Regular Expressions

At the very beginning of this post I’d like to point out that this is neither related to a specific programming language nor related to a specific operating system, although I’ll need to employ some “real-world” examples for better understanding. Regular expressions are supported by a myriad of programming languages, text editors, command line interpreters, software applications, calculators, electric toothbrushes, and so on. What I’m saying is that anyone who even thinks about moving his or her stubby fingers near a computer keyboard ought to know at least about the existence of regular expressions and their basic capabilities. The purpose of this post is to provide you with this vital information.

Let’s consider the following example: You just wrote some source code that prompts a user to enter his or her email address. A variable named email holds the user’s input.

Now, how are you going to check if the address format is valid? If you haven’t yet heard of regular expressions, you’d probably loop through the string to count the occurrence of certain characters: A valid email address must contain exactly one ‘@’ character, a ‘.’ character, and the address must not contain any whitespaces. Now, if you think you’re done with this program, hold on a sec. We haven’t yet made sure there is at least one character preceding the ‘@’, at least one character residing between the ‘@’ and the ‘.’, and at least two (but not more than four) characters following the ‘.’ character, well, oh what about country code second level domains, such as ‘.co.uk’, … blah … blah … blah …

Writing something like this in C can get pretty awkward. You may object that you could use a programming language that provides highly sophisticated string functions or methods. Still though, the solution would involve many lines of code that are apt to compromise your program’s maintainability and as well as its extensibility.

Here is where regular expressions, often referred to as regex, come in handy. They allow us to describe the format of a string in pretty much the same way like we would describe it verbally.

Let’s simplify the example above for the sake of easier understanding.

“An email address is considered valid, if it starts with a sequence of one or more characters (only lowercase letters or numbers allowed), which is followed by an ‘@’ character, which is then followed by yet another sequence of one or more characters (again lowercase or numbers), which is succeeded by the ‘.’ (dot) character and eventually followed by at least two but not more than four lowercase letters.”

Well, you may now think of Sheldon Cooper from The Big Bang Theory, but in just a second from now you will see that this is almost exactly the way we define a regular expression pattern.

Here’s the according regex:

[a-z0-9]+@[a-z0-9]+\\.[a-z]{2,4}

Let’s have a closer look on the details.

regex[a-z0-9] is called a character class. It matches either a lowercase letter (ranging from a to z) or a number (ranging from 0-9). The +, marked green, indicates “one or more occurrences” of the previously described pattern. If we removed the +  from the expression, we would say that the @ character must be preceded by exactly one single [a-z0-9] character. If we replaced + by a *, the meaning would change to “zero or more occurrences”. This would make the address part left to the @ optional.

Most characters, such as the @ character, can be used directly as a literal in the regular expression. In this example, the @ indicates that we are checking for a mandatory @ character to follow.

What’s then following is something you should be familiar with by now. Yes, correct … good old [a-z0-9]+, yet another sequence of one or more lowercase letters or numbers.

The ‘.’ (dot) representation \\. looks kind of weird. However, a simple . wouldn’t do, because this would have a special meaning, similar to the + character we’ve seen before. Such special symbols are called metacharacters. To use their literals, we need to “escape” them with \\. So if you were to check for a mandatory + character, e.g. in a phone number +49…, you’d have to do a \\+ in the regular expression.

I guess I’ll no longer need to explain [a-z]. The interesting part, however, is the number of occurrences at the end of the expression: {2,4}. This says “at least 2, but not more than 4 occurrences”. If we wrote {42}, the top level domain would have to be a sequence of exactly(!) 42 lowercase letters.

Of course there are much more metacharacters you can use in regular expressions. Various sites on the Internet offer more or less complete lists. Note that the syntax of regular expression can vary slightly between different programming or scripting languages. The principle of how a regular expression is built, most character classes, and most metacharacters are always the same, though. Learn one – know ’em all.

Now that you have seen how a regex pattern is built, you may wonder how you can actually use it to perform a format check on an email address. This step may differ between different programming languages. Plus, some languages, such as Java, offer more than one way to employ regular expressions.

The following Java program shows the simplest approach in Java to regex check an email address.

import java.util.Scanner;

public class Demo {

  public static void main( String[] args ) {

    Scanner s = new Scanner( System.in );

    System.out.print( "Enter your email address: " );
    String input = s.next();
    String pattern = "[a-z0-9]+@[a-z0-9]+\\.[a-z]{2,4}";
    if( input.matches( pattern ) ) {
      System.out.println( "Valid." );
    }
    else {
      System.out.println( "Invalid." );
    }
  }
}

In Java, any String object provides a method which allows you to verify if this string matches the regex pattern that is passed as its argument. Impressive, isn’t it?

Now think of what you can do with this stuff! You could dynamically generate regex patterns during runtime. Or if you’re a homo nerdus delirum, you could use regular expressions to match regular expressions. 😉

On the Linux command line interface (aka ‘Shell’), regular expressions are used to filter information. Consider this example:

ls | egrep "^[0-9]{1,3}\.jpg$"

This command lists all files with names that start with one to three numbers and end with .jpg. The regex pattern (printed bold) contains two metacharacters we haven’t yet been looking at: The ^ character to represent the beginning of a string, and the $ character to indicate the end of a string. So the expression only matches strings like 004.jpg, but not something23.jpgsomethingelse. I could show you more 😎 stuff like this, but I don’t want to make you faint from awe, hahaha.

The last remaining question is: What are the disadvantages of using regex, if there are any?

Actually, there is only one major disadvantage, which is pretty obvious especially to most of my students ;). You really need to gain some routine to formulate complex regular expressions in little time. Like all cool things in life, mastering regex requires tedious exercise.

For rare situations, you may want to keep in mind that when using regex you rely on predefined matching algorithms. Depending on what specific regex pattern you want to match, and on the programming language you are using, these algorithms can be more or less runtime efficient than hand-coded “if-trees”. For further information on what happens behind the scenes, please refer to Russ Cox’s website on regex matching.

A neat and pretty complete regex tutorial, can be found here:
http://www.cs.colorado.edu/~schenkc/UNIX_Regular_Expressions.pdf

See you in class,

— Andre M. Maier

Advertisements

About bitjunkie

Teacher, Lecturer, and BITJUNKIE ...
This entry was posted in Programming Essentials, Uncategorized and tagged , , , , , , . Bookmark the permalink.