How to use ancient Morse code to solve a modern problem

If you own a Raspberry Pi mini computer the following problem may sound familiar to you: You just connected your Rasperry Pi to a network with automatic DHCP. Now you need to find out what IP address has been assigned to it. While this may be quite easy to do when you have a computer screen and a keyboard at hand, it requires some nifty workarounds if don’t.

There are numerous ways to solve this problem, e.g.

  • Set up a network packet sniffer (e.g. Wireshark) on another machine in the same network to examine the DHCP communication between server and client.
  • View the logs or a list of the active clients on the DHCP server.
  • Set up the Raspberry Pi to register with a dynamic DNS provider or some homebrew service on another machine.
  • Write a script and configure the Raspberry Pi to build and send out an email to you each time it is connected to a network.

However, there is a catch to such solutions: They all rely on external devices and configurations. Sometimes you may connect your Raspberry Pi to a network that is configured to block certain ports or traffic. Or maybe there is a a power outage that affects your homemade registration server.

I was therefore looking for an independent alternative solution to find out my Raspberry Pi’s newly assigned IP address. It should not involve sending data across the network, and it was supposed to work without having to connect a computer screen.

The idea I came up with was a program (see below) that displays the device’s IP address on one single LED that is connected to one of its GPIO ports. The only hitch to my nerdy solution is that you need to learn how the digits 0 to 9 are represented in Morse code, which is a piece of cake though.

Feel free to check out my solution by performing the following steps:

  • Compile the C program below in your Raspberry Pi
    gcc ip2morse.c -o ip2morse -lwiringPi
  • Copy the generated executable file to /bin (root permissions / sudo required).
    cp ip2morse /bin
  • Add a new entry to the file /etc/crontab to have the program executed every minute.
    * * * * * root /bin/ip2morse

Note that the author of the C program below cannot be held liable for any damages, loss, claims, or expenses caused by running the program.

/* Program: ip2morse.c
 * Platform/OS: RaspberryPi/Raspbian
 * Description: The purpose of this program is to output the IPv4 
 * address of a network interface on a LED in Morse 
 * code.
 *
 * Copyright (c) 2017 by Andre M. Maier - All Rights Reserved
 * You may use, distribute and modify this code under the
 * terms of the Creative Commons Attribution 4.0 International 
 * (CC BY 4.0) license. For further information, please visit 
 * https://creativecommons.org/licenses/by/4.0
 */
#include <stdio.h>
#include <wiringPi.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <netinet/in.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <string.h>
#include <unistd.h>

#define PIN 1 // BCM GPIO 18
#define SHORT_IN_MS 200 // Length of a dot in Morse code
#define LONG_IN_MS 600 // Length of a dash in Morse code
#define PAUSE_IN_MS 200 // Pause within dots and dashes

// Pause between digits; The pause between the octets of the IP
// address will be twice this value.
#define PAUSE_AFTER_DIGIT_IN_MS 1000

// Network interface for which to get the IP address
#define INTERFACE "wlan0"

// Digits in Morse code (from 0 to 9)
const char* const numbers_in_morse[] = { "-----", ".----", "..---",
                                         "...--", "....-", ".....",
                                         "-....", "--...", "---..",
                                         "----." };

/* Outputs a specified decimal digit (0-9)
   on the GPIO pin. */
void digit_to_morse_output( unsigned char digit ) 
{
   const char* num_in_morse = numbers_in_morse[ digit ];
   int i = 0;
   while( num_in_morse[ i ] != '\0' )
   {
      digitalWrite( PIN, 1 );
      if( num_in_morse[ i ] == '.' ) 
      {
         delay( SHORT_IN_MS );
      }
      else
      {
         delay( LONG_IN_MS );
      }
      digitalWrite( PIN, 0 );
      delay( PAUSE_IN_MS );
      i++;
   }
}

/* Determines the IPv4 address from the specified network
   interface and returns it as a string in dotted-decimal
   format. */
char* get_ip_address_string( char* interface )
{
   struct ifreq ifr;
   int fd = socket( AF_INET, SOCK_DGRAM, 0 );
   ifr.ifr_addr.sa_family = AF_INET;
   strncpy( ifr.ifr_name, interface, IFNAMSIZ - 1 );
   ioctl( fd, SIOCGIFADDR, &ifr );
   close( fd );
   return inet_ntoa( ( ( struct sockaddr_in *)&ifr.ifr_addr ) 
                     -> sin_addr );
}
 
/* Main function. This is where the program starts. */
int main( int argc, char** argv )
{
   // Setting up wiringPi and initializing the GPIO port
   if( wiringPiSetup() == -1 )
     return 1;
   pinMode( PIN, OUTPUT );
 
   // Determine ip address of the network interface
   char* ip_address = get_ip_address_string( INTERFACE );

   // Looping through the whole IPv4 string. The dots between
   // the octets will be represented by a long pause (twice
   // as long as the pause between the digits.
   int length = strlen( ip_address );
   int i;
   for( i = 0; i < length; i++ )
   {
      char c = ip_address[ i ];
      if( c == '.' )
      {
         delay( PAUSE_AFTER_DIGIT_IN_MS );
      }
      else
      {
         unsigned char digit = c - '0';
         digit_to_morse_output( digit );
         delay( PAUSE_AFTER_DIGIT_IN_MS );
      }
   }
   return 0;
}

Last but not least, a short note to my teaching colleagues in computer science:
The program can be used as a nice little classroom project for advanced students. It combines topics such as network and hardware-oriented C programming, the history of communications, and scheduled tasks (you can either run the program in a cronjob as described above, or you can make it a SysV or systemd service).

Have fun!
— Andre M. Maier

Advertisements
Posted in c, Information Theory, Linux, Raspberry Pi, Uncategorized | Tagged , , , , , , ,

Let’s make the stack explode in Java!

In one of my lectures about memory management in Java, I was looking for a quick and simple way to demonstrate the influence of local variables on the stack memory usage. So what came to my mind was this:

Local variables, including method parameters, are stored within a stack frame. As the stack size is limited, the maximum number of stack frames allowed on the stack depends on the size of each frame. Thus, when you declare some local variables within a method that recursively calls itself, the number of recursive calls until a stack overflow occurs should be lower than without the local variables.

Here is a program that allows you to demonstrate the expected behavior.

/**
 * This program will intentionally cause a stack overflow.
 *
 * CC BY-NC
 * This work is licensed under a Creative Commons
 * Attribution-NonCommercial 4.0 International License.
 *
 * @author Andre M. Maier
 * @version 1.0
 */
public class Demo {

   public static void main( String[] args ) {
      m( 0 );
   }

   public static void m( int count ) {
      long a, b, c, d, e, f, g, h, i, j, k, l, m;
      System.out.println( count );
      m( ++count );
   }
}

In order to make all variables, including the counter, reside in the stack frames, I decided to pass the count variable to the recursive method by parameter. The only purpose of the 64-bit long variables a to m is to inflate the stack frames.

During its runtime, the program will output the actual counter value on standard output. Sooner or later, the program is going to crash with a stack overflow exception.

When running the program, make sure you redirect the standard error stream

Note that you will have to redirect the standard error stream, to prevent it from interfering with the counter output.

java Demo 2> /dev/null

You will notice a significant difference in the maximum counter value if you comment out the line that declares the long variables a to m.

Using the non-standard option -Xss of the JVM you can change the maximum stack size and explore the influence on the maximum counter value.

java -Xss400k Demo 2> /dev/null

Have fun, and check out my blog entry about the JVM Heap.

Posted in Java, Programming Essentials | Tagged , , , , ,

A simple way to obfuscate string literals

By Andre M. Maier

While I was sitting in a doctor’s waiting room last week, it occurred to me that I could send out a geeky holiday greeting to my younger programming students. Naturally, as a teacher, I wanted to make it educational. It should allow my students to recap on the very basics of computer programming, and maybe provide a new aspect that is not covered in my regular curriculum. These thoughts led me to the idea of “hiding” the characters of the holiday note within integer literals.

I first came across this principle back in the late 80s when I spent hours after hours studying the machine code of various commercial computer games. Some programmers have used this method to prevent level passwords or product keys from showing up as plain ASCII in any text editor. Many years later, when I worked in computer forensics, I learned that many famous hardware and software manufacturers used similar, albeit more effective, techniques to obfuscate proprietary information.

So, I took my laptop out of my bag and wrote the following program.

program

The innermost while loop retrieves three subsequent digits from the value of an int variable n. The digits are weighted with powers of 10 according to their position, and then summed up in a char variable which represents an actual ASCII character. The inner for loop repeats this procedure three times, because this simple variant allows you to hide up to three ASCII characters per int literal. Thus, the first int value in the array (101105076) contains the ASCII characters 76, 105, 101 in decimal representation, which resemble the printed letters L, i, and e. The rest of the program should be pretty much self-explanatory.

As a result, the output of my program looks like this:

Liebe BKI15, 
FROHES FEST! 
wuenscht Euch
Andre Maier

I must admit that I have both seen and used far more sophisticated methods of hiding information in source code or in binary files. One of the most impressive variants I have seen involved some weird algorithm that was used to pick 4 bit of each character from a JPEG image. What made it impressive to me was the fact that the algorithm retrieved its changing parameters from the same source of data. With only disassembled RISC machine code and no debugging tools available, it took me quite a while to figure out the exact function and design of the algorithm.

The program above is simple to understand and straight forward. Nevertheless, I think it makes a neat little exercise on the essentials of computer programming.

Merry Christmas and a Happy New Year 2017 to all of you!

 

Posted in Information Theory, Java, Programming Essentials | Tagged , , , , ,

Why Linux DOES NOT Suck

By Andre M. Maier

On October 5, 1991, Linus Torvalds officially published the first version of his homemade Linux kernel. Today, 24 years later, a large number of IT professionals agree that best of Linux is yet to come. I would like to take this opportunity to explain why I share this opinion and what I think should be done to bring the future forward.

1. Technological Aspect

Technologically, Linux is a slightly modified clone of the Unix operating system. Thus, it encompasses all the advantages of Unix. Unix was developed in the early 1970s, long before Microsoft and Apple were founded. Back in those days, computers were extremely expensive, so even universities and companies could only afford one computer. This computer had to be shared among many users, who only had “dumb terminals” (consisting of a keyboard and a monochrome cathode ray tube monitor) on their desks. As there were no hard drives, flash drives, or local disk drives, all data was stored centrally in the data center. It was clear, therefore, that Unix had to be a highly available operating system that supports multi-user, multi-processing, and networking features. Because the hardware configuration was very different between the data centers, Unix also had to be scalable and modular, which is often referred to as the “Unix Philosophy”.

Microsoft just didn’t have these requirements in mind when they designed their Disk Operating System (DOS) and Microsoft Windows. In fact, they didn’t need to, because during the 1980s, hardware became cheap enough to allow more and more people to have their private “data center” right on their desks. It was the dawn of the Personal Computer (PC). In the 1980s, nobody would have thought that by 2015 we would share our thoughts on some global network, a.k.a the Internet. I’m pretty sure that if you came up with the idea of a smart watch running an operating system, nobody would have taken your serious back then.

Times have changed and so have the requirements. In today’s technological world, we had to get back to the old “values” such as networking, multitasking, multi-user support, high availability and high scalability. While Microsoft was struggling to add these features to their already existing products, Linux (as well as Unix) didn’t have to undergo any major changes at all. Anyone who has some experience in software development will agree that any changes in the requirements during the process of development inevitably increase the risk of bugs.

For me, this seems to be the main reason why both Linux and Unix have succeeded in mobile computing (Android, Apple iOS), why most of the Internet’s infrastructure is based on Linux, and why more than 99.99% of the TOP500 supercomputers in the world are running some Linux distribution.

2. Philosophical Aspect

Although Linus Torvalds first published the Linux kernel under his own license, he released it under the GNU General Public License (GPL) only one year later. At this point, it would make sense for me to write about the idea behind GNU and the free software movement, but there is a great TED talk by no other than the founder of the GNU Project himself. So let’s give Richard Stallman a chance to sum up his remarkable idea.

Anyone in the world may have his or her own definition of what freedom is. And anyone in the world may decide for himself or herself how much convenience he or she is ready to sacrifice. We all, however, should be striving for a society in which anyone who chooses one over the other is aware of what he is about to give up. Also, people should understand that, in this universe, there is no way to claim both freedom and convenience at the same time.

What’s more, there is no 100.0% freedom like there is no 100.0% convenience. Any software, no matter how convenient it was designed to be, can cause inconveniences. Likewise, anyone who uses a computer will inescapably lose a significant amount of freedom, no matter what software he uses. From my point of view, it should be socially acceptable to refrain from using any computers at all. I’m sure this would have saved us a lot of political trouble, too. German Chancellor Merkel’s sentence “Das Internet ist Neuland für uns alle.” (The Internet is uncharted terrain for all of us) caused an uproar among the German citizens, who felt being called “incompetent”. In my personal opinion, Merkel was absolutely right, as technology has evolved way faster than education, and it still is.

Most people today have never been taught to think about where exactly they stand between freedom and convenience. They know the latest gimmicks and keyboard shortcuts in commercial, non-free software products, but they don’t know the inevitable consequences of their actions. Would you feel vulnerable if, for some reason, whatever text or media you have entered into your computer ended up published publicly on the web? Do you really think that laws, contracts, user agreements, and promises are going to stop companies and governments from doing unethical and immoral things? Think about this question and discuss it with a loved one.

3. Conclusion

Linux is not just a product, but a well proven and tested technological concept. Its portability and high scalability makes it a perfect operating system not only for mobile devices and servers, but also for most components in the “Internet of Things”. So there is no doubt Linux will face a bright future.

We should get our students involved in Linux and free software. Currently, almost none of my younger students have sufficient knowledge in this area. All they have been taught in previous schools was how to play with proprietary non-free software. Most of them even don’t know that there is an alternative.

It is human to always look for the easiest and most convenient path. But it is a teacher’s job to take his students on alternative routes, too. Someone who has never seriously tried to gain at least some basic knowledge of Linux will never be able to discuss its merits and demerits. I definitely agree with most of what Richard Stallman has to say about Linux in schools: http://www.gnu.org/education/edu-schools.en.html

Posted in Linux, Non-technical | Tagged , , , ,

Let’s make the heap explode!

This blog entry is dedicated to those who teach computer science at high schools and vocational schools.

In one of my Java programming classes, only few of the students showed up. It didn’t seem to make sense to continue with my regular lesson plan, so I had to make an ad hoc decision about what to do instead. I was clear that I would have to pull something out of the hat. After a second or so I thought it might be a good idea to take a closer look at memory management in Java. During that class, I wrote a simple quick ‘n’ dirty program that just eats up heap memory like there’s no tomorrow.

Screenshot from 2014-11-29 23:25:58Here is how the essential lines of code work: First, an array a is declared to hold SIZE references to objects of the type of SomeObject. Then, SIZE objects of SomeObject are instantiated in a loop. There is nothing noteworthy about class SomeObject, except it contains about 25 (useless) attributes of datatype long. The primary purpose of those attributes is to waste some heap memory.

If you run this program normally, sooner or later you will see the following error message.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at Demo.main(Demo.java:14)

This is because the default maximum heap size is limited to 1 GByte in 32-bit JVMs. Run your program again with the following JVM option to push your computer to its limits.

java -Xmx8g Demo

Sooner or later you’re computer system will almost freeze until you kill the process.

There is an interesting fact about Java that you can see when you don’t keep the references stored in the array: It is Garbage Collection to its finest. As soon as an object doesn’t have any references to it any more, the Java Virtual Machine automatically deallocates the related memory. In order to explore this effect, use a fixed index, e.g. 0, inside the loop.

a[ 0 ] = new SomeObject();

It is also worth it to open a graphical system monitor before starting the program. Here is how it looked like on my laptop (Intel i5, 4 gigs of RAM).

usage
The sky-rocketing rise at t-59 in the uppermost diagram is where the program was started. From t-57 to t-45 the CPU is busily creating objects, which is also indicated by the rising purple line (physical RAM usage) in the diagram in the middle. At t-45 the physical RAM is vastly exhausted, so the operating system starts to swap RAM data to hard disk. This involves time-consuming disk operations which are not handled by the CPU. Hence, the process is put to sleep once in a while, which results in a decrease of the average CPU usage. As the administrative overhead in memory management increases, CPU usage further decreases due to prolonged sleep periods of the process. Eventually, your system is going to freeze due to deprivation of memory, although the CPU usage remains relatively low.

When I got home from school late that evening, I thought that I could reuse this program in a different context, specifically in one of my lectures on Linux, though the underlying principles apply to any other operating system as well.

In my opinion, everyone who is about to touch, use, or even administer a computer system should be able to identify memory exhaustion. However, both identification and debugging of memory-related issues can be tricky, and you need to have quite some background knowledge. As with many things in computer science, the whole topic seems to end up in a difficult balancing act between hiding confusing aspects on the one hand, and establishing mental cross-links on the other. As far as I’m concerned, I’m still looking for a simple and “unscientific” approach to memory management which is easily scalable from high school to university level. If you have found one, please let me know.

I wish you a happy Christmas time!

— Andre M. Maier

Posted in Java, Linux, Programming Essentials, Uncategorized | Tagged , , , , ,

Floating Point Representation

Back in the days when I was in middle school, I started to develop a strong preference for “nice” and “short” numbers, such as 0.1. One likely reason was that any math problem in school was designed to yield a “nice” result. You simply didn’t want to end up with results like 3.0078125 in a math test, because you knew there must have been something wrong in your calculations.

Here’s a fun fact, though: If we were computers we’d prefer 3.0078125 to be the result, because computers represent numbers in binary. It is due to the same fact that we have to strictly differentiate between range and accuracy of a floating point number.

This article will give you a closer look on the underlying IEEE 754 floating point representation. You should already have some basic understanding of the binary system, though.

You may remember that binary numbers follow exactly the same mathematical rules as decimal numbers, except that the base is 2 instead of 10. So from the right to the left, the first digit of a binary number is weighted by 2 to the power of 0, the second digit is 2 to the power of 1, the third digit is 2 to the power of 2, and so on. This is also true for digits on the left side of the decimal point.

binary_fpConsidering this, you should no longer be surprised that your computer loves numbers like 3.0078125 but hates 0.1 at the same time. How many bits would be required to represent the exact number of 0.1?

At this point in class, I usually ask my students to come up with a clever way to store real numbers, such as 9.75, in a computer’s memory. Unless there is a downright nerd among the students, their first idea would usually involve to use a fixed number of bits to hold the binary digits left to the decimal point, and a fixed number of bits to hold the decimals after the decimal point. I’m then going to show them the obvious waste of memory if we wanted to store numbers like 9.0 or 0.25.

In 1985, the IEEE (Institute of Electrical and Electronics Engineers) established a standard (IEEE 754) that allows handling of real numbers in a more efficient way. This standard not only defines the number format, but also defines all kinds of arithmetic operations, rounding, and so on.

The underlying principle is fairly simple, and it has been used by scientists for ages. Physicists, for example, would never say that the diameter of a hydrogen atom was about 0.000000000074 meters. Instead, they’d say that it is “74 times 10 to the power of minus 12 meters” or “74 pico meter” (because “pico” means 10-12). This is much shorter to write, saving both ink and memory. On computer screens or calculator displays you will usually find E-12 (‘E’ stands for Exponent) instead of 10-12. This principle is called floating point representation, because you can picture the decimal point to be “floating” between the digits as the exponent increases or decreases.

scanIf you didn’t skip physics classes in high school, you will be familiar with this kind of number representation. So any real number can be described by its exponent and its mantissa (that’s what we call the fraction before the multiplication sign).

Now, let’s do exactly the same thing in binary. It follows exactly the same principle, except that the base is now 2.

scanAs you can see, it is quite easy to describe any binary number by its base-2 exponent and its mantissa.

Let’s now have a look at how this kind of number is actually stored in a computer’s memory. As you can now guess, in order to sufficiently describe a floating point number we need to store three parts of information:

  • Signum (i.e. whether the number is positive or negative)
  • Base 2 – Exponent
  • Mantissa

The IEEE 754 standard specifies floating point formats of different length. The most popular formats are binary32 (a.k.a. single precision) and binary64 (a.k.a. double precision). In the programming language Java, for example, they correspond to the data types float and double, respectively.

scanWhen stored in a binary32 (float) data type, the number 9.75 from the above example yields exactly the following bit pattern:

scanDon’t worry if you haven’t been able to find this out by yourself. 😉 I’m going to explain each and every single bit.

The first bit is the signum, i.e. it indicates whether the number is positive (0) or negative (1). Since the number in the example is +9.75, this bit is set to 0.

The next eight bits are designated to hold the exponent. Since the exponent can be either negative or positive, the IEEE754 specifies a bias of 2n-1 – 1, which is 2⁸-1 – 1 = 127 in single precision format. So, in order to calculate the corresponding bit pattern, just add 127 to your exponent and convert the result to binary. In the example above, the floating point number to be stored is 1.00111·2³. So the biased exponent is 127+3=130, which corresponds to a binary value of 1000 0010.

The remaining 23 bits are used to store the normalized mantissa. Normalized means that the “binary point” is shifted to the left until there is only one leading 1 remaining on the left side of the “binary point”. This is exactly what we did previously in the example, where we found out that 9.75 is “1.00111 times 2 to the power of three”. Now there is just one more little tweak to it: We do not need to store the leading 1, because we know that it is there. So, what we do need to store are just the digits on the right side of the “binary point”: 00111. All the remaining less significant bits are set to zero, so the final result is 00111000000000000000000.

That’s pretty much all there is to say about how floating point numbers are represented inside your computer. Feel free to pick any positive or negative real number that comes to your mind and convert it to IEEE754 binary32 (float) or binary64 (double) format. You may want to check your results by using one of the numerous IEEE754 converters on the Internet (enter +ieee754 +converter as a search string in Google).

Just one more thing before you leave this blog: There are certain problematic conditions that may occur when converting or handling floating point numbers. For example, the number to be converted may be outside the representable range. Therefore, the IEEE754 standard defines several exceptions to indicate these conditions. You should at least know the two most important exceptions that you will frequently encounter as a software developer.

  • Positive Infinity and Negative Infinity
    indicate that the number has grown too large for representation. This condition is denoted with an exponent of all ones and a mantissa of all zeros. The signum bit is set to 0 to indicate positive infinity and set to 1 for negative infinity respectively.
  • Not-a-Number (NaN)
    is used to represent a value that is not a real number. You can evoke this scenario by calculating the square root of -1, for example. We know that the result is i, but a computer can’t handle complex numbers by default. So the result is NaN, because it does not represent a real number. According to the IEEE754 specification, NaN is denoted with an exponent of all ones and a non-zero mantissa.

There would be many more details worth mentioning, such as “Quiet NaN” and “Signalling NaN”, but this should be enough for the time being. For further details please refer to http://en.wikipedia.org/wiki/IEEE_floating_point and/or my lectures on data representation.

See you around,

— Andre M. Maier

Posted in Bit-Twiddling, Math, Uncategorized | Tagged , , , , ,

Regular Expressions

At the very beginning of this post I’d like to point out that this is neither related to a specific programming language nor related to a specific operating system, although I’ll need to employ some “real-world” examples for better understanding. Regular expressions are supported by a myriad of programming languages, text editors, command line interpreters, software applications, calculators, electric toothbrushes, and so on. What I’m saying is that anyone who even thinks about moving his or her stubby fingers near a computer keyboard ought to know at least about the existence of regular expressions and their basic capabilities. The purpose of this post is to provide you with this vital information.

Let’s consider the following example: You just wrote some source code that prompts a user to enter his or her email address. A variable named email holds the user’s input.

Now, how are you going to check if the address format is valid? If you haven’t yet heard of regular expressions, you’d probably loop through the string to count the occurrence of certain characters: A valid email address must contain exactly one ‘@’ character, a ‘.’ character, and the address must not contain any whitespaces. Now, if you think you’re done with this program, hold on a sec. We haven’t yet made sure there is at least one character preceding the ‘@’, at least one character residing between the ‘@’ and the ‘.’, and at least two (but not more than four) characters following the ‘.’ character, well, oh what about country code second level domains, such as ‘.co.uk’, … blah … blah … blah …

Writing something like this in C can get pretty awkward. You may object that you could use a programming language that provides highly sophisticated string functions or methods. Still though, the solution would involve many lines of code that are apt to compromise your program’s maintainability and as well as its extensibility.

Here is where regular expressions, often referred to as regex, come in handy. They allow us to describe the format of a string in pretty much the same way like we would describe it verbally.

Let’s simplify the example above for the sake of easier understanding.

“An email address is considered valid, if it starts with a sequence of one or more characters (only lowercase letters or numbers allowed), which is followed by an ‘@’ character, which is then followed by yet another sequence of one or more characters (again lowercase or numbers), which is succeeded by the ‘.’ (dot) character and eventually followed by at least two but not more than four lowercase letters.”

Well, you may now think of Sheldon Cooper from The Big Bang Theory, but in just a second from now you will see that this is almost exactly the way we define a regular expression pattern.

Here’s the according regex:

[a-z0-9]+@[a-z0-9]+\\.[a-z]{2,4}

Let’s have a closer look on the details.

regex[a-z0-9] is called a character class. It matches either a lowercase letter (ranging from a to z) or a number (ranging from 0-9). The +, marked green, indicates “one or more occurrences” of the previously described pattern. If we removed the +  from the expression, we would say that the @ character must be preceded by exactly one single [a-z0-9] character. If we replaced + by a *, the meaning would change to “zero or more occurrences”. This would make the address part left to the @ optional.

Most characters, such as the @ character, can be used directly as a literal in the regular expression. In this example, the @ indicates that we are checking for a mandatory @ character to follow.

What’s then following is something you should be familiar with by now. Yes, correct … good old [a-z0-9]+, yet another sequence of one or more lowercase letters or numbers.

The ‘.’ (dot) representation \\. looks kind of weird. However, a simple . wouldn’t do, because this would have a special meaning, similar to the + character we’ve seen before. Such special symbols are called metacharacters. To use their literals, we need to “escape” them with \\. So if you were to check for a mandatory + character, e.g. in a phone number +49…, you’d have to do a \\+ in the regular expression.

I guess I’ll no longer need to explain [a-z]. The interesting part, however, is the number of occurrences at the end of the expression: {2,4}. This says “at least 2, but not more than 4 occurrences”. If we wrote {42}, the top level domain would have to be a sequence of exactly(!) 42 lowercase letters.

Of course there are much more metacharacters you can use in regular expressions. Various sites on the Internet offer more or less complete lists. Note that the syntax of regular expression can vary slightly between different programming or scripting languages. The principle of how a regular expression is built, most character classes, and most metacharacters are always the same, though. Learn one – know ’em all.

Now that you have seen how a regex pattern is built, you may wonder how you can actually use it to perform a format check on an email address. This step may differ between different programming languages. Plus, some languages, such as Java, offer more than one way to employ regular expressions.

The following Java program shows the simplest approach in Java to regex check an email address.

import java.util.Scanner;

public class Demo {

  public static void main( String[] args ) {

    Scanner s = new Scanner( System.in );

    System.out.print( "Enter your email address: " );
    String input = s.next();
    String pattern = "[a-z0-9]+@[a-z0-9]+\\.[a-z]{2,4}";
    if( input.matches( pattern ) ) {
      System.out.println( "Valid." );
    }
    else {
      System.out.println( "Invalid." );
    }
  }
}

In Java, any String object provides a method which allows you to verify if this string matches the regex pattern that is passed as its argument. Impressive, isn’t it?

Now think of what you can do with this stuff! You could dynamically generate regex patterns during runtime. Or if you’re a homo nerdus delirum, you could use regular expressions to match regular expressions. 😉

On the Linux command line interface (aka ‘Shell’), regular expressions are used to filter information. Consider this example:

ls | egrep "^[0-9]{1,3}\.jpg$"

This command lists all files with names that start with one to three numbers and end with .jpg. The regex pattern (printed bold) contains two metacharacters we haven’t yet been looking at: The ^ character to represent the beginning of a string, and the $ character to indicate the end of a string. So the expression only matches strings like 004.jpg, but not something23.jpgsomethingelse. I could show you more 😎 stuff like this, but I don’t want to make you faint from awe, hahaha.

The last remaining question is: What are the disadvantages of using regex, if there are any?

Actually, there is only one major disadvantage, which is pretty obvious especially to most of my students ;). You really need to gain some routine to formulate complex regular expressions in little time. Like all cool things in life, mastering regex requires tedious exercise.

For rare situations, you may want to keep in mind that when using regex you rely on predefined matching algorithms. Depending on what specific regex pattern you want to match, and on the programming language you are using, these algorithms can be more or less runtime efficient than hand-coded “if-trees”. For further information on what happens behind the scenes, please refer to Russ Cox’s website on regex matching.

A neat and pretty complete regex tutorial, can be found here:
http://www.cs.colorado.edu/~schenkc/UNIX_Regular_Expressions.pdf

See you in class,

— Andre M. Maier

Posted in Programming Essentials, Uncategorized | Tagged , , , , , ,