Floating Point Representation

Back in the days when I was in middle school, I started to develop a strong preference for “nice” and “short” numbers, such as 0.1. One likely reason was that any math problem in school was designed to yield a “nice” result. You simply didn’t want to end up with results like 3.0078125 in a math test, because you knew there must have been something wrong in your calculations.

Here’s a fun fact, though: If we were computers we’d prefer 3.0078125 to be the result, because computers represent numbers in binary. It is due to the same fact that we have to strictly differentiate between range and accuracy of a floating point number.

This article will give you a closer look on the underlying IEEE 754 floating point representation. You should already have some basic understanding of the binary system, though.

You may remember that binary numbers follow exactly the same mathematical rules as decimal numbers, except that the base is 2 instead of 10. So from the right to the left, the first digit of a binary number is weighted by 2 to the power of 0, the second digit is 2 to the power of 1, the third digit is 2 to the power of 2, and so on. This is also true for digits on the left side of the decimal point.

binary_fpConsidering this, you should no longer be surprised that your computer loves numbers like 3.0078125 but hates 0.1 at the same time. How many bits would be required to represent the exact number of 0.1?

At this point in class, I usually ask my students to come up with a clever way to store real numbers, such as 9.75, in a computer’s memory. Unless there is a downright nerd among the students, their first idea would usually involve to use a fixed number of bits to hold the binary digits left to the decimal point, and a fixed number of bits to hold the decimals after the decimal point. I’m then going to show them the obvious waste of memory if we wanted to store numbers like 9.0 or 0.25.

In 1985, the IEEE (Institute of Electrical and Electronics Engineers) established a standard (IEEE 754) that allows handling of real numbers in a more efficient way. This standard not only defines the number format, but also defines all kinds of arithmetic operations, rounding, and so on.

The underlying principle is fairly simple, and it has been used by scientists for ages. Physicists, for example, would never say that the diameter of a hydrogen atom was about 0.000000000074 meters. Instead, they’d say that it is “74 times 10 to the power of minus 12 meters” or “74 pico meter” (because “pico” means 10-12). This is much shorter to write, saving both ink and memory. On computer screens or calculator displays you will usually find E-12 (‘E’ stands for Exponent) instead of 10-12. This principle is called floating point representation, because you can picture the decimal point to be “floating” between the digits as the exponent increases or decreases.

scanIf you didn’t skip physics classes in high school, you will be familiar with this kind of number representation. So any real number can be described by its exponent and its mantissa (that’s what we call the fraction before the multiplication sign).

Now, let’s do exactly the same thing in binary. It follows exactly the same principle, except that the base is now 2.

scanAs you can see, it is quite easy to describe any binary number by its base-2 exponent and its mantissa.

Let’s now have a look at how this kind of number is actually stored in a computer’s memory. As you can now guess, in order to sufficiently describe a floating point number we need to store three parts of information:

  • Signum (i.e. whether the number is positive or negative)
  • Base 2 – Exponent
  • Mantissa

The IEEE 754 standard specifies floating point formats of different length. The most popular formats are binary32 (a.k.a. single precision) and binary64 (a.k.a. double precision). In the programming language Java, for example, they correspond to the data types float and double, respectively.

scanWhen stored in a binary32 (float) data type, the number 9.75 from the above example yields exactly the following bit pattern:

scanDon’t worry if you haven’t been able to find this out by yourself. 😉 I’m going to explain each and every single bit.

The first bit is the signum, i.e. it indicates whether the number is positive (0) or negative (1). Since the number in the example is +9.75, this bit is set to 0.

The next eight bits are designated to hold the exponent. Since the exponent can be either negative or positive, the IEEE754 specifies a bias of 2n-1 – 1, which is 2⁸-1 – 1 = 127 in single precision format. So, in order to calculate the corresponding bit pattern, just add 127 to your exponent and convert the result to binary. In the example above, the floating point number to be stored is 1.00111·2³. So the biased exponent is 127+3=130, which corresponds to a binary value of 1000 0010.

The remaining 23 bits are used to store the normalized mantissa. Normalized means that the “binary point” is shifted to the left until there is only one leading 1 remaining on the left side of the “binary point”. This is exactly what we did previously in the example, where we found out that 9.75 is “1.00111 times 2 to the power of three”. Now there is just one more little tweak to it: We do not need to store the leading 1, because we know that it is there. So, what we do need to store are just the digits on the right side of the “binary point”: 00111. All the remaining less significant bits are set to zero, so the final result is 00111000000000000000000.

That’s pretty much all there is to say about how floating point numbers are represented inside your computer. Feel free to pick any positive or negative real number that comes to your mind and convert it to IEEE754 binary32 (float) or binary64 (double) format. You may want to check your results by using one of the numerous IEEE754 converters on the Internet (enter +ieee754 +converter as a search string in Google).

Just one more thing before you leave this blog: There are certain problematic conditions that may occur when converting or handling floating point numbers. For example, the number to be converted may be outside the representable range. Therefore, the IEEE754 standard defines several exceptions to indicate these conditions. You should at least know the two most important exceptions that you will frequently encounter as a software developer.

  • Positive Infinity and Negative Infinity
    indicate that the number has grown too large for representation. This condition is denoted with an exponent of all ones and a mantissa of all zeros. The signum bit is set to 0 to indicate positive infinity and set to 1 for negative infinity respectively.
  • Not-a-Number (NaN)
    is used to represent a value that is not a real number. You can evoke this scenario by calculating the square root of -1, for example. We know that the result is i, but a computer can’t handle complex numbers by default. So the result is NaN, because it does not represent a real number. According to the IEEE754 specification, NaN is denoted with an exponent of all ones and a non-zero mantissa.

There would be many more details worth mentioning, such as “Quiet NaN” and “Signalling NaN”, but this should be enough for the time being. For further details please refer to http://en.wikipedia.org/wiki/IEEE_floating_point and/or my lectures on data representation.

See you around,

— Andre M. Maier

Advertisements

About bitjunkie

Teacher, Lecturer, and BITJUNKIE ...
This entry was posted in Bit-Twiddling, Math, Uncategorized and tagged , , , , , . Bookmark the permalink.