Okay, so that’s a Nintendo DS. Pay no mind to the man behind the curtain! Anyways…

This time we’re going to talk about floating-point numbers, what they are, how they’re calculated, why they’re used, etcetera. There seems to be a lot of confusion about and fear of floating-point numbers, especially from non-computer scientists (like me).

So if you’re completely in the dark so far, let’s answer the most basic question: what is a floating-point number? Basically, floating-point numbers are computer approximations of real numbers (i.e. the set of all rational and irrational numbers). Why do computers require approximations? Well, let’s go back a few posts, when we discussed the Halting Problem. Remember that some programs run forever. We can construct such a program pretty easily.

For example, you can represent the number $\frac{1}{3}$ as the infinite sum $\mathop {\lim }\limits_{n \to \infty } \left( {\sum\limits_{i = 1}^n {\frac{3}{{10^i }}} }\right) = \frac{3}{{10}} + \frac{3}{{100}} + \frac{3}{{1000}} + \cdots + \frac{3}{{10^n }}$. If you aren’t familiar with this notation, look at the right-hand side of the equality. The first term equals .3, the second equals .03, the third equals .003, and so on. If you add these up you get .3333 repeating, which is exactly $\frac{1}{3}$.

Now, a computer cannot replicate such a number exactly (well, that’s not exactly true; there are certain methods that can replicate some of them, but not most) because computers are finite-state machines. This is another Turing vocabulary word, and all it means is that computers can only use finite inputs and can only display finite inputs. Which means if you enter in a symbolic representation of an infinite input (i.e. the number $\frac{1}{3}$), the computer must convert it into a finite input. In this process, the computer necessarily rounds and chops after a certain number of digits. Floating-point numbers provide the rules for this rounding and chopping.

So how are floating-point numbers calculated? I’m going to give a formula… WAIT… DON’T CLOSE THE BROWSER.. yes, I know, formulas, we hates them. I’ll explain everything, I promise.
So here goes:

Let $x$ be the number we want to represent. The formula used to calculate $x$ is this:

$x = \pm \beta ^\epsilon \left( {\frac{{d_1 }}{\beta } + \frac{{d_2 }}{{\beta ^2 }} + \frac{{d_3 }}{{\beta ^3 }} + \cdots + \frac{{d_t }}{{\beta ^t }}} \right)$

Let’s start with $\beta$.

$\beta$ is the base of the number system you want to use; for instance, we use base 10 in everyday math (the decimal system). $\beta$ is discretionary in the sense that you may choose whatever base you want. Personal computers use a base-2 system because it’s a natural extension of the way they work. Computers represent information via electrical voltages. A positive voltage corresponds to “on”, or the binary number 1. A zero voltage (or a negative voltage with regard to the positive) corresponds to “off”, or the binary number 0. You could design a trinary computer that ran on base-3, for instance, by representing three steps in voltage as 0, 1, and 2; we could call these three states off, kinda-off, and on, for instance.

Next, $\epsilon$.

$\epsilon$ is what is called the range of the floating-point number, and it’s exactly what it sounds like. It tells you how high and how low of a number you can represent. This variable is also discretionary. MATLAB, which is a mathematical computing suite, uses a lower bound of -1022 and an upper bound of 1023, with some special bounds, one that represents zero and one that represents infinity or NaN (stands for “not a number”). Infinity is only the name given to numbers which have $\epsilon$‘s greater than 1023. (This is MATLAB’s infinity. In mathematics, infinity is not a number in the strict sense of the word).

In theory, you can choose whatever exponent you like. In practice, you probably will never use those really, really large or really, really small numbers. And anyways, you have to choose bounds for infinitely long numbers on a finite-state machine.

Next, the $d$‘s.

The $d$‘s represent the digits of the floating-point expansion. In base 10 (decimal), these are just the digits after the decimal point. So in our $\frac{1}{3}$ example, $d_1 = 3, d_2 = 3, d_3 = 3$, and in fact all of them equal 3. These variables have two properties: the $d$‘s are all nonnegative integers (0, 1, 2, etcetera) and $0 \leq d \leq \beta{-1}$. Why? The first one is just convenient. If we represented the digits of our expansion as fractions themselves, it would just complicate things; what if one of the digits were irrational? We would have a lot of trouble justifying the floating-point system. The second one is just by convention.

You’re probably wondering what that $t$ is about, though. The $t$ is what’s called the precision of the floating-point number. This variable is what we use to turn a number with possibly infinite digits into a number with finite digits (so that a computer can actually use it). $t$ gives the maximum number of terms in the floating-point expansion. So for example, if $t = 5$ then $\frac{1}{3} = .33333$; i.e. the floating-point number goes out 5 places, and truncates the rest.

So let’s do some examples:

Let’s calculate the number 64 in base 2, base 3, and base 10 with a 5-bit mantissa (i.e. with $t=5$).

Base 2:

$64_\mathbf{(base 10)} = 2^7 (\frac{1}{2} + \frac{0}{4} + \frac{0}{8} + \frac{0}{{16}} + \frac{0}{{32}}) = 1000000_\mathbf{(base 2)}$ (this is a binary number! It isn’t equal to a million in decimal! A computer represents this number with the sequence of voltages, “on, off, off, off, off, off, off”.)

Base 3:

$64_\mathbf{(base 10)}= 3^5 (\frac{0}{3} + \frac{2}{9} + \frac{1}{27} + \frac{0}{{81}} + \frac{1}{{243}}) = 2101_\mathbf{(base 3)}$ (this is a ternary number! It isn’t equal to two-thousand, one-hundred and one in decimal.)

Base 10:

$64_\mathbf{(base 10)}= 10^2 (\frac{6}{10} + \frac{4}{100} + \frac{0}{1000} + \frac{0}{{10000}} + \frac{0}{{100000}}) = 64_\mathbf{(base 10)}$ (this is a decimal number!)

The great success of the floating-point scheme is that it allows us to fiddle and alter precision via the formula variables. We can approximate as close to a number as we want, provided we have the storage capacity. You can also see why computers use binary and not the other systems. Instead of having ten different voltages for decimal notations, computers just use two for the binary.

So there you have it. Not so bad, was it? By the way, if you’re wondering what a mantissa is, it’s just the part of our formula where we add up the fractions.