Summarizing “What Every Computer Scientist Should Know About Floating Point Arithmetic”

Hi again! Have you ever used a floating point number in your code? They appear in the forms of float or double usually, but essentially it’s a type of data to represent real numbers (like 0.1 or 3.14159653589 or 123456789 * 10^(-23)). While it can represent decimals, it can also do whole numbers like 1 or 12345678.

Regardless of which one you used, there is a chance your code might be in trouble. When you use a number (like 1.5), your computer might not actually be using that number but instead something really close.

Now multiply your wrong number a few times, add it with a few other wrong numbers, and soon you’re math is chaos! Your computer isn’t actually listening to you.

How do we reduce errors with floating point operations?

Today I’ll be summarizing “What Every Computer Scientist Should Know About Floating Point Arithmetic” by David Goldberg in 1991. Give it a read if you dare…. hehe

Ok that blog was very long…. so I’mma just cover some of the basic points of what I read. Drop a comment if you want me to try and explain a section from the blog that I didn’t cover (I’ll try lol).

Get ready to get smarter 🙂

Representing Floating Points

Floating points are represented with a numerical base

$beta$

For example, it would look something like this.

d_0.d_1d_2…d_{p-1} * beta^e

The digits of the significand
$d_i$

So if
$beta = 2$

1.10 * 2^{-1}

This would be equal to 0.110 which is
$1/2 + 1/4$

If
$beta = 10$

2.73 * 10^1

If we keep saying floating points are “close enough” for our numbers and we start doing operations on them, eventually our representations will be far from the actual number.

Okay, let’s measure how far we are from the actual number. AKA, what is the error?

Understanding the error

There are two types of floating point errors or rounding errors that are commonly measured. ulps (units in the last place) and relative error.

ulps

The units in the last place is the total error the last digit is off by compared to the actual number. To be exact, it can be calculated by this complicated formula where z is the actual number we are comparing to.

text{ulps} = |d.dd…d – z/{beta^e}|beta^{p-1}

If this looks confusing, let’s just do an example and it’ll be a lot easier. Let’s say we have a number 314.09 and our z = 314.1592653589.

314.09 = 3.1409 * 10^2

From this we know,
$beta = 10$

text{ulps} = |3.1409 – 314.1592653589/{10^2}|{10^{4}}

= |3.1409 – 3.14159653989|{10^{4}}

= |-0.00069653989|{10^{4}}

= 6.9653989

This has roughly ulps = 6.965.

Relative error

This error takes the absolute error and takes it into proportion to the magnitude of the real number.

Super simple. First take the absolute error (the difference between the actual and the representation).

text{absolute error} = |(d.dd..d * beta^e) – z|

Now divide that by the real number to take it into proportion.

text{relative error} = text{absolute error}/z

text{relative error} = |(d.dd..d * beta^e) – z|/z

The idea is that if I have a very large number like a million. If I wanted to buy a million dollar home (if I’m ever rich enough), I wouldn’t really mind a difference in 10$.

Converting 0.5 ulps to relative error

Let’s say I have a number represented as
$d.dd…dd * beta^e$

Convincing you about the 0.5 ulps absolute error

Let me convince you. Let’s say we’re in base 10 (
$beta=10$

The limits of the actual number are shown to be bounded by the error 0.005. This error is also 0.5 ulps. It is also the same as
$0.00…00beta’$

This might seem obvious in base 10, but let’s try something in base 2 (
$beta=2$

Therefore, if a number was rounded properly to its proper representation, it would have an error of < 0.5 ulps.

In other words, 0.5 ulps is always has equal to
$((beta/2)beta^{-p}) * beta^e$

Onto the relative error

Now, what is 0.5 ulps in relative error? Remember that relative error was absolute error divided by the actual number.

We just said that the absolute error when we have 0.5 ulps is
$((beta/2)beta^{-p}) * beta^e$

Specifically, it could be in the range
$1 * beta^e$

From our previous example, we can see that is true. 1.11 (1.75 decimal) was in fact between 1 and 10 in binary.

Cool so we set some bounds on what the actual number could have been, namely:
$1 * beta^e$

This means we can set bounds for the relative error for 0.5 ulps. So let’s divide the absolute error with the bounds of the real number.

Upper bound of relative error:

frac{((beta/2)beta^{-p}) * beta^e}{(1*beta^e)}

= (frac{beta}{2})beta^{-p}

Lower bound of relative error:

frac{((beta/2)beta^{-p}) * beta^e}{(beta*beta^e)}

= (frac{1}{2})beta^{-p}

Therefore,

(frac{1}{2})beta^{-p} lt 0.5 ulps le (frac{beta}{2})beta^{-p}

Machine epsilon

The upper bound relative error for 0.5 ulps is called machine epsilon. This is the largest relative error possible when given a base.

epsilon = (frac{beta}{2})beta^{-p}

Larger precision p, as expected, implies smaller relative error/machine epsilon. We also notice that 0.5 ulps is bounded by machine epsilon and
$(frac{1}{2})beta^{-p}$

Yeahhh… wobble baby wobble baby…

BTW, machine epsilon is such a cool name. I’m not judging if you name your dog or child machine epsilon.

Relative errors with machine epsilon

Remember machine epsilon was the upper bound for rounding errors or 0.5 ulps. So if we actually got a relative error much lower, we can represent the relative error as a ratio of the machine epsilon like this:

text{rel. error} = k * epsilon

Let’s do an example. If I had the number 3.14159 to represent with
$beta=10$

Now, to find the ratio, we must find the machine epsilon:

epsilon = (frac{beta}{2})beta^{-p}

= 5 * 10^{-3} = 0.005

So… the ratio is:

k = text{rel. error}/epsilon = 0.0005/0.005 = 0.1

So we say that the relative error is
$0.1epsilon$

The Wobble

Get ready for things to wobble. First let me show you how ulps and relative error react to each other.

Using 1.0 to represent 1.04 in decimal, has an error of 0.4 ulps and relative error of 0.038. The machine epsilon is 0.05 which makes the relative error
$0.76epsilon$

Great! Hopefully this made sense so far.

Now, let’s multiply our number by let’s say 8. The actual number would be 8.32 while the calculated number would be 8.0. This has 3.2 ulps which is 8 times larger than before! However, our relative error is still
$0.32/8.32 = 0.038$

Whoa! Our ulps increased, but our relative error was the same?

Yep. It turns out whenever you have a fixed relative error, you’re ulps can wobble by
$beta$

On the other hand, whenever we have a fixed ulps (like we showed earlier with 0.5 ulps), the relative error had bounds which showed it can also wobble by
$beta$

So, smaller the
$beta$

Contaminated digits

We now know that ulps and relative error’s ratio k vary from each other by a factor of
$beta$

text{contaminated digits} approx log_{beta}{n}

n is the number of ulps. n can also mean k, the ratio between the relative error and
$epsilon$

So if I had a number in decimal, 3.10 with p=3 and it was trying to represent 3.1415, it would have an error of 4.15 ulps. The contaminated digits would be roughly
$log_{10}{4.15}$

LOL we can’t have partial digits! We’ll see that when pigs fly.

Visually looking, we can see that it is wrong in 1 digit, the last one, which is pretty close to what we got from our calculation.

Guard digits

Let’s subtract 2 values when
$beta = 10$

x = 1.01 * 10^{0}

y = 9.93 * 10^{-1}

x – y = 1.01 – 0.99 = 0.02

It becomes 0.99 and not 0.993 because we had to lose some data with p=3 so that they could be subtracted from each other at the same
$beta^{e}$

As you know, the actual answer 0.017, but the answer ended up being 0.02. So
$2.00 * 10^{-2}$

The relative error from this kind of subtraction is bounded by
$beta – 1$

If
$x=1.00…00, y=rho.rhorho…rhorho * beta^{-1}$

If I subtracted them, I should get the actual answer of
$1*beta^{-p}$

text{absolute err.} = |beta^{-p} – beta{-p+1}| = |beta^{-p}(1 – beta)|

text{relative err.} = text{abs. error}/z

= frac{|beta^{-p}(1 – beta)|}{beta^{-p}} = beta-1

If our
$beta=2$

Was there a way we could have avoided some of this error? Okay, fine, there iss…. otherwise I wouldn’t have written about this section…

Just add a temporary extra digit. And let’s call it a … wait for it … a guard digit. suprise surprise

x = 1.01 * 10^{0}

y = 9.93 * 10^{-1}

x – y = 1.010 – 0.993 = 0.017

Now we have no error now. This is now a bit better than before.

Turns out the guard digit bounds the relative error to
$2epsilon$

Benign and Catastrophic Cancellation

When we try to subtract two really close numbers, many of the digits cancel out and become 0. We call this cancellation. Sometimes cancellation can be catastrophic or benign.

Sometimes, when we do subtraction, there are often errors on the later, far right digits (the least significant digits) after rounding the value or after prior operations. The more accurate digits are at the front (the most significant digits). While the more significant digits at the front cancel out, the lesser accurate lower significant digits would have to subtract and produce an even more inaccurate value. (Like when you calculate the determinant
$b^2 – 4ac$

The catastrophic cancellation just exposes the rounding errors from prior operations.

Benign cancellation happens when you subtract numbers that have no rounding errors.

IEEE Standard

So the IEEE standard is a set of rules that many systems follow to ensure consistency. There are two IEEE standards that are followed: IEEE 754 and IEEE 854. They both support smaller and larger floating points called single precision and double precision.

IEEE 754

The standard allows
$beta=2$

In fact, here is a cool table that shows how IEEE 754 sets all its floating point parameters.

Exponents are represented with a sign/magnitude split. One bit is used for the sign of the exponent. The remaining bits for the exponent are used to represent its magnitude. Two’s complement is another approach but is not used by either IEEE standard.

In fact, here is exactly how the bits are laid out to represent different kinds of values. To represent 0, you have to use
$e_{text{min}}-1$

The fractional section is the digits after the first digit (also called the significand.

IEEE 854

On the other hand, this standard allows
$beta=2$

It allows base 10 because it is also the standard way humans count. Base 2 is also included because of the low wobble.

The Conclusion

Okay, I kinda rushed the last section. But overall, I wanted to say floating points can have a lot of imprecision. If you can avoid them, use integers instead.

In the case that you do use them, try to limit your wobbles and avoid catastrophic cancellations (there are ways you could do it sometimes by rearranging formulas).

Try to read the original blog yourself. This blog you are reading right now is a summary of a fraction of the original blog by David Goldberg.

But as always, drop your questions and comments. And I’m out for now…

Peace
-absterdabster

Source link
lol