Hi again! Have you ever used a floating point number in your code? They appear in the forms of float
or double
usually, but essentially it’s a type of data to represent real numbers (like 0.1
or 3.14159653589
or 123456789 * 10^(-23))
. While it can represent decimals, it can also do whole numbers like 1
or 12345678
.
Regardless of which one you used, there is a chance your code might be in trouble. When you use a number (like 1.5
), your computer might not actually be using that number but instead something really close.
Now multiply your wrong number a few times, add it with a few other wrong numbers, and soon you’re math is chaos! Your computer isn’t actually listening to you.
How do we reduce errors with floating point operations?
Today I’ll be summarizing “What Every Computer Scientist Should Know About Floating Point Arithmetic” by David Goldberg in 1991. Give it a read if you dare…. hehe
Ok that blog was very long…. so I’mma just cover some of the basic points of what I read. Drop a comment if you want me to try and explain a section from the blog that I didn’t cover (I’ll try lol).
Get ready to get smarter 🙂
Representing Floating Points
Floating points are represented with a numerical base
(like decimal or binary or hexadecimal), an exponent range
and
, and a precision p
(the number of digits). They are represented in scientific notation with a single non-zero digit before the decimal point.
For example, it would look something like this.
The digits of the significand
are all in the range
where there are p
digits (
).
So if
and p = 3
and I wanted to represent 0.77
it would be something like this.
This would be equal to 0.110
which is
. This is the closest we could get to 0.77
with 3 digit precision p
in binary. As you can see, floating point representations lie 0.75
is not 0.77
, butttt it is close enough.
If
and p = 3
then 2734
would be represented as
If we keep saying floating points are “close enough” for our numbers and we start doing operations on them, eventually our representations will be far from the actual number.
Okay, let’s measure how far we are from the actual number. AKA, what is the error?
Understanding the error
There are two types of floating point errors or rounding errors that are commonly measured. ulps (units in the last place) and relative error.
ulps
The units in the last place is the total error the last digit is off by compared to the actual number. To be exact, it can be calculated by this complicated formula where z
is the actual number we are comparing to.
If this looks confusing, let’s just do an example and it’ll be a lot easier. Let’s say we have a number 314.09
and our z = 314.1592653589
.
From this we know,
, p = 5
, and e = 2
.
This has roughly ulps = 6.965
.
Relative error
This error takes the absolute error and takes it into proportion to the magnitude of the real number.
Super simple. First take the absolute error (the difference between the actual and the representation).
Now divide that by the real number to take it into proportion.
The idea is that if I have a very large number like a million. If I wanted to buy a million dollar home (if I’m ever rich enough), I wouldn’t really mind a difference in 10$.
Converting 0.5 ulps to relative error
Let’s say I have a number represented as
. If this number had 0.5 ulps, the error could be bounded by
where
.
Convincing you about the 0.5 ulps absolute error
Let me convince you. Let’s say we’re in base 10 (
) and we had p=3
with the number 9.97
. If the actual number was properly represented via rounding, the actual number would have been between >= 9.965
and < 9.975
.
The limits of the actual number are shown to be bounded by the error 0.005
. This error is also 0.5 ulps. It is also the same as
because
.
This might seem obvious in base 10, but let’s try something in base 2 (
). Let’s say we had 1.11
(1.75
in decimal) when
and p = 3
. However, the real number before rounding could have been between < 1.111
and >= 1.110
. And of course this range results in 0.1 ulps ex. (1.111 - 1.11)
in binary which is 0.5 ulps in decimal which is the same as
because
.
Therefore, if a number was rounded properly to its proper representation, it would have an error of < 0.5 ulps.
In other words, 0.5 ulps is always has equal to
. This would be the absolute error of 0.5 ulps.
Onto the relative error
Now, what is 0.5 ulps in relative error? Remember that relative error was absolute error divided by the actual number.
We just said that the absolute error when we have 0.5 ulps is
. But the actual number could be anything.
Specifically, it could be in the range
and
. So, if 0.5 ulps was 0.001
in binary (
), then p=3
and e=0
. In that case the real number must have been between
and
which would be 1
and 2
in decimal or 1
and 10
in binary.
From our previous example, we can see that is true. 1.11
(1.75
decimal) was in fact between 1
and 10
in binary.
Cool so we set some bounds on what the actual number could have been, namely:
and
.
This means we can set bounds for the relative error for 0.5 ulps. So let’s divide the absolute error with the bounds of the real number.
Upper bound of relative error:
Lower bound of relative error:
Therefore,
Machine epsilon
The upper bound relative error for 0.5 ulps is called machine epsilon. This is the largest relative error possible when given a base.
Larger precision p
, as expected, implies smaller relative error/machine epsilon. We also notice that 0.5 ulps is bounded by machine epsilon and
. These bounds have a factor of
which we call wobble.
Yeahhh… wobble baby wobble baby…
BTW, machine epsilon is such a cool name. I’m not judging if you name your dog or child machine epsilon.
Relative errors with machine epsilon
Remember machine epsilon was the upper bound for rounding errors or 0.5 ulps. So if we actually got a relative error much lower, we can represent the relative error as a ratio of the machine epsilon like this:
Let’s do an example. If I had the number 3.14159
to represent with
and p = 3
, I would have to round to 3.14
. This would have an absolute error of .00159
or 0.159 ulps. For relative error, I do
which leads me to a relative error of 0.0005
.
Now, to find the ratio, we must find the machine epsilon:
So… the ratio is:
So we say that the relative error is
.
The Wobble
Get ready for things to wobble. First let me show you how ulps and relative error react to each other.
Using 1.0
to represent 1.04
in decimal, has an error of 0.4 ulps and relative error of 0.038. The machine epsilon is 0.05
which makes the relative error
.
Great! Hopefully this made sense so far.
Now, let’s multiply our number by let’s say 8
. The actual number would be 8.32
while the calculated number would be 8.0
. This has 3.2 ulps which is 8 times larger than before! However, our relative error is still
which is the same as
.
Whoa! Our ulps increased, but our relative error was the same?
Yep. It turns out whenever you have a fixed relative error, you’re ulps can wobble by
.
On the other hand, whenever we have a fixed ulps (like we showed earlier with 0.5 ulps), the relative error had bounds which showed it can also wobble by
.
So, smaller the
, smaller the wobble or smaller the error bounds! Using binary, can significantly reduce our error.
Contaminated digits
We now know that ulps and relative error’s ratio k
vary from each other by a factor of
, the wobble. As a result, we can estimate the number of contaminated digits (the number of incorrect digits from the correct representation of the number).
n
is the number of ulps. n
can also mean k
, the ratio between the relative error and
. It can mean either because of the wobble factor.
So if I had a number in decimal, 3.10
with p=3
and it was trying to represent 3.1415
, it would have an error of 4.15 ulps. The contaminated digits would be roughly
which is roughly 0.61804809
digits.
LOL we can’t have partial digits! We’ll see that when pigs fly.
Visually looking, we can see that it is wrong in 1 digit, the last one, which is pretty close to what we got from our calculation.
Guard digits
Let’s subtract 2 values when
and p=3
.
It becomes 0.99
and not 0.993
because we had to lose some data with p=3
so that they could be subtracted from each other at the same
.
As you know, the actual answer 0.017
, but the answer ended up being 0.02
. So
and
have an error of 30 ulps!
The relative error from this kind of subtraction is bounded by
. Let me show you why.
If
where
. (ex.
). (x
and y
have p
digits.)
If I subtracted them, I should get the actual answer of
, but because we shift y
to the right and lose a digit, we end up getting
.
If our
that would make our relative error 1. In terms of
, it would mean
and so
. The contaminated digits would be
. If p
digits are contaminated, all of them are contaminated.
Was there a way we could have avoided some of this error? Okay, fine, there iss…. otherwise I wouldn’t have written about this section…
Just add a temporary extra digit. And let’s call it a … wait for it … a guard digit. suprise surprise
Now we have no error now. This is now a bit better than before.
Turns out the guard digit bounds the relative error to
. I’m lazy but if someone in the comments asks, I’ll figure out why and try to explain it (but its in the linked blog).
Benign and Catastrophic Cancellation
When we try to subtract two really close numbers, many of the digits cancel out and become 0
. We call this cancellation. Sometimes cancellation can be catastrophic or benign.
Sometimes, when we do subtraction, there are often errors on the later, far right digits (the least significant digits) after rounding the value or after prior operations. The more accurate digits are at the front (the most significant digits). While the more significant digits at the front cancel out, the lesser accurate lower significant digits would have to subtract and produce an even more inaccurate value. (Like when you calculate the determinant
).
The catastrophic cancellation just exposes the rounding errors from prior operations.
Benign cancellation happens when you subtract numbers that have no rounding errors.
IEEE Standard
So the IEEE standard is a set of rules that many systems follow to ensure consistency. There are two IEEE standards that are followed: IEEE 754 and IEEE 854. They both support smaller and larger floating points called single precision and double precision.
IEEE 754
The standard allows
. It has single bit precision (p=24
) and double bit precision (p=53
). It also discusses how the bits should be laid out.
In fact, here is a cool table that shows how IEEE 754 sets all its floating point parameters.
Exponents are represented with a sign/magnitude split. One bit is used for the sign of the exponent. The remaining bits for the exponent are used to represent its magnitude. Two’s complement is another approach but is not used by either IEEE standard.
In fact, here is exactly how the bits are laid out to represent different kinds of values. To represent 0, you have to use
. Infinity is
without the fractional section 0ed out. NaN is another type (like when 0 is divided by 0, or infinities are added). NaN is represented the same as infinity but with the fraction section set.
The fractional section is the digits after the first digit (also called the significand.
IEEE 854
On the other hand, this standard allows
or
. However, there is no rules about how the bits should be laid out for double and single precision.
It allows base 10 because it is also the standard way humans count. Base 2 is also included because of the low wobble.
The Conclusion
Okay, I kinda rushed the last section. But overall, I wanted to say floating points can have a lot of imprecision. If you can avoid them, use integers instead.
In the case that you do use them, try to limit your wobbles and avoid catastrophic cancellations (there are ways you could do it sometimes by rearranging formulas).
Try to read the original blog yourself. This blog you are reading right now is a summary of a fraction of the original blog by David Goldberg.
But as always, drop your questions and comments. And I’m out for now…
Peace
-absterdabster
Source link
lol