#### Jeffrey Quesnelle

Computer systems often need to represent real numbers as a part of their operation, but how to encode these numbers in fixed, finite space is non-trivial. If the size of the variable in question is bounded then so-called fixed point arithmetic can be used, treating both sides of the decimal point as integer values. In general however it would be useful to have a more flexible method of representing values that can hold both $10$ and $10^{-9}$ in a small, fixed amount of memory.

The current base ten number system is a relatively new invention [1]. We may take for granted that there is a well-defined way of writing down numbers (with only terminating reals having non-unique representations) but the problem of representing an arbitrary real number in fixed space (say, in computer memory) raises several interesting tradeoffs between precision and accuracy; we give an extremely abbreviated overview of the most popular method of representing reals in computers: IEEE 754.

*IEEE Standard for Floating-Point Arithmetic* (IEEE 754-2008 or simply 754) is the internationally accepted method for performing operations on and transmitting approximations of reals on computer systems. The key property of 754 is that the decimal point “floats”, i.e. if a number is very large or very small the decimal point can be “moved around” so that most bits are used to represent the significant digits of the number. Compare this method to fixed point arithmetic which has a bias towards numbers closer to zero; in 8.8 fixed point math (8 bits for whole part, 8 bits for fractional part) the number $2^{7}$ is represented as $1000 \; 0000.0000 \; 0000$ and $2^8$ cannot be represented at all, even though both numbers contain only one “significant digit”!

Continue reading “Real number approximations in finite space with IEEE 754”