Floating-Point Representable Values

Chapter 9 Introduction to Floating-Point: Basics, Data Types,

9.6 Floating-Point Representable Values

All representable values have a single encoding in each floating-point format, but not all floating-point encodings represent a number. This is another difference between floating-point and integer representation. The IEEE 754-2008 specification defines five classes of floating-point encodings: normal numbers, subnormal numbers, zeros, NaNs, and infinities. Each class has some shared properties and some unique properties. Let’s consider each one separately.

9.6.1 noRMAL VALues

We use the term normal value to define a floating-point value that satisfies the equation

F = (−1)s × 2(exp–bias) × 1.f (9.1)

which we saw earlier in Section 9.4. In the space of normal values, each floating- point number has a single encoding, that is, an encoding represents only one floating- point value and each representable value has only one encoding. Put another way, no aliasing exists within the single-precision floating-point data type. It is possible to have multiple encodings represent a single value when represented in decimal floating-point formats, but this is beyond the scope of this text. See the IEEE 754- 2008 specification for more on this format.

Recall that a 32-bit signed integer has a range of −2,147,483,648 to 2,147,483,647 (+/−2.147 × 109). Figure 9.9 shows the range of signed 32-bit integers, half-precision (16-bit) and single-precision (32-bit) floating-point data types for the normal range.

Notice the range of the signed 32-bit integer and the half-precision data types is roughly the same; however, notice the much greater range available in the single- precision floating-point data type.

Remember, the tradeoff between the integer data types and the floating-point data types is in the precision of the result. In short, as we showed in Figure 9.8, the preci- sion of a floating-point data value is a function of the exponent. As the exponent

Single-precision

Signed 32-bit integer

Half-precision

0 1.18×10–38 6.10×10–51.06.55×104 2.15×109 3.40×1038

FIGURE 9.9 Relative normal range for signed 32-bit integer, half-precision floating-point, and single-precision floating-point data types.

increases, the precision decreases, resulting in an increased numeric separation between representable values.

Table 9.3 shows some examples of normal data values for half-precision and single- precision formats. Note that each of these values may be made negative by setting the most-significant bit. For example, −1.0 is 0xBF800000. Using the technique shown in Section 9.4, try out some of these. You can check your work using the conversion tools listed in the References.

9.6.2 suBnoRMAL VALues

The inclusion of subnormal values* was an issue of great controversy in the original IEEE 754-1985 deliberations. When a value is non-zero and too small to be represented in the normal range, it value may be represented by a subnormal encoding.

These values satisfy Equation 9.2:

F = (−1)s × 2−126 × 0.f (9.2)

Notice first the exponent value is fixed at −126, one greater than the negative bias value. This value is referred to as emin, and is the exponent value of the smallest normal representation. Also notice that the 1.0 factor is missing, changing the significand range to [0.0, 1.0). The subnormal range extends the lower bounds of the representable numbers by further dividing the range between zero and the smallest normal representable value into 223 additional representable values. If we look again at Figure 9.8, we see in the region marked Subnormals that the range between 0 and the minimum normal value is represented by n values, as in each exponent range of the normal

* The ARM documentation in the ARM v7-M Architecture Reference Manual uses the terms “denormal” and “denormalized” to refer to subnormal values. The ARM Cortex-M4 Technical Reference Manual uses the terms “denormal” and “subnormal” to refer to subnormal values.

TABLE 9.3

Several Normal Half-Precision and Single- Precision Floating-Point Values

Format

Half-Precision Single-Precision

1.0 0x3C00 0x3F800000

2.0 0x4000 0x40000000

0.5 0x3800 0x3F000000

1024 0x6400 0x44800000

0.005 0x1D1F 0x3BA3D70A

6.10 × 10−5 0x0400 0x38800000

6.55 × 104 0x7BFF 0x477FE000

1.175 × 10−38 Out of range 0x00800000 3.40 × 1038 Out of range 0x7F7FFFFF

values. The numeric separation in the subnormal range is equal to that of the normal values with minimum normal exponent. The minimum value in the normal range for the single-precision floating-point format is 1.18 × 10−38. The subnormal values increase the minimum range to 1.4 × 10−45. Be aware, however, when an operand in the subnormal range decreases toward the minimum value, the number of significant digits decreases. In other words, the precision of subnormal values may be signifi- cantly less than the precision of normal values, or even larger subnormal values. The range of subnormal values for the half-precision and single-precision data types is shown in Table 9.4. Table 9.5 shows some examples of subnormal data values. As with the normal values, each of these values may be made negative by setting the most significant bit.

EXAMPLE 9.3 Convert the value −4.59 × 10−41 to single-precision.

soLuTion

The value is below the minimum threshold representable as a normal value in the single-precision format, but is greater than the minimum representable subnormal value and is in the subnormal range for the single-precision format.

Recalling our conversion steps above, we can use the same methodology for subnormal values so long as we recall that the exponent is fixed at the value 2−126 and no implicit 1 is present.

TABLE 9.4

Subnormal Range for Half-Precision and Single-Precision

Format

Half-Precision Single-Precision

Minimum +/−5.96 × 10−8 +/−1.45 × 10−45

Maximum +/−6.10 × 10−5 +/−1.175 × 10−38

TABLE 9.5

Examples of Subnormal Values for Half-Precision and Single-Precision

Format

Half-Precision Single-Precision 6.10 × 10−5 0x03FF

1.43 × 10−6 0x0018 5.96 × 10−8 0x0001

1.175 × 10−38 0x007FFFFF

4.59 × 10−41 0x00008000

1.45 × 10−45 0x00000001

First, divide −4.592 × 10−41 by 2−126 and we have −0.00390625, which is equal to 2−8. This leaves us with

−4.592 × 10−41 = −11 × 2−126 × 0.00390625

The result single-precision value is 0x80008000, shown in binary and hexadeci- mal in Figure 9.10.

The conversion to and from half-precision is done in an identical manner, but remember the subnormal exponent for the half-precision format is −14 and the format is only 16 bits.

A computation that results in a subnormal value may set the Underflow flag and may signal an exception. We will address exceptions in a later chapter.

9.6.3 ZeRos

It’s odd to think of zero as anything other than, well, zero. In floating-point zeros are signed. You may compute a function and see a negative zero as a result! Zeros are formed by a zero exponent and zero fraction. A critical bit of information here—if the fraction is not zero, the value is a subnormal, as we saw above. While numerous subnormal encodings are possible, only two zero encodings, a positive zero with a sign bit of zero, and a negative zero with a sign bit of one, are possible. How is it possible to have a negative zero? There are several ways outlined in the IEEE 754-2008 specification. One way is to be in Round to Minus Infinity mode (we will consider rounding in Chapter 10) and sum two equal values that have opposite signs.

EXAMPLE 9.4

Add the two single-precision values 0x3F80000C and 0xBF80000C with different rounding modes.

soLuTion

Let register s0 contain 0x3F80000C and register s1 contain 0xBF80000C. The two operands have the same magnitude but opposite sign, so the result of adding the two operands using the Cortex-M4 VADD instruction (we will consider this instruction in Chapter 11)

VADD s2, s0, s1

in each case is zero. But notice that the sign of the zero is determined by the rounding mode. We will consider rounding modes in detail in Chapter 10, but S

3 1

3 0

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 8

0 0

8 0

2 9

2 8

2 7

2 6

2 5

2 4

2 3

2 2

2 1

2 0

1 9

1 8

1 7

1 6

1 5

1 4

1 3

1 2

1 1

1 9 8 7 6 5 4 3 2 1 0 0

Exponent Fraction

FIGURE 9.10 Single-precision representation of −4.592 × 10−41.

for now consider the four in Table 9.6. (The names give a clue to the rounding that is done. For example, roundTowardPositive always rounds up if the result is not exact. The rounding mode roundTiesToEven uses the method we learned in school—round to the nearest valid number, and if the result is exactly halfway between two valid numbers, pick the one that is even.)

Likewise, a multiplication of two values, one positive and the other negative, with a product too small to represent as a subnormal, will return a negative zero. And finally, the square root of −0 returns −0. Why bother with signed zeros? First, the negative zero is an artifact of the sign-magnitude format, but more importantly, the sign of zero is an indicator of the direction of the operation or the sign of the value before it was rounded to zero. This affords the numeric analyst with information on the computation, which is not obvious from an unsigned zero result, and this may be useful even if the result of the computation is zero.

The format of the two zeros for half-precision and single-precision are shown in Table 9.7.

9.6.4 infiniTies

Another distinction between floating-point and integer values is the presence of an infinity encoding in the floating-point formats. A floating-point infinity is encoded with an exponent of all ones and a fraction of all zeros. The sign indicates whether it is a positive or negative infinity. While it is tempting to consider the positive infinity as the value just greater than the maximum normal value, it is best considered as a mathematical symbol and not as a number. In this way computations involving infinity will behave as would be expected. In other words, any operation computed with an

TABLE 9.6

Operations with Zero Result in Each Rounding Mode

Rounding Mode Result

roundTiesToEven 0x00000000 Positive Zero

roundTowardPositive 0x00000000 Positive Zero roundTowardNegative 0x80000000 Negative Zero

roundTowardZero 0x00000000 Positive Zero

TABLE 9.7

Format of Signed Zero in Half-Precision and Single-Precision

Format

Half-Precision Single-Precision

+0.0 0x0000 0x00000000

−0.0 0x8000 0x80000000

infinity value by a normal or subnormal value will return the infinity value. However, some operations are invalid, that is, there is no generally accepted result value for the operation. An example is multiplication of infinity by zero. We note that the IEEE 754-2008 specification defines the nature of the infinity in an affine sense, that is,

−∞ < all finite numbers < +∞

Recall from Section 7.2.2 that overflow in an integer computation produces an incorrect value and sets a hardware flag. To determine whether overflow occurred, a check on the flags in the status register must be made before you can take appropri- ate action. Multiplying two very large values that result in a value greater than the maximum for the floating-point format will return an infinity,* and further calcula- tions on the infinity will indicate the overflow. While there is an overflow flag (more on this in Chapter 10), in most cases the result of a computation that overflows will indicate as much without requiring the programmer to check any flags. The result will make sense as if you had done it on paper. The format of the half-precision and single-precision infinities is shown in Table 9.8.

9.6.5 noT-A-nuMBeRs (nAns)

Perhaps the oddest of the various floating-point classes is the not-a-number, or NaN.

Why would a numerical computation method include a data representation that is

“not a number?” A reasonable question, certainly. They have several uses, and we will consider two of them. In the first use, a programmer may choose to return a NaN with a unique payload (the bits in the fraction portion of the format) as an indicator that a specific, typically unexpected, condition existed in a routine within the program. For instance, the programmer believes the range of data for a variable at a point in the program should not be greater than 100. But if it is, he can use a NaN to replace the value and encode the payload to locate the line or algorithm in the routine that caused the behavior. Secondly, NaNs have historically found use as the default value put in registers or in data structures. Should the register or data structure be read before it is written with valid data, a NaN would be returned. If

* In some rounding modes, a value of Maximum Normal will be returned. We will consider this case in the section in our discussion of exceptions.

TABLE 9.8

Format of Signed Infinity in Half-Precision and Single-Precision

Format

Half-Precision Single-Precision

−Infinity 0xFC00 0xFF800000

+Infinity 0x7C00 0x7F800000

the NaN is of a type called signaling NaNs, the Invalid Operation exception would be signaled, giving the programmer another tool for debugging. This use would alert the programmer to the fact that uninitialized data was used in a computation, likely an error. The Motorola MC68881 and later 68K floating-point processors initialized the floating-point register file with signaling NaNs upon reset for this purpose. Both signaling NaNs, and a second type known as quiet NaNs, have been used to represent non-numeric data, such as symbols in a symbolic math system. These programs operate on both numbers and symbols, but the routines operating on numbers can’t handle the symbols. NaNs have been used to represent the symbols in the program, and when a symbol is encountered it would cause the program to jump to a routine written specifically to perform the needed computation on symbols rather than numbers. This way it would be easy to intermix symbols and numbers, with the arith- metic of the processor operating on the numbers and the symbol routines operating whenever an operand is a symbol.

How does one use NaNs? One humorous programmer described NaNs this way:

when you think of computing with NaNs, replace the NaN with a “Buick” in a cal- culation.* So, what is a NaN divided by 5? Well, you could ask instead, “What is a Buick divided by 5?” You quickly see that it’s not possible to reasonably answer this question, since a Buick divided by 5 is not-a-number, so we will simply return the Buick (unscratched, if we know what’s good for us). Simply put, in an operation involving a NaN, the NaN, or one of the NaNs if both operands are NaN, is returned.

This is the behavior of an IEEE 754-2008-compliant system in most cases when a NaN is involved in a computation. The specification does not direct which of the NaNs is returned when two or more operands are NaN, leaving it to the floating- point designer to select which is returned.

A NaN is encoded with an exponent of all ones and a non-zero fraction. Note that an exponent of all ones with a zero fraction is an infinity encoding, so to avoid confusing the two representations, a NaN must not have a zero fraction. As we mentioned above, NaNs come in two flavors: signaling NaNs (sNaN) and non-signaling, or quiet, NaNs (qNaN). The difference is the value of the first, or most significant, of the fraction bits.

If the bit is a one, the NaN is quiet. Likewise, if the bit is a zero, the NaN is signaling, but only if at least one other fraction bit is a one. In the half-precision format, bit 9 is the bit that identifies the NaN type; in the single-precision format it’s bit 22. The format of the NaN encodings for the half-precision format and the single-precision format is shown in Table 9.9.

Why two encodings? The signaling NaN will cause an Invalid Operation exception (covered in Section 10.3.4) to be set, while a quiet NaN will not. What about the fraction bits when a NaN is an operand to an operation? The specification requires that the fraction bits of a NaN be preserved, that is, returned in the NaN result, if it is the only NaN in the operation and if preservation is possible. (An example when it would not be possible to preserve the fraction is the case of a format conversion in which the fraction cannot be preserved because the final format lacks the necessary number of bits.) If two or more NaNs are involved in an operation, the fraction of one of them is to be preserved, but which is again the decision of the processor designer.

* Buick is a brand of General Motors vehicle popular in the 1980s.

The sign bit of a NaN is not significant, and may be considered as another payload bit. Several of the many NaN values are shown in Table 9.10, with payloads of 0x01 and 0x55. Notice how the differentiator is the most-significant fraction bit.

Structure of Assembly Language Modules

Defining a Block of Data or Code