Learning C language notes: floating point types float, double and long double

Floating-point types can represent a larger range of numbers including decimals. Floating-point numbers are represented similar to scientific notation (that is, numbers are represented by multiplying decimals by powers of 10). This number system is often used to represent very large or very small numbers. 

float

The C language stipulates that the float type must be able to represent at least 6 significant figures, and the value range is at least 10^{-37}~ 10^{+37}. The previous provision means that the float type must be able to represent the first 6 digits of 33.333333, rather than being accurate to 6 digits after the decimal point. The latter specification is used to conveniently express numbers such as the mass of the sun (2.0e30 kilograms), the charge of one proton (1.6e-19 coulombs), or the national debt. Usually, the system takes up 32 bits to store a floating-point number. Among them, 8 bits are used to represent the value and sign of the exponent, and the remaining 24 bits are used to represent the non-exponent part (also called the mantissa or significand) and its sign.

double

Another floating-point type provided by the C language is double (meaning double precision). The minimum value range of double type and float type is the same, but at least 10 valid figures must be represented. In general, double occupies 64 bits instead of 32. Some systems use all of the extra 32 bits for the non-exponent, which not only increases the number of significands (i.e., increases precision), but also reduces rounding errors. Other systems allocate some of these bits to the exponent part to accommodate larger exponents, thereby increasing the representable range. Either way, a value of type double has at least 13 significant digits, exceeding the standard minimum.

long double

The third floating-point type of the C language is long double to meet higher precision requirements than the double type. However, C only guarantees that long double has at least the same precision as double.

floating point constant

The basic form of a floating-point constant is: a signed number (including a decimal point), followed by e or E, and finally a signed number representing the exponent of 10. For example:

-1.56E+12

2.76e-3

The plus sign can be omitted. There can be no decimal point (eg, 2E5) or exponent part (eg, 19.28), but not both. Either the fractional part (eg, 3.E16) or the integer part (eg, .45E-6) can be omitted, but not both.

For example:

3.14159

-2

4e16

.8E-5

100.

Do not put spaces between floating-point constants: 1.56 E+12 (wrong!)

By default, the compiler assumes that floating-point constants are of type doble precision. For example, assuming that some is a variable of type float, write the following statement:

some = 4.0 * 2.0;

Typically, 4.0 and 2.0 are stored as 64-bit doubles, multiplied using double precision, and then truncated to the width of a float. Although the calculation accuracy is higher in this way, it will slow down the running speed of the program.

Adding f or F suffix after the floating-point number can override the default setting, and the compiler will treat the floating-point constant as the float type, such as 2.3f and 9.1E9F. Using the L suffix after l makes the number a long double type, such as 54.32 and 4.32L. Note that it is better to use the L suffix, because the letter l is easily confused with the number 1. Floating-point constants without a suffix are of type double.

Overflow and Underflow of Floating-Point Values

Assuming that the maximum float type value of the system is 3.4E38, write the following code:

float toobig = 3.4E38 * 100.0f;

printf("%e\n", toobig);

what will happen Here is an example of overflow. Overflow occurs when a calculation results in a number that is too large to be expressed by the current type. This behavior was undefined in the past, but now the C language specifies that toobig is assigned a specific value representing infinity in this case, and printf() displays that value as inf or infinity (or some other value that means infinity) content).

The situation is more complicated when dividing a very small number. Recall that numbers of type float are stored in exponent and mantissa parts. There exists a number whose exponent is the minimum value, the smallest mantissa value represented by all available bits. The number is the smallest number that the float type can represent with full precision. Now divide it by 2. Normally, this operation would reduce the exponent part, but in the hypothetical case, the exponent is already at a minimum. So the computer has to shift the bits of the mantissa to the right, freeing up the first binary bit, and discarding the last binary number. Taking decimal as an example, divide a number with 4 significant digits (for example, 0.1234E-10) by 10, and the result is 0.0123E-10. Although the result is obtained, the original end is lost during the calculation process digits in significand. This situation is called underflow. The C language refers to a floating-point value that loses the full precision of its type as a subnormal floating-point value. So dividing the smallest positive floating point number by 2 will give you a subnormal value. Dividing by a very large value will result in all bits being 0. The c library now provides functions for checking whether a calculation will produce a lower than normal value.

There is another special floating point value NaN. For example, if you pass a value to the asin() function, the function will return an angle whose sine is the value passed into the function. But the sine value cannot be greater than 1, so the behavior of the function is undefined if an argument greater than 1 is passed in. In this case, the function will return a NaN value, which can be displayed by the printf() function as nan, NaN, or something similar.

Guess you like

Origin blog.csdn.net/weixin_51995147/article/details/128525859