What are floating point numbers?

The form of representation of real (or material) numbers, where they are stored as a mantissa and exponent, are floating point numbers (maybe a period, as is customary in English-speaking countries). Despite this, the number is provided with a fixed relative accuracy and a varying absolute. The representation that is used most often is approved by the IEEE 754. The mathematical operations that use floating point numbers are implemented in computer systems, both hardware and software.

floating point numbers

Period or comma

In the detailed list of Decimal separator, those English-speaking and English-speaking countries are indicated where the fractional part is separated from the whole part by a dot in the number entries, and therefore the terminology of these countries is called the floating point. In the Russian Federation, the fractional part from the whole is traditionally separated by a comma, therefore, the historically recognized term "floating point numbers" denotes the same concept. Nevertheless, today both options are quite acceptable in the technical documentation and in Russian-language literature.

The term "floating point numbers" comes from the fact that the positional representation of a number represents a comma (normal decimal or binary - computer), which can fit anywhere among the digits of a string. This feature of it must be negotiated separately. This means that the representation of floating point numbers can be considered as a computer implementation of the exponential notation of a number. The advantage of using such a representation over a fixed-point and integer format representation is that the range of values ​​grows significantly while the relative accuracy remains unchanged.

Example

If the comma in the number is fixed, then it can be written in only one format. For example, six digits of the integer in number and two digits in the fractional part are given. This can only be done in this way: 123456.78. The format of floating point numbers gives full scope for expression. For example, the same eight digits are given. There can be any number of recording options, if the programmer does not skimp on the obligation to create a two-digit additional field, where he will record exponents, which are usually 10, from 0 to 16, and there will be ten digits with a total of ten: 8 + 2.

Some recording options that the floating point format allows: 12345678000000000000; 0.0000012345678; 123.45678; 1.2345678 and so on. This format even has a speed unit! Rather, the performance of a computing system that captures the speed at which a computer performs operations where there is a representation of floating point numbers. This performance is measured in units of FLOPS (floating-point operations per second, which translates as the number of operations per second with floating point numbers). This unit is fundamental in measuring the speed of a computing system.

floating point format

Structure

Write a number in floating point format as follows, following the sequence of required parts, since this record is exponential, where real numbers are represented as mantissa and order. This is necessary to represent too large and too small numbers; they are much more convenient to read. Mandatory parts: recordable number (N), mantissa (M), order sign (p) and order (n). The last two characters form the characteristic of the number. Therefore, N = M. n p . This is how floating point numbers are written. Examples will be varied.

1. You need to write the number one million so as not to get confused at the zeros. 1000000 is a normal notation, arithmetic. A computer looks like this: 1.0 . 10 6 . That is, ten to the sixth degree - three signs, which fit as many as six zeros. Thus, the representation of fixed and floating point numbers occurs, where you can immediately detect differences in spelling.

2. And such a difficult number as 1435000000 (one billion four hundred thirty five thousand) can also be simply written: 1.435 . 10 9 , just. Also with a minus sign you can write any number. This is what the fixed and floating point numbers differ from each other.

But these are big numbers, what about small ones? Yes, too easy.

3. For example, how to designate one millionth? 0.000001 = 1.0 . 10 -6 . The writing of the number and its reading are greatly facilitated.

4. Is it more complicated? Five hundred forty-sixth billionth: 0.000000546 = 546 . 10 -9 . Here. The range of representation of floating point numbers is very wide.

representation of floating point numbers

The form

The shape of the number can be normal or normalized. Normal - always respects the precision of floating point numbers. It should be noted that the mantissa in this form, not taking into account the sign, is located at half the interval: 0 1, which means that 0 ⩽ a <1. Not in normal form, the number loses its accuracy. The disadvantage of the normal form of the number is that many numbers can be written in different ways, that is, ambiguously. An example of a different record of the same number: 0.0001 = 0, 000001 . 10 2 = 0.00001 . 10 1 = 0.0001 . 10 0 = 0.001 . 10 -1 = 0.01 . 10 -2 and so much more is possible. That is why in computer science a different, normalized form of notation is used, where the decimal mantissa takes a value from one (inclusive) and thus up to ten (not inclusive), and in the same way the binary mantissa takes a value from one (inclusive) to two (not inclusive).

Therefore, 1 ⩽ a <10. These are binary floating-point numbers, and this form of writing fixes any number (except zero) uniquely. But there is also a drawback - the inability to represent zero in this form. Therefore, computer science provides for the use of a number 0 of a special feature (bit). The integer part of the number (high order) of the mantissa in the binary number, in addition to zero in normalized form, is 1 (implicit unit). Such a record is used by the IEEE 754. The positional number systems, where the base is more than two (ternary, quadruple and other systems), have not acquired this property.

write a number in floating point format

Real numbers

Real floating-point numbers usually just happen, because this is not the only, but very convenient way to represent a real number, as if a compromise between the range of values ​​and accuracy. This is an analogue of exponential notation only performed on a computer. Floating point number - a set of separate binary digits, divided into sign (sign), order (exponent) and mantissa (mantis). The most common IEEE 754 format is a floating-point number as a set of bits that encode the mantissa in one part, the power in the other, and the sign of the number is indicated with one bit: zero is if it is positive, one if the number is negative. The whole order is written as an integer (code with a shift), and the mantissa is in normalized form, its fractional part is in the binary system.

Each character is a single bit that indicates the character for an entire floating point number. Mantissa and order are integers, they are coupled with a sign and make a representation of a floating point number. The order can be called an exponent or exponent. Not all real numbers can be represented in the computer in their exact value, while the rest are represented as approximate values. A much simpler option is to imagine a real number with a fixed point, where the real and integer parts will be stored separately. Most likely, so that the integer part is always allocated X bits, and the fractional - Y bits. But processor architectures do not know this method, and therefore, preference is given to floating point numbers.

range of representation of floating point numbers

Addition

Adding floating point numbers is pretty simple. Due to the IEEE 754 standard, single precision of a number has a huge number of bits, so it’s better to go straight to the examples, and it’s better to take the smallest representation of the floating point number. For example, two numbers - X and Y.

VariableSignExhibitorMantissa
X01001110
Y00111000

The steps will be as follows:

a) The numbers must be presented in a normalized form. The hidden unit is clearly represented. X = 1,110 . 2 2 , and Y = 1,000 . 2 0 .

b) You can continue the addition process only by equalizing the exponents, and for this you need to rewrite the value of Y. It will correspond to the value of the normalized number, although in fact it will be denormalized.

Calculate the difference of exponentials of degree 2 - 0 = 2. Now move the mantissa to compensate for these changes, that is, add 2 to the index of the second term, thus moving the hidden unit comma two points to the left. It turns out 0,0100 . 2 2 . This will be the equivalent of the previous value of Y, that is, already Y '.

c) Now we need to add the mantissa of the number X and the adjusted Y.

1.110 + 0.01 = 10.0

The exponent is still equal to the presented indicator X, which is equal to 2.

d) The amount obtained in the previous step has shifted the normalization unit, which means that you need to shift the exponent and repeat the summation. 10.0 with two bits to the left of the comma, now the number needs to be normalized, that is, move the comma to the left by one point, and increase the exponent by 1. It turns out 1,000 . 2 3 .

e) It's time to convert the floating point number to a single-byte system.

AmountSignExhibitorMantissa
X + Y01010000

Output

As you can see, adding such numbers is not too difficult, nothing that the comma floats. Unless, of course, we consider the reduction of a number with a smaller exponent to a number with a larger one (in the above example, it was Y to X), as well as the restoration of the status quo, that is, the issue of compensation is the movement of the mantissa comma to the left. When the addition has already been done, another complication is very possible - renormalization and truncation of bits, if their number does not match the format of the number to represent it.

floating point multiplication

Multiplication

The binary number system offers two ways by which the multiplication of floating point numbers is performed. This task can be performed by multiplication, which begins with the least significant bits and which begins with the highest bits in the factor. Both cases contain a number of operations that consistently add up particular works. These addition operations are controlled by factor bits. So, if in one of the digits of the multiplier there is a unit, then the sum of the private products grows significant with a corresponding shift. And if zero has crept into the category of the factor, then the multiplier is not added.

If just two numbers are multiplied, then the figures of the product in their quantity cannot exceed the number of digits contained in the factors by more than two times, and with large numbers this is very, very much. If several numbers are multiplied, then the product risks not fit on the screen. Therefore, the number of digits of any digital automaton is quite finite, and this forces us to limit as much as twice to the number of digits of adders. And if the number of digits is limited, an error is inevitably introduced into the work. If the volume of calculations is large, then the errors are superimposed, and as a result, the overall error greatly increases. Here the only way out is to round the multiplication results, then the error of the product will turn out to be alternating. When the operation of multiplication is performed, it becomes possible to go beyond the grid of digits, but only from the side of the least significant ones, since there is a restriction imposed on numbers that are presented in fixed-point form.

Some explanation

Better to start over. The most common way to represent a number is with a string of numbers as an integer, where a comma is implied at the very end. This line can be any length, and the comma stands in the most necessary place for it, separating the integer from the fractional part of it. The format of the fixed-point number representation is necessarily set by the system regarding the location of the point. Exponential notation uses the standard normalized representation of numbers. This is aqn {\ displaystyle aq ^ {n}} aq n . Here a {\ displaystyle a} a , and this lace mantissa is called. It was just said about this that 0 ⩽ a <q. Further, everything should already be clear: n {/ displaystyle n} n is an integer, an exponent, and q {/ displaystyle q} q is also an integer that is the basis of this number system (and in a letter this is most often 10). Mantissa will leave a comma after the first digit, which is not zero, but then information about the current value of the number is transmitted further down the record.

A floating-point number is very similar to being written on an understandable standard notation of numbers, only the exponent and mantissa are written separately. The latter is also in a normalized format - with a fixed comma, which decorates the first significant digit. Just a floating point is used mainly in the computer, that is, in the electronic representation, where the system is not decimal, but binary, where even the mantissa is denormalized by rearranging the comma - now it is before the first digit, which means before, not after, where the integer part in principle, it may not be. For example, our native decimal system will give its nine binary system for temporary use. And she will write it with a floating point mantissa like this: +1001000 ... 0, and to it and an indicator +0 ... 0100. But the decimal system will not be able to perform such complex calculations as are possible in binary, using a floating-point form.

floating point numbers examples

Long arithmetic

Electronic computers have built-in software packages, where the memory allocated to the mantissa and exponent is programmed, limited only by the size of the computer's memory. This is how long arithmetic looks, that is, simple operations on numbers performed by a computer. These are all the same - subtraction and addition, division and multiplication, elementary functions and rooting. But only the numbers are completely different, their capacity can significantly exceed the length of the machine word. The implementation of such operations does not occur in hardware, but in software, but basic hardware is widely used in working with numbers of much smaller orders. There is also arithmetic, where the length of the numbers is limited only by the amount of memory - arbitrary precision arithmetic. And long arithmetic is used in many areas.

1. To compile the code (processors, microcontrollers with low bit depth - 10 bits and eight-bit bit registers), this is clearly not enough to process information with Analog-to-digital (Analog-to-digital converter), and therefore can not do without long arithmetic.

2. Also, long arithmetic is used for cryptography, where it is necessary to ensure the accuracy of the result of exponentiation or multiplication up to 10 309 . Integer arithmetic is used modulo m - a large natural number, and not necessarily simple.

3. Software for financiers and mathematicians also cannot do without long arithmetic, because this is the only way to verify the results of calculations on paper - using a computer, ensuring high accuracy of numbers. By floating point they can attract arbitrarily long bit depth. But engineering calculations and the work of scientists quite rarely require the intervention of software calculations, because it is very difficult to enter the input data without making mistakes. Usually they are much more voluminous than the results of rounding.

Fighting Errors

In operations with numbers in which a comma floats, it is very difficult to evaluate the error of the results. It has not yet been invented to satisfy all mathematical theory that would help solve this problem. But errors with integers are easy to evaluate. The possibility of getting rid of inaccuracies lies on the surface - just use only fixed-point numbers. For example, financial programs are built on this principle. However, it is simpler there: the required number of digits after the decimal point is known in advance.

Other applications cannot be limited to this, because it is impossible to work with either very small or very large numbers. Therefore, when working, it is always taken into account that inaccuracies are possible, and therefore, when outputting the results, it is necessary to round off. Moreover, automatic rounding is often an insufficient action, and therefore rounding is set specifically. The comparison operation is very dangerous in this respect. Here, even estimating the size of future errors is extremely difficult.


All Articles