`
tinggo
  • 浏览: 44898 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

浮点数的比较

阅读更多

 

Comparing two floating point numbers is not a inappreciable job which can also occure lots of avoidable bugs. Fortunately, I find a good article to guide me how to comparing floating point numbers efficiently. So copy it to my personal technical website as my personal collection so as to help me consult in the possible future.

I appreciate author very much for his effort, the reference URL shown as below. Moreover, it seems that the content length limitation, some contents are not well displayed in the javaeye.

(http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm)

 

Comparing floating point numbers

Bruce Dawson

Comparing for equality

Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. If you do a calculation and then compare the results against some expected value it is highly unlikely that you will get exactly the result you intended.

 

In other words, if you do a calculation and then do this comparison:

if (result == expectedResult)

then it is unlikely that the comparison will be true. If the comparison is true then it is probably unstable – tiny changes in the input values, compiler, or CPU may change the result and make the comparison be false.

Comparing with epsilon – absolute error

Since floating point calculations involve a bit of uncertainty we can try to allow for this by seeing if two numbers are ‘close’ to each other. If you decide – based on error analysis, testing, or a wild guess – that the result should always be within 0.00001 of the expected result then you can change your comparison to this:

if (fabs(result - expectedResult) < 0.00001)

The maximum error value is typically called epsilon.

 

Absolute error calculations have their place, but they aren’t what is most often used. When talking about experimental error it is more common to specify the error as a percentage. Absolute error is used less often because if you know, say, that the error is 1.0 that tells you very little. If the result is one million then an error of 1.0 is great. If the result is 0.1 then an error of 1.0 is terrible.

 

With the fixed precision of floating point numbers in computers there are additional considerations with absolute error. If the absolute error is too small for the numbers being compared then the epsilon comparison may have no effect, because the finite precision of the floats may not be able to represent such small differences.

 

Let's say you do a calculation that has an expected answer of about 10,000. Because floating point math is imperfect you may not get an answer of exactly 10,000 - you may be off by one or two in the least significant bits of your result. If you're using 4-byte floats and you're off by one in the least significant bit of your result then instead of 10,000 you'll get +10000.000977. So we have:

 

float expectedResult = 10000;

float result = +10000.000977;   // The closest 4-byte float to 10,000 without being 10,000

float diff = fabs(result - expectedResult);

 

diff is equal to 0.000977, which is 97.7 times larger than our epsilon. So, our comparison tells us that result and expectedResult are not nearly equal, even though they are adjacent floats! Using an epsilon value 0.00001 for float calculations in this range is meaningless – it’s the same as doing a direct comparison, just more expensive.

 

Absolute error comparisons have value. If the range of the expectedResult is known then checking for absolute error is simple and effective. Just make sure that your absolute error value is larger than the minimum representable difference for the range and type of float you’re dealing with.

Comparing with epsilon – relative error

An error of 0.00001 is appropriate for numbers around one, too big for numbers around 0.00001, and too small for numbers around 10,000. A more generic way of comparing two numbers – that works regardless of their range, is to check the relative error. Relative error is measured by comparing the error to the expected result. One way of calculating it would be like this:

relativeError = fabs((result - expectedResult) / expectedResult);

If result is 99.5, and expectedResult is 100, then the relative error is 0.005.

 

Sometimes we don’t have an ‘expected’ result, we just have two numbers that we want to compare to see if they are almost equal. We might write a function like this:

// Non-optimal AlmostEqual function - not recommended.

bool AlmostEqualRelative(float A, float B, float maxRelativeError)

{

    if (A == B)

        return true;

    float relativeError = fabs((A - B) / B);

    if (relativeError <= maxRelativeError)

        return true;

    return false;

}

The maxRelativeError parameter specifies what relative error we are willing to tolerate. If we want 99.999% accuracy then we should pass a maxRelativeError of 0.00001.

 

The initial comparison for A == B may seem odd – if A == B then won’t relativeError be zero? There is one case where this will not be true. If A and B are both equal to zero then the relativeError calculation will calculate 0.0 / 0.0. Zero divided by zero is undefined, and gives a NAN result. A NAN will never return true on a <= comparison, so this function will return false if A and B are both zero (on some platforms where NAN comparisons are not handled properly this function might return true for zero, but it will then return true for all NAN inputs as well, which makes this poor behavior to count on).

 

The trouble with this function is that AlmostEqualRelative(x1, x2, epsilon) may not give the result as AlmostEqualRelative(x2, x1, epsilon), because the second parameter is always used as the divisor. An improved version of AlmostEqualRelative would always divide by the larger number. This function might look like this;

// Slightly better AlmostEqual function – still not recommended

bool AlmostEqualRelative2(float A, float B, float maxRelativeError)

{

    if (A == B)

        return true;

    float relativeError;

    if (fabs(B) > fabs(A))

        relativeError = fabs((A - B) / B);

    else

        relativeError = fabs((A - B) / A);

    if (relativeError <= maxRelativeError)

        return true;

    return false;

}

Even now our function isn’t perfect. In general this function will behave poorly for numbers around zero. The positive number closest to zero and the negative number closest to zero are extremely close to each other, yet this function will correctly calculate that they have a huge relative error of 2.0. If you want to count numbers near zero but of opposite sign as being equal then you need to add a maxAbsoluteError check also. The function would then return true if either the absoluteError or the relativeErrorwere smaller than the maximums passed in. A typical value for this backup maxAbsoluteError would be very small – FLT_MAX or less, depending on whether the platform supports subnormals.

// Slightly better AlmostEqual function – still not recommended

bool AlmostEqualRelativeOrAbsolute(float A, float B,

                float maxRelativeError, float maxAbsoluteError)

{

    if (fabs(A - B) < maxAbsoluteError)

        return true;

    float relativeError;

    if (fabs(B) > fabs(A))

        relativeError = fabs((A - B) / B);

    else

        relativeError = fabs((A - B) / A);

    if (relativeError <= maxRelativeError)

        return true;

    return false;

}

Comparing using integers

There is an alternate technique for checking whether two floating point numbers are close to each other. Recall that the problem with absolute error checks is that they don’t take into consideration whether there are any values in the range being checked. That is, with an allowable absolute error of 0.00001 and an expectedResult of 10,000 we are saying that we will accept any number in the range 9,999.99999 to 10,000.00001, without realizing that when using 4-byte floats there is only onerepresentable float in that range – 10,000. Wouldn’t it be handy if we could easily specify our error range in terms of how many floats we want in that range? That is, wouldn’t it be convenient if we could say “I think the answer is 10,000 but since floating point math is imperfect I’ll accept the 5 floats above and the 5 floats below that value.”

 

It turns out there is an easy way to do this.

 

The IEEE float and double formats were designed so that the numbers are “lexicographically ordered”, which – in the words of IEEE architect William Kahan means “if two floating-point numbers in the same format are ordered ( say x < y ), then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers.”

 

This means that if we take two floats in memory, interpret their bit pattern as integers, and compare them, we can tell which is larger, without doing a floating point comparison. In the C/C++ language this comparison looks like this:

if (*(int*)&f1 < *(int*)&f2)

This charming syntax means take the address of f1, treat it as an integer pointer, and dereference it. All those pointer operations look expensive, but they basically all cancel out and just mean ‘treat f1 as an integer’. Since we apply the same syntax to f2 the whole line means ‘compare f1 and f2, using their in-memory representations interpreted as integers instead of floats’.

 

Kahan says that we can compare them if we interpret them as sign-magnitude integers. That’s unfortunate because most processors these days use twos-complement integers. Effectively this means that the comparison only works if one or more of the floats is positive. If both floats are negative then the sense of the comparison is reversed – the result will be the opposite of the equivalent float comparison. Later we will see that there is a handy technique for dealing with this inconvenience.

 

Because the floats are lexicographically ordered that means that if we increment the representation of a float as an integer then we move to the next float. In other words, this line of code:

(*(int*)&f1) += 1;

will increment the underlying representation of a float and, subject to certain restrictions, will give us the next float. For a positive number this means the next larger float, for a negative number this means the next smaller float. In both cases it gives us the next float farther away from zero.

 

We can apply this logic in reverse also. If we subtract the integer representations of two floats then that will tell us how close they are. If the difference is zero, they are identical. If the difference is one, they are adjacent floats. In general, if the difference is n then there are n-1 floats between them.

 

The chart below shows some floating point numbers and the integer stored in memory that represents them. It can be seen in this chart that the five numbers near 2.0 are represented by adjacent integers. This demonstrates the meaning of subtracting integer representations, and also shows that there are no floats between 1.99999988 and 2.0.

 

 

Representation

Float value

Hexadecimal

Decimal

+1.99999976

0x3FFFFFFE

1073741822

+1.99999988

0x3FFFFFFF

1073741823

+2.00000000

0x40000000

1073741824

+2.00000024

0x40000001

1073741825

+2.00000048

0x40000002

1073741826

 

With this knowledge of the floating point format we can write this revised AlmostEqual implementation:

// Initial AlmostEqualULPs version - fast and simple, but

// some limitations.

bool AlmostEqualUlps(float A, float B, int maxUlps)

{

    assert(sizeof(float) == sizeof(int));

    if (A == B)

        return true;

    int intDiff = abs(*(int*)&A - *(int*)&B);

    if (intDiff <= maxUlps)

        return true;

    return false;

}

It’s certainly a lot simpler, especially when you look at all the divides and calls to fabs() that it’s not doing!

 

The last parameter to this function is different from the previous AlmostEqual. Instead of passing in maxRelativeError as a ratio we pass in the maximum error in terms of Units in the Last Place. This specifies how big an error we are willing to accept in terms of the value of the least significant digit of the floating point number’s representation. maxUlps can also be interpreted in terms of how many representable floats we are willing to accept between A and B. This function will allow maxUlps-1 floats between A and B.

 

If two numbers are identical except for a one-bit difference in the last digit of their mantissa then this function will calculate intDiff as one.

 

If one number is the maximum number for a particular exponent – perhaps 1.99999988 – and the other number is the smallest number for the next exponent – 2.0 – then this function will again calculate intDiff as one.

 

In both cases the two numbers are the closest floats there are.

 

There is not a completely direct translation between maxRelativeError and maxUlps. For a normal float number a maxUlps of 1 is equivalent to a maxRelativeError of between 1/8,000,000 and 1/16,000,000. The variance is because the accuracy of a float varies slightly depending on whether it is near the top or bottom of its current exponent’s range. This can be seen in the chart of numbers near 2.0 – the gap between numbers just above 2.0 is twice as big as the gap between numbers just below 2.0.

 

Our AlmostEqualUlps function starts by checking whether A and B are equal – just like AlmostEqualRelative did, but for a different reason that will be discussed below.

Compiler issues

In our last version of AlmostEqualUlps we use pointers and casting to tell the compiler to treat the in-memory representation of a float as an int. There are a couple of things that can go wrong with this. One risk is that int and float might not be the same size. A float should be 32 bits, but an int can be almost any size. This is certainly something to be aware of, but every modern compiler that I am aware of has 32-bit ints. If your compiler has ints of a different size, find a 32-bit integral type and use it instead.

 

Another complication comes from aliasing optimizations. Strictly speaking the C/C++ standard says that the compiler can assume that different types do not overlap in memory (with a few exceptions such as char pointers). For instance, it is allowed to assume that a pointer to an int and a pointer to a float do not point to overlapping memory. This opens up lots of worthwhile optimizations, but for code that violates this rule—which is quite common—it leads to undefined results. In particular, some versions of g++ default to very strict aliasing rules, and don’t like the techniques used in AlmostEqualUlps.

 

Luckily g++ knows that there will be a problem, and it gives this warning:

warning: dereferencing type-punned pointer will break strict-aliasing rules

There are two possible solutions if you encounter this problem. Turn off the strict aliasing option using the -fno-strict-aliasing switch, or use a union between a float and anint to implement the reinterpretation of a float as an int. The documentation for -fstrict-aliasing gives more information.

Complications

Floating point math is never simple. AlmostEqualUlps doesn’t properly deal with all the peculiar types of floating point numbers. Whether it deals with them well enough depends on how you want to use it, but an improved version will often be needed.

 

IEEE floating point numbers fall into a few categories:

  • Zeroes
  • Subnormals
  • Normal numbers
  • Infinities
  • NANs

Zeroes

AlmostEqual is designed to deal with normal numbers, and it is there that it behaves its best. Its first problem is when dealing with zeroes. IEEE floats can have both positive and negative zeroes. If you compare them as floats they are equal, but their integer representations are quite different – positive 0.0 is an integer zero, but negative zero is 0x80000000! (in decimal this is -2147483648). The chart below shows the positive and negative floats closest to zero, together with their integer representations.

 

 

Representation

Float value

Hexadecimal

Decimal

+4.2038954e-045

0x00000003

3

+2.8025969e-045

0x00000002

2

+1.4012985e-045

0x00000001

1

+0.00000000

0x00000000

0

-0.00000000

0x80000000

-2147483648

-1.4012985e-045

0x80000001

-2147483647

-2.8025969e-045

0x80000002

-2147483646

-4.2038954e-045

0x80000003

-2147483645

分享到:
评论

相关推荐

    S7-200SMART_浮点数比较库文件.rar

    标题中的"S7-200SMART_浮点数比较库文件.rar"暗示了这是一个与西门子S7-200SMART系列PLC相关的程序库,主要用于浮点数的比较操作。S7-200SMART是西门子推出的一款小型PLC,广泛应用于自动化设备控制,具有编程灵活、...

    SMART库_精确浮点数比较real compare.smartlib.rar

    标题 "SMART库_精确浮点数比较real compare.smartlib.rar" 暗示了这是一个针对S7-200 SMART PLC(西门子小型可编程控制器)的智能库,专门用于进行浮点数的精确比较操作。这个库可能包含了一系列的函数块或子程序,...

    浮点数比较错误.vi

    浮点数比较错误

    消除浮点数比较错误.vi

    消除浮点数比较错误

    浮点数比较大小.c

    浮点数比较大小.c

    近似:近似浮点数相等比较和断言

    在编程领域,尤其是在涉及到数值计算和...Rust的`approx`库为此提供了强大的支持,使得在测试和实际代码中进行浮点数比较变得更加可靠和简单。通过理解和运用这些工具,开发者可以编写出更加健壮和精准的数值处理代码。

    Shell脚本处理浮点数的运算和比较实例

    3. 浮点数比较:`awk`可以直接在条件语句中进行浮点数比较,如`if($1&gt;$2) {...}`,这使得在`awk`内部进行比较变得简单。 结合`bc`和`awk`,我们可以处理各种复杂的浮点数运算和比较任务。例如,计算三角函数、进制...

    三菱PLC浮点数运算指令

    1. 二进制浮点数比较指令ECMP(FNC110)和DECMP(P) 这一类指令用于比较两个32位浮点数,并将比较结果放置在目标操作数中。源操作数可以是K、H和D(分别代表常数、高位字、低位字),而目标操作数则可以是Y、M和S...

    php 浮点数比较方法详解

    在PHP编程中,浮点数比较是一个常见的挑战,因为浮点数运算存在精度问题。这是因为计算机内部使用二进制表示浮点数,而某些十进制小数在转换为二进制时无法精确表示,导致计算后可能出现微小的误差。这种误差在比较...

    4.15实验-浮点数的表示及运算

    此外,浮点数比较也不总是准确的,因为两个看似相等的浮点数在二进制下可能有所不同,因此比较时需注意精度问题。 四、浮点数运算的性能优化 在实际应用中,理解和优化浮点数运算的性能至关重要。处理器通常具有...

    浮点数的数据结构.pdf

    在进行浮点数比较或运算时,开发者需要考虑精度问题,并可能需要使用特定的方法(如`Math.abs()`或自定义比较函数)来处理潜在的不精确性。此外,了解浮点数的二进制转换规则有助于调试和理解浮点运算的底层机制。

    golang如何比较浮点数的大小

    Golang浮点数比较和运算会出现误差。 浮点数储存至内存中时,2的-1、-2……-n次方不能精确的表示小数部分,所以再把这个数从地址中取出来进行计算就出现了偏差。 package main import ( errors fmt github....

    浮点数十六进制表示转换工具

    在计算机科学和编程领域,理解和操作浮点数的十六进制表示对于理解数值计算的底层机制以及进行精确的浮点数比较和调试至关重要。 浮点数是一种用于表示实数的数据类型,它由两部分组成:指数部分和尾数部分。在IEEE...

    基于C++浮点数(float、double)类型数据比较与转换的详解

    在C++编程语言中,浮点数类型包括`float`和`double`,它们用于表示非整数值。本文主要探讨的是这两...对于浮点数比较,使用适当的容差值是比较通用且稳健的方法。理解这些细节对于编写高精度和鲁棒的C++代码至关重要。

    float浮点数二进制表示转换[源代码]

    学习这部分内容不仅有助于理解底层计算原理,还有助于在编程时更有效地处理浮点数,例如在进行高精度计算、浮点数比较或优化性能时。 总的来说,理解浮点数的二进制表示是计算机科学中的基础概念,对于程序员和硬件...

    在C语言中双精度浮点数线性化相等比较的研究.pdf

    线性化是通过一些变换,将非线性的浮点数比较问题转化为更易于处理的线性问题,从而提高比较的准确性和效率。 总结来说,双精度浮点数的相等比较问题在计算机编程中非常重要,尤其是在需要精确计算的科学和工程领域...

    浮点数运算,三菱浮点数运算,C,C++源码.zip

    在进行浮点数比较时,需要特别注意这一点,通常使用`epsilon`值来判断两个浮点数是否接近,而非直接比较是否相等。 C++扩展了C语言的功能,提供了更多的模板类和函数来处理浮点数。例如,`std::numeric_limits`模板...

    iOS 解决floatValue,doubleValue等计算不精确问题,一句话解决精确计算,精确比较

    例如,`float`和`double`类型的值在进行加减乘除运算后,结果可能与预期不符,尤其是在比较两个浮点数是否相等时,直接使用`==`可能会得到错误的结果。 为了解决这个问题,我们可以采用以下策略: 1. **避免直接...

    IEEE浮点数表示法.docx

    **IEEE浮点数表示法详解** 在计算机科学中,浮点数的表示是至关重要的,尤其是在数值计算和科学计算领域。...在编程中,尤其是在涉及到浮点数比较、精度控制和效率优化时,理解这一表示法显得尤为重要。

Global site tag (gtag.js) - Google Analytics