![]() |
This is a good confirmation
[QUOTE=kracker;420768]I replaced M_PI with pi.. on my (windows)/Haswell system, I'm getting
[code] cos((2 * pi) * 33 / 256.0) == 3fe610b7551d2cdf Press enter to exit... [/code][/QUOTE] Thank you for your post. It is the same result as on Skylake. On Ivy the result is different. A debug on the cos function evaluation shows that for Ivy it branches to a code with pure AVX and on Skylake it branches to some asm code with "vfmadd213sd" and "vfnmadd231sd" instructions that are FMA. It happens in the same way on Haswell based on your confirmation. Theoretically the end result should be the same no matter the processor used, otherwise this kind of errors are spread all over, basically in each program that simply evaluates a cos function. |
Technical questions about the workings of Prime 95
To further narrow the cause of the error and maybe exclude this rounding error, because Prime 95 v27.9 works correctly on Haswell. Does someone know the error tolerance of the multiplication done through FFT ?
Basically the FFT is used to multiply 2 big numbers A and B like this: result = IFFT ( FFT(A, 0) <point.by.point.multiplication> FFT(B, 0) ) If A and B have n coefficients, the middle coefficients of the product is the sum of n intermediary coefficient products and the FFT and IFFT are performed on size n * 2. So the error tolerance should account for this middle "n" term, should also account for the log(n) bits lost in the precision of computing the FFT (maybe twice, for the IFFT as well). Then it should have added a fixed tolerance. Does someone know what is this fixed tolerance ? If big enough, the multiplication will not be affected by a possible precision loss in the calculations. |
Prime95 does not use any trigonometric functions (well it does, but via a library that produces 128-bit floating point values).
Even if it did, a change in the least significant bit would not affect prime95's results. |
Questions?
Are all the trig functions precomputed and stored in memory or do some of them get to be computed during runtime ? At least for the 768KB FFT.... And 768KB FFT is more exactly a 49152 Complex point FFT, right ?
|
[QUOTE=megabit8;420792]Are all the trig functions precomputed and stored in memory or do some of them get to be computed during runtime ? At least for the 768KB FFT.... And 768KB FFT is more exactly a 49152 Complex point FFT, right ?[/QUOTE]
All are precomputed. A 768K is a 786432 point real FFT. |
It starts to make sense...
Now I see,
And a point real occupies 8 bytes or 16 bytes ? In other words is the in-place memory 6MB or 12 MB ? Is more memory used intensely during the In-Place test ? Like a temp buffer of equal size or lower to copy back and forth the transformation ? A good test to exclude this error would be to export all the precomputed 128bit coefficients into a binary file from a Skylake processor and from say an Ivy Bridge processor or another without FMA. I can do this test if someone points me to the point where the precomputed buffer is filled. If the exports match then it is really a complex Intel Architecture problem. |
[QUOTE=megabit8;420778]Thank you for your post. It is the same result as on Skylake. On Ivy the result is different.[/QUOTE]
Too ignorant to even understand the empirical... To those watching this thread, we're (mostly) smarter than bricks here. Particularly those who have a few posts (and a bit of software) under their belts.... |
[QUOTE=megabit8;420794]Now I see,
And a point real occupies 8 bytes or 16 bytes ? In other words is the in-place memory 6MB or 12 MB ? Is more memory used intensely during the In-Place test ? Like a temp buffer of equal size or lower to copy back and forth the transformation ? A good test to exclude this error would be to export all the precomputed 128bit coefficients into a binary file from a Skylake processor and from say an Ivy Bridge processor or another without FMA. I can do this test if someone points me to the point where the precomputed buffer is filled. If the exports match then it is really a complex Intel Architecture problem.[/QUOTE] It occupies 6MB and operates in place with an additional 1.2MB of precomputed constants. All signs point to a complex Intel Architecture problem. If it wasn't, then the torture test would fail in the same place every time with the same error message. |
Thank you for your compliment chalsall! :tu:
The empirical is logical... the cos thing evaluates differently based on FMA support which Haswell has. I was not trying to be ignorant, I was trying to help rationally in getting this issue sorted out. Anyways ... Those questions in my last post seem unimportant but are very important for a tech person. Even the coefficients comparison is important, I do not trust that something that operates on different data produces the same results - because of propagation. And I know a bit about how software is developed. Theoretically with 2 doubles you have 53x2 = 106 bits of precision, but you loose 3 * log2(768*1024) = 58.8 bits due to transformations and middle coefficient + 2 bits for tolerance = 61 bits lost. You are left with 45 bits of precision for each real number. And this has to hold a square of something. So the numbers assigned to this double real vector should be less that 2^22 * sqrt(2). And another thing, 1 bit is lost for this happy simple case cos(PI*33/128), but there can be other values in which the rounding error is bigger. This I have not tested yet. But these are all good reasons to compare the coefficients. In the end I hope that Skylake is fine and that me and other thousands of people did not buy a processor that produces junk from time to time and can make the system freeze and applications crash. This is true for any processor/ram/electronic device until proven it works correctly. I am trying to do a step by step approach into sorting this issue out, otherwise with an error thrown out each hour, there could be other factors which influence a calculation. We have the code and the error each hour. What's left is to make it happen in 1 second or less. I ask you how do you do it ? Because otherwise it is too hard to test, imagine that if Intel uses a program to record all the calculations performed in an hour for comparison, that's hundreds of TB and very slow even with adequate hardware ... |
I have build my own Skylake system in the days between Christmas and New Years day. The build is made of the following parts:
motherboard: Asus Z170 deluxe Processor: Intel 6700K (Skylake) RAM memory: Corsair Vengeance DDR4 4*4 Gb 3200 MHz After installing Ubuntu 15.10 I installed mprime version 28.7 The test worktodo.txt file is: [CODE] [Worker #1] Test=N/A,14942209,67,1 [Worker #2] Test=N/A,14942267,67,1 [Worker #3] Test=N/A,14942293,67,1 [Worker #4] Test=N/A,14942437,67,1 [/CODE] I first let the program run this with the following two lines added to local.txt: [CODE] CpuSupportsFMA3=0 CpuNumHyperthreads=1 [/CODE] So this switches hyperthreading off and forces mprime to make use of the older AVX implementation of the 768K FFT. As expected, the processor finished this in about 12 hours: [CODE] M14942209 is not prime. Res64: 8587C9937E3BED22. We8: CA7381D0,2354169,00000000 M14942267 is not prime. Res64: C35562BC4F3511F3. We8: D9111948,9356811,00000000 M14942293 is not prime. Res64: 035EFC95F88CFC27. We8: 361EF8AE,3597260,00000000 M14942437 is not prime. Res64: 683A0DFFC5827CD8. We8: E69D5DB7,323098,00000000 [/CODE] I then deleted both lines from local.txt allowing two threads to run on each assignment, but also to use the new instructionset. As also expected, the results were the same as in the first run, but obtained slightly faster. I then added the line CpuSupportsFMA3=0 again to local.txt to force mprime to use the older FFT implementation and allow hyperthreading on all 8 logical CPU's. So far it has run for three hours doing nearly 25% of the work and without anything noticeable happening, I am writing this message on that machine. Any thoughts? |
[QUOTE=tha;420846]I have build my own Skylake system in the days between Christmas and New Years day. The build is made of the following parts:[/QUOTE]
Sweet! Lucky you! :smile: [QUOTE=tha;420846]Any thoughts?[/QUOTE] Try running one exponent which uses the 768K FFT across all the cores (both physical and virtual). Initially don't set affinity. Also, post the information requested of you in post #171 of this thread. Note also that it has been shown that _some_ (possibly most) Skylake systems work fine. This whole exercise is to try to figure out if there is a correlation of many variables as to why. |
| All times are UTC. The time now is 23:23. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.