![]() |
|
|
#23 |
|
May 2004
FRANCE
24416 Posts |
Hi,
Thanks to Iain Bethune, MAC OS X 32bit and 64bit binaries are now available. Regards, Jean |
|
|
|
|
|
#24 | |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
22·23·103 Posts |
Quote:
And another test would be to run two 2-threaded tests in two folders. Optionally, here, assign best affinities using external system tools. |
|
|
|
|
|
|
#25 |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
23·3·5·72 Posts |
The speed gain from multiple threads is heavily dependent on the size of the fft.
zero-padded 168K 1 thread 0.513 2 thread 0.386 = 1.329 3 thread 0.268 = 1.914 4 thread 0.222 = 2.311 4 threads 0.566/4=0.142 zero-padded 288K 1 thread 0.876 2 thread 0.516 = 1.698 3 thread 0.361 = 2.427 4 thread 0.275 = 3.186 4 threads 0.974/4 = 0.244 |
|
|
|
|
|
#26 | |
|
Feb 2016
UK
43210 Posts |
Quote:
One more variable: the CPU L3 cache size. In my previous testing, small FFTs did not benefit and may be hindered by managing the threads. Best benefits may be when a single task substantially fills but doesn't exceed L3 cache size. Yes, this means you will have to consider the CPU model for optimal performance. |
|
|
|
|
|
|
#27 | |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
23×3×5×72 Posts |
Quote:
|
|
|
|
|
|
|
#28 |
|
Feb 2016
UK
43210 Posts |
Multiply FFT size by 8 to get it in bytes e.g. 128k FFT size = 1024kB. That would be 256k for 2MB, 512k for 4MB, 768k for 6MB, 1024k for 8MB. I would caution that's only for the FFT, and I don't know if there is much else needed for other things, so beware if you're at a limit. As an observation it seems to hold ok.
I'd also repeat, I didn't see good scaling at smaller FFT sizes previously, presumably as the overheads are proportionately more significant. |
|
|
|
|
|
#29 |
|
Feb 2016
UK
24·33 Posts |
Testing is ongoing, and I'm currently using SR5 project at PrimeGrid. It is interesting, I've had some invalid results on two Skylake systems: 2 invalid, 42 valid on an i5, 2 invalid, 37 valid on an i7. These systems are overclocked, and voltage optimised, and to the best of my knowledge were stable doing similar loads running one per core. My speculation at this point is that running a multithread test is stressing the CPU in a different way. I've reduced the clock for now to see if this makes it go away. If confirmed, then extra care may be needed when optimising clocks vs voltages.
As a thought, the default Prime95 stress test runs separate tasks per core, is there a setting to change it to run all cores on the same task? I'll look into this later. |
|
|
|
|
|
#30 |
|
Dec 2011
New York, U.S.A.
1418 Posts |
There's a serious problem with the multi-threaded code. Under some circumstances it seems to always produce an erroneous result. The bad result is a seemingly random, but always wrong, residue.
The test case we're using is "-d -t4 -q64598*5^2318694-1". The expected result is "64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0". Running with just one thread yields the correct result. Running the 64 bit Windows and 64 bit Linux FMA3 transform with 4 threads produces incorrect results. It seems that if FMA3 is not used, you get the correct result, but I'm not 100% certain that it's the FMA3 transform that makes the difference between working and not working. It seems to always work if 1 thread is used. It seems to always work on CPUs that don't support FMA3. It seems to always work if a 32 bit Windows build is used (which doesn't use FMA3 for whatever reason). On machines where it's failing, under the same test conditions it fails 100% of the time, but the residue that's produced is different in each run. I did not see this problem when testing small numbers, but it seems that this SR5 test case always produces an error. That's what I know at this point, or at least think that I know. It's being discussed at length at PrimeGrid here: http://www.primegrid.com/forum_threa...ap=true#104836 |
|
|
|
|
|
#31 | |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
23·3·5·72 Posts |
Quote:
|
|
|
|
|
|
|
#32 |
|
Feb 2016
UK
24×33 Posts |
I ran -t2 tests on i3 overnight and they were also bad on my Haswell/Skylake systems. The task is small enough to fit in the i5/i7 cache, but not the i3.
I've eliminated hardware as far as I can, having run this on multiple systems, as have others. As mentioned in my earlier post, not every test is doing this, but this specific one seems easily repeatable. |
|
|
|
|
|
#33 |
|
Sep 2012
New Jersey, USA
59 Posts |
On 64598*5^2318694-1 with LLR 3.8.18, I can't get it to fail on my Xeon X3430's, my Sandy Bridge or my Ivy Bridge. But on my Kaby Lake if running four threads it always gives the wrong (and different every time) result except when I add -oFFT_Increment=1 which yields the proper residue. That bumps the FFT size from 512K to 576K.
On a single thread it always works everywhere. Edit: I always have HT turned off. Last fiddled with by JimB on 2017-02-11 at 16:39 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| LLR Version 3.8.19 released | Jean Penné | Software | 11 | 2017-02-23 08:52 |
| LLR Version 3.8.17 released [deprecated] | Jean Penné | Software | 18 | 2017-02-01 12:49 |
| LLR Version 3.8.14 released (deprecated) | Jean Penné | Software | 67 | 2015-05-02 07:24 |
| Prime95 version 28.5 (deprecated, use 28.7) | Prime95 | Software | 162 | 2015-04-05 16:19 |
| LLR beta Version 3.8.13 (deprecated) | Jean Penné | Software | 111 | 2015-01-26 21:41 |