mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-02-08, 15:54   #23
Jean Penné
 
Jean Penné's Avatar
 
May 2004
FRANCE

24416 Posts
Default MacIntel binaries

Hi,

Thanks to Iain Bethune, MAC OS X 32bit and 64bit binaries are now available.
Regards,
Jean
Jean Penné is offline   Reply With Quote
Old 2017-02-08, 16:08   #24
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

22·23·103 Posts
Default

Quote:
Originally Posted by mackerel View Post
That equates to two threads being 1.84x faster, and four threads 2.96x faster. Again, that doesn't take account for turbo clocks when fewer threads are run. Assuming ideal turbo speeds, with correction that goes to 1.9x for two threads, and 3.2x for four threads.
One additional test to complement these comparisons is to specifically run four single-threaded tests in four folders, simultaneously. Then you will be able to see if four single-threaded tests are also slower than a single one in an otherwise idle machine, and how much slower.
And another test would be to run two 2-threaded tests in two folders. Optionally, here, assign best affinities using external system tools.
Batalov is offline   Reply With Quote
Old 2017-02-08, 16:33   #25
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23·3·5·72 Posts
Default

The speed gain from multiple threads is heavily dependent on the size of the fft.

zero-padded 168K
1 thread 0.513
2 thread 0.386 = 1.329
3 thread 0.268 = 1.914
4 thread 0.222 = 2.311
4 threads 0.566/4=0.142

zero-padded 288K
1 thread 0.876
2 thread 0.516 = 1.698
3 thread 0.361 = 2.427
4 thread 0.275 = 3.186
4 threads 0.974/4 = 0.244
henryzz is online now   Reply With Quote
Old 2017-02-08, 17:01   #26
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

43210 Posts
Default

Quote:
Originally Posted by Batalov View Post
One additional test to complement these comparisons is to specifically run four single-threaded tests in four folders, simultaneously. Then you will be able to see if four single-threaded tests are also slower than a single one in an otherwise idle machine, and how much slower.
And another test would be to run two 2-threaded tests in two folders. Optionally, here, assign best affinities using external system tools.
4 parallel instances is kinda covered by the later testing. I don't intend to revisit these older machines for test. Two of 2-thread is unlikely to be interesting as you'll lack adequate cache to hold them. Maybe there are some specific FFT sizes for some processors where it will help, but it wont be in this case.

Quote:
Originally Posted by henryzz View Post
The speed gain from multiple threads is heavily dependent on the size of the fft.
One more variable: the CPU L3 cache size. In my previous testing, small FFTs did not benefit and may be hindered by managing the threads. Best benefits may be when a single task substantially fills but doesn't exceed L3 cache size. Yes, this means you will have to consider the CPU model for optimal performance.
mackerel is offline   Reply With Quote
Old 2017-02-08, 18:32   #27
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23×3×5×72 Posts
Default

Quote:
Originally Posted by mackerel View Post
4 parallel instances is kinda covered by the later testing. I don't intend to revisit these older machines for test. Two of 2-thread is unlikely to be interesting as you'll lack adequate cache to hold them. Maybe there are some specific FFT sizes for some processors where it will help, but it wont be in this case.


One more variable: the CPU L3 cache size. In my previous testing, small FFTs did not benefit and may be hindered by managing the threads. Best benefits may be when a single task substantially fills but doesn't exceed L3 cache size. Yes, this means you will have to consider the CPU model for optimal performance.
Both my tests will have fit in the L3 cache on my i7(8mb). It would be useful if someone could work out which ffts fit in 2,4,6,8mb of cache.
henryzz is online now   Reply With Quote
Old 2017-02-08, 21:16   #28
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

43210 Posts
Default

Multiply FFT size by 8 to get it in bytes e.g. 128k FFT size = 1024kB. That would be 256k for 2MB, 512k for 4MB, 768k for 6MB, 1024k for 8MB. I would caution that's only for the FFT, and I don't know if there is much else needed for other things, so beware if you're at a limit. As an observation it seems to hold ok.

I'd also repeat, I didn't see good scaling at smaller FFT sizes previously, presumably as the overheads are proportionately more significant.
mackerel is offline   Reply With Quote
Old 2017-02-10, 12:37   #29
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

24·33 Posts
Default

Testing is ongoing, and I'm currently using SR5 project at PrimeGrid. It is interesting, I've had some invalid results on two Skylake systems: 2 invalid, 42 valid on an i5, 2 invalid, 37 valid on an i7. These systems are overclocked, and voltage optimised, and to the best of my knowledge were stable doing similar loads running one per core. My speculation at this point is that running a multithread test is stressing the CPU in a different way. I've reduced the clock for now to see if this makes it go away. If confirmed, then extra care may be needed when optimising clocks vs voltages.

As a thought, the default Prime95 stress test runs separate tasks per core, is there a setting to change it to run all cores on the same task? I'll look into this later.
mackerel is offline   Reply With Quote
Old 2017-02-11, 07:34   #30
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

1418 Posts
Default

There's a serious problem with the multi-threaded code. Under some circumstances it seems to always produce an erroneous result. The bad result is a seemingly random, but always wrong, residue.

The test case we're using is "-d -t4 -q64598*5^2318694-1". The expected result is "64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0".

Running with just one thread yields the correct result. Running the 64 bit Windows and 64 bit Linux FMA3 transform with 4 threads produces incorrect results.

It seems that if FMA3 is not used, you get the correct result, but I'm not 100% certain that it's the FMA3 transform that makes the difference between working and not working.

It seems to always work if 1 thread is used.

It seems to always work on CPUs that don't support FMA3.

It seems to always work if a 32 bit Windows build is used (which doesn't use FMA3 for whatever reason).

On machines where it's failing, under the same test conditions it fails 100% of the time, but the residue that's produced is different in each run.

I did not see this problem when testing small numbers, but it seems that this SR5 test case always produces an error.

That's what I know at this point, or at least think that I know. It's being discussed at length at PrimeGrid here: http://www.primegrid.com/forum_threa...ap=true#104836
AG5BPilot is offline   Reply With Quote
Old 2017-02-11, 11:51   #31
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23·3·5·72 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
There's a serious problem with the multi-threaded code. Under some circumstances it seems to always produce an erroneous result. The bad result is a seemingly random, but always wrong, residue.

The test case we're using is "-d -t4 -q64598*5^2318694-1". The expected result is "64598*5^2318694-1 is not prime. RES64: 47DC8FF7EF9583A8. OLD64: 5C3C1DD5662F6EF0".

Running with just one thread yields the correct result. Running the 64 bit Windows and 64 bit Linux FMA3 transform with 4 threads produces incorrect results.

It seems that if FMA3 is not used, you get the correct result, but I'm not 100% certain that it's the FMA3 transform that makes the difference between working and not working.

It seems to always work if 1 thread is used.

It seems to always work on CPUs that don't support FMA3.

It seems to always work if a 32 bit Windows build is used (which doesn't use FMA3 for whatever reason).

On machines where it's failing, under the same test conditions it fails 100% of the time, but the residue that's produced is different in each run.

I did not see this problem when testing small numbers, but it seems that this SR5 test case always produces an error.

That's what I know at this point, or at least think that I know. It's being discussed at length at PrimeGrid here: http://www.primegrid.com/forum_threa...ap=true#104836
Is the multithreading the difference between using memory and not on all these systems?
henryzz is online now   Reply With Quote
Old 2017-02-11, 12:23   #32
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

24×33 Posts
Default

I ran -t2 tests on i3 overnight and they were also bad on my Haswell/Skylake systems. The task is small enough to fit in the i5/i7 cache, but not the i3.

I've eliminated hardware as far as I can, having run this on multiple systems, as have others.

As mentioned in my earlier post, not every test is doing this, but this specific one seems easily repeatable.
mackerel is offline   Reply With Quote
Old 2017-02-11, 16:25   #33
JimB
 
Sep 2012
New Jersey, USA

59 Posts
Default

On 64598*5^2318694-1 with LLR 3.8.18, I can't get it to fail on my Xeon X3430's, my Sandy Bridge or my Ivy Bridge. But on my Kaby Lake if running four threads it always gives the wrong (and different every time) result except when I add -oFFT_Increment=1 which yields the proper residue. That bumps the FFT size from 512K to 576K.

On a single thread it always works everywhere.

Edit: I always have HT turned off.

Last fiddled with by JimB on 2017-02-11 at 16:39
JimB is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR Version 3.8.19 released Jean Penné Software 11 2017-02-23 08:52
LLR Version 3.8.17 released [deprecated] Jean Penné Software 18 2017-02-01 12:49
LLR Version 3.8.14 released (deprecated) Jean Penné Software 67 2015-05-02 07:24
Prime95 version 28.5 (deprecated, use 28.7) Prime95 Software 162 2015-04-05 16:19
LLR beta Version 3.8.13 (deprecated) Jean Penné Software 111 2015-01-26 21:41

All times are UTC. The time now is 16:33.


Fri Jul 16 16:33:00 UTC 2021 up 49 days, 14:20, 1 user, load averages: 1.48, 1.49, 1.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.