mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2012-04-07, 11:33   #1
Unregistered
 

2×37×53 Posts
Default Single LL Question

Could an LL test be split across multiple local computers, the goal being to speed up computation of a single LL test for large exponents? Is it feasible that current home or small server tech designed to minimise latency between machines would be good enough to allow this?

I don't know how the LL test is split onto multiple cores, I guess it must be that the multiplication is split (?) somehow. Can the work be split into an arbitrary number of pieces, is there an optimum number of pieces or piece size to split the work into (dependent on p and/or cpu architecture perhaps), and would the optimum piece count for large exponents be high enough to even suggest that a multi-computer LL test might be worthwhile?

I realise that there may be many problems with splitting the workload onto cores which aren't tightly in sync, particularly for something as highly tuned reliant on latency as an LL test, probably. But as I don't know anything for sure and can only guess, I thought asking might be a good idea :)
  Reply With Quote
Old 2012-04-07, 15:20   #2
bcp19
 
bcp19's Avatar
 
Oct 2011

2A716 Posts
Default

Quote:
Originally Posted by Unregistered View Post
Could an LL test be split across multiple local computers, the goal being to speed up computation of a single LL test for large exponents? Is it feasible that current home or small server tech designed to minimise latency between machines would be good enough to allow this?

I don't know how the LL test is split onto multiple cores, I guess it must be that the multiplication is split (?) somehow. Can the work be split into an arbitrary number of pieces, is there an optimum number of pieces or piece size to split the work into (dependent on p and/or cpu architecture perhaps), and would the optimum piece count for large exponents be high enough to even suggest that a multi-computer LL test might be worthwhile?

I realise that there may be many problems with splitting the workload onto cores which aren't tightly in sync, particularly for something as highly tuned reliant on latency as an LL test, probably. But as I don't know anything for sure and can only guess, I thought asking might be a good idea :)
No, the way the LL test works is through a series of multiplications and residues from a Mod function, which must be performed serially. You could not, for example, take 4 sections and run them, and then put them together, since you could not start section 2 without the end result of section 1.
bcp19 is offline   Reply With Quote
Old 2012-04-07, 15:38   #3
emily
 
Feb 2012
Athens, Greece

47 Posts
Default

You can, however, pause an LL test and take it from one computer to another so that you can continue the same test when you upgrade hardware or change computers.
emily is offline   Reply With Quote
Old 2012-04-07, 16:27   #4
zanmato
 
Apr 2012

1010 Posts
Default

As I understand it LL currently can be done multi-core, because the FFT used in multiplication can be run multi-core. I know iterations cannot be performed out of sync or without the result of the previous iteration. In cases where the iteration can be done entirely in the cache, is it right to think that any external communication (outside of this cpu to ram or anywhere) would make it slower no matter what? For any which cannot be done wholly in the cache (do such cases exist?), would a multi-cpu setup potentially benefit then?

I am the OP, please excuse my ignorance
zanmato is offline   Reply With Quote
Old 2012-04-07, 16:40   #5
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by zanmato View Post
As I understand it LL currently can be done multi-core, because the FFT used in multiplication can be run multi-core. I know iterations cannot be performed out of sync or without the result of the previous iteration. In cases where the iteration can be done entirely in the cache, is it right to think that any external communication (outside of this cpu to ram or anywhere) would make it slower no matter what? For any which cannot be done wholly in the cache (do such cases exist?), would a multi-cpu setup potentially benefit then?

I am the OP, please excuse my ignorance
If you have a multi-core system, you can run the benchmark program and observe the timings for 1,2,3,etc cores and see for yourself. Generally though, you see a lesser benefit for each added core, as shown by these timings from one of my systems:

1024K FFT on 1 core = 22.390ms
1024K FFT on 2 cores = 13.738ms
1024K FFT on 3 cores = 9.706ms
1024K FFT on 4 cores = 8.489ms

Last fiddled with by bcp19 on 2012-04-07 at 16:42
bcp19 is offline   Reply With Quote
Old 2012-04-07, 17:54   #6
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by zanmato View Post
As I understand it LL currently can be done multi-core, because the FFT used in multiplication can be run multi-core. I know iterations cannot be performed out of sync or without the result of the previous iteration. In cases where the iteration can be done entirely in the cache, is it right to think that any external communication (outside of this cpu to ram or anywhere) would make it slower no matter what? For any which cannot be done wholly in the cache (do such cases exist?), would a multi-cpu setup potentially benefit then?

I am the OP, please excuse my ignorance
Quote:
Originally Posted by bcp19 View Post
If you have a multi-core system, you can run the benchmark program and observe the timings for 1,2,3,etc cores and see for yourself. Generally though, you see a lesser benefit for each added core, as shown by these timings from one of my systems:

1024K FFT on 1 core = 22.390ms
1024K FFT on 2 cores = 13.738ms
1024K FFT on 3 cores = 9.706ms
1024K FFT on 4 cores = 8.489ms
Indeed, you get very reduced returns for each successive core added. With regards to the cache, each test uses roughly the size of the number, which for GIMPS' current wavefront (58,xxx,xxx exponents) is 58 million bits, or around 6.9 MB, which is larger than L1 or L2 cache, and if you're running more than one test, larger than the L3 cache as well (and you see what happens if you try and run only 1 test on four cores, you get horrible efficiency).
Dubslow is offline   Reply With Quote
Old 2012-04-10, 21:06   #7
aketilander
 
aketilander's Avatar
 
"Åke Tilander"
Apr 2011
Sandviken, Sweden

23616 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Indeed, you get very reduced returns for each successive core added. With regards to the cache, each test uses roughly the size of the number, which for GIMPS' current wavefront (58,xxx,xxx exponents) is 58 million bits, or around 6.9 MB, which is larger than L1 or L2 cache, and if you're running more than one test, larger than the L3 cache as well (and you see what happens if you try and run only 1 test on four cores, you get horrible efficiency).
"very" --- well I would say it depends on so many different factors. Just an example from a six core system:

If core #1 is set to 100%
the second core adds 86% of the first cores capacity
3rd 83%
4th 83%
5th 80%
6th 24%

using AVX.

It seems as if the speed of the memory is a very crucial factor in relation to how much capacity you loose adding another core.
aketilander is offline   Reply With Quote
Old 2012-04-10, 21:24   #8
TObject
 
TObject's Avatar
 
Feb 2012

6258 Posts
Default

Quote:
Originally Posted by aketilander View Post
"very" --- well I would say it depends on so many different factors. Just an example from a six core system:

If core #1 is set to 100%
the second core adds 86% of the first cores capacity
3rd 83%
4th 83%
5th 80%
6th 24%

using AVX.

It seems as if the speed of the memory is a very crucial factor in relation to how much capacity you loose adding another core.
The 3rd core, 83% of the first core or of the second? Thanks.
TObject is offline   Reply With Quote
Old 2012-04-10, 21:37   #9
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by aketilander View Post
It seems as if the speed of the memory is a very crucial factor in relation to how much capacity you loose adding another core.
Indeed, AVX is so fast that Prime95 is now severely memory limited. The reason the extra cores appear to be relatively efficient is that there is reduced memory requirements from running fewer tests across the system. I suspect if we had infinitely fast memory, the marginal efficiency would be far lower.
Dubslow is offline   Reply With Quote
Old 2012-04-11, 16:55   #10
aketilander
 
aketilander's Avatar
 
"Åke Tilander"
Apr 2011
Sandviken, Sweden

2·283 Posts
Default

Quote:
Originally Posted by TObject View Post
The 3rd core, 83% of the first core or of the second? Thanks.
First core. All are % of first core.
aketilander is offline   Reply With Quote
Old 2012-04-11, 18:23   #11
aketilander
 
aketilander's Avatar
 
"Åke Tilander"
Apr 2011
Sandviken, Sweden

2×283 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Indeed, AVX is so fast that Prime95 is now severely memory limited. The reason the extra cores appear to be relatively efficient is that there is reduced memory requirements from running fewer tests across the system. I suspect if we had infinitely fast memory, the marginal efficiency would be far lower.
Another example from an "identical" six core system but with faster memory:

If core #1 is set to 100% (= 112% of first cores capacity with slower memory)
the second core adds 92% of the first cores capacity
3rd 90% of the first cores capacity
4th 75% of the first cores capacity
5th 50% of the first cores capacity
6th 17% of the first cores capacity

using AVX.
aketilander is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
does half-precision have any use for GIMPS? ixfd64 GPU Computing 9 2017-08-05 22:12
Single vs Dual channel memory TObject Hardware 5 2014-12-24 05:58
How to have all 4 cores working on a single number? tech96 Information & Answers 5 2014-07-04 09:53
Why factoring is single-core designed? otutusaus Software 33 2010-11-20 21:05
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 08:15.


Sat Jul 17 08:15:48 UTC 2021 up 50 days, 6:03, 1 user, load averages: 2.65, 1.71, 1.47

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.