![]() |
Single LL Question
Could an LL test be split across multiple local computers, the goal being to speed up computation of a single LL test for large exponents? Is it feasible that current home or small server tech designed to minimise latency between machines would be good enough to allow this?
I don't know how the LL test is split onto multiple cores, I guess it must be that the multiplication is split (?) somehow. Can the work be split into an arbitrary number of pieces, is there an optimum number of pieces or piece size to split the work into (dependent on p and/or cpu architecture perhaps), and would the optimum piece count for large exponents be high enough to even suggest that a multi-computer LL test might be worthwhile? I realise that there may be many problems with splitting the workload onto cores which aren't tightly in sync, particularly for something as highly tuned reliant on latency as an LL test, probably. But as I don't know anything for sure and can only guess, I thought asking might be a good idea :) |
[QUOTE=Unregistered;295698]Could an LL test be split across multiple local computers, the goal being to speed up computation of a single LL test for large exponents? Is it feasible that current home or small server tech designed to minimise latency between machines would be good enough to allow this?
I don't know how the LL test is split onto multiple cores, I guess it must be that the multiplication is split (?) somehow. Can the work be split into an arbitrary number of pieces, is there an optimum number of pieces or piece size to split the work into (dependent on p and/or cpu architecture perhaps), and would the optimum piece count for large exponents be high enough to even suggest that a multi-computer LL test might be worthwhile? I realise that there may be many problems with splitting the workload onto cores which aren't tightly in sync, particularly for something as highly tuned reliant on latency as an LL test, probably. But as I don't know anything for sure and can only guess, I thought asking might be a good idea :)[/QUOTE] No, the way the LL test works is through a series of multiplications and residues from a Mod function, which must be performed serially. You could not, for example, take 4 sections and run them, and then put them together, since you could not start section 2 without the end result of section 1. |
You can, however, pause an LL test and take it from one computer to another so that you can continue the same test when you upgrade hardware or change computers.
|
As I understand it LL currently can be done multi-core, because the FFT used in multiplication can be run multi-core. I know iterations cannot be performed out of sync or without the result of the previous iteration. In cases where the iteration can be done entirely in the cache, is it right to think that any external communication (outside of this cpu to ram or anywhere) would make it slower no matter what? For any which cannot be done wholly in the cache (do such cases exist?), would a multi-cpu setup potentially benefit then?
I am the OP, please excuse my ignorance :grin: |
[QUOTE=zanmato;295725]As I understand it LL currently can be done multi-core, because the FFT used in multiplication can be run multi-core. I know iterations cannot be performed out of sync or without the result of the previous iteration. In cases where the iteration can be done entirely in the cache, is it right to think that any external communication (outside of this cpu to ram or anywhere) would make it slower no matter what? For any which cannot be done wholly in the cache (do such cases exist?), would a multi-cpu setup potentially benefit then?
I am the OP, please excuse my ignorance :grin:[/QUOTE] If you have a multi-core system, you can run the benchmark program and observe the timings for 1,2,3,etc cores and see for yourself. Generally though, you see a lesser benefit for each added core, as shown by these timings from one of my systems: 1024K FFT on 1 core = 22.390ms 1024K FFT on 2 cores = 13.738ms 1024K FFT on 3 cores = 9.706ms 1024K FFT on 4 cores = 8.489ms |
[QUOTE=zanmato;295725]As I understand it LL currently can be done multi-core, because the FFT used in multiplication can be run multi-core. I know iterations cannot be performed out of sync or without the result of the previous iteration. In cases where the iteration can be done entirely in the cache, is it right to think that any external communication (outside of this cpu to ram or anywhere) would make it slower no matter what? For any which cannot be done wholly in the cache (do such cases exist?), would a multi-cpu setup potentially benefit then?
I am the OP, please excuse my ignorance :grin:[/QUOTE] [QUOTE=bcp19;295728]If you have a multi-core system, you can run the benchmark program and observe the timings for 1,2,3,etc cores and see for yourself. Generally though, you see a lesser benefit for each added core, as shown by these timings from one of my systems: 1024K FFT on 1 core = 22.390ms 1024K FFT on 2 cores = 13.738ms 1024K FFT on 3 cores = 9.706ms 1024K FFT on 4 cores = 8.489ms[/QUOTE] Indeed, you get very reduced returns for each successive core added. With regards to the cache, each test uses roughly the size of the number, which for GIMPS' current wavefront (58,xxx,xxx exponents) is 58 million bits, or around 6.9 MB, which is larger than L1 or L2 cache, and if you're running more than one test, larger than the L3 cache as well (and you see what happens if you try and run only 1 test on four cores, you get horrible efficiency). |
[QUOTE=Dubslow;295735]Indeed, you get very reduced returns for each successive core added. With regards to the cache, each test uses roughly the size of the number, which for GIMPS' current wavefront (58,xxx,xxx exponents) is 58 million bits, or around 6.9 MB, which is larger than L1 or L2 cache, and if you're running more than one test, larger than the L3 cache as well (and you see what happens if you try and run only 1 test on four cores, you get horrible efficiency).[/QUOTE]
"very" --- well I would say it depends on so many different factors. Just an example from a six core system: If core #1 is set to 100% the second core adds 86% of the first cores capacity 3rd 83% 4th 83% 5th 80% 6th 24% using AVX. It seems as if the speed of the memory is a very crucial factor in relation to how much capacity you loose adding another core. |
[QUOTE=aketilander;296057]"very" --- well I would say it depends on so many different factors. Just an example from a six core system:
If core #1 is set to 100% the second core adds 86% of the first cores capacity 3rd 83% 4th 83% 5th 80% 6th 24% using AVX. It seems as if the speed of the memory is a very crucial factor in relation to how much capacity you loose adding another core.[/QUOTE] The 3rd core, 83% of the first core or of the second? Thanks. |
[QUOTE=aketilander;296057]
It seems as if the speed of the memory is a very crucial factor in relation to how much capacity you loose adding another core.[/QUOTE] Indeed, AVX is so fast that Prime95 is now severely memory limited. The reason the extra cores [i]appear[/i] to be relatively efficient is that there is reduced memory requirements from running fewer tests across the system. I suspect if we had infinitely fast memory, the marginal efficiency would be far lower. |
[QUOTE=TObject;296059]The 3rd core, 83% of the first core or of the second? Thanks.[/QUOTE]
First core. All are % of first core. |
[QUOTE=Dubslow;296061]Indeed, AVX is so fast that Prime95 is now severely memory limited. The reason the extra cores [I]appear[/I] to be relatively efficient is that there is reduced memory requirements from running fewer tests across the system. I suspect if we had infinitely fast memory, the marginal efficiency would be far lower.[/QUOTE]
Another example from an "identical" six core system but with faster memory: If core #1 is set to 100% (= 112% of first cores capacity with slower memory) the second core adds 92% of the first cores capacity 3rd 90% of the first cores capacity 4th 75% of the first cores capacity 5th 50% of the first cores capacity 6th 17% of the first cores capacity using AVX. |
| All times are UTC. The time now is 08:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.