![]() |
Did I read a post re 4 cores to NOT LL all?
I seem to recall reading on this forum that on a Quad it is not the best use of the PC to do LL tests on all 4 cores; something about overhead on the CPU? or RAM? I think it recommended doing TF on one core?
If I am not imagining things then can someone tell me what the conditions were? i.e. Does it depend on the CPU or RAM Technology or Speed or the OS? I have an Intel Q9550 (Quad core 2.83 Ghz) with 4GB DDR1066 RAM and Vista 64. |
[QUOTE=petrw1;150458]I seem to recall reading on this forum that on a Quad it is not the best use of the PC to do LL tests on all 4 cores; something about overhead on the CPU? or RAM? I think it recommended doing TF on one core?
If I am not imagining things then can someone tell me what the conditions were? i.e. Does it depend on the CPU or RAM Technology or Speed or the OS?[/QUOTE]The limitations in accessing RAM (irrespective of tech.) cause the multiple threads to compete and choke down the through put. With a quad, I personally would suggest at least 1, maybe 2 threads doing T-F. T-F puts less demand on the bus. Others may offer their own suggestions based upon actual experience, but that is my dos centavos. |
You may be right....I swapped one core to TF and the other three doing LL (though still in the P-1 phase) are now running at least 10% faster
|
Hi petrw1,
I think you should give the new v25.8 a try with AffinityScrambling set to AffinityScramble=1230 an 2 worker / 2 helper threads for LL-Tests see [URL]http://mersenneforum.org/showpost.php?p=150437&postcount=21[/URL] |
With your memory, there's not as much of a performance hit as most people get when you run LL tests on all 4 cores. I think it's along the lines of with DDR2-800 or below, you only get something like 3 cores worth of output, but with DDR2-1066 it's closer to 3.5 cores worth of output. It's not big enough of a hit to discourage me from running LL on all 4 cores of my two quad-cores.
Whether it's "better" to run LL on all 4 cores or LL-TF-LL-TF is somewhat a matter of opinion. If you just care about total production, then with the current crediting system you're best off alternating. Back when TF only got credited 1/10 as much, it was worth the performance hit to run LL on all 4 cores. If you care about advancing the project, then I believe that LL testing is the way to go. I'm pretty sure the crediting disparity was originally created to motivate LL testing over factoring, and I think with that disparity gone we're going to eventually see the TF leading edge pull away from the LL testing leading edge. |
[QUOTE=Phantomas;150470]Hi petrw1,
I think you should give the new v25.8 a try with AffinityScrambling set to AffinityScramble=1230 an 2 worker / 2 helper threads for LL-Tests see [URL]http://mersenneforum.org/showpost.php?p=150437&postcount=21[/URL][/QUOTE] Do I understand correctly that this will give me 2 concurrent LL tests with 2 cores jointly working on each? And if I do this will the Per Iteration time be close to half so that whether I do 4 LL tests on seperate cores OR 2 by 2 cores each the total elapsed time for 4 tests will be about the same? |
[quote=petrw1;150518]Do I understand correctly that this will give me 2 concurrent LL tests with 2 cores jointly working on each?
[/quote] Yes, that's right. And when I interpret my results right, than each LL will run on one DualCore, so (my impression) one LL can use the 6MB L2 Cache alone, an it doesn't need to access the ordinary RAM so often. [quote=petrw1;150518] And if I do this will the per Iteration time be close to half so that whether I do 4 LL tests on seperate cores OR 2 by 2 cores each the total elapsed time for 4 tests will be about the same?[/quote] This is the case in my test with my Q9450. With 4 independent LL-Tests (2560K) one itteration is about 54.somewhat ms. With 2 LL-Tests with 2 cores it's about 26.somewhatelse ms. So it is in fact a little, tiny bit faster. And I assume that this is because it's using the L2 Cache better. My RAM runs at 1200MHz 6,6,6,15, and maybe the effect is bigger on 800MHz Ram's (hope so...) But it seems to be important to run one test on Core [1,2], and the other on core [0,3]. Else my itteration time went up 20%. |
For future reference, all of the above applies to Intel quad core processors prior to the i7 series. Most of the adjustments/tweaks listed probably do not need to be done on i7 systems (at least those with triple channel RAM) and won't have a significant effect on Phenom quad cores.
|
[QUOTE]For future reference, all of the above applies to Intel quad core processors[/QUOTE]
Yepp! |
[QUOTE=Kevin;150489]With your memory, there's not as much of a performance hit as most people get when you run LL tests on all 4 cores. I think it's along the lines of with DDR2-800 or below, you only get something like 3 cores worth of output, but with DDR2-1066 it's closer to 3.5 cores worth of output. It's not big enough of a hit to discourage me from running LL on all 4 cores of my two quad-cores.[/QUOTE]
The bandwidth of the RAM itself is not the limiting factor. The limiting factor is the memory bus itself, and contention for it by four cores. 800 Mhz QDR RAM is more than enough for what we do. Moving to 1066 Mhz won't add any more than 5% to your performance. Of far greater concern to him is the chipset... as the nVidia chipsets have far greater problems with all four cores demand high volume access to the memory bus, whereas the Intel chipsets are much, MUCH better. If he's running nVidia, 2 LL and 2 TF are about optimal. If he's running Intel, then 4 LL are fine -- downshifting to 3 LL and 1 TF doesn't buy you any improvement. Jester |
[QUOTE=ADBjester;150735]The bandwidth of the RAM itself is not the limiting factor. The limiting factor is the memory bus itself, and contention for it by four cores. 800 Mhz QDR RAM is more than enough for what we do. Moving to 1066 Mhz won't add any more than 5% to your performance.[/QUOTE]I do not agree, on P4 D and on Core2 Quad the performance of Prime95 was proportional to the memory speed (measured from 533 MHz DDR2 to 1066 MHz DDR2.)[QUOTE=ADBjester;150735]... the nVidia chipsets have far greater problems with all four cores demand high volume access to the memory bus, whereas the Intel chipsets are much, MUCH better.[/QUOTE]Yes[QUOTE=ADBjester;150735]If he's running nVidia, 2 LL and 2 TF are about optimal. If he's running Intel, then 4 LL are fine -- downshifting to 3 LL and 1 TF doesn't buy you any improvement.[/QUOTE]I don't agree : on the P35, 965 anf G965 chipsets running 3 LL + 1 TF, the core doing LL and sharing a die with TF sees a 6% to 12 % improvement over the cores of the die where both are doing LL tests.
Jacob |
[quote=S485122;150762]I do not agree, on P4 D and on Core2 Quad the performance of Prime95 was proportional to the memory speed (measured from 533 MHz DDR2 to 1066 MHz DDR2.)[/quote]
At equal core frequency and different memory speeds? Could you please provide detailed benchmarks? :) |
I, and others posted details in the Hardware subforum. I do not have the time to try to find them now (I did a quick search : the threads Quad Core and Quad Core and P95 should contain the necessary data.). All parameters except memory were constant (Motherboard, FSB speed, CPU.
Jacob |
[QUOTE=Phantomas;150536]Yes, that's right. And when I interpret my results right, than each LL will run on one DualCore, so (my impression) one LL can use the 6MB L2 Cache alone, an it doesn't need to access the ordinary RAM so often.
But it seems to be important to run one test on Core [1,2], and the other on core [0,3]. Else my itteration time went up 20%.[/QUOTE] Interesting... must be the combined cache kicking in. What settings are needed to ensure we run on core [1,2] and core [0,3]? Is it achievable only on 25.8 with the affinityscramble setting? |
[QUOTE=db597;151581]Interesting... must be the combined cache kicking in. What settings are needed to ensure we run on core [1,2] and core [0,3]? Is it achievable only on 25.8 with the affinityscramble setting?[/QUOTE]
Yes, so says George after my attempts to do the same with 25.7 with mixed results. See post ... [url]http://www.mersenneforum.org/showpost.php?p=151570&postcount=29[/url] ... and the next 3 |
[quote=db597;151581]Interesting... must be the combined cache kicking in. What settings are needed to ensure we run on core [1,2] and core [0,3]? Is it achievable only on 25.8 with the affinityscramble setting?[/quote]
Yes, only 25.8 gives you full control which core to use. But (at least) in my system I noticed, that the core-binding depends and varies on the FSB and/or CPU speed. Can't explain why, but it is reproducible.... See [URL]http://mersenneforum.org/showpost.php?p=151272&postcount=24[/URL] and [URL]http://mersenneforum.org/showpost.php?p=151272&postcount=26[/URL] |
Thanks guys. I'll download 25.8 and give it a try tonight. Even running 24/7, it takes me over a month to complete 1 LL (first time tests), so it's a welcoming thought to be able to get 2 workers on the same exponent without sacrificing any speed (or even get a tiny speedup is a fantastic bonus!).
|
While examining the quad core performance of my system I noticed something interesting. When running 4 LL tests I get the equivalent of about 3.2 cores-worth of performance if I pick as a reference the speed of a single core operating on a single exponent. This is a well known issue and agrees with the observations of others (aka memory bottleneck). This made me initially think it is only minimally worth the effort of running the 4th core for LL, as getting 0.2x performance out of it isn't all that good. However, when I run 3 cores on LL I don't get 3 cores-worth of performance. I get 2.6. Only when I go down to 2 cores do I get twice the single core performance. So running the fourth core on LL has more than a 0.2 effect, as it takes me from 2.6 to 3.2. I believe others have noticed this too, as I've seen some recommend running 2 LL and 2 TF (instead of 3 LL and 1 TF). My quad is overclocked to 3.2GHz, with 1066DDR2 memory running at 533MHz, and yet I still see this behavior. Nonetheless, I'm happy with its performance as it far exceeds the stock performance and is exactly double that of my dual-core E8500 (3.16GHz) which I always thought was fast and not suffering from a memory bottleneck.
|
[quote=stars10250;153057]While examining the quad core performance of my system I noticed something interesting. When running 4 LL tests I get the equivalent of about 3.2 cores-worth of performance if I pick as a reference the speed of a single core operating on a single exponent. This is a well known issue and agrees with the observations of others (aka memory bottleneck). This made me initially think it is only minimally worth the effort of running the 4th core for LL, as getting 0.2x performance out of it isn't all that good. However, when I run 3 cores on LL I don't get 3 cores-worth of performance. I get 2.6. Only when I go down to 2 cores do I get twice the single core performance. So running the fourth core on LL has more than a 0.2 effect, as it takes me from 2.6 to 3.2. I believe others have noticed this too, as I've seen some recommend running 2 LL and 2 TF (instead of 3 LL and 1 TF). My quad is overclocked to 3.2GHz, with 1066DDR2 memory running at 533MHz, and yet I still see this behavior. Nonetheless, I'm happy with its performance as it far exceeds the stock performance and is exactly double that of my dual-core E8500 (3.16GHz) which I always thought was fast and not suffering from a memory bottleneck.[/quote]
i bet if you remove your overclocking but keep the memory at the same speed it will scale better |
[quote=henryzz;153278]i bet if you remove your overclocking but keep the memory at the same speed it will scale better[/quote]
I tried this and did get better scaling but overall lower performance. Here are the numbers: 3.2 GHz Q6600 (8x), 400 MHz FSB, 533 MHz DRAM ...4 cores (0,1,2,3) ....3.2 core-equivalent performance (total # of iter in 1 hr: 239016) ...3 cores (1,2,3) .......2.6 core-equivalent performance ...2 cores (1,3) ..........2.0 core-equivalent performance ...1 core. (3) .............1.0 core-equivalent performance (48 ms iter time, M47.8) 2.8 GHz Q6600 (7x), 400 MHz FSB, 533 MHz DRAM ...4 cores (0,1,2,3) ....3.4 core-equivalent performance (total # of iter in 1 hr: 220699) ...3 cores (1,2,3) .......2.7 core-equivalent performance ...2 cores (1,3) ..........2.0 core-equivalent performance ...1 core. (3) .............1.0 core-equivalent performance (55 ms iter time, M47.8) 2.4 GHz Q6600 (6x), 400 MHz FSB, 533 MHz DRAM ...4 cores (0,1,2,3) ....3.5 core-equivalent performance (total # of iter in 1 hr: 200000) ...3 cores (1,2,3) .......2.8 core-equivalent performance ...2 cores (1,3) ..........2.0 core-equivalent performance ...1 core. (3) .............1.0 core-equivalent performance (63 ms iter time, M47.8) Overall, the maximum number of iterations performed in a given time is achieved by running all 4 cores at the highest CPU (and memory) speed. |
exactly as i expected
computer speed isnt so based on cpu speed as people used to think at some point i will so some benchmarks with different memory speeds to show the difference |
This is not to cause a stink, but Prime95 is specifically made for Intel processors. I've heard opinions that modern AMD processors would kick butt if there were a publicly available LLR client for AMDs.
If one were made available publicly(the one I heard about is integer-based and probably still alpha) would it be something that a decent number of people would be interested in? I guess I should be more direct: If an LLR client(Prime95 is an LLR client made specifically for Mersenne numbers) were made available for AMD computers, but making the same residues(when there's not an error) would a good chunk of the community be interested in using that program? |
[QUOTE=jasong;153347]This is not to cause a stink, but Prime95 is specifically made for Intel processors. I've heard opinions that modern AMD processors would kick butt if there were a publicly available LLR client for AMDs.[/QUOTE]I've got my marshmallows out.:popcorn:
|
Ha... thanks to the way P95 is optimised, I've bought only Intel processors for the last 5 years! Total of 4 Pentium 4s, 1 E6400, 1 Q6600 and 1 E5200!
And from the way the benchmarks are looking, the next one is likely to be an Intel i7. |
as far as i am aware prime95 has been optimised for both cpus
intels just happen to have an internal design feature that makes them do lucas lehmer tests twice as fast(some info from an expert would be nice) if you try trial factoring on prime95 then i think you will find amds are faster |
| All times are UTC. The time now is 13:59. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.