![]() |
Open Discussion
I would like to open a discussion concerning how best to utilize multi-core
processors to run NFS. It seems as if cache-contention would be a major constraint against simply running multiple instances (i.e. give a separate special-q to each core and let it proceed independently). It might be the case that e.g. running 4 separate instances on a quad core might actually have lower total throughput than a single core (or 2 cores) owing to cache and bus contention. An alternative approach would be to give separate parts of the computation to different cores; while one core is sieving, another core could be computing the sieve start points and sieve boundary for the next special q to be processed. Also, once sieving is finished, we could let separate cores do the trial division on separate candidate smooth values. etc. Will this approach be better than running multiple copies? If someone has access to a Windows based multi-core system, I can provide code and data to perform benchmarks with respect to running multiple instances. It would be nice to get a feel for how these processors will perform. |
[QUOTE=R.D. Silverman;103062]I would like to open a discussion concerning how best to utilize multi-core
processors to run NFS. It seems as if cache-contention would be a major constraint against simply running multiple instances (i.e. give a separate special-q to each core and let it proceed independently). It might be the case that e.g. running 4 separate instances on a quad core might actually have lower total throughput than a single core (or 2 cores) owing to cache and bus contention. An alternative approach would be to give separate parts of the computation to different cores; while one core is sieving, another core could be computing the sieve start points and sieve boundary for the next special q to be processed. Also, once sieving is finished, we could let separate cores do the trial division on separate candidate smooth values. etc. Will this approach be better than running multiple copies? If someone has access to a Windows based multi-core system, I can provide code and data to perform benchmarks with respect to running multiple instances. It would be nice to get a feel for how these processors will perform.[/QUOTE] I would also be interested in seeing how fast a single NFS thread is on the new processors when compared with (say) a P4 @3.6GHz. What is the effect of having a much larger L2 cache on the efficiency of a *single* instance? (even running at lowe clock rates) |
[QUOTE=R.D. Silverman;103062]
If someone has access to a Windows based multi-core system, I can provide code and data to perform benchmarks with respect to running multiple instances. It would be nice to get a feel for how these processors will perform.[/QUOTE] I've moved windows development to a dual-core 1.86GHz system with 2MB of shared cache; for most tasks it's slightly slower than a dual 2GHz opteron system. It should be fairly typical of modern systems, and I'd be happy to run any benchmarks you're thinking of. Maybe we should continue this via email? |
I have access to a few Windows based multi-core system like one dual P4 2.8Ghz, one AMD X2 3800+, Intel Core Duo T5500 and two P4 3.0GHz HT.
Carlos |
[QUOTE=em99010pepe;103224]I have access to a few Windows based multi-core system like one dual P4 2.8Ghz, one AMD X2 3800+, Intel Core Duo T5500 and two P4 3.0GHz HT.
Carlos[/QUOTE] If you will send me your email address, I will bundle up my lattice siever, along with input data and instructions. Send it to my private message box. |
[QUOTE=R.D. Silverman;103313]If you will send me your email address, I will bundle up my lattice
siever, along with input data and instructions. Send it to my private message box.[/QUOTE] I am sending the code and data. I have timings for a P4 and P4 HT. I would like timings for a single thread on a dual core and for 2 threads on a dual core run simultaneously. The output from the siever gives verbose timing info. Please post those files. My 1.6 GHz P4 laptop takes just under 20 seconds to process a single special q. |
[QUOTE=jasonp;103122]I've moved windows development to a dual-core 1.86GHz system with 2MB of shared cache; for most tasks it's slightly slower than a dual 2GHz opteron system. It should be fairly typical of modern systems, and I'd be happy to run any benchmarks you're thinking of. Maybe we should continue this via email?[/QUOTE]
Send me your email in my private messages. I will send code and data. |
[quote=R.D. Silverman;103315]I am sending the code and data. I have timings for a P4 and P4 HT.
I would like timings for a single thread on a dual core and for 2 threads on a dual core run simultaneously. The output from the siever gives verbose timing info. Please post those files. My 1.6 GHz P4 laptop takes just under 20 seconds to process a single special q.[/quote] I will test it as soon as I get home. Carlos |
[QUOTE=em99010pepe;103317]I will test it as soon as I get home.
Carlos[/QUOTE] Here's the output from my laptop: Siever built on Apr 9 2007 09:39:45 Clock rate = 1594930.693333 We try to factor: 790041893431046581209233185025824311175827003282978140813822078685169030310463219677902163905473613516505865387434570494416331465957580100723618236415905252281207423 ====================================================== Total time to process 243542 : 19.396596 Total sieve time = 9.274764 Int line: 0.457527 Alg line: 0.420746 Int vector: 4.171477 Alg vector: 4.225013 Total resieve time = 1.742181 Int resieve: 1.096560 Alg resieve: 0.645621 Trial Int time: 0.118909 Trial Alg time: 2.399764 Find startpts time: 3.148531 Alg scan time: 1.095609 Lattice reduce time: 1.431896 QS/Squfof time: 1.072377 Prepare regions time: 1.088817 Inverse time = 1.117249 Prime Test time = 0.431831 ====================================================== ====================== Time for Q = 19.396670 ====================== ====================================================== Total time to process 243543 : 19.351597 Total sieve time = 9.285043 Int line: 0.458326 Alg line: 0.473001 Int vector: 4.174365 Alg vector: 4.179350 Total resieve time = 1.769205 Int resieve: 1.095318 Alg resieve: 0.673887 Trial Int time: 0.119059 Trial Alg time: 2.474075 Find startpts time: 3.128221 Alg scan time: 1.099960 Lattice reduce time: 1.445799 QS/Squfof time: 1.126384 Prepare regions time: 1.131905 Inverse time = 1.151887 Prime Test time = 0.456115 ====================================================== ====================== Time for Q = 19.351665 |
I didn't get your email.....please check you PM. Thanks.
|
[QUOTE=em99010pepe;103334]I didn't get your email.....please check you PM. Thanks.[/QUOTE]
Sent to your alternate address. The zip file contains exe's. |
All times are UTC. The time now is 16:58. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.