mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-04-28, 16:45   #23
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

Don't ask for 32 threads out of Msieve. I would limit it to 2-3 threads per physical package on the machine you are using.
jasonp is offline   Reply With Quote
Old 2013-04-29, 07:01   #24
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

100010110112 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Fixed in 27.9. All earlier v27s had the bug. You are correct in that it does not damage the result.
The LL in question just finished. It has an error code of 08004400, which is 68 of those huge roundoff errors, and mprime catching its breath eight times long enough to say "hey, this isn't a hardware problem". Should I submit this result as a "Suspect LL" or should I send results.txt to you, so the error code can be adjusted?
NBtarheel_33 is offline   Reply With Quote
Old 2013-04-29, 07:26   #25
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

5·223 Posts
Default

Quote:
Originally Posted by bcp19 View Post
I don't know about mprime that much, but I doubt your best performance is using the entire CPU on a single exponent. Run a benchmark and see the iteration times for 1/2/3/4/5/6/7/8. On mine at 3072k FFT (which is around the 50M exp) I get 16.680 for 1, 8.987 for 2, 6.541 for 3 and 5.842 for all 4. This means with 2 shared cores I only get 92% of those 2 separate, and it drops to 85% with 3 and 71% with all 4.

I definitely understand that this is usually the case. But for some reason, and it may be related to my still not fully understanding how Linux pairs physical cores and logical "helper" hyperthreads, I get weird timings, e.g. for 50M exponents:
  • 1 worker on 16 threads = 2.6 ms/iteration (130K sec/exponent), with fluctuations as low as 2.0 ms/iteration and as high as 3.2 ms/iteration.
  • 2 workers, each 8 threads = 5.1 ms/iteration on one exponent and 3.9 ms/iteration on the other (average of 225K sec/exponent).
  • 4 workers, each 4 threads = 9.4 ms/iteration on two exponents and 17.2 ms/iteration on the other two (average of 665K sec/exponent).
  • 8 workers, each 2 threads = 19.5 ms/iteration on four exponents and 37.8 ms/iteration on the other four (average of 1.43M sec/exponent).
  • 16 workers, each 1 thread = 30-40 ms/iteration on eight exponents and 80-90 ms/iteration on the other eight, subject to volatile fluctuation (average of 2.75M - 3.25M sec/exponent).
After extensive testing and direction of expletives at the cryptic information provided by top and /proc/cpuinfo, I found that 8 cores/16 hyperthreads for each test seemed the most efficient, offering the most stable iteration times. I can't complain about finishing an LL test on a 50M exponent in 2-1/2 days, after all.

I am curious as to what might be happening to cause the differential in iteration times between one set of eight cores and the other set of eight cores, though. As I said above, I agree 100% in all of my other experience, that maximal throughput is achieved by running one number on each core. But not so much in this case...

Would it perhaps be better to run two copies of mprime, one on each CPU? Does mprime play nicely with multi-socket (as opposed to multi-core) systems?

Last fiddled with by NBtarheel_33 on 2013-04-29 at 07:28 Reason: Remove redundant quote block
NBtarheel_33 is offline   Reply With Quote
Old 2013-04-29, 14:52   #26
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by NBtarheel_33 View Post
I definitely understand that this is usually the case. But for some reason, and it may be related to my still not fully understanding how Linux pairs physical cores and logical "helper" hyperthreads, I get weird timings, e.g. for 50M exponents:
  • 1 worker on 16 threads = 2.6 ms/iteration (130K sec/exponent), with fluctuations as low as 2.0 ms/iteration and as high as 3.2 ms/iteration.
  • 2 workers, each 8 threads = 5.1 ms/iteration on one exponent and 3.9 ms/iteration on the other (average of 225K sec/exponent).
  • 4 workers, each 4 threads = 9.4 ms/iteration on two exponents and 17.2 ms/iteration on the other two (average of 665K sec/exponent).
  • 8 workers, each 2 threads = 19.5 ms/iteration on four exponents and 37.8 ms/iteration on the other four (average of 1.43M sec/exponent).
  • 16 workers, each 1 thread = 30-40 ms/iteration on eight exponents and 80-90 ms/iteration on the other eight, subject to volatile fluctuation (average of 2.75M - 3.25M sec/exponent).
After extensive testing and direction of expletives at the cryptic information provided by top and /proc/cpuinfo, I found that 8 cores/16 hyperthreads for each test seemed the most efficient, offering the most stable iteration times. I can't complain about finishing an LL test on a 50M exponent in 2-1/2 days, after all.

I am curious as to what might be happening to cause the differential in iteration times between one set of eight cores and the other set of eight cores, though. As I said above, I agree 100% in all of my other experience, that maximal throughput is achieved by running one number on each core. But not so much in this case...

Would it perhaps be better to run two copies of mprime, one on each CPU? Does mprime play nicely with multi-socket (as opposed to multi-core) systems?
My first thought is you are running a hyperthreaded setup. My second thought is you are running 2 different CPUs on the board.

George's programs is so efficient that H/T can actually slow it down. Case in point: I have a 4 physical core laptop with H/T. These are the timings I get running ECM:
1 worker 1 core (affinity Logical cpu 1) = 774sec avg
1 worker 2 core (affinity Logical cpu 1,2)= 785 sec avg
1 worker 4 cores (affinity Logical cpu 1,2,3,4= 540 sec avg
If I 'trick' the program by setting affinity to CPU 2, I get:
1 worker 2 cores (affinity logical cpu 2,3 [physical cores 1,2]) = 502 sec. avg

As you can see, using both logical cores on a physical core is slower than using 1 in both instances.

To truly see how your system works, I would start by setting it to run 16 workers on 1 core each. Then start 1 core and record timings, start a second core and record timings, continue untill you have all 16 running OR you see a significant slowdown in timings.

My 2500 (it's not H/T) cannot run all 4 cores as efficiently as 3 due to a bottleneck somewhere, likely in memory bandwidth. 1 worker running = ~16.2ms/iter, 2 = ~16.4 avg, 3 = 17.2 avg, 4= 21.3 avg.

Time to complete the equiv of 1 exponent:
~9.9 days on 1 core
~5.0 days on 2 cores
~3.4 days on 3 cores
~3.2 days on 4 cores
bcp19 is offline   Reply With Quote
Old 2013-04-29, 15:16   #27
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by NBtarheel_33 View Post
The LL in question just finished. It has an error code of 08004400, which is 68 of those huge roundoff errors, and mprime catching its breath eight times long enough to say "hey, this isn't a hardware problem". Should I submit this result as a "Suspect LL" or should I send results.txt to you, so the error code can be adjusted?
Submit the result "as is". Someone will do an early double-check.
Prime95 is offline   Reply With Quote
Old 2013-04-30, 09:10   #28
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

5×223 Posts
Default

Another nice P-1, but no factor...

M93111047 completed P-1, B1=1080000, B2=33210000, E=12

Takes about 4 hours for Stage 1 and 8 hours for Stage 2, running on 8 cores/16 threads and 30GB RAM. Anyone ever seen a higher E? This is the third time I've had E=12...

Last fiddled with by NBtarheel_33 on 2013-04-30 at 09:16 Reason: Add timing for Stage 1 and Stage 2
NBtarheel_33 is offline   Reply With Quote
Old 2013-04-30, 09:33   #29
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

5×223 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Submit the result "as is". Someone will do an early double-check.

Done. The exponent is 50098369, if anyone's interested...


Never mind, it's been assigned already! It's only been factored to 72, actually, so perhaps the GPUto7x folk will have a go at it. Incidentally, George, you factored it to 64 way back in 2008 (so let me say that it's an honor to have collaborated with you, sir ).

In other happenings, this post is my nth, where n is the Number of the Beast...(say, wasn't there a devilish smiley at one time?)

Last fiddled with by NBtarheel_33 on 2013-04-30 at 09:39
NBtarheel_33 is offline   Reply With Quote
Old 2013-04-30, 10:16   #30
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

101010000111002 Posts
Default

Quote:
Originally Posted by NBtarheel_33 View Post
In other happenings, this post is my nth, where n is the Number of the Beast...(say, wasn't there a devilish smiley at one time?)
Like this you mean?

xilman is offline   Reply With Quote
Old 2013-04-30, 15:06   #31
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2·23·179 Posts
Default

Quote:
This is the third time I've had E=12...
Almost all of our recent P-1 work has "E=12" because we give P-1 8GiB of memory. It is strange that every once in a while we get an "E=6" even though the memory allocated doesn't change.

Code:
UID: Xyzzy/i7, M61192819 completed P-1, B1=580000, B2=11890000, E=12, We4: Γ—Γ—Γ—Γ—Γ—Γ—Γ—Γ—
UID: Xyzzy/i7, M61478429 completed P-1, B1=585000, B2=11992500, E=6, We4: Γ—Γ—Γ—Γ—Γ—Γ—Γ—Γ—
UID: Xyzzy/i7, M60505889 completed P-1, B1=570000, B2=11685000, E=12, We4: Γ—Γ—Γ—Γ—Γ—Γ—Γ—Γ—
Xyzzy is offline   Reply With Quote
Old 2013-04-30, 15:31   #32
c10ck3r
 
c10ck3r's Avatar
 
Aug 2010
Kansas

547 Posts
Default

Xyzzy- could that be because of the size of the exponent? I noticed that only the largest one had a reduced E-value.
c10ck3r is offline   Reply With Quote
Old 2013-04-30, 22:35   #33
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

5·223 Posts
Default

Quote:
Originally Posted by c10ck3r View Post
Xyzzy- could that be because of the size of the exponent? I noticed that only the largest one had a reduced E-value.
Yeah, perhaps that's a breakpoint in the 61M range, between 8GB being strong enough to support E=12 vs. E=6. Maybe try another P-1 in the high 61M range with 12-16GB of RAM and see what happens.
NBtarheel_33 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
New GPU Compute System airsquirrels GPU Computing 90 2017-12-08 00:13
Analog hardware to compute FFT's... WraithX Hardware 1 2012-11-28 13:29
Doubled compute power for a day? Christenson PrimeNet 19 2011-10-26 08:29
New Compute Box Christenson Hardware 0 2011-01-15 04:44
My throughput does not compute... petrw1 Hardware 9 2007-08-13 14:38

All times are UTC. The time now is 06:30.


Mon Aug 2 06:30:45 UTC 2021 up 10 days, 59 mins, 0 users, load averages: 1.20, 1.22, 1.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.