mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-07-20, 19:16   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default What SIEVE_PRIMES are you using?

I'm working on GPU sieving and I'd like to know what sieve_primes setting mfaktc most users are using.

I think mfaktc can auto-adjust this value. If so, please report the sieve_primes value mfaktc thinks is optimal.

My goal is to create a GPU sieve that is as fast as the CPU sieve for most users. To do this, the GPU sieve must be fast enough so that it can sieve deeper than the CPU -- where the time saved by reduced TF candidates equals the time spent doing GPU sieving.

This may not be possible, but it is good to have goals!
Prime95 is offline   Reply With Quote
Old 2012-07-20, 19:37   #2
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3×17×97 Posts
Default

You need to think it as a specific energy ratio, like (kWh/sieve_depth)_GPU vs (kWh/sieve_depth)_CPU_all_cores.

Last fiddled with by pinhodecarlos on 2012-07-20 at 19:38
pinhodecarlos is offline   Reply With Quote
Old 2012-07-20, 22:03   #3
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

The upper instance is working on a low 50M. The lower instance is a low 60M.
This is on a GTX 460 running at 810 MHz being fed by 2 cores of a Phenom II x6 1090T running at 3.5 GHz.
Attached Thumbnails
Click image for larger version

Name:	mfaktc_capture.JPG
Views:	133
Size:	191.3 KB
ID:	8210  
kladner is offline   Reply With Quote
Old 2012-07-20, 23:52   #4
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2·23·179 Posts
Default

We use 5,000 per instance of mfaktc, with each instance locked to one core. This uses 9-12% of each core on our (underclocked to 3GHz) 2500K processors. The remaining cycles per box are used with Prime95 running a P-1 test with four worker threads. The GPUs (GTX 570) are almost saturated with three instances so the fourth instance is there as a backup in case one instance fails and to fully utilize the GPU. TF from 71 to 72 bits takes around three hours per core and P-1 results are around two per day per box.

We can get the same TF performance, without the P-1 work, by underclocking the CPUs to 2GHz. Our room temperature is 83°F with just TF and 88°F with TF and P-1.

Using ~25,000 per instance of mfaktc improves 71 to 72 bit times by 15 minutes or so, which is not an efficient use of cycles in our opinion.
Xyzzy is offline   Reply With Quote
Old 2012-07-21, 00:18   #5
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

With a 460 and one core of a 2600K, 5000 was barely low enough to saturate it.

Are you starting with rcv's sieve or from scratch?

Last fiddled with by Dubslow on 2012-07-21 at 00:24 Reason: new question
Dubslow is offline   Reply With Quote
Old 2012-07-21, 01:10   #6
TObject
 
TObject's Avatar
 
Feb 2012

34×5 Posts
Default

GTX 580, 797MHz on Core i7 920, 2.67GHz
Attached Thumbnails
Click image for larger version

Name:	072012.gif
Views:	115
Size:	97.5 KB
ID:	8211  
TObject is offline   Reply With Quote
Old 2012-07-21, 02:24   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Are you starting with rcv's sieve or from scratch?
I started from bsquared's code as well as many ideas from rcv's code.

At this point, I think I can make a GPU sieve as fast as a SIEVE_PRIMES=5000 CPU sieve. I've got a few more ideas I'm working on.

The responses are interesting in that the 3 replies have SIEVE_PRIMES anywhere from 5000 to 40000. Also interesting is that Kladner's 460 reports 100M/sec, yet my 460 reports it can do up to 205M/sec.

Last fiddled with by Prime95 on 2012-07-21 at 02:26
Prime95 is offline   Reply With Quote
Old 2012-07-21, 02:43   #8
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Kladner runs two instances, so between them they get 200M/s. That's also roughly what I got with mine. The main reason is he's using a Phenom II, whose cores are individually slower than our SB cores. (At least I presume your 460 is in your SB.)

SP on its own mostly varies with CPU/GPU throughput ratios. For all but the slowest cards or fastest chips though, SP has to be as low as possible (5000 was the lower bound set by TheJudger, though rcv et. al. demonstrated that SP below 2000 runs even faster.)
Dubslow is offline   Reply With Quote
Old 2012-07-21, 02:53   #9
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Kladner runs two instances, so between them they get 200M/s. That's also roughly what I got with mine. The main reason is he's using a Phenom II, whose cores are individually slower than our SB cores. (At least I presume your 460 is in your SB.)

SP on its own mostly varies with CPU/GPU throughput ratios. For all but the slowest cards or fastest chips though, SP has to be as low as possible (5000 was the lower bound set by TheJudger, though rcv et. al. demonstrated that SP below 2000 runs even faster.)
Correct on my setup, and the reasons for it. I did have the idea that SP adjustment was driven by CPU wait time in current versions of mfaktc. I'm running v 0.18.

EDIT: The attached is my night setup with SP @ 19000 adjustable, and NumStreams=5, Priority=High. The machine is virtually unusable. I have to shut mfaktc down first thing in the AM and reset Numstreams back to 3 and Priority back to low.

EDIT2: These are the same exponents as the previous post.
Attached Thumbnails
Click image for larger version

Name:	mfaktc_SP19000Adjust_NumStr5_PriHi.JPG
Views:	95
Size:	127.7 KB
ID:	8213  

Last fiddled with by kladner on 2012-07-21 at 03:49
kladner is offline   Reply With Quote
Old 2012-07-21, 05:17   #10
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72×197 Posts
Default

GTX580x2 or x3, with i7 2600k, SievePrimes set on auto-adjust, 25000, all threads locked to phys cores if less then 4 threads, or to logical cores if more then 4 threads.

With two instances of mfaktc per card and 2 cards, the numbers stabilize around 5800-6500, more or less. GPU occupancy ~87-90%. Raising the CPU clock to around 4.4Gigs increase the number to 7000-8000 or more (but unstable, raises and falls) and the GPU occu close to 96-98%.

Two cards with 3 instances per card, or 3 cards with 2 instances per card, the numbers go down fast to 5000 and stay there. GPU occu goes to 92-97 and stays there (2 cards, for 3 cards is lower, about 70). There is no way to max the cards with this setup. CPU is the bottleneck, even if I clock it over 4.5G for short periods of time. I can't let the clock so high, even with water-cooled asus maximus extreme, I saw a couple of blue screens. Usually depending on the weather (the radiator is outside of the house) the CPU is clocked between minimum 3.6GHz during the day and 4.4GHz during the night.

With 3 cards I can run a total of 8 instances, mapped to logical cores, the active (video) card gets two instances only, and I still can watch videos (what I never do, in fact, but the low res video, like DVD-Rip quality, is playable with VLC player), GPUs stabilizes somewhere at 70% average (the one with 2 instances is less, I tried 3 instances per card but the computer can't be used anymore and there is no improvement in output, also, locking the ninth instance to any number of cpus won't help). Of course, the SievePrimes goes down fast to 5000 and stays there, no matter what I do with the CPU clock.

Sometimes I would like mfaktc to let me adjust the number under 5000, as the system is CPU.. not bottlenecked, but totally "strangled". P95 never runs together with mfaktc, as the performance of both become really lousy. When I do TF with one card only, for example (like the other two are doing CL, or are busy with my fantasies or daily job) then the free remaining CPU cores (if any) would do aliquiet and yafu.
LaurV is online now   Reply With Quote
Old 2012-07-21, 05:20   #11
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

Summary: tl;dr version:
SievePrimes is 5000, and it should be very nice if you implement GPU sieving into mfaktc.
LaurV is online now   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 07:32.


Mon Aug 2 07:32:28 UTC 2021 up 10 days, 2:01, 0 users, load averages: 1.36, 1.29, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.