mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-07-28, 13:15   #331
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Ludovic,

Quote:
Originally Posted by Aillas View Post
This is the standard behavior. So it's ok for me. I didn't want to run the program many days for nothing.
You can (should) run the builtin selftest:
Code:
./mfaktc.exe -st
This might remove your checkpoint file so perhaps make a copy of the mfaktc.ckp file.

Quote:
Originally Posted by Aillas View Post
Now, I'm curious how many days it will take to sieve 3321931967 from 76 to 77 bit on a Quatro 140 M.
Once it finishes the first class you can extrapolate the total runtime. In standard configuration there are 96 classes (and with MORE_CLASSES in params.h enabled there are 960 classes). So just multiply the time for the first class by 96. I assume that you leave the defaults in mfaktc.ini. I think SievePrimes will increases from class to class on your system up to 100000 and the time per class goes down a little bit.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-07-28, 15:06   #332
Aillas
 
Aillas's Avatar
 
Oct 2002
France

33×5 Posts
Default

So roughtly, it will be tested in 60 hours, 2 1/2 days.

Code:
tf(3321931967, 76, 77, ...);
 k_min = 11372578438620
 k_max = 22745156877535
Using GPU kernel "95bit_mul32"
class    0: tested 5293211648 candidates in 8622263ms (613900/sec) (avg. wait: 1657638usec)
avg. wait > 500usec, increasing SievePrimes to 27000
class    4: tested 5257560064 candidates in 8564600ms (613871/sec) (avg. wait: 1656671usec)
avg. wait > 500usec, increasing SievePrimes to 29000
class    9: tested 5225054208 candidates in 8509699ms (614011/sec) (avg. wait: 1653730usec)
avg. wait > 500usec, increasing SievePrimes to 31000
class   12: tested 5195694080 candidates in 8461917ms (614009/sec) (avg. wait: 1654384usec)
avg. wait > 500usec, increasing SievePrimes to 33000
class   24: tested 5168431104 candidates in 8417560ms (614005/sec) (avg. wait: 1655990usec)
avg. wait > 500usec, increasing SievePrimes to 35000
class   25: tested 5142216704 candidates in 8374795ms (614011/sec) (avg. wait: 1655011usec)
avg. wait > 500usec, increasing SievePrimes to 37000
class   28: tested 5118099456 candidates in 8335604ms (614004/sec) (avg. wait: 1653319usec)
avg. wait > 500usec, increasing SievePrimes to 39000
class   33: tested 5096079360 candidates in 8299648ms (614011/sec) (avg. wait: 1651130usec)
avg. wait > 500usec, increasing SievePrimes to 41000
class   37: tested 5074059264 candidates in 8263817ms (614009/sec) (avg. wait: 1652407usec)
avg. wait > 500usec, increasing SievePrimes to 43000
class   40: tested 5054136320 candidates in 8231369ms (614009/sec) (avg. wait: 1651757usec)
avg. wait > 500usec, increasing SievePrimes to 45000
class   45: tested 5035261952 candidates in 8201827ms (613919/sec) (avg. wait: 1647743usec)
avg. wait > 500usec, increasing SievePrimes to 47000
class   49: tested 5017436160 candidates in 8176214ms (613662/sec) (avg. wait: 1642147usec)
avg. wait > 500usec, increasing SievePrimes to 49000
class   52: tested 4999610368 candidates in 8142560ms (614009/sec) (avg. wait: 1635144usec)
avg. wait > 500usec, increasing SievePrimes to 51000
The config:

Code:
Compiletime Options
  THREADS_PER_GRID_MAX 1048576
  THREADS_PER_BLOCK    256
  SIEVE_SIZE_LIMIT     32kiB
  SIEVE_SIZE           230945bits
  VERBOSE_TIMING       disabled
  MORE_CLASSES         disabled

Runtime Options
  SievePrimes          25000
  SievePrimesAdjust    1
  NumStreams           3
  WorkFile             worktodo.txt
  Checkpoints          enabled
  Stages               enabled
  StopAfterFactor      bitlevel

CUDA device info
  name:                      Quadro NVS 140M
  compute capability:        1.1
  maximum threads per block: 512
  number of multiprocessors: 2 (16 shader cores)
  clock rate:                800MHz

Automatic parameters
  threads per grid:          1048576
Aillas is offline   Reply With Quote
Old 2010-07-29, 07:35   #333
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Ludovic,

for your next run (if you do any) I recommend that you edit mfaktc.ini and set SievePrimes=100000. Your CPU is easily capable to feed your GPU with factor candidates fast enough.

You had problems with mfaktc 0.09 (cudaStreamCreate() failed), how ofter have you tried it? I've noticed on my system that the driver 256.35 sometimes isn't capable of running even the simplest examples from the SDK. I shutdown my X server, reload the nvidia kernel module and start X again in this case.

The code for memory/stream allocation is unchanged between 0.09 and 0.10 (expect a printf() change).

---
Luigi, any idea how Ludovics timings compare to a recent CPU running factor5?

Oliver
TheJudger is offline   Reply With Quote
Old 2010-07-29, 13:32   #334
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61·79 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi Ludovic,

for your next run (if you do any) I recommend that you edit mfaktc.ini and set SievePrimes=100000. Your CPU is easily capable to feed your GPU with factor candidates fast enough.

You had problems with mfaktc 0.09 (cudaStreamCreate() failed), how ofter have you tried it? I've noticed on my system that the driver 256.35 sometimes isn't capable of running even the simplest examples from the SDK. I shutdown my X server, reload the nvidia kernel module and start X again in this case.

The code for memory/stream allocation is unchanged between 0.09 and 0.10 (expect a printf() change).

---
Luigi, any idea how Ludovics timings compare to a recent CPU running factor5?

Oliver
I'm running some benchmark just now, I'll tell you soon.

[Done]
About 3,25 days on an i5-750 @ 2.67 GHz and 4 cores used.


Luigi

Last fiddled with by ET_ on 2010-07-29 at 13:56
ET_ is offline   Reply With Quote
Old 2010-07-29, 14:31   #335
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi Luigi, Ludovic,

so when I compare those numbers I would say that it is perhaps not worth running mfaktc on this GPU.

So when SievePrimes reaches 100000 I assume a 8000000msec per class. This is 8000 seconds or a bit over 2 hours for one class.
8000s * 96(classes) = 768000s = ~213h = ~8.9days!
OK, it is just one CPU core (at unknown speed) but compared to your i750 this isn't much faster than a single core of an i7.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-07-29, 14:42   #336
Aillas
 
Aillas's Avatar
 
Oct 2002
France

33·5 Posts
Default

Hi,

I run mfaktc on a laptop with a core 2 duo: Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz. It's not a i7 quad core, so 2 1/2 days with one core, and the other one for other crunching it's not bad

I was looking for factor5 for Linux 32bit but didn't found it, so I have tried mfaktc.

I will let mfakt finish its task; I will then try with SievePrimes=100000 as you suggest to see if there is any improvement.

About the difference between the 2 version, I think its about the compilation option in Makefile (the -arch).

Ludovic

Last fiddled with by Aillas on 2010-07-29 at 14:42
Aillas is offline   Reply With Quote
Old 2010-08-02, 13:37   #337
Aillas
 
Aillas's Avatar
 
Oct 2002
France

33·5 Posts
Default

Hi,

I should miss something or don't understand how class are working. I thought I will finish my exponent this week end and found this this morning:

Code:
class  273: tested 4722786304 candidates in 7693152ms (613894/sec) (avg. wait: 1608724usec)
When you talked about 96 classes, I thought the program will end something like below will be displayed

Code:
class   96: tested 5118099456 candidates in ....
So I read a little more carefully this thread and found something about MORE_CLASSES.

In one of your post, you said:
Quote:
Once it finishes the first class you can extrapolate the total runtime. In standard configuration there are 96 classes (and with MORE_CLASSES in params.h enabled there are 960 classes).
I open params.h :

Code:
If MORE_CLASSES is defined than the while TF process is split into 4620
(4 * 3*5*7*11) classes. Otherwise it will be split into 420 (4 * 3*5*7)
classes. With 4620 the siever runs a bit more efficent at the cost of 10 times
more sieve initializations. This will allow to increase SIEVE_PRIMES a little
bit further.
This starts to become usefull on my system for e.g. TF M66xxxxxx from 2^66 to
So, does that means I will finish my exponent when I'll reach class 420?

Code:
class  420: tested 4722786304 candidates in 7693152ms (613894/sec) (avg. wait: 1608724usec)
If so, for this exponent, time remaining to finish it should be:
(420-273) * 7700 sec = 314 hours = 13 days (??)

Thanks

Ludovic
Aillas is offline   Reply With Quote
Old 2010-08-02, 14:16   #338
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

12D316 Posts
Default

Quote:
Originally Posted by Aillas View Post
Hi,

I should miss something or don't understand how class are working. I thought I will finish my exponent this week end and found this this morning:

Code:
class  273: tested 4722786304 candidates in 7693152ms (613894/sec) (avg. wait: 1608724usec)
When you talked about 96 classes, I thought the program will end something like below will be displayed

Code:
class   96: tested 5118099456 candidates in ....
So I read a little more carefully this thread and found something about MORE_CLASSES.

In one of your post, you said:


I open params.h :

Code:
If MORE_CLASSES is defined than the while TF process is split into 4620
(4 * 3*5*7*11) classes. Otherwise it will be split into 420 (4 * 3*5*7)
classes. With 4620 the siever runs a bit more efficent at the cost of 10 times
more sieve initializations. This will allow to increase SIEVE_PRIMES a little
bit further.
This starts to become usefull on my system for e.g. TF M66xxxxxx from 2^66 to
So, does that means I will finish my exponent when I'll reach class 420?

Code:
class  420: tested 4722786304 candidates in 7693152ms (613894/sec) (avg. wait: 1608724usec)
If so, for this exponent, time remaining to finish it should be:
(420-273) * 7700 sec = 314 hours = 13 days (??)

Thanks

Ludovic
I think so. You need to perform the whole cycle of 420 classes, of which only 96 are used for sieving. You can notice that the progression of the class indicator does not advance by 1 number per iteration.
ET_ is offline   Reply With Quote
Old 2010-08-02, 15:59   #339
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Aillas,

Quote:
Originally Posted by Aillas View Post
Hi,

I should miss something or don't understand how class are working. I thought I will finish my exponent this week end and found this this morning:

Code:
class  273: tested 4722786304 candidates in 7693152ms (613894/sec) (avg. wait: 1608724usec)
maybe me describtion was not detailed enough, sorry.
You're running without MORE_CLASSES so there are 420 classes but most of them can be removed totally. Remaining are 96 classes. So take the time of one class and multiply by 96. See post #335, I've predicted 8.9 days.

With your latest timing (7693 seconds per class) you're are at 7693s * 96 classes = 8.55days.

You can see the the time needed per class goes down over time on your system because SievePrimes increases all the way up too 100k. This removes more candidates by sieving, you can see the the number of candidates tested per class goes down from the start. Next time you can start directly at SievePrimes=100000 in mfaktc.ini. You're totally GPU-bound (very high average wait) so you CPU handles easily sieving that much.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-08-03, 06:50   #340
Aillas
 
Aillas's Avatar
 
Oct 2002
France

33×5 Posts
Default

Thanks for the explanation.

Go back to crunch now. Just do it
Aillas is offline   Reply With Quote
Old 2010-08-04, 08:28   #341
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hello,

just some GPU benchmark for those who are interessted in. Values are the raw GPU speed without paying attention on the sieve performance. On GTX 4x0 you'll usually need two instances of mfaktc to utilize the GPU 100%.
Percentages are the speed compared to the 71bit kernel.

Slightly factory overclocked GTX 275 (1458 MHz SP clock (reference 1404MHz)):
Code:
kernel | M66362159, 2^64 to 2^67 | M3321932839, 2^50 to 2^71
-------+-------------------------+--------------------------
71bit  | 80.8M/s                 | 62.3M/s         0.09-pre5
75bit  | 62.5M/s           77.4% | 48.2M/s             77.4%
95bit  | 52.2M/s           64.6% | 40.3M/s             64.7%
Impressive factory overclocked GTX 460 (1600MHz SP clock (reference 1350MHz)):
Code:
kernel | M66362159, 2^64 to 2^67 | M3321932839, 2^50 to 2^71
-------+-------------------------+--------------------------
71bit  | 85.2M/s                 | 65.8M/s         0.10-pre7
75bit  | 145.2M/s         170.4% | 112.2M/s           170.5%
95bit  | 120.2M/s         141.1% | 92.8M/s            141.0%
Stock GTX 470:
Code:
kernel | M66362159, 2^64 to 2^67 | M3321932839, 2^50 to 2^71
-------+-------------------------+--------------------------
71bit  | 102.7M/s                | 79.7M/s         0.10-pre7
75bit  | 183.8M/s         179.0% | 143.8M/s           180.4%
95bit  | 155.4M/s         151.3% | 121.2M/s           152.1%
In this comparison the GTX 470 gives most bank for the buck while the GTX 460 has the highest performance per watt.
My GTX 275 feels kind of slow these days

Oliver
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 06:00.


Fri Aug 6 06:00:58 UTC 2021 up 14 days, 29 mins, 1 user, load averages: 3.04, 3.14, 3.13

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.