mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-05-16, 18:21   #221
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Actually I've no clue why it runs so slow on your system.
The high "average wait" usually indicates that the CPU (the siever) is NOT limiting the whole application.
The "average wait" goes up when SievePrimes is lowered, too (if not it could be possible that CPU/GPU don't run interleaved (GPU processes one block while CPU preprocesses the next block) for some reason).

You've a relative huge jitter on average wait, too.
Quote:
Originally Posted by frmky
Code:
class    9: tested 160235520 candidates in 5835ms (27461100/sec) (avg. wait: 15431usec)
class   12: tested 160235520 candidates in 5267ms (30422540/sec) (avg. wait: 11899usec)
sp = 26250, min = 5285
sp = 13125, min = 0, max = 52500, prev = 26250 5285
class   20: tested 170065920 candidates in 5552ms (30631469/sec) (avg. wait: 14334usec)
class   21: tested 170065920 candidates in 5434ms (31296635/sec) (avg. wait: 13653usec)
sp = 13125, min = 5442
sp = 32812, min = 13125, max = 52500, prev = 26250 0
SievePrimes is the same for class 9 and 12 so the runtime and average wait should be the same. Class 20 and 21 have the same problem.

On the other hand you get full GPU performance when you have 3 copies running at the same time.

Do you run the application from X? Perhaps you could try to run on command line or remote by ssh.

A wild guess: Latency issues on GPU-kernel startup?
Can you enable VERBOSE_TIMING in params.h? (I hope it compiles, IIRC I screwed this in one release, but if this is the case I know that you can fix it easily (compile issues due to wrong formats in printf())).
If you do so: redirect the output to some file! This generates alot output.

PCIe type/speed?


Thank you,
Oliver

Last fiddled with by TheJudger on 2010-05-16 at 18:23
TheJudger is offline   Reply With Quote
Old 2010-05-16, 19:30   #222
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

11×193 Posts
Default

This is an Intel G35 board, so it's running PCIe 1.0 x16. It's running on a headless compute node with no X running, controlled remotely by ssh. Enabling MORE_CLASSES improves the speed:

default params, one binary running:
Code:
mfaktc v0.06
Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
  SievePrimes         25000
  SievePrimesAdjust   2
  CudaStreams         2

CUDA device info
  name:                      GeForce GTX 480
  compute capabilities:      2.0
  maximum threads per block: 1024
  number of multiprocessors: 15 (120 shader cores)
  clock rate:                1401MHz

tf(3321932839, 66, 71);
 k_min = 11106030600
 k_max = 355392982921
class    0: tested 150405120 candidates in 5694ms (26414668/sec) (avg. wait: 11717usec)
class    5: tested 150405120 candidates in 5694ms (26414668/sec) (avg. wait: 11716usec)
sp = 52500, min = 5729
sp = 26250, min = 0, max = 100000, prev = 52500 5729
class    9: tested 160235520 candidates in 5238ms (30590973/sec) (avg. wait: 11717usec)
class   12: tested 160235520 candidates in 5238ms (30590973/sec) (avg. wait: 11717usec)
sp = 26250, min = 5255
sp = 13125, min = 0, max = 52500, prev = 26250 5255
class   20: tested 170065920 candidates in 5094ms (33385535/sec) (avg. wait: 11718usec)
class   21: tested 170065920 candidates in 5093ms (33392091/sec) (avg. wait: 11718usec)
sp = 13125, min = 5102
sp = 6562, min = 0, max = 26250, prev = 13125 5102
class   29: tested 181862400 candidates in 5060ms (35941185/sec) (avg. wait: 11719usec)
class   32: tested 181862400 candidates in 5060ms (35941185/sec) (avg. wait: 11719usec)
sp = 6562, min = 5064
sp = 3281, min = 0, max = 13125, prev = 6562 5064
class   36: tested 195624960 candidates in 5095ms (38395477/sec) (avg. wait: 11719usec)
class   41: tested 195624960 candidates in 5095ms (38395477/sec) (avg. wait: 11719usec)
sp = 3281, min = 5097
sp = 8203, min = 3281, max = 13125, prev = 6562 0
class   44: tested 177930240 candidates in 5060ms (35164079/sec) (avg. wait: 11721usec)
class   56: tested 177930240 candidates in 5062ms (35150185/sec) (avg. wait: 11719usec)
sp = 8203, min = 5066
sp = 5742, min = 3281, max = 13125, prev = 8203 5066
class   57: tested 184811520 candidates in 5074ms (36423240/sec) (avg. wait: 11720usec)
class   60: tested 184811520 candidates in 5074ms (36423240/sec) (avg. wait: 11719usec)
sp = 5742, min = 5078
sp = 9433, min = 5742, max = 13125, prev = 8203 0
class   65: tested 175964160 candidates in 5085ms (34604554/sec) (avg. wait: 11718usec)
class   69: tested 175964160 candidates in 5085ms (34604554/sec) (avg. wait: 11719usec)
sp = 9433, min = 5091
sp = 7587, min = 5742, max = 13125, prev = 9433 5091
class   72: tested 178913280 candidates in 5049ms (35435389/sec) (avg. wait: 11719usec)
class   77: tested 178913280 candidates in 5048ms (35442408/sec) (avg. wait: 11719usec)
MORE_CLASSES, one binary running
Code:
mfaktc v0.06
Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          193154bits
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        enabled

Runtime Options
  SievePrimes         25000
  SievePrimesAdjust   2
  CudaStreams         2

CUDA device info
  name:                      GeForce GTX 480
  compute capabilities:      2.0
  maximum threads per block: 1024
  number of multiprocessors: 15 (120 shader cores)
  clock rate:                1401MHz

tf(3321932839, 66, 71);
 k_min = 11106027240
 k_max = 355392982921
class    0: tested 15728640 candidates in 560ms (28086857/sec) (avg. wait: 11713usec)
class    5: tested 15728640 candidates in 559ms (28137101/sec) (avg. wait: 11714usec)
sp = 52500, min = 594
sp = 26250, min = 0, max = 100000, prev = 52500 594
class    9: tested 16711680 candidates in 507ms (32961893/sec) (avg. wait: 11715usec)
class   20: tested 16711680 candidates in 507ms (32961893/sec) (avg. wait: 11717usec)
sp = 26250, min = 524
sp = 13125, min = 0, max = 52500, prev = 26250 524
class   21: tested 17694720 candidates in 491ms (36038126/sec) (avg. wait: 11718usec)
class   29: tested 17694720 candidates in 491ms (36038126/sec) (avg. wait: 11717usec)
sp = 13125, min = 499
sp = 6562, min = 0, max = 26250, prev = 13125 499
class   32: tested 18677760 candidates in 480ms (38912000/sec) (avg. wait: 11716usec)
class   36: tested 18677760 candidates in 481ms (38831101/sec) (avg. wait: 11717usec)
sp = 6562, min = 484
sp = 3281, min = 0, max = 13125, prev = 6562 484
class   41: tested 19660800 candidates in 473ms (41566173/sec) (avg. wait: 11720usec)
class   44: tested 19660800 candidates in 473ms (41566173/sec) (avg. wait: 11713usec)
sp = 3281, min = 476
sp = 1640, min = 0, max = 6562, prev = 3281 476
class   57: tested 21626880 candidates in 491ms (44046598/sec) (avg. wait: 11719usec)
class   60: tested 21626880 candidates in 491ms (44046598/sec) (avg. wait: 11718usec)
sp = 1640, min = 493
sp = 4101, min = 1640, max = 6562, prev = 3281 0
class   65: tested 19660800 candidates in 483ms (40705590/sec) (avg. wait: 11718usec)
class   69: tested 19660800 candidates in 483ms (40705590/sec) (avg. wait: 11717usec)
sp = 4101, min = 486
sp = 2870, min = 1640, max = 6562, prev = 4101 486
class   72: tested 20643840 candidates in 492ms (41959024/sec) (avg. wait: 11710usec)
class   77: tested 20643840 candidates in 492ms (41959024/sec) (avg. wait: 11711usec)
sp = 2870, min = 494
sp = 4716, min = 2870, max = 6562, prev = 4101 0
class   81: tested 19660800 candidates in 490ms (40124081/sec) (avg. wait: 11716usec)
class   84: tested 19660800 candidates in 490ms (40124081/sec) (avg. wait: 11716usec)
sp = 4716, min = 493
sp = 3793, min = 2870, max = 6562, prev = 4716 493
class   92: tested 19660800 candidates in 480ms (40960000/sec) (avg. wait: 11717usec)
class   96: tested 19660800 candidates in 480ms (40960000/sec) (avg. wait: 11717usec)
sp = 3793, min = 483
sp = 3331, min = 2870, max = 4716, prev = 3793 483
class  104: tested 19660800 candidates in 474ms (41478481/sec) (avg. wait: 11717usec)
class  105: tested 19660800 candidates in 474ms (41478481/sec) (avg. wait: 11718usec)
sp = 3331, min = 476
sp = 3100, min = 2870, max = 3793, prev = 3331 476
class  116: tested 19660800 candidates in 471ms (41742675/sec) (avg. wait: 11730usec)
class  117: tested 19660800 candidates in 471ms (41742675/sec) (avg. wait: 11719usec)
Verbose timing is attached.
Attached Files
File Type: zip verbose.zip (154.1 KB, 96 views)
frmky is offline   Reply With Quote
Old 2010-05-17, 09:55   #223
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101112 Posts
Default

Hi frmky,

those two non-verbose outputs are interesting. The average wait is constant regardless how high or low SievePrimes is. This could indicate that GPU-code doesn't run concurrently to the CPU-code. But I have no idea why.
The GPU throughput increases when the siever consumes less CPU-time, that is the reason why the throughput increases with MORE_CLASSES (which makes the siever run more efficient due to denser candidate distribution).

Oliver

P.S. generally (if the application runs as expected) this assignment is too small for a propoper usage of MORE_CLASSES, the overhead is too big in this case.

Last fiddled with by TheJudger on 2010-05-17 at 09:56
TheJudger is offline   Reply With Quote
Old 2010-05-18, 16:53   #224
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi,

the performance problem which frmky "found" might be related to kjagets modification. I gave frmky a beta of 0.07 and the problem did not occur.

I've looked into kjagets version again. I've noticed one difference in the way the streams are used.

kjagets version:
...
- wait for stream N to finish GPU calculations
- preprocess data for stream N
- start GPU calculation on stream N
- wait for stream N+1 to finish GPU calculations
- preprocess data for stream N+1
- start GPU calculation on stream N+1
...

my version (implemented in 0.07 beta)
...
- preprocess data for stream N
- wait for stream N+1 to finish GPU calculations
- start GPU calculation on stream N
- preprocess data for stream N+1
- wait for stream N+2 to finish GPU calculations
- start GPU calculation on stream N+1
...

- on paper both versions are correct!
- kjagets versions seems to be more intuitively
- my version need N+1 streams have N streams running on the GPU while kjagets version can run N streams running with only N streams defined
- moving the wait behind the start in my version causes a small performance penalty on my system

JFYI: My version of 0.06 has a fixed number of streams (2) and uses 3 segments for preprocessing so with 2 streams running on the GPU I can preprocess the third one.

I'm unsure if this is the reason for the performance problem frmky has noticed.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-18, 18:51   #225
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by TheJudger View Post

I'm unsure if this is the reason for the performance problem frmky has noticed.

Oliver
This is entirely possible. I'm glad you agree that on paper my approach should be the same, so at least I didn't miss anything really obvious. Without knowing the details of how the streams are scheduled and what sort of delays each step incurs there's no way to know which works better other than testing - and we know the results there. It's been part of the frustration of working with CUDA for me, but at least we found an answer.

I'm beginning to wonder if the CUDA runtime will dispatch more than one stream at a time if they're ready and will fit on the GPU hardware? Then by having more than 1 stream ready to go at a time the original version will be faster by shipping off multiple streams at once? This would fit in with what I read about the overhead for starting up a stream being higher on windows than Linux. If the original code only had to dispatch half of the streams (by combining every 2 into 1) you'd reduce that overhead quite a bit.

In that case it's possible that increasing the # of streams on my version would eventually give good performance. Or then again maybe the timing would never line up correctly with my code ... it's been that kind of week around here ;) I do know that on my system I saw steady performance with up to 6 streams and then eventually saw performance go down again, so it's also possible that the "correct" value for my code is high enough that it would hit hit other bottleneck.

Edit to add - if you need 3 or 4 instances of the program to get good throughput, you'd need 3 streams for each instance, or 1 instance with 9-12 (!) streams. Ouch, that starts to add up to real memory pretty quickly.

In any case, the problem is fixed in 0.07 so I don't have to worry quite so much. It still annoys me since there shouldn't be any differences between the two approaches but I'll live :)

Kevin

Last fiddled with by kjaget on 2010-05-18 at 18:57
kjaget is offline   Reply With Quote
Old 2010-05-19, 09:31   #226
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

On paper:

1 dataset (1 stream): preprocessing (CPU), upload and processing (GPU) are done sequentially
2 datasets (1 or 2 streams, depending on implementation): preprocessing of block N+1 (CPU) could be done simultaneously to the upload and processing of block N (GPU) (upload and processing on GPU are sequentiallly)
3 datasets (2 or 3 streams, again depending on implementation): preprocessing of block N+2, upload of block N+1 and processing of block N could be done at the same time (this is how 0.06 works)
more than 3 datasets/streams: hide latencys, hide jittter in runtime of sieve_candidates(), ...

0.07 uses as many streams as datasets (ktab on CPU) exist. It is possible to save one stream and some memory on GPU but it doesn't hurt much since the memory usage is still far away from the avialable memory of todays GPUs.
With 5 streams I'm still below 50MB memory usage.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-23, 20:59   #227
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi,

I just tried my current version of mfaktc on my 8800 GTX (G80 chip, compute capability 1.0). The code doesn't run as expected. Seems that my fix for the issue with "multiple factors found at the same time" causes the problems.
So perhaps I've to say the the G80 chip is "not supported" and add a check which refuses mfaktc to run on chips with compute capability 1.0. (AFAIK the G80 is the only chip with only 1.0)
Another option would be to disable the fix on compute capability 1.0 GPUs.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-28, 07:30   #228
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hello,

find attached mfaktc 0.07.

Highlights of this version:
- parse Prime95 worktodo files (thank you Luigi (ET_)! )
- changed the commandline interface
- should compile on Windows (thank you Kevin (kjaget)! )
Currently there is no compile script for windows but the code itself should compile.
- improved the siever performance a little bit

For details take a look at Changelog.txt and README.txt.

Oliver
Attached Files
File Type: gz mfaktc-0.07.tar.gz (62.3 KB, 108 views)
TheJudger is offline   Reply With Quote
Old 2010-05-28, 07:34   #229
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

And a small teaser from my current development version:

Code:
M3321929759 has a factor: 4103086300931724495689
M3321931637 has a factor: 28475025393798152885081
M3321931061 has a factor: 29833158347165530570273
M3321933893 has a factor: 36285087986156170392041
M3321929827 has a factor: 73716630445294762224353
Who will notice the difference first

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-28, 07:41   #230
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

23·11·73 Posts
Default

76 bits: did you increase the base for the arithmetic or use four-digit numbers?
fivemack is offline   Reply With Quote
Old 2010-05-28, 08:09   #231
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi fivemack,

that was fast!
I've increased the base to 32 bits per int. It is a 2nd GPU kernel which allows factors up to 2^95 (untested so far). It uses 32bit multiplication.

Raw GPU speed on my GTX 275 (candidates per second (percentage of 71bit kernel)):
Code:
GPU kernel	M66362159	M3321932839
71bit		80.8M		62.4M
75bit		61.0M (75.5%)	47.1M (75.5%)
95bit		51.1M (63.2%)	39.4M (63.1%)
The 75bit kernel is the same as the 95bit kernel with the only difference that it skips the first iteration in the long division. I'm happy with the performance, it is allready at the top end of my expection. I've expected 50-65% for the 95bit code compared to the 71bit code for non-Fermi GPUs.
I'll do some benchmarks on a GTX 480 soon! Perhaps the 75bit kernel is even faster than the 71bit kernel on Fermi because it has a much faster 32bit multiplication than non-Fermi GPUs.

Oliver
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 13:02.


Fri Aug 6 13:02:19 UTC 2021 up 14 days, 7:31, 1 user, load averages: 2.83, 2.90, 2.73

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.