mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

TheJudger 2010-05-16 18:21

Actually I've no clue why it runs so slow on your system.
The high "average wait" usually indicates that the CPU (the siever) is NOT limiting the whole application.
The "average wait" goes up when SievePrimes is lowered, too (if not it could be possible that CPU/GPU don't run interleaved (GPU processes one block while CPU preprocesses the next block) for some reason).

You've a relative huge jitter on average wait, too.
[QUOTE=frmky][CODE]
class 9: tested 160235520 candidates in 5835ms (27461100/sec) (avg. wait: 15431usec)
class 12: tested 160235520 candidates in 5267ms (30422540/sec) (avg. wait: 11899usec)
sp = 26250, min = 5285
sp = 13125, min = 0, max = 52500, prev = 26250 5285
class 20: tested 170065920 candidates in 5552ms (30631469/sec) (avg. wait: 14334usec)
class 21: tested 170065920 candidates in 5434ms (31296635/sec) (avg. wait: 13653usec)
sp = 13125, min = 5442
sp = 32812, min = 13125, max = 52500, prev = 26250 0[/CODE][/QUOTE]
SievePrimes is the same for class 9 and 12 so the runtime and average wait should be the same. Class 20 and 21 have the same problem.

On the other hand you get full GPU performance when you have 3 copies running at the same time.

Do you run the application from X? Perhaps you could try to run on command line or remote by ssh.

A wild guess: Latency issues on GPU-kernel startup?
Can you enable VERBOSE_TIMING in params.h? (I hope it compiles, IIRC I screwed this in one release, but if this is the case I know that you can fix it easily (compile issues due to wrong formats in printf())).
If you do so: [B]redirect the output to some file![/B] This generates alot output.

PCIe type/speed?


Thank you,
Oliver

frmky 2010-05-16 19:30

1 Attachment(s)
This is an Intel G35 board, so it's running PCIe 1.0 x16. It's running on a headless compute node with no X running, controlled remotely by ssh. Enabling MORE_CLASSES improves the speed:

default params, one binary running:
[CODE]mfaktc v0.06
Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 25000
SievePrimesAdjust 2
CudaStreams 2

CUDA device info
name: GeForce GTX 480
compute capabilities: 2.0
maximum threads per block: 1024
number of multiprocessors: 15 (120 shader cores)
clock rate: 1401MHz

tf(3321932839, 66, 71);
k_min = 11106030600
k_max = 355392982921
class 0: tested 150405120 candidates in 5694ms (26414668/sec) (avg. wait: 11717usec)
class 5: tested 150405120 candidates in 5694ms (26414668/sec) (avg. wait: 11716usec)
sp = 52500, min = 5729
sp = 26250, min = 0, max = 100000, prev = 52500 5729
class 9: tested 160235520 candidates in 5238ms (30590973/sec) (avg. wait: 11717usec)
class 12: tested 160235520 candidates in 5238ms (30590973/sec) (avg. wait: 11717usec)
sp = 26250, min = 5255
sp = 13125, min = 0, max = 52500, prev = 26250 5255
class 20: tested 170065920 candidates in 5094ms (33385535/sec) (avg. wait: 11718usec)
class 21: tested 170065920 candidates in 5093ms (33392091/sec) (avg. wait: 11718usec)
sp = 13125, min = 5102
sp = 6562, min = 0, max = 26250, prev = 13125 5102
class 29: tested 181862400 candidates in 5060ms (35941185/sec) (avg. wait: 11719usec)
class 32: tested 181862400 candidates in 5060ms (35941185/sec) (avg. wait: 11719usec)
sp = 6562, min = 5064
sp = 3281, min = 0, max = 13125, prev = 6562 5064
class 36: tested 195624960 candidates in 5095ms (38395477/sec) (avg. wait: 11719usec)
class 41: tested 195624960 candidates in 5095ms (38395477/sec) (avg. wait: 11719usec)
sp = 3281, min = 5097
sp = 8203, min = 3281, max = 13125, prev = 6562 0
class 44: tested 177930240 candidates in 5060ms (35164079/sec) (avg. wait: 11721usec)
class 56: tested 177930240 candidates in 5062ms (35150185/sec) (avg. wait: 11719usec)
sp = 8203, min = 5066
sp = 5742, min = 3281, max = 13125, prev = 8203 5066
class 57: tested 184811520 candidates in 5074ms (36423240/sec) (avg. wait: 11720usec)
class 60: tested 184811520 candidates in 5074ms (36423240/sec) (avg. wait: 11719usec)
sp = 5742, min = 5078
sp = 9433, min = 5742, max = 13125, prev = 8203 0
class 65: tested 175964160 candidates in 5085ms (34604554/sec) (avg. wait: 11718usec)
class 69: tested 175964160 candidates in 5085ms (34604554/sec) (avg. wait: 11719usec)
sp = 9433, min = 5091
sp = 7587, min = 5742, max = 13125, prev = 9433 5091
class 72: tested 178913280 candidates in 5049ms (35435389/sec) (avg. wait: 11719usec)
class 77: tested 178913280 candidates in 5048ms (35442408/sec) (avg. wait: 11719usec)
[/CODE]

MORE_CLASSES, one binary running
[CODE]mfaktc v0.06
Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES enabled

Runtime Options
SievePrimes 25000
SievePrimesAdjust 2
CudaStreams 2

CUDA device info
name: GeForce GTX 480
compute capabilities: 2.0
maximum threads per block: 1024
number of multiprocessors: 15 (120 shader cores)
clock rate: 1401MHz

tf(3321932839, 66, 71);
k_min = 11106027240
k_max = 355392982921
class 0: tested 15728640 candidates in 560ms (28086857/sec) (avg. wait: 11713usec)
class 5: tested 15728640 candidates in 559ms (28137101/sec) (avg. wait: 11714usec)
sp = 52500, min = 594
sp = 26250, min = 0, max = 100000, prev = 52500 594
class 9: tested 16711680 candidates in 507ms (32961893/sec) (avg. wait: 11715usec)
class 20: tested 16711680 candidates in 507ms (32961893/sec) (avg. wait: 11717usec)
sp = 26250, min = 524
sp = 13125, min = 0, max = 52500, prev = 26250 524
class 21: tested 17694720 candidates in 491ms (36038126/sec) (avg. wait: 11718usec)
class 29: tested 17694720 candidates in 491ms (36038126/sec) (avg. wait: 11717usec)
sp = 13125, min = 499
sp = 6562, min = 0, max = 26250, prev = 13125 499
class 32: tested 18677760 candidates in 480ms (38912000/sec) (avg. wait: 11716usec)
class 36: tested 18677760 candidates in 481ms (38831101/sec) (avg. wait: 11717usec)
sp = 6562, min = 484
sp = 3281, min = 0, max = 13125, prev = 6562 484
class 41: tested 19660800 candidates in 473ms (41566173/sec) (avg. wait: 11720usec)
class 44: tested 19660800 candidates in 473ms (41566173/sec) (avg. wait: 11713usec)
sp = 3281, min = 476
sp = 1640, min = 0, max = 6562, prev = 3281 476
class 57: tested 21626880 candidates in 491ms (44046598/sec) (avg. wait: 11719usec)
class 60: tested 21626880 candidates in 491ms (44046598/sec) (avg. wait: 11718usec)
sp = 1640, min = 493
sp = 4101, min = 1640, max = 6562, prev = 3281 0
class 65: tested 19660800 candidates in 483ms (40705590/sec) (avg. wait: 11718usec)
class 69: tested 19660800 candidates in 483ms (40705590/sec) (avg. wait: 11717usec)
sp = 4101, min = 486
sp = 2870, min = 1640, max = 6562, prev = 4101 486
class 72: tested 20643840 candidates in 492ms (41959024/sec) (avg. wait: 11710usec)
class 77: tested 20643840 candidates in 492ms (41959024/sec) (avg. wait: 11711usec)
sp = 2870, min = 494
sp = 4716, min = 2870, max = 6562, prev = 4101 0
class 81: tested 19660800 candidates in 490ms (40124081/sec) (avg. wait: 11716usec)
class 84: tested 19660800 candidates in 490ms (40124081/sec) (avg. wait: 11716usec)
sp = 4716, min = 493
sp = 3793, min = 2870, max = 6562, prev = 4716 493
class 92: tested 19660800 candidates in 480ms (40960000/sec) (avg. wait: 11717usec)
class 96: tested 19660800 candidates in 480ms (40960000/sec) (avg. wait: 11717usec)
sp = 3793, min = 483
sp = 3331, min = 2870, max = 4716, prev = 3793 483
class 104: tested 19660800 candidates in 474ms (41478481/sec) (avg. wait: 11717usec)
class 105: tested 19660800 candidates in 474ms (41478481/sec) (avg. wait: 11718usec)
sp = 3331, min = 476
sp = 3100, min = 2870, max = 3793, prev = 3331 476
class 116: tested 19660800 candidates in 471ms (41742675/sec) (avg. wait: 11730usec)
class 117: tested 19660800 candidates in 471ms (41742675/sec) (avg. wait: 11719usec)
[/CODE]

Verbose timing is attached.

TheJudger 2010-05-17 09:55

Hi frmky,

those two non-verbose outputs are interesting. The average wait is constant regardless how high or low SievePrimes is. This could indicate that GPU-code doesn't run concurrently to the CPU-code. But I have no idea why.
The GPU throughput increases when the siever consumes less CPU-time, that is the reason why the throughput increases with MORE_CLASSES (which makes the siever run more efficient due to denser candidate distribution).

Oliver

P.S. generally (if the application runs as expected) this assignment is too small for a propoper usage of MORE_CLASSES, the overhead is too big in this case.

TheJudger 2010-05-18 16:53

Hi,

the performance problem which frmky "found" might be related to kjagets modification. I gave frmky a beta of 0.07 and the problem did not occur.

I've looked into kjagets version again. I've noticed one difference in the way the streams are used.

kjagets version:
...
- wait for stream N to finish GPU calculations
- preprocess data for stream N
- start GPU calculation on stream N
- wait for stream N+1 to finish GPU calculations
- preprocess data for stream N+1
- start GPU calculation on stream N+1
...

my version (implemented in 0.07 beta)
...
- preprocess data for stream N
- wait for stream N+1 to finish GPU calculations
- start GPU calculation on stream N
- preprocess data for stream N+1
- wait for stream N+2 to finish GPU calculations
- start GPU calculation on stream N+1
...

- on paper both versions are correct!
- kjagets versions seems to be more intuitively
- my version need N+1 streams have N streams running on the GPU while kjagets version can run N streams running with only N streams defined
- moving the wait behind the start in my version causes a small performance penalty on my system

JFYI: My version of 0.06 has a fixed number of streams (2) and uses 3 segments for preprocessing so with 2 streams running on the GPU I can preprocess the third one.

[B]I'm unsure if this is the reason for the performance problem frmky has noticed.[/B]

Oliver

kjaget 2010-05-18 18:51

[QUOTE=TheJudger;215319]

[B]I'm unsure if this is the reason for the performance problem frmky has noticed.[/B]

Oliver[/QUOTE]

This is entirely possible. I'm glad you agree that on paper my approach should be the same, so at least I didn't miss anything really obvious. Without knowing the details of how the streams are scheduled and what sort of delays each step incurs there's no way to know which works better other than testing - and we know the results there. It's been part of the frustration of working with CUDA for me, but at least we found an answer.

I'm beginning to wonder if the CUDA runtime will dispatch more than one stream at a time if they're ready and will fit on the GPU hardware? Then by having more than 1 stream ready to go at a time the original version will be faster by shipping off multiple streams at once? This would fit in with what I read about the overhead for starting up a stream being higher on windows than Linux. If the original code only had to dispatch half of the streams (by combining every 2 into 1) you'd reduce that overhead quite a bit.

In that case it's possible that increasing the # of streams on my version would eventually give good performance. Or then again maybe the timing would never line up correctly with my code ... it's been that kind of week around here ;) I do know that on my system I saw steady performance with up to 6 streams and then eventually saw performance go down again, so it's also possible that the "correct" value for my code is high enough that it would hit hit other bottleneck.

Edit to add - if you need 3 or 4 instances of the program to get good throughput, you'd need 3 streams for each instance, or 1 instance with 9-12 (!) streams. Ouch, that starts to add up to real memory pretty quickly.

In any case, the problem is fixed in 0.07 so I don't have to worry quite so much. It still annoys me since there shouldn't be any differences between the two approaches but I'll live :)

Kevin

TheJudger 2010-05-19 09:31

On paper:

1 dataset (1 stream): preprocessing (CPU), upload and processing (GPU) are done sequentially
2 datasets (1 or 2 streams, depending on implementation): preprocessing of block N+1 (CPU) could be done simultaneously to the upload and processing of block N (GPU) (upload and processing on GPU are sequentiallly)
3 datasets (2 or 3 streams, again depending on implementation): preprocessing of block N+2, upload of block N+1 and processing of block N could be done at the same time (this is how 0.06 works)
more than 3 datasets/streams: hide latencys, hide jittter in runtime of sieve_candidates(), ...

0.07 uses as many streams as datasets (ktab on CPU) exist. It is possible to save one stream and some memory on GPU but it doesn't hurt much since the memory usage is still far away from the avialable memory of todays GPUs.
With 5 streams I'm still below 50MB memory usage. :smile:

Oliver

TheJudger 2010-05-23 20:59

Hi,

I just tried my current version of mfaktc on my 8800 GTX (G80 chip, compute capability 1.0). The code doesn't run as expected. Seems that my fix for the issue with "multiple factors found at the same time" causes the problems. :sad:
So perhaps I've to say the the G80 chip is "not supported" and add a check which refuses mfaktc to run on chips with compute capability 1.0. (AFAIK the G80 is the only chip with only 1.0)
Another option would be to disable the fix on compute capability 1.0 GPUs.

Oliver

TheJudger 2010-05-28 07:30

1 Attachment(s)
Hello,

find attached mfaktc 0.07.

Highlights of this version:
- parse Prime95 worktodo files (thank you Luigi (ET_)! :smile:)
- changed the commandline interface
- should compile on Windows (thank you Kevin (kjaget)! :smile:)
Currently there is no compile script for windows but the code itself should compile.
- improved the siever performance a little bit

For details take a look at Changelog.txt and README.txt.

Oliver

TheJudger 2010-05-28 07:34

And a small teaser from my current development version:

[CODE]M3321929759 has a factor: 4103086300931724495689
M3321931637 has a factor: 28475025393798152885081
M3321931061 has a factor: 29833158347165530570273
M3321933893 has a factor: 36285087986156170392041
M3321929827 has a factor: 73716630445294762224353[/CODE]

Who will notice the difference first :question:

Oliver

fivemack 2010-05-28 07:41

76 bits: did you increase the base for the arithmetic or use four-digit numbers?

TheJudger 2010-05-28 08:09

Hi fivemack,

that was fast!
I've increased the base to 32 bits per int. It is a 2nd GPU kernel which allows factors up to 2^95 (untested so far). It uses 32bit multiplication.

Raw GPU speed on my GTX 275 (candidates per second (percentage of 71bit kernel)):
[CODE]GPU kernel M66362159 M3321932839
71bit 80.8M 62.4M
75bit 61.0M (75.5%) 47.1M (75.5%)
95bit 51.1M (63.2%) 39.4M (63.1%)[/CODE]

The 75bit kernel is the same as the 95bit kernel with the only difference that it skips the first iteration in the long division. I'm happy with the performance, it is allready at the top end of my expection. I've expected 50-65% for the 95bit code compared to the 71bit code for non-Fermi GPUs.
I'll do some benchmarks on a GTX 480 soon! Perhaps the 75bit kernel is even faster than the 71bit kernel on Fermi because it has a much faster 32bit multiplication than non-Fermi GPUs.

Oliver


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.