![]() |
Actually I've no clue why it runs so slow on your system.
The high "average wait" usually indicates that the CPU (the siever) is NOT limiting the whole application. The "average wait" goes up when SievePrimes is lowered, too (if not it could be possible that CPU/GPU don't run interleaved (GPU processes one block while CPU preprocesses the next block) for some reason). You've a relative huge jitter on average wait, too. [QUOTE=frmky][CODE] class 9: tested 160235520 candidates in 5835ms (27461100/sec) (avg. wait: 15431usec) class 12: tested 160235520 candidates in 5267ms (30422540/sec) (avg. wait: 11899usec) sp = 26250, min = 5285 sp = 13125, min = 0, max = 52500, prev = 26250 5285 class 20: tested 170065920 candidates in 5552ms (30631469/sec) (avg. wait: 14334usec) class 21: tested 170065920 candidates in 5434ms (31296635/sec) (avg. wait: 13653usec) sp = 13125, min = 5442 sp = 32812, min = 13125, max = 52500, prev = 26250 0[/CODE][/QUOTE] SievePrimes is the same for class 9 and 12 so the runtime and average wait should be the same. Class 20 and 21 have the same problem. On the other hand you get full GPU performance when you have 3 copies running at the same time. Do you run the application from X? Perhaps you could try to run on command line or remote by ssh. A wild guess: Latency issues on GPU-kernel startup? Can you enable VERBOSE_TIMING in params.h? (I hope it compiles, IIRC I screwed this in one release, but if this is the case I know that you can fix it easily (compile issues due to wrong formats in printf())). If you do so: [B]redirect the output to some file![/B] This generates alot output. PCIe type/speed? Thank you, Oliver |
1 Attachment(s)
This is an Intel G35 board, so it's running PCIe 1.0 x16. It's running on a headless compute node with no X running, controlled remotely by ssh. Enabling MORE_CLASSES improves the speed:
default params, one binary running: [CODE]mfaktc v0.06 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled Runtime Options SievePrimes 25000 SievePrimesAdjust 2 CudaStreams 2 CUDA device info name: GeForce GTX 480 compute capabilities: 2.0 maximum threads per block: 1024 number of multiprocessors: 15 (120 shader cores) clock rate: 1401MHz tf(3321932839, 66, 71); k_min = 11106030600 k_max = 355392982921 class 0: tested 150405120 candidates in 5694ms (26414668/sec) (avg. wait: 11717usec) class 5: tested 150405120 candidates in 5694ms (26414668/sec) (avg. wait: 11716usec) sp = 52500, min = 5729 sp = 26250, min = 0, max = 100000, prev = 52500 5729 class 9: tested 160235520 candidates in 5238ms (30590973/sec) (avg. wait: 11717usec) class 12: tested 160235520 candidates in 5238ms (30590973/sec) (avg. wait: 11717usec) sp = 26250, min = 5255 sp = 13125, min = 0, max = 52500, prev = 26250 5255 class 20: tested 170065920 candidates in 5094ms (33385535/sec) (avg. wait: 11718usec) class 21: tested 170065920 candidates in 5093ms (33392091/sec) (avg. wait: 11718usec) sp = 13125, min = 5102 sp = 6562, min = 0, max = 26250, prev = 13125 5102 class 29: tested 181862400 candidates in 5060ms (35941185/sec) (avg. wait: 11719usec) class 32: tested 181862400 candidates in 5060ms (35941185/sec) (avg. wait: 11719usec) sp = 6562, min = 5064 sp = 3281, min = 0, max = 13125, prev = 6562 5064 class 36: tested 195624960 candidates in 5095ms (38395477/sec) (avg. wait: 11719usec) class 41: tested 195624960 candidates in 5095ms (38395477/sec) (avg. wait: 11719usec) sp = 3281, min = 5097 sp = 8203, min = 3281, max = 13125, prev = 6562 0 class 44: tested 177930240 candidates in 5060ms (35164079/sec) (avg. wait: 11721usec) class 56: tested 177930240 candidates in 5062ms (35150185/sec) (avg. wait: 11719usec) sp = 8203, min = 5066 sp = 5742, min = 3281, max = 13125, prev = 8203 5066 class 57: tested 184811520 candidates in 5074ms (36423240/sec) (avg. wait: 11720usec) class 60: tested 184811520 candidates in 5074ms (36423240/sec) (avg. wait: 11719usec) sp = 5742, min = 5078 sp = 9433, min = 5742, max = 13125, prev = 8203 0 class 65: tested 175964160 candidates in 5085ms (34604554/sec) (avg. wait: 11718usec) class 69: tested 175964160 candidates in 5085ms (34604554/sec) (avg. wait: 11719usec) sp = 9433, min = 5091 sp = 7587, min = 5742, max = 13125, prev = 9433 5091 class 72: tested 178913280 candidates in 5049ms (35435389/sec) (avg. wait: 11719usec) class 77: tested 178913280 candidates in 5048ms (35442408/sec) (avg. wait: 11719usec) [/CODE] MORE_CLASSES, one binary running [CODE]mfaktc v0.06 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES enabled Runtime Options SievePrimes 25000 SievePrimesAdjust 2 CudaStreams 2 CUDA device info name: GeForce GTX 480 compute capabilities: 2.0 maximum threads per block: 1024 number of multiprocessors: 15 (120 shader cores) clock rate: 1401MHz tf(3321932839, 66, 71); k_min = 11106027240 k_max = 355392982921 class 0: tested 15728640 candidates in 560ms (28086857/sec) (avg. wait: 11713usec) class 5: tested 15728640 candidates in 559ms (28137101/sec) (avg. wait: 11714usec) sp = 52500, min = 594 sp = 26250, min = 0, max = 100000, prev = 52500 594 class 9: tested 16711680 candidates in 507ms (32961893/sec) (avg. wait: 11715usec) class 20: tested 16711680 candidates in 507ms (32961893/sec) (avg. wait: 11717usec) sp = 26250, min = 524 sp = 13125, min = 0, max = 52500, prev = 26250 524 class 21: tested 17694720 candidates in 491ms (36038126/sec) (avg. wait: 11718usec) class 29: tested 17694720 candidates in 491ms (36038126/sec) (avg. wait: 11717usec) sp = 13125, min = 499 sp = 6562, min = 0, max = 26250, prev = 13125 499 class 32: tested 18677760 candidates in 480ms (38912000/sec) (avg. wait: 11716usec) class 36: tested 18677760 candidates in 481ms (38831101/sec) (avg. wait: 11717usec) sp = 6562, min = 484 sp = 3281, min = 0, max = 13125, prev = 6562 484 class 41: tested 19660800 candidates in 473ms (41566173/sec) (avg. wait: 11720usec) class 44: tested 19660800 candidates in 473ms (41566173/sec) (avg. wait: 11713usec) sp = 3281, min = 476 sp = 1640, min = 0, max = 6562, prev = 3281 476 class 57: tested 21626880 candidates in 491ms (44046598/sec) (avg. wait: 11719usec) class 60: tested 21626880 candidates in 491ms (44046598/sec) (avg. wait: 11718usec) sp = 1640, min = 493 sp = 4101, min = 1640, max = 6562, prev = 3281 0 class 65: tested 19660800 candidates in 483ms (40705590/sec) (avg. wait: 11718usec) class 69: tested 19660800 candidates in 483ms (40705590/sec) (avg. wait: 11717usec) sp = 4101, min = 486 sp = 2870, min = 1640, max = 6562, prev = 4101 486 class 72: tested 20643840 candidates in 492ms (41959024/sec) (avg. wait: 11710usec) class 77: tested 20643840 candidates in 492ms (41959024/sec) (avg. wait: 11711usec) sp = 2870, min = 494 sp = 4716, min = 2870, max = 6562, prev = 4101 0 class 81: tested 19660800 candidates in 490ms (40124081/sec) (avg. wait: 11716usec) class 84: tested 19660800 candidates in 490ms (40124081/sec) (avg. wait: 11716usec) sp = 4716, min = 493 sp = 3793, min = 2870, max = 6562, prev = 4716 493 class 92: tested 19660800 candidates in 480ms (40960000/sec) (avg. wait: 11717usec) class 96: tested 19660800 candidates in 480ms (40960000/sec) (avg. wait: 11717usec) sp = 3793, min = 483 sp = 3331, min = 2870, max = 4716, prev = 3793 483 class 104: tested 19660800 candidates in 474ms (41478481/sec) (avg. wait: 11717usec) class 105: tested 19660800 candidates in 474ms (41478481/sec) (avg. wait: 11718usec) sp = 3331, min = 476 sp = 3100, min = 2870, max = 3793, prev = 3331 476 class 116: tested 19660800 candidates in 471ms (41742675/sec) (avg. wait: 11730usec) class 117: tested 19660800 candidates in 471ms (41742675/sec) (avg. wait: 11719usec) [/CODE] Verbose timing is attached. |
Hi frmky,
those two non-verbose outputs are interesting. The average wait is constant regardless how high or low SievePrimes is. This could indicate that GPU-code doesn't run concurrently to the CPU-code. But I have no idea why. The GPU throughput increases when the siever consumes less CPU-time, that is the reason why the throughput increases with MORE_CLASSES (which makes the siever run more efficient due to denser candidate distribution). Oliver P.S. generally (if the application runs as expected) this assignment is too small for a propoper usage of MORE_CLASSES, the overhead is too big in this case. |
Hi,
the performance problem which frmky "found" might be related to kjagets modification. I gave frmky a beta of 0.07 and the problem did not occur. I've looked into kjagets version again. I've noticed one difference in the way the streams are used. kjagets version: ... - wait for stream N to finish GPU calculations - preprocess data for stream N - start GPU calculation on stream N - wait for stream N+1 to finish GPU calculations - preprocess data for stream N+1 - start GPU calculation on stream N+1 ... my version (implemented in 0.07 beta) ... - preprocess data for stream N - wait for stream N+1 to finish GPU calculations - start GPU calculation on stream N - preprocess data for stream N+1 - wait for stream N+2 to finish GPU calculations - start GPU calculation on stream N+1 ... - on paper both versions are correct! - kjagets versions seems to be more intuitively - my version need N+1 streams have N streams running on the GPU while kjagets version can run N streams running with only N streams defined - moving the wait behind the start in my version causes a small performance penalty on my system JFYI: My version of 0.06 has a fixed number of streams (2) and uses 3 segments for preprocessing so with 2 streams running on the GPU I can preprocess the third one. [B]I'm unsure if this is the reason for the performance problem frmky has noticed.[/B] Oliver |
[QUOTE=TheJudger;215319]
[B]I'm unsure if this is the reason for the performance problem frmky has noticed.[/B] Oliver[/QUOTE] This is entirely possible. I'm glad you agree that on paper my approach should be the same, so at least I didn't miss anything really obvious. Without knowing the details of how the streams are scheduled and what sort of delays each step incurs there's no way to know which works better other than testing - and we know the results there. It's been part of the frustration of working with CUDA for me, but at least we found an answer. I'm beginning to wonder if the CUDA runtime will dispatch more than one stream at a time if they're ready and will fit on the GPU hardware? Then by having more than 1 stream ready to go at a time the original version will be faster by shipping off multiple streams at once? This would fit in with what I read about the overhead for starting up a stream being higher on windows than Linux. If the original code only had to dispatch half of the streams (by combining every 2 into 1) you'd reduce that overhead quite a bit. In that case it's possible that increasing the # of streams on my version would eventually give good performance. Or then again maybe the timing would never line up correctly with my code ... it's been that kind of week around here ;) I do know that on my system I saw steady performance with up to 6 streams and then eventually saw performance go down again, so it's also possible that the "correct" value for my code is high enough that it would hit hit other bottleneck. Edit to add - if you need 3 or 4 instances of the program to get good throughput, you'd need 3 streams for each instance, or 1 instance with 9-12 (!) streams. Ouch, that starts to add up to real memory pretty quickly. In any case, the problem is fixed in 0.07 so I don't have to worry quite so much. It still annoys me since there shouldn't be any differences between the two approaches but I'll live :) Kevin |
On paper:
1 dataset (1 stream): preprocessing (CPU), upload and processing (GPU) are done sequentially 2 datasets (1 or 2 streams, depending on implementation): preprocessing of block N+1 (CPU) could be done simultaneously to the upload and processing of block N (GPU) (upload and processing on GPU are sequentiallly) 3 datasets (2 or 3 streams, again depending on implementation): preprocessing of block N+2, upload of block N+1 and processing of block N could be done at the same time (this is how 0.06 works) more than 3 datasets/streams: hide latencys, hide jittter in runtime of sieve_candidates(), ... 0.07 uses as many streams as datasets (ktab on CPU) exist. It is possible to save one stream and some memory on GPU but it doesn't hurt much since the memory usage is still far away from the avialable memory of todays GPUs. With 5 streams I'm still below 50MB memory usage. :smile: Oliver |
Hi,
I just tried my current version of mfaktc on my 8800 GTX (G80 chip, compute capability 1.0). The code doesn't run as expected. Seems that my fix for the issue with "multiple factors found at the same time" causes the problems. :sad: So perhaps I've to say the the G80 chip is "not supported" and add a check which refuses mfaktc to run on chips with compute capability 1.0. (AFAIK the G80 is the only chip with only 1.0) Another option would be to disable the fix on compute capability 1.0 GPUs. Oliver |
1 Attachment(s)
Hello,
find attached mfaktc 0.07. Highlights of this version: - parse Prime95 worktodo files (thank you Luigi (ET_)! :smile:) - changed the commandline interface - should compile on Windows (thank you Kevin (kjaget)! :smile:) Currently there is no compile script for windows but the code itself should compile. - improved the siever performance a little bit For details take a look at Changelog.txt and README.txt. Oliver |
And a small teaser from my current development version:
[CODE]M3321929759 has a factor: 4103086300931724495689 M3321931637 has a factor: 28475025393798152885081 M3321931061 has a factor: 29833158347165530570273 M3321933893 has a factor: 36285087986156170392041 M3321929827 has a factor: 73716630445294762224353[/CODE] Who will notice the difference first :question: Oliver |
76 bits: did you increase the base for the arithmetic or use four-digit numbers?
|
Hi fivemack,
that was fast! I've increased the base to 32 bits per int. It is a 2nd GPU kernel which allows factors up to 2^95 (untested so far). It uses 32bit multiplication. Raw GPU speed on my GTX 275 (candidates per second (percentage of 71bit kernel)): [CODE]GPU kernel M66362159 M3321932839 71bit 80.8M 62.4M 75bit 61.0M (75.5%) 47.1M (75.5%) 95bit 51.1M (63.2%) 39.4M (63.1%)[/CODE] The 75bit kernel is the same as the 95bit kernel with the only difference that it skips the first iteration in the long division. I'm happy with the performance, it is allready at the top end of my expection. I've expected 50-65% for the 95bit code compared to the 71bit code for non-Fermi GPUs. I'll do some benchmarks on a GTX 480 soon! Perhaps the 75bit kernel is even faster than the 71bit kernel on Fermi because it has a much faster 32bit multiplication than non-Fermi GPUs. Oliver |
| All times are UTC. The time now is 22:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.