mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

chalsall 2019-03-14 20:47

[QUOTE=Mark Rose;510774]It would be worth it to test numbers around 90M, 100M, and 110M, too.[/QUOTE]

Indeed. The GPU TF'er are currently taking 91M and up to 77 "bits"; 90M is already done.

It might also be worth testing at 332M, to see if there's any optimization which could be squeezed out using different kernels going to 81 "bits".

TheJudger 2019-03-14 22:56

[QUOTE=nomead;510750]There is a selection table in mfaktc.c that only checks for compute capability 1.x (where the speed order was 76 -> 77 -> 87 -> 88 -> 79 -> 92) and all the rest get 76 -> 87 -> 88 -> 77 -> 79 -> 92. [COLOR="Red"]So the barrett77_mul32_gs kernel is in effect never selected on anything newer than GTX2xx.[/COLOR][/QUOTE]

Are you sure about this? I'm not! Hint: check kernel_possible() in the same file.

Last time I did some benchmarks barrett 87 and 88 was faster than 77 (Pascal series).

Oliver

nomead 2019-03-15 01:05

[QUOTE=TheJudger;510820]Are you sure about this? I'm not! Hint: check kernel_possible() in the same file.

Last time I did some benchmarks barrett 87 and 88 was faster than 77 (Pascal series).

Oliver[/QUOTE]
Well, not 100% sure of course, but isn't kernel_possible() just called from tf() to see whether a certain kernel works at all for the selected bit range combination, and it says nothing about the relative speed? I may have oversimplified a bit when I said "in effect never gets selected", as it can fall through all the way to barrett77 if 87 and 88 wouldn't work. Ah yes, there's that extra check on the barrett87, 88 and 92 rows to see whether it's factoring more than one bit depth range at once, and then those aren't selected.

So, on the code as it is, for compute capability bigger than 1.x,
76-77 gets barrett87_mul32_gs
75-77 gets barrett77_mul32_gs
78-79 gets barrett87_mul32_gs
77-79 gets barrett79_mul32_gs
79-80 gets barrett87_mul32_gs
78-80 or 79-81 will actually get 95bit_mul32_gs
But I'd like to think that since factoring at these bit levels takes quite a while, most people would be running with the default Stages=1 set in mfaktc.ini. This is my reasoning behind that "in effect never"...

The one thing I'm not at all sure about is the 1% improvement. On real life work the difference seems to be less than that (still on Turing). I'll have to gather some more timing information, but this will take a while longer. :smile:

nomead 2019-03-15 16:47

[QUOTE=nomead;510824]I'll have to gather some more timing information, but this will take a while longer. :smile:[/QUOTE]

Okay, I was shocked. For whatever reason, there is pretty much no measurable performance difference between barrett77 and 87 as tested on real work. So, again, RTX 2080, GPU clock locked at 1800 MHz. Six exponents each in the M9152xxxx range factored from 76 to 77 bits. All are reported as 167.21 GHz-days. Average for the barrett77 runs: 1 hour 18 minutes 40.223 seconds. And for the barrett87 runs: 1 hour 18 minutes... 42.352 seconds. It's well within the measurement error margin now. I wonder why I saw that 1% earlier, but then, that was for a single run for each kernel.

So, nothing needs to be changed, it doesn't make any difference. Meh. :yawn:

TheJudger 2019-03-15 19:21

[QUOTE=nomead;510883]Okay, I was shocked. For whatever reason, there is pretty much no measurable performance difference between barrett77 and 87 as tested on real work. So, again, RTX 2080, GPU clock locked at 1800 MHz. Six exponents each in the M9152xxxx range factored from 76 to 77 bits. All are reported as 167.21 GHz-days. Average for the barrett77 runs: 1 hour 18 minutes 40.223 seconds. And for the barrett87 runs: 1 hour 18 minutes... 42.352 seconds. It's well within the measurement error margin now. I wonder why I saw that 1% earlier, but then, that was for a single run for each kernel.

So, nothing needs to be changed, it doesn't make any difference. Meh. :yawn:[/QUOTE]

No problem. And yes, those run to run variations are annoying. On a stock Geforce you have powertarget, temperature target, actual temperature and so on. Even when you try to lock a specific clockrate you have those (minor) run to run variations. This happens on Tesla, too. And on Tesla it is much easier to make sure you're running on a fixed clockrate (just set a relative low application clock). For benchmarks/comparisons you should always run in a realistic setting and not on stuff like "RAW GPU BENCH".

Oliver

Thecmaster 2019-03-17 23:31

Help. I'm running mfaktc 0.21 cuda 65 right now. I have a GTX 960 and saw there was a cuda 80 and a cuda100 vercion of mfaktc. whats the diffrence between them and should I rund an other version?
/Arvid

kriesel 2019-03-18 01:13

[QUOTE=Thecmaster;511001]Help. I'm running mfaktc 0.21 cuda 65 right now. I have a GTX 960 and saw there was a cuda 80 and a cuda100 vercion of mfaktc. whats the diffrence between them and should I rund an other version?
/Arvid[/QUOTE]Test them and see what's faster on your card. Note that mfaktc tuning can make a several percent difference for a set version. CUDA 6.5 has done well in speed comparisons in my testing in CUDALucas. (I don't have a GTX960.)

Thecmaster 2019-03-18 11:01

[QUOTE=kriesel;511004]Test them and see what's faster on your card. Note that mfaktc tuning can make a several percent difference for a set version. CUDA 6.5 has done well in speed comparisons in my testing in CUDALucas. (I don't have a GTX960.)[/QUOTE]
Just tested cuda 100 and got 10% faster. I will test 80 to and take the one with best speed.

The speed on 80 was just 8% faster than 65. So 100 it is. ty for help.
/Arvid

kriesel 2019-03-18 14:15

[QUOTE=Thecmaster;511029]Just tested cuda 100 and got 10% faster. I will test 80 to and take the one with best speed.

The speed on 80 was just 8% faster than 65. So 100 it is. ty for help.
/Arvid[/QUOTE]Thanks, I've not looked into above 8.0 myself yet, looks like there may be some gains there for some of my fleet too.

Was your testing with or without tuning? See [URL]https://mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL]

Gpu clock constant, or power limited, or allowed to fluctuate?

Thecmaster 2019-03-18 19:02

[QUOTE=kriesel;511037]Thanks, I've not looked into above 8.0 myself yet, looks like there may be some gains there for some of my fleet too.

Was your testing with or without tuning? See [URL]https://mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL]

Gpu clock constant, or power limited, or allowed to fluctuate?[/QUOTE]

No. I didn't tune any of that. I was just on my way to search for information on that or ask about it.

I looked around in the mfaktc.ini file and found some interesting things to tweak but I don't know where to start.


Have done some tuning now.

GPUSieveProcessSize=32
GPUSieveSize=128
GPUSievePrimes=110000 (this gets adjusted to 110134 when program starts)

This gave me a bit nor through put.

With 6.5 I got 303 GHz-d/Day
With 10.0 I got 331
After tweaking I got 337

This on a GTX 960 2GB

petrw1 2019-03-20 18:43

CPU impacts GPU more than I expected.
 
I have a 2080Ti GPU running mfaktc
on a i7-7820X with 32GB of 3600DDR4 RAM running Large P-1 on all 8 cores.

The CPU is running at 60 degrees F and the GPU at 81 degrees F.

The GPU is at about 3,900 GHZDays/Day
but if I stop Prime95 the GPU thruput immediately goes to about 4,250.
The GPU stays at 81 degrees F.
If I restart Prime95 the GPU stays at 4,250 until about the time all 8 cores are started, have the RAM allocated and are running the P-1 again.

In other words the total thruput of the rig is LOWER when the CPU is busy. It does about 75 GhzDays/Day of P1 while the GPU loses about 300.

I don't know if the impact would be the same if I was running LL instead of P-1 (much less RAM); though my guess is it would be about the same impact.


All times are UTC. The time now is 22:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.