mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

James Heinrich 2011-03-07 19:26

[QUOTE=Bdot;254545]I just hate wasting cycles and therefore wanted to ask for raising the priority of an event-based implementation of the GPU synchronisation.[/QUOTE]I sometimes (depending on exponent range) see the same thing here (i7-920(OC) vs 8800GT). I don't know if a "quick & dirty" solution would be to allow SievePrimes > 100k, maybe allow it to float up to 200k? Reducing the CPU usage to only what's needed would of course be "better", but as a quick fix this requires almost no effort (I think?).

Bdot 2011-03-08 10:18

I now set the priorities of both prime95 and mfactc to 4, resulting in only a half CPU for mfactc. This reduced the avg wait time from 63000 us to 49000 us and the prime95 thread advances at half-speed :-)

Kind of workaround ...

TheJudger 2011-03-08 16:49

[QUOTE=Bdot;254545]I just hate wasting cycles and therefore wanted to ask for raising the priority of an event-based implementation of the GPU synchronisation.[/QUOTE]
I would say that I've a goal for mfaktc 0.17. :smile:
First I have to read/understand how events work, a quick look did non reveal a way to wait for a single event from any stream. There is cudaEventQuery() but this ends again to a busy loop again.
Perhaps it is enough to add a (automatical adjusted) sleeps in the current implementation?

[QUOTE=James Heinrich;254556]I sometimes (depending on exponent range) see the same thing here (i7-920(OC) vs 8800GT). I don't know if a "quick & dirty" solution would be to allow SievePrimes > 100k, maybe allow it to float up to 200k? Reducing the CPU usage to only what's needed would of course be "better", but as a quick fix this requires almost no effort (I think?).[/QUOTE]

It is better not to do so, when you do both (increasing SievePrimes above 100k and set THREADS_PER_GRID_MAX to 2^21) mfaktc will fail because the k_tab[] will contain values above 2^24 and this is too much when using 24bit multiply (which is faster for cc 1.x GPUs). The benefit above 100k won't be great, only a few more candidates will be removed by sieving.

Oliver

TheJudger 2011-03-08 17:12

btw. perhaps a kindly mod of this forum can change the topic of this thread to e.g. "Trial division with CUDA (mfaktc)"

Thank you!

Oliver

Bdot 2011-03-08 17:38

Sleep
 
[QUOTE=TheJudger;254622]
Perhaps it is enough to add a (automatical adjusted) sleeps in the current implementation?
[/QUOTE]

Yes, that could already help a lot: I currently see avg. wait between 49183 - 50558), so a single sleep of (avg_wait >> 10) (in ms), followed by the busy-loop would be good, and self-adjusting.

50558 / 1024 = 49, so the sleep after the longest wait would still be a bit shorter than the shortest wait.

Or add a config setting to specify a fix sleep time ...

Or always subtract 1000 (configurable) from the last avg wait time and use this for the sleep if >0 ...

James Heinrich 2011-03-08 17:51

[QUOTE=TheJudger;254622]...mfaktc will fail because the k_tab[] will contain values above 2^24 and this is too much when using 24bit multiply[/QUOTE]Perhaps it could be enabled for 32-bit multiplies then? On 24-bit stuff, SievePrimes hovers around 70k for me; on higher bit depths with 32-bit multiplies, it's stuck at 100k all the time, which is where I think at least a little benefit of larger limit would be.

TheJudger 2011-03-09 11:38

[QUOTE=Bdot;254629]Yes, that could already help a lot: I currently see avg. wait between 49183 - 50558), so a single sleep of (avg_wait >> 10) (in ms), followed by the busy-loop would be good, and self-adjusting.

50558 / 1024 = 49, so the sleep after the longest wait would still be a bit shorter than the shortest wait.[/QUOTE]
I don't think it is that easy since it interferences with the automatic adjustment of SievePrimes.
But I've allready some ideas... Just give me some time. I won't do this for mfaktc 0.16 which I plan to release in a few days if everything works right. As told yesterday this is on my .plan for 0.17.

[QUOTE=James Heinrich;254634]Perhaps it could be enabled for 32-bit multiplies then? On 24-bit stuff, SievePrimes hovers around 70k for me; on higher bit depths with 32-bit multiplies, it's stuck at 100k all the time, which is where I think at least a little benefit of larger limit would be.[/QUOTE]

Another option is to lower the maximum allowed THREADS_PER_GRID_MAX to 2^20. Does anybody use/need THREADS_PER_GRID_MAX > 2^20?

Oliver

James Heinrich 2011-03-09 12:01

[QUOTE=TheJudger;254701]Does anybody use/need THREADS_PER_GRID_MAX > 2^20?[/QUOTE]That's what's compiled into the Windows binary, so I don't really have a choice in the matter. But capping it at 2^20 seems reasonable.

Bdot 2011-03-10 12:59

"Iterations" in prime95 vs. candidates in mfactc ?
 
Hi, I'm wondering if someone could explain the differences how prime95 and mfactc do the factoring, especially the number of candidates being tested/displayed.

What I see: mfactc has a throughput of ~8.6M candidates/sec for tf(333000091, 76, 77, ...).

mprime95 is set to report one status line every 10000 iterations, and it does so every ~90 sec for a similar task. That would correspond to ~ 110 iterations per sec.

Comparing the overall progress (total runtime for this tf), mfactc advances at about 4 times the speed of prime95.

Is prime95 testing so few candidates which each take so much longer? Or what is an "iteration" in this respect? As far as I understood the sieving in mfactc, there's not a big opportunity to omit more factor candidates with reasonable effort.

Maybe this should rather go into some prime95 thread?

TheJudger 2011-03-10 17:39

The average rate which is reported my mfaktc reports the number of candidates tested on GPU. This is [B]after[/B] the sieving part.

In Prime95 one iteration is one block of the sieve.. AFAIK the 64bit Prime95 variant uses a 3(?) times bigger block than the 32bit Prime95 variant. So you can't even compare the iteration between those two directly.

The best you can do is to compare the overall runtime. This is easy and what you really want to know.

Oliver

TheJudger 2011-03-13 14:03

mfaktc 0.16
 
1 Attachment(s)
Hello,

find attached mfaktc 0.16! :smile:

Highlights:[LIST][*]some changes on the screen outputs, including a new option in mfaktc.ini where the user can choose between a new line for each finished class or (more compact) overwrite the last line. James Heinrich put an eye on the screen outputs and provided some ideas. Thank you, James! :tu:[*]the barrett92 kernel [B]is up to 5% faster[/B], the speedup is similar on all supported GPUs :smile:[*]the barrett79 kernel [B]is up to 18% faster[/B], again the speedup is similar on all supported GPUs :smile:[/LIST]Of course the improved kernel speed will put more pressure on your CPU...

Upgrade instructions (as usual): finish your current assignment (or at least the current bitlevel of your assignment) and switch to the new version (checkpoint files are ignored from foreign versions of mfaktc).

Oliver


All times are UTC. The time now is 23:05.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.