mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

apsen 2011-07-18 05:49

1 Attachment(s)
[QUOTE=TheJudger;266725]aspen: your changes seem to screw something up. :sad:
[/QUOTE]

I originally tried to follow the changes since version 0.8 as it was the last stock version that worked on my machine so it is more probable that I broke something unintentionally.

I've now got clean mkfaktc-0.17 and reapplied my changes directly to it. I've made them almost minimal (with the exception of enabling it to compile under CUDA 2.2) so it should be easy to do diff.

I do not see any test failures on it so please test it on your system to see if you still experience them.

Also, the slowdown was due to synchronous memory copy in the main loop but on my machine it was not as noticeable. I was loosing less then 10% of performance so I was going to look into it later. I have now reworked it too and performance is back on par with stock build. Please check if I did it the right way.

TheJudger 2011-07-18 10:00

Hi aspen,

[QUOTE=apsen;266731]The idea is simple give each thread it's own chunk of memory to write the results so there's no need to have shared variable.[/QUOTE]
Each thread or each stream? I think it is the latter case and than it won't work reliable. Old GPUs can't even run concurrent kernels so the behavior is the same as without own chunks of memory.

[QUOTE=apsen;266751]I've now got clean mkfaktc-0.17 and reapplied my changes directly to it. I've made them almost minimal (with the exception of enabling it to compile under CUDA 2.2) so it should be easy to do diff.
[/QUOTE]

May I know the reason why you're still on CUDA 2.2 (there might be good ones).

Oliver

apsen 2011-07-18 12:08

[QUOTE=TheJudger;266768]
Each thread or each stream? I think it is the latter case and than it won't work reliable. Old GPUs can't even run concurrent kernels so the behavior is the same as without own chunks of memory.
[/QUOTE]
I meant for each thread but I guess I indeed did it for each stream :-( I guess I would need to go back and fix it.


[QUOTE=TheJudger;266768]
May I know the reason why you're still on CUDA 2.2 (there might be good ones).
[/QUOTE]

The driver on the machine does not support 3.1 or higher and if I upgrade the performance of stock mfaktc-0.8 drops significantly. So while I'm changing mfaktc I need it to test on 2.2. When it's ready I'll compile it for newer CUDA and upgrade the driver.

It was late yesterday so I didn't really do it this time.

Is there a problem with 2.2?

TheJudger 2011-07-18 14:01

[QUOTE=apsen;266778]Is there a problem with 2.2?[/QUOTE]

I don't know any problems related to mfaktc (expect that is doesn't compile as it is now). As CUDA 2.2 is past I don't have any real plans for supporting it in mfaktc. If the needed changes are trivial and have not side effect I might try it anyway.

Oliver

apsen 2011-07-18 14:43

1 Attachment(s)
[QUOTE=TheJudger;266790]As CUDA 2.2 is past I don't have any real plans for supporting it in mfaktc. If the needed changes are trivial and have not side effect I might try it anyway.
[/QUOTE]

The only difference is the absence of __launch_bounds__. You could see I've just put conditional macro to do away with it if we compile under CUDA 2.2. I don't care about supporting 2.2 either I just need it until we could make the current version of mkfaktc work with sm_10.


[QUOTE=apsen;266778]I guess I would need to go back and fix it.
[/QUOTE]

I've made the changes but I wouldn't be able test it until later today. But maybe you'd be willing to take a look at it before then to see if my understanding is right. (If you are on Germany time it will be past midnight for you before I get a chance to test.)

apsen 2011-07-18 15:57

[QUOTE=TheJudger;265318]
use of atomic instructions for access to the results array (this needs CC >=1.1)
[/QUOTE]
:geek:
Just to make sure I understand it right: just blindly replacing atomics with unprotected access to d_RES might result in the problem only when we find more then one factor per class (tf_class_* call) and even then it will report that at least one factor has been found but the factor(s) itself may be scrambled by simultaneous attempt to store them in the result array. So if the program reports no factors found - it will be true. Is this correct?

TheJudger 2011-07-18 16:32

correct!

apsen 2011-07-18 18:24

Is CPUStreams configuration parameter basically the length of sieve queue?

TheJudger 2011-07-18 19:38

yes!

btw.: pleases change the version string in your modified code to something unique.
e.g. "0.17-ap1"

Oliver

TheJudger 2011-07-18 20:11

Hi Eric,

[QUOTE=Christenson;266730]Hi Oliver:

I've been putting my time into parse.c ... gone through 1 re-write, need another to get it organized with a parse_line function that returns as a structure with both the data found and the original line.[/QUOTE]

feel free to sent me your stuff (even if it is not finished).

Oliver

apsen 2011-07-18 22:03

[QUOTE=TheJudger;266839]
btw.: pleases change the version string in your modified code to something unique.
e.g. "0.17-ap1"
Oliver[/QUOTE]

Ok. But would you be willing to incorporate my changes in your code once it goes through testing?


All times are UTC. The time now is 23:13.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.