mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

TheJudger 2011-07-18 22:25

[QUOTE=apsen;266858]Ok. But would you be willing to incorporate my changes in your code once it goes through testing?[/QUOTE]

G80 support: yes, at least as "official patch" or "special G80 version", depending how/if the changes affect the code / performance. I assume that a 1% slowdown on non-G80 GPUs has a bigger impact than the ability to run on G80 GPUs. G80 isn't that hot today. Even entry level Fermis are faster.

CUDA < 2.3 support: unlikely, how many people need this? Main reason: needs testing.

Cleanups: very welcome. I know that the code is messy. :blush: It is grown and some parts have been changed / expanded many times.

Performance improvements: [B]YES[/B]

Oliver

apsen 2011-07-18 23:04

[QUOTE=TheJudger;266863]CUDA < 2.3 support: unlikely, how many people need this? Main reason: needs testing.[/QUOTE]

Do not worry about it - I'll remove it myself eventually :-)

[QUOTE=TheJudger;266863]
Cleanups: very welcome. I know that the code is messy. :blush: It is grown and some parts have been changed / expanded many times.
[/QUOTE]

Are you sure you know what you are asking for? :bounce:
Actually the reason my fist code had so many changes is that I tried to make sense of the code so I was trying to make it more readable. And all those #ifdef inside the code were getting to me. Sorry, but with you prompting it I could not resist.
:redface:

[QUOTE=TheJudger;266863]Performance improvements: [B]YES[/B]
[/QUOTE]

I might be able to cook something here but this is my first encounter with GPU computing. :unsure:

apsen 2011-07-19 03:19

1 Attachment(s)
Ok. Here it goes. I think this one is ready for more extensive testing as it passes all the selftests multiple times on different machines.

I've included Win64 executable compiled with CUDA 4.0 for multiple compute capabilities. I'd be interested if anyone could run the selftests and any other tests they could think of. Also I'm interested to know if there any noticeable performance differences with stock mfaktc-0.17.

Bdot 2011-07-19 08:55

[QUOTE=apsen;266804]:geek:
Just to make sure I understand it right: just blindly replacing atomics with unprotected access to d_RES might result in the problem only when we find more then one factor per class (tf_class_* call) and even then it will report that at least one factor has been found but the factor(s) itself may be scrambled by simultaneous attempt to store them in the result array. So if the program reports no factors found - it will be true. Is this correct?[/QUOTE]

Actually, as far as I understood the code, the problem only appears if multiple factors are found within the same grid of a class. Therefore, reducing the grid size will reduce the risk even more (when was the last time anyone had 2 factors in the same class?).

For ATI's HD4xxx GPUs I'm limiting the GridSize to 2, and issue a warning "GPU does not support atomics. There is a small chance mfakto can report wrong factors; if it does, then the exponent has multiple factors in the tested range."
And yes, I know it is not the exponent itself having the factors ...

ckdo 2011-07-19 09:32

Why not simply(?) use CPU based code to check the factors found and if one proves to be wrong use CPU based code to re-search the grid/class? The time used to do this should be negligible.

James Heinrich 2011-07-19 11:35

[QUOTE=apsen;266891]I've included Win64 executable compiled with CUDA 4.0 for multiple compute capabilities. I'd be interested if anyone could run the selftests and any other tests they could think of.[/QUOTE]I would, but you didn't bundle cudart64_40_17.dll -- can you post a copy of that please to make it easier for testers?

apsen 2011-07-19 14:07

1 Attachment(s)
[QUOTE=James Heinrich;266911]I would, but you didn't bundle cudart64_40_17.dll -- can you post a copy of that please to make it easier for testers?[/QUOTE]

Sorry.

apsen 2011-07-19 14:50

1 Attachment(s)
[QUOTE=James Heinrich;266911]I would, but you didn't bundle cudart64_40_17.dll -- can you post a copy of that please to make it easier for testers?[/QUOTE]

Actually, could you use this version:

James Heinrich 2011-07-19 16:57

[QUOTE=apsen;266891]I'd be interested if anyone could run the selftests and any other tests they could think of. Also I'm interested to know if there any noticeable performance differences with stock mfaktc-0.17.[/QUOTE]Selftest (for mfaktc17apsen.cuda40.sm_multi.win64) works fine here (Win7x64, i7-920, 8800GT):[code]Selftest statistics
number of tests 4914
successfull tests 4914

selftest PASSED![/code]Speed (in my very, very brief testing) seemed identical to stock.

edit: you posted the new version after I'd run my tests

TheJudger 2011-07-19 17:29

[QUOTE=apsen;266866]Are you sure you know what you are asking for? :bounce:

Actually the reason my fist code had so many changes is that I tried to make sense of the code so I was trying to make it more readable. And all those #ifdef inside the code were getting to me. Sorry, but with you prompting it I could not resist.
:redface: [/QUOTE]

No problem. I saw your modifications with the #ifdefs, I'll try to remove/replace some of them in my official version (e.g. optional parameters for debuging, etc.)

Oliver

apsen 2011-07-19 17:41

[QUOTE=James Heinrich;266936]Speed (in my very, very brief testing) seemed identical to stock.[/QUOTE]
No surprise here. The difference is mostly memory usage. The stock one uses 32 integers total to hold results. My version uses 32 integers per thread to avoid using of atomics. It could store up to 10 factors per thread but it is folded down to original amount on output. I did not modify output to minimize diffs with stock versions but it could be easily done later.

[QUOTE=James Heinrich;266936]edit: you posted the new version after I'd run my tests[/QUOTE]
I did not realize threadId is unique only within block originally so I had to fix that. mfaktc17apsen could have problems if two threads with the same id in different blocks find factors at the same time. mfaktc171apsen fixes that.

BTW at what point the program could be deemed good enough to produce? What is the usual standard?


All times are UTC. The time now is 23:13.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.