mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-07-18, 05:49   #1079
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by TheJudger View Post
aspen: your changes seem to screw something up.
I originally tried to follow the changes since version 0.8 as it was the last stock version that worked on my machine so it is more probable that I broke something unintentionally.

I've now got clean mkfaktc-0.17 and reapplied my changes directly to it. I've made them almost minimal (with the exception of enabling it to compile under CUDA 2.2) so it should be easy to do diff.

I do not see any test failures on it so please test it on your system to see if you still experience them.

Also, the slowdown was due to synchronous memory copy in the main loop but on my machine it was not as noticeable. I was loosing less then 10% of performance so I was going to look into it later. I have now reworked it too and performance is back on par with stock build. Please check if I did it the right way.
Attached Files
File Type: zip mfaktc-test-cuda2.2.zip (231.4 KB, 101 views)
apsen is offline   Reply With Quote
Old 2011-07-18, 10:00   #1080
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi aspen,

Quote:
Originally Posted by apsen View Post
The idea is simple give each thread it's own chunk of memory to write the results so there's no need to have shared variable.
Each thread or each stream? I think it is the latter case and than it won't work reliable. Old GPUs can't even run concurrent kernels so the behavior is the same as without own chunks of memory.

Quote:
Originally Posted by apsen View Post
I've now got clean mkfaktc-0.17 and reapplied my changes directly to it. I've made them almost minimal (with the exception of enabling it to compile under CUDA 2.2) so it should be easy to do diff.
May I know the reason why you're still on CUDA 2.2 (there might be good ones).

Oliver
TheJudger is offline   Reply With Quote
Old 2011-07-18, 12:08   #1081
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Each thread or each stream? I think it is the latter case and than it won't work reliable. Old GPUs can't even run concurrent kernels so the behavior is the same as without own chunks of memory.
I meant for each thread but I guess I indeed did it for each stream :-( I guess I would need to go back and fix it.


Quote:
Originally Posted by TheJudger View Post
May I know the reason why you're still on CUDA 2.2 (there might be good ones).
The driver on the machine does not support 3.1 or higher and if I upgrade the performance of stock mfaktc-0.8 drops significantly. So while I'm changing mfaktc I need it to test on 2.2. When it's ready I'll compile it for newer CUDA and upgrade the driver.

It was late yesterday so I didn't really do it this time.

Is there a problem with 2.2?

Last fiddled with by apsen on 2011-07-18 at 12:13
apsen is offline   Reply With Quote
Old 2011-07-18, 14:01   #1082
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Quote:
Originally Posted by apsen View Post
Is there a problem with 2.2?
I don't know any problems related to mfaktc (expect that is doesn't compile as it is now). As CUDA 2.2 is past I don't have any real plans for supporting it in mfaktc. If the needed changes are trivial and have not side effect I might try it anyway.

Oliver
TheJudger is offline   Reply With Quote
Old 2011-07-18, 14:43   #1083
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by TheJudger View Post
As CUDA 2.2 is past I don't have any real plans for supporting it in mfaktc. If the needed changes are trivial and have not side effect I might try it anyway.
The only difference is the absence of __launch_bounds__. You could see I've just put conditional macro to do away with it if we compile under CUDA 2.2. I don't care about supporting 2.2 either I just need it until we could make the current version of mkfaktc work with sm_10.


Quote:
Originally Posted by apsen View Post
I guess I would need to go back and fix it.
I've made the changes but I wouldn't be able test it until later today. But maybe you'd be willing to take a look at it before then to see if my understanding is right. (If you are on Germany time it will be past midnight for you before I get a chance to test.)
Attached Files
File Type: zip mfaktc-test-cuda2.2.zip (120.9 KB, 101 views)
apsen is offline   Reply With Quote
Old 2011-07-18, 15:57   #1084
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by TheJudger View Post
use of atomic instructions for access to the results array (this needs CC >=1.1)

Just to make sure I understand it right: just blindly replacing atomics with unprotected access to d_RES might result in the problem only when we find more then one factor per class (tf_class_* call) and even then it will report that at least one factor has been found but the factor(s) itself may be scrambled by simultaneous attempt to store them in the result array. So if the program reports no factors found - it will be true. Is this correct?
apsen is offline   Reply With Quote
Old 2011-07-18, 16:32   #1085
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

correct!
TheJudger is offline   Reply With Quote
Old 2011-07-18, 18:24   #1086
apsen
 
Jun 2011

131 Posts
Default

Is CPUStreams configuration parameter basically the length of sieve queue?
apsen is offline   Reply With Quote
Old 2011-07-18, 19:38   #1087
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

yes!

btw.: pleases change the version string in your modified code to something unique.
e.g. "0.17-ap1"

Oliver

Last fiddled with by TheJudger on 2011-07-18 at 20:15
TheJudger is offline   Reply With Quote
Old 2011-07-18, 20:11   #1088
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Hi Eric,

Quote:
Originally Posted by Christenson View Post
Hi Oliver:

I've been putting my time into parse.c ... gone through 1 re-write, need another to get it organized with a parse_line function that returns as a structure with both the data found and the original line.
feel free to sent me your stuff (even if it is not finished).

Oliver
TheJudger is offline   Reply With Quote
Old 2011-07-18, 22:03   #1089
apsen
 
Jun 2011

13110 Posts
Default

Quote:
Originally Posted by TheJudger View Post
btw.: pleases change the version string in your modified code to something unique.
e.g. "0.17-ap1"
Oliver
Ok. But would you be willing to incorporate my changes in your code once it goes through testing?
apsen is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 01:35.


Fri Aug 6 01:35:48 UTC 2021 up 13 days, 20:04, 1 user, load averages: 2.56, 2.36, 2.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.