mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

TheJudger 2010-10-13 17:29

Hi Ethan,

[QUOTE=Ethan (EO);233232]Actually, what made me wonder about this is that I _was_ able to make my 470 miss a factor in the selftest with a not-too-severe overclock! This occured during one of three selftests that I ran, so that means one error out of under 15000 tests. Not so terrible, I suppose, when multiplied by the fraction of exponents which will have detectable factors, but it was unsettling nonetheless.[/QUOTE]

ofcourse the selftest will detect some errors caused by computational errors (e.g. to much overclocking, defective hardware, ...). But I won't recommend mfaktc's selftest as hardware test. At default settings there are up to ~15M factor candidates processed for a single test but I only check if the one known factor is found and don't care about the remaining factor candidate results. So less than 0.00001% of the results are verified. This is OK because false positivs are very easy to spot, they won't harm primenet progress.
A good hardware test would check the result of every single factor candidate but this would need a lot changes in the code which will cause a bit slowdown, too.
Very early versions of my code (before mfaktc 0.01) downloaded the whole bunch of data back to the CPU. This was slow! :wink:

Oliver

TheJudger 2010-10-17 11:32

Hi Lorenzo,

[QUOTE=Lorenzo;227744]Hi Oliver,

I'm trying to install 64-bit system:
1. Windows XP Professional x64 Edition
2. nVidia Forceware v258.96 International (WinXP x64) WHQL
3. CUDA ToolKit 3.1 (Win64)
4. Video Card nVidia Geforce 8500 GT (256 Mb DDR2)
5. mfaktc-0.09-win64-eoc

But when i run mfaktc-0.09-win64-eoc:
[CODE]mfaktc v0.09-Win
...
[B]cudaStreamCreate() failed[/B][/CODE]
What i doing wrong?[/QUOTE]

At least I found one way how to reproduce this error message:
I've built a binary with openSUSE 11.2 / Nvidia driver 260.40 / CUDA toolkit [B]3.2.9[/B] and switched to CUDA toolkit [B]3.1[/B]...
Of course I can't fix this in my code but I'll try to produce a more meaningful error messages (compare toolkit version used for compile and the actually toolkit version [B]if possible[/B]).
I'm not sure if this is the issue that you've seen but at least this is one variant to create those error message.

Oliver

TheJudger 2010-10-17 11:51

Hi frmky,

[QUOTE=frmky;231731]I thought I'd put my money where my mouth is. Here are the results.

GPU Bench: 222.5 M/s
Single threaded: 105.0 M/s
OMP 1 thread: 97.3 M/s
OMP 2 threads: 116.9 M/s
OMP 3 threads: 134.8 M/s
OMP 4 threads: 139.5 M/s

I was hoping for more than 33% improvement for a 300% increase in CPU time. :smile: The k_tab[] generation code is ugly (no offense intended!). It would need to be more structured to parallelize. [/QUOTE]

thank you for your attempt. :smile:
Scaling is limited to four threads. I've tried 5 and 6 threads on my six core and timings get worse. :sad:

I've fiddled about the parallel and:
[CODE]
if(n == 4)
{
for(j=0;j<(((SIEVE_SIZE-1)>>3)+sizeof(unsigned int))/sizeof(unsigned int);j++)
{
sieve[j] = sieves[0][j] & sieves[1][j] & sieves[2][j] & sieves[3][j];
}
}
else // n != 4
{
#pragma omp parallel for private(j)
for (i=0; i<(((SIEVE_SIZE-1)>>3)+sizeof(unsigned int))/sizeof(unsigned int); i++) {
for (j=0;j<n;j++) {
sieve[i] &= sieves[j][i];
}
}
}
[/CODE]

which gives better (but still not good) results for 4 threads.

2nd attempt:
[CODE]
i=0;
do
{
i*=2;
i++;
// printf("i = %d\n", i);
#pragma omp parallel for private(j)
for(thd=0; thd<n; thd++)if((thd & i) == 0 && (thd+((i+1)>>1)) < n)
{
// printf("Thread %d: %d\n", thd, thd+((i+1)>>1));
for(j=0; j<(((SIEVE_SIZE-1)>>3)+sizeof(unsigned int))/sizeof(unsigned int); j++)
{
sieves[thd][j] &= sieves[thd+((i+1)>>1)][j];
}
}
}
while(i<n);

for(j=0; j<(((SIEVE_SIZE-1)>>3)+sizeof(unsigned int))/sizeof(unsigned int); j++)
{
sieve[j] = sieves[0][j];
}
[/CODE]
But this runs even worse. :sad:

Of course in all 3 variants one could save those extra copy to sieve[] and use sieves[0][] in the ktab generation code, but I don't expect that this will solve the scalebility issue...


Oliver

Lorenzo 2010-10-19 07:06

Hi Oliver,

I found another cudart.dll (in the thread). I put them in directory with mfaktc. And all work fine!

But now I sold this card and bought the GTX460. With new card all work fine!

TheJudger 2010-10-19 12:09

Hi Lorenzo,

so it might still be a version mismatch.
My current development version contains allready a check for the versions:
[CODE]
CUDA version info
binary compiled for CUDA 3.10
CUDA driver version 3.10
CUDA runtime version 3.10
[/CODE]

If the binary is compiled for a newer version than the driver or runtime library there is a "WARNING: ..."

Oliver

P.S. GTX 460 is a very nice card for mfaktc! :smile:

Lorenzo 2010-10-19 13:31

[QUOTE]so it might still be a version mismatch.[/QUOTE]
I think that is true!

A small addition to the next version.
For 78 bit and above mfaktc goes a very long time for a class. Sometimes I have to interrupt the process. I can stop at the beginning of class or at the end of class. Since breaking off the end of class can I lose an hour or more work. But if I saw the time of completion of the previous class, I would estimate time of completion of the current class and waited a few minutes.
For example:
[QUOTE][19.10.2010 16:30] class 0: tested 105.63G candidates in 2244s (47.08M/sec) (avg. wait: 11115usec)[/QUOTE]

TheJudger 2010-10-19 14:04

Hi Lorenzo,

About the runtime per class:
- A build with MORE_CLASSES enabled has a lower runtime per class (~1/10 time per class but 10 times more classes). :smile:
- It looks like that you're affected by the "high average wait" bug which can occur on Windows with 2xx series drivers.
Assuming that you run [B]one[/B] instance of mfaktc the performance is way to low. I don't know the size of your current exponent but I think ~150M/s is more realistic on a GTX 460 if everything works right.

mfaktc 0.13 has a modified stream scheduler, Ethan reported that this cures the performance issue (Windows, 2xx series driver) on his system. :smile:

Oliver

Ethan (EO) 2010-10-20 04:51

[QUOTE=TheJudger;233869]Hi Lorenzo,

...

mfaktc 0.13 has a modified stream scheduler, Ethan reported that this cures the performance issue (Windows, 2xx series driver) on his system. :smile:

Oliver[/QUOTE]

[CODE]class 412: 93.59M candidates in 378ms (247.58M/sec) (avg. wait: 51usec)
class 417: 93.59M candidates in 378ms (247.58M/sec) (avg. wait: 48usec)
no factor for M66362159 from 2^64 to 2^65 [mfaktc xxxxxxxxx barrett79_mul32]
tf(): total time spent: 36.487s[/CODE] :smile:

TheJudger 2010-10-27 09:26

1 Attachment(s)
Hello,

find attached mfaktc 0.13.
This version features a modified stream scheduler which might solve some performance issues. If everything was OK with earlier versions on your system you [B]won't[/b] see any difference in performance. But there might some nice improvements if you had performance problems before (e.g. Windows + 2xx.xx series driver (+ fast GPU)).

This version prints some informations about CUDA versions, too. This might help to indicate some issues (version mismatches).

Thank you Luigi, Kevin, Dave and Ethan for testing the pre releases. :smile:
Thank you Ethan for two nice code cleanups. :smile:

Oliver

ET_ 2010-10-27 10:59

[QUOTE=TheJudger;234513]Hello,

find attached mfaktc 0.13.
This version features a modified stream scheduler which might solve some performance issues. If everything was OK with earlier versions on your system you [B]won't[/b] see any difference in performance. But there might some nice improvements if you had performance problems before (e.g. Windows + 2xx.xx series driver (+ fast GPU)).

This version prints some informations about CUDA versions, too. This might help to indicate some issues (version mismatches).

Thank you Luigi, Kevin, Dave and Ethan for testing the pre releases. :smile:
Thank you Ethan for two nice code cleanups. :smile:

Oliver[/QUOTE]

I started a new factorization with 0.12 just a few hours ago... :-(

Luigi

ET_ 2010-10-29 10:31

China claims Supercomputer crown
 
Reaching 2.5 PetaFLOPS with an incrastructure holding 7,000 Intel cores and 14,000 nVidia graphic cpus.

[url]http://www.bbc.co.uk/news/technology-11644252[/url]

Luigi


All times are UTC. The time now is 22:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.