mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Chuck 2011-07-21 02:49

Selftest passed OK. Core i7-970 and GTX 580

[FONT=Fixedsys]Result[00]: M3321928619 has a factor: 38814612911305349835664385407[/FONT]
[FONT=Fixedsys]found 1 factor(s) for M3321928619 from 2^94 to 2^95 [mfaktc 0.17.1-Win !!!TEST!![/FONT]
[FONT=Fixedsys]! apsen 95bit_mul32][/FONT]
[FONT=Fixedsys]selftest for M3321928619 passed![/FONT]
[FONT=Fixedsys]tf(): total time spent: 0.212s[/FONT]
[FONT=Fixedsys]Selftest statistics[/FONT]
[FONT=Fixedsys]number of tests 4914[/FONT]
[FONT=Fixedsys]successfull tests 4914[/FONT]
[FONT=Fixedsys]selftest PASSED![/FONT]

apsen 2011-07-21 21:37

[QUOTE=Christenson;266992]
I do have this dream of moving sieving off the CPU in mfaktc....it would significantly improve my multi-core LL performance...so your effort to understand how the sieving works in enough detail to make it screw up creatively won't be wasted.[/QUOTE]

:smile: I'll see how much time I could spend on it.

apsen 2011-07-21 22:04

[QUOTE=apsen;266988]
[QUOTE=Christenson;266975]
b) running them through a large, relatively slow expression in parallel.
[/QUOTE]

That actually makes it harder to trigger. The atomics deal only with storing thread local data to shared memory. That's only four bytes for each factor, seven lines of c code:

[CODE]
index=RES[0]++;
if(index<10) /* limit to 10 factors per thread */
{
RES[index*3 + 1]=f.d2;
RES[index*3 + 2]=f.d1;
RES[index*3 + 3]=f.d0;
}
[/CODE]

It's very negligible time comparing with the rest of computation.
[/QUOTE]

! But we could

[CODE]
[COLOR="Red"]__syncthreads();[/COLOR]
index=RES[0]++;
if(index<10) /* limit to 10 factors per thread */
{
RES[index*3 + 1]=f.d2;
RES[index*3 + 2]=f.d1;
RES[index*3 + 3]=f.d0;
}
[/CODE]

to try threads go for the shared memory at the same time once we figure the way to heard the factors right.

apsen 2011-07-22 01:42

Oliver, did you get my message?

TheJudger 2011-07-22 11:08

[QUOTE=apsen;267179]
[CODE]
[COLOR="Red"]__syncthreads();[/COLOR]
index=RES[0]++;
if(index<10) /* limit to 10 factors per thread */
{
RES[index*3 + 1]=f.d2;
RES[index*3 + 2]=f.d1;
RES[index*3 + 3]=f.d0;
}
[/CODE]
[/QUOTE]
The chance for a lost update is higher than without the update... (I did some expericments in the past). There is another issue in this variant: not all threads execute the __syncthreads() which has AFAIK undefined behavior.

[QUOTE=apsen;267203]Oliver, did you get my message?[/QUOTE]
Yepp, had YABT (Yet Another Business Trip).

Oliver

TheJudger 2011-07-22 11:10

Hi James,

[QUOTE=James Heinrich;266979]I found an interesting issue, possibly bug, in mfaktc:[code]
M862452323 has a factor: 1724904647
M862452323 has a factor: 103494278761
M862452323 has a factor: 26768470919200553
M862452323 has a factor: 178517762372762302367
M862452323 has a factor: 50544926613807686257
found 5 factor(s) for M862452323 from 2^ 1 to 2^70 [mfaktc 0.18-pre1-Win 71bit_mul24][/code]The problem?
1724904647 * 103494278761 = 178517762372762302367
So it really only found 4 factors, but [i]also[/i] reported a composite factor of the two smallest factors.

How did this happen?[/QUOTE]

This is known (at least to me).
And technically it is correct. 178517762372762302367 is a factor of M862452323. It is a composite factor, not a prime factor.

Oliver

Christenson 2011-07-22 11:49

That multiple factor thing might have been avoided with an absurdly high setting for SievePrimes (and code updates to make sure that setting actually worked), wouldn't it?

Not that we wouldn't all grow old waiting for the CPU to do the sieving at the same time!

James Heinrich 2011-07-22 12:04

[QUOTE=TheJudger;267248]This is known (at least to me).
And technically it is correct. 178517762372762302367 is a factor of M862452323. It is a composite factor, not a prime factor.[/QUOTE]Of course. You could claim that all these are factors, and it would be true, although not neccesarily useful :smile:[code]M862452323 has a factor: 1724904647
M862452323 has a factor: 103494278761
M862452323 has a factor: 26768470919200553
M862452323 has a factor: 50544926613807686257
M862452323 has a factor: 178517762372762302367
M862452323 has a factor: 46173059881613395394669791
M862452323 has a factor: 2770383591317463939447354833
M862452323 has a factor: 87185178798430852389017336279
M862452323 has a factor: 5231110724923700473126386687577
M862452323 has a factor: 1353010398174837130096359433084900121
M862452323 has a factor: 4778647530636042401407668963299608951
M862452323 has a factor: 9023167198392429666531803015720504470319
M862452323 has a factor: 2333813923251096884171353943910429764243762287
M862452323 has a factor: 140028835315238199557965845958940579302616630081[/code]Would it be worth a few extra lines of code to try and factor the found factor to break it down into prime factors on the chance that it happens to be a composite factor? I know PrimeNet will break it down when submitted, but I still think I'd rather know that I found 2 small factors rather than one larger one.

wblipp 2011-07-22 12:34

[QUOTE=James Heinrich;267255]Would it be worth a few extra lines of code to try and factor the found factor to break it down into prime factors on the chance that it happens to be a composite factor? I know PrimeNet will break it down when submitted, but I still think I'd rather know that I found 2 small factors rather than one larger one.[/QUOTE]

Use alpertron or yafu; don't mess with the single minded pursuit of speed by distracting the programmers.

apsen 2011-07-22 14:31

[QUOTE=TheJudger;267246]The chance for a lost update is higher than without the update...
[/QUOTE]

That was related to the talks about how to increase the chance of memory access related problems in special test version if we want to test for the presence of said problems.

[QUOTE=TheJudger;267246]There is another issue in this variant: not all threads execute the __syncthreads() which has AFAIK undefined behavior.
[/QUOTE]

That was just to illustrate the idea but good point. Need to keep that in mind.

Bdot 2011-07-25 23:49

[QUOTE=apsen;266988]
It looks like the approach Bdot is taking with mfakto. BTW I wonder if he's still reading this thread? Maybe he'll borrow my solution? :smile:[/QUOTE]

I do follow the thread, just did not have a lot of time recently ...

What problem is it that you want to see in mfaktc/o? Just what happens when multiple threads write to the same buffer? For this test you don't need mfaktc - a small kernel like this
[code]
f=get_global_id(0);
index=RES[0]++;
if(index<10) /* limit to 10 factors */
{
RES[index*3 + 1]=f;
RES[index*3 + 2]=f;
RES[index*3 + 3]=f;
}
[/code]will show. Now run that kernel in one, two, five, twenty ... threads and check the RES buffer. No need to actually test any factor ...

For the problem to occur in real life, the factors need to be in the same class (we talked about that sufficiently). And they also need to be close together (in the same grid, not just the same bitlevel). mfaktc/o tests up to 2^20 resp. 2^21 factor candidates in one run. Someone with better mathematical background than I have could maybe calculate the chances:
[LIST][*]we're typically searching factors above 2^64, more often above 2^68[*]the two factors must be in the same class out of 4620 classes[*]they must not be more than 2^20 (2^21) of our "probably prime" factor candidates apart. (At SievePrimes=200.000, grid size 2^21, "last k" - "first k" of a grid is ~ 5.3E10)[/LIST]@apsen: I don't think that copying 32 bytes per FC from the GPU to the CPU after a grid was tested is a good idea. Currently mfaktc copies 4 byte per FC to the GPU and 32 bytes back in TOTAL. Your approach (as of[URL="http://mersenneforum.org/showpost.php?p=266751&postcount=1079"] this post[/URL]) increases the required bandwidth about 9 times.

Probably the comment " /* limit to 10 factors per thread */" was misleading :smile: : A single thread will only test one candidate, and as such will have at most one result. So the code could be changed to just double the required bandwitdh (4 bytes in, 4 bytes out per FC). Still I'm afraid that would be a major performance hit, at least for OpenCL where there is no transfer-hiding support yet.

I'll see that I can test the above mini-kernel and verify that for all possible thread numbers at least RES[0] is non-zero. Then we're (almost) certain that we always have some indication that there is at least one factor in the unlikely event above.


Unfortunately my Windows-PC has lost its PSU. The good thing about that: Now there is a running version of mfakto for Linux :smile:.


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.