mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

xilman 2010-04-11 12:01

[quote=Mini-Geek;211371]You could always do TF-LMH or manually choose numbers TFd to a very low bit level and do 1 bit on it to speed the process (and be playing with less consequential things, so you don't have to potentially give up the credit for two full-size TFs).[/quote]You could always stop being obsessed with credit and just do something useful.

Paul

Mini-Geek 2010-04-11 12:07

[quote=xilman;211374]You could always stop being obsessed with credit and just do something useful.

Paul[/quote]
I was thinking more of the time involved in finding a factor with normal TF, considering the point of the exercise is to test something that applies at any bit level. The credit is a bit of a side issue. But if you wanted to do TF anyway, and don't mind delaying the experiment an unknown amount of time (and don't care too much about some credit), by all means, do it that way. :smile:

TheJudger 2010-04-11 13:19

[QUOTE=ixfd64;211344]That sounds really exciting!

However, problems may arise when it comes to submitting the results. The source code that generates the PrimeNet checksums is not publicly available, so I don't know if results from mfaktc will be rejected. If that's the case, we have three alternatives:

1) E-mail the results to George every time.
2) Convince George to implement the mfaktc code in Prime95.
3) Convince George to configure PrimeNet such that it accepts results from mfaktc. This is definitely a possibility, since PrimeNet already "trusts" [url=http://mersenneforum.org/showthread.php?t=12576&page=4]MacLucasFFTW[/url] (I think).

Personally, I like option #2. :smile:[/QUOTE]

I've talked with George allready a little bit about this topic.
1) not an option, too time consuming for George
2) This might have negative siteeffects. Propably you'll need the CUDA environment on all machines which run prime95 on CPU, too. Not really usefull!
3) I think that is a good option.

---
At the current state you can take the "has a factor" lines out of the results.txt from mfaktc and directly put them into the manual result checkin form.
I've submitted some factors found by mfaktc allready this way.
For those who are hunting for credits: If you login into the website before you submit the results you'll get them...


Oliver

henryzz 2010-04-11 13:23

[quote=TheJudger;211380]2) This might have negative siteeffects. Propably you'll need the CUDA environment on all machines which run prime95 on CPU, too. Not really usefull!
[/quote]
What if there are multiple downloads one including CUDA and one not.

TheJudger 2010-04-11 13:42

Hi,

[QUOTE=Mini-Geek;211371]You could always do TF-LMH or manually choose numbers TFd to a very low bit level and do 1 bit on it to speed the process (and be playing with less consequential things, so you don't have to potentially give up the credit for two full-size TFs).[/QUOTE]

Performancewise this is not a good idea for mfaktc. For good performance you need big ranges.
E.g. TF M66xxxxxx from 2^65 to 2^68 in one steep yields a higher troughput than TF the same number from 2^65 to 2^66, 2^66 to 2^67 and 2^67 to 2^68 in three separate passes.
This has two main reasons: mfaktc does thousands of candidates per block (controlled by THREADS_PER_GRID).
-> the progress of TF doesn't stop directly a the specfied bit level. The last block includes on average THREADS_PER_GRID/2 candidates which fall into the next bit level
-> sieving (CPU) and candidate testing (GPU) runs interleaved. For long runs the best performance will be achieved when sieving takes as long as candidate testing. Lets assume that one block needs 100ms compute time (for each part). A specific block must be sieved before it will be tested.
For small runs this means that ranges consisting of a single candidate block needs 100ms for sieving followed by 100ms for candidate testing. So one block needs 200ms (5 blocks per second).
For long runs (e.g. 1000 blocks): in the first 100ms the first block is in sieving stage. From 100ms to 200ms runtime the 2nd block is in the sieving stage and the first block is in candidate testing stage. So the runtime for 1000 blocks will be 1000x100ms + 100ms = 100100ms (nearly 10 blocks per second).

So mfaktc needs "big" assingment, TF-LMH is not optimal for mfaktc.

Oliver

cheesehead 2010-04-11 18:14

[quote=Mini-Geek;211371]You could always do TF-LMH or manually choose numbers TFd to a very low bit level and do 1 bit on it to speed the process (and be playing with less consequential things, so you don't have to potentially give up the credit for two full-size TFs).[/quote]Thank you for your concern.

The reason I specified "make a copy of the prime.spl file" before editing it was so that after I submitted the edited test version to PrimeNet and saw what happened, I could restore the unedited prime.spl, submit that to PrimeNet, and verify that all was normal. (If there were other results besides the edited ones, they'd simply be noted as "not needed" on the second communication.)

In other words, I left out describing some steps of the envisioned experiment.

I learned long ago on-the-job to make backups before doing something that could screw up data.

[quote=Mini-Geek;211375]But if you wanted to do TF anyway,[/quote]... which I do ...
[quote]and don't mind delaying the experiment an unknown amount of time[/quote]I'm patient.
[quote](and don't care too much about some credit)[/quote]I don't.

[quote]by all means, do it that way. :smile:[/quote]Thanks again. :smile:

- - -

[quote=TheJudger;211384]Performancewise this is not a good idea for mfaktc. For good performance you need big ranges.
E.g. TF M66xxxxxx from 2^65 to 2^68 in one steep yields a higher troughput than TF the same number from 2^65 to 2^66, 2^66 to 2^67 and 2^67 to 2^68 in three separate passes.[/quote]Mini-Geek was only responding to my experiment description there, not recommending single-bit-at-a-time to CUDA TFers in general.

But thank you for your optimization explanation anyway!

TheJudger 2010-04-11 18:17

Hi (again),

[QUOTE=TheJudger;211326]I tried to put the 72bit GPU code into a separate .cu file (as a preparation for having multiple code pathes), in the current state I'm not able to built a binary from the code (linking problems).
[/QUOTE]

[B]Fixed![/B] :)
Now I can compile the .c file which contains main() directly with gcc. :)
In the .cu file I had to add [I]extern "C"[/I] to the function prototype.... simple fix :)
---
[CODE]o@Sinope:~/mfaktc/mfaktc-0.06/0.06-pre4> ./mfaktc.exe 3321932839 66 71
mfaktc v0.06-pre4
...

Runtime Options
SievePrimes 15000
SievePrimesAdjust 1

...
tf(3321932839, 66, 71);
k_min = 11106030600
k_max = 355392982921
class 0: tested 168099840 candidates in 2715ms (61915226/sec) (avg. wait: 1987usec)
avg. wait > 1000usec, increasing SievePrimes to 16000
class 5: tested 167116800 candidates in 2698ms (61940993/sec) (avg. wait: 1806usec)
avg. wait > 1000usec, increasing SievePrimes to 17000
class 9: tested 166133760 candidates in 2701ms (61508241/sec) (avg. wait: 1699usec)
avg. wait > 1000usec, increasing SievePrimes to 18000
class 12: tested 165150720 candidates in 2672ms (61807904/sec) (avg. wait: 1492usec)
avg. wait > 1000usec, increasing SievePrimes to 19000
class 20: tested 164167680 candidates in 2667ms (61555185/sec) (avg. wait: 1388usec)
avg. wait > 1000usec, increasing SievePrimes to 20000
class 21: tested 164167680 candidates in 2663ms (61647645/sec) (avg. wait: 1239usec)
avg. wait > 1000usec, increasing SievePrimes to 21000
class 29: tested 163184640 candidates in 2643ms (61742202/sec) (avg. wait: 1090usec)
avg. wait > 1000usec, increasing SievePrimes to 22000
class 32: tested 162201600 candidates in 2621ms (61885387/sec) (avg. wait: 949usec)
class 36: tested 162201600 candidates in 2608ms (62193865/sec) (avg. wait: 894usec)
class 41: tested 162201600 candidates in 2614ms (62051109/sec) (avg. wait: 914usec)
class 44: tested 162201600 candidates in 2614ms (62051109/sec) (avg. wait: 917usec)
...
[/CODE]
As you can see the number of candidates decreases from class to class because SievePrimes is adjusted at runtime automatically. Of course this needs some fine tuning but basically it is working. :)

Performance numbers are [U]NOT[/U] comparable to previous posts, I have a new CPU in my desktop PC. Now it is a "Core i7 800 series". It seems that a Core i7 8x0 is a little bit slower as a Core 2 Dou per clock for the sieving part (using a single core). I have to check, I think it is related to the huge L2-cache of the Core 2 CPUs, the L2-cache of the Core i7 is small while the L3-cache of the Core i7 is relative slow.

Oliver

kjaget 2010-04-12 13:20

A note from my running this code on a windows machine (based on 0.05).

I upgraded to Windows 7 and was seeing very slow results (~25M/sec on an i750+GTX275). I was seeing a large average wait regardless of the setting of the sieve values. I read that Win7 (Vista also) has a large latency for setting up kernels, so I think a lot of that delay was OS overhead. I made similar changes to what was mentioned earlier to launch multiple kernels at once - this sped things up by about 2x. I'm still only getting ~55M/sec or so which I think is below what you would expect.

Since you're looking at updates to the math code, is it possible to find a way for the individual kernels to run longer so the effect this OS latency will be reduced? I know you're limited by the hardware, but something to think about.

TheJudger 2010-04-12 15:44

Hi kajget,

can you check if the GPU runs a full speed (e.g. try GPU-Z) while mfaktc is running. Maybe the GPU isn't running at full speed.

You can increase the runtime of a kernel by
- increase THREADS_PER_GRID in params.h and recompile. This might have the site effect that the GUI becomes even more laggy. The GPU kernel needs THREADS_PER_GRID * 8 bytes so with my default setting it consumes ~8MiB of memory, not really a limit nowadays...
- choosing bigger exponents (not really an option usually, the effect isn't that big)

Please tell me what you needed to do to compile the code in Windows. Try to explain for somebody that has no experience with compiling code on Windows... :rolleyes:

[B]edit: increasing THREADS_PER_GRID above 2^20 might cause overflows in the offset tables...[/B]

TheJudger 2010-04-13 07:44

Hi kajget,

When you keep SIEVE_PRIMES below 100.000 it is save to increase THREADS_PER_GRID up to 2.500.000.

If increasing the kernel runtime by increases the amount of candidates per kernel call helps on Win Vista/7 than I could add a few lines to support bigger THREADS_PER_GRID.

Oliver

alexhiggins732 2010-04-17 06:48

Is there a windows binary of the latest version? Or perhaps a MSVC project I can compile myself?


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.