![]() |
[QUOTE=TheJudger;230756]Hi,
some of you know allready: mfaktc 0.12 will be final, soon ([B]perhaps[/B] next week). It features two new kernels based on "barretts modular reduction" which are faster than the currents kernels in some ranges (e.g. up to 50% on 2.x GPUs for factors above 2^75 and below 2^79). :smile: A second feature is a combined binary for sm_11 and sm_20 code so we can have one binary which delivers "optimal" performance on all currently supported GPUs. What's next? I don't know? Feel free to post some ideas what I could implement next. Of course posting an idea does [B]not[/B] guarantee that I'll implement it. Things which won't happen shortterm if ever - primenet integration - a GUI - multicore support (CPU), if one CPU core isn't fast enough just start another instance of mfaktc on another exponent in a separate directory Oliver[/QUOTE] I won't ask for it as an idea to be processed in the near future, just making a proposal. When mfaktc will be well optimized and tuned up, I'd like to see a modified (ffaktc?)version for Fermat and GFN factors :smile: Luigi |
One idea is to make a library based on mfaktc, which will allow it to be easily used by other programs.
|
[QUOTE=TheJudger;230756]- multicore support (CPU), if one CPU core isn't fast enough just start another instance of mfaktc on another exponent in a separate directory
[/QUOTE] How about multicore support? :smile: Using OpenMP pragma directives, the main loop over i from 6 (or 7) to sieve_limit would be easy to parallelize. OK ... there is a complication. The threads can't be allowed to stomp all over each other when accessing the sieve array since that'll kill the speed, so you'd need n sieve arrays for n threads, and then AND them all together after that loop in another parallel for loop. But unless I'm missing something significant (which is likely, BTW) that should add efficient threading in the most time intensive loop in perhaps a dozen additional lines of code. |
Hi frmky,
yepp, the method you suggested [B]should[/B] work. But I'm unsure how big the improvement would be because the sieve is limited by two loops: 1st is sieving itself but the second is the generation of k_tab[] from the bits in the sieve. The changes in the sieve (mfaktc 0.03, 0.05, 0.07) were all about the k_tab generation. So this loop needs paralisation, too. :sad: Oliver |
1 Attachment(s)
I thought I'd put my money where my mouth is. Here are the results.
GPU Bench: 222.5 M/s Single threaded: 105.0 M/s OMP 1 thread: 97.3 M/s OMP 2 threads: 116.9 M/s OMP 3 threads: 134.8 M/s OMP 4 threads: 139.5 M/s I was hoping for more than 33% improvement for a 300% increase in CPU time. :smile: The k_tab[] generation code is ugly (no offense intended!). It would need to be more structured to parallelize. Here are the details of my test case, and the patch is attached in case anyone wants to play with it. To compile in Linux, just add -fopenmp to the CFLAGS and LDFLAGS in the Makefile, the run with a command like OMP_NUM_THREADS=4 ./mfaktc.exe for 4 threads. The code passes the -st self tests for 1-4 threads. [CODE]mfaktc v0.11 Compiletime Options THREADS_PER_GRID_MAX 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_SPLIT 250 VERBOSE_TIMING disabled MORE_CLASSES disabled Runtime Options SievePrimes 25000 SievePrimesAdjust 1 NumStreams 3 WorkFile worktodo.txt Checkpoints disabled Stages enabled StopAfterFactor bitlevel CUDA device info name: GeForce GTX 480 compute capability: 2.0 maximum threads per block: 1024 number of multiprocessors: 15 (480 shader cores) clock rate: 1401MHz Automatic parameters threads per grid: 983040 running a simple selftest... selftest PASSED! got assignment: exp=100006129 bit_min=65 bit_max=68 tf(100006129, 65, 66, ...); k_min = 184456135080 k_max = 368912270841[/CODE] |
It would be nice to have a version with only the siever (hence it wouldn't need any GPU) so that people without a GPU can tune it :smile:
|
Hi ldesnogu,
I'll try to provide a sieve-only version. :smile: --- Hi frmky, perhaps you'll find a less ugly variant of k_tab generation in the earlier versions. Oliver |
1 Attachment(s)
Hi!
Here is mfaktc 0.12. :smile: Highlights: - two new kernels based on "barretts modular reduction", nice speed on "Fermi" cards. :smile: - combined binary with optimized code for sm_11 and sm_20 code easily possible. Older cards (pre-Fermi) will use the sm_11 code and newer cards (Fermi) will use the sm_20 code. :smile: Raw GPU benchmarks on stock GTX 470 [CODE] kernel | M66362159 above 2^64 | M3321932839 above 2^64 -------+----------------------+-------------------------- 71bit | 102.7M/s | 79.8M/s mfaktc-0.12 75bit | 189.1M/s 184.0% | 148.1M/s 185.6% 95bit | 155.8M/s 151.6% | 121.8M/s 152.6% 79bit* | 235.2M/s 229.0% | 187.7M/s 235.2% 92bit* | 214.5M/s 208.7% | 170.0M/s 213.0% [/CODE] On pre-Fermi cards the improvement is much smaller, the 75bit kernel is faster than the 2 new kernels. The new ones are still a small improvement over the 95bit kernel. A makefile for Windows is included. (Thank you Dave & Kevin! :smile:) No Windows built yet. :sad: Oliver |
1 Attachment(s)
I sent the Windows Makefile to Oliver via PM and it looks like it stripped out the necessary tabs. I attach a corrected version below.
|
1 Attachment(s)
Hi!
[QUOTE=ldesnogu;231754]It would be nice to have a version with only the siever (hence it wouldn't need any GPU) so that people without a GPU can tune it :smile:[/QUOTE] here we go. This should compile/run without GPU/Nvidia toolkit. This is only for those who want to try to optimize the sieve code (not for the users who just want to run TF). Oliver |
Great! Thanks a lot.
|
| All times are UTC. The time now is 22:55. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.