![]() |
mfaktc: a CUDA program for Mersenne prefactoring
[B][edit by LaurV (2012-Dec-28)]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: [URL]https://www.mersenneforum.org/mfaktc/[/URL] Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! :smile: Thanks! [end of edit by LaurV][/B] [b][edit by James Heinrich (2022-Nov-06)] The mersenneforum mirror hasn't been updated in several years, current mfaktc builds are available here: [url]https://download.mersenne.ca/mfaktc/[/url] [end of edit by James Heinrich][/b] Hi, maybe a bit offtopic... as a proof-of-concept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger |
[QUOTE=TheJudger;196991]
jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( [/QUOTE] It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus. |
Hi jasonp!
[QUOTE=jasonp;197012]It doesn't map very well to CUDA, but there are few alternatives if you are not performing a very large number of multiplications with fixed modulus.[/QUOTE] Please tell me which alternatives you're thinking about. Maybe you've allready tried by yourself some of them on CUDA? |
[QUOTE=TheJudger;197083]
Please tell me which alternatives you're thinking about. Maybe you've allready tried by yourself some of them on CUDA?[/QUOTE] No, I went straight to writing the Montgomery code because that's what I know :) The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61-bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way. |
jasonp: thank you for your hints.
I had allready the barrett modular multiplicatoin on my radar but didn't take a deeper look at it so far. Current speed: 30 million checks per second on a 26bit exponent with factors up to 71bit :) Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode) - presieving of factor candidates - reduce the amount of data transfered from/to device - interleave host/device code |
Hi,
Next on my to-do list: - more flexible input (e.g. the exponents are still hardcoded in the hostcode): DONE! - presieving of factor candidates: 50% - reduce the amount of data transfered from/to device: DONE! - interleave host/device code: DONE! Raw speed on my GTX 275: ~34 million factor candidates per second for a 26bit exponent. presieving is supported in the code now, what is missing is a fast siever. The current code does "sieving" by calculation the remainder of each factor candidate which isn't fast, offcourse. So with presieving factor candidates with primes <= 11 *lol* I can factor M66xxxxxx to 60 bits in 80 seconds. This removes 2/3 of the factor candidates, with a good siever this should increase ;) (the prime95 help notes that ~95% of the candidates are removed with the siever). If I can do the same this will be speedup of ~5-6 (~33% vs. ~5% remaining factor candidates) compared to the current version. :) There is one known bug (introduced in the current version): if 2 (or more) factors are found at the same dataset often only one factor is returned. It is even possible that the returned factor is wrong (mixup of both factors) in this case. I haven't noticed a mixup so far but I'm sure it can occur. :( |
Hi,
found a bug in my horrible "siever" (actually it is no sieve right now), it removed only the half of the factor candidates which are not 1 or 7 mod 8. Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 7: 57s Sieving M66xxxxxx to 2^60 with presiving factor candidates up to 11: 55s (siever is too slow, causes some stalls on the GPU code) I ran some tests from M66000000 to M66004000 and "verified" the factors known allready to the primenet and found some unknown factors (which I've checked in manually). |
Does anyone else get the feeling especially the mods that TheJudger's stuff should get its own thread and not hijack this one?
|
henryzz: good idea!
Maybe a mod can move all posts related to tf on CUDA to a separate thread? |
Hi,
improved siever (sieving the first 600 primes): Sieving M66xxxxxx to 2^60: 17 seconds Sieving M66xxxxxx to 2^64: 270 seconds Sieving M66xxxxxx from 2^64 to 2^65: 270s seconds I think I'm close to the first "public beta". :) I have some ideas in my mind that I want to test/implement before that. Regards, TheJudger |
Hi,
my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :) |
[quote=TheJudger;198902]Hi,
my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :( The good news: the 2nd attempt works fine an took only ~10 lines of code :) sieving the first 2000 primes: TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...) sieving the first 4000 primes: TF M66xxxxxx to 2^64: 220 seconds :)[/quote] Just curious, how does this compare to a modern CPU doing the same task? Based on what I recall from the last time I did TF work, this looks pretty fast. |
[QUOTE=mdettweiler;198925]Just curious, how does this compare to a modern CPU doing the same task? Based on what I recall from the last time I did TF work, this looks pretty fast.[/QUOTE]
It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit: M660xxxxx: 1 to 2^64: 1369sec M669xxxxx: 1 to 2^64: 1406sec So 1 core on a 1.5year old cpu is over 6 times slower. |
[quote=ATH;198970]It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:
M660xxxxx: 1 to 2^64: 1369sec M669xxxxx: 1 to 2^64: 1406sec So 1 core on a 1.5year old cpu is over 6 times slower.[/quote] Wow! :shock: Now I really wish I had a real GPU in my computer... :ermm: |
Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it?
If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU. |
Hi ATH,
[QUOTE=ATH;198970]It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit: M660xxxxx: 1 to 2^64: 1369sec M669xxxxx: 1 to 2^64: 1406sec So 1 core on a 1.5year old cpu is over 6 times slower.[/QUOTE] I'm wondering why M669xxxxx is slower than M660xxxxx, M669xxxxx should have LESS factor candidates. Can you run a benchmark from 2^64 to 2^65 aswell? There might be a slowdown in P95 for factors above 2^64. My CUDA-Code has only one code path: factors up to 2^70 (maybe 2^71, I have to check, I would say that I've a 90% chance that 2^71 will work without modifications) so there should be no slowdown when moving to factors > 2^64. :) |
Hi Uncwilly,
[QUOTE=Uncwilly;198973]Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it? If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU.[/QUOTE] right now I've a soft limit for the exponent of (2^31) -1. I just haven't thought about exponents above this limit right now. 77bit? No. Right now I'm using 3 32bit integers (24bits data + 8 bit carry) so this gives me 72bit overall. I'm using 24bits per int because current GPUs can do 24x24 multiply 4 times faster than 32x32... I could do 4x 24bits, too. But at this moment I just want to have a working version which runs effective from 2^64 to 2^70 (or 2^71). CUDA-doc notes that upcomming GPUs can do 32x32 multiply much faster than now... |
No code improvement, just some more benchmarks.
M66362159 sieving the first 4000 primes: TF to 2^64: 220 seconds TF from 2^64 to 2^65: 212 seconds TF from 2^65 to 2^66: 423 seconds In the range from 2^65 to 2^66 exists one factor, so Prime95 might stop there? 63205291599831307391 is a factor of M66362159 TF to 2^64 does one step less precalculation so for each factor candidate is one more long division needed. This is the reason why it is slower than TF from 2^64 to 2^65. |
[QUOTE=TheJudger;199326]M66362159
sieving the first 4000 primes: TF to 2^64: 220 seconds TF from 2^64 to 2^65: 212 seconds TF from 2^65 to 2^66: 423 seconds [/QUOTE] 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with Windows XP x64 and Prime95 25.11.2 64bit: M66362189: TF to 2^64: 1373 sec TF 2^64-2^65: 1741 sec TF 2^65-2^66: 3889 sec TF 2^66-2^67: 7733 sec |
Hi!
Thank you for the benchmarks, ATH :) Little speed increase caused by several minimal improvements. :) M66362159 sieving the first 4000 primes: TF to 2^64: 205 seconds (needs further testing, quick tests showed no problems) TF from 2^64 to 2^65: 205 seconds TF from 2^65 to 2^66: 406 seconds Some code cleanups asweel. And finally an user interface (command line). Next steps are: - some more cleanups (remove unused code, ...) - some more describtion about compiletime options - testing the code on different platforms on mersenne numbers with known factors (check that my code finds the factors aswell). Here you all can help me. Prefered are people who know how to compile/run CUDA code. |
Hi,
Good news and bad news! (don't be afraid, it is not really bad) Good news #1: I quick test on my 8800GTX (G80) ran fine, it produced the same results as my GTX 275 (GT200b). The G80 is the oldest CUDA-capable chip with "compute capabilities 1.0". If the code works on G80 it should work on all CUDA-enabled products. :) Good news #2: I've increased the raw speed of the GPU code. My GTX 275 can now test ~53.1M factor candidates per second (M66362159, tested with 65 bit factors). Performance before the latests optimisation was ~36.6M factor candidates per second. The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before) Bad news #1: The "ptx hack" is ugly!!! I have to check some compilers... There is a patch to enable some more intrinsics but I was not able to build the compiler. :( Bad news #2: My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now. I have to think about possiblities: (1) speedup the siever by writing better code (I'm not sure if I can do this). If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2. (2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution. (3) put the siever on the GPU. I'm not sure if this might work... (4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ... personnally I prefer (4) Any comments? ----- New benchmark numbers (still on M66362159) 8800GTX, 3.00GHz Core 2 Duo, sieving up to 37831 (4000th odd prime): TF to 2^64: 283s GTX 275, Core 2 Duo overclocked to 4GHz, sieveing up to 3581 (500th odd prime) In this configuration the siever causes allready some stalls on the GPU. TF to 2^64: 190s TF from 2^64 to 2^65: 190s TF from 2^65 to 2^66: 379s Raw "factor testing speed" of the 8800 GTX is exactly half of the GTX 275 so imagin the time if the siever would be fast enough... |
[QUOTE=TheJudger;199739]Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.[/QUOTE] I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one? In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit. |
Yes, there are patches to Nvidia's version of the Open64 compiler that allow high half multiplies and also the add and subtract instructions that generate ad consume carry bits. I wasn't able to build the compiler using either MinGW or Cygwin, even though the source specifically has build directories for those. Given the deafening silence from the Nvidia forums, probably nobody else is able to do it either :)
Have you tried building the compiler in Linux? If all you do is generate the PTX code but not run it, then you don't need any of the other GPU infrastructure on the machine... |
[QUOTE=TheJudger;199739]The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before)
Bad news #1: The "ptx hack" is ugly!!! I have to check some compilers... There is a patch to enable some more intrinsics but I was not able to build the compiler. :([/QUOTE]Could you post details of your __[u]mul24hi() trick please? If you prefer, PM or email will be just as good but posting here will aid other CUDA programmers too. I've seen the nvidia forum postings and the alleged patches to nvcc but I've never managed to get it working either. It will be very useful in some code I'm writing which, at present, has to use __umulhi() and nasty shifting and masking. Thanks, Paul |
[QUOTE=axn;199744]I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one?
In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.[/QUOTE] This won't maximize the throughput of the machine. And for the next generation GPU "Fermi" I might have to force the sieve to sieve only up to 17 or so :( |
Jason/Paul: did you check other CUDA compilers?
E.g. PGI advertises their compiler as CUDA-capable. Jason: yep, I tried on Linux and failed. (actually I'm devolopping my code under Linux). Paul: for sure, here we go (hopefully your familar with the bash): My code contained only a single __umulhi(). Since the device functions are inlined all the time it appears several times in the ptx code. Step #1 (just for safety): - comment out that __umulhi() - add "--keep" to the nvcc command line (this will generate alot of files, do it in a seperate subdirectory or so) and compile the code - check if there is no "mul.hi.u32" in the ptx code - comment in that __umulhi() Step #2: - add "--dry-run" to the nvcc command line and compile the code (actually it won't compile). This shows you the commands issued by nvcc. Write down the commands issued after the ptx-code generation Step #3: - compile you code with the --keep option again - modify the ptx file (search & replace mul.hi.u32 with mul24.hi.u32) - run the commands which you have written down in step #2 my script used for compiling without ptx hack [CODE]#!/bin/bash -xe rm -f sieve.o main.o main.exe gcc -Wall -O2 -c sieve.c -o sieve.o nvcc -c main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v gcc -fPIC -o main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart[/CODE] and now with ptx hack [CODE]#!/bin/bash mkdir compile_bla_bla cd compile_bla_bla gcc -Wall -O2 -c ../sieve.c -o sieve.o nvcc -c ../main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v --keep cat main.ptx | sed s/mul\.hi\.u32/mul24\.hi\.u32/ > main.ptx.new mv main.ptx main.ptx.old mv main.ptx.new main.ptx rm -f main.sm_10.cubin main.cu.cpp main.o ptxas --key="xxxxxxxxxx" -arch=sm_10 -v "main.ptx" -o "main.sm_10.cubin" fatbin --key="xxxxxxxxxx" --source-name="../main.cu" --usage-mode="-v " --embedded-fatbin="main.fatbin.c" "--image=profile=sm_10,file=main.sm_10.cubin" "--image=profile=compute_10,file=main.ptx" cudafe++ --m64 --gnu_version=40302 --diag_error=host_device_limited_call --diag_error=ms_asm_decl_not_allowed --parse_templates --gen_c_file_name "main.cudafe1.cpp" --stub_file_name "main.cudafe1.stub.c" --stub_header_file_name "main.cudafe1.stub.h" "main.cpp1.ii" gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart" -I. -I"/opt/cuda/include/" -m64 -o "main.cu.cpp" "main.cudafe1.cpp" gcc -c -x c++ "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart" -I. -I"/opt/cuda/include/" -m64 -o "main.o" "main.cu.cpp" gcc -fPIC -o ../main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart cd .. rm compile_bla_bla -rf[/CODE] If you want to replace only some of your __[u]mulhi with __[u]mul24hi than it will bit a bit more complicated. :( And offcourse the build script uses some systemspecific pathes... Before you replace the instruction remember about the different behaviour. __[u]mulhi() returns bits 32 to 63 while __[u]mul24hi() returns bits 16 to 47. |
[QUOTE=TheJudger;199739]Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now. I have to think about possiblities: (1) speedup the siever by writing better code (I'm not sure if I can do this). If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2. (2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution. (3) put the siever on the GPU. I'm not sure if this might work... (4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ... personnally I prefer (4) Any comments?[/QUOTE]I'd try a variant of (3) as follows: Write a siever in CUDA and run it on the GPU (no TF, in other words) until you have at least a few hundred megabytes of sieved results. You would use compact storage, obviously. Something like this would work: all factors are of the form 2kp+1, so store only the k values and them only as deltas from the previous value. The factors also form 16 "obvious" residue classes (something exploited by prime95 since the very early days) so store 16 such lists of deltas. I don't know whether an unsigned char is enough to store the deltas but an unsigned short surely would be. The results would be stored on disk or in cpu RAM as appropriate. Then you can feed the results into a separate TF kernel in a separate cpu thread. Paul |
[quote=TheJudger;199739]My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.
I have to think about possiblities: (1) speedup the siever by writing better code (I'm not sure if I can do this). If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2.[/quote] Ernst has some code in Mlucas for trial factoring in factor.c. You could perhaps steal some ideas? :smile: EDIT: A link could help: [url]http://hogranch.com/mayer/README.html[/url] |
Execept that I'm doing the sieve on the CPU thats more or less the way I'm doing the sieving.
I generate a list of 2^20 k's (in the same class) at once and transfer them to the GPU. The k's are stored as uint32 k_ref_hi, uint32 k_ref_lo and uint32 *k_delta. The deltas are relative to k_ref, NOT to the previous k_delta (think parallel ;)) For 2^20 k's and sieving the first 4000 odd primes the k_delta grows above 300.000.000 so a short surely doesn't fit. This sieve is segmented and so small that it fits into the L1-cache of the CPU. From my feeling sieving doesn't fit well on CUDA. |
Hi and a happy new year!
GTX 275, Core 2 Duo overclocked to 4GHz, sieving up to 37831 (4000th odd prime): One process on the CPU: M66362159 TF from 2^64 to 2^65: 180s (siever still too slow) Two processes on the CPU at the same time M66362159 TF from 2^64 to 2^65: 279s M66362189 TF from 2^64 to 2^65: 279s (using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code)) There are some compiletime options which run fine on newer GPUs but won't work on older ones. (e.g. asynchronous memory transfers aren't supported on G80 chips). Therefore I need to write some checks into the code if the current GPU as capable or not. I had a horrible stupid "bug" during my attempt to make the code capable running multiple CPU processes concurrently on one GPU: CUDA source files are usually named "*.cu". My favorite editor doesn't know ".cu" files (for syntax highlighting), I was lazy and did symlinks "*.c" -> "*.cu". This worked fine until I copied the files with scp (secure copy, openssh). Doing so the "*.c" files were copies of the "*.cu" files. So I've edited the "*.c" files and compiled the "*.cu" files..... It took some time to figure out why code changes didn't show any difference... |
[QUOTE=axn;199744]I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one?
In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.[/QUOTE] Some more details on this topic: Generating a candidate list (2^20 candidates) [CODE]Sieve limit | time | number of raw candidates 31 (10th odd prime) | 8ms | ~1.83M 547 (100th odd prime) | 16ms | ~3.16M 7927 (1000th odd prime) | 21ms | ~4.48M 104743 (10000th odd prime) | 28ms | ~5.77M[/CODE] (hopefully no typo in this list...) "raw candidates" are the number of candidates before sieving needed to generate a list of 2^20 candidates after sieving (on average). As you can see the runtime of the siever depends mostly on the number of raw candidates. Generate a list of 2^20 candidates with sieving up to 104743 takes ~3.5 times longer than generation a list of same size with sieving up to 31. [B][U]BUT[/U][/B] compare the number of raw candidates. Sieving up to 104743 covers 5.77/1.83 = 3.15 times raw candidates! So lowering the sievelimit will help to keep the GPU busy BUT this won't increase the throughput much. It will help to generate more heat on the GPU and burn electricy. |
Happy new year!, Thejudger
[QUOTE=TheJudger;200571] (using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code))[/QUOTE] Why use CPU ? Something slow on GPU ? |
Hello msft,
I'm sceptical about sieving on the GPUs. I would be happy if somebody proves me wrong on that topic. ;) - sieving in one big array with all GPU threads (each GPU thread handels different small primes) Problem: will need global memory for the array. When using one char per candidate I need only writes. When using one bit per char I need reads and writes (one global memory!). This should be very slow. - each thread sieves a small segment Problem: need to calculate starting offsets much more often. For this I use a modified euclidean algorithm which needs ifs. Different code paths will break parallel execution as far as I know. :( Problem: might get imbalanced between threads if the segments are too small. If the segments are bigger (to minimize the effect of imbalanced threads) it will have [U]LONG[/U] kernel runtimes on GPU which stalls the GUI (if there is some). |
You can take into shared memory the 16 segments used for sieving (one for each residual class) and synchronize threads only at the end of the whole task.
Updating the segment is a matter of nanoseconds, once you locally store the precomputed gaps for each prime. Luigi |
Hi,
another performance update. Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations). Remove classes which are - 3 or 5 mod 8 - multiples of 3 or 5 4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%) Removing multiples of 7, too: 4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%) Removing multiples of 11, too: 4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%) ----- One process on the CPU, sieving the first 4000 odd primes: M66362159 TF from 2^64 to 2^65: 160.7s / 185.5s M66362159 TF from 2^65 to 2^66: 318.8s / 323.5s M66362159 TF from 2^66 to 2^67: 631.6s / 620.7s First time: using 96 classes Second time: using 960 classes The range is a bit to small for proper timings when using 960 classes. The reason is that I always fill up the candidate list no matter if its above the range or not. And compared to 96 classes it need 10x more initializations. ----- M66362159 TF from 2^64 to 2^65 using 96 classes: sieving first 4000/25000 odd primes 1 process: 160.7s / 167.5s 2 processes: 285.8s / 238.1s (both processes are doing TF from 64 to 65bit) Throughput for M66362159 TF from 2^64 to 2^65: Time estimate for my 4GHz C2D (based on ATHs timings): 1160s (Since Prime95 TF fits excellent into the L1-Cache of the CPU I've scaled up the time directly by the clock rates) How often can I do the TF in one hour? One Process for CUDA and one with Prime95 (assuming they don't slow down each other): 3600s / 160.7s + 3600s / 1160s = 22.4 + 3.1 = 25.5 Two Processes for CUDA and no for Prime95: 2 * 3600s / 238.1s = 30.2 So when aiming for TF throughput on my particular system it is clearly beneficial to dedicate 2 CPU cores to my CUDA-code. |
[quote=TheJudger;200871]Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations).
Remove classes which are - 3 or 5 mod 8 - multiples of 3 or 5 4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%) Removing multiples of 7, too: 4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%) Removing multiples of 11, too: 4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%)[/quote] Do you only rely on initialization to remove these? If so, you should perhaps consider removing them from the sieve itself, that'd help you put wider ranges in your L1 Dcache, at the expense of code size, but I don't think it's critical :smile:. |
Hi ldesnogu,
I'm not 100% sure if I got the point of your post. The smallest primes (3, 5, 7, 11) are removed from the sieve totally. I'll try to explain how it works. Let us ignore the fact that we can remove factors which are 3 or 5 mod 8 for simplicy (this would just increase the number of classes by four). For example we take M23 (p=23) and build only 3 classes (remove all multiples of 3). The factor candidates (FC) have the form 2kp+1. [CODE]class 0: k={ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, ...} class 1: k={ 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...} class 2: k={ 2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, ...}[/CODE] Actually those lists are only for representation here, they are not generated in the code. Check, if a class is 0 mod 3: k=3: 2kp+1 mod 3 = 1 Check, if a class is 1 mod 3: k=1: 2kp+1 mod 3 = 2 Check, if a class is 2 mod 3: k=2: 2kp+1 mod 3 = 0 So we can ignore class 2 totally. ALL candidates in this class are a multiple of 3! [CODE]class 0: k={ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, ...} class 1: k={ 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...} class 2: cleared[/CODE] So we've no multiples of 3 now (talking about FCs). Next small prime is 5. For each class find the smallest k where 2kp+1 mod 5 = 0 These are k=9 for class 0 and k=4 for class 1. Starting at k=9/k=4 in this lists mark every 5th k as cleared. [CODE]class 0: k={ 3, 6, *9, 12, 15, 18, 21,*24, 27, 30, 33, ...} class 1: k={ 1, *4, 7, 10, 13, 16,*19, 22, 25, 28, 31,*34, ...} class 2: cleared[/CODE] Repeat sieving for small primes 7, 11, 13, ... up to a certain bound. Those classes are independent on each other so we need to handle only a single class at one time. There is no need to try those FCs in ascending order. My actual implementation uses one of - 4 * 3 * 5 * 7 = 420 classes, sieve size = N * 11 * 13 * 17 * 19 bits = N * 46189 bits - 4 * 3 * 5 * 7 * 11 = 4620 classes, sieve size = N * 13 * 17 * 19 * 23 bits = N * 96577 bits Each k is represented as a single bit. These sieve sizes allow to have preinitialized sieves which have allready sieved the small primes up to 19/23. |
Oh OK, I mistakenly thought you only removed smaller primes from your sieve by using initialization, while in fact your initialization removes some larger ones :smile:
In case you don't know, this method is called sieving by wheel. |
Hi,
[QUOTE=ldesnogu;200928]In case you don't know, this method is called sieving by wheel.[/QUOTE] I didn't know that this has a specific name. I came up with this idea some years ago by myself (for a other project, offcourse). In my original code I called it "groups", not "classes". ;) |
In other words, you reinvented the wheel. :big grin:
|
nice one, ckdo. :)
|
1 Attachment(s)
Hello,
for those who want to try: find attached the code :) At the current state I recommend this only to people how are familar with CUDA. The code is NOT ready for everyday usage, you will be able to submit new factors to the primenet server but you can't submit "no factor from 2^X to 2^Y" results. Compared to my postings from 4. Jan there are some speed improvements aswell (maybe 10%). When you start: try to run the selftest first (see params.h and enable them). The selftest contains ~40 exponents which have known factors between 2^68 to 2^71. To speedup the selftest you can define MORE_CLASSES in params.h. The selftest does only check the class in which the known factor falls so it is alot faster than doing the full range. Try to compile with 'compile_replace_umul.hi_umul24.hi.sh', this gives a speedup for free compared to 'compile.sh'. (See post #21/#26 of this thread). Once you've created a binary just run it without any parameters and it will show you how to run. (You need to enter a valid set of parameters even if you want to run the selftest...) I recommend to run on mersenne numbers with known factors and check if my code finds the factor(s) aswell. Please report the results of you runs. It should work on (allmost) all CUDA-capable devices. I'm not 100% sure about G80 (Geforce 8800 GTS 320/640, 8800 GTX, 8800 Ultra and their Quadro/Telsa variants). Maybe you need to disable "USE_PINNED_MEMORY" and "USE_ASYNC_COPY" on these devices (params.h). The selftest runs fine on an old 8600GTS :) Documentation is allmost missing. :( The userinterface isn't very comfortable, sorry for that ;) Oliver |
[quote=TheJudger;201416]The selftest runs fine on an old 8600GTS :)[/quote]
At least i know it works on my card:smile: How fast is it on the 8600GTS? From my experience of it it is very slow and pointless compared with modern cards like the GTX275 although plenty good enough for the gaming i do Is it worth me getting it working? I have never compiled something as i have only ever used msievegpu. Would there be any chance of binaries for it? |
Hi henryzz,
for factors between 2^64 to 2^71 it is about twice as fast as a single core of ath's core 2 quad. I like the card since it is relative slow it's easy to spot differences in runtime of the GPU-code, the CPU never limits the throughput. The [B]RAW[/B] speed of the GPU code can be easily estimated since it scales perfect along the GPUs GFLOPS. I have tested this on - 8400GS (43.2GFLOPS / 2.3M candidates tested per second) - 8600GT (113GFLOPS / 6.1M candidates tested per second) - 8800GTX (518GFLOPS / ~28M candidates tested per second) - GTX 275 (1011GFLOPS / ~54M candidates tested per second) I think it is not the right time for precompiled binaries, there are too many compiletime options in the code right now. I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64). If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than. You need to install the CUDA software aswell. |
[quote=TheJudger;201419]Hi henryzz,
for factors between 2^64 to 2^71 it is about twice as fast as a single core of ath's core 2 quad. [/quote] Nice to know that my GPU will donate 2 cores worth of throughput on my pc.:smile: [quote] I like the card since it is relative slow it's easy to spot differences in runtime of the GPU-code, the CPU never limits the throughput. The [B]RAW[/B] speed of the GPU code can be easily estimated since it scales perfect along the GPUs GFLOPS. I have tested this on - 8400GS (43.2GFLOPS / 2.3M candidates tested per second) - 8600GT (113GFLOPS / 6.1M candidates tested per second) - 8800GTX (518GFLOPS / ~28M candidates tested per second) - GTX 275 (1011GFLOPS / ~54M candidates tested per second) [/quote]Nice that it scales with GFLOPS. That makes it easy to make estimates.:smile: [quote] I think it is not the right time for precompiled binaries, there are too many compiletime options in the code right now. I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64). If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than. You need to install the CUDA software aswell.[/quote]My attempts at compiling using compilers new to me often fail(still haven't managed to compile anything major properly with Visual Studio for example):smile:. I often end up causing more trouble than the time it would take for you to post binaries. My platform is a Q6600. I have the CUDA software and will attempt to compile tomorrow on Ubuntu 9.04-64bit. Hopefully it will go smoothly.:smile: |
1 Attachment(s)
Hi Henry,
try this one. I won't be surprised if it doesn't work (libraries versions, ...) I have mistyped the model name of the 8600, I have a 8600GT here, not a 8600GTS. The GTS is faster. Oliver |
Hi,
On ubuntu9.04/32bit/GTX260 [CODE] $ time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 C... ... no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 273133msec real 4m33.207s user 4m30.925s sys 0m2.288s [/CODE] |
Hi msft,
can you post the compiletime options and your CPU (Q8400?), too? I'm pretty sure this run was CPU-limited. If you want to try to run 2 (or maybe even 3) processes at the same time (in different directories because both processes try to access results.txt). |
Yes Q8400.
[CODE] #!/bin/bash -x mkdir compile_bla_bla cd compile_bla_bla gcc -Wall -O2 -c ../sieve.c -o sieve.o nvcc -c ../mfaktc.cu -o mfaktc.o -I /NVIDIA_GPU_Computing_SDK/C/common/inc/ --ptxas-options=-v --keep -DMUL24HI mv mfaktc.ptx mfaktc.ptx.old cat mfaktc.ptx.old | sed s/mul\.hi\.u32/mul24\.hi\.u32/ > mfaktc.ptx rm -f mfaktc.sm_10.cubin mfaktc.cu.cpp mfaktc.o ptxas --key="xxxxxxxxxx" -arch=sm_10 -v "mfaktc.ptx" -o "mfaktc.sm_10.cubin" fatbin --key="xxxxxxxxxx" --source-name="../mfaktc.cu" --usage-mode="-v " --embedded-fatbin="mfaktc.fatbin.c" "--image=profile=sm_10,file=mfaktc.sm_10.cubin" "--image=profile=compute_10,file=mfaktc.ptx" cudafe++ --gnu_version=40302 --diag_error=host_device_limited_call --diag_error=ms_asm_decl_not_allowed --parse_templates --gen_c_file_name "mfaktc.cudafe1.cpp" --stub_file_name "mfaktc.cudafe1.stub.c" --stub_header_file_name "mfaktc.cudafe1.stub.h" "mfaktc.cpp1.ii" gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS "-I /NVIDIA_GPU_Computing_SDK/C/common/inc/" -I/usr/local/cuda/include/ -I. -o "mfaktc.cu.cpp" "mfaktc.cudafe1.cpp" gcc -c -x c++ "-I /NVIDIA_GPU_Computing_SDK/C/common/inc/" -I/usr/local/cuda/include/ -I. -o "mfaktc.o" "mfaktc.cu.cpp" gcc -fPIC -o ../mfaktc.exe sieve.o mfaktc.o -L/usr/local/lib -L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib -L/NVIDIA_GPU_Computing_SDK/C/common/common/lib/linux -lcudart -L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib -L/NVIDIA_GPU_Computing_SDK/C/common/lib/linux -lcufft -lm cd .. rm compile_bla_bla -rf [/CODE] [CODE] $ time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 C... ... no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 273291msec real 4m33.374s user 4m31.081s sys 0m2.304s $ time ./mfaktc.exe 66362159 64 65 & $ time ./mfaktc.exe 66362159 64 65 & ... no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 274948msec real 4m35.055s user 4m31.613s sys 0m3.392s class 417: tested 265712378014859264 candidates in 12176232284160ms (93725704046247936/sec) no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 275090msec real 4m35.173s user 4m31.745s sys 0m3.356s [/CODE] |
Thank you, msft!
Actually I was asking for this: [CODE]Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_PRIMES 250000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled [/CODE] It is clearly CPU bound with only one process on your machine (this was expected). The slowdown from one to two processes is very little. 275s for 2 times from 2^64 to 2^65 of M66362159 looks reasonable (still a little bit CPU-limited). My 275GTX paired with a fast Core 2 Duo does is in ~220 seconds. [QUOTE]class 417: tested 265712378014859264 candidates in 12176232284160ms (93725704046247936/sec) [/QUOTE] This doesn't look as it should. Can you edit mfaktc.cu line 615: replace [CODE]printf("class %4d: tested...[/CODE] with [CODE]printf("class %4Lu: tested...[/CODE] This is an example output on my Pentium-D with 8600GT [CODE]./mfaktc.exe 66362159 1 64 mfaktc v0.01 ... Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_PRIMES 250000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled tf(66362159, 1, 64); k_min = 0 k_max = 138985412407 sieve_init(): sieving factor candidates with small primes up to 3497867 class 0: tested 54525952 candidates in 9014ms (6049029/sec) class 4: tested 54525952 candidates in 9014ms (6049029/sec) ... class 49: tested 54525952 candidates in 9014ms (6049029/sec) Result[00]: M66362159 has a factor: 6901664537 ... class 61: tested 54525952 candidates in 9014ms (6049029/sec) Result[00]: M66362159 has a factor: 9157977943 ... class 301: tested 54525952 candidates in 9015ms (6048358/sec) Result[00]: M66362159 has a factor: 124246422648815633 ... class 417: tested 54525952 candidates in 9014ms (6049029/sec) found 3 factors for M66362159 with 1 to 64 bits tf(): total time spent: 891193msec [/CODE] If you want to spent more time on this: please edit params.h and enable "SELFTEST" and "MORE_CLASSES" (remove // from the defines). It should find one factor per mersenne number (check results.txt after the run). |
1 Attachment(s)
Hi,
[CODE] Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_PRIMES 50000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled [/CODE] [QUOTE=TheJudger;201470] If you want to spent more time on this: please edit params.h and enable "SELFTEST" and "MORE_CLASSES" (remove // from the defines). It should find one factor per mersenne number (check results.txt after the run).[/QUOTE] typescript.gz is log. [CODE] $ cat results.txt no factor for M66362159 from 2^64 to 2^65 bits no factor for M66362159 from 2^64 to 2^65 bits no factor for M66362159 from 2^64 to 2^65 bits no factor for M66362159 from 2^64 to 2^65 bits no factor for M66362159 from 2^64 to 2^65 bits no factor for M66362159 from 2^64 to 2^65 bits M50804297 has a factor: 180620316395899877719 M50725243 has a factor: 230316474510833959177 M49635893 has a factor: 280164061095680036711 M51332417 has a factor: 297892586972172587537 M51413951 has a factor: 317216341513975685569 M51265327 has a factor: 348552331323478392193 M50787953 has a factor: 408564895570348290031 M51161503 has a factor: 415469688496323219041 M51061601 has a factor: 427900063728254374393 M51082547 has a factor: 465935689349117544521 M51437311 has a factor: 503858403232211768047 M51486859 has a factor: 510284989447684180297 M51408359 has a factor: 522238472503709826367 M51532279 has a factor: 541792563550794873377 M50751637 has a factor: 550221472071174741833 M51302663 has a factor: 603656963178941666303 M51163433 has a factor: 684192107898332819377 M50896831 has a factor: 705640111241611518359 M51375383 has a factor: 713108825973682051703 M51133343 has a factor: 796838010410767671769 M51023447 has a factor: 931398820964215340641 M50863909 has a factor: 959145688648033584641 M50920721 has a factor: 1253793135671017237321 M48630643 has a factor: 1396673413347982098001 M51250613 has a factor: 1412902407482377985447 M51406301 has a factor: 1426645377855974696807 M50893061 has a factor: 1441854080374870808777 M50979079 has a factor: 1443184588520125697329 M51064417 has a factor: 1464103704184177492831 M51293899 has a factor: 1595148557829097879457 M51132959 has a factor: 1609354388906437820393 M51125413 has a factor: 1754609807377017622201 M50781589 has a factor: 1771605458538879435223 M51321659 has a factor: 1782972607557912437543 M49715873 has a factor: 2029034084175690064751 M49915309 has a factor: 2085962683046854861393 M51152869 has a factor: 2105744115640061414321 M50909147 has a factor: 2218183397480493562177 M51340871 has a factor: 2283988614248258513047 M47644171 has a factor: 2357049767161724465927 [/CODE] |
Thank you!
This was with the modified printf in mfaktc.cu line 615, right? results.txt and the screen output (typescript.gz) are as expected. :) I just noticed another bug. Look at results.txt: [CODE]no factor for M66362159 from 2^64 to 2^65 bits[/CODE] 2^64 to 2^65 [B]bits[/B] is way too much. ;) |
[QUOTE=TheJudger;201475]
This was with the modified printf in mfaktc.cu line 615, right? [/QUOTE] Right. |
[quote=TheJudger;201428]Hi Henry,
try this one. I won't be surprised if it doesn't work (libraries versions, ...) I have mistyped the model name of the 8600, I have a 8600GT here, not a 8600GTS. The GTS is faster. Oliver[/quote] Thanks Oliver, the binary works fine. However running it makes my pc respond slowly and it becomes almost unusable. Have you any suggestions to cure this? Would increasing the sieve bound to make it cpu bound help? It seems to respond every time it moves onto the next class. Here is a benchmark which is the same as the first one in #49. [code]time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 Copyright (C) 2009, 2010 Oliver Weihe (o.weihe@t-online.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_PRIMES 50000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled tf(66362159, 64, 65); k_min = 138985412160 k_max = 277970824814 sieve_init(): sieving factor candidates with small primes up to 611957 class 0: tested 61865984 candidates in 8318ms (7437603/sec) class 4: tested 61865984 candidates in 8311ms (7443867/sec) class 9: tested 61865984 candidates in 8311ms (7443867/sec) class 12: tested 61865984 candidates in 8314ms (7441181/sec) class 16: tested 61865984 candidates in 8307ms (7447452/sec) class 21: tested 61865984 candidates in 8315ms (7440286/sec) class 24: tested 61865984 candidates in 8303ms (7451039/sec) class 25: tested 61865984 candidates in 8312ms (7442972/sec) class 37: tested 61865984 candidates in 8311ms (7443867/sec) class 40: tested 61865984 candidates in 8311ms (7443867/sec) class 45: tested 61865984 candidates in 8311ms (7443867/sec) class 49: tested 61865984 candidates in 8312ms (7442972/sec) class 52: tested 61865984 candidates in 8313ms (7442076/sec) class 60: tested 61865984 candidates in 8319ms (7436709/sec) class 61: tested 61865984 candidates in 8316ms (7439392/sec) class 69: tested 61865984 candidates in 8309ms (7445659/sec) class 72: tested 61865984 candidates in 8304ms (7450142/sec) class 76: tested 61865984 candidates in 8309ms (7445659/sec) class 81: tested 61865984 candidates in 8316ms (7439392/sec) class 84: tested 61865984 candidates in 8317ms (7438497/sec) class 96: tested 61865984 candidates in 8314ms (7441181/sec) class 97: tested 61865984 candidates in 8314ms (7441181/sec) class 100: tested 61865984 candidates in 8314ms (7441181/sec) class 105: tested 61865984 candidates in 8317ms (7438497/sec) class 109: tested 61865984 candidates in 8311ms (7443867/sec) class 112: tested 61865984 candidates in 8314ms (7441181/sec) class 117: tested 61865984 candidates in 8318ms (7437603/sec) class 121: tested 61865984 candidates in 8314ms (7441181/sec) class 124: tested 61865984 candidates in 8303ms (7451039/sec) class 129: tested 61865984 candidates in 8308ms (7446555/sec) class 132: tested 61865984 candidates in 68309ms (905678/sec) class 136: tested 61865984 candidates in 68353ms (905095/sec) class 144: tested 61865984 candidates in 8321ms (7434921/sec) class 145: tested 61865984 candidates in 8317ms (7438497/sec) class 156: tested 61865984 candidates in 8309ms (7445659/sec) class 157: tested 61865984 candidates in 8313ms (7442076/sec) class 160: tested 61865984 candidates in 8315ms (7440286/sec) class 165: tested 61865984 candidates in 8313ms (7442076/sec) class 172: tested 61865984 candidates in 8310ms (7444763/sec) class 177: tested 61865984 candidates in 8315ms (7440286/sec) class 180: tested 61865984 candidates in 8313ms (7442076/sec) class 181: tested 61865984 candidates in 8310ms (7444763/sec) class 184: tested 61865984 candidates in 8316ms (7439392/sec) class 189: tested 61865984 candidates in 8305ms (7449245/sec) class 192: tested 61865984 candidates in 8308ms (7446555/sec) class 196: tested 61865984 candidates in 8316ms (7439392/sec) class 201: tested 61865984 candidates in 8314ms (7441181/sec) class 205: tested 61865984 candidates in 8313ms (7442076/sec) class 216: tested 61865984 candidates in 8308ms (7446555/sec) class 217: tested 61865984 candidates in 8315ms (7440286/sec) class 220: tested 61865984 candidates in 8313ms (7442076/sec) class 229: tested 61865984 candidates in 8307ms (7447452/sec) class 237: tested 61865984 candidates in 8315ms (7440286/sec) class 240: tested 61865984 candidates in 8303ms (7451039/sec) class 241: tested 61865984 candidates in 8311ms (7443867/sec) class 244: tested 61865984 candidates in 8315ms (7440286/sec) class 249: tested 61865984 candidates in 8317ms (7438497/sec) class 252: tested 61865984 candidates in 8311ms (7443867/sec) class 256: tested 61865984 candidates in 8313ms (7442076/sec) class 261: tested 61865984 candidates in 8316ms (7439392/sec) class 264: tested 61865984 candidates in 8307ms (7447452/sec) class 265: tested 61865984 candidates in 8316ms (7439392/sec) class 276: tested 61865984 candidates in 8303ms (7451039/sec) class 277: tested 61865984 candidates in 8314ms (7441181/sec) class 280: tested 61865984 candidates in 8311ms (7443867/sec) class 285: tested 61865984 candidates in 8316ms (7439392/sec) class 289: tested 61865984 candidates in 8317ms (7438497/sec) class 292: tested 61865984 candidates in 8313ms (7442076/sec) class 297: tested 61865984 candidates in 8314ms (7441181/sec) class 300: tested 61865984 candidates in 8314ms (7441181/sec) class 301: tested 61865984 candidates in 8317ms (7438497/sec) class 304: tested 61865984 candidates in 8309ms (7445659/sec) class 312: tested 61865984 candidates in 8317ms (7438497/sec) class 321: tested 61865984 candidates in 8313ms (7442076/sec) class 324: tested 61865984 candidates in 8316ms (7439392/sec) class 325: tested 61865984 candidates in 8315ms (7440286/sec) class 336: tested 61865984 candidates in 8313ms (7442076/sec) class 340: tested 61865984 candidates in 8313ms (7442076/sec) class 345: tested 61865984 candidates in 8316ms (7439392/sec) class 349: tested 61865984 candidates in 8312ms (7442972/sec) class 352: tested 61865984 candidates in 8318ms (7437603/sec) class 357: tested 61865984 candidates in 8314ms (7441181/sec) class 360: tested 61865984 candidates in 8313ms (7442076/sec) class 361: tested 61865984 candidates in 8315ms (7440286/sec) class 364: tested 61865984 candidates in 8317ms (7438497/sec) class 369: tested 61865984 candidates in 8313ms (7442076/sec) class 376: tested 61865984 candidates in 8315ms (7440286/sec) class 381: tested 61865984 candidates in 8315ms (7440286/sec) class 384: tested 61865984 candidates in 8314ms (7441181/sec) class 385: tested 61865984 candidates in 8316ms (7439392/sec) class 396: tested 61865984 candidates in 8313ms (7442076/sec) class 397: tested 61865984 candidates in 8319ms (7436709/sec) class 405: tested 61865984 candidates in 8317ms (7438497/sec) class 409: tested 61865984 candidates in 8310ms (7444763/sec) class 412: tested 61865984 candidates in 8312ms (7442972/sec) class 417: tested 61865984 candidates in 8312ms (7442972/sec) no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 922393msec real 15m22.494s user 13m25.326s sys 0m0.820s [/code] Whenever I try to use my pc the times suddenly ramp up to 68 seconds per class. I will now have a go at compiling myself and see how i fare. |
I just compiled successfully after changing the cuda directory in the script.
The old version of the script runs at 2/3rds the speed of the one with the hack which is the same as your compilation. I will now try with different sieve bounds.:smile: |
instead of rolling your own "bit clear", use the x86 btr instruction. That will speed up your CPU sieving routine.
in fact, the critical loop in sieve_candidates could be rewritten in assembly |
thought you might like the speedup i got from squaring SIEVE_SIZE_LIMIT:smile:
[code]time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 Copyright (C) 2009, 2010 Oliver Weihe (o.weihe@t-online.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 512 SIEVE_SIZE_LIMIT 1024kiB SIEVE_SIZE 8360209bits SIEVE_PRIMES 1000000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled tf(66362159, 64, 65); k_min = 138985412160 k_max = 277970824814 sieve_init(): sieving factor candidates with small primes up to 15485867 class 0: tested 50331648 candidates in 3286ms (15316995/sec) class 4: tested 50331648 candidates in 3221ms (15626093/sec) class 9: tested 50331648 candidates in 3232ms (15572910/sec) class 12: tested 50331648 candidates in 3233ms (15568094/sec) class 16: tested 50331648 candidates in 3239ms (15539255/sec) class 21: tested 50331648 candidates in 3224ms (15611553/sec) class 24: tested 50331648 candidates in 3214ms (15660126/sec) class 25: tested 50331648 candidates in 3210ms (15679641/sec) class 37: tested 50331648 candidates in 3233ms (15568094/sec) class 40: tested 50331648 candidates in 3246ms (15505744/sec) class 45: tested 50331648 candidates in 3214ms (15660126/sec) class 49: tested 50331648 candidates in 3211ms (15674758/sec) class 52: tested 50331648 candidates in 3208ms (15689416/sec) class 60: tested 50331648 candidates in 3232ms (15572910/sec) class 61: tested 50331648 candidates in 3221ms (15626093/sec) class 69: tested 50331648 candidates in 3219ms (15635802/sec) class 72: tested 50331648 candidates in 3215ms (15655255/sec) class 76: tested 50331648 candidates in 3220ms (15630946/sec) class 81: tested 50331648 candidates in 3259ms (15443893/sec) class 84: tested 50331648 candidates in 3226ms (15601874/sec) class 96: tested 50331648 candidates in 3234ms (15563280/sec) class 97: tested 50331648 candidates in 3245ms (15510523/sec) class 100: tested 50331648 candidates in 3226ms (15601874/sec) class 105: tested 50331648 candidates in 3228ms (15592208/sec) class 109: tested 50331648 candidates in 3236ms (15553661/sec) class 112: tested 50331648 candidates in 3212ms (15669877/sec) class 117: tested 50331648 candidates in 3240ms (15534459/sec) class 121: tested 50331648 candidates in 3215ms (15655255/sec) class 124: tested 50331648 candidates in 3226ms (15601874/sec) class 129: tested 50331648 candidates in 3224ms (15611553/sec) class 132: tested 50331648 candidates in 3244ms (15515304/sec) class 136: tested 50331648 candidates in 3225ms (15606712/sec) class 144: tested 50331648 candidates in 3218ms (15640661/sec) class 145: tested 50331648 candidates in 3226ms (15601874/sec) class 156: tested 50331648 candidates in 3219ms (15635802/sec) class 157: tested 50331648 candidates in 3291ms (15293724/sec) class 160: tested 50331648 candidates in 3230ms (15582553/sec) class 165: tested 50331648 candidates in 3217ms (15645523/sec) class 172: tested 50331648 candidates in 3222ms (15621243/sec) class 177: tested 50331648 candidates in 3213ms (15665000/sec) class 180: tested 50331648 candidates in 3220ms (15630946/sec) class 181: tested 50331648 candidates in 3214ms (15660126/sec) class 184: tested 50331648 candidates in 3234ms (15563280/sec) class 189: tested 50331648 candidates in 3296ms (15270524/sec) class 192: tested 50331648 candidates in 3268ms (15401361/sec) class 196: tested 50331648 candidates in 3243ms (15520088/sec) class 201: tested 50331648 candidates in 3233ms (15568094/sec) class 205: tested 50331648 candidates in 3227ms (15597039/sec) class 216: tested 50331648 candidates in 3227ms (15597039/sec) class 217: tested 50331648 candidates in 3218ms (15640661/sec) class 220: tested 50331648 candidates in 3218ms (15640661/sec) class 229: tested 50331648 candidates in 3227ms (15597039/sec) class 237: tested 50331648 candidates in 3299ms (15256637/sec) class 240: tested 50331648 candidates in 3243ms (15520088/sec) class 241: tested 50331648 candidates in 3220ms (15630946/sec) class 244: tested 50331648 candidates in 3218ms (15640661/sec) class 249: tested 50331648 candidates in 3216ms (15650388/sec) class 252: tested 50331648 candidates in 3223ms (15616397/sec) class 256: tested 50331648 candidates in 3222ms (15621243/sec) class 261: tested 50331648 candidates in 3232ms (15572910/sec) class 264: tested 50331648 candidates in 3220ms (15630946/sec) class 265: tested 50331648 candidates in 3321ms (15155570/sec) class 276: tested 50331648 candidates in 3216ms (15650388/sec) class 277: tested 50331648 candidates in 3225ms (15606712/sec) class 280: tested 50331648 candidates in 3217ms (15645523/sec) class 285: tested 50331648 candidates in 3225ms (15606712/sec) class 289: tested 50331648 candidates in 3224ms (15611553/sec) class 292: tested 50331648 candidates in 3229ms (15587379/sec) class 297: tested 50331648 candidates in 3216ms (15650388/sec) class 300: tested 50331648 candidates in 3235ms (15558469/sec) class 301: tested 50331648 candidates in 3277ms (15359062/sec) class 304: tested 50331648 candidates in 3217ms (15645523/sec) class 312: tested 50331648 candidates in 3234ms (15563280/sec) class 321: tested 50331648 candidates in 3235ms (15558469/sec) class 324: tested 50331648 candidates in 3219ms (15635802/sec) class 325: tested 50331648 candidates in 3245ms (15510523/sec) class 336: tested 50331648 candidates in 3221ms (15626093/sec) class 340: tested 50331648 candidates in 3242ms (15524876/sec) class 345: tested 50331648 candidates in 3284ms (15326323/sec) class 349: tested 50331648 candidates in 3265ms (15415512/sec) class 352: tested 50331648 candidates in 3228ms (15592208/sec) class 357: tested 50331648 candidates in 3230ms (15582553/sec) class 360: tested 50331648 candidates in 3221ms (15626093/sec) class 361: tested 50331648 candidates in 3215ms (15655255/sec) class 364: tested 50331648 candidates in 3234ms (15563280/sec) class 369: tested 50331648 candidates in 3221ms (15626093/sec) class 376: tested 50331648 candidates in 3230ms (15582553/sec) class 381: tested 50331648 candidates in 3318ms (15169273/sec) class 384: tested 50331648 candidates in 3258ms (15448633/sec) class 385: tested 50331648 candidates in 3234ms (15563280/sec) class 396: tested 50331648 candidates in 3231ms (15577730/sec) class 397: tested 50331648 candidates in 3233ms (15568094/sec) class 405: tested 50331648 candidates in 3214ms (15660126/sec) class 409: tested 50331648 candidates in 3237ms (15548856/sec) class 412: tested 50331648 candidates in 3236ms (15553661/sec) class 417: tested 50331648 candidates in 3257ms (15453376/sec) no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 402946msec real 6m43.049s user 6m41.909s sys 0m0.812s [/code] edit: sorry for the false hope just realized that it only appeared when THREADS_PER_BLOCK was 512 this seems to stop it finding factors |
Hi,
axn: actually I'm not familar with assembler. :( henry: If your on your 8600GTS than it is clear, the 8600GTS doesn't support 512 threads per block. You'll need a GT200 based GPU for this (GTX 260/275/280/285/295). |
henryzz's recipe on GTX260
[CODE] $ time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 Copyright (C) 2009, 2010 Oliver Weihe (o.weihe@t-online.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 512 SIEVE_SIZE_LIMIT 1024kiB SIEVE_SIZE 8360209bits SIEVE_PRIMES 1000000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled tf(66362159, 64, 65); k_min = 138985412160 k_max = 277970824814 sieve_init(): sieving factor candidates with small primes up to 15485867 class 0: tested 50331648 candidates in 3064ms (16426778/sec) class 4: tested 50331648 candidates in 3073ms (16378668/sec) ... class 417: tested 50331648 candidates in 3068ms (16405361/sec) no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 382236msec real 6m22.313s user 6m20.232s sys 0m2.084s [/CODE] |
Hi!
msft: with SIEVE_PRIMES 1000000 you're heavily CPU-bound. Your 260GTX should do something like 45-50M tests per second. So with these settings you'll need 3 CPU cores (3 processes). henry: how difficult was it for you (as a CUDA newbie) to compile the code? I need to write some checks into the code, e.g. - query device if it capable of 512 threads per block - check if THREADS_PER_GRID is a multiple of THREADS_PER_BLOCK - AFAIK sieving should not exeed the mersenne number (the exponent) itself - ... About the slowdown of the GUI: yes, I know but at the current state I don't care about. I normally run it from the command line. Oliver |
[quote=TheJudger;201609]
henry: how difficult was it for you (as a CUDA newbie) to compile the code? [/quote] very easy:smile: your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?) if i hadnt in the past fiddled around with the paths for cuda i would have found it hard work to find the correct place to point it to when i installed cuda none of the environment variables that were supposed to be added were(and i never found the easy way to add them:smile: only the hard way) unfortunately this seems to happen every time i install something like that BTW read posts 17,18 and 20 of this [URL="http://www.mersenneforum.org/showthread.php?t=10910"][COLOR=#800080]thread[/COLOR][/URL]:smile: |
[QUOTE=TheJudger;201588]axn: actually I'm not familar with assembler. :([/QUOTE]
Can you use this piece of code: [CODE]static inline void sieve_clear_bit(unsigned int *array,unsigned int bit) { // unsigned int chunk; // chunk=bit>>5; // bit&=0x1F; // array[chunk]&=mask0[bit]; asm("btrl %0, %1" : /* no output */ : "r" (bit), "m" (*array) : "memory", "cc" ); }[/CODE] and tell me if it: a) compiles b) runs correctly and c) is faster or slower? |
Compiling and running under Windows
1 Attachment(s)
I managed to compile the code using MSVC on my laptop running Vista 32-bit. I had to change some things in mfaktc.cu, mostly because of annoying compatibility issues.
I noticed that the program had full CPU utilization all the time, this is caused by cudaStreamSynchronize() in line 542, which continuously polls on the device. I replaced it with an event driven construction similar to the one described in [URL="http://forums.nvidia.com/lofiversion/index.php?t75431.html"]http://forums.nvidia.com/lofiversion/index.php?t75431.html[/URL] I couldn't get the ptx hack to run. I tried the procedure described in post #26, but for some strange reason nvcc always gives an internal error when using the --dryrun command line option. The results from the selftest are attached to this post. This was run on a laptop with a P7350 2.0 GHz and a 9600M GS. This card is capable of 91.2 GFLOPs but the result of 3.2M candidates tested per second is rather low compared with the list in post #44, possibly because of the absence of the ptx hack? When running the program, the UI of Vista slows down considerably, although it is still workable for most tasks. axn: Your code is in gcc syntax, I couldn't use it right away with MSVC. |
axn: I'll try later, thank you. :)
BigBrother: 3.2M/sec is within the expacted range when you run without the ptx hack. The ptx hack should give an improvement of ~50% for the GPU code. About the event driven replacement of cudaStreamSynchronize(): I'll check it. Currently on a faster GPU it's not a problem since the code is limited by the speed of the siever on CPU. :( Oliver |
Hi David,
[QUOTE=henryzz;201626]very easy:smile: your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?)[/QUOTE] Debian is Linux! You needed the change the path because I didn't install CUDA into the default directory on my systems. Personally I like /opt/<APP> for "software which does not came from the distribution". Sometimes I expand it to /opt/<APP>/<VERSION> which makes it easier to handle different versions. Oliver |
[QUOTE=BigBrother;201732]axn: Your code is in gcc syntax, I couldn't use it right away with MSVC.[/QUOTE]
Presumably, you've figured out what to do already, if not check this out: [url]http://msdn.microsoft.com/en-us/library/hd0hzyf8.aspx[/url] |
Hi,
On my GTX260 a) compiles OK b) runs correctly and correct c) is faster or slower? slower |
[QUOTE=axn;201745]Presumably, you've figured out what to do already, if not check this out: [url]http://msdn.microsoft.com/en-us/library/hd0hzyf8.aspx[/url][/QUOTE]
Yes, I replaced the two sieve_clear_bit() calls with the _bittestandreset intrinsic, but it didn't improve performance. Actually, the program was slightly slower. |
Bummer :sad:
|
The infamous ptx hack on Windows
Using a method described on [URL="http://code.cheesydesign.com/"]http://code.cheesydesign.com/[/URL] I managed to uncover the commands issued by nvcc without using the defective --dryrun command line option.
The program now processes 4.2M candidates/second on my 9600M GS, which is an increase of ~33% :smile: |
[quote=TheJudger;200887]I'm not 100% sure if I got the point of your post.
The smallest primes (3, 5, 7, 11) are removed from the sieve totally.[/quote] I quickly read your code, and it seems that it's not using a dense representation for the sieve array. You don't need to have bits for even numbers for instance. If I misread your code, sorry. |
ldesnogu: there are no even numbers.
The sieve represents the k-values of the factor candidates. The factor candidates are 2*k*p+1, so they are allways odd. |
1 Attachment(s)
Hi!
axn: the asm code works but is slower (as msft reported allready) :( I don't remeber the specific settings for this run but here are the numbers. My C code as inline function: ~33M/s My C code as macro: ~32M/s Your asm code as inline function: ~28M/s Your asm code as macro: ~30M/s ----- Anyway I've improved the siever a little bit. :) ----- New Benchmarks on my system (openSUSE 11.1 x86-64, CUDA 2.3, GTX 275, 4GHz C2D) Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 15000 M66362159 from 2^ 1 to 2^64: 142.4s M66362159 from 2^64 to 2^65: 142.2s M66362159 from 2^65 to 2^66: 282.4s M66362159 from 2^66 to 2^67: 559.4s --- Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 15000 MORE_CLASSES M66362159 from 2^65 to 2^66: 284.1s M66362159 from 2^66 to 2^67: 545.5s M66362159 from 2^67 to 2^68: 1068.6s --- Two Processes at the same time THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 150000 (10 fold increase compared to Single Process) M66362159 from 2^64 to 2^65: 206.3s M66362159 from 2^65 to 2^66: 401.0s M66362159 from 2^66 to 2^67: 794.0s ----- find attached the new version. :) version 0.02 (2010-01-13) - fixed some printf's - allocate and free arrays only ONCE (was per class before) - added check of return values of most *alloc() - siever: improved the loop which creates the candidate list Oliver |
[quote=TheJudger;201419]Hi henryzz,
I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64). If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than. You need to install the CUDA software aswell.[/quote] I'm interested in an exe for 32-bit Windows XP SP3,CUDA 2.3, Athlon 64 (single core), Geforce 8600GT (256 MB). It would be nice if that were possible, because I haven't installed MSVC-compiler on this machine. |
Windows executables
1 Attachment(s)
Here are the executables I compiled, they work on my 32-bit Vista system, so I guess they also work on 32-bit XP. Both versions are included, the original and the one with the ptx hack.
|
Very nice, I try it out later, because msieve_gpu currently running on the nvidia.
bedankt.. |
Thank you, BigBrother.
Actually I've no CUDA-capable compiler environment installed under Windows, is MSVC the first choice for CUDA on Windows? Is it available for free for non-commercial usage? ----- I've improved the sieve code again. I've unrolled the loop which creates the ktab and use a precalculated table to get check 8 bits at once. This minimized the effect of "MORE_CLASSES" (becomes benefical at 2^69 to 2^70 for M66362159). The GPU-code is untouched. Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 45000 (with this value my CPU can keep the GPU busy with on core) M66362159 from 2^ 1 to 2^64: 113.4s M66362159 from 2^64 to 2^65: 113.4s M66362159 from 2^65 to 2^66: 223.6s M66362159 from 2^66 to 2^67: 442.3s M66362159 from 2^67 to 2^68: 879.7s Sorry, no new code release yet. You have to wait, I want to try some things before. ;) Oliver |
[QUOTE=TheJudger;202557]Thank you, BigBrother.
Actually I've no CUDA-capable compiler environment installed under Windows, is MSVC the first choice for CUDA on Windows? Is it available for free for non-commercial usage? [/QUOTE] I don't know if MSVC is the first choice for CUDA on Windows, it's the only one i've got here. I'm using a student license for MSVC. |
Microsoft Visual Studio Express and MSVC++ Express Edition is free-of-charge, MFC, ATL, OpenPM require the Standard Edition or higher.
|
E:\mfactc>mfaktc-hack 65255629 68 69 0
mfaktc v0.02 Copyright (C) 2009, 2010 Oliver Weihe THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits SIEVE_PRIMES 50000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled sieve_init(): sieving factor candidates with small primes up to 611957 tf(65255629, 68, 69); k_min = 2261474677980 no factor for M65255629 from 2^68 to 2^69 tf(): total time spent: 12785031msec too bad that I can't upload the results to primenet |
1 Attachment(s)
Hello!
moebius: If you want to help: rerun allready factored exponents and help proving that my code finds the factor(s) aswell. False positives are not the problem (if there are some) But we need to be sure that my code doesn't miss the factors. On your hardware "SIEVE_PRIMES 200000" might be benifical, to bad that you can't check easily (needs recompile). Can you add one or two lines starting with "class" aswell so I can see the actual speed? ----- Find attached version 0.03 - allow exponents up to 2^32 -1 (tested with some exponents around M3321xxxxxx) - siever: improved the loop which creates the candidate list (again) - loop unrolled - use a lookup table to parse 8 bits at once - added 40 known factors from ElevenSmooth "Operation Billion Digits" in M3321xxxxxx range to the selftest - added another timer which helps to adjust SIEVE_PRIMES (needs to be enabled with VERBOSE_TIMING) The bigger exponents need further testing! You'll see the speedups only in configurations where the CPU was limiting! ----- Uncwilly: M3312xxxxxx from 2^1 to 2^71 needs ~5m 35s on my system :) Oliver |
[QUOTE=TheJudger;202685]Hello!
moebius: If you want to help: rerun allready factored exponents and help proving that my code finds the factor(s) aswell. False positives are not the problem (if there are some) But we need to be sure that my code doesn't miss the factors. On your hardware "SIEVE_PRIMES 200000" might be benifical, to bad that you can't check easily (needs recompile). Can you add one or two lines starting with "class" aswell so I can see the actual speed? ----- Find attached version 0.03 - allow exponents up to 2^32 -1 (tested with some exponents around M3321xxxxxx) - siever: improved the loop which creates the candidate list (again) - loop unrolled - use a lookup table to parse 8 bits at once - added 40 known factors from ElevenSmooth "Operation Billion Digits" in M3321xxxxxx range to the selftest - added another timer which helps to adjust SIEVE_PRIMES (needs to be enabled with VERBOSE_TIMING) The bigger exponents need further testing! You'll see the speedups only in configurations where the CPU was limiting! ----- Uncwilly: M3312xxxxxx from 2^1 to 2^71 needs ~5m 35s on my system :) Oliver[/QUOTE] As a owner of a Nvidia card on a Linux OS and a manager of Operation Billion Digits, I will soon ask for help testing the range 3321xxxxxx :smile: Luigi |
Hi Luigi,
keep in mind that it is still limited to 71 bits factor size. Oliver |
[QUOTE=TheJudger;202688]Hi Luigi,
keep in mind that it is still limited to 71 bits factor size. Oliver[/QUOTE] Keep in mind that I still need to install Linux and learn how to use CUDA... We may test our factors below 71 bits in less than a day (note that we didn't use Prime95...) :smile: I keep on waiting for some 80 bits version to put on OBD page :smile: Luigi |
Wow, this could be a turning point for GIMPS. I hope your code gets implemented in Prime95!
|
[QUOTE=TheJudger;202685]Uncwilly: M3312xxxxxx from 2^1 to 2^71 needs ~5m 35s on my system :)[/QUOTE]I may have to invest in a box for home then. I currently only have a laptop.
|
[QUOTE=TheJudger;202685]
Uncwilly: M3312xxxxxx from 2^1 to 2^71 needs ~5m 35s on my system :) [/QUOTE] How is that comparing with CPUs? (core i7 etc...) -- Craig |
[QUOTE=nucleon;202761]How is that comparing with CPUs? (core i7 etc...)[/QUOTE]70->71 on a single core ~8h on my laptop. 0.72 GHz Days credit.
|
[QUOTE=TheJudger;202685]Uncwilly: M3312xxxxxx from 2^1 to 2^71 needs ~5m 35s on my system :)[/QUOTE]If you or anyone else gets bored, here is a lovely link to some work that might keep you busy for a little while.
[url]http://v5www.mersenne.org/report_factoring_effort/?exp_lo=332192831&exp_hi=332299999&bits_lo=0&bits_hi=70&txt=1&exassigned=1&B1=Get+Data[/url] |
[quote=BigBrother;202566]I don't know if MSVC is the first choice for CUDA on Windows, it's the only one i've got here. I'm using a student license for MSVC.[/quote]
MSVC 8.0 or 9.0 are default for CUDA I have now installed MSVC 9.0, however I can not compile the sourcecode for version 0.03 because the librays <sys/time.h> <unistd.h> don't exist in VisualC++. Please tell me the trick. |
[QUOTE=moebius;202790]MSVC 8.0 or 9.0 are default for CUDA
I have now installed MSVC 9.0, however I can not compile the sourcecode for version 0.03 because the librays <sys/time.h> <unistd.h> don't exist in VisualC++. Please tell me the trick.[/QUOTE] I removed those two lines and included <timeval.h>, which is available [URL="http://wyw.dcweb.cn/timeval.h.txt"]here[/URL]. You'll also have to replace every %Lu by %llu in printf statements, and maybe some other minor things as well. Just ask if you run into more troubles. |
[QUOTE=Uncwilly;202774]70->71 on a single core ~8h on my laptop. 0.72 GHz Days credit.[/QUOTE]
Did you test it with Factor5? AFAIK, Prime95 can't work with such exponents... Luigi |
[QUOTE=Uncwilly;202774]70->71 on a single core ~8h on my laptop. 0.72 GHz Days credit.[/QUOTE]
That is for M332xxxxxx from your link the the primenet server? I was talking about M3312xxxxxx. Otherwise the speedup seems to be way to much! Oliver |
[QUOTE=TheJudger;202810]That is for M332xxxxxx from your link the the primenet server?
I was talking about M3312xxxxxx. Otherwise the speedup seems to be way to much![/QUOTE]Since you called me out, I assumed that you were talking about a 100M digit. I didn't count the xxx's. I was quoting based upon 332,2xx,xxx. |
[quote=BigBrother;202800]Just ask if you run into more troubles.[/quote]
This has so basically works, but I had to compile with the /TP option because of the error message referring to unresolved external "_sieve_candidates" in ...... Now, still enters the following "syntax error" in mfaktc.cu. E:\Programme\Microsoft Visual Studio 9.0\VC\bin>nvcc --compiler-options "/TP" E:\Programme\mfaktc\mfaktc-0.03\mfaktc.cu mfaktc.cu cl : Befehlszeile warning D9025 : "/TC" wird durch "/TP" überschrieben tmpxft_00002ecc_00000000-3_mfaktc.cudafe1.gpu C:\DOKUME~1\avatara\LOKALE~1\Temp/tmpxft_00002ecc_00000000-3_mfaktc.cudafe1.gpu( 5): error: expected an identifier C:\DOKUME~1\avatara\LOKALE~1\Temp/tmpxft_00002ecc_00000000-3_mfaktc.cudafe1.gpu( 5): error: expected a ";" e:\programme\cudatoolkit\include\cuda_texture_types.h(60): warning: parsing restarts here after previous syntax error .. .. |
Hi,
[QUOTE=Uncwilly;202820]Since you called me out, I assumed that you were talking about a 100M digit. I didn't count the xxx's. I was quoting based upon 332,2xx,xxx.[/QUOTE] sorry, my fault. I've mixed up some stuff. I remembered that you've asked "can it handle <some size> exponents" before. Actually it was 100+M digits while I was did some 1G digits exponents in the past days. ----- Luigi: Let me first have atleast one code which is "known to work". Expanding it to a bigger range is not that difficult. I'll do it later (no timeline yet). When you can't resist and want to try yourself to implement bigger factors: wait at least for the next version. I expect some changes in the function mod_144_72(). I hope for some speedups there. :cool: Oliver |
Does the latest version support trial factoring of exponents less than 1M? When will it?
When my pc is fixed i will try the latest version. |
| All times are UTC. The time now is 14:22. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.