mersenneforum.org mfaktc: a CUDA program for Mersenne prefactoring
 Register FAQ Search Today's Posts Mark Forums Read

 2010-01-03, 09:43 #34 ET_ Banned     "Luigi" Aug 2002 Team Italia 25×149 Posts You can take into shared memory the 16 segments used for sieving (one for each residual class) and synchronize threads only at the end of the whole task. Updating the segment is a matter of nanoseconds, once you locally store the precomputed gaps for each prime. Luigi
 2010-01-04, 19:54 #35 TheJudger     "Oliver" Mar 2005 Germany 33×41 Posts Hi, another performance update. Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations). Remove classes which are - 3 or 5 mod 8 - multiples of 3 or 5 4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%) Removing multiples of 7, too: 4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%) Removing multiples of 11, too: 4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%) ----- One process on the CPU, sieving the first 4000 odd primes: M66362159 TF from 2^64 to 2^65: 160.7s / 185.5s M66362159 TF from 2^65 to 2^66: 318.8s / 323.5s M66362159 TF from 2^66 to 2^67: 631.6s / 620.7s First time: using 96 classes Second time: using 960 classes The range is a bit to small for proper timings when using 960 classes. The reason is that I always fill up the candidate list no matter if its above the range or not. And compared to 96 classes it need 10x more initializations. ----- M66362159 TF from 2^64 to 2^65 using 96 classes: sieving first 4000/25000 odd primes 1 process: 160.7s / 167.5s 2 processes: 285.8s / 238.1s (both processes are doing TF from 64 to 65bit) Throughput for M66362159 TF from 2^64 to 2^65: Time estimate for my 4GHz C2D (based on ATHs timings): 1160s (Since Prime95 TF fits excellent into the L1-Cache of the CPU I've scaled up the time directly by the clock rates) How often can I do the TF in one hour? One Process for CUDA and one with Prime95 (assuming they don't slow down each other): 3600s / 160.7s + 3600s / 1160s = 22.4 + 3.1 = 25.5 Two Processes for CUDA and no for Prime95: 2 * 3600s / 238.1s = 30.2 So when aiming for TF throughput on my particular system it is clearly beneficial to dedicate 2 CPU cores to my CUDA-code.
2010-01-04, 21:13   #36
ldesnogu

Jan 2008
France

10000100002 Posts

Quote:
 Originally Posted by TheJudger Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations). Remove classes which are - 3 or 5 mod 8 - multiples of 3 or 5 4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%) Removing multiples of 7, too: 4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%) Removing multiples of 11, too: 4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%)
Do you only rely on initialization to remove these? If so, you should perhaps consider removing them from the sieve itself, that'd help you put wider ranges in your L1 Dcache, at the expense of code size, but I don't think it's critical .

 2010-01-04, 22:45 #37 TheJudger     "Oliver" Mar 2005 Germany 33·41 Posts Hi ldesnogu, I'm not 100% sure if I got the point of your post. The smallest primes (3, 5, 7, 11) are removed from the sieve totally. I'll try to explain how it works. Let us ignore the fact that we can remove factors which are 3 or 5 mod 8 for simplicy (this would just increase the number of classes by four). For example we take M23 (p=23) and build only 3 classes (remove all multiples of 3). The factor candidates (FC) have the form 2kp+1. Code: class 0: k={ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, ...} class 1: k={ 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...} class 2: k={ 2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, ...} Actually those lists are only for representation here, they are not generated in the code. Check, if a class is 0 mod 3: k=3: 2kp+1 mod 3 = 1 Check, if a class is 1 mod 3: k=1: 2kp+1 mod 3 = 2 Check, if a class is 2 mod 3: k=2: 2kp+1 mod 3 = 0 So we can ignore class 2 totally. ALL candidates in this class are a multiple of 3! Code: class 0: k={ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, ...} class 1: k={ 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...} class 2: cleared So we've no multiples of 3 now (talking about FCs). Next small prime is 5. For each class find the smallest k where 2kp+1 mod 5 = 0 These are k=9 for class 0 and k=4 for class 1. Starting at k=9/k=4 in this lists mark every 5th k as cleared. Code: class 0: k={ 3, 6, *9, 12, 15, 18, 21,*24, 27, 30, 33, ...} class 1: k={ 1, *4, 7, 10, 13, 16,*19, 22, 25, 28, 31,*34, ...} class 2: cleared Repeat sieving for small primes 7, 11, 13, ... up to a certain bound. Those classes are independent on each other so we need to handle only a single class at one time. There is no need to try those FCs in ascending order. My actual implementation uses one of - 4 * 3 * 5 * 7 = 420 classes, sieve size = N * 11 * 13 * 17 * 19 bits = N * 46189 bits - 4 * 3 * 5 * 7 * 11 = 4620 classes, sieve size = N * 13 * 17 * 19 * 23 bits = N * 96577 bits Each k is represented as a single bit. These sieve sizes allow to have preinitialized sieves which have allready sieved the small primes up to 19/23.
 2010-01-05, 09:47 #38 ldesnogu     Jan 2008 France 10208 Posts Oh OK, I mistakenly thought you only removed smaller primes from your sieve by using initialization, while in fact your initialization removes some larger ones In case you don't know, this method is called sieving by wheel.
2010-01-05, 19:06   #39
TheJudger

"Oliver"
Mar 2005
Germany

33·41 Posts

Hi,

Quote:
 Originally Posted by ldesnogu In case you don't know, this method is called sieving by wheel.
I didn't know that this has a specific name. I came up with this idea some years ago by myself (for a other project, offcourse). In my original code I called it "groups", not "classes". ;)

 2010-01-05, 19:40 #40 ckdo     Dec 2007 Cleves, Germany 232 Posts In other words, you reinvented the wheel.
 2010-01-05, 19:45 #41 TheJudger     "Oliver" Mar 2005 Germany 33·41 Posts nice one, ckdo. :)
2010-01-10, 18:20   #42
TheJudger

"Oliver"
Mar 2005
Germany

33×41 Posts

Hello,

for those who want to try: find attached the code :)
At the current state I recommend this only to people how are familar with CUDA.
The code is NOT ready for everyday usage, you will be able to submit new factors to the primenet server but you can't submit "no factor from 2^X to 2^Y" results.

Compared to my postings from 4. Jan there are some speed improvements aswell (maybe 10%).

When you start: try to run the selftest first (see params.h and enable them).
The selftest contains ~40 exponents which have known factors between 2^68 to 2^71. To speedup the selftest you can define MORE_CLASSES in params.h. The selftest does only check the class in which the known factor falls so it is alot faster than doing the full range.

Try to compile with 'compile_replace_umul.hi_umul24.hi.sh', this gives a speedup for free compared to 'compile.sh'. (See post #21/#26 of this thread).

Once you've created a binary just run it without any parameters and it will show you how to run. (You need to enter a valid set of parameters even if you want to run the selftest...)

I recommend to run on mersenne numbers with known factors and check if my code finds the factor(s) aswell. Please report the results of you runs.

It should work on (allmost) all CUDA-capable devices.
I'm not 100% sure about G80 (Geforce 8800 GTS 320/640, 8800 GTX, 8800 Ultra and their Quadro/Telsa variants). Maybe you need to disable "USE_PINNED_MEMORY" and "USE_ASYNC_COPY" on these devices (params.h).

The selftest runs fine on an old 8600GTS :)

Documentation is allmost missing. :(
The userinterface isn't very comfortable, sorry for that ;)

Oliver
Attached Files
 mfaktc.tar.gz (24.4 KB, 323 views)

2010-01-10, 18:49   #43
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

10110011000012 Posts

Quote:
 Originally Posted by TheJudger The selftest runs fine on an old 8600GTS :)
At least i know it works on my card
How fast is it on the 8600GTS? From my experience of it it is very slow and pointless compared with modern cards like the GTX275 although plenty good enough for the gaming i do
Is it worth me getting it working? I have never compiled something as i have only ever used msievegpu. Would there be any chance of binaries for it?

 2010-01-10, 19:09 #44 TheJudger     "Oliver" Mar 2005 Germany 45316 Posts Hi henryzz, for factors between 2^64 to 2^71 it is about twice as fast as a single core of ath's core 2 quad. I like the card since it is relative slow it's easy to spot differences in runtime of the GPU-code, the CPU never limits the throughput. The RAW speed of the GPU code can be easily estimated since it scales perfect along the GPUs GFLOPS. I have tested this on - 8400GS (43.2GFLOPS / 2.3M candidates tested per second) - 8600GT (113GFLOPS / 6.1M candidates tested per second) - 8800GTX (518GFLOPS / ~28M candidates tested per second) - GTX 275 (1011GFLOPS / ~54M candidates tested per second) I think it is not the right time for precompiled binaries, there are too many compiletime options in the code right now. I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64). If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than. You need to install the CUDA software aswell. Last fiddled with by TheJudger on 2010-01-10 at 20:01

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1656 2020-10-13 14:21 firejuggler GPU Computing 752 2020-09-08 16:15 froderik GPU Computing 4 2016-10-30 15:29 fivemack Programming 112 2015-02-12 22:51 xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 16:12.

Mon Oct 19 16:12:03 UTC 2020 up 39 days, 13:23, 1 user, load averages: 3.08, 2.65, 2.28