![]() |
There is a global hysteria about residue classes. Take off from residue classes, that eats up your brain.
To collect the easy fruits: [QUOTE=LaurV;504082]will they??? [...] when you include 13 in the product, you start sieving with 17, etc., saving a blink, but doing it 10k times more[/QUOTE] False. The total length of interval what you sieve will be smaller, and you missed that point. Prefering the small easy examples (numbers<=1000) say we have to find the primes up to 1000, you don't care and sieve the whole interval, taking the whole interval=1000 for each sieving prime. But using the smallest primes=2,3 I will sieve on 2 intervals, 6*k+1 and 6*k+5, their combined length is 167+166=333. A speedup by a factor of 3, not to mention that I haven't sieved by 2,3; so got the 1/((1-1/2)*(1-1/3))=3 times faster sieve. When you sieve on r=5, you need 1000/5=200 updates, but I only need 33+34=64 updates etc. what is roughly 1/3 of 200. and you get this speedup for every r sieving prime. Bingo. This is a very basic rule. [QUOTE=TheJudger;504047] This is very simple - there are no checkpoints within a residue class! Checkpoints are written between two residue classes by just remembering the last finished residue class. Runtimes are rather short and I don't want to overengineer this - a simple solution is a good solution. [/QUOTE] A simple solution is a slow solution. Yeah these are non ortodox sieve methods. I haven't spoken about breakpoint in a residue class, in that sentence I've written about the total time, when I've broken up the whole task to C=10000 tasks, and made 10 checkpoints. [QUOTE=LaurV;504082]MfaktX works one class at a time.[/QUOTE] Thanks, asked it before. Say you have 3 gpu threads need to sieve K=1..100, and 2 remainder classes 6*k+2 and 6*k+3, and k=0..16 (so using m=6). Both of them contains 17 integers. We have 34 integers, and want to make say C=3 tasks for the 3 threads. 1st thread: [0,10] ---> sieve on 6*k+2 for k=0..10 #11 k numbers 2nd thread: [11,21] ---> sieve on 6*k+2 for k=11..16 and 6*k+3 for k=0..4 #11 k numbers 3rd thread: [22,33] ---> sieve on 6*k+3 for k=5..16 #12 k numbers as you can see in spite of the many threads we sieved on "very" long arithmetic progressions, and you can see the good facts, different threads working on the same residue, and same thread doing different residues. Where we haven't switched on residue classes we have almost equal length of interval: 11,12. And what you're doing: consider 6*k+2 and distribute the 17 integers into tiny intervals: 5,6,6 to the 3 threads. Then do the 6*k+3 sequence in the same archaic way. OK, this is a small example, and larger one has even better property: you are switching residue class only 11520 times , their combined length is small < 11520*32000. Refinement, more details: one gpu thread would sieve on different interval=L=32768 (or 32000), this needs 4kbytes per thread, using primes say r<32000 (this can be anything, so it can be larger/smaller than L); you need to store the start sieving point in each(!) thread for each r prime. For each prime you need floor(L/r) [or one more] sieving deletion, if you have done with updates, then the new sieving offset will be (location-L). Rarely (11520 times) you need to update the whole offset table [for given thread(s)] when you switch to a new residue; in this rare case do the update(s) when you start the new residue's block=L. |
[QUOTE=Mark Rose;504073]Run times per class.
The longest I've seen a class run for is about minute, on an ancient GT 520, at a high TF level.[/QUOTE] Thanks for clarifying. Much longer than a minute per class is possible. Granted, these examples are slow gpus and high bit levels. Attempting to further factor some small exponents will also produce long class times. On an RX550 in mfakto (still nearly a day to go now): [CODE]Starting trial factoring M658000139 from 2^82 to 2^83 (1488.55GHz-days) Using GPU kernel "cl_barrett32_87_gs_2" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Dec 11 23:43 | 0 0.1% | 1464.7 16d06h | 91.47 81206 0.00% Dec 12 00:08 | 9 0.2% | 1464.0 16d05h | 91.51 81206 0.00%[/CODE]Some fairly ordinary TF assignments exceed a minute per class on Intel IGPs, which are of order 20GhzD/day performance when not thermally throttled, and can drop to half that. Still a couple days to run on a Quadro 4000 on the following, and the following bit levels will take about 2 and 4 times longer per class than the current ~28 minutes, so up to nearly two hours: [CODE]got assignment: exp=990000029 bit_min=83 bit_max=86 (13851.05 GHz-days) Starting trial factoring M990000029 from 2^83 to 2^84 (1978.72 GHz-days) ...found a valid checkpoint file! ... Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Dec 25 17:48 | 3684 79.9% | 1654.1 3d16h | 107.66 82485 n.a.% Dec 25 18:16 | 3687 80.0% | 1660.1 3d16h | 107.28 82485 n.a.%[/CODE]On a Quadro 2000, and not so high above the 100Mdigit exponent activity: [CODE]got assignment: exp=366000023 bit_min=81 bit_max=82 (1338.07 GHz-days) Starting trial factoring M366000023 from 2^81 to 2^82 (1338.07 GHz-days) k_min = 3303075802303200 k_max = 6606151604611397 Using GPU kernel "barrett87_mul32_gs" found a valid checkpoint file! last finished class was: 4116 found 0 factor(s) already Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Jul 31 01:48 | 4117 89.1% | 2257.0 2d17h | 53.36 82485 n.a.% Jul 31 02:27 | 4120 89.2% | 2347.2 2d19h | 51.31 82485 n.a.%[/CODE]On an RX550 again, 55 minutes/class: [CODE]got assignment: exp=580500007 bit_min=83 bit_max=84 (3374.56 GHz-days) Starting trial factoring M580500007 from 2^83 to 2^84 (3374.56GHz-days) Using GPU kernel "cl_barrett32_87_gs_2" Found a valid checkpoint file. last finished class was: 5 Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Jun 05 09:06 | 9 0.3% | 3327.9 36d20h | 91.26 81206 0.00% Jun 05 10:02 | 12 0.4% | 3327.9 36d19h | 91.26 81206 0.00% Jun 05 10:57 | 20 0.5% | 3328.6 36d19h | 91.24 81206 0.00%[/CODE]The high exponent/high bit level combinations are taking scattered exponents throughout the mersenne.org range, to full gputo72 TF depth, to qualify exponents for testing CUDAPm1 on them. |
[QUOTE=R. Gerbicz;504006]If you have cnt number of residues (mod M) then you'll sieve NR=cnt*(k1-k0+1)=11520*(k1-k0+1) numbers in our case (if you're using the primes up to 13). With something 3rd class knowledge from elementary school you can distribute these numbers into C classes using an almost equal way: [floor(i*NR)/C,floor((i+1)*NR/C)) numbers will be in the i-th class/task for i=0..(C-1).
And this is independent from 11520 and the number of GPU threads, you can use C=1,C=1000,C=1024,C=20480 (yes smaller/larger than 11520). From the value of i you can get the residues and intervals (for each residue) that you need to do. Yes, different threads could work on the same residue class, and the same thread could do multiple residue (ofcourse not at once in this case) ! About checkpoints: Don't know how you are doing checkpoint(s), say with 1024 gpu threads and with C=10000 you would lost only ~T/20 work (in average) where T is the total computation time, if you're distributing 1000 works ten times. (note that here we are not using all threads, and even 10000<11520 the above 'equal way' distribution works nicely). For much deeper task (N2~2^80 or so) use larger C and/or more small sieve primes (17-19-23). [/QUOTE] [QUOTE=LaurV;504082] But you have to sieve 11k+ classes, instead of 960 classes. ... MfaktX works one class at a time. Take the class, sieve the class, exponentiate the class ... Think about, at the limit, you have millions of classes with a single candidate. Is the sieving faster, or slower, per total? ... think about the fact that when you split to 4620 classes[/QUOTE] Wait, in my post's example for 1024 gpu threads and C=10000 tasks, then at once 1000 tasks distributed to 1000 threads (24 is "idle"), one task is given to one thread; and you are doing it 10 times, and with that you cover the k=[N1,N2] interval. So virtually in that run I had 10, yes ten residue classes, so 9 (intermediate) breakpoints. While you're still using 420 or 4620 classes, hence you can make the 420/4620 breakpoints, I can't do that because using much smaller number of 'big' tasks. Then who is the winner? Think about C=1000, there would be no checkpoint; all k's distributed in one step. Written enough sieves, that should be working for cpu, well, gpu is a question. But realize the fact, in a cpu(!) implementation that works even in (hypothetical) 1536 cpu threads on one fixed residue class, so in one interval, with length in the range very few billion you couldn't do (much) better what is currently mfaktx doing, using the small sieveprimes up to 11. Look my optimized gap12.c code, (maybe) M was fixed, but we sieved seq(i)=c(t)+i*M for different c(t) mod M, where c(t) is constant in the t-th thread. So it wasn't true, that the union of these arithmetic sequences gives a single arithmetic sequence for some starting point with difference=M. That would be way too restrictive approach, it would totally block my sieve in cpu. |
Hello Robert,
[QUOTE=R. Gerbicz;504111]False. The total length of interval what you sieve will be smaller, and you missed that point. Prefering the small easy examples (numbers<=1000) say we have to find the primes up to 1000, you don't care and sieve the whole interval, taking the whole interval=1000 for each sieving prime. But using the smallest primes=2,3 I will sieve on 2 intervals, 6*k+1 and 6*k+5, their combined length is 167+166=333.[/QUOTE] unless I got something totally wrong this is exactly what mfaktX does (96 out of 420 (4*3*5*7) or 960 out of 4620 (4*3*5*7*11) classes). Prime95 does this, too (IIRC 16 out of 60 (4*3*7)). The 4 comes from the fact that factors of M(p) must be +/-1 mod 8. [QUOTE=R. Gerbicz;504111]Say you have 3 gpu threads need to sieve K=1..100, and 2 remainder classes 6*k+2 and 6*k+3, and k=0..16 (so using m=6). Both of them contains 17 integers. We have 34 integers, and want to make say C=3 tasks for the 3 threads. 1st thread: [0,10] ---> sieve on 6*k+2 for k=0..10 #11 k numbers 2nd thread: [11,21] ---> sieve on 6*k+2 for k=11..16 and 6*k+3 for k=0..4 #11 k numbers 3rd thread: [22,33] ---> sieve on 6*k+3 for k=5..16 #12 k numbers[/QUOTE] The remaining ranges (classes) from above are so big that we have to work on them in small chunks - so there is no need to have different threads working on different classes. The chunks have to fit in some fast memory (read: GPU internal shared memory or on CPU fast L1/L2 cache). I'm pretty sure you know but others may read this, too: For CPUs/GPUs (and most likely most other HW types, too) it is not possible to read/write single bits - on current GPUs to clear (or set) a bit in memory means to read the whole word (32bit for current nvidia GPUs), modify the word and write the word back to memory - so 32bit read + 32bit write to clear a single bit in memory (when talking about sieving). GPU memory is waaaaaaaaaaaaaaaay to slow for this (while hundreds of gigabytes per second sounds impressive), GPU internal L2 cache is still way to slow (speaking of more than a terabyte per second on current highend GPUs), sieving has to be done in L1 cache and/or shared memory. Oliver |
[QUOTE=TheJudger;504213]
The remaining ranges (classes) from above are so big that we have to work on them in small chunks - so there is no need to have different threads working on different classes. [/QUOTE] Yes, your method works, the problem with that when all threads are working on the same class you shorten the length of the interval in an excessive way, I'm working on the whole [N1,N2] interval's possible numbers (for that q is coprime to 3*5*7*11*13 and gives good q mod 8): Sieve on q=[N1,N2], and assume that (what we have) in general: N2=2*N1. then your interval's length is LEN1=N1/(8*p*1155), using the same class, and small sieveprimes up to 11. (8 comes from the fact that q==1,7 mod 8). While my playing field has length LEN2=N1/(4*p)*eulerphi(3*5*7*11*13)/(3*5*7*11*13) [we have 2 good residues for q mod 8, so that gives the 2/8=1/4 ratio]. So the ratio is LEN2/LEN1~886, I have a much longer interval, and this enables us to use more small sieveprimes. Just for comparison 13*17=221<886, suggesting that the optimal is even using 17 also. Interested in the followings: On a typical sieve, say p~2^26, q=[2^74,2^75] what is your sieve bound? Is it true, that the gpu threads are sieving different intervals? If not, how do you distribute the sieving? What is the length of the interval that you sieve per thread(s)? How much time do you spend on sieving, so without the powmod tests mp%q ? If you have no timing on this already, then for this say count only the number of survivors (and then print out after a class done) , so the number of tested q numbers, but don't do the powmod tests [you have to count or similar fast thing, doing nothing is dangerous, that could enable the compiler to elminate all/parts of the code, giving a totally false feel about the speed]. For a simplified description: I'd sieve on a sieve interval length=bound for primes=2^15=32768 (call it L), for that you need 4 Kbytes for sievebits+14048 bytes to store the starting points for each r prime, and this is slightly less than 18 Kbytes data per gpu thread. This is basically the same what I'd do in a cpu sieving, how is it/isn't feasible in gpu? Note that basically we can sieve with the same r prime on each thread in the same time, we need to do floor(L/r) [or one more] sieving point deletion, then update the starting point for r. One more thing: if you count the number of survivors in each thread's interval with one scan, then with another scan you can make a large interval containing the q value's survived the sieve [we use the count array to put the q value to the right place]. The advantage is that with this you can balance out the work: each thread will do floor(#S/1024) or ceil(#S/1024) powmod tests (assuming we have 1024 threads), where #S is the number of survivors. Ofcourse for a somewhat large L you can expect very similar number of survivors per thread, don't know if you have done this or similar thing. ps. maybe you need to store the r primes in each thread(?) Here sieve interval length=bound for primes=2^15=32768 is just an example, you could choose bound for primes<sieve interval length (or the other inequality), and for primes<65536 you could pack the starting point in 2 bytes, saving memory. |
Hi Robert,
[QUOTE=R. Gerbicz;504225]Interested in the followings: On a typical sieve, say p~2^26, q=[2^74,2^75] what is your sieve bound?[/QUOTE] Default number of primes used for the GPU sieve is 82486 (primes up to 1055143). This is not a "nice round number" because the number of threads in flight isn't a rount number either (see below). [QUOTE=R. Gerbicz;504225]Is it true, that the gpu threads are sieving different intervals? If not, how do you distribute the sieving? What is the length of the interval that you sieve per thread(s)?[/QUOTE] (Assuming you know the basics on how paralism in CUDA works): Default bitmap size is 64 Mibibit (2[SUP]26[/SUP] bit). Each thread block (256 threads in my case) is working on a 64 kibibyte bitmap (2[SUP]16[/SUP] bit) using fast shared memory (GPU internal). Within each thread block[LIST][*]for primes up to 511: each of the 256 threads within a thread block sieves all primes up to 511 within a 32 byte (512 bit) segment of the 64 kib shared memory (no memory conflicts)[*]for the next few primes we're using 8 threads for each prime and each thread accesses 8 kib of memory -> still 256 threads in flight, so 256/8 = 32 threads are sieving 32 different primes within the same 8 kib of memory - we ignore shared memory conflicts here, worst what could happen is a lost update and thus a composite q is not cleared. Using atomics for memory access is way slower.[*]for the remaining primes each thread takes one prime and sieves the complete 64 kib shared memory segment - again ignoring memory conflicts.[/LIST] [QUOTE=R. Gerbicz;504225]How much time do you spend on sieving, so without the powmod tests mp%q ? If you have no timing on this already, then for this say count only the number of survivors (and then print out after a class done) , so the number of tested q numbers, but don't do the powmod tests [you have to count or similar fast thing, doing nothing is dangerous, that could enable the compiler to elminate all/parts of the code, giving a totally false feel about the speed].[/QUOTE] I didn't do any measurements on current cards but when George implemented the GPU sieve code I did some comparisons to the old CPU sieve code (GPU did just the powmod stuff) - GPU sieve lowered total throughput but just a few percent while leaving CPU idle. For current highend GPUs the CPU sieve code is way too slow. [QUOTE=R. Gerbicz;504225]For a simplified description: I'd sieve on a sieve interval length=bound for primes=2^15=32768 (call it L), for that you need 4 Kbytes for sievebits+14048 bytes to store the starting points for each r prime, and this is slightly less than 18 Kbytes data per gpu thread. This is basically the same what I'd do in a cpu sieving, how is it/isn't feasible in gpu?[/QUOTE] Main constraint is many threads vs. limited amount of fast memory. (Threads, don't count just the cores, you need multiple threads/core for maximum throughput on nvidia GPUs, at least 4-8 threads/core in a compute bound situation). There are alot "where to start sieving" computations but seems to be best choice (balanced work, SIMD, lots of threads) [QUOTE=R. Gerbicz;504225]One more thing: if you count the number of survivors in each thread's interval with one scan, then with another scan you can make a large interval containing the q value's survived the sieve [we use the count array to put the q value to the right place]. The advantage is that with this you can balance out the work: each thread will do floor(#S/1024) or ceil(#S/1024) powmod tests (assuming we have 1024 threads), where #S is the number of survivors. Ofcourse for a somewhat large L you can expect very similar number of survivors per thread, don't know if you have done this or similar thing. [/QUOTE] After sieving is done there is a parallel count of survivors (each threads counts survivors in a small segment again). Than each threads needs to know the number of survivors in the segments before its own segment (can be done in log[SUB]2[/SUB]<number of threads> steps). Now each threads writes the survivors in an array (generate q's from bits) - no gaps or so than, just a linear array of survivors - this can easily balanced between a million of threads doing the powmod stuff. [QUOTE=R. Gerbicz;504225]ps. maybe you need to store the r primes in each thread(?) Here sieve interval length=bound for primes=2^15=32768 is just an example, you could choose bound for primes<sieve interval length (or the other inequality), and for primes<65536 you could pack the starting point in 2 bytes, saving memory. [/QUOTE] Current GPUs have ~5k cores, with 8 threads per core we're talking about 40k threads at least. With the need to balance work between threads I see no option to store starting points. 40k threads -> 40k primes in flight for sieve, lowest prime is e.g. 13, 17 or 19 (doesn't really matter unless we're getting much bigger) while 40000th prime is somewhere near 480k. So 480k / 19 = ~25k times the number of bits to clear - [I]"slightly"[/I] unbalanced. Oliver P.S. "Threads" are [U]very cheap[/U] in CUDA - don't compare them to threads on your CPU and your favorite OS. |
[QUOTE=TheJudger;504233]I didn't do any measurements on current cards but when George implemented the GPU sieve code I did some comparisons to the old CPU sieve code (GPU did just the powmod stuff) - GPU sieve lowered total throughput but just a few percent while leaving CPU idle. For current highend GPUs the CPU sieve code is way too slow.[/QUOTE]
From the point of view of a cloud user: CPU instances are very cheap, while GPU instances are expensive. A Tesla V100 is about 60 times more expensive per hour than one core of a Skylake Xeon. I'm certainly ready to believe that the Tesla will sieve more than sixty times faster than a single Skylake core, but if not, is there some hybrid solution possible where factoring is split between two different programs on different machines? Say, one or more CPUs does all the sieving in advance for a list of exponents, slowly but more cheaply, and then the result of that, some kind of preprocessed data file, is sent as a batch job to a different instance that has a GPU? Internal bandwidth is free within the same AWS region and availability zone, so the size of this preprocessed data file shouldn't matter. |
[QUOTE=GP2;504240]From the point of view of a cloud user: CPU instances are very cheap, while GPU instances are expensive. A Tesla V100 is about 60 times more expensive per hour than one core of a Skylake Xeon.
I'm certainly ready to believe that the Tesla will sieve more than sixty times faster than a single Skylake core, but if not, is there some hybrid solution possible where factoring is split between two different programs on different machines? Say, one or more CPUs does all the sieving in advance for a list of exponents, slowly but more cheaply, and then the result of that, some kind of preprocessed data file, is sent as a batch job to a different instance that has a GPU? Internal bandwidth is free within the same AWS region and availability zone, so the size of this preprocessed data file shouldn't matter.[/QUOTE] Sieving is only a small fraction of GPU time and I *guess* we're talking about 10+ GB/sec for a V100 with the current implementation of the CPU sieve... |
[QUOTE=TheJudger;504233]Each thread block (256 threads in my case) is working on a 64 kibibyte bitmap (2[SUP]16[/SUP] bit) using fast shared memory (GPU internal).[/quote]
[quote] Current GPUs have ~5k cores, with 8 threads per core we're talking about 40k threads at least.[/quote] Does this mean that during sieving most of the cores on a high end device are idle? Or is mfaktc sieving the for the next class while trial factoring the current class? |
[QUOTE=Mark Rose;504244]Does this mean that during sieving most of the cores on a high end device are idle? Or is mfaktc sieving the for the next class while trial factoring the current class?[/QUOTE]
No and no. There are 256 threads in each thread block - and ofc there are multiple thread blocks running at the same time. |
[QUOTE=TheJudger;504245]No and no. There are 256 threads in each thread block - and ofc there are multiple thread blocks running at the same time.[/QUOTE]
I missed that. Thanks! |
| All times are UTC. The time now is 23:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.