![]() |
[QUOTE=R. Gerbicz;504006]Interesting discussion, there are lots of misunderstanding here.[/QUOTE]
Likely! [QUOTE=R. Gerbicz;504006]So at once you are sieving only one residue per gpu thread? In this case you would be too restrictive, I'll return to this.[/QUOTE] All threads (the whole GPU) are working on a single residue class. [QUOTE=R. Gerbicz;504006]Say you want a sieve for q=2*k*p+1=N1..N2 in range (say p~2^26 in the current wavefront, and N1=2^74,N2=2^75 a typical(?) input). using small primes in the sieving prime (and also knowledge of q mod 8=1,7 from quad rec.) : q=res+k*M for k=k0..k1, and here k0,k1 is independent from res (this k is not the same as above). M=8*p*3*5*7*11*13. where k0=floor(N1/M) and k1=floor(N2/M). With this you will sieve at most two more numbers per residue classes (so at most one number<N1 and at most one>N2 per res). Note that res==1 mod (2*p), because q==1 mod (2*p). If you have cnt number of residues (mod M) then you'll sieve NR=cnt*(k1-k0+1)=11520*(k1-k0+1) numbers in our case (if you're using the primes up to 13). With something 3rd class knowledge from elementary school you can distribute these numbers into C classes using an almost equal way: [floor(i*NR)/C,floor((i+1)*NR/C)) numbers will be in the i-th class/task for i=0..(C-1). And this is independent from 11520 and the number of GPU threads, you can use C=1,C=1000,C=1024,C=20480 (yes smaller/larger than 11520). From the value of i you can get the residues and intervals (for each residue) that you need to do. Yes, different threads could work on the same residue class, and the same thread could do multiple residue (ofcourse not at once in this case) ![/QUOTE] Talking about the "too much work at the end of the residue class"? Keep in mind that GPUs (especially older GPUs) don't like divergent thread branches - it is bettter to keep all threads in the same code path. You're right, calculating the limits is very easy but you can't do a simple for loop to work in the factor candidates. Paralism is SIMD. [QUOTE=R. Gerbicz;504006]About checkpoints: Don't know how you are doing checkpoint(s), say with 1024 gpu threads and with C=10000 you would lost only ~T/20 work (in average) where T is the total computation time, if you're distributing 1000 works ten times. (note that here we are not using all threads, and even 10000<11520 the above 'equal way' distribution works nicely). For much deeper task (N2~2^80 or so) use larger C and/or more small sieve primes (17-19-23).[/QUOTE] This is very simple - there are no checkpoints within a residue class! Checkpoints are written between two residue classes by just remembering the last finished residue class. Runtimes are rather short and I don't want to overengineer this - a simple solution is a good solution. [QUOTE=R. Gerbicz;504006] Using 64 bits? Where in the sieving, when you are using sieve primes up to say 32000 ? For the sieving initialization in a cpu case I would precompute and store ((2*p) mod R) and (M mod R) for R<sievingprimes=32000 or whatever you are using. But maybe it isn't that needed for gpu; you have to do that when you start/switch on a new residue, so basically you don't need to store/precompute it.[/QUOTE] At some point you have to "convert" your bits in the sieve back to the actual factor candidates. Sieving is done over the k's in FC = 2kp+1. Oliver |
[QUOTE=TheJudger;504047]
Runtimes are rather short Oliver[/QUOTE] Meaning run times per class are short? Run times per TF bit level can be several hours. [URL]http://www.mersenneforum.org/showpost.php?p=488519&postcount=2[/URL] |
[QUOTE=kriesel;504054]Meaning run times per class are short?
Run times per TF bit level can be several hours. [URL]http://www.mersenneforum.org/showpost.php?p=488519&postcount=2[/URL][/QUOTE] Run times per class. The longest I've seen a class run for is about minute, on an ancient GT 520, at a high TF level. |
[QUOTE=TheJudger;503945]You can compare the two versions (420 vs. 4620 classes) and give an estimate when the next prime should be included in the number of residue classes.[/QUOTE]
We did this in the past, during developing of mfaktc, and after, and not only once; there are endless discussions here around. You forget speed decrease for sieving (a lot more classes means smaller classes, i.e. less efficient sieving). Actually, the best point (top of the parabola) seemed to be somewhere between 420 and 4620, as you go higher the things get slower, as well as if you go lower. As we progress towards higher exponents, this point is (extremely) slowly shifting [U]towards[/U] 420. This is not a joke... For example, at 420 classes you have 96 remaining to sieve/exponentiate, and for 4620 you have 960. The rest are eliminated by modularity (that is why we use 2[U]^2[/U]*p1*p2*... instead of 2*p1*p2*... where px are odd primes, to eliminate completely classes which are 3 or 5 mod 8. The 420 classes is good as there are not many candidates in each class (either a large exponent, or a low bitlevel, or both) [U]especially[/U] to make each class larger, to have a faster sieve. When you sieve, you take each prime, do some modular calculus with it, then clear a lot of bits in the matrix (i.e. eliminate candidates in the class). Smaller class means you do more modular calculus and less elimination. As the classes are getting larger (higher bitlevels, and lower exponents) the candidates in each class become too numerous to be efficiently "tabulated" in the sieving matrices in RAM, and a split in more classes makes sense, but sieving becomes slower. This is compensated by less classes (proportional, 960/4620=[B]20.78%[/B] of the candidates, instead of 96/420=[B]22.86%[/B] of the candidates) need to be sieved and exponentiated. If we go to add 13 to the product, then you will have 60060 classes, from which 11520 remain after modular elimination, which is [B]19.18%[/B] of the candidates need to be sieved and exponentiated, but now they are divided in a lot of more classes, there is a lot of more modular calculus to be done and a less efficient sieving. There is a lot of overhead as Oliver said, to handle all these 11k classes. It is not worth. In fact, as we move to higher exponents, the "ideal", "optimum" "top of the parabola" switches imperceptible [U]towards 420[/U] classes, until some trick is implemented ("invented") to sieve all the 11k+ classes all together somehow... With actual sieving way, the 60060 classes would worth only for taking very low exponents to high bitlevels (like factoring "under 1M or 100k exponents to 66+ bits or so), and it would make happy few hobbyists here who don't believe that low exponents had so much ECM done that they may not have a factor under 100 bits :razz: But that is a different discussion, and it is in itself a different challenge, due to the fact that sieving for so low expos can inadvertently eliminate factors (where the exponent is comparable in size with the sieving primes). |
Another misconception is the fact that you have thousands of cores, and they do thousands of things in parallel. You don't. They are not "true" cores like for the CPU. They can not do different tasks in the same time. Or well, they can do, but it costs time.
They usually all do the same task in the same time, but applied to different data. That is why it is called SIMD, single instruction, multiple data. They are very fast if you have (for example) a screen with 4 million pixels and want to make all of them greenish, because it is Christmas (i.e. add some constant to all components of a vector), that is what they are designed for (graphic cards, remember?), each core takes a pixel and paints it green, because it is Christmas, and they all do it in parallel, all in the same time. But if you have 10000 pixels to paint, and 3000 cores in your GPU, then they will do the change in 4 "ticks": 3 times each core will change a pixel, for a total of 9000, and the last "tick" only 1000 cores will change a pixel, the other 2000 cores will wait, or rest, because there is no pixel for them to change. You can reconfigure them to do other things in that time, but it comes with a cost for the other 1000 which have work to do, because they still share resources (memory, pipelines, etc). Alternatively, you could make it from the start that only 2500 cores turn pixels green, and the other 500 do other things. You still need 4 steps, and you can do other things meantime, but you may get a better or a worse total time. This is endless tweaking, a lot of work invested by the programmers, etc. Profound respect for the guys who do that kind of software! How about, maybe only 2000 cores paint pixels green, and 1000 do other tasks? This will require 5 "ticks", but maybe the other task that gets done is more important? (and it would require by itself more ticks, which are now saved?). Remember the story with the doughnuts? You have a frying pan into which you can fry two of them, on a side, which takes a minute. How long will it take to fry 3 doughnuts on both sides? (if you said 4 minutes, don't start writing code for GPUs :razz:) |
[QUOTE=R. Gerbicz;504006]Almost, in theory you get that speedup, but only in the sieve. [/QUOTE]
Actually, not. See my post. Sieving each class is much faster, indeed. Ten times faster, or more. But you have to sieve 11k+ classes, instead of 960 classes. Per total, sieving is slower. MfaktX works [U]one class[/U] at a time. Take the class, sieve the class, exponentiate the class,etc., like [URL="https://www.youtube.com/watch?v=m842HLSOprA"]Pierre Richard with the plates[/URL] (take the plate, turn the plate, wash the plate, rinse the plate... you know the drill). If a factor is found, stop (or continue, if the user says so in the init file). This helps when you sieve (all candidates in the class "in the same time", see above) and exponentiate (all remaining candidates "in the same time"). Otherwise you can't get the speed. Think about, at the limit, you have millions of classes with a single candidate. Is the sieving faster, or slower, per total? Of course, doing a larger split, less candidates will survive and the exponentiation may be faster (will they??? think about the fact that when you split to 4620 classes, sieving with 13, 17, 19, 23, etc, will immediately eliminate the candidates that would have been eliminated by modularity when split in more classes; - this only takes a blink of the time, and you do it for each of the 960 classes once; when you include 13 in the product, you start sieving with 17, etc., saving a blink, but doing it 10k times more). So, as I said, unless a new method is found, to sieve all classes in the same time or so, sieve them in bunches, whatever, splitting in more classes will be slower. |
I occasionly come up with an idea:
Is it possible to add some "doublecheck" feature in mfaktc? for example, for each class, we could calculate a trial factor p which minimize abs(n-p*round(n/p)) after calculating, we could have 4620 (or 420) p(s) and then we could use operation like xor to merge some of them together. for example, we could firstly convert the "more class" result into "less class" then we could use xor(or, minimize abs(n-p*round(n/p))) to merge 420 "less class" into, for example, 15, "megaclass" finally, for each "megaclass", we could use a algorithm like CRC-16(or, just last 16 bits) to compress them into 16 bytes, and finally compress all of the 16*15=240 bytes into 40 base-64 characters. It is helpful when double-checking. When double-checking, people will not needs to calculate all the 15 "megaclass", to verify whether the submit is a vaild result, verify 1 megaclass is enough. So this method could roughly provides a 15x faster roughly double-check method, might be helpful for further double-checking. IMO a double checking code could prevent the false "no factor" result, at least it is easier to tell everyone not to submit a handwritting result to earn illegal GHz-days. |
There is a global hysteria about residue classes. Take off from residue classes, that eats up your brain.
To collect the easy fruits: [QUOTE=LaurV;504082]will they??? [...] when you include 13 in the product, you start sieving with 17, etc., saving a blink, but doing it 10k times more[/QUOTE] False. The total length of interval what you sieve will be smaller, and you missed that point. Prefering the small easy examples (numbers<=1000) say we have to find the primes up to 1000, you don't care and sieve the whole interval, taking the whole interval=1000 for each sieving prime. But using the smallest primes=2,3 I will sieve on 2 intervals, 6*k+1 and 6*k+5, their combined length is 167+166=333. A speedup by a factor of 3, not to mention that I haven't sieved by 2,3; so got the 1/((1-1/2)*(1-1/3))=3 times faster sieve. When you sieve on r=5, you need 1000/5=200 updates, but I only need 33+34=64 updates etc. what is roughly 1/3 of 200. and you get this speedup for every r sieving prime. Bingo. This is a very basic rule. [QUOTE=TheJudger;504047] This is very simple - there are no checkpoints within a residue class! Checkpoints are written between two residue classes by just remembering the last finished residue class. Runtimes are rather short and I don't want to overengineer this - a simple solution is a good solution. [/QUOTE] A simple solution is a slow solution. Yeah these are non ortodox sieve methods. I haven't spoken about breakpoint in a residue class, in that sentence I've written about the total time, when I've broken up the whole task to C=10000 tasks, and made 10 checkpoints. [QUOTE=LaurV;504082]MfaktX works one class at a time.[/QUOTE] Thanks, asked it before. Say you have 3 gpu threads need to sieve K=1..100, and 2 remainder classes 6*k+2 and 6*k+3, and k=0..16 (so using m=6). Both of them contains 17 integers. We have 34 integers, and want to make say C=3 tasks for the 3 threads. 1st thread: [0,10] ---> sieve on 6*k+2 for k=0..10 #11 k numbers 2nd thread: [11,21] ---> sieve on 6*k+2 for k=11..16 and 6*k+3 for k=0..4 #11 k numbers 3rd thread: [22,33] ---> sieve on 6*k+3 for k=5..16 #12 k numbers as you can see in spite of the many threads we sieved on "very" long arithmetic progressions, and you can see the good facts, different threads working on the same residue, and same thread doing different residues. Where we haven't switched on residue classes we have almost equal length of interval: 11,12. And what you're doing: consider 6*k+2 and distribute the 17 integers into tiny intervals: 5,6,6 to the 3 threads. Then do the 6*k+3 sequence in the same archaic way. OK, this is a small example, and larger one has even better property: you are switching residue class only 11520 times , their combined length is small < 11520*32000. Refinement, more details: one gpu thread would sieve on different interval=L=32768 (or 32000), this needs 4kbytes per thread, using primes say r<32000 (this can be anything, so it can be larger/smaller than L); you need to store the start sieving point in each(!) thread for each r prime. For each prime you need floor(L/r) [or one more] sieving deletion, if you have done with updates, then the new sieving offset will be (location-L). Rarely (11520 times) you need to update the whole offset table [for given thread(s)] when you switch to a new residue; in this rare case do the update(s) when you start the new residue's block=L. |
[QUOTE=Mark Rose;504073]Run times per class.
The longest I've seen a class run for is about minute, on an ancient GT 520, at a high TF level.[/QUOTE] Thanks for clarifying. Much longer than a minute per class is possible. Granted, these examples are slow gpus and high bit levels. Attempting to further factor some small exponents will also produce long class times. On an RX550 in mfakto (still nearly a day to go now): [CODE]Starting trial factoring M658000139 from 2^82 to 2^83 (1488.55GHz-days) Using GPU kernel "cl_barrett32_87_gs_2" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Dec 11 23:43 | 0 0.1% | 1464.7 16d06h | 91.47 81206 0.00% Dec 12 00:08 | 9 0.2% | 1464.0 16d05h | 91.51 81206 0.00%[/CODE]Some fairly ordinary TF assignments exceed a minute per class on Intel IGPs, which are of order 20GhzD/day performance when not thermally throttled, and can drop to half that. Still a couple days to run on a Quadro 4000 on the following, and the following bit levels will take about 2 and 4 times longer per class than the current ~28 minutes, so up to nearly two hours: [CODE]got assignment: exp=990000029 bit_min=83 bit_max=86 (13851.05 GHz-days) Starting trial factoring M990000029 from 2^83 to 2^84 (1978.72 GHz-days) ...found a valid checkpoint file! ... Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Dec 25 17:48 | 3684 79.9% | 1654.1 3d16h | 107.66 82485 n.a.% Dec 25 18:16 | 3687 80.0% | 1660.1 3d16h | 107.28 82485 n.a.%[/CODE]On a Quadro 2000, and not so high above the 100Mdigit exponent activity: [CODE]got assignment: exp=366000023 bit_min=81 bit_max=82 (1338.07 GHz-days) Starting trial factoring M366000023 from 2^81 to 2^82 (1338.07 GHz-days) k_min = 3303075802303200 k_max = 6606151604611397 Using GPU kernel "barrett87_mul32_gs" found a valid checkpoint file! last finished class was: 4116 found 0 factor(s) already Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Jul 31 01:48 | 4117 89.1% | 2257.0 2d17h | 53.36 82485 n.a.% Jul 31 02:27 | 4120 89.2% | 2347.2 2d19h | 51.31 82485 n.a.%[/CODE]On an RX550 again, 55 minutes/class: [CODE]got assignment: exp=580500007 bit_min=83 bit_max=84 (3374.56 GHz-days) Starting trial factoring M580500007 from 2^83 to 2^84 (3374.56GHz-days) Using GPU kernel "cl_barrett32_87_gs_2" Found a valid checkpoint file. last finished class was: 5 Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Jun 05 09:06 | 9 0.3% | 3327.9 36d20h | 91.26 81206 0.00% Jun 05 10:02 | 12 0.4% | 3327.9 36d19h | 91.26 81206 0.00% Jun 05 10:57 | 20 0.5% | 3328.6 36d19h | 91.24 81206 0.00%[/CODE]The high exponent/high bit level combinations are taking scattered exponents throughout the mersenne.org range, to full gputo72 TF depth, to qualify exponents for testing CUDAPm1 on them. |
[QUOTE=R. Gerbicz;504006]If you have cnt number of residues (mod M) then you'll sieve NR=cnt*(k1-k0+1)=11520*(k1-k0+1) numbers in our case (if you're using the primes up to 13). With something 3rd class knowledge from elementary school you can distribute these numbers into C classes using an almost equal way: [floor(i*NR)/C,floor((i+1)*NR/C)) numbers will be in the i-th class/task for i=0..(C-1).
And this is independent from 11520 and the number of GPU threads, you can use C=1,C=1000,C=1024,C=20480 (yes smaller/larger than 11520). From the value of i you can get the residues and intervals (for each residue) that you need to do. Yes, different threads could work on the same residue class, and the same thread could do multiple residue (ofcourse not at once in this case) ! About checkpoints: Don't know how you are doing checkpoint(s), say with 1024 gpu threads and with C=10000 you would lost only ~T/20 work (in average) where T is the total computation time, if you're distributing 1000 works ten times. (note that here we are not using all threads, and even 10000<11520 the above 'equal way' distribution works nicely). For much deeper task (N2~2^80 or so) use larger C and/or more small sieve primes (17-19-23). [/QUOTE] [QUOTE=LaurV;504082] But you have to sieve 11k+ classes, instead of 960 classes. ... MfaktX works one class at a time. Take the class, sieve the class, exponentiate the class ... Think about, at the limit, you have millions of classes with a single candidate. Is the sieving faster, or slower, per total? ... think about the fact that when you split to 4620 classes[/QUOTE] Wait, in my post's example for 1024 gpu threads and C=10000 tasks, then at once 1000 tasks distributed to 1000 threads (24 is "idle"), one task is given to one thread; and you are doing it 10 times, and with that you cover the k=[N1,N2] interval. So virtually in that run I had 10, yes ten residue classes, so 9 (intermediate) breakpoints. While you're still using 420 or 4620 classes, hence you can make the 420/4620 breakpoints, I can't do that because using much smaller number of 'big' tasks. Then who is the winner? Think about C=1000, there would be no checkpoint; all k's distributed in one step. Written enough sieves, that should be working for cpu, well, gpu is a question. But realize the fact, in a cpu(!) implementation that works even in (hypothetical) 1536 cpu threads on one fixed residue class, so in one interval, with length in the range very few billion you couldn't do (much) better what is currently mfaktx doing, using the small sieveprimes up to 11. Look my optimized gap12.c code, (maybe) M was fixed, but we sieved seq(i)=c(t)+i*M for different c(t) mod M, where c(t) is constant in the t-th thread. So it wasn't true, that the union of these arithmetic sequences gives a single arithmetic sequence for some starting point with difference=M. That would be way too restrictive approach, it would totally block my sieve in cpu. |
Hello Robert,
[QUOTE=R. Gerbicz;504111]False. The total length of interval what you sieve will be smaller, and you missed that point. Prefering the small easy examples (numbers<=1000) say we have to find the primes up to 1000, you don't care and sieve the whole interval, taking the whole interval=1000 for each sieving prime. But using the smallest primes=2,3 I will sieve on 2 intervals, 6*k+1 and 6*k+5, their combined length is 167+166=333.[/QUOTE] unless I got something totally wrong this is exactly what mfaktX does (96 out of 420 (4*3*5*7) or 960 out of 4620 (4*3*5*7*11) classes). Prime95 does this, too (IIRC 16 out of 60 (4*3*7)). The 4 comes from the fact that factors of M(p) must be +/-1 mod 8. [QUOTE=R. Gerbicz;504111]Say you have 3 gpu threads need to sieve K=1..100, and 2 remainder classes 6*k+2 and 6*k+3, and k=0..16 (so using m=6). Both of them contains 17 integers. We have 34 integers, and want to make say C=3 tasks for the 3 threads. 1st thread: [0,10] ---> sieve on 6*k+2 for k=0..10 #11 k numbers 2nd thread: [11,21] ---> sieve on 6*k+2 for k=11..16 and 6*k+3 for k=0..4 #11 k numbers 3rd thread: [22,33] ---> sieve on 6*k+3 for k=5..16 #12 k numbers[/QUOTE] The remaining ranges (classes) from above are so big that we have to work on them in small chunks - so there is no need to have different threads working on different classes. The chunks have to fit in some fast memory (read: GPU internal shared memory or on CPU fast L1/L2 cache). I'm pretty sure you know but others may read this, too: For CPUs/GPUs (and most likely most other HW types, too) it is not possible to read/write single bits - on current GPUs to clear (or set) a bit in memory means to read the whole word (32bit for current nvidia GPUs), modify the word and write the word back to memory - so 32bit read + 32bit write to clear a single bit in memory (when talking about sieving). GPU memory is waaaaaaaaaaaaaaaay to slow for this (while hundreds of gigabytes per second sounds impressive), GPU internal L2 cache is still way to slow (speaking of more than a terabyte per second on current highend GPUs), sieving has to be done in L1 cache and/or shared memory. Oliver |
[QUOTE=TheJudger;504213]
The remaining ranges (classes) from above are so big that we have to work on them in small chunks - so there is no need to have different threads working on different classes. [/QUOTE] Yes, your method works, the problem with that when all threads are working on the same class you shorten the length of the interval in an excessive way, I'm working on the whole [N1,N2] interval's possible numbers (for that q is coprime to 3*5*7*11*13 and gives good q mod 8): Sieve on q=[N1,N2], and assume that (what we have) in general: N2=2*N1. then your interval's length is LEN1=N1/(8*p*1155), using the same class, and small sieveprimes up to 11. (8 comes from the fact that q==1,7 mod 8). While my playing field has length LEN2=N1/(4*p)*eulerphi(3*5*7*11*13)/(3*5*7*11*13) [we have 2 good residues for q mod 8, so that gives the 2/8=1/4 ratio]. So the ratio is LEN2/LEN1~886, I have a much longer interval, and this enables us to use more small sieveprimes. Just for comparison 13*17=221<886, suggesting that the optimal is even using 17 also. Interested in the followings: On a typical sieve, say p~2^26, q=[2^74,2^75] what is your sieve bound? Is it true, that the gpu threads are sieving different intervals? If not, how do you distribute the sieving? What is the length of the interval that you sieve per thread(s)? How much time do you spend on sieving, so without the powmod tests mp%q ? If you have no timing on this already, then for this say count only the number of survivors (and then print out after a class done) , so the number of tested q numbers, but don't do the powmod tests [you have to count or similar fast thing, doing nothing is dangerous, that could enable the compiler to elminate all/parts of the code, giving a totally false feel about the speed]. For a simplified description: I'd sieve on a sieve interval length=bound for primes=2^15=32768 (call it L), for that you need 4 Kbytes for sievebits+14048 bytes to store the starting points for each r prime, and this is slightly less than 18 Kbytes data per gpu thread. This is basically the same what I'd do in a cpu sieving, how is it/isn't feasible in gpu? Note that basically we can sieve with the same r prime on each thread in the same time, we need to do floor(L/r) [or one more] sieving point deletion, then update the starting point for r. One more thing: if you count the number of survivors in each thread's interval with one scan, then with another scan you can make a large interval containing the q value's survived the sieve [we use the count array to put the q value to the right place]. The advantage is that with this you can balance out the work: each thread will do floor(#S/1024) or ceil(#S/1024) powmod tests (assuming we have 1024 threads), where #S is the number of survivors. Ofcourse for a somewhat large L you can expect very similar number of survivors per thread, don't know if you have done this or similar thing. ps. maybe you need to store the r primes in each thread(?) Here sieve interval length=bound for primes=2^15=32768 is just an example, you could choose bound for primes<sieve interval length (or the other inequality), and for primes<65536 you could pack the starting point in 2 bytes, saving memory. |
Hi Robert,
[QUOTE=R. Gerbicz;504225]Interested in the followings: On a typical sieve, say p~2^26, q=[2^74,2^75] what is your sieve bound?[/QUOTE] Default number of primes used for the GPU sieve is 82486 (primes up to 1055143). This is not a "nice round number" because the number of threads in flight isn't a rount number either (see below). [QUOTE=R. Gerbicz;504225]Is it true, that the gpu threads are sieving different intervals? If not, how do you distribute the sieving? What is the length of the interval that you sieve per thread(s)?[/QUOTE] (Assuming you know the basics on how paralism in CUDA works): Default bitmap size is 64 Mibibit (2[SUP]26[/SUP] bit). Each thread block (256 threads in my case) is working on a 64 kibibyte bitmap (2[SUP]16[/SUP] bit) using fast shared memory (GPU internal). Within each thread block[LIST][*]for primes up to 511: each of the 256 threads within a thread block sieves all primes up to 511 within a 32 byte (512 bit) segment of the 64 kib shared memory (no memory conflicts)[*]for the next few primes we're using 8 threads for each prime and each thread accesses 8 kib of memory -> still 256 threads in flight, so 256/8 = 32 threads are sieving 32 different primes within the same 8 kib of memory - we ignore shared memory conflicts here, worst what could happen is a lost update and thus a composite q is not cleared. Using atomics for memory access is way slower.[*]for the remaining primes each thread takes one prime and sieves the complete 64 kib shared memory segment - again ignoring memory conflicts.[/LIST] [QUOTE=R. Gerbicz;504225]How much time do you spend on sieving, so without the powmod tests mp%q ? If you have no timing on this already, then for this say count only the number of survivors (and then print out after a class done) , so the number of tested q numbers, but don't do the powmod tests [you have to count or similar fast thing, doing nothing is dangerous, that could enable the compiler to elminate all/parts of the code, giving a totally false feel about the speed].[/QUOTE] I didn't do any measurements on current cards but when George implemented the GPU sieve code I did some comparisons to the old CPU sieve code (GPU did just the powmod stuff) - GPU sieve lowered total throughput but just a few percent while leaving CPU idle. For current highend GPUs the CPU sieve code is way too slow. [QUOTE=R. Gerbicz;504225]For a simplified description: I'd sieve on a sieve interval length=bound for primes=2^15=32768 (call it L), for that you need 4 Kbytes for sievebits+14048 bytes to store the starting points for each r prime, and this is slightly less than 18 Kbytes data per gpu thread. This is basically the same what I'd do in a cpu sieving, how is it/isn't feasible in gpu?[/QUOTE] Main constraint is many threads vs. limited amount of fast memory. (Threads, don't count just the cores, you need multiple threads/core for maximum throughput on nvidia GPUs, at least 4-8 threads/core in a compute bound situation). There are alot "where to start sieving" computations but seems to be best choice (balanced work, SIMD, lots of threads) [QUOTE=R. Gerbicz;504225]One more thing: if you count the number of survivors in each thread's interval with one scan, then with another scan you can make a large interval containing the q value's survived the sieve [we use the count array to put the q value to the right place]. The advantage is that with this you can balance out the work: each thread will do floor(#S/1024) or ceil(#S/1024) powmod tests (assuming we have 1024 threads), where #S is the number of survivors. Ofcourse for a somewhat large L you can expect very similar number of survivors per thread, don't know if you have done this or similar thing. [/QUOTE] After sieving is done there is a parallel count of survivors (each threads counts survivors in a small segment again). Than each threads needs to know the number of survivors in the segments before its own segment (can be done in log[SUB]2[/SUB]<number of threads> steps). Now each threads writes the survivors in an array (generate q's from bits) - no gaps or so than, just a linear array of survivors - this can easily balanced between a million of threads doing the powmod stuff. [QUOTE=R. Gerbicz;504225]ps. maybe you need to store the r primes in each thread(?) Here sieve interval length=bound for primes=2^15=32768 is just an example, you could choose bound for primes<sieve interval length (or the other inequality), and for primes<65536 you could pack the starting point in 2 bytes, saving memory. [/QUOTE] Current GPUs have ~5k cores, with 8 threads per core we're talking about 40k threads at least. With the need to balance work between threads I see no option to store starting points. 40k threads -> 40k primes in flight for sieve, lowest prime is e.g. 13, 17 or 19 (doesn't really matter unless we're getting much bigger) while 40000th prime is somewhere near 480k. So 480k / 19 = ~25k times the number of bits to clear - [I]"slightly"[/I] unbalanced. Oliver P.S. "Threads" are [U]very cheap[/U] in CUDA - don't compare them to threads on your CPU and your favorite OS. |
[QUOTE=TheJudger;504233]I didn't do any measurements on current cards but when George implemented the GPU sieve code I did some comparisons to the old CPU sieve code (GPU did just the powmod stuff) - GPU sieve lowered total throughput but just a few percent while leaving CPU idle. For current highend GPUs the CPU sieve code is way too slow.[/QUOTE]
From the point of view of a cloud user: CPU instances are very cheap, while GPU instances are expensive. A Tesla V100 is about 60 times more expensive per hour than one core of a Skylake Xeon. I'm certainly ready to believe that the Tesla will sieve more than sixty times faster than a single Skylake core, but if not, is there some hybrid solution possible where factoring is split between two different programs on different machines? Say, one or more CPUs does all the sieving in advance for a list of exponents, slowly but more cheaply, and then the result of that, some kind of preprocessed data file, is sent as a batch job to a different instance that has a GPU? Internal bandwidth is free within the same AWS region and availability zone, so the size of this preprocessed data file shouldn't matter. |
[QUOTE=GP2;504240]From the point of view of a cloud user: CPU instances are very cheap, while GPU instances are expensive. A Tesla V100 is about 60 times more expensive per hour than one core of a Skylake Xeon.
I'm certainly ready to believe that the Tesla will sieve more than sixty times faster than a single Skylake core, but if not, is there some hybrid solution possible where factoring is split between two different programs on different machines? Say, one or more CPUs does all the sieving in advance for a list of exponents, slowly but more cheaply, and then the result of that, some kind of preprocessed data file, is sent as a batch job to a different instance that has a GPU? Internal bandwidth is free within the same AWS region and availability zone, so the size of this preprocessed data file shouldn't matter.[/QUOTE] Sieving is only a small fraction of GPU time and I *guess* we're talking about 10+ GB/sec for a V100 with the current implementation of the CPU sieve... |
[QUOTE=TheJudger;504233]Each thread block (256 threads in my case) is working on a 64 kibibyte bitmap (2[SUP]16[/SUP] bit) using fast shared memory (GPU internal).[/quote]
[quote] Current GPUs have ~5k cores, with 8 threads per core we're talking about 40k threads at least.[/quote] Does this mean that during sieving most of the cores on a high end device are idle? Or is mfaktc sieving the for the next class while trial factoring the current class? |
[QUOTE=Mark Rose;504244]Does this mean that during sieving most of the cores on a high end device are idle? Or is mfaktc sieving the for the next class while trial factoring the current class?[/QUOTE]
No and no. There are 256 threads in each thread block - and ofc there are multiple thread blocks running at the same time. |
[QUOTE=TheJudger;504245]No and no. There are 256 threads in each thread block - and ofc there are multiple thread blocks running at the same time.[/QUOTE]
I missed that. Thanks! |
Thanks Judger, that was a detailed description, ok, so even in cpu that thread work distribution would not work in my way.
just some more questions: 1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's __global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp) method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ? 2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere? 3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny, then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower). 4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine, making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads? 5. What happens if the number of GPU threads is not divisible by 256 ? Possible ideas: on line 539: locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask; where we know: const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block const uint32 threadsPerBlock = 256; // Threads per block so we could also write: locsieve32[8 * threadIdx.x + j] |= mask; and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting. But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method. And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say: #define block_size_in_bytes 8192 // Size of shared memory array in bytes #define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block #define threadsPerBlock 256 etc., would it be faster on GPU? |
You can set sieve primes in mfaktc.ini
# GPUSievePrimes defines how far we sieve the factor candidates on the GPU. # The first <GPUSievePrimes> primes are sieved. # # Minimum: GPUSievePrimes=0 # Maximum: GPUSievePrimes=1075000 # # Default: GPUSievePrimes=82486 Usually you test the speed with different sieve primes for your GPU the first time, but the fastest is always between 75K and 150K every time I have tested it. The optimal can be slightly different I believe for different exponent sizes and bit depths. |
Hi Robert,
[QUOTE=R. Gerbicz;504271]1. in your latest binary mfaktc-0.21 I'm assuming that the main sieve is in gpusieve.cu's __global__ static void __launch_bounds__(256,6) SegSieve (uint8 *big_bit_array_dev, uint8 *pinfo_dev, uint32 maxp) method, 256 refers to the fact that you are using 256 threads in one pool, right? What is the 6 ?[/QUOTE] Correct, 256 threads per block and a minimum of 6 blocks per [I]"multiprocessor"[/I] (a group of GPU cores). [QUOTE=R. Gerbicz;504271]2. where do you store the pinfo_dev, so the sieving prime's info? Is it in your default 64mbit bitmap or elsewhere?[/QUOTE] Not sure what you want to know here. The big 64 Mibit bitmap contains just the bits of the sieve. [QUOTE=R. Gerbicz;504271]3. still not clear (at least for me) the ratio of sieving time/total time; say if it is really that tiny, then why not use deeper sieve, using primes up to 2m or 10m. With that you could lower the number of k survivors by a factor of log(B2)/log(B1), where B2 is the new depth, B1 is the old (ofcourse, the sieve would be slower).[/QUOTE] I can't tell you the timings because I don't know, too. As ATH noted one simply adjusts GPUSievePrimes to the "optimal" value (minimum runtime for given work). [QUOTE=R. Gerbicz;504271]4. in the mentioned SegSieve, say we have 20480=80*256 GPU threads, then do you launch these threads at once in that routine, making a sieve on 80*65536=5242880 length interval ? Or do you start at once only 256 threads, when you have 256 available threads?[/QUOTE] All threads(blocks) are launched at the same time. With default configuration there are 1024 blocks of 256 threads each (64 Mibit partitioned into 1024 64 kibit segments). From the code itself it is OK to run them all at once or block after block. The GPU (driver) does the magic to schedules the blocks to the HW. Currently the (big) bitmap is limited to 128 Mibit and thus 524288 threads - enough until nvidia manages to build GPUs with more than lets say 64k cores. [QUOTE=R. Gerbicz;504271]5. What happens if the number of GPU threads is not divisible by 256 ?[/QUOTE] It is! [QUOTE=R. Gerbicz;504271]Possible ideas: on line 539: locsieve32[threadIdx.x * block_size / threadsPerBlock / 32 + j] |= mask; where we know: const uint32 block_size_in_bytes = 8192; // Size of shared memory array in bytes const uint32 block_size = block_size_in_bytes * 8; // Number of bits generated by each block const uint32 threadsPerBlock = 256; // Threads per block so we could also write: locsieve32[8 * threadIdx.x + j] |= mask; and the compiler can easily replace it by a shift. I'm not sure if the compiler arrives to the single shifting. But if you are sure about this, then it is absolutely right not to change. Note that there are many such cases in this method. And you could replace those lots of const .. with #define (with a single integer, don't leave mult/operations there), say: #define block_size_in_bytes 8192 // Size of shared memory array in bytes #define block_size 65536 // = block_size_in_bytes * 8; // Number of bits generated by each block #define threadsPerBlock 256 etc., would it be faster on GPU?[/QUOTE] Pretty sure the compiler knows that those variables are constant and precomputes the value. Replacing the multiplication with a shift might be a bad idea - at least on some GPUs shift has the same throughput as mul - and we still need the add. On the other hand there is an multiply-add with the same throughput as mul (add if for free). [URL="https://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions"]https://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions[/URL] Oliver |
[QUOTE=TheJudger;504293]
I can't tell you the timings because I don't know, too. As ATH noted one simply adjusts GPUSievePrimes to the "optimal" value (minimum runtime for given work). [/QUOTE] Could it be obtained by profiling? |
[QUOTE=TheJudger;196991][B][edit by LaurV]
As the thread got longer and longer and people continue coming and asking "Where is this or that version of mfaktc?" I am editing this post to give the link to the folder which Xyzzy kindly created: [URL]http://www.mersenneforum.org/mfaktc/[/URL] Here you can find many different versions of mfaktc, for different OS-es, different Cuda drivers, bla bla. Select the suitable one for you, download it, clear some exponents! :smile: Thanks! [end of edit by LaurV][/B] Hi, maybe a bit offtopic... as a proof-of-concept I can do TF on my GTX 275 :) A straight forward implementation of the multiprecision arithmetic gives me ~15.000.000 tested factors per second for a 26bit exponent and factors up to 71bit. (The numbers are out of my mind, I hope they are correct) I've checked the correctness for some 1000 cases against a basic implementation using libgmp on CPU, no errors found. :) Currently there is no presieving of factor candidates... but I've alot cycles free on the CPU. ;) The quality of the code right now is.... "proof-of-concept" (means really ugly, alot of hardcoded stuff, ...) :D Most time is spent on the calculation if the reminder (using a simple basecase division). jasonp: if you read this: the montgomery multiplication suggested on the nvidia forum doesn't look suitable on CUDA for me. :( TheJudger[/QUOTE] Recently I found some improvments while using mfaktc, one is here [url]https://www.mersenneforum.org/showpost.php?p=504109&postcount=3014[/url] I think add a double-check variable may provides a more convicent result. This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless. Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M) It will take ~1 second to rewrite the worktodo.txt. I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record. after TF, mfaktc could delete the last record. I think delete the last record is much quickly than delete the first one. |
Hi,
[QUOTE=Neutron3529;504342]Recently I found some improvments while using mfaktc, one is here [url]https://www.mersenneforum.org/showpost.php?p=504109&postcount=3014[/url] I think add a double-check variable may provides a more convicent result. This post have no reply, I don't know whether it is ignored. And I would apologize if such suggest is really useless.[/QUOTE] There is currently no doublechecking for TF work and thus you can't improve (reduce time) for that. :yucky: You want a checksum or whatever for all factor attemps in each class, right? Won't work for multiple reasons:[LIST][*]the list of factor candidates depends on sieve parameters (e.g. more sieving, less candidates)[*]even with same settings the list of candidates depends on hardware because we ignore memory conflicts and sometimes a composite factor isn't cleared by the sieve while on the next run it is.[/LIST] [QUOTE=Neutron3529;504342]Further more, I found it is really slow when mfaktc is dealing with a really big worktodo.txt(e.g., ~2M) It will take ~1 second to rewrite the worktodo.txt. I think a better way to dealing such problem is, once the worktodo.txt is provided, mfaktc could remember the size of worktodo.txt, and dealing with the last record. after TF, mfaktc could delete the last record. I think delete the last record is much quickly than delete the first one.[/QUOTE] Is this a common usecase? Keep in mind that I focus on current primenet wavefront. 2M worktodo isn't general usage. Maybe easiest is to split your worktodo to reasonable sizes and put a small script around mfaktc (put small worktodo.txt into directory, start mfaktc (let it run until it has finished worktodo.txt, repeat with next worktodo.txt) |
[QUOTE=TheJudger;504352]Is this a common usecase? 2M worktodo isn't general usage.[/QUOTE]No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. For convenience I fetch/submit 1000 exponents at a time (~25kB worktodo) but even if it was an offline system I would seriously consider writing some script that would slice off 100-1000 assignments at a time from a separate bulk assignment file when worktodo.txt runs empty (and at the same time archive off results.txt since that also gets large quickly).
I'm curious what [i]Neutron3529[/i] is doing to get 2MB worktodo... is he my mystery TF'er who reserves [url=https://www.mersenne.ca/tf1G.php#assign_count_by_age]a million exponents at a time[/url] and then takes 2 weeks to submit the results? |
[QUOTE=James Heinrich;504378]No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. ... my mystery TF'er who reserves [URL="https://www.mersenne.ca/tf1G.php#assign_count_by_age"]a million exponents at a time[/URL] and then takes 2 weeks to submit the results?[/QUOTE]At what point does it stop being an acceptable request and start looking like a DOS?
I don't see the point of MB worktodo downloads when there's so much work to do below 10[SUP]9 [/SUP]with a well chosen 1k size worktodo representing several ThzDays. |
[QUOTE=kriesel;504414]I don't see the point of MB worktodo downloads[/QUOTE]
I can see why someone would do that if they didn't have a script to automate fetching/distributing the work. |
[QUOTE=Mark Rose;504416]I can see why someone would do that if they didn't have a script to automate fetching/distributing the work.[/QUOTE]
Plenty of work to do well under 10[SUP]9[/SUP]. It would be good to TF exponents to full gputo72 bit depth, successively, with preference for lowest exponent first, to clear the path for P-1 and primality testing. In other words, instead of single-bit-depth per exponent, do full-to-goal-bit-depth before going on to another exponent on the same gpu. [URL]https://www.mersenne.ca/exponent/200000033[/URL] half a ThzDay. [URL]https://www.mersenne.ca/exponent/400000079[/URL] a ThzDay [URL]https://www.mersenne.ca/exponent/800000087[/URL] nearly 10 ThzDays. [URL]https://www.mersenne.ca/exponent/990000337[/URL] ~15ThzDays. Two weeks on a GTX1080, for ~56 bytes of worktodo content. 1KB of such work could occupy even a fast gpu for a month or more. I have TF of one exponent on an older gpu that's estimated to complete in April. |
[QUOTE=Neutron3529;504342]
It will take ~1 second to rewrite the worktodo.txt. [/QUOTE] Even if you use a RAM drive? |
[QUOTE=James Heinrich;504378]...I'm curious what [I]Neutron3529[/I] is doing to get 2MB worktodo... is he my mystery TF'er who reserves [URL="https://www.mersenne.ca/tf1G.php#assign_count_by_age"]a million exponents at a time[/URL] and then takes 2 weeks to submit the results?[/QUOTE]
The simple math says he's running nearly 3,000 per hour. He seems to have found a way around your limitations. He may be using a recursive batch process to do this. 1,000 iterations of 1,000 each, with a very short timeout between. I would not want to have to concatenate all of that. Even 100 would take some time. He would also have to rename each download. If he has any programming experience, and I suspect he does, he could automate this entire process to run unattended. |
[QUOTE=storm5510;504559]He seems to have found a way around your limitations.[/QUOTE]Whoever it is isn't doing anything "wrong", as long as they keep reporting back 25THz-days of work once a week or so I'm not complaining much, although I would still much prefer that whoever it is would reserve and submit only a day's worth at a time. But as long as they work's getting done...
|
[QUOTE=TheJudger;504352]Hi,
There is currently no doublechecking for TF work and thus you can't improve (reduce time) for that. :yucky: You want a checksum or whatever for all factor attemps in each class, right? Won't work for multiple reasons:[LIST][*]the list of factor candidates depends on sieve parameters (e.g. more sieving, less candidates)[*]even with same settings the list of candidates depends on hardware because we ignore memory conflicts and sometimes a composite factor isn't cleared by the sieve while on the next run it is.[/LIST] Is this a common usecase? Keep in mind that I focus on current primenet wavefront. 2M worktodo isn't general usage. Maybe easiest is to split your worktodo to reasonable sizes and put a small script around mfaktc (put small worktodo.txt into directory, start mfaktc (let it run until it has finished worktodo.txt, repeat with next worktodo.txt)[/QUOTE] I finally come up with a possible idea: Firstly, it is easy to maintain a "smallest candidate", for example, about 100 per class. for each candidate, we could use BPSW algorithm to verify if it is a psedo-prime. if all the candidate is not pseudo-prime, we could use a special mark, and we could do a checksum with a pseudo-prime. It is quite hard to discard a pseudo-prime, and since there are at least 90log2/2~ 1/31 the probability a odd number n could be a prime if n less than 2^90, so the probability that all the candidate in a class is not a prime should be very low (<0.04) Hence a residual check value could be available. |
[QUOTE=TheJudger]There is currently no double-checking for TF work and thus you can't improve (reduce time) for that. :yucky:
You want a checksum or whatever for all factor attempts in each class, right? Won't work for multiple reasons:[/QUOTE] There is more than enough DC work as it is without checking TF. Except for the wave-front, there is far too much TF going on. By wave-front, I mean [I]GPUto72[/I]. |
[QUOTE=storm5510;504670]There is more than enough DC work as it is without checking TF. Except for the wave-front, there is far too much TF going on. By wave-front, I mean [I]GPUto72[/I].[/QUOTE]
TF _is_ checked, in the sense of detecting false positive factors, when submitted. The payoff on detecting missed factors (false negatives) is very low. |
[QUOTE=kriesel;504693]TF _is_ checked, in the sense of detecting false positive factors, when submitted. The payoff on detecting missed factors (false negatives) is very low.[/QUOTE]
If I may please share... Early on in the GPU72 effort I would sometimes notice people who's results were "unusual" (read: an unexpectedly low "success" rate). I spent a lot of time and money rechecking their work, and never once did I find a "cheat". And at the end of the day, it doesn't really matter all that much if a factor is missed. Finding a factor simply removes the candidate from the LL'ing and then DC'ing effort. The latter are definitive as to primality. |
I would remind the gentle readers that we do have a TF DC system in place. It is called the user [URL="https://www.mersenneforum.org/showthread.php?t=19014"]TJAOI [/URL]
|
[QUOTE=ixfd64;502467]I'm running mfaktc on a borrowed MSI gaming laptop with a GeForce GTX 1070 video card. There was a moment today when mfaktc got stuck on a class. However, this wasn't a complete freeze because mfaktc processed the next set of classes when I pressed Ctrl + C. I had to press Ctrl + C a few more times (with classes being processed each time) before mfaktc correctly exited. Has anyone encountered this issue?
It's worth mentioning that the cursor on this laptop sometimes freezes for a short time. I have no idea if these issues are related.[/QUOTE] I just noticed that this issue occurs when I select text inside the mfaktc window. The program resumes as soon as I cancel the selection. This is 100% reproducible as far as I could tell. The screen freezing issue likely isn't related to mfaktc as it did go away after a Windows update. |
[QUOTE=ixfd64;504712]I just noticed that this issue occurs when I select text inside the mfaktc window. The program resumes as soon as I cancel the selection. This is 100% reproducible as far as I could tell.[/QUOTE]That's a Windows thing, nothing to do with mfaktc.
Any program running in a command window will be suspended while you're marking/selecting text. You can also suspend a program with the Pause/Break key, and resume by hitting any (other?) key. |
[QUOTE=James Heinrich;504378]No, that's not a common usecase. Even with my work at the large-exponent-low-bits above 1000M where exponent (not class) runtimes are ~1s there's no reason to have mammoth worktodo. For convenience I fetch/submit 1000 exponents at a time (~25kB worktodo) but even if it was an offline system I would seriously consider writing some script that would slice off 100-1000 assignments at a time from a separate bulk assignment file when worktodo.txt runs empty (and at the same time archive off results.txt since that also gets large quickly).
I'm curious what [i]Neutron3529[/i] is doing to get 2MB worktodo... is he my mystery TF'er who reserves [url=https://www.mersenne.ca/tf1G.php#assign_count_by_age]a million exponents at a time[/url] and then takes 2 weeks to submit the results?[/QUOTE] I use [QUOTE][url]https://www.mersenne.org/report_factoring_effort/[/url][/QUOTE] to get a worktodo.txt file So it is quite easy to get a 0KB worktodo.txt or a ~2.7M worktodo.txt |
[QUOTE=GP2;504422]Even if you use a RAM drive?[/QUOTE]
I use Imdisk to create a RAM drive. The question is, when rewrite worktodo.txt, every byte in worktodo.txt must be changed, causing a lot of waste. I think the reason why so slow is that I keep the priime95 running, which may slow down the rewrite the worktodo file |
[QUOTE=Neutron3529;504757]I use Imdisk to create a RAM drive.
The question is, when rewrite worktodo.txt, every byte in worktodo.txt must be changed, causing a lot of waste. I think the reason why so slow is that I keep the priime95 running, which may slow down the rewrite the worktodo file[/QUOTE] prime95 by design runs at a quite low priority, to use otherwise-idle cpu cycles, without slowing user applications and console interactive response speed. (See the end of undoc.txt) |
[QUOTE=Neutron3529;504756]I use
[URL]https://www.mersenne.org/report_factoring_effort/[/URL] to get a worktodo.txt file So it is quite easy to get a 0KB worktodo.txt or a ~2.7M worktodo.txt[/QUOTE] Hmm, unless I'm mistaken, that's not for assignments, that's a report generator page. It's my understanding that assignments are issued through Primenet automatic connections in prime95 or mprime, or manually through [URL]https://www.mersenne.org/manual_assignment/[/URL] or [URL]https://www.mersenne.org/manual_gpu_assignment/[/URL] and include unique assignment IDs in the worktodo records. Output of [URL]https://www.mersenne.org/report_factoring_effort/[/URL] for 92M to 93M just now includes exponents I have assigned TF work for. Exponents in such a report listing, even if current assignments are excluded, won't exclude assignments made in the near future either through automatic primenet connections or through the manual assignment pages. It seems likely to me, that the assignee and you will be duplicating work. I did a quick test on [URL]https://www.mersenne.org/report_factoring_effort/[/URL] for 92000000 to 92002000. If I check exclude currently assigned exponents, and check worktodo format, and specify 76 bits, it does not include an assignment ID in its output, it provides a prime95 style manual assignment format record, and the resulting assignment if any does not show up in my current assignments retrieved afterward. Its output was: [CODE]Factor=N/A,92000429,75,76[/CODE]Checking the status of [URL]https://www.mersenne.org/report_exponent/?exp_lo=92000429&exp_hi=[/URL] afterward indicates no active TF assignment. Therefore, I conclude, that if I were to use this method to obtain a large list of exponents needing more factoring, and factored them over time, without taking other actions promptly to reserve them, meanwhile they very likely are being assigned to other people, by the usual mechanisms listed above, and for many if not most of the exponents so obtained, wasteful duplication of TF effort on the same exponent and bit level would occur as a result. N/A is the dead giveaway in the record produced, that an assignment was not issued. That's where the lengthy AID would normally appear (32 upper case hexadecimal characters). |
[QUOTE=Neutron3529;504756]So it is quite easy to get a 0KB worktodo.txt or a ~2.7M worktodo.txt[/QUOTE]Yes, but the question is, [b][i]why[/i][/b] are you generating a worktodo with ~100,000 entries (I can't call them assignments because they're not assigned).
It would be better throughput (both for you and for GIMPS) to get fewer assignments in the conventional way (ideally actually assigned to you so there's no risk of effort being duplicated). |
[QUOTE=kriesel;504779]prime95 by design runs at a quite low priority, to use otherwise-idle cpu cycles, without slowing user applications and console interactive response speed[/QUOTE]Usually true, but can still have an effect in the right circumstances. For example, I'm currently running several large backup compressions with 7zip, and if I also have Prime95 P-1 running in the background it slows down 7zip by about 300%, I suspect the limiting resource is memory bandwidth and not CPU power (quad-channel, but only DDR3-1333). [i]Generally[/i] Prime95 has little effect on user applications, but there are always exceptions :smile:
|
[QUOTE=James Heinrich;504715]That's a Windows thing, nothing to do with mfaktc.
Any program running in a command window will be suspended while you're marking/selecting text. You can also suspend a program with the Pause/Break key, and resume by hitting any (other?) key.[/QUOTE] I see. However, I've noticed the GPU fans are still running at full power when the program is paused. What is the GPU actually doing during this time? |
[QUOTE=James Heinrich;504791]Usually true, but can still have an effect in the right circumstances. For example, I'm currently running several large backup compressions with 7zip, and if I also have Prime95 P-1 running in the background it slows down 7zip by about 300%, I suspect the limiting resource is memory bandwidth and not CPU power (quad-channel, but only DDR3-1333). [i]Generally[/i] Prime95 has little effect on user applications, but there are always exceptions :smile:[/QUOTE]I have seen an Adobe install creeping along while Prime95 was running. I think it was trying to run at low priority. So both programs were "after you sir", 'no, after you sir', "no, after you sir", etc.
|
I noticed an oddity in the timestamp. On my older system running mfaktc compiled for CUDA 6.5, the day is prepended by a zero when it is in the single digits:
[CODE][Mon Jan 08 22:50:07 2018][/CODE] But on my laptop running mfaktc for CUDA 10, there is no preceding zero: [CODE][Wed Jan 2 09:40:29 2019][/CODE] Is there a reason for this inconsistency? |
[QUOTE=James Heinrich;504378]No, that's not a common usecase.[/QUOTE]
+1. Misfit does a very good job of keeping the worktodo small, doing all the reports, and stuff. In fact, I don't see any reason of having large worktodo file either. |
[QUOTE=LaurV;504881]+1. Misfit does a very good job of keeping the worktodo small, doing all the reports, and stuff. In fact, I don't see any reason of having large worktodo file either.[/QUOTE]
I reserved a large number of exponents when doing manual testing, but that was before setting up a loop script. Now the script will not catch new work items and consume the worktodo items until it reaches 3 items in the worktodo file. 3 items seems reasonable if you tune the catch/submit timeout accordingly. |
From mfakto thread:
[QUOTE=R. Gerbicz;504831]Hm, reading the code and still don't know where do you sieve by p>11; confirmed that it is in SegSieve in gpusieve.cl, but as I can see we don't even call this in the runs: say replacing line 1296 by big_bit_array32[j * threadsPerBlock + get_local_id(0)]=0; (this would eliminate all k values) or placing Visual Studio's breakpoints in the first line in SegSieve. Or even using [CODE] #define TRACE_SIEVE_KERNEL 5 // If above tracing is on, only the thread with the ID below will trace #define TRACE_SIEVE_TID 2 [/CODE] results we pass the selftest and we're seeing no break, no additonal debug info. Furthermore modifying the default GPUSievePrimes=81157 we get different times, suggesting that we're really sieving, but where? One more thing that I've also seen on this forum, that at run it is displaying that the automatic parameter for threads per grid is 0: [CODE] OpenCL device info name Intel(R) HD Graphics 530 (Intel(R) Corporation) device (driver) version OpenCL 2.1 NEO (25.20.100.6471) maximum threads per block 256 maximum threads per grid 16777216 number of multiprocessors 24 (24 compute elements) clock rate 1150MHz Automatic parameters threads per grid 0 optimizing kernels for INTEL [/CODE] Still I've some ideas (not that many) to improve the current code but without understanding the basics of the code it is somewhat hard.[/QUOTE] In mfaktc we can see that in the 1079th line and there are no those #define's, but the question is the same. It could be something trivial. |
Hi Robert,
sieving of small primes starts right at the begining of SegSieve() in src/gpusieve.cu. At this point (no sieving done yet on shared memory) George used 4 32 bit words (128 bit) of local variables (mask, mask2, mask3 and mask4) for sieving. Don't be affraid about those [CODE]if (primesNotSieved == X)[/CODE] primesNotSieved is const and thus compiler knows the correct code path at compile time. Oliver |
[QUOTE=kriesel;504779]prime95 by design runs at a quite low priority, to use otherwise-idle cpu cycles, without slowing user applications and console interactive response speed. (See the end of undoc.txt)[/QUOTE]
Maybe It is the memory bandwith that make the rewrite slower. I will check it later since I am running an R script now. [QUOTE=kriesel;504783]Hmm, unless I'm mistaken, that's not for assignments, that's a report generator page. It's my understanding that assignments are issued through Primenet automatic connections in prime95 or mprime, or manually through [URL]https://www.mersenne.org/manual_assignment/[/URL] or [URL]https://www.mersenne.org/manual_gpu_assignment/[/URL] and include unique assignment IDs in the worktodo records. Output of [URL]https://www.mersenne.org/report_factoring_effort/[/URL] for 92M to 93M just now includes exponents I have assigned TF work for. Exponents in such a report listing, even if current assignments are excluded, won't exclude assignments made in the near future either through automatic primenet connections or through the manual assignment pages. It seems likely to me, that the assignee and you will be duplicating work. I did a quick test on [URL]https://www.mersenne.org/report_factoring_effort/[/URL] for 92000000 to 92002000. If I check exclude currently assigned exponents, and check worktodo format, and specify 76 bits, it does not include an assignment ID in its output, it provides a prime95 style manual assignment format record, and the resulting assignment if any does not show up in my current assignments retrieved afterward. Its output was: [CODE]Factor=N/A,92000429,75,76[/CODE]Checking the status of [URL]https://www.mersenne.org/report_exponent/?exp_lo=92000429&exp_hi=[/URL] afterward indicates no active TF assignment. Therefore, I conclude, that if I were to use this method to obtain a large list of exponents needing more factoring, and factored them over time, without taking other actions promptly to reserve them, meanwhile they very likely are being assigned to other people, by the usual mechanisms listed above, and for many if not most of the exponents so obtained, wasteful duplication of TF effort on the same exponent and bit level would occur as a result. N/A is the dead giveaway in the record produced, that an assignment was not issued. That's where the lengthy AID would normally appear (32 upper case hexadecimal characters).[/QUOTE] I tried report page to generate worktodo file is that, I want to test some relative low exponent, but [URL="https://www.mersenne.org/manual_gpu_assignment/"]manual_gpu_assignment/[/URL] may return an error, while [URL="https://www.mersenne.org/manual_assignment/"]manual_assignment/[/URL] may assign a really large exponent to me. |
[QUOTE=Neutron3529;505008]I want to test some relative low exponent[/QUOTE]What kind of exponent range are you testing, and to what bitdepth? How long does each "assignment" (I say in quotes because they're not actually assigned to you) take to run on your GPU?
|
I'm factoring exponent above 900000000, to 71 bitdepth.
as @[B]kriesel [/B]suggests, I tried [URL]https://www.mersenne.org/manual_gpu_assignment/[/URL] and get some results. And unfortunately some collision occur: [url]https://www.mersenne.org/M903462383[/url] I find it might be a collision due to the latter submit: [COLOR=darkgreen]processing: TF no-factor for [URL="https://www.mersenne.org/M903482861"]M903482861[/URL] (270-271)[/COLOR] [COLOR=blue]Result was not needed. TF on M903482861, sf: 70, ef: 71 CPU credit is 0.2647 GHz-days.[/COLOR] |
[QUOTE=Neutron3529;505041]I'm factoring exponent above 900000000, to 71 bitdepth.
as @[B]kriesel [/B]suggests, I tried [URL]https://www.mersenne.org/manual_gpu_assignment/[/URL] and get some results. And unfortunately some collision occur: [URL]https://www.mersenne.org/M903462383[/URL] I find it might be a collision due to the latter submit: [COLOR=darkgreen]processing: TF no-factor for [URL="https://www.mersenne.org/M903482861"]M903482861[/URL] (2[SUP]70[/SUP]-2[SUP]71[/SUP])[/COLOR] [COLOR=blue]Result was not needed. TF on M903482861, sf: 70, ef: 71 CPU credit is 0.2647 GHz-days.[/COLOR][/QUOTE] Not certain if you actually meant results, assignments, or factors, there. Collisions, waste, or poaching can be frustrating. That's a low bit level, but above 900M is a very [B]high[/B] exponent, in the upper range of what mersenne.org supports. (Not low exponent as earlier stated) The GIMPS primality testing wavefront won't get there for decades if not lifetimes. See [URL]https://www.mersenneforum.org/showthread.php?t=23845[/URL], particularly post 11. At 0.265 GhzD/day, any reasonable speed gpu will do hundreds or thousands of such exponents from 2[SUP]70[/SUP] to 2[SUP]71[/SUP] per day. It may be easier to avoid collision by working longer on fewer exponents, closer to the wavefront, where the investment in computing time will also pay off sooner for the project, perhaps in the next decade or less. Or get actual TF assignments on scattered exponents, such as in the first thousand range of some million exponent range bins, and take them to full gputo72 recommended TF bit level. Those are very helpful and useful to have partly or fully done for exponent qualification for software testers of P-1 and primality test code, testing well ahead of the wavefront. Some portions of the exponent range do not yield assignments from the mersenne.org assignment page for TF or P-1. That may be because it is (not so visibly) reserved to gputo72, or is simply not being issued at the time, or someone has grabbed a huge span for single-bit-level trial factoring. In my experience, once runs break through and establish "islands" above the bit level "sea", collisions with other users reporting results are very rare. |
[QUOTE=Neutron3529;505041]And unfortunately some collision occur[/QUOTE]Collisions will occur if you don't get assignments and just work on random exponents. Many more collisions will occur than you know about, when you do work that's already assigned to other people, when they submit their results their efforts will be wasted.
If you want to work on really big exponents you can try [url=https://www.mersenne.ca/tf1G.php]TF above 1000M[/url], but [i]please[/i] use the provided scripts to automate fetching/submitting work and avoid getting more work than you can complete in one day. |
[QUOTE=James Heinrich;505056]Collisions will occur if you don't get assignments and just work on random exponents. Many more collisions will occur than you know about, when you do work that's already assigned to other people, when they submit their results their efforts will be wasted.
[/QUOTE] That's one way to look at it. Another is to consider that your own efforts are less likely to be wasted and not credited if diligently using the assignment methods. And others have less reason to be annoyed with you then. Good point though about the probable magnitude of the entire waste of working without assignments being much larger than may be apparent to one user though. |
[QUOTE=James Heinrich;505056]If you want to work on really big exponents you can try [URL="https://www.mersenne.ca/tf1G.php"]TF above 1000M[/URL], but [I]please[/I] use the provided scripts to automate fetching/submitting work and avoid getting more work than you can complete in one day.[/QUOTE]
[U]Congrats[/U]! All the 65's and 66's are done. :smile: On another note: I am now using a separate PSU to power my GTX 1080 in the HP. I am getting a lot of, what I would call, screen noise. Is this harmful in any way? That PSU is 550W and nine years old. |
[QUOTE=storm5510;505266][U]Congrats[/U]! All the 65's and 66's are done. :smile:[/QUOTE]Slightly premature congratulations, there's still 2,562,566 exponents below 2[sup]67[/sup], but they're already assigned. Everything should be complete within a week or two though.
|
[QUOTE=storm5510;505266]On another note: I am now using a separate PSU to power my GTX 1080 in the HP. I am getting a lot of, what I would call, screen noise. Is this harmful in any way?[/QUOTE]
Yes. You should use a firm ground between the two PS, and DO NOT rely on the ground connection that goes [B][U]through the card[/U][/B]! You most probably have different ground potential between the two PS grounds (as they are insulated outputs each) and have a lot of EMI in that card (which causes the screen flickering**). You can have EMI sparks of over 500V there. This will shorten the life time of your card, and also can give calculus errors. You must connect a thick and short wire between the grounds of the two PS, using one of the unused outputs of each). ------- ** related to screen flickering, it can also be due to mechanical vibrations (from the fans, etc), so first check if the card is well stuck in the socket, and if the monitor cables are firm stuck in the connectors, etc., otherwise the plugs vibrate causing intermittent losing of contacts which may cause flickering. |
[QUOTE=LaurV;505366]Yes. You should use a firm ground between the two PS, and DO NOT rely on the ground connection that goes [B][U]through the card[/U][/B]! You most probably have different ground potential between the two PS grounds (as they are insulated outputs each) and have a lot of EMI in that card (which causes the screen flickering**). You can have EMI sparks of over 500V there. This will shorten the life time of your card, and also can give calculus errors. You must connect a thick and short wire between the grounds of the two PS, using one of the unused outputs of each).
------- ** related to screen flickering, it can also be due to mechanical vibrations (from the fans, etc), so first check if the card is well stuck in the socket, and if the monitor cables are firm stuck in the connectors, etc., otherwise the plugs vibrate causing intermittent losing of contacts which may cause flickering.[/QUOTE] Thank you for the reply! I disconnected the extra PSU after I posted the above. I was not comfortable with running it that way. I got this card just before crypto began to tank, so it was a large investment, for me. To run [I]mfaktc[/I], I use [I]MSI Afterburner[/I] to throttle it back to 80%. There is a little performance drop, but everything runs much cooler. [I]CUDALucas[/I] and [I]CUDAPm1[/I] are not nearly as demanding so I let the card run at its factory defaults. The PSU in the HP is 400W, and proprietary. I do not tax it hard. Replacements are scarce and expensive! The reason I moved the card to the HP was because it became like the watched pot that never boiled in my i7. I was always stopping the process to do something else. I never seemed to get anywhere. In the HP, it can simply run unattended, monitor off, until I have to give it more work. With DC's, it's like a once-a-day peek to see if it needs anything. :smile: |
Tuning mfaktc for GTX 1080 Ti performance
1 Attachment(s)
After seeing mfaktc not produce the expected GhzD/day numbers on my runs, and looking through the thread here for guidance, here's what I found, on my Windows 7 installation.
Following Mark Rose's advice on tuning parameters and sequence, [URL]https://www.mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL] I iterated to individually tune, in order, GPUSieveProcessSize GPUSieveSize GPUSievePrimes and then went back, and varied each separately, recording and plotting the results. There was still some performance left on the table; running a second instance gained about 2%. That reached 1285 GhzD/day, not quite the 1300 GhzD/day benchmark rating, on current assignments (92M, 75-76 bits) All the preceding was with Numstreams=3, not sure that matters. (Varying Numstreams the results look like measurement noise.) See the attachment. Note that this gpu was thermally throttling at 83C and 175-200 Watts, gpu core clock around 1570Mhz at the time. The plots look consistent with some further gains being possible, particularly if the maximum for gpusievesize was larger. |
mfaktc updates?
There was mention of an mfaktc v0.22-pre2 back in 2015 at [URL]https://www.mersenneforum.org/showpost.php?p=402408&postcount=2547[/URL]
What did that offer or was planned for implementation, and was it ever released? Given that the GTX 1080 Ti seems to be bumping up against the documented maximums for gpusieveprocesssize and gpusievesize, and there are now faster gpus (RTX20xx), there may be performance gains available by increasing those maximums. Could we get an updated version? Is there some obstacle preventing increasing maximum gpusieveprocesssize or gpusievesize? |
RTX gains several % more throughput by 1024 gpusievesize
At [URL]https://www.mersenneforum.org/showpost.php?p=505395&postcount=83[/URL] nomead provides test data for a modified version of mfaktc on RTX2080. He found a 7% gain in indicated throughput going up to gpusievesize 1024. The shape of the curves plotted for his data, in [URL]https://www.mersenneforum.org/showpost.php?p=507031&postcount=106[/URL] is consistent with additional gain at even higher gpusievesize. In my testing there appears to be some potential gain for GTX1080 Ti above 128 also. [URL]https://www.mersenneforum.org/showpost.php?p=506990&postcount=3069[/URL] Modern gpu cards have plenty of memory to support large sizes, even if running multiple instances.
TheJudger, what sort of issues might arise when increasing gpusievesize? How could we spot them? Similarly, on the GTX1080Ti, it looked like an increase in max gpusieveprocesssize would be beneficial. Again, what sort of possible issues, and symptoms? |
[QUOTE=kriesel;507040]At [URL]https://www.mersenneforum.org/showpost.php?p=505395&postcount=83[/URL] nomead provides test data for a modified version of mfaktc on RTX2080. He found a 7% gain in indicated throughput going up to gpusievesize 1024. The shape of the curves plotted for his data, in [URL]https://www.mersenneforum.org/showpost.php?p=507031&postcount=106[/URL] is consistent with additional gain at even higher gpusievesize. In my testing there appears to be some potential gain for GTX1080 Ti above 128 also. [URL]https://www.mersenneforum.org/showpost.php?p=506990&postcount=3069[/URL] Modern gpu cards have plenty of memory to support large sizes, even if running multiple instances.
TheJudger, what sort of issues might arise when increasing gpusievesize? How could we spot them?[/QUOTE] I had a similar increase of speed on my RTX 2060 |
Hi,
[QUOTE=kriesel;507040]At [URL]https://www.mersenneforum.org/showpost.php?p=505395&postcount=83[/URL] nomead provides test data for a modified version of mfaktc on RTX2080. He found a 7% gain in indicated throughput going up to gpusievesize 1024. The shape of the curves plotted for his data, in [URL]https://www.mersenneforum.org/showpost.php?p=507031&postcount=106[/URL] is consistent with additional gain at even higher gpusievesize. In my testing there appears to be some potential gain for GTX1080 Ti above 128 also. [URL]https://www.mersenneforum.org/showpost.php?p=506990&postcount=3069[/URL] Modern gpu cards have plenty of memory to support large sizes, even if running multiple instances. TheJudger, what sort of issues might arise when increasing gpusievesize? How could we spot them? Similarly, on the GTX1080Ti, it looked like an increase in max gpusieveprocesssize would be beneficial. Again, what sort of possible issues, and symptoms?[/QUOTE] Possible issues? Missing factors, everything else doesn't really matter. Will check max value of gpusievesize in next version (no ETA yet). Thank you for putting my attention on this! :smile: I think gpusieveprocesssize is more like a hard(er) limit on the other hand the increase of performance is rather low from 16k to 32k?! Oliver |
[QUOTE=TheJudger;507050]
Will check max value of gpusievesize in next version (no ETA yet). Thank you for putting my attention on this! :smile: [/QUOTE] I found at least one hard limit, 2047 still passed self tests but 2048 gave an error. [CODE]gpusieve.cu(1276) : CUDA Runtime API error 2: out of memory.[/CODE] And besides, above 1024 the performance increase wasn't so significant, so to be (semi-)safe I'll stay at max 1024 for now. I'm fully aware that I'm poking things I really shouldn't poke :reality-check: |
[QUOTE=nomead;507052]I found at least one hard limit, 2047 still passed self tests but 2048 gave an error.
[CODE]gpusieve.cu(1276) : CUDA Runtime API error 2: out of memory.[/CODE]And besides, above 1024 the performance increase wasn't so significant, so to be (semi-)safe I'll stay at max 1024 for now. I'm fully aware that I'm poking things I really shouldn't poke :reality-check:[/QUOTE] If you have data for above 1024, please share. Thank you for poking at things! That's how progress occurs. [URL]https://www.goodreads.com/quotes/536961-the-reasonable-man-adapts-himself-to-the-world-the-unreasonable[/URL] And wow, now for some reason my GTX1080Ti is no longer thermally throttling at 83C; seems to be happily running to the power limit at 90-91C, hotter than I'd like but one instance mfaktc 1356 GhzD/day & 93% gpu load, 2 instances 1383 & 97% gpu load. |
[QUOTE=kriesel;507057]If you have data for above 1024, please share.
[/QUOTE] In the [URL="https://www.mersenneforum.org/showpost.php?p=507051&postcount=107"]other thread...[/URL] 1024 -> 1536 +0,7% 1536 -> 2047 +0,2% |
[QUOTE=nomead;507058]In the [URL="https://www.mersenneforum.org/showpost.php?p=507051&postcount=107"]other thread...[/URL]
1024 -> 1536 +0,7% 1536 -> 2047 +0,2%[/QUOTE] The same happens on the RTX 2060 |
gtx 1060 tune variations (likes large gpusievesize)
GTX1060 pwr & vrel limiting
exponent 720M bit level 81-82 GPUSieveSize=64 GPUSieveProcessSize=16 thruput 412.34 GhzD/day GPUSieveSize=64 GPUSieveProcessSize=32 thruput 412.88 GPUSieveSize=[B]128[/B] GPUSieveProcessSize=32 thruput [B]420.25[/B] |
Trial factoring concepts
Concepts in GIMPS trial factoring (TF) (note, sort of mfaktc oriented, more so toward the end)
Does this list cover the main concepts? Anything missing, misstated, incomplete, etc? (Tact encouraged!) 1 Trial factoring work is generally assigned by exponent and bit levels; Mersenne exponent, bit level already trial factored to or where to start, and bit level to factor up to. 2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent. 3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. 4 Use a wheel generator for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Less-classes number; 2 2 3 5 7 11 = 4620, the more-classes number. [URL]https://www.mersenneforum.org/showpost.php?p=200871&postcount=35[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=200887&postcount=37[/URL] (There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) 5 For a given exponent, exclude entire classes of candidate factors with a single test [URL]https://www.mersenneforum.org/showpost.php?p=200887&postcount=37[/URL] 6 Make use of the special form of factors, that they are 1 or 7 mod 8. 7 Dense representation of the candidate factors, as a bit map of k values. [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=201884&postcount=72 8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant. 9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto. [URL]https://www.mersenneforum.org/showpost.php?p=212815&postcount=180[/URL] 10 Sieving level must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations. [URL]https://www.mersenneforum.org/showpost.php?[/URL] p=260105&postcount=788 (If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below) 11 Surviving candidate factors are tried. 12 The trial method is to generate a representation of the Mersenne prime plus 1 (2^p), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2^p mod f =1, 2^p-1 mod f = 0 and f is a factor. The powering method is much much faster than trial long division for sizable numbers, and uses far less memory, and more so on both counts for larger numbers. Description and brief small-numbers example at [URL]https://www.mersenne.org/various/math.php[/URL] 13 On significantly parallel computing hardware, such as gpus, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. 14 Operation is as multiword integers, sometimes with some bits used for carries. [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=199203&postcount=17 [URL]https://www.mersenneforum.org/showpost.php?p=211326&postcount=159[/URL] 15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent. [URL]https://www.mersenneforum.org/showpost.php?p=216430&postcount=231[/URL] Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc: _71BIT_MUL24 _75BIT_MUL32 _95BIT_MUL32 BARRETT76_MUL32 BARRETT77_MUL32 BARRETT79_MUL32 BARRETT87_MUL32 BARRETT88_MUL32 BARRETT92_MUL32 _75BIT_MUL32_GS BARRETT76_MUL32_GS BARRETT77_MUL32_GS BARRETT79_MUL32_GS BARRETT87_MUL32_GS BARRETT88_MUL32_GS BARRETT92_MUL32_GS _95BIT_MUL32_GS (mfakto is similar although it does not include 95-bit) 16 The Barrett 77 (or was it 76) was derived from the 79 [URL]https://www.mersenneforum.org/showpost.php?p=306572&postcount=1824[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=306808&postcount=1838[/URL] [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=307238&postcount=1845 17 "funnel shift" barrett 87 [URL]https://www.mersenneforum.org/showpost.php?p=334251&postcount=2243[/URL] 18 The frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels. 19 For gpu applications, there are various implementation approaches for performance. [URL]https://www.mersenneforum.org/showpost.php?p=199799&postcount=29[/URL] 20 multiple streams and data sets allowing concurrent data transfer and processing [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=207433&postcount=152 21 On gpu sieving: [URL]https://www.mersenneforum.org/showpost.php?p=251120&postcount=554[/URL] Tradeoffs and possible decisions about high gpu sieving, low exponents [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=385780&postcount=2409 [URL]https://www.mersenneforum.org/showpost.php?p=503820&postcount=2999[/URL] 22 There's a small performance advantage to 32-bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth) [URL]https://www.mersenneforum.org/showpost.php?p=323678&postcount=1981[/URL] At CUDA8 or higher (which means GTX10xx or newer models) only 64-bit CUDA is available. 23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many gpu cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time. Prime95 comments on passing out chunks of work and using memory bandwidth efficiently: [URL]https://www.mersenneforum.org/showpost.php?p=292154&postcount=1634[/URL] Oliver writes to R Gerbicz about it [URL]https://www.mersenneforum.org/showpost.php?p=504233&postcount=3020[/URL] 24 could FP be faster than integer math for the kernels? Probably not [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=383607&postcount=2377 Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent gpu designs would be useful) [URL]https://www.mersenneforum.org/showpost.php?p=384137&postcount=2380[/URL] 25 mfaktc v0.21 announce, and ruminations about possible v0.22 [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=395689&postcount=2492 26 ini file parameter tuning advice, [URL]https://www.mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL] - 2508 in this order: GPUSieveProcessSize GPUSieveSize GPUSievePrimes 27 There was mention of an mfaktc v0.22-pre2 back in 2015 at [URL]https://www.mersenneforum.org/showpost.php?p=402408&postcount=2547[/URL] What did that offer or was planned for implementation, and was it ever released? 28 Experimenting on tuning with a gtx1080Ti, on Windows 7, it seems to like the upper end of the tuning variable limits. So do some lesser models. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. # GPUSieveSize defines how big of a GPU sieve we use (in M bits). # Minimum: GPUSieveSize=4 # Maximum: GPUSieveSize=128 # Default: GPUSieveSize=64 GPUSieveSize=128 # GPUSieveProcessSize defines how far many bits of the sieve each TF block # processes (in K bits). Larger values may lead to less wasted cycles by # reducing the number of times all threads in a warp are not TFing a # candidate. However, more shared memory is used which may reduce occupancy. # Smaller values should lead to a more responsive system (each kernel takes # less time to execute). GPUSieveProcessSize must be a multiple of 8. # Minimum: GPUSieveProcessSize=8 # Maximum: GPUSieveProcessSize=32 # Default: GPUSieveProcessSize=16 GPUSieveProcessSize=32 GPUSievePrimes ~94000 29 some RTX20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to GB 30 the decoupling of FP from integer math in the RTX20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored. |
[QUOTE=kriesel;507954]16 The Barrett 77 (or was it 76) was derived from the 79 [URL]https://www.mersenneforum.org/showpost.php?p=306572&postcount=1824[/URL]
[URL]https://www.mersenneforum.org/showpost.php?p=306808&postcount=1838[/URL] [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=307238&postcount=1845[/QUOTE] Correct but not complete. First barrett kernel im mfaktc was BARRETT92, all other kernels are stripped down versions. From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster) From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same "tricks" lead from BARRETT79 to BARRETT77 and BARRETT76. [QUOTE=kriesel;507954]25 mfaktc v0.21 announce, and ruminations about possible v0.22 [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=395689&postcount=2492 27 There was mention of an mfaktc v0.22-pre2 back in 2015 at [URL]https://www.mersenneforum.org/showpost.php?p=402408&postcount=2547[/URL] What did that offer or was planned for implementation, and was it ever released?[/QUOTE] "-pre" versions aren't released into the wild and are not intended for productive usage. Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed). Oliver |
[QUOTE=TheJudger;508134]Correct but not complete. First barrett kernel im mfaktc was BARRETT92, all other kernels are stripped down versions.
From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster) From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same "tricks" lead from BARRETT79 to BARRETT77 and BARRETT76. "-pre" versions aren't released into the wild and are not intended for productive usage. Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed). Oliver[/QUOTE]Thanks for the review, followup on Barretts, and clarification on -pre versions. I'm looking forward to updates and any bug fixes or enhancements whenever they're ready for field testing. (I can throw a variety of gpu models at it, from CC2.0 up to gtx1080Ti) One other thing: there are _gs variations of lots of kernels. What does that _gs mean? |
[QUOTE=kriesel;508137]One other thing: there are _gs variations of lots of kernels. What does that _gs mean?[/QUOTE]
[B]G[/B]PU [B]s[/B]ieve |
[QUOTE=kriesel;507954]Concepts in GIMPS trial factoring (TF) (note, sort of mfaktc oriented, more so toward the end)[/QUOTE]This would make a good entry for the wiki.
|
[QUOTE=Uncwilly;508271]This would make a good entry for the wiki.[/QUOTE]
It's going in one of my reference threads. |
[QUOTE=Uncwilly;508271]This would make a good entry for the wiki.[/QUOTE]
Feel free to link to [url]https://www.mersenneforum.org/showpost.php?p=508523&postcount=6[/url] from the wiki. |
Ugh... I'm trying to compile mfaktc on Windows now. Instead of Visual Studio 2012, I got Visual Studio 2017 (Community). File paths are all over the place. It really took a while to find all the extra bits needed so that the compile job would run through. But it seems that C++/CLI support installed, then finding and running vcvars64.bat finally did the trick.
I already installed MinGW earlier for other purposes and thus had gnu make. Also installed GPU Toolkit 10.0.130. Still, after a succesful compile, the executable gives this error (also included the last bits of info given by the program) : [CODE]CUDA version info binary compiled for CUDA 10.0 CUDA runtime version 10.0 CUDA driver version 10.0 CUDA device info name GeForce RTX 2060 compute capability 7.5 max threads per block 1024 max shared memory per MP 65536 byte number of multiprocessors 30 clock rate (CUDA cores) 1830MHz memory clock rate: 7001MHz memory bus width: 192 bit Automatic parameters threads per grid 983040 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... ERROR: cudaGetLastError() returned 8: invalid device function[/CODE] Which is strange, since I added this in the Makefile [CODE]NVCCFLAGS += --generate-code arch=compute_75,code=sm_75 # CC 7.5 Turing[/CODE] And it seems to generate 7.5 code during the compilation process. The same thing also happens if I replace code=sm_75 with code=compute_75 to enable just in time compilation. It shouldn't be because of VS 2017 / VS 2012 differences, but who knows? Maybe I'll try that, too, but not right now :smile: |
[QUOTE=nomead;508640]It shouldn't be because of VS 2017 / VS 2012 differences, but who knows? Maybe I'll try that, too, but not right now :smile:[/QUOTE]
How wrong can I be? First of all, I *had* to try it now on VS 2012. And now, it works! Even despite the NVCC compiler showing warnings like this: [CODE]support for this version of Microsoft Visual Studio has been deprecated! Only the versions between 2013 and 2017 (inclusive) are supported![/CODE] |
Compilation notes
The basic outline is documented in the mfaktc README.txt, but here are the specific steps I had to do, to make it work. Let's forget about Visual Studio 2017 for the moment and concentrate on Visual Studio 2012. All installation packages listed here are available for free. Even though a Microsoft account is needed for downloading VS2012 Express, it's free to use. And I'm running on Windows 7 64-bit.
First, I got 64-bit MinGW (originally for other reasons, but it includes GNU make) from [URL="https://nuwen.net/mingw.html"]https://nuwen.net/mingw.html[/URL] From there, mingw-16.1-without-git.exe is enough for our purposes. Install that somewhere. Then, Visual Studio 2012 Express for Windows Desktop. [URL="https://my.visualstudio.com/Downloads?q=visual%20studio%202012%20express"]https://my.visualstudio.com/Downloads?q=visual%20studio%202012%20express[/URL] Log in, or create an account and then log in. The one marked "Visual Studio Express 2012" only works on Windows 8 (and up, maybe?) but the "for Windows Desktop" one also works on Windows 7. I got the installer EXE and then ran it. Finally, CUDA Toolkit 10.0 [URL="https://developer.nvidia.com/cuda-downloads"]https://developer.nvidia.com/cuda-downloads[/URL] Download and install. Then prepare the Makefile.win First of all, you need to change the CUDA_DIR to point to where your CUDA Toolkit was installed. For me this was [CODE]CUDA_DIR = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0" [/CODE] After that, add code generation for the cards you're planning to use. For example, [CODE]NVCCFLAGS += --generate-code arch=compute_60,code=sm_60 # CC 6.0 Pascal / GTX10xx NVCCFLAGS += --generate-code arch=compute_70,code=sm_70 # CC 7.0 Volta / Titan V NVCCFLAGS += --generate-code arch=compute_75,code=sm_75 # CC 7.5 Turing / RTX20xx, GTX16xx [/CODE] Then there was a problem with NVCC that needed a fix. It expects to find vcvars64.bat in a certain place, and it seems that the VS2012 Express installer doesn't put it there. Go to C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin and see if the subfolder amd64 exists there, and vcvars64.bat inside of it. If not, you need to copy the subfolder x86_amd64 and its contents to amd64, and rename the now copied vcvarsx86_amd64.bat to vcvars64.bat. Finally, time to start compiling. Start a command prompt window. Go to the root folder of where you installed MinGW and run set_distro_paths.bat from there. Then go to wherever that vcvars64.bat is and run it. Then go to the mfaktc-0.21 source folder and make -f Makefile.win Wait a while... (It seems to take a whole lot longer than on Linux gcc + nvcc) Done! If you want to compile other versions (more/less classes, Wagstaff) these can be set by editing params.h and then recompiling. |
TF concepts updated
Some of the existing points have been refined or expanded, and I've added several additional points recently. It's now up to 40 entries. It's at [URL]https://www.mersenneforum.org/showpost.php?p=508523&postcount=6[/URL]
|
Another poke at the internals. I was factoring a few exponents in mfaktc, where the bit depth was 76-77 (among others). I wondered why the barrett87_mul32_gs kernel was chosen instead of barrett77_mul32_gs. Then I looked at the kernel_benchmarks.txt in the source directory. Okay, tests were done back in CUDA 5.5 days and the freshest card used was a Tesla K20m, three (and a half) generations old by now. That got me wondering, again, have things changed? Well, of course, I HAD to do some benchmarkig of my own on Turing, and at least there, yes they have. Not by much, but now barrett77 is faster than barrett87 by about 1%.
Exponent tested: 66362159, bit depth 68-69 (the same as in kernel_benchmarks.txt), less classes, debug RAW GPU BENCH mode on (disables sieving so the GHz-d/d numbers are low because of that), CUDA 10.1 and RTX 2080 locked at 1800 MHz: [CODE] time GHz-d/day barrett76_mul32_gs 02:15.827 572.49 barrett77_mul32_gs 02:24.794 537.04 barrett87_mul32_gs 02:26.262 531.65 barrett88_mul32_gs 02:30.296 517.38 barrett79_mul32_gs 02:45.376 470.20 barrett92_mul32_gs 02:56.342 440.96 75bit_mul32_gs 04:54.998 263.60 95bit_mul32_gs 06:04.134 213.55 [/CODE] There is a selection table in mfaktc.c that only checks for compute capability 1.x (where the speed order was 76 -> 77 -> 87 -> 88 -> 79 -> 92) and all the rest get 76 -> 87 -> 88 -> 77 -> 79 -> 92. So the barrett77_mul32_gs kernel is in effect never selected on anything newer than GTX2xx. It's a small difference, and it only affects this single bit depth, and separate benchmarks should be run on every architecture to see if there are any changes there as well. A lot of work, so is it worth it? I'd like to think yes, since GPU72 is now factoring over 76 bits, and every little bit of extra performance should help. |
It would be worth it to test numbers around 90M, 100M, and 110M, too.
|
[QUOTE=Mark Rose;510774]It would be worth it to test numbers around 90M, 100M, and 110M, too.[/QUOTE]
Indeed. The GPU TF'er are currently taking 91M and up to 77 "bits"; 90M is already done. It might also be worth testing at 332M, to see if there's any optimization which could be squeezed out using different kernels going to 81 "bits". |
[QUOTE=nomead;510750]There is a selection table in mfaktc.c that only checks for compute capability 1.x (where the speed order was 76 -> 77 -> 87 -> 88 -> 79 -> 92) and all the rest get 76 -> 87 -> 88 -> 77 -> 79 -> 92. [COLOR="Red"]So the barrett77_mul32_gs kernel is in effect never selected on anything newer than GTX2xx.[/COLOR][/QUOTE]
Are you sure about this? I'm not! Hint: check kernel_possible() in the same file. Last time I did some benchmarks barrett 87 and 88 was faster than 77 (Pascal series). Oliver |
[QUOTE=TheJudger;510820]Are you sure about this? I'm not! Hint: check kernel_possible() in the same file.
Last time I did some benchmarks barrett 87 and 88 was faster than 77 (Pascal series). Oliver[/QUOTE] Well, not 100% sure of course, but isn't kernel_possible() just called from tf() to see whether a certain kernel works at all for the selected bit range combination, and it says nothing about the relative speed? I may have oversimplified a bit when I said "in effect never gets selected", as it can fall through all the way to barrett77 if 87 and 88 wouldn't work. Ah yes, there's that extra check on the barrett87, 88 and 92 rows to see whether it's factoring more than one bit depth range at once, and then those aren't selected. So, on the code as it is, for compute capability bigger than 1.x, 76-77 gets barrett87_mul32_gs 75-77 gets barrett77_mul32_gs 78-79 gets barrett87_mul32_gs 77-79 gets barrett79_mul32_gs 79-80 gets barrett87_mul32_gs 78-80 or 79-81 will actually get 95bit_mul32_gs But I'd like to think that since factoring at these bit levels takes quite a while, most people would be running with the default Stages=1 set in mfaktc.ini. This is my reasoning behind that "in effect never"... The one thing I'm not at all sure about is the 1% improvement. On real life work the difference seems to be less than that (still on Turing). I'll have to gather some more timing information, but this will take a while longer. :smile: |
[QUOTE=nomead;510824]I'll have to gather some more timing information, but this will take a while longer. :smile:[/QUOTE]
Okay, I was shocked. For whatever reason, there is pretty much no measurable performance difference between barrett77 and 87 as tested on real work. So, again, RTX 2080, GPU clock locked at 1800 MHz. Six exponents each in the M9152xxxx range factored from 76 to 77 bits. All are reported as 167.21 GHz-days. Average for the barrett77 runs: 1 hour 18 minutes 40.223 seconds. And for the barrett87 runs: 1 hour 18 minutes... 42.352 seconds. It's well within the measurement error margin now. I wonder why I saw that 1% earlier, but then, that was for a single run for each kernel. So, nothing needs to be changed, it doesn't make any difference. Meh. :yawn: |
[QUOTE=nomead;510883]Okay, I was shocked. For whatever reason, there is pretty much no measurable performance difference between barrett77 and 87 as tested on real work. So, again, RTX 2080, GPU clock locked at 1800 MHz. Six exponents each in the M9152xxxx range factored from 76 to 77 bits. All are reported as 167.21 GHz-days. Average for the barrett77 runs: 1 hour 18 minutes 40.223 seconds. And for the barrett87 runs: 1 hour 18 minutes... 42.352 seconds. It's well within the measurement error margin now. I wonder why I saw that 1% earlier, but then, that was for a single run for each kernel.
So, nothing needs to be changed, it doesn't make any difference. Meh. :yawn:[/QUOTE] No problem. And yes, those run to run variations are annoying. On a stock Geforce you have powertarget, temperature target, actual temperature and so on. Even when you try to lock a specific clockrate you have those (minor) run to run variations. This happens on Tesla, too. And on Tesla it is much easier to make sure you're running on a fixed clockrate (just set a relative low application clock). For benchmarks/comparisons you should always run in a realistic setting and not on stuff like "RAW GPU BENCH". Oliver |
Help. I'm running mfaktc 0.21 cuda 65 right now. I have a GTX 960 and saw there was a cuda 80 and a cuda100 vercion of mfaktc. whats the diffrence between them and should I rund an other version?
/Arvid |
[QUOTE=Thecmaster;511001]Help. I'm running mfaktc 0.21 cuda 65 right now. I have a GTX 960 and saw there was a cuda 80 and a cuda100 vercion of mfaktc. whats the diffrence between them and should I rund an other version?
/Arvid[/QUOTE]Test them and see what's faster on your card. Note that mfaktc tuning can make a several percent difference for a set version. CUDA 6.5 has done well in speed comparisons in my testing in CUDALucas. (I don't have a GTX960.) |
[QUOTE=kriesel;511004]Test them and see what's faster on your card. Note that mfaktc tuning can make a several percent difference for a set version. CUDA 6.5 has done well in speed comparisons in my testing in CUDALucas. (I don't have a GTX960.)[/QUOTE]
Just tested cuda 100 and got 10% faster. I will test 80 to and take the one with best speed. The speed on 80 was just 8% faster than 65. So 100 it is. ty for help. /Arvid |
[QUOTE=Thecmaster;511029]Just tested cuda 100 and got 10% faster. I will test 80 to and take the one with best speed.
The speed on 80 was just 8% faster than 65. So 100 it is. ty for help. /Arvid[/QUOTE]Thanks, I've not looked into above 8.0 myself yet, looks like there may be some gains there for some of my fleet too. Was your testing with or without tuning? See [URL]https://mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL] Gpu clock constant, or power limited, or allowed to fluctuate? |
[QUOTE=kriesel;511037]Thanks, I've not looked into above 8.0 myself yet, looks like there may be some gains there for some of my fleet too.
Was your testing with or without tuning? See [URL]https://mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL] Gpu clock constant, or power limited, or allowed to fluctuate?[/QUOTE] No. I didn't tune any of that. I was just on my way to search for information on that or ask about it. I looked around in the mfaktc.ini file and found some interesting things to tweak but I don't know where to start. Have done some tuning now. GPUSieveProcessSize=32 GPUSieveSize=128 GPUSievePrimes=110000 (this gets adjusted to 110134 when program starts) This gave me a bit nor through put. With 6.5 I got 303 GHz-d/Day With 10.0 I got 331 After tweaking I got 337 This on a GTX 960 2GB |
CPU impacts GPU more than I expected.
I have a 2080Ti GPU running mfaktc
on a i7-7820X with 32GB of 3600DDR4 RAM running Large P-1 on all 8 cores. The CPU is running at 60 degrees F and the GPU at 81 degrees F. The GPU is at about 3,900 GHZDays/Day but if I stop Prime95 the GPU thruput immediately goes to about 4,250. The GPU stays at 81 degrees F. If I restart Prime95 the GPU stays at 4,250 until about the time all 8 cores are started, have the RAM allocated and are running the P-1 again. In other words the total thruput of the rig is LOWER when the CPU is busy. It does about 75 GhzDays/Day of P1 while the GPU loses about 300. I don't know if the impact would be the same if I was running LL instead of P-1 (much less RAM); though my guess is it would be about the same impact. |
I see the same thing, so I always leave at least one hyperthreaded core idle so maximize mfaktx throughput.
|
[QUOTE=petrw1;511247]I have a 2080Ti GPU running mfaktc
on a i7-7820X with 32GB of 3600DDR4 RAM running Large P-1 on all 8 cores. The CPU is running at 60 degrees F and the GPU at 81 degrees F. The GPU is at about 3,900 GHZDays/Day but if I stop Prime95 the GPU thruput immediately goes to about 4,250. The GPU stays at 81 degrees F. If I restart Prime95 the GPU stays at 4,250 until about the time all 8 cores are started, have the RAM allocated and are running the P-1 again. In other words the total thruput of the rig is LOWER when the CPU is busy. It does about 75 GhzDays/Day of P1 while the GPU loses about 300. I don't know if the impact would be the same if I was running LL instead of P-1 (much less RAM); though my guess is it would be about the same impact.[/QUOTE] 60[B]F[/B] and 81[B]F[/B]? What's ambient temperature where this system is located? My systems are typically 70-85[B]C[/B] cpu cores, 80-90[B]C[/B] gpu, even in a system with TEN or more fans. HDs are ~29[B]C[/B]. Ambient 20-27C typ. It's common for modern gpus to clock faster at GIMPS gpu application startup, then dial back as the gpu warms up and gpu fan speed goes up (even if the cpu is idle for the duration). Pretty much the same goes for the cpu cores; cpus thermally regulate by changing clock rate. Perhaps you've allowed for the effect of thermal time constants by providing plenty of time stagger in your throughput test, but that was not apparent to me in your post. Another consideration is primality test and P-1 are one kind of calculation, and TF another which gpus are very good at. GhzD/day in primality and P-1 on a gpu are much lower than the same gpu's TF throughput. Another way to think about that is TF Ghz are cheap, primality and P-1 more valuable, since the cost of ~63 GhzD primality or P-1, or ~1016 GhzD TF, are about the same: a GTX1080-day, ~16:1 exchange rate. The "exchange rate" is even more extreme on the RTX20xx, around 40:1 as I recall; -300/40 = -7.5, not bad for +75. from your cpu. The tradeoff is worse when using igp's on laptops to do TF. The igp uses part of the package power budget, reducing primality or P-1 throughput on the cpu side of the chip by around half. It can still be worthwhile on the raw GhzD/day basis, but not necessarily if allowing for the exchange rate. |
| All times are UTC. The time now is 13:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.