mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

preda 2018-12-26 07:43

[QUOTE=GP2;503961]Is it somehow possible, at least in principle, to accumulate full-size blocks that may include work from more than one class or even more than one exponents, and then send that as a batch to the GPU?

In other words, how much of the overhead is fundamental to the way the CUDA code works, and how much is just a legacy implementation choice? I know very little about CUDA and am trying to decide if it's worth learning more.[/QUOTE]

My impression is that you can't easily combine blocks or exponents. This is not because of CUDA but because of the algorithm -- which pre-computes a set of values that are per-exponent, and per-class, once before starting each class.

About learning more -- sure it's worth, if you have the time and the inclination. The concepts are pretty much exactly the same between CUDA and OpenCL, only the details differ, so you get to almost learn OpenCL at the same time :)

[QUOTE]
But most people will have a 64-bit architecture, I think? Or are you referring to single-precision vs. double-precision speed within the GPU?[/QUOTE]

It's a question of the memory size taken by a large number of 32-bit integers vs. the same number of 64-bit integers.

R. Gerbicz 2018-12-26 12:33

Interesting discussion, there are lots of misunderstanding here.

[QUOTE=TheJudger;503945]
Another source of overhead is the fact that mfaktc doesn't handle partial blocks, the last block is always fully used - with more classes the work above the upper limit increases aswell.
You can compare the two versions (420 vs. 4620 classes) and give an estimate when the next prime should be included in the number of residue classes.
[/QUOTE]

So at once you are sieving only one residue per gpu thread? In this case you would be too restrictive, I'll return to this.

[QUOTE=GP2;503820]Currently, 960/4620 means that 20.78% of the classes are "classes_needed", and then I guess sieve_init_class filters out some percentage of the remainder.

But if it was possible to go all the way to 2 × (2 × 3 × 5 × 7 × 11 × 13 × 17 × 19 × 23 × 29), then there would be a lot more classes but only 15.79% of those classes would be needed. It would be one-quarter fewer as a percent of the total classes. [/QUOTE]

Almost, in theory you get that speedup, but only in the sieve. In the mp%q (where mp=2^p-1) calculations it gives no speedup; but here there is a trade off: lowered sieving time could allow higher sievelimit, and with that again increased sieve time-lowered number of mp%q calc to get the optimal computation time.

The sieving interval's length is approx 2^74/M~2.3*10^9 for p~2^26, N1=2^74,N2=2^75, see below for M=8*p*3*5*7*11*13. Still way too large for a smallish 32000 (?) sievelength.
[you can try more primes, but using the primes up to 29 is too excessive, that gives roughly 11000 for the length]
Well, I'm not coding in gpu, but here are my thoughts:

[url]https://www.mersenneforum.org/showthread.php?t=22435&page=13[/url], maybe gap12 is/was the latest code from me. See my primegap code, the task is comparable, what you could do even in a harder problem when you need to sieve on a harder way, and actually it was a true sieving, so we sieved up to sqrt(n) on the sieve intervals, and used a large modulus in a much more complicated setup: m=2*3*5*7*11*13*17*19, you could ask why we haven't used the next prime, as in your case there is an overhead when you switch on the interval), these tasks have similar diff, for one large residue we sieved on an interval ~1e15 length/numbers.

Returning to your problem:

Say you want a sieve for q=2*k*p+1=N1..N2 in range (say p~2^26 in the current wavefront, and N1=2^74,N2=2^75 a typical(?) input).

using small primes in the sieving prime (and also knowledge of q mod 8=1,7 from quad rec.) :

q=res+k*M for k=k0..k1, and here k0,k1 is independent from res (this k is not the same as above).
M=8*p*3*5*7*11*13.
where k0=floor(N1/M) and k1=floor(N2/M).
With this you will sieve at most two more numbers per residue classes (so at most one number<N1 and at most one>N2 per res).
Note that res==1 mod (2*p), because q==1 mod (2*p).

If you have cnt number of residues (mod M) then you'll sieve NR=cnt*(k1-k0+1)=11520*(k1-k0+1) numbers in our case (if you're using the primes up to 13). With something 3rd class knowledge from elementary school you can distribute these numbers into C classes using an almost equal way: [floor(i*NR)/C,floor((i+1)*NR/C)) numbers will be in the i-th class/task for i=0..(C-1).
And this is independent from 11520 and the number of GPU threads, you can use C=1,C=1000,C=1024,C=20480 (yes smaller/larger than 11520). From the value of i you can get the residues and intervals (for each residue) that you need to do. Yes, different threads could work on the same residue class, and the same thread could do multiple residue (ofcourse not at once in this case) !

About checkpoints:
Don't know how you are doing checkpoint(s), say with 1024 gpu threads and with C=10000 you would lost only ~T/20 work (in average) where T is the total computation time, if you're distributing 1000 works ten times. (note that here we are not using all threads, and even 10000<11520 the above 'equal way' distribution works nicely). For much deeper task (N2~2^80 or so) use larger C and/or more small sieve primes (17-19-23).

[QUOTE=TheJudger;503945]Hi,
More classes means the distances between two factor candidates increases - at some point you have to take care about this, too (e.g. 64bit instead of 32bit for difference values).
[/QUOTE]

Using 64 bits? Where in the sieving, when you are using sieve primes up to say 32000 ?

For the sieving initialization in a cpu case I would precompute and store ((2*p) mod R) and (M mod R) for R<sievingprimes=32000 or whatever you are using. But maybe it isn't that needed for gpu; you have to do that when you start/switch on a new residue, so basically you don't need to store/precompute it.

R. Gerbicz 2018-12-26 13:17

One more minor thought:
If you are using (much) more task than threads then it would be possible that first you find not the smallest prime factor
of mp=2^p-1 but that means no issue for us...
in any case, in these situations there could be possible to take the whole interval to find the smallest/all prime factors also since the probability of success is small it would change nothing.
Yes, finding a smaller factor is easier/faster, but the diff is petite, roughly up to constant multiplier: 1/74-1/75~(1/74^2) for N1=2^74,N2=2^75.

TheJudger 2018-12-26 23:32

[QUOTE=GP2;503961]But most people will have a 64-bit architecture, I think? Or are you referring to single-precision vs. double-precision speed within the GPU?[/QUOTE]

Not really, Maxwell and Pascal generation have only 16bit integer multiplication in HW (32bit is constructed of four 16bit multiplications).

TheJudger 2018-12-27 00:06

[QUOTE=R. Gerbicz;504006]Interesting discussion, there are lots of misunderstanding here.[/QUOTE]

Likely!

[QUOTE=R. Gerbicz;504006]So at once you are sieving only one residue per gpu thread? In this case you would be too restrictive, I'll return to this.[/QUOTE]

All threads (the whole GPU) are working on a single residue class.


[QUOTE=R. Gerbicz;504006]Say you want a sieve for q=2*k*p+1=N1..N2 in range (say p~2^26 in the current wavefront, and N1=2^74,N2=2^75 a typical(?) input).

using small primes in the sieving prime (and also knowledge of q mod 8=1,7 from quad rec.) :

q=res+k*M for k=k0..k1, and here k0,k1 is independent from res (this k is not the same as above).
M=8*p*3*5*7*11*13.
where k0=floor(N1/M) and k1=floor(N2/M).
With this you will sieve at most two more numbers per residue classes (so at most one number<N1 and at most one>N2 per res).
Note that res==1 mod (2*p), because q==1 mod (2*p).

If you have cnt number of residues (mod M) then you'll sieve NR=cnt*(k1-k0+1)=11520*(k1-k0+1) numbers in our case (if you're using the primes up to 13). With something 3rd class knowledge from elementary school you can distribute these numbers into C classes using an almost equal way: [floor(i*NR)/C,floor((i+1)*NR/C)) numbers will be in the i-th class/task for i=0..(C-1).
And this is independent from 11520 and the number of GPU threads, you can use C=1,C=1000,C=1024,C=20480 (yes smaller/larger than 11520). From the value of i you can get the residues and intervals (for each residue) that you need to do. Yes, different threads could work on the same residue class, and the same thread could do multiple residue (ofcourse not at once in this case) ![/QUOTE]

Talking about the "too much work at the end of the residue class"? Keep in mind that GPUs (especially older GPUs) don't like divergent thread branches - it is bettter to keep all threads in the same code path. You're right, calculating the limits is very easy but you can't do a simple for loop to work in the factor candidates. Paralism is SIMD.

[QUOTE=R. Gerbicz;504006]About checkpoints:
Don't know how you are doing checkpoint(s), say with 1024 gpu threads and with C=10000 you would lost only ~T/20 work (in average) where T is the total computation time, if you're distributing 1000 works ten times. (note that here we are not using all threads, and even 10000<11520 the above 'equal way' distribution works nicely). For much deeper task (N2~2^80 or so) use larger C and/or more small sieve primes (17-19-23).[/QUOTE]

This is very simple - there are no checkpoints within a residue class! Checkpoints are written between two residue classes by just remembering the last finished residue class. Runtimes are rather short and I don't want to overengineer this - a simple solution is a good solution.


[QUOTE=R. Gerbicz;504006]
Using 64 bits? Where in the sieving, when you are using sieve primes up to say 32000 ?

For the sieving initialization in a cpu case I would precompute and store ((2*p) mod R) and (M mod R) for R<sievingprimes=32000 or whatever you are using. But maybe it isn't that needed for gpu; you have to do that when you start/switch on a new residue, so basically you don't need to store/precompute it.[/QUOTE]

At some point you have to "convert" your bits in the sieve back to the actual factor candidates. Sieving is done over the k's in FC = 2kp+1.

Oliver

kriesel 2018-12-27 01:40

[QUOTE=TheJudger;504047]
Runtimes are rather short
Oliver[/QUOTE]
Meaning run times per class are short?
Run times per TF bit level can be several hours.
[URL]http://www.mersenneforum.org/showpost.php?p=488519&postcount=2[/URL]

Mark Rose 2018-12-27 05:38

[QUOTE=kriesel;504054]Meaning run times per class are short?
Run times per TF bit level can be several hours.
[URL]http://www.mersenneforum.org/showpost.php?p=488519&postcount=2[/URL][/QUOTE]

Run times per class.

The longest I've seen a class run for is about minute, on an ancient GT 520, at a high TF level.

LaurV 2018-12-27 06:17

[QUOTE=TheJudger;503945]You can compare the two versions (420 vs. 4620 classes) and give an estimate when the next prime should be included in the number of residue classes.[/QUOTE]
We did this in the past, during developing of mfaktc, and after, and not only once; there are endless discussions here around.

You forget speed decrease for sieving (a lot more classes means smaller classes, i.e. less efficient sieving). Actually, the best point (top of the parabola) seemed to be somewhere between 420 and 4620, as you go higher the things get slower, as well as if you go lower.

As we progress towards higher exponents, this point is (extremely) slowly shifting [U]towards[/U] 420. This is not a joke...

For example, at 420 classes you have 96 remaining to sieve/exponentiate, and for 4620 you have 960. The rest are eliminated by modularity (that is why we use 2[U]^2[/U]*p1*p2*... instead of 2*p1*p2*... where px are odd primes, to eliminate completely classes which are 3 or 5 mod 8. The 420 classes is good as there are not many candidates in each class (either a large exponent, or a low bitlevel, or both) [U]especially[/U] to make each class larger, to have a faster sieve. When you sieve, you take each prime, do some modular calculus with it, then clear a lot of bits in the matrix (i.e. eliminate candidates in the class). Smaller class means you do more modular calculus and less elimination. As the classes are getting larger (higher bitlevels, and lower exponents) the candidates in each class become too numerous to be efficiently "tabulated" in the sieving matrices in RAM, and a split in more classes makes sense, but sieving becomes slower. This is compensated by less classes (proportional, 960/4620=[B]20.78%[/B] of the candidates, instead of 96/420=[B]22.86%[/B] of the candidates) need to be sieved and exponentiated.

If we go to add 13 to the product, then you will have 60060 classes, from which 11520 remain after modular elimination, which is [B]19.18%[/B] of the candidates need to be sieved and exponentiated, but now they are divided in a lot of more classes, there is a lot of more modular calculus to be done and a less efficient sieving. There is a lot of overhead as Oliver said, to handle all these 11k classes. It is not worth.

In fact, as we move to higher exponents, the "ideal", "optimum" "top of the parabola" switches imperceptible [U]towards 420[/U] classes, until some trick is implemented ("invented") to sieve all the 11k+ classes all together somehow...

With actual sieving way, the 60060 classes would worth only for taking very low exponents to high bitlevels (like factoring "under 1M or 100k exponents to 66+ bits or so), and it would make happy few hobbyists here who don't believe that low exponents had so much ECM done that they may not have a factor under 100 bits :razz: But that is a different discussion, and it is in itself a different challenge, due to the fact that sieving for so low expos can inadvertently eliminate factors (where the exponent is comparable in size with the sieving primes).

LaurV 2018-12-27 07:10

Another misconception is the fact that you have thousands of cores, and they do thousands of things in parallel. You don't. They are not "true" cores like for the CPU. They can not do different tasks in the same time. Or well, they can do, but it costs time.

They usually all do the same task in the same time, but applied to different data. That is why it is called SIMD, single instruction, multiple data. They are very fast if you have (for example) a screen with 4 million pixels and want to make all of them greenish, because it is Christmas (i.e. add some constant to all components of a vector), that is what they are designed for (graphic cards, remember?), each core takes a pixel and paints it green, because it is Christmas, and they all do it in parallel, all in the same time. But if you have 10000 pixels to paint, and 3000 cores in your GPU, then they will do the change in 4 "ticks": 3 times each core will change a pixel, for a total of 9000, and the last "tick" only 1000 cores will change a pixel, the other 2000 cores will wait, or rest, because there is no pixel for them to change. You can reconfigure them to do other things in that time, but it comes with a cost for the other 1000 which have work to do, because they still share resources (memory, pipelines, etc). Alternatively, you could make it from the start that only 2500 cores turn pixels green, and the other 500 do other things. You still need 4 steps, and you can do other things meantime, but you may get a better or a worse total time. This is endless tweaking, a lot of work invested by the programmers, etc.

Profound respect for the guys who do that kind of software!

How about, maybe only 2000 cores paint pixels green, and 1000 do other tasks? This will require 5 "ticks", but maybe the other task that gets done is more important? (and it would require by itself more ticks, which are now saved?). Remember the story with the doughnuts? You have a frying pan into which you can fry two of them, on a side, which takes a minute. How long will it take to fry 3 doughnuts on both sides? (if you said 4 minutes, don't start writing code for GPUs :razz:)

LaurV 2018-12-27 07:43

[QUOTE=R. Gerbicz;504006]Almost, in theory you get that speedup, but only in the sieve. [/QUOTE]
Actually, not. See my post. Sieving each class is much faster, indeed. Ten times faster, or more. But you have to sieve 11k+ classes, instead of 960 classes. Per total, sieving is slower. MfaktX works [U]one class[/U] at a time. Take the class, sieve the class, exponentiate the class,etc., like [URL="https://www.youtube.com/watch?v=m842HLSOprA"]Pierre Richard with the plates[/URL] (take the plate, turn the plate, wash the plate, rinse the plate... you know the drill). If a factor is found, stop (or continue, if the user says so in the init file). This helps when you sieve (all candidates in the class "in the same time", see above) and exponentiate (all remaining candidates "in the same time"). Otherwise you can't get the speed. Think about, at the limit, you have millions of classes with a single candidate. Is the sieving faster, or slower, per total?

Of course, doing a larger split, less candidates will survive and the exponentiation may be faster (will they??? think about the fact that when you split to 4620 classes, sieving with 13, 17, 19, 23, etc, will immediately eliminate the candidates that would have been eliminated by modularity when split in more classes; - this only takes a blink of the time, and you do it for each of the 960 classes once; when you include 13 in the product, you start sieving with 17, etc., saving a blink, but doing it 10k times more).

So, as I said, unless a new method is found, to sieve all classes in the same time or so, sieve them in bunches, whatever, splitting in more classes will be slower.

Neutron3529 2018-12-27 13:07

I occasionly come up with an idea:
Is it possible to add some "doublecheck" feature in mfaktc?
for example, for each class, we could calculate a trial factor p which minimize abs(n-p*round(n/p))
after calculating, we could have 4620 (or 420) p(s)
and then we could use operation like xor to merge some of them together.
for example, we could firstly convert the "more class" result into "less class"
then we could use xor(or, minimize abs(n-p*round(n/p))) to merge 420 "less class" into, for example, 15, "megaclass"
finally, for each "megaclass", we could use a algorithm like CRC-16(or, just last 16 bits) to compress them into 16 bytes, and finally compress all of the 16*15=240 bytes into 40 base-64 characters.
It is helpful when double-checking.
When double-checking, people will not needs to calculate all the 15 "megaclass", to verify whether the submit is a vaild result, verify 1 megaclass is enough.


So this method could roughly provides a 15x faster roughly double-check method, might be helpful for further double-checking.


IMO a double checking code could prevent the false "no factor" result, at least it is easier to tell everyone not to submit a handwritting result to earn illegal GHz-days.


All times are UTC. The time now is 23:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.