mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Bdot 2012-03-05 20:07

[QUOTE=rcv;292021]
So, let me re-ask my 2nd question... Aside from validating the code, is there a reason why the user shouldn't be allowed to set a lower SievePrimes than 5K?
[/QUOTE]

I think, the best way to free your CPU core(s) for other tasks is to set mfaktc to low priority.

Regarding the pure question "why not lower SievePrimes": because it's not tested. As long as SievePrimes remains above SIEVE_SPLIT (250), the code will probably work correctly. However, before mfakt[co] is given to the public, some more tests are performed. You may have seen the CHECKS_MODBASECASE #define. Completing the full selftest at a certain SievePrimes value is a prerequisite, but not sufficient.

OK, now that this is out I can tell you that some time ago I had finished a few tests with CHECKS_MODBASECASE at SievePrimes=256, and that did not show up any error, [B]but it was not the full test[/B], and it was mfakto, which runs different kernels.

And I briefly tested a special version of mfakto that skips sieving and memory transfer to the GPU completely, testing all candidates of a class. This one also successfully completed a few rounds of CHECKS_MODBASECASE tests.

So I'm quite confident that the full tests of lower SievePrimes would pass, but so far nobody has done these tests.

[QUOTE=rcv;292021]
I maintain that if both sieving and trial factoring is done on the GPU, the tradeoff *is* as simple as matching candidates per second being removed by the siever to candidates per second being tested by the trial factoror.

I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.]

@Bdot: If you are interested, would you please weigh in on how this compares with your results.

Thanks to all who responded![/QUOTE]
I'd be really be interested to see this work. Though I'm maintaining the OpenCL version, receiving a hint how GPU-sieving could be done in a fast way would certainly help.

My first approach was so terribly slow that I could only sieve the first 256 primes to get the 30M/s (sieve output for the siever alone) on a GPU that otherwise runs ~150M/s through factoring. I was so disappointed that I gave up ... Your results, however ...

Bdot 2012-03-05 20:16

[QUOTE=James Heinrich;291989]I highly second this. The "M/s" value is somewhat meaningless for the user, and often misunderstood. The conversion of time-per-class into GHz-days-per-day should be very simple: GHz-days for the assignment is given by:[code]0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent

// example using M50,000,000 from 2^69-2^70:
= 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
= 2.3912767291392 GHz-days

// magic constant is 0.016968 for TF to 65-bit and above
// magic constant is 0.017832 for 63-and 64-bit
// magic constant is 0.01116 for 62-bit and below[/code]Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment

Of course, above code is based on a single bitlevel, but easily adapted to multi-bitlevel assigments.[/QUOTE]

It's the first time I've seen the formula to calculate this. (OK, if I had asked you had probably told me earlier ... ) I think, I'll leave the M/s as an option in the ini-file, but switch the default to GHz-days/day (= GHz ?). Added to todo-list ...

Dubslow 2012-03-05 22:23

Perhaps call it Eq. GHz, for Equivalent GHz (to a Core2).

@rcv: What a wonderful defense :smile: I like where this is going. I'd volunteer to do some 460 testing, but unfortunately I cannot for the life of me upgrade my Linux Nvidia drivers past 270.xx, so I'm stuck with mfaktc 0.17.

flashjh 2012-03-05 22:35

[QUOTE=Bdot;292025]So I'm quite confident that the full tests of lower SievePrimes would pass, but so far nobody has done these tests.[/QUOTE]

Just point me in the right direction and I'll test it. What needs to be done, specifically?

Prime95 2012-03-05 23:19

[QUOTE=rcv;292021]I actually have some prototype sieving code. It is not optimized. [/QUOTE]

I too would be interested in seeing this code.

Bdot 2012-03-05 23:26

[QUOTE=flashjh;292046]Just point me in the right direction and I'll test it. What needs to be done, specifically?[/QUOTE]

1. Build a binary that allows for setting low SievePrimes, and that has CHECKS_MODBASECASE defined.
2. Run the full selftest with that binary with fixed, low SievePrimes
3. Analyze the output for any CHECKS_MODBASECASE violations (and of course, all factors need to be found).

But before we all run and do something just for the sake of doing something: Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.

Dubslow 2012-03-06 02:57

[QUOTE=Bdot;292048]

But before we all run and do something just for the sake of doing something: Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.[/QUOTE]

[QUOTE=rcv;292021]

In the parlance of Mathematica, the fraction of candidates which pass the sieving is given by Apply[Times,Prime[5+Range[sp]]-1)/Prime[5+Range[sp]])], where sp is the number of SievePrimes. At SievePrimes=1500, the above formula yields 28.5914%. At SievePrimes=5000, the above formula yields 25.0285%.

The number of candidates reported in each class by mfaktc (1.39G, as shown above) with SievePrimes=1500 agrees with the theoretical. Floor[.285914665945569*2^71/4620/52575179/2+1/2]=1389675478 candidates per class.

At SievePrimes=5000, the number of candidates per class is theoretically Floor[0.250284623178239*2^71/4620/52575179/2+1/2]=1216497244.

[B][U]When I switched from SievePrimes=5000 to SievePrimes=1500, the number of candidates per second remained constant, but the time per class increased by about 14% (0.285915/0.250285-1).[/B] As best as I could tell, my CPU usage due to mfaktc went down by more than half.[/U] Now, the GPU is almost never starved for work. In contrast, with high fixed values of SievePrimes, my CPU becomes saturated, the GPU is often starved for work, the net mfaktc throughput goes down, and I can't use my CPU for other useful work. With moderate values of SievePrimes, the CPU burns a lot of time and the GPU is sometimes starved for work.
[/QUOTE]

Seems like a pretty clear benefit to me. (My emphasis.)

kjaget 2012-03-06 04:18

[QUOTE=Dubslow;292058]Seems like a pretty clear benefit to me. (My emphasis.)[/QUOTE]

Not sure which side you meant the benefit was, but here's my take.

Considering how much more efficient GPUs are at generating GHz-days of work, trading 100% of 1 CPU for a 14% speedup in GPU production feels like a net win to me. If you choose to measure it that way, of course.

For a 560ti, that's ~25 GHz-days / day extra throughput. I don't see that a CPU is going to give anywhere that kind of throughput doing other types of work.

Dubslow 2012-03-06 06:00

No, it's about half a CPU core, there'd still be significant CPU use (unless he pushes the GPU sieve thing through as well, which I'm hoping for).

TheJudger 2012-03-06 10:59

Hi rcv,

[QUOTE=rcv;292021]I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.][/QUOTE]

this sounds interesting. I would love to see your code. Are you talking about sieving only or does this include the translation of set/unset bits into FC candidates, too?

Oliver

apsen 2012-03-06 14:18

[QUOTE=Bdot;292048]Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.[/QUOTE]

I have old CPU and new GPU. With lower limit at 5000 I saturate all 4 cores of CPU but my GPU usage is well below 100%. I played with reducing the limit too and found that the lowest time per class for me is around 1000-1500. As the limit is lowered below 1000 - the CPU usage will go down. I did run long tests at 1000 and there were no problems detected but I did not use the CHECKS_MODBASECASE define.

And low priority does not always help. I have not specifically checked mfaktc but prime95 would cause hiccups in some programs even on idle priority. In fact IIRC it has setting to pause processing when it detects specified programs running.


All times are UTC. The time now is 23:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.