mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

James Heinrich 2014-05-21 21:54

[QUOTE=Bdot;373923]There are 3 factors that influence mfakto (and mfaktc) performance[/QUOTE][QUOTE=James Heinrich;373890]What would be brilliant would be if Oliver/Bertram could include a broad benchmark that runs a few classes for a range of exponents (every 1M, 5M, etc across the range specified [e.g. 30M-80M]) and for each test at various bit ranges and give throughput performance at that exponent+bitlevel. That would provide consistent data to map the 3D performance variance for the various GPUs.[/QUOTE]@Bdot: how hard would it be to implement a benchmark such as I suggest?

LaurV 2014-05-22 03:05

[QUOTE=manfred4;373967]seems to be a lot smoother between the exponents and bitlevels.[/QUOTE]
Yes, it is. As explained few posts above, mfakt[B][U]c[/U][/B] uses (almost?) the same kernel (barrett76?) for all this stuff. The "big drop" in performance you will feel only for very short assignments, or only for bitlevels over 76, when a less [URL="http://weblogs.asp.net/jgalloway/archive/2007/05/10/performant-isn-t-a-word.aspx"]performant [/URL]kernel will be used. Also, as discussed, mfakt[B][U]o[/U][/B] has "lower bitlevel" kernels, optimized to fit the AMD/OpenCL architecture (see Bdot's posts).

NickOfTime 2014-05-22 06:31

Hmm, or how much work to create a BARRETT76_MUL15 kernel? :ermm:

kracker 2014-05-22 14:01

[QUOTE=NickOfTime;373985]Hmm, or how much work to create a BARRETT76_MUL15 kernel? :ermm:[/QUOTE]

Well, I think the question is: How fast or efficient would it be?

Bdot 2014-05-22 16:41

[QUOTE=kracker;373998]Well, I think the question is: How fast or efficient would it be?[/QUOTE]
:grin:
It would be exactly as fast as the 82_MUL15 kernel, because it would need to be implemented like it.

When using 32-bit chunks of data, all current kernels need 3 of them, giving 96 bits. Now, certain short-cuts are possible that reduce the available bits. It's basically using less exact intermediate values during the calculation that are cheaper to compute, like skipping evaluation of some carry flags. Different short-cuts have different costs, but adding them all in brings you down to 76 bits usable out of the 96.

Now, when using 15-bit chunks, you can use 5 of them for 75 "raw" bits, or 6 chunks for 90. Adding in all short-cuts results in 69 and 82 usable bits, respectively. 73 bits is the full implementation with 5 chunks (no short-cuts; there are always small rounding errors that eat one or two bits).

It might be worth checking again, if I can squeeze out 74 bits in 5 chunks - I currently don't remember why I did not succeed the last time ...

NickOfTime 2014-05-23 18:56

[QUOTE=Bdot;374005]:grin:
It would be exactly as fast as the 82_MUL15 kernel, because it would need to be implemented like it.

When using 32-bit chunks of data, all current kernels need 3 of them, giving 96 bits. Now, certain short-cuts are possible that reduce the available bits. It's basically using less exact intermediate values during the calculation that are cheaper to compute, like skipping evaluation of some carry flags. Different short-cuts have different costs, but adding them all in brings you down to 76 bits usable out of the 96.

Now, when using 15-bit chunks, you can use 5 of them for 75 "raw" bits, or 6 chunks for 90. Adding in all short-cuts results in 69 and 82 usable bits, respectively. 73 bits is the full implementation with 5 chunks (no short-cuts; there are always small rounding errors that eat one or two bits).

It might be worth checking again, if I can squeeze out 74 bits in 5 chunks - I currently don't remember why I did not succeed the last time ...[/QUOTE]

Hmm, I guess it depends on how many to 74 exponents are left and how long we will be processing them :-). I seem to be mostly processing 73-74's in the 65/66M range at the moment...

James Heinrich 2014-05-23 20:11

[QUOTE=NickOfTime;374106]Hmm, I guess it depends on how many to 74 exponents are left and how long we will be processing them :-)[/QUOTE]Many, and a long time.
As of 01-May-2014 "many" was approximately 21,176,383 exponents above 65M (for requiring 2[sup]74[/sup]) in PrimeNet range (below 1000M) that are currently TF'd to less than 2[sup]74[/sup] and will eventually need to be taken there. I didn't bother to calculate the THz-years required, but it'll be a bunch.

[SIZE="1"]Small trivia: if we continue TF limits in the current curve, TF for the range between 1000M-4294M will require 1.5 EHz-days (exahertz-days, as in thousand-million GHz-days. That means 1000 TitanBlack/780ti GPUs running continuously for 1000 years.)[/SIZE]

Bdot 2014-05-23 23:16

[QUOTE=James Heinrich;373970]@Bdot: how hard would it be to implement a benchmark such as I suggest?[/QUOTE]
I was extending the --perftest mode of mfakto over the last versions, but so far it is mainly testing the sieving performance, in order to find the best config values.

Doing the performance tests for each kernel is on the list ... Oliver and I discussed that a while ago, in order to have some comparable results. we need to revive that.


And my attempt for a 74_15 kernel comes in less than 1% ahead of the 82_15 kernel, but still misses some factors :bangheadonwall: I will need to use an even more accurate modulo function that will slow down the kernel even more ...

chalsall 2014-05-24 21:10

[QUOTE=Bdot;374132]Doing the performance tests for each kernel is on the list ... Oliver and I discussed that a while ago, in order to have some comparable results. we need to revive that.[/QUOTE]

That would be really cool. What would be even more cool is if such results could then be submitted to Primenet, and then made available to those interested. Perhaps James could help with that.

[QUOTE=Bdot;374132]And my attempt for a 74_15 kernel comes in less than 1% ahead of the 82_15 kernel, but still misses some factors :bangheadonwall: I will need to use an even more accurate modulo function that will slow down the kernel even more ...[/QUOTE]

Not meaning to blow inappropriate sunshine. But what you and Oliver (et al) do is (IMO) quite impressive.

James Heinrich 2014-05-24 21:59

[QUOTE=chalsall;374192]What would be even more cool is if such results could then be submitted to Primenet, and then made available to those interested. Perhaps James could help with that.[/QUOTE]I don't know about on PrimeNet since I'm not all that comfortable with database interactions there, but I'd be happy to make such data available in raw and aggregated form on mersenne.ca

chalsall 2014-05-24 22:07

[QUOTE=James Heinrich;374203]I don't know about on PrimeNet since I'm not all that comfortable with database interactions there, but I'd be happy to make such data available in raw and aggregated form on mersenne.ca[/QUOTE]

LOL... If you could "make it so", is would be appreciated and useful.


All times are UTC. The time now is 23:17.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.