mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2016-01-29, 20:10   #309
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by chalsall View Post
...Unless we decide to take 60M to 63M up to 74 bits...
It seems likely that those 24k assignments should get assigned since the need to move them to 74 bits isn't going to just disappear, however I'm not anxious to include them in the initial DCTF burst since it is already moving the goalpost from where we started. They are also well beyond any active DCTF work.

I also wonder if those levels are still the desired transition point for the average card in the GPU72 fleet. Was it not the case that most of those level transition points were made for older cards?
airsquirrels is offline   Reply With Quote
Old 2016-01-29, 20:40   #310
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

293010 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
It seems likely that those 24k assignments should get assigned since the need to move them to 74 bits isn't going to just disappear, however I'm not anxious to include them in the initial DCTF burst since it is already moving the goalpost from where we started. They are also well beyond any active DCTF work.

I also wonder if those levels are still the desired transition point for the average card in the GPU72 fleet. Was it not the case that most of those level transition points were made for older cards?
Well, for my GTX 580s, it makes no sense. I should only take DCTF to 74 above 66M. Last year I went ahead and did all the DCTF that makes sense for a GTX 580 between 60M and 105M, including exponents assigned as first-time LL with not enough TF for DC. So unless PrimeNet runs out of available exponents, DCTF should be done for a GTX 580, baring some algorithmic improvement in TF throughput.

The clients don't report what cards are being used, so it's hard to say what the average card in the fleet is. I do believe that the GTX 580 was used as the basis for the cut over, at least comparing what I see on GPU72 versus James' graphs at mersenne.ca. When Chris expanded the current levels chart I remember him saying he didn't adjust the colours of the cells.

I gather with the AMD cards that they are even more better at TF than LL than Nvidia cards, so it makes sense to take lower exponents to a higher bit level (keeping in mind that Nvidia does better at higher bit levels).

It would be nice if there were a way for a card to indicate its relative abilities or model to GPU72, which could then make a better decision as to what the card should do when letting GPU72 decide. However, with the bulk of the work being exponents needing TF to 75 bits, if a pile of AMD cards are doing TF, they're going to end up doing that work anyway.
Mark Rose is offline   Reply With Quote
Old 2016-01-29, 23:56   #311
dragonbud20
 
dragonbud20's Avatar
 
Mar 2014

24×5 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
The clients don't report what cards are being used, so it's hard to say what the average card in the fleet is. I do believe that the GTX 580 was used as the basis for the cut over,
I wonder if it would be possible to add a section to your GPU72 profile or maybe your prime net one to list what graphics cards you trial factor with as it would help to at least figure out what cards people generally use. any of the guys in charge know if this could be done?
dragonbud20 is offline   Reply With Quote
Old 2016-01-30, 00:59   #312
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

51710 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Well, for my GTX 580s, it makes no sense. I should only take DCTF to 74 above 66M. Last year I went ahead and did all the DCTF that makes sense for a GTX 580 between 60M and 105M, including exponents assigned as first-time LL with not enough TF for DC. So unless PrimeNet runs out of available exponents, DCTF should be done for a GTX 580, baring some algorithmic improvement in TF throughput.

The clients don't report what cards are being used, so it's hard to say what the average card in the fleet is. I do believe that the GTX 580 was used as the basis for the cut over, at least comparing what I see on GPU72 versus James' graphs at mersenne.ca. When Chris expanded the current levels chart I remember him saying he didn't adjust the colours of the cells.

I gather with the AMD cards that they are even more better at TF than LL than Nvidia cards, so it makes sense to take lower exponents to a higher bit level (keeping in mind that Nvidia does better at higher bit levels).

It would be nice if there were a way for a card to indicate its relative abilities or model to GPU72, which could then make a better decision as to what the card should do when letting GPU72 decide. However, with the bulk of the work being exponents needing TF to 75 bits, if a pile of AMD cards are doing TF, they're going to end up doing that work anyway.
I am curious why the AMD cards have this performance cutoff vs. the NVIDIA cards. I know we jump to a different kernel because we need more bits, but that seems like something that happens on both cards/architecture. Has the mfaktc code just had more tuning for high-bit kernels? Or conversely was more effort in mfakto code just put into low bit kernels? It hardly seems like an actual hardware difference.

Similarly, in the time since the 580 cut-off point analysis was done mfakto, mfaktc, and clLucas (and probably CUDALucas) as well as the AMD and NVIDIA drivers have all changed substantially. At best the cutoff point is a very rough estimate.

Once we are sufficiently ahead of the CPU DC and CPU LL wavefronts with TF, will most of us actually turn those cards to LL testing? Or will we just trial factor to infinity (and beyond)?
airsquirrels is offline   Reply With Quote
Old 2016-01-30, 01:12   #313
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×5×293 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I am curious why the AMD cards have this performance cutoff vs. the NVIDIA cards. I know we jump to a different kernel because we need more bits, but that seems like something that happens on both cards/architecture. Has the mfaktc code just had more tuning for high-bit kernels? Or conversely was more effort in mfakto code just put into low bit kernels? It hardly seems like an actual hardware difference.
I don't know enough to say.

Quote:
Similarly, in the time since the 580 cut-off point analysis was done mfakto, mfaktc, and clLucas (and probably CUDALucas) as well as the AMD and NVIDIA drivers have all changed substantially. At best the cutoff point is a very rough estimate.
I do know that James periodically updates his tables when new versions are released. For instance, mfaktc 0.21 gave about a 1.5% increase over 0.20.

Quote:
Once we are sufficiently ahead of the CPU DC and CPU LL wavefronts with TF, will most of us actually turn those cards to LL testing? Or will we just trial factor to infinity (and beyond)?
I haven't given thought as to what I will do. Until a few months ago when the surge started, I had a goal of finishing DCTF that I expected to last a couple more years lol
Mark Rose is offline   Reply With Quote
Old 2016-01-30, 01:32   #314
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I haven't given thought as to what I will do. Until a few months ago when the surge started, I had a goal of finishing DCTF that I expected to last a couple more years lol
As long as there is TF work to do that makes sense in the LL or 100M range I intend to put both HW and hopefully mental/code effort into that work. At some point when we are far enough ahead I hope to put all my iron into DC-LL until I'm reasonably satisfied in its stability for 100M testing. Currently all my Titan-era NVIDIA cards are working DC-LL.

My profiling on mfakto so far has shown it to be pretty effectively loading the VALUs and not memory bound/stalled at all after my change to on-card memory, which makes sense. Kernel/workgroup scheduling is quite reasonable with long running kernels and low overhead for most active TF ranges. 90+% of the work is in the actual factoring kernel and the sieve efficiency isn't really worth tuning. George's suggestion of looking at hand-tuned ISA that makes use of the hardware carry flag and add+carry instructions to reduce instruction count is the only big place I see for easy improvement and that's not the big percentage of the VALU instructions. The SALU is mostly sitting idle but I don't see a worthwhile use of it with the current architecture. Summary, I can't see much headroom for more than a few more percentage points of improvement for mfakto. Since they are so similar in structure I believe mfaktc will be similar.

clLucas is an entirely different story, with pipeline stalls, cache misses, and extremely excessive kernel queuing and the associated overhead. This isn't msft's fault so much as the fact that clFFT was not really designed for this kind of work (EDIT: I'm referring to being called over and over again in a tight loop with other kernels running before and after.) I am fairly confident at this stage that at least a 2x improvement in performance is possible, possibly more. That would bring Fury X LL test performance to better than the 32 core Xeon V3 configurations that currently have the speed record. At that point clearing an exponent per-day on a Fury is probably better than TF work.

Last fiddled with by airsquirrels on 2016-01-30 at 01:37
airsquirrels is offline   Reply With Quote
Old 2016-01-30, 01:43   #315
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·5·293 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Summary, I can't see much headroom for more than a few more percentage points of improvement for mfakto. Since they are so similar in structure I believe mfaktc will be similar.
I do know that Oliver keep tweaking mfaktc to eek out small percentage increases.

Quote:
clLucas is an entirely different story, with pipeline stalls, cache misses, and extremely excessive kernel queuing and the associated overhead. This isn't msft's fault so much as the fact that clFFT was not really designed for this kind of work (EDIT: I'm referring to being called over and over again in a tight loop with other kernels running before and after.) I am fairly confident at this stage that at least a 2x improvement in performance is possible, possibly more. That would bring Fury X LL test performance to better than the 32 core Xeon V3 configurations that currently have the speed record. At that point clearing an exponent per-day on a Fury is probably better than TF work.
Have you looked at cudalucas? Merely curious what your observations are.
Mark Rose is offline   Reply With Quote
Old 2016-01-30, 01:54   #316
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

20516 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I do know that Oliver keep tweaking mfaktc to eek out small percentage increases.
This is exactly why there is not much headroom left. Very capable hands have already been working on these problems, all I can lend is some more hours.

Quote:
Originally Posted by Mark Rose View Post
Have you looked at cudalucas? Merely curious what your observations are.
I have however I am not as well versed in the CUDA side of things. I have also never seen source published for cuFFT (clFFT is open source), so the ability to tweak things to our specific application and avoid queuing several kernels per iteration is either not there or not as easy (for me) to do. The comments in the CUDALucas code suggest quite a bit of time was spent profiling and tuning the current implementation. My focus with CUDALucas has been on making it run more effectively on dual GPU cards such as the Titan Zs I have, which is a bit of a niche use case that sprung out of wanting to do a faster GPU validation run for M49*.
airsquirrels is offline   Reply With Quote
Old 2016-01-30, 02:09   #317
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

7×11×41 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I am fairly confident at this stage that at least a 2x improvement in performance is possible, possibly more. That would bring Fury X LL test performance to better than the 32 core Xeon V3 configurations that currently have the speed record. At that point clearing an exponent per-day on a Fury is probably better than TF work.
Even with DP performance of 1/16th of the SP performance it could still possibly beat a Xeon?
ATH is offline   Reply With Quote
Old 2016-01-30, 02:21   #318
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

20516 Posts
Default

Quote:
Originally Posted by ATH View Post
Even with DP performance of 1/16th of the SP performance it could still possibly beat a Xeon?
The theoretical peak DP for the 2698v3 is around 600 GFLOP, so 1.2 TFLOPS between two of them. A FirePro W9100 is 2.6 TFLOPS DP, so it's good for more than twice the potential power of the Xeon system.

The Fury X has 537 TFLOPS DP which is similar to one of the Xeons, however the memory bandwidth is the kicker. 500 GB/s on the Fury vs. 48 GB/s or so per Xeon.


These are all theoretical numbers, but it does give an idea of how the two compare.
airsquirrels is offline   Reply With Quote
Old 2016-01-30, 02:50   #319
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10000001012 Posts
Default

Replying to myself, because I wanted to run the numbers.

Estimating what we need to verify M49 in 24 hours:

1.16ms/ iteration

A complex FFT of 4M requires about 5 * N log2(N) DP FLOPs,
We need that twice plus about 4*N operations for point multiplication and normalization.

That works out to 940M or so ops per iteration, or 810 GFLOPs

The memory access requirements are a bit more nebulous to compute but at the very least we need read and write to 4M complex values each iteration, which is 55GB/s

So the FirePro, if it could magically line everything up, has the raw throughput. The Dual Xeons are memory bound, and the Fury is ALU bound.
airsquirrels is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting unneed DCTF work Mark Rose GPU to 72 4 2018-01-01 06:14

All times are UTC. The time now is 08:35.


Tue Jul 27 08:35:16 UTC 2021 up 4 days, 3:04, 0 users, load averages: 1.49, 1.70, 1.73

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.