![]() |
|
|
#309 |
|
"David"
Jul 2015
Ohio
10000001012 Posts |
It seems likely that those 24k assignments should get assigned since the need to move them to 74 bits isn't going to just disappear, however I'm not anxious to include them in the initial DCTF burst since it is already moving the goalpost from where we started. They are also well beyond any active DCTF work.
I also wonder if those levels are still the desired transition point for the average card in the GPU72 fleet. Was it not the case that most of those level transition points were made for older cards? |
|
|
|
|
|
#310 | |
|
"/X\(‘-‘)/X\"
Jan 2013
2·5·293 Posts |
Quote:
The clients don't report what cards are being used, so it's hard to say what the average card in the fleet is. I do believe that the GTX 580 was used as the basis for the cut over, at least comparing what I see on GPU72 versus James' graphs at mersenne.ca. When Chris expanded the current levels chart I remember him saying he didn't adjust the colours of the cells. I gather with the AMD cards that they are even more better at TF than LL than Nvidia cards, so it makes sense to take lower exponents to a higher bit level (keeping in mind that Nvidia does better at higher bit levels). It would be nice if there were a way for a card to indicate its relative abilities or model to GPU72, which could then make a better decision as to what the card should do when letting GPU72 decide. However, with the bulk of the work being exponents needing TF to 75 bits, if a pile of AMD cards are doing TF, they're going to end up doing that work anyway. |
|
|
|
|
|
|
#311 |
|
Mar 2014
5016 Posts |
I wonder if it would be possible to add a section to your GPU72 profile or maybe your prime net one to list what graphics cards you trial factor with as it would help to at least figure out what cards people generally use. any of the guys in charge know if this could be done?
|
|
|
|
|
|
#312 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
Similarly, in the time since the 580 cut-off point analysis was done mfakto, mfaktc, and clLucas (and probably CUDALucas) as well as the AMD and NVIDIA drivers have all changed substantially. At best the cutoff point is a very rough estimate. Once we are sufficiently ahead of the CPU DC and CPU LL wavefronts with TF, will most of us actually turn those cards to LL testing? Or will we just trial factor to infinity (and beyond)? |
|
|
|
|
|
|
#313 | |||
|
"/X\(‘-‘)/X\"
Jan 2013
B7216 Posts |
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#314 | |
|
"David"
Jul 2015
Ohio
10000001012 Posts |
Quote:
My profiling on mfakto so far has shown it to be pretty effectively loading the VALUs and not memory bound/stalled at all after my change to on-card memory, which makes sense. Kernel/workgroup scheduling is quite reasonable with long running kernels and low overhead for most active TF ranges. 90+% of the work is in the actual factoring kernel and the sieve efficiency isn't really worth tuning. George's suggestion of looking at hand-tuned ISA that makes use of the hardware carry flag and add+carry instructions to reduce instruction count is the only big place I see for easy improvement and that's not the big percentage of the VALU instructions. The SALU is mostly sitting idle but I don't see a worthwhile use of it with the current architecture. Summary, I can't see much headroom for more than a few more percentage points of improvement for mfakto. Since they are so similar in structure I believe mfaktc will be similar. clLucas is an entirely different story, with pipeline stalls, cache misses, and extremely excessive kernel queuing and the associated overhead. This isn't msft's fault so much as the fact that clFFT was not really designed for this kind of work (EDIT: I'm referring to being called over and over again in a tight loop with other kernels running before and after.) I am fairly confident at this stage that at least a 2x improvement in performance is possible, possibly more. That would bring Fury X LL test performance to better than the 32 core Xeon V3 configurations that currently have the speed record. At that point clearing an exponent per-day on a Fury is probably better than TF work. Last fiddled with by airsquirrels on 2016-01-30 at 01:37 |
|
|
|
|
|
|
#315 | ||
|
"/X\(‘-‘)/X\"
Jan 2013
2·5·293 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#316 | |
|
"David"
Jul 2015
Ohio
51710 Posts |
Quote:
I have however I am not as well versed in the CUDA side of things. I have also never seen source published for cuFFT (clFFT is open source), so the ability to tweak things to our specific application and avoid queuing several kernels per iteration is either not there or not as easy (for me) to do. The comments in the CUDALucas code suggest quite a bit of time was spent profiling and tuning the current implementation. My focus with CUDALucas has been on making it run more effectively on dual GPU cards such as the Titan Zs I have, which is a bit of a niche use case that sprung out of wanting to do a faster GPU validation run for M49*. |
|
|
|
|
|
|
#317 | |
|
Einyen
Dec 2003
Denmark
2·1,579 Posts |
Quote:
|
|
|
|
|
|
|
#318 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
The Fury X has 537 TFLOPS DP which is similar to one of the Xeons, however the memory bandwidth is the kicker. 500 GB/s on the Fury vs. 48 GB/s or so per Xeon. These are all theoretical numbers, but it does give an idea of how the two compare. |
|
|
|
|
|
|
#319 |
|
"David"
Jul 2015
Ohio
10000001012 Posts |
Replying to myself, because I wanted to run the numbers.
Estimating what we need to verify M49 in 24 hours: 1.16ms/ iteration A complex FFT of 4M requires about 5 * N log2(N) DP FLOPs, We need that twice plus about 4*N operations for point multiplication and normalization. That works out to 940M or so ops per iteration, or 810 GFLOPs The memory access requirements are a bit more nebulous to compute but at the very least we need read and write to 4M complex values each iteration, which is 55GB/s So the FirePro, if it could magically line everything up, has the raw throughput. The Dual Xeons are memory bound, and the Fury is ALU bound. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Getting unneed DCTF work | Mark Rose | GPU to 72 | 4 | 2018-01-01 06:14 |