mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2016-01-30, 04:35   #320
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

3,313 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
So the FirePro, if it could magically line everything up, has the raw throughput. The Dual Xeons are memory bound, and the Fury is ALU bound.
I managed to mostly figure out that the Nvidia Pascal will have 1/3 DP performance, if people are speculating correctly.

That might put it's DP performance somewhere in the 2-3 TFLOPs range although I saw some reports it was rated over 4.

The memory bandwidth, some reports are saying as much as 1 TB/s...

I take all of that with a heavy dose of salt since the rumor mill is based on who knows what, but whatever the case, it does sound pretty intriguing.

I know it's a sucky thing to look at a performance problem, shrug your shoulders, and throw more hardware at it, especially when the software side has apparent room for improvement.
Madpoo is offline   Reply With Quote
Old 2016-01-30, 04:44   #321
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
The memory access requirements are a bit more nebulous to compute but at the very least we need read and write to 4M complex values each iteration, which is 55GB/s
Prime95 requires 2 r/w passes over memory.

Prime95 does some of the forward FFT, point-wise squaring, and inverse FFT while data is in memory, clLucas does not. At best, you are going to need 2 r/w for the forward FFT, 1 r/w for the squaring, 2 r/w for the inverse FFT, 1 r/w for the rounding and carry propagation.

Prime95 uses a 256KB L2 cache which CUDA cards don't have, I assume AMD doesn't either. Consequently, I expect the 2 r/w in the forward and inverse FFT is optimistic -- probably 4 r/w is more realistic. You'd need to look at the clFFT code to know for sure.
Prime95 is offline   Reply With Quote
Old 2016-01-30, 06:33   #322
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

22×232 Posts
Default

I'm not sure about the consumer cards, but the Tesla K20 has 1.25 MB L2 cache and the K20x and K40 has 1.5 MB L2.
frmky is online now   Reply With Quote
Old 2016-01-30, 14:00   #323
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Prime95 requires 2 r/w passes over memory.

Prime95 does some of the forward FFT, point-wise squaring, and inverse FFT while data is in memory, clLucas does not. At best, you are going to need 2 r/w for the forward FFT, 1 r/w for the squaring, 2 r/w for the inverse FFT, 1 r/w for the rounding and carry propagation.

Prime95 uses a 256KB L2 cache which CUDA cards don't have, I assume AMD doesn't either. Consequently, I expect the 2 r/w in the forward and inverse FFT is optimistic -- probably 4 r/w is more realistic. You'd need to look at the clFFT code to know for sure.
Thanks for the detailed information!

The AMD cards are a bit better equipped than we have been discussing in terms of cache, of course we throw almost all of this out between each kernel call currently.

There is a 2MB L2 cache shared between all compute units. There is also a small GDS shared between all compute units of 32KB pages, a L1 cache per CU, and 64KB LDS for each 64 ALUs, and most importantly each of the 64 CUs has a relatively huge number of vector registers, 256KiB worth per CU with 2KB of Scalar registers.

As one more bonus for a carefully tuned kernel, their are basic integer ops (add, compare/swap) and reordering ops built into the LDS which can run fully independent of the VALUs.

Last fiddled with by airsquirrels on 2016-01-30 at 14:01
airsquirrels is offline   Reply With Quote
Old 2016-01-30, 18:16   #324
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×1,579 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Estimating what we need to verify M49 in 24 hours:

1.16ms/ iteration

A complex FFT of 4M requires about 5 * N log2(N) DP FLOPs,
We need that twice plus about 4*N operations for point multiplication and normalization.

That works out to 940M or so ops per iteration, or 810 GFLOPs

The memory access requirements are a bit more nebulous to compute but at the very least we need read and write to 4M complex values each iteration, which is 55GB/s

So the FirePro, if it could magically line everything up, has the raw throughput. The Dual Xeons are memory bound, and the Fury is ALU bound.
Titan Black has a theoretical DP performance of 1700 GFLOPS and 336 GB/s memory bandwidth but it still takes 55-56 hours for M49.
ATH is offline   Reply With Quote
Old 2016-01-30, 18:21   #325
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by ATH View Post
Titan Black has a theoretical DP performance of 1700 GFLOPS and 336 GB/s memory bandwidth but it still takes 55-56 hours for M49.
Correct, realized performance is closer to 200 GFLOP due to FFT implementations inefficiencies. My statements were to point out that clFFT is not achieving the same level of performance as cuFFT despite having similar HW. clFFT is much younger and has more room for optimizations.
airsquirrels is offline   Reply With Quote
Old 2016-01-30, 19:13   #326
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

230478 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Or will we just trial factor to infinity (and beyond)?
That probably wouldn't (optimally) Make Sense (TM).

There will always be a need for additional GPU TF'ing before the LL'ing wavefronts. But at some point (read: probably in about eight months or so) we'll be far enough ahead in the TF'ing that most GPU efforts should go to DC'ing and/or LL'ing.

Of course, this all depends on the optimal economic cross-over points for LL'ing vs. TF'ing. These points have already changed several times as the GPU codes where optimized for different GPUs and different worktypes.
chalsall is offline   Reply With Quote
Old 2016-02-16, 16:36   #327
manfred4
 
manfred4's Avatar
 
Mar 2014
Germany

11110002 Posts
Default

3735 Assignments or about 65 THzd left for completion which all have been assigned approx. 20 days ago to ANONYMOUS - this will be finished on the next drop off of his completed assignments next friday or maybe the week after - then all the DCTF is done!

What do you think will he do after he has finished? Help in the 100M digit range or on the LLTF front? Or continue on what he did before?

I also have a theory on what that guy was doing before he started DCTF: Before that time there was an Anonymous user, who submitted a very large amount of TF work in the 875M range (here) and stopped doing so at about the time this anonymous user started DCTF - that's why I think he was there a long time before doing very high range work, that nobody else cares about right now.
manfred4 is offline   Reply With Quote
Old 2016-02-20, 16:11   #328
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

77 candidates left!

Looks like tomorrow, if Anonymous posts results like he/she has on the weekends in the past. We many finally get the answer to whether Anonymous will help with LLTF or move on to some other work.
airsquirrels is offline   Reply With Quote
Old 2016-02-20, 16:16   #329
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2×3×1,609 Posts
Default

Or Chris can bring in the 62M to 74, they are just 2000 of them, then this DCTF is RIP forever... (well, till next better hardware will allow us few bits more... )
LaurV is online now   Reply With Quote
Old 2016-02-20, 16:41   #330
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,767 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
We many finally get the answer to whether Anonymous will help with LLTF or move on to some other work.
I *really* hope he does. We still need some LLTF'ing love in order to keep feeding the P-1'ers and the Cat 4 churners. For the latter things should lighten up in about a month (when the assignments given out after the MP announcement start expiring in quantity).

But even now, for LL Cat 4 there are approximately 240 assigned a day, with only 75 being completed a day.
chalsall is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting unneed DCTF work Mark Rose GPU to 72 4 2018-01-01 06:14

All times are UTC. The time now is 08:00.


Mon Aug 2 08:01:00 UTC 2021 up 10 days, 2:29, 0 users, load averages: 1.48, 1.62, 1.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.