mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-01-31, 21:11   #683
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by kracker View Post


My 7770 is running at 200M/s, 150 GHZ/days, at SievePrime 15000, do you think I can get more if I increase it?

(it's OC'ed)
As long as your CPU can manage the higher sieving effort, increasing SievePrimes will increase overall throughput. But keep in mind that this may reduce prime95 throughput, if you run that in parallel. It makes some sense to have both GPUs set to the same SievePrimes, because of diminishing returns in sieving.

If the CPU can do, you could even sieve 1000000 primes. This will almost be the point where M/s and GHz-days/day have the same numbers (i.e. you'd be close to 200 GHz-days/day, but you may have to add another CPU to get there ).

Monitor the CPU and GPU load. If you want to max overall throughput, both should be close to 100%.
Bdot is offline   Reply With Quote
Old 2013-01-31, 21:26   #684
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

5·137 Posts
Default

Quote:
Originally Posted by Bdot View Post
Oh, boy. Results, after all.
  • 6.25 M/s translates to ~5-6 GHz-days/day.
  • this kernel certainly is not optimal, there may be better ones
  • Error -5 is CL_OUT_OF_RESOURCES, we may have less memory available, or fewer streams. This already happens when copying the second memory block to the device.
  • no factor found still needs to be analyzed
Next round: In order to make it a little easier for the iGPU, please create (or copy) an mfakto.ini file, with these settings changed from the default:
NumStreams=1
GridSize=1
VectorSize=1

and retry the 'mfakto.hd4000-pi -d 11 -st'.
They all have a serious effect on resources (and performance).
With 3 line mfakto.ini:
Code:
C:\hd>mfakto.hd4000-pi -d 11 -st
mfakto 0.12-Win-HD4000 (64bit build)


Runtime options
  Inifile                   mfakto.ini
WARNING: Cannot read SievePrimesMin from inifile, using default value (5000)
  SievePrimesMin            5000
WARNING: Cannot read SievePrimesMax from inifile, using default value (1000000)
  SievePrimesMax            1000000
WARNING: Cannot read SievePrimes from inifile, using default value (25000)
  SievePrimes               25000
WARNING: Cannot read SievePrimesAdjust from inifile, using default value (0)
  SievePrimesAdjust         0
  NumStreams                1
  GridSize                  1
WARNING: Cannot read WorkFile from inifile, using default (worktodo.txt)
  WorkFile                  worktodo.txt
WARNING: Cannot read ResultsFile from inifile, using default (results.txt)
  ResultsFile               results.txt
WARNING: Cannot read Checkpoints from inifile, enabled by default
  Checkpoints               enabled
WARNING: Cannot read CheckpointDelay from inifile, set to 300s by default
  CheckpointDelay           300s
WARNING: Cannot read Stages from inifile, enabled by default
  Stages                    enabled
WARNING: Cannot read StopAfterFactor from inifile, set to 1 by default
  StopAfterFactor           bitlevel
WARNING: Cannot read PrintMode from inifile, set to 0 by default
  PrintMode                 full
  V5UserID                  none
  ComputerID                none
WARNING: Cannot read AllowSleep from inifile, set to 0 by default
  AllowSleep                no
  TimeStampInResults        no
  VectorSize                1
WARNING: Cannot read GPUType from inifile, using default (AUTO)
  GPUType                   AUTO
WARNING: Cannot read SieveOnGPU from inifile, set to 0 by default
  SieveOnGPU                no
WARNING: Cannot read SmallExp from inifile, set to 0 by default
  SmallExp                  no
WARNING: Cannot read SieveCPUMask from inifile, set to 0 by default
  SieveCPUMask              0
Compiletime options
  SIEVE_SIZE_LIMIT          36kiB
  SIEVE_SIZE                289731bits
  SIEVE_SPLIT               250
  MORE_CLASSES              enabled
  CL_PERFORMANCE_INFO       enabled (DEBUG option)
Select device - Get device info - Compiling kernels ..........
WARNING: Unknown GPU name, assuming VLIW5 type. Please post the device name "Int
el(R) HD Graphics 4000 (Intel(R) Corporation)" to http://www.mersenneforum.org/s
howthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to s
elect a GPU type yourself and avoid this warning.

OpenCL device info
  name                      Intel(R) HD Graphics 4000 (Intel(R) Corporation)
  device (driver) version   OpenCL 1.1  (9.17.10.2932)
  maximum threads per block 512
  maximum threads per grid  134217728
  number of multiprocessors 16 (1280 compute elements)
  clock rate                350MHz

Automatic parameters
  threads per grid          262144
  optimizing kernels for    VLIW5

########## testcase 1/1559 ##########
Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)
  k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "barrett15_75"
  done |    ETA |     GHz |time/class|    #FCs | avg. rate | SieveP. |CPU idle
262144 FCs copied in 0.11 ms (9546.39 MB/s), proc'd in 47.81 ms (5.48 M/s)
Currently 'stuck' here for 5+ minutes @ 13% CPU usage (1 full core).
Of course, while I was posting:
Code:
########## testcase 1/1559 ##########
Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)
  k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "barrett15_75"
  done |    ETA |     GHz |time/class|    #FCs | avg. rate | SieveP. |CPU idle
262144 FCs copied in 0.11 ms (9546.39 MB/s), proc'd in 47.81 ms (5.48 M/s)
Error -5: Copying h_ktab(clEnqueueWriteBuffer)
ERROR from tf_class.
Error exit as selftest failed

C:\hd>

Last fiddled with by sdbardwick on 2013-01-31 at 21:28
sdbardwick is offline   Reply With Quote
Old 2013-01-31, 21:47   #685
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Unhappy

Quote:
Originally Posted by sdbardwick View Post
...
Currently 'stuck' here for 5+ minutes @ 13% CPU usage (1 full core).
Of course, while I was posting:
Code:
########## testcase 1/1559 ##########
Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)
  k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "barrett15_75"
  done |    ETA |     GHz |time/class|    #FCs | avg. rate | SieveP. |CPU idle
262144 FCs copied in 0.11 ms (9546.39 MB/s), proc'd in 47.81 ms (5.48 M/s)
Error -5: Copying h_ktab(clEnqueueWriteBuffer)
ERROR from tf_class.
Error exit as selftest failed

C:\hd>
OK, same error, even though mfakto with these options is really light-weight. Taking 5 minutes to come up with that error can only mean some internal loop eating up all memory or whatever resource.

That is the point where some serious debugging seems necessary ... or better drivers from Intel or AMD (I'm not even sure whose code is running).
The prospect of adding 5 or maybe 10 GHz-days/day is also not too encouraging ...
Bdot is offline   Reply With Quote
Old 2013-01-31, 22:09   #686
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

5·137 Posts
Default

I'd agree that waiting for Intel to go through another iteration of OpenCL development is the efficient choice. Now about that GPU sieving...
sdbardwick is offline   Reply With Quote
Old 2013-01-31, 22:48   #687
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by Bdot View Post
As long as your CPU can manage the higher sieving effort, increasing SievePrimes will increase overall throughput. But keep in mind that this may reduce prime95 throughput, if you run that in parallel. It makes some sense to have both GPUs set to the same SievePrimes, because of diminishing returns in sieving.

If the CPU can do, you could even sieve 1000000 primes. This will almost be the point where M/s and GHz-days/day have the same numbers (i.e. you'd be close to 200 GHz-days/day, but you may have to add another CPU to get there ).

Monitor the CPU and GPU load. If you want to max overall throughput, both should be close to 100%.
Thanks! My cpu is definitely not the best, so I'll keep fiddling.

What I need is gpu sieving a better cpu !
kracker is offline   Reply With Quote
Old 2013-02-25, 14:21   #688
E_tron
 
E_tron's Avatar
 
Sep 2002
Austin, TX

3·11·17 Posts
Thumbs up

I'm glad to see that GPU TF has come so far; thank you all for your effort. I'm trying out Mfakto on an AMD A8-5600K APU. Looks promising, but CPU sieving appears to be a bottleneck.

Any hope of AMD OpenCL implementations getting LL testing and factoring with GPU Sieving? It seems hard to believe that AMD's modern hardware doesn't have the same instructions as Cuda 2.x (Double precision floating point operations).
E_tron is offline   Reply With Quote
Old 2013-02-25, 14:27   #689
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

By the way, I accidentally run into a OpenCL FFT implementation by Apple, which seems to be just the missing link to make "OpenCL_LL" (i.e. CudaLucas for Radeons). Maybe someone else stronger then me on the subject can have a look?

Last fiddled with by LaurV on 2013-02-25 at 14:28
LaurV is offline   Reply With Quote
Old 2013-02-25, 19:19   #690
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

1000011110002 Posts
Default

Quote:
Originally Posted by E_tron View Post
I'm glad to see that GPU TF has come so far; thank you all for your effort. I'm trying out Mfakto on an AMD A8-5600K APU. Looks promising, but CPU sieving appears to be a bottleneck.

Any hope of AMD OpenCL implementations getting LL testing and factoring with GPU Sieving? It seems hard to believe that AMD's modern hardware doesn't have the same instructions as Cuda 2.x (Double precision floating point operations).
I have a AMD A8-3850, 2.9 GHz quad, I have no bottleneck on sieving, it takes exactly one core of four(quad) with two sessions.
kracker is offline   Reply With Quote
Old 2013-02-25, 20:57   #691
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

If you have all your CPU cores occupied with something (e.g. 2x mfakto + 2x prime95 on a quad core), then you need to manually select a suitable SievePrimes value and disable SievePrimesAdjust. AutoAdjust does not work on a fully loaded system - it will max out SievePrimes, which indeed results in a performance bottleneck.

Regarding GPU sieving: It is not so much that the GPU part of OpenCL is so much different from the GPU part of CUDA - most of that can even be solved by one or two dozen #defines. The CPU part of both is so much different. And a few concepts/abilities are hard to "translate" (Assembler inlines, for instance).

Anyway, I have George's GPU sieve running on OpenCL so that it provides some output. I need some verification of it being correct, and I need to do the adaptations of the kernels to read the raw sieve bitfield. As mfakto has many more kernels than mfaktc, I'm still thinking of a smart way to do this ...

OpenCL FFTs are available, not only from Apple but also from AMD. However, they are far from well-optimized, and I'm not sure if the ones that CuLu needs are there. Double precision is available on GCN and the higher end previous generation cards, but at a rather high penalty (1/4 single precision speed). Not sure if this allows for an efficient OcLu .
Bdot is offline   Reply With Quote
Old 2013-02-25, 20:59   #692
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

1C3516 Posts
Default

nVidia's cards suffer a worse SP->DP penalty than 1/4th. We would be a lot happier if it was "only" 1/4th.
Dubslow is offline   Reply With Quote
Old 2013-03-03, 17:15   #693
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default

Quote:
Originally Posted by Dubslow View Post
nVidia's cards suffer a worse SP->DP penalty than 1/4th. We would be a lot happier if it was "only" 1/4th.
Well then, why's nobody porting cudalucas to OpenCL ?

BTW, I'm still busy with the GPU sieving for mfakto, and I could use some suggestions. I think I found out why my sieve seems to randomly kill FCs, it's code like this:
Code:
          mask = 1 << i37;
          mask |= (1 << i41) | (1 << i43);
          mask |= (1 << i47) | (1 << i53);
          mask |= (1 << i59) | (1 << i61);
the i<nn> variables contain values mod nn, i<nn> and mask are 32-bit values. On CUDA, shifts of more than 32 will result in 0, but OpenCL first takes the shift-value mod 32, and shifts only by the remainder. One of the less prominent differences between CUDA and OpenCL.

So how can I get the code above to work correctly and efficientlly on OpenCL? I best I could come up with is three instructions for one
Code:
          mask = i37 > 31 ? 0 : (1 << i37);
          mask |= (i41 > 31 ? 0 : (1 << i41)) | (i43 > 31 ? 0 : (1 << i43));
          mask |= (i47 > 31 ? 0 : (1 << i47)) | (i53 > 31 ? 0 : (1 << i53));
          mask |= (i59 > 31 ? 0 : (1 << i59)) | (i61 > 31 ? 0 : (1 << i61));
Does anyone see a more efficient way?
Bdot is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2718 2021-07-06 18:30
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 03:14.


Mon Aug 2 03:14:00 UTC 2021 up 9 days, 21:42, 0 users, load averages: 1.40, 1.36, 1.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.