mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2013-01-31 21:11

[QUOTE=kracker;326917]:smile:

My 7770 is running at 200M/s, 150 GHZ/days, at SievePrime 15000, do you think I can get more if I increase it?

(it's OC'ed)[/QUOTE]
As long as your CPU can manage the higher sieving effort, increasing SievePrimes will increase overall throughput. But keep in mind that this may reduce prime95 throughput, if you run that in parallel. It makes some sense to have both GPUs set to the same SievePrimes, because of diminishing returns in sieving.

If the CPU can do, you could even sieve 1000000 primes. This will almost be the point where M/s and GHz-days/day have the same numbers (i.e. you'd be close to 200 GHz-days/day, but you may have to add another CPU to get there :no:).

Monitor the CPU and GPU load. If you want to max overall throughput, both should be close to 100%.

sdbardwick 2013-01-31 21:26

[QUOTE=Bdot;326792]Oh, boy. Results, after all.
[LIST][*]6.25 M/s translates to ~5-6 GHz-days/day.[*]this kernel certainly is not optimal, there may be better ones[*]Error -5 is CL_OUT_OF_RESOURCES, we may have less memory available, or fewer streams. This already happens when copying the second memory block to the device.[*]no factor found still needs to be analyzed[/LIST]Next round: In order to make it a little easier for the iGPU, please create (or copy) an mfakto.ini file, with these settings changed from the default:
NumStreams=1
GridSize=1
VectorSize=1

and retry the 'mfakto.hd4000-pi -d 11 -st'.
They all have a serious effect on resources (and performance).[/QUOTE]

With 3 line mfakto.ini:
[CODE]
C:\hd>mfakto.hd4000-pi -d 11 -st
mfakto 0.12-Win-HD4000 (64bit build)


Runtime options
Inifile mfakto.ini
WARNING: Cannot read SievePrimesMin from inifile, using default value (5000)
SievePrimesMin 5000
WARNING: Cannot read SievePrimesMax from inifile, using default value (1000000)
SievePrimesMax 1000000
WARNING: Cannot read SievePrimes from inifile, using default value (25000)
SievePrimes 25000
WARNING: Cannot read SievePrimesAdjust from inifile, using default value (0)
SievePrimesAdjust 0
NumStreams 1
GridSize 1
WARNING: Cannot read WorkFile from inifile, using default (worktodo.txt)
WorkFile worktodo.txt
WARNING: Cannot read ResultsFile from inifile, using default (results.txt)
ResultsFile results.txt
WARNING: Cannot read Checkpoints from inifile, enabled by default
Checkpoints enabled
WARNING: Cannot read CheckpointDelay from inifile, set to 300s by default
CheckpointDelay 300s
WARNING: Cannot read Stages from inifile, enabled by default
Stages enabled
WARNING: Cannot read StopAfterFactor from inifile, set to 1 by default
StopAfterFactor bitlevel
WARNING: Cannot read PrintMode from inifile, set to 0 by default
PrintMode full
V5UserID none
ComputerID none
WARNING: Cannot read AllowSleep from inifile, set to 0 by default
AllowSleep no
TimeStampInResults no
VectorSize 1
WARNING: Cannot read GPUType from inifile, using default (AUTO)
GPUType AUTO
WARNING: Cannot read SieveOnGPU from inifile, set to 0 by default
SieveOnGPU no
WARNING: Cannot read SmallExp from inifile, set to 0 by default
SmallExp no
WARNING: Cannot read SieveCPUMask from inifile, set to 0 by default
SieveCPUMask 0
Compiletime options
SIEVE_SIZE_LIMIT 36kiB
SIEVE_SIZE 289731bits
SIEVE_SPLIT 250
MORE_CLASSES enabled
CL_PERFORMANCE_INFO enabled (DEBUG option)
Select device - Get device info - Compiling kernels ..........
WARNING: Unknown GPU name, assuming VLIW5 type. Please post the device name "Int
el(R) HD Graphics 4000 (Intel(R) Corporation)" to http://www.mersenneforum.org/s
howthread.php?t=15646 to have it added to mfakto. Set GPUType in mfakto.ini to s
elect a GPU type yourself and avoid this warning.

OpenCL device info
name Intel(R) HD Graphics 4000 (Intel(R) Corporation)
device (driver) version OpenCL 1.1 (9.17.10.2932)
maximum threads per block 512
maximum threads per grid 134217728
number of multiprocessors 16 (1280 compute elements)
clock rate 350MHz

Automatic parameters
threads per grid 262144
optimizing kernels for VLIW5

########## testcase 1/1559 ##########
Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)
k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "barrett15_75"
done | ETA | GHz |time/class| #FCs | avg. rate | SieveP. |CPU idle
262144 FCs copied in 0.11 ms (9546.39 MB/s), proc'd in 47.81 ms (5.48 M/s)
[/CODE] Currently 'stuck' here for 5+ minutes @ 13% CPU usage (1 full core).
Of course, while I was posting:
[CODE]########## testcase 1/1559 ##########
Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)
k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "barrett15_75"
done | ETA | GHz |time/class| #FCs | avg. rate | SieveP. |CPU idle
262144 FCs copied in 0.11 ms (9546.39 MB/s), proc'd in 47.81 ms (5.48 M/s)
Error -5: Copying h_ktab(clEnqueueWriteBuffer)
ERROR from tf_class.
Error exit as selftest failed

C:\hd>[/CODE]

Bdot 2013-01-31 21:47

[QUOTE=sdbardwick;326924]
...
Currently 'stuck' here for 5+ minutes @ 13% CPU usage (1 full core).
Of course, while I was posting:
[CODE]########## testcase 1/1559 ##########
Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)
k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "barrett15_75"
done | ETA | GHz |time/class| #FCs | avg. rate | SieveP. |CPU idle
262144 FCs copied in 0.11 ms (9546.39 MB/s), proc'd in 47.81 ms (5.48 M/s)
Error -5: Copying h_ktab(clEnqueueWriteBuffer)
ERROR from tf_class.
Error exit as selftest failed

C:\hd>[/CODE][/QUOTE]
OK, same error, even though mfakto with these options is really light-weight. Taking 5 minutes to come up with that error can only mean some internal loop eating up all memory or whatever resource.

That is the point where some serious debugging seems necessary :yucky: ... or better drivers from Intel or AMD (I'm not even sure whose code is running).
The prospect of adding 5 or maybe 10 GHz-days/day is also not too encouraging ...

sdbardwick 2013-01-31 22:09

I'd agree that waiting for Intel to go through another iteration of OpenCL development is the efficient choice. Now about that GPU sieving... :smile:

kracker 2013-01-31 22:48

[QUOTE=Bdot;326920]As long as your CPU can manage the higher sieving effort, increasing SievePrimes will increase overall throughput. But keep in mind that this may reduce prime95 throughput, if you run that in parallel. It makes some sense to have both GPUs set to the same SievePrimes, because of diminishing returns in sieving.

If the CPU can do, you could even sieve 1000000 primes. This will almost be the point where M/s and GHz-days/day have the same numbers (i.e. you'd be close to 200 GHz-days/day, but you may have to add another CPU to get there :no:).

Monitor the CPU and GPU load. If you want to max overall throughput, both should be close to 100%.[/QUOTE]

Thanks!:smile: My cpu is definitely not the best, so I'll keep fiddling.

What I need is [strike]gpu sieving[/strike] a better cpu ! :smile:

E_tron 2013-02-25 14:21

I'm glad to see that GPU TF has come so far; thank you all for your effort. I'm trying out Mfakto on an AMD A8-5600K APU. Looks promising, but CPU sieving appears to be a bottleneck.

Any hope of AMD OpenCL implementations getting LL testing and factoring with GPU Sieving? It seems hard to believe that AMD's modern hardware doesn't have the same instructions as Cuda 2.x (Double precision floating point operations).

LaurV 2013-02-25 14:27

By the way, I accidentally run into a [URL="http://developer.apple.com/library/mac/#samplecode/OpenCL_FFT/Introduction/Intro.html"]OpenCL FFT implementation [/URL]by Apple, which seems to be just the missing link to make "OpenCL_LL" (i.e. CudaLucas for Radeons). Maybe someone else stronger then me on the subject can have a look?

kracker 2013-02-25 19:19

[QUOTE=E_tron;330925]I'm glad to see that GPU TF has come so far; thank you all for your effort. I'm trying out Mfakto on an AMD A8-5600K APU. Looks promising, but CPU sieving appears to be a bottleneck.

Any hope of AMD OpenCL implementations getting LL testing and factoring with GPU Sieving? It seems hard to believe that AMD's modern hardware doesn't have the same instructions as Cuda 2.x (Double precision floating point operations).[/QUOTE]

I have a AMD A8-3850, 2.9 GHz quad, I have no bottleneck on sieving, it takes exactly one core of four(quad) with two sessions.

Bdot 2013-02-25 20:57

If you have all your CPU cores occupied with something (e.g. 2x mfakto + 2x prime95 on a quad core), then you need to manually select a suitable SievePrimes value and disable SievePrimesAdjust. AutoAdjust does not work on a fully loaded system - it will max out SievePrimes, which indeed results in a performance bottleneck.

Regarding GPU sieving: It is not so much that the GPU part of OpenCL is so much different from the GPU part of CUDA - most of that can even be solved by one or two dozen #defines. The CPU part of both is so much different. And a few concepts/abilities are hard to "translate" (Assembler inlines, for instance).

Anyway, I have George's GPU sieve running on OpenCL so that it provides some output. I need some verification of it being correct, and I need to do the adaptations of the kernels to read the raw sieve bitfield. As mfakto has many more kernels than mfaktc, I'm still thinking of a smart way to do this ...

OpenCL FFTs are available, not only from Apple but also from AMD. However, they are far from well-optimized, and I'm not sure if the ones that CuLu needs are there. Double precision is available on GCN and the higher end previous generation cards, but at a rather high penalty (1/4 single precision speed). Not sure if this allows for an efficient OcLu :smile:.

Dubslow 2013-02-25 20:59

nVidia's cards suffer a worse SP->DP penalty than 1/4th. We would be a lot happier if it was "only" 1/4th.

Bdot 2013-03-03 17:15

[QUOTE=Dubslow;330990]nVidia's cards suffer a worse SP->DP penalty than 1/4th. We would be a lot happier if it was "only" 1/4th.[/QUOTE]

Well then, why's nobody porting cudalucas to OpenCL :smile:?

BTW, I'm still busy with the GPU sieving for mfakto, and I could use some suggestions. I think I found out why my sieve seems to randomly kill FCs, it's code like this:
[code]
mask = 1 << i37;
mask |= (1 << i41) | (1 << i43);
mask |= (1 << i47) | (1 << i53);
mask |= (1 << i59) | (1 << i61);
[/code]the i<nn> variables contain values mod nn, i<nn> and mask are 32-bit values. On CUDA, shifts of more than 32 will result in 0, but OpenCL first takes the shift-value mod 32, and shifts only by the remainder. One of the less prominent differences between CUDA and OpenCL.

So how can I get the code above to work correctly and efficientlly on OpenCL? I best I could come up with is three instructions for one
[code]
mask = i37 > 31 ? 0 : (1 << i37);
mask |= (i41 > 31 ? 0 : (1 << i41)) | (i43 > 31 ? 0 : (1 << i43));
mask |= (i47 > 31 ? 0 : (1 << i47)) | (i53 > 31 ? 0 : (1 << i53));
mask |= (i59 > 31 ? 0 : (1 << i59)) | (i61 > 31 ? 0 : (1 << i61));
[/code]Does anyone see a more efficient way?


All times are UTC. The time now is 23:07.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.