mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

therealwebs 2011-12-30 05:55

bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!

*it's my first time using procdump, so it might not have captured the right thing :S

dump: [url]http://dl.dropbox.com/u/5274619/mfakto_dmp_111230_002904.dmp[/url]

debrouxl 2011-12-30 13:00

A while ago, at home, we bought a cheap desktop computer, to replace an old laptop which died after several years of ~24/7 BOINC crunching (World Community Grid, with a short period of RSALS when we were factoring the TI-68k & TI-Z80 512-bit RSA signing keys).
The desktop computer is an Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: it has therefore never been intended as a serious number cruncher (either NFS or TF).
But I wanted to test the GPU nonetheless, so I set up fglrx 11-12 under Debian Testing 64 bits :smile:

Unsurprisingly, this GPU is not very fast: few estimated compute elements, lots of CPU wait even with high SievePrimes values. Excerpt of mfakto 0.10 output:
[code]OpenCL device info
name Cedar (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.1 AMD-APP (831.4) (CAL 1.4.1646)
maximum threads per block 128
maximum threads per grid 2097152
number of multiprocessors 2 (160 compute elements (estimate for ATI GPUs))
clock rate 650MHz

snip

got assignment: exp=65XXXXXX bit_min=69 bit_max=70
Starting trial factoring M65XXXXXX from 2^69 to 2^70, k_min = Y - k_max = Z
Using GPU kernel "mfakto_cl_barrett79"

found a valid checkpoint file!
last finished class was: 888
found 0 factors already

class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait
893/4620 | 247.46M | 26.399s | 9.37M/s | 5000 | 5h40m | 191738us
896/4620 | 243.27M | 25.684s | 9.47M/s | 5625 | 5h30m | 203540us
897/4620 | 241.17M | 25.518s | 9.45M/s | 6328 | 5h27m | 203248us
900/4620 | 239.08M | 25.204s | 9.49M/s | 7119 | 5h23m | 199535us
908/4620 | 234.88M | 24.866s | 9.45M/s | 8008 | 5h18m | 202758us
917/4620 | 232.78M | 24.610s | 9.46M/s | 9009 | 5h15m | 200071us
920/4620 | 230.69M | 24.404s | 9.45M/s | 10135 | 5h11m | 199695us
921/4620 | 228.59M | 24.182s | 9.45M/s | 11401 | 5h08m | 200666us
932/4620 | 224.40M | 23.770s | 9.44M/s | 12826 | 5h03m | 199813us
936/4620 | 222.30M | 23.591s | 9.42M/s | 14429 | 5h00m | 199118us
941/4620 | 220.20M | 23.306s | 9.45M/s | 16232 | 4h56m | 197502us
945/4620 | 218.10M | 23.110s | 9.44M/s | 18261 | 4h53m | 197998us
948/4620 | 216.01M | 22.825s | 9.46M/s | 20543 | 4h49m | 195479us
953/4620 | 213.91M | 22.615s | 9.46M/s | 23110 | 4h46m | 196987us
956/4620 | 211.81M | 22.402s | 9.46M/s | 25998 | 4h43m | 195551us
957/4620 | 209.72M | 22.207s | 9.44M/s | 29247 | 4h40m | 193541us
965/4620 | 207.62M | 22.004s | 9.44M/s | 32902 | 4h37m | 194135us
972/4620 | 205.52M | 21.809s | 9.42M/s | 37014 | 4h34m | 189328us
977/4620 | 203.42M | 21.633s | 9.40M/s | 41640 | 4h32m | 189369us
980/4620 | 201.33M | 21.381s | 9.42M/s | 46845 | 4h28m | 188979us
class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait
981/4620 | 199.23M | 21.184s | 9.40M/s | 52700 | 4h25m | 188281us
992/4620 | 197.13M | 20.996s | 9.39M/s | 59287 | 4h23m | 186956us
1001/4620 | 195.04M | 20.802s | 9.38M/s | 66697 | 4h20m | 185757us
1005/4620 | 192.94M | 20.616s | 9.36M/s | 75034 | 4h17m | 184389us
1008/4620 | 192.94M | 20.543s | 9.39M/s | 84413 | 4h16m | 181585us
1013/4620 | 190.84M | 20.360s | 9.37M/s | 94964 | 4h13m | 179461us
1016/4620 | 188.74M | 20.131s | 9.38M/s | 106834 | 4h10m | 175500us
1020/4620 | 186.65M | 20.009s | 9.33M/s | 120188 | 4h08m | 174065us
1025/4620 | 184.55M | 19.711s | 9.36M/s | 135211 | 4h04m | 170872us
1028/4620 | 184.55M | 19.725s | 9.36M/s | 152112 | 4h04m | 168330us
1032/4620 | 182.45M | 19.588s | 9.31M/s | 171126 | 4h02m | 167198us
1040/4620 | 180.36M | 19.379s | 9.31M/s | 192516 | 3h59m | 161959us
1041/4620 | 180.36M | 19.463s | 9.27M/s | 200000 | 4h00m | 158546us
1053/4620 | 180.36M | 19.391s | 9.30M/s | 200000 | 3h59m | 163496us
1056/4620 | 180.36M | 19.354s | 9.32M/s | 200000 | 3h58m | 162110us
1061/4620 | 180.36M | 19.525s | 9.24M/s | 200000 | 4h00m | 163995us
1065/4620 | 180.36M | 19.550s | 9.23M/s | 200000 | 4h00m | 161185us
1068/4620 | 180.36M | 19.526s | 9.24M/s | 200000 | 3h59m | 162294us[/code]

Obviously, I'm not going to make this GPU work much :smile:
But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?

Bdot 2011-12-30 22:42

[QUOTE=therealwebs;284024]bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!

*it's my first time using procdump, so it might not have captured the right thing :S

dump: [URL]http://dl.dropbox.com/u/5274619/mfakto_dmp_111230_002904.dmp[/URL][/QUOTE]

What is the last output of mfakto?

You captured the right thing, however, my debugger cannot make anything meaningful out of it. This is most likely caused by different runtime versions. What I do see is the abort location, amdocl64!clGetSamplerInfo. This is the OpenCL runtime, but mfakto never calls clGetSamplerInfo. So I assume this part of the information is already wrong. I'll see if I can somehow get more info out of the dump, thanks a lot for providing it. If I can't extract anything better, I'll probably create a special debug version for you that should show more.

Do you still have any aborts during normal operation?

Bdot 2011-12-30 23:15

[QUOTE=debrouxl;284049] Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: ... fglrx 11-12 under Debian Testing 64 bits :smile:
...
Obviously, I'm not going to make this GPU work much :smile:
But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?[/QUOTE]

Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.

Well, if you have a good idea how to force it to finish the assignment faster, let me know!

Higher SievePrimes would go in this direction. Doubling the CPU effort you could expect a speedup of 3-5%. As you noticed with the values >180k: not really worth the effort.

SIEVE_SIZE_LIMIT 64kB would make the siever more efficient on your system as the Athlon CPU has 64kB L1 cache. The next mfakto version will have the sieve size configurable, but in your case it would just increase the CPU wait time.

In my eyes a new kernel has the best chance for real improvement of the throughput. This would be a barrett kernel based on 24-bit operations. I'm not yet certain if it would need to be entirely based on 24-bit, or if the 32-bit mul_hi is still allowed. I'm (slowly) working on these kernels, but cannot tell when they'll be ready. Also, it is hard to give a good estimate if the improvement will be 5% or 50% ...

And I recently thought of another idea that could increase throughput, especially on slower cards: the calculations in the kernel always require an initial division. GPUs are not made for divisions, so I could move this division from the GPU to the CPU, preferably into another thread.

But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.

Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.

therealwebs 2011-12-31 01:43

On this machine (2x5870), whether or not mfakto 10p1 crashes seems to be up to chance. I'm trying to run 2 instances mostly unattended, and when I check on it (every 2 to 12 hours), usually one has crashed. I won't be able to dig in and really test until I get back to the physical location tomorrow (everything is being done via Teamviewer/remote desktop). I'll see if I can't screenshot the mfakto window along with a process dump the next time it goes south. Thanks for your work on this!

debrouxl 2011-12-31 08:46

[quote]Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.[/quote]
Exactly.
On another computer, which I have intermittent access to, a Mobility Radeon HD 4650 (550 MHz) driven by a Core i7 720QM @ 1.6 GHz (which has lower single-core throughput than the Athlon II X4 640 @ 3 GHz) goes through assignments in the 65M range more than twice and a half faster than the desktop HD 5450...
Unsurprisingly, on both computers, mfakto -d cpu does less than 4M per second.

[quote]But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.[/quote]
OK :smile:

[quote]Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.[/quote]
I'll look into that, even if I probably won't overclock anything.


Thanks :smile:

KyleAskine 2012-01-10 14:49

I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.

Bdot 2012-01-10 22:50

[QUOTE=KyleAskine;285724]I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.[/QUOTE]

Raw figures and game benchmarks look promising, and the simplified instruction scheduling should boost performance of the 32-bit operations quite a bit, even though I have not been able to find detailed specs about the timing of the operations. Not sure if mul32, mul_hi and convert_* still occupy the whole "Graphic Core Next" ...

Another update:
I have been working a bit on mfakto and implemented the variable SieveSizeLimit. And in order to easily test it, I also made the upper limit of SievePrimes configurable (between 5000 and 1,000,000). I added a test for it to a new --perftest option, so that you can check which SieveSize fits best to the typical SievePrimes values you have. The output contains a list how fast sieving alone is (this is on an otherwise idle Phenom 2 X4 955 @ 3.2 GHz):
[code]
2. Sieve (M/s)
SievePrimes: 5000 20000 80000 200000 500000 1000000
SieveSizeLimit
12 kiB 136.60 89.10 46.78 27.38 14.15 6.28
24 kiB 152.08 110.37 62.30 39.25 22.82 11.36
36 kiB 156.94 115.52 71.29 47.36 28.78 15.52
48 kiB 158.79 119.92 78.07 52.81 33.24 19.01
59 kiB 157.13 120.70 82.93 54.79 36.58 21.85
71 kiB 137.47 107.61 77.29 54.07 36.73 23.29
83 kiB 127.99 99.11 73.83 52.63 37.19 24.49
95 kiB 122.54 94.50 71.26 51.05 37.70 25.69
107 kiB 114.02 89.02 67.95 51.58 37.26 26.32
118 kiB 107.10 84.73 63.78 50.51 37.16 26.74
142 kiB 99.38 78.03 59.94 49.09 36.82 27.11
166 kiB 93.95 73.78 57.86 47.93 35.08 27.41
189 kiB 87.60 69.12 54.13 45.88 35.02 27.41
213 kiB 83.13 66.16 52.67 45.00 33.74 27.53
236 kiB 81.05 64.50 51.39 43.93 34.11 27.60
260 kiB 79.17 62.76 50.06 42.78 34.09 27.24
283 kiB 77.22 61.57 49.01 42.63 34.19 26.93
307 kiB 76.78 60.33 47.70 42.01 33.84 27.52
331 kiB 75.66 59.80 48.18 41.02 33.37 27.37
354 kiB 73.93 58.56 47.70 41.23 33.00 27.31
378 kiB 73.40 58.40 47.24 40.37 33.45 27.27
[/code]And this is on a stock (2.7GHz ?) i7-2600:
[code]
2. Sieve (M/s)
SievePrimes: 5000 20000 80000 200000 500000 1000000
SieveSizeLimit
12 kiB 167.10 107.70 54.40 33.82 20.04 11.96
24 kiB 189.79 136.77 73.83 47.18 29.02 18.64
36 kiB 194.79 142.45 86.47 56.14 35.26 23.38
48 kiB 177.86 135.17 89.44 59.99 38.60 26.43
59 kiB 162.03 124.87 89.10 61.40 40.98 28.72
71 kiB 148.61 117.39 86.94 61.31 41.94 30.22
83 kiB 141.95 112.71 86.40 62.64 43.48 31.89
95 kiB 136.69 108.89 85.23 63.35 44.90 33.15
107 kiB 131.41 104.69 82.61 62.55 45.39 33.82
118 kiB 126.81 101.86 79.87 62.46 45.66 34.41
142 kiB 120.92 96.91 76.12 62.03 44.90 35.07
166 kiB 114.77 92.57 74.75 62.40 46.12 33.94
189 kiB 111.64 90.66 73.31 62.00 47.28 36.93
213 kiB 107.66 86.50 71.72 60.39 46.84 36.71
236 kiB 107.26 86.01 71.08 60.86 47.30 38.52
260 kiB 103.58 84.10 68.98 59.79 46.59 38.97
283 kiB 102.16 82.36 67.12 58.66 46.62 38.48
307 kiB 101.63 80.78 66.40 57.28 47.35 37.33
331 kiB 99.62 79.60 65.41 58.17 47.14 38.61
354 kiB 97.86 78.80 64.63 56.96 47.37 38.78
378 kiB 96.47 77.18 64.45 55.75 47.64 38.88
[/code]For larger SievePrimes it is of advantage to increase SieveSizeLimit towards the L2-cache-size. This is even more evident when the machine is loaded with more mfakto-instances and mprime.

And I finally got around to implement a barrett-kernel based on mul24. Performance is quite promising (174M/s compared to the other kernel“s 135M/s on a HD5770). The only disadvantage is that it does not find any factors yet :redface:.
However, positive side-effect: I found a few places in the traditional mul24 kernel where I could combine a left-shift + add into a mad24, increasing the total performance of that kernel by ~2-3%.

Dubslow 2012-01-10 23:28

[QUOTE=Bdot;285814]And this is on a stock (2.7GHz ?) i7-2600:
[/QUOTE]
3.4 GHz, turbo to 3.8.

(mfaktc...)

kracker 2012-01-11 00:06

works
 
1 Attachment(s)
Works good on my Llano A8-3850 apu, thanks :)

Bdot 2012-01-11 08:51

[QUOTE=kracker;285827]Works good on my Llano A8-3850 apu, thanks :)[/QUOTE]
Thanks for this info! Could you please also post the OpenCL device info part as mfakto reports it? If I can easily figure out we're running on Llano, then I can enable a zero-memory-copy optimization, that should increase GPU utilisation by ~10% when only a single instance is running (and by a small amount for multi-instance).


All times are UTC. The time now is 22:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.