![]() |
|
|
#298 |
|
Dec 2011
Ottawa, Canada
48 Posts |
bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!
*it's my first time using procdump, so it might not have captured the right thing :S dump: http://dl.dropbox.com/u/5274619/mfak...230_002904.dmp |
|
|
|
|
|
#299 |
|
Sep 2009
3D116 Posts |
A while ago, at home, we bought a cheap desktop computer, to replace an old laptop which died after several years of ~24/7 BOINC crunching (World Community Grid, with a short period of RSALS when we were factoring the TI-68k & TI-Z80 512-bit RSA signing keys).
The desktop computer is an Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: it has therefore never been intended as a serious number cruncher (either NFS or TF). But I wanted to test the GPU nonetheless, so I set up fglrx 11-12 under Debian Testing 64 bits ![]() Unsurprisingly, this GPU is not very fast: few estimated compute elements, lots of CPU wait even with high SievePrimes values. Excerpt of mfakto 0.10 output: Code:
OpenCL device info
name Cedar (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.1 AMD-APP (831.4) (CAL 1.4.1646)
maximum threads per block 128
maximum threads per grid 2097152
number of multiprocessors 2 (160 compute elements (estimate for ATI GPUs))
clock rate 650MHz
snip
got assignment: exp=65XXXXXX bit_min=69 bit_max=70
Starting trial factoring M65XXXXXX from 2^69 to 2^70, k_min = Y - k_max = Z
Using GPU kernel "mfakto_cl_barrett79"
found a valid checkpoint file!
last finished class was: 888
found 0 factors already
class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait
893/4620 | 247.46M | 26.399s | 9.37M/s | 5000 | 5h40m | 191738us
896/4620 | 243.27M | 25.684s | 9.47M/s | 5625 | 5h30m | 203540us
897/4620 | 241.17M | 25.518s | 9.45M/s | 6328 | 5h27m | 203248us
900/4620 | 239.08M | 25.204s | 9.49M/s | 7119 | 5h23m | 199535us
908/4620 | 234.88M | 24.866s | 9.45M/s | 8008 | 5h18m | 202758us
917/4620 | 232.78M | 24.610s | 9.46M/s | 9009 | 5h15m | 200071us
920/4620 | 230.69M | 24.404s | 9.45M/s | 10135 | 5h11m | 199695us
921/4620 | 228.59M | 24.182s | 9.45M/s | 11401 | 5h08m | 200666us
932/4620 | 224.40M | 23.770s | 9.44M/s | 12826 | 5h03m | 199813us
936/4620 | 222.30M | 23.591s | 9.42M/s | 14429 | 5h00m | 199118us
941/4620 | 220.20M | 23.306s | 9.45M/s | 16232 | 4h56m | 197502us
945/4620 | 218.10M | 23.110s | 9.44M/s | 18261 | 4h53m | 197998us
948/4620 | 216.01M | 22.825s | 9.46M/s | 20543 | 4h49m | 195479us
953/4620 | 213.91M | 22.615s | 9.46M/s | 23110 | 4h46m | 196987us
956/4620 | 211.81M | 22.402s | 9.46M/s | 25998 | 4h43m | 195551us
957/4620 | 209.72M | 22.207s | 9.44M/s | 29247 | 4h40m | 193541us
965/4620 | 207.62M | 22.004s | 9.44M/s | 32902 | 4h37m | 194135us
972/4620 | 205.52M | 21.809s | 9.42M/s | 37014 | 4h34m | 189328us
977/4620 | 203.42M | 21.633s | 9.40M/s | 41640 | 4h32m | 189369us
980/4620 | 201.33M | 21.381s | 9.42M/s | 46845 | 4h28m | 188979us
class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait
981/4620 | 199.23M | 21.184s | 9.40M/s | 52700 | 4h25m | 188281us
992/4620 | 197.13M | 20.996s | 9.39M/s | 59287 | 4h23m | 186956us
1001/4620 | 195.04M | 20.802s | 9.38M/s | 66697 | 4h20m | 185757us
1005/4620 | 192.94M | 20.616s | 9.36M/s | 75034 | 4h17m | 184389us
1008/4620 | 192.94M | 20.543s | 9.39M/s | 84413 | 4h16m | 181585us
1013/4620 | 190.84M | 20.360s | 9.37M/s | 94964 | 4h13m | 179461us
1016/4620 | 188.74M | 20.131s | 9.38M/s | 106834 | 4h10m | 175500us
1020/4620 | 186.65M | 20.009s | 9.33M/s | 120188 | 4h08m | 174065us
1025/4620 | 184.55M | 19.711s | 9.36M/s | 135211 | 4h04m | 170872us
1028/4620 | 184.55M | 19.725s | 9.36M/s | 152112 | 4h04m | 168330us
1032/4620 | 182.45M | 19.588s | 9.31M/s | 171126 | 4h02m | 167198us
1040/4620 | 180.36M | 19.379s | 9.31M/s | 192516 | 3h59m | 161959us
1041/4620 | 180.36M | 19.463s | 9.27M/s | 200000 | 4h00m | 158546us
1053/4620 | 180.36M | 19.391s | 9.30M/s | 200000 | 3h59m | 163496us
1056/4620 | 180.36M | 19.354s | 9.32M/s | 200000 | 3h58m | 162110us
1061/4620 | 180.36M | 19.525s | 9.24M/s | 200000 | 4h00m | 163995us
1065/4620 | 180.36M | 19.550s | 9.23M/s | 200000 | 4h00m | 161185us
1068/4620 | 180.36M | 19.526s | 9.24M/s | 200000 | 3h59m | 162294us
![]() But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ? |
|
|
|
|
|
#300 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
You captured the right thing, however, my debugger cannot make anything meaningful out of it. This is most likely caused by different runtime versions. What I do see is the abort location, amdocl64!clGetSamplerInfo. This is the OpenCL runtime, but mfakto never calls clGetSamplerInfo. So I assume this part of the information is already wrong. I'll see if I can somehow get more info out of the dump, thanks a lot for providing it. If I can't extract anything better, I'll probably create a special debug version for you that should show more. Do you still have any aborts during normal operation? |
|
|
|
|
|
|
#301 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
Well, if you have a good idea how to force it to finish the assignment faster, let me know! Higher SievePrimes would go in this direction. Doubling the CPU effort you could expect a speedup of 3-5%. As you noticed with the values >180k: not really worth the effort. SIEVE_SIZE_LIMIT 64kB would make the siever more efficient on your system as the Athlon CPU has 64kB L1 cache. The next mfakto version will have the sieve size configurable, but in your case it would just increase the CPU wait time. In my eyes a new kernel has the best chance for real improvement of the throughput. This would be a barrett kernel based on 24-bit operations. I'm not yet certain if it would need to be entirely based on 24-bit, or if the 32-bit mul_hi is still allowed. I'm (slowly) working on these kernels, but cannot tell when they'll be ready. Also, it is hard to give a good estimate if the improvement will be 5% or 50% ... And I recently thought of another idea that could increase throughput, especially on slower cards: the calculations in the kernel always require an initial division. GPUs are not made for divisions, so I could move this division from the GPU to the CPU, preferably into another thread. But for now, I'm afraid there's nothing in mfakto that you can do to speed it up. Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput. |
|
|
|
|
|
|
#302 |
|
Dec 2011
Ottawa, Canada
22 Posts |
On this machine (2x5870), whether or not mfakto 10p1 crashes seems to be up to chance. I'm trying to run 2 instances mostly unattended, and when I check on it (every 2 to 12 hours), usually one has crashed. I won't be able to dig in and really test until I get back to the physical location tomorrow (everything is being done via Teamviewer/remote desktop). I'll see if I can't screenshot the mfakto window along with a process dump the next time it goes south. Thanks for your work on this!
|
|
|
|
|
|
#303 | |||
|
Sep 2009
977 Posts |
Quote:
On another computer, which I have intermittent access to, a Mobility Radeon HD 4650 (550 MHz) driven by a Core i7 720QM @ 1.6 GHz (which has lower single-core throughput than the Athlon II X4 640 @ 3 GHz) goes through assignments in the 65M range more than twice and a half faster than the desktop HD 5450... Unsurprisingly, on both computers, mfakto -d cpu does less than 4M per second. Quote:
![]() Quote:
Thanks
|
|||
|
|
|
|
|
#304 |
|
Oct 2011
Maryland
2·5·29 Posts |
I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.
|
|
|
|
|
|
#305 | |
|
Nov 2010
Germany
3·199 Posts |
Quote:
Another update: I have been working a bit on mfakto and implemented the variable SieveSizeLimit. And in order to easily test it, I also made the upper limit of SievePrimes configurable (between 5000 and 1,000,000). I added a test for it to a new --perftest option, so that you can check which SieveSize fits best to the typical SievePrimes values you have. The output contains a list how fast sieving alone is (this is on an otherwise idle Phenom 2 X4 955 @ 3.2 GHz): Code:
2. Sieve (M/s) SievePrimes: 5000 20000 80000 200000 500000 1000000 SieveSizeLimit 12 kiB 136.60 89.10 46.78 27.38 14.15 6.28 24 kiB 152.08 110.37 62.30 39.25 22.82 11.36 36 kiB 156.94 115.52 71.29 47.36 28.78 15.52 48 kiB 158.79 119.92 78.07 52.81 33.24 19.01 59 kiB 157.13 120.70 82.93 54.79 36.58 21.85 71 kiB 137.47 107.61 77.29 54.07 36.73 23.29 83 kiB 127.99 99.11 73.83 52.63 37.19 24.49 95 kiB 122.54 94.50 71.26 51.05 37.70 25.69 107 kiB 114.02 89.02 67.95 51.58 37.26 26.32 118 kiB 107.10 84.73 63.78 50.51 37.16 26.74 142 kiB 99.38 78.03 59.94 49.09 36.82 27.11 166 kiB 93.95 73.78 57.86 47.93 35.08 27.41 189 kiB 87.60 69.12 54.13 45.88 35.02 27.41 213 kiB 83.13 66.16 52.67 45.00 33.74 27.53 236 kiB 81.05 64.50 51.39 43.93 34.11 27.60 260 kiB 79.17 62.76 50.06 42.78 34.09 27.24 283 kiB 77.22 61.57 49.01 42.63 34.19 26.93 307 kiB 76.78 60.33 47.70 42.01 33.84 27.52 331 kiB 75.66 59.80 48.18 41.02 33.37 27.37 354 kiB 73.93 58.56 47.70 41.23 33.00 27.31 378 kiB 73.40 58.40 47.24 40.37 33.45 27.27 Code:
2. Sieve (M/s) SievePrimes: 5000 20000 80000 200000 500000 1000000 SieveSizeLimit 12 kiB 167.10 107.70 54.40 33.82 20.04 11.96 24 kiB 189.79 136.77 73.83 47.18 29.02 18.64 36 kiB 194.79 142.45 86.47 56.14 35.26 23.38 48 kiB 177.86 135.17 89.44 59.99 38.60 26.43 59 kiB 162.03 124.87 89.10 61.40 40.98 28.72 71 kiB 148.61 117.39 86.94 61.31 41.94 30.22 83 kiB 141.95 112.71 86.40 62.64 43.48 31.89 95 kiB 136.69 108.89 85.23 63.35 44.90 33.15 107 kiB 131.41 104.69 82.61 62.55 45.39 33.82 118 kiB 126.81 101.86 79.87 62.46 45.66 34.41 142 kiB 120.92 96.91 76.12 62.03 44.90 35.07 166 kiB 114.77 92.57 74.75 62.40 46.12 33.94 189 kiB 111.64 90.66 73.31 62.00 47.28 36.93 213 kiB 107.66 86.50 71.72 60.39 46.84 36.71 236 kiB 107.26 86.01 71.08 60.86 47.30 38.52 260 kiB 103.58 84.10 68.98 59.79 46.59 38.97 283 kiB 102.16 82.36 67.12 58.66 46.62 38.48 307 kiB 101.63 80.78 66.40 57.28 47.35 37.33 331 kiB 99.62 79.60 65.41 58.17 47.14 38.61 354 kiB 97.86 78.80 64.63 56.96 47.37 38.78 378 kiB 96.47 77.18 64.45 55.75 47.64 38.88 And I finally got around to implement a barrett-kernel based on mul24. Performance is quite promising (174M/s compared to the other kernel“s 135M/s on a HD5770). The only disadvantage is that it does not find any factors yet .However, positive side-effect: I found a few places in the traditional mul24 kernel where I could combine a left-shift + add into a mad24, increasing the total performance of that kernel by ~2-3%. |
|
|
|
|
|
|
#306 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
160658 Posts |
|
|
|
|
|
|
#307 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
87816 Posts |
Works good on my Llano A8-3850 apu, thanks :)
|
|
|
|
|
|
#308 |
|
Nov 2010
Germany
3×199 Posts |
Thanks for this info! Could you please also post the OpenCL device info part as mfakto reports it? If I can easily figure out we're running on Llano, then I can enable a zero-memory-copy optimization, that should increase GPU utilisation by ~10% when only a single instance is running (and by a small amount for multi-instance).
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3498 | 2021-08-06 21:07 |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2719 | 2021-08-05 22:43 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |