![]() |
bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!
*it's my first time using procdump, so it might not have captured the right thing :S dump: [url]http://dl.dropbox.com/u/5274619/mfakto_dmp_111230_002904.dmp[/url] |
A while ago, at home, we bought a cheap desktop computer, to replace an old laptop which died after several years of ~24/7 BOINC crunching (World Community Grid, with a short period of RSALS when we were factoring the TI-68k & TI-Z80 512-bit RSA signing keys).
The desktop computer is an Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: it has therefore never been intended as a serious number cruncher (either NFS or TF). But I wanted to test the GPU nonetheless, so I set up fglrx 11-12 under Debian Testing 64 bits :smile: Unsurprisingly, this GPU is not very fast: few estimated compute elements, lots of CPU wait even with high SievePrimes values. Excerpt of mfakto 0.10 output: [code]OpenCL device info name Cedar (Advanced Micro Devices, Inc.) device (driver) version OpenCL 1.1 AMD-APP (831.4) (CAL 1.4.1646) maximum threads per block 128 maximum threads per grid 2097152 number of multiprocessors 2 (160 compute elements (estimate for ATI GPUs)) clock rate 650MHz snip got assignment: exp=65XXXXXX bit_min=69 bit_max=70 Starting trial factoring M65XXXXXX from 2^69 to 2^70, k_min = Y - k_max = Z Using GPU kernel "mfakto_cl_barrett79" found a valid checkpoint file! last finished class was: 888 found 0 factors already class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait 893/4620 | 247.46M | 26.399s | 9.37M/s | 5000 | 5h40m | 191738us 896/4620 | 243.27M | 25.684s | 9.47M/s | 5625 | 5h30m | 203540us 897/4620 | 241.17M | 25.518s | 9.45M/s | 6328 | 5h27m | 203248us 900/4620 | 239.08M | 25.204s | 9.49M/s | 7119 | 5h23m | 199535us 908/4620 | 234.88M | 24.866s | 9.45M/s | 8008 | 5h18m | 202758us 917/4620 | 232.78M | 24.610s | 9.46M/s | 9009 | 5h15m | 200071us 920/4620 | 230.69M | 24.404s | 9.45M/s | 10135 | 5h11m | 199695us 921/4620 | 228.59M | 24.182s | 9.45M/s | 11401 | 5h08m | 200666us 932/4620 | 224.40M | 23.770s | 9.44M/s | 12826 | 5h03m | 199813us 936/4620 | 222.30M | 23.591s | 9.42M/s | 14429 | 5h00m | 199118us 941/4620 | 220.20M | 23.306s | 9.45M/s | 16232 | 4h56m | 197502us 945/4620 | 218.10M | 23.110s | 9.44M/s | 18261 | 4h53m | 197998us 948/4620 | 216.01M | 22.825s | 9.46M/s | 20543 | 4h49m | 195479us 953/4620 | 213.91M | 22.615s | 9.46M/s | 23110 | 4h46m | 196987us 956/4620 | 211.81M | 22.402s | 9.46M/s | 25998 | 4h43m | 195551us 957/4620 | 209.72M | 22.207s | 9.44M/s | 29247 | 4h40m | 193541us 965/4620 | 207.62M | 22.004s | 9.44M/s | 32902 | 4h37m | 194135us 972/4620 | 205.52M | 21.809s | 9.42M/s | 37014 | 4h34m | 189328us 977/4620 | 203.42M | 21.633s | 9.40M/s | 41640 | 4h32m | 189369us 980/4620 | 201.33M | 21.381s | 9.42M/s | 46845 | 4h28m | 188979us class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait 981/4620 | 199.23M | 21.184s | 9.40M/s | 52700 | 4h25m | 188281us 992/4620 | 197.13M | 20.996s | 9.39M/s | 59287 | 4h23m | 186956us 1001/4620 | 195.04M | 20.802s | 9.38M/s | 66697 | 4h20m | 185757us 1005/4620 | 192.94M | 20.616s | 9.36M/s | 75034 | 4h17m | 184389us 1008/4620 | 192.94M | 20.543s | 9.39M/s | 84413 | 4h16m | 181585us 1013/4620 | 190.84M | 20.360s | 9.37M/s | 94964 | 4h13m | 179461us 1016/4620 | 188.74M | 20.131s | 9.38M/s | 106834 | 4h10m | 175500us 1020/4620 | 186.65M | 20.009s | 9.33M/s | 120188 | 4h08m | 174065us 1025/4620 | 184.55M | 19.711s | 9.36M/s | 135211 | 4h04m | 170872us 1028/4620 | 184.55M | 19.725s | 9.36M/s | 152112 | 4h04m | 168330us 1032/4620 | 182.45M | 19.588s | 9.31M/s | 171126 | 4h02m | 167198us 1040/4620 | 180.36M | 19.379s | 9.31M/s | 192516 | 3h59m | 161959us 1041/4620 | 180.36M | 19.463s | 9.27M/s | 200000 | 4h00m | 158546us 1053/4620 | 180.36M | 19.391s | 9.30M/s | 200000 | 3h59m | 163496us 1056/4620 | 180.36M | 19.354s | 9.32M/s | 200000 | 3h58m | 162110us 1061/4620 | 180.36M | 19.525s | 9.24M/s | 200000 | 4h00m | 163995us 1065/4620 | 180.36M | 19.550s | 9.23M/s | 200000 | 4h00m | 161185us 1068/4620 | 180.36M | 19.526s | 9.24M/s | 200000 | 3h59m | 162294us[/code] Obviously, I'm not going to make this GPU work much :smile: But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ? |
[QUOTE=therealwebs;284024]bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!
*it's my first time using procdump, so it might not have captured the right thing :S dump: [URL]http://dl.dropbox.com/u/5274619/mfakto_dmp_111230_002904.dmp[/URL][/QUOTE] What is the last output of mfakto? You captured the right thing, however, my debugger cannot make anything meaningful out of it. This is most likely caused by different runtime versions. What I do see is the abort location, amdocl64!clGetSamplerInfo. This is the OpenCL runtime, but mfakto never calls clGetSamplerInfo. So I assume this part of the information is already wrong. I'll see if I can somehow get more info out of the dump, thanks a lot for providing it. If I can't extract anything better, I'll probably create a special debug version for you that should show more. Do you still have any aborts during normal operation? |
[QUOTE=debrouxl;284049] Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: ... fglrx 11-12 under Debian Testing 64 bits :smile:
... Obviously, I'm not going to make this GPU work much :smile: But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?[/QUOTE] Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits. Well, if you have a good idea how to force it to finish the assignment faster, let me know! Higher SievePrimes would go in this direction. Doubling the CPU effort you could expect a speedup of 3-5%. As you noticed with the values >180k: not really worth the effort. SIEVE_SIZE_LIMIT 64kB would make the siever more efficient on your system as the Athlon CPU has 64kB L1 cache. The next mfakto version will have the sieve size configurable, but in your case it would just increase the CPU wait time. In my eyes a new kernel has the best chance for real improvement of the throughput. This would be a barrett kernel based on 24-bit operations. I'm not yet certain if it would need to be entirely based on 24-bit, or if the 32-bit mul_hi is still allowed. I'm (slowly) working on these kernels, but cannot tell when they'll be ready. Also, it is hard to give a good estimate if the improvement will be 5% or 50% ... And I recently thought of another idea that could increase throughput, especially on slower cards: the calculations in the kernel always require an initial division. GPUs are not made for divisions, so I could move this division from the GPU to the CPU, preferably into another thread. But for now, I'm afraid there's nothing in mfakto that you can do to speed it up. Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput. |
On this machine (2x5870), whether or not mfakto 10p1 crashes seems to be up to chance. I'm trying to run 2 instances mostly unattended, and when I check on it (every 2 to 12 hours), usually one has crashed. I won't be able to dig in and really test until I get back to the physical location tomorrow (everything is being done via Teamviewer/remote desktop). I'll see if I can't screenshot the mfakto window along with a process dump the next time it goes south. Thanks for your work on this!
|
[quote]Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.[/quote]
Exactly. On another computer, which I have intermittent access to, a Mobility Radeon HD 4650 (550 MHz) driven by a Core i7 720QM @ 1.6 GHz (which has lower single-core throughput than the Athlon II X4 640 @ 3 GHz) goes through assignments in the 65M range more than twice and a half faster than the desktop HD 5450... Unsurprisingly, on both computers, mfakto -d cpu does less than 4M per second. [quote]But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.[/quote] OK :smile: [quote]Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.[/quote] I'll look into that, even if I probably won't overclock anything. Thanks :smile: |
I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.
|
[QUOTE=KyleAskine;285724]I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.[/QUOTE]
Raw figures and game benchmarks look promising, and the simplified instruction scheduling should boost performance of the 32-bit operations quite a bit, even though I have not been able to find detailed specs about the timing of the operations. Not sure if mul32, mul_hi and convert_* still occupy the whole "Graphic Core Next" ... Another update: I have been working a bit on mfakto and implemented the variable SieveSizeLimit. And in order to easily test it, I also made the upper limit of SievePrimes configurable (between 5000 and 1,000,000). I added a test for it to a new --perftest option, so that you can check which SieveSize fits best to the typical SievePrimes values you have. The output contains a list how fast sieving alone is (this is on an otherwise idle Phenom 2 X4 955 @ 3.2 GHz): [code] 2. Sieve (M/s) SievePrimes: 5000 20000 80000 200000 500000 1000000 SieveSizeLimit 12 kiB 136.60 89.10 46.78 27.38 14.15 6.28 24 kiB 152.08 110.37 62.30 39.25 22.82 11.36 36 kiB 156.94 115.52 71.29 47.36 28.78 15.52 48 kiB 158.79 119.92 78.07 52.81 33.24 19.01 59 kiB 157.13 120.70 82.93 54.79 36.58 21.85 71 kiB 137.47 107.61 77.29 54.07 36.73 23.29 83 kiB 127.99 99.11 73.83 52.63 37.19 24.49 95 kiB 122.54 94.50 71.26 51.05 37.70 25.69 107 kiB 114.02 89.02 67.95 51.58 37.26 26.32 118 kiB 107.10 84.73 63.78 50.51 37.16 26.74 142 kiB 99.38 78.03 59.94 49.09 36.82 27.11 166 kiB 93.95 73.78 57.86 47.93 35.08 27.41 189 kiB 87.60 69.12 54.13 45.88 35.02 27.41 213 kiB 83.13 66.16 52.67 45.00 33.74 27.53 236 kiB 81.05 64.50 51.39 43.93 34.11 27.60 260 kiB 79.17 62.76 50.06 42.78 34.09 27.24 283 kiB 77.22 61.57 49.01 42.63 34.19 26.93 307 kiB 76.78 60.33 47.70 42.01 33.84 27.52 331 kiB 75.66 59.80 48.18 41.02 33.37 27.37 354 kiB 73.93 58.56 47.70 41.23 33.00 27.31 378 kiB 73.40 58.40 47.24 40.37 33.45 27.27 [/code]And this is on a stock (2.7GHz ?) i7-2600: [code] 2. Sieve (M/s) SievePrimes: 5000 20000 80000 200000 500000 1000000 SieveSizeLimit 12 kiB 167.10 107.70 54.40 33.82 20.04 11.96 24 kiB 189.79 136.77 73.83 47.18 29.02 18.64 36 kiB 194.79 142.45 86.47 56.14 35.26 23.38 48 kiB 177.86 135.17 89.44 59.99 38.60 26.43 59 kiB 162.03 124.87 89.10 61.40 40.98 28.72 71 kiB 148.61 117.39 86.94 61.31 41.94 30.22 83 kiB 141.95 112.71 86.40 62.64 43.48 31.89 95 kiB 136.69 108.89 85.23 63.35 44.90 33.15 107 kiB 131.41 104.69 82.61 62.55 45.39 33.82 118 kiB 126.81 101.86 79.87 62.46 45.66 34.41 142 kiB 120.92 96.91 76.12 62.03 44.90 35.07 166 kiB 114.77 92.57 74.75 62.40 46.12 33.94 189 kiB 111.64 90.66 73.31 62.00 47.28 36.93 213 kiB 107.66 86.50 71.72 60.39 46.84 36.71 236 kiB 107.26 86.01 71.08 60.86 47.30 38.52 260 kiB 103.58 84.10 68.98 59.79 46.59 38.97 283 kiB 102.16 82.36 67.12 58.66 46.62 38.48 307 kiB 101.63 80.78 66.40 57.28 47.35 37.33 331 kiB 99.62 79.60 65.41 58.17 47.14 38.61 354 kiB 97.86 78.80 64.63 56.96 47.37 38.78 378 kiB 96.47 77.18 64.45 55.75 47.64 38.88 [/code]For larger SievePrimes it is of advantage to increase SieveSizeLimit towards the L2-cache-size. This is even more evident when the machine is loaded with more mfakto-instances and mprime. And I finally got around to implement a barrett-kernel based on mul24. Performance is quite promising (174M/s compared to the other kernel“s 135M/s on a HD5770). The only disadvantage is that it does not find any factors yet :redface:. However, positive side-effect: I found a few places in the traditional mul24 kernel where I could combine a left-shift + add into a mad24, increasing the total performance of that kernel by ~2-3%. |
[QUOTE=Bdot;285814]And this is on a stock (2.7GHz ?) i7-2600:
[/QUOTE] 3.4 GHz, turbo to 3.8. (mfaktc...) |
works
1 Attachment(s)
Works good on my Llano A8-3850 apu, thanks :)
|
[QUOTE=kracker;285827]Works good on my Llano A8-3850 apu, thanks :)[/QUOTE]
Thanks for this info! Could you please also post the OpenCL device info part as mfakto reports it? If I can easily figure out we're running on Llano, then I can enable a zero-memory-copy optimization, that should increase GPU utilisation by ~10% when only a single instance is running (and by a small amount for multi-instance). |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.