mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-12-30, 05:55   #298
therealwebs
 
Dec 2011
Ottawa, Canada

48 Posts
Default

bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!

*it's my first time using procdump, so it might not have captured the right thing :S

dump: http://dl.dropbox.com/u/5274619/mfak...230_002904.dmp
therealwebs is offline   Reply With Quote
Old 2011-12-30, 13:00   #299
debrouxl
 
debrouxl's Avatar
 
Sep 2009

3D116 Posts
Default

A while ago, at home, we bought a cheap desktop computer, to replace an old laptop which died after several years of ~24/7 BOINC crunching (World Community Grid, with a short period of RSALS when we were factoring the TI-68k & TI-Z80 512-bit RSA signing keys).
The desktop computer is an Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: it has therefore never been intended as a serious number cruncher (either NFS or TF).
But I wanted to test the GPU nonetheless, so I set up fglrx 11-12 under Debian Testing 64 bits

Unsurprisingly, this GPU is not very fast: few estimated compute elements, lots of CPU wait even with high SievePrimes values. Excerpt of mfakto 0.10 output:
Code:
OpenCL device info
  name                      Cedar (Advanced Micro Devices, Inc.)
  device (driver) version   OpenCL 1.1 AMD-APP (831.4) (CAL 1.4.1646)
  maximum threads per block 128
  maximum threads per grid  2097152
  number of multiprocessors 2 (160 compute elements (estimate for ATI GPUs))
  clock rate                650MHz

snip

got assignment: exp=65XXXXXX bit_min=69 bit_max=70
Starting trial factoring M65XXXXXX from 2^69 to 2^70, k_min = Y -  k_max = Z
Using GPU kernel "mfakto_cl_barrett79"

found a valid checkpoint file!
  last finished class was: 888
  found 0 factors already

    class | candidates |    time | avg. rate | SievePrimes |    ETA | CPU wait
 893/4620 |    247.46M | 26.399s |   9.37M/s |        5000 |  5h40m |  191738us
 896/4620 |    243.27M | 25.684s |   9.47M/s |        5625 |  5h30m |  203540us
 897/4620 |    241.17M | 25.518s |   9.45M/s |        6328 |  5h27m |  203248us
 900/4620 |    239.08M | 25.204s |   9.49M/s |        7119 |  5h23m |  199535us
 908/4620 |    234.88M | 24.866s |   9.45M/s |        8008 |  5h18m |  202758us
 917/4620 |    232.78M | 24.610s |   9.46M/s |        9009 |  5h15m |  200071us
 920/4620 |    230.69M | 24.404s |   9.45M/s |       10135 |  5h11m |  199695us
 921/4620 |    228.59M | 24.182s |   9.45M/s |       11401 |  5h08m |  200666us
 932/4620 |    224.40M | 23.770s |   9.44M/s |       12826 |  5h03m |  199813us
 936/4620 |    222.30M | 23.591s |   9.42M/s |       14429 |  5h00m |  199118us
 941/4620 |    220.20M | 23.306s |   9.45M/s |       16232 |  4h56m |  197502us
 945/4620 |    218.10M | 23.110s |   9.44M/s |       18261 |  4h53m |  197998us
 948/4620 |    216.01M | 22.825s |   9.46M/s |       20543 |  4h49m |  195479us
 953/4620 |    213.91M | 22.615s |   9.46M/s |       23110 |  4h46m |  196987us
 956/4620 |    211.81M | 22.402s |   9.46M/s |       25998 |  4h43m |  195551us
 957/4620 |    209.72M | 22.207s |   9.44M/s |       29247 |  4h40m |  193541us
 965/4620 |    207.62M | 22.004s |   9.44M/s |       32902 |  4h37m |  194135us
 972/4620 |    205.52M | 21.809s |   9.42M/s |       37014 |  4h34m |  189328us
 977/4620 |    203.42M | 21.633s |   9.40M/s |       41640 |  4h32m |  189369us
 980/4620 |    201.33M | 21.381s |   9.42M/s |       46845 |  4h28m |  188979us
    class | candidates |    time | avg. rate | SievePrimes |    ETA | CPU wait
 981/4620 |    199.23M | 21.184s |   9.40M/s |       52700 |  4h25m |  188281us
 992/4620 |    197.13M | 20.996s |   9.39M/s |       59287 |  4h23m |  186956us
1001/4620 |    195.04M | 20.802s |   9.38M/s |       66697 |  4h20m |  185757us
1005/4620 |    192.94M | 20.616s |   9.36M/s |       75034 |  4h17m |  184389us
1008/4620 |    192.94M | 20.543s |   9.39M/s |       84413 |  4h16m |  181585us
1013/4620 |    190.84M | 20.360s |   9.37M/s |       94964 |  4h13m |  179461us
1016/4620 |    188.74M | 20.131s |   9.38M/s |      106834 |  4h10m |  175500us
1020/4620 |    186.65M | 20.009s |   9.33M/s |      120188 |  4h08m |  174065us
1025/4620 |    184.55M | 19.711s |   9.36M/s |      135211 |  4h04m |  170872us
1028/4620 |    184.55M | 19.725s |   9.36M/s |      152112 |  4h04m |  168330us
1032/4620 |    182.45M | 19.588s |   9.31M/s |      171126 |  4h02m |  167198us
1040/4620 |    180.36M | 19.379s |   9.31M/s |      192516 |  3h59m |  161959us
1041/4620 |    180.36M | 19.463s |   9.27M/s |      200000 |  4h00m |  158546us
1053/4620 |    180.36M | 19.391s |   9.30M/s |      200000 |  3h59m |  163496us
1056/4620 |    180.36M | 19.354s |   9.32M/s |      200000 |  3h58m |  162110us
1061/4620 |    180.36M | 19.525s |   9.24M/s |      200000 |  4h00m |  163995us
1065/4620 |    180.36M | 19.550s |   9.23M/s |      200000 |  4h00m |  161185us
1068/4620 |    180.36M | 19.526s |   9.24M/s |      200000 |  3h59m |  162294us
Obviously, I'm not going to make this GPU work much
But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?
debrouxl is offline   Reply With Quote
Old 2011-12-30, 22:42   #300
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by therealwebs View Post
bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!

*it's my first time using procdump, so it might not have captured the right thing :S

dump: http://dl.dropbox.com/u/5274619/mfak...230_002904.dmp
What is the last output of mfakto?

You captured the right thing, however, my debugger cannot make anything meaningful out of it. This is most likely caused by different runtime versions. What I do see is the abort location, amdocl64!clGetSamplerInfo. This is the OpenCL runtime, but mfakto never calls clGetSamplerInfo. So I assume this part of the information is already wrong. I'll see if I can somehow get more info out of the dump, thanks a lot for providing it. If I can't extract anything better, I'll probably create a special debug version for you that should show more.

Do you still have any aborts during normal operation?
Bdot is offline   Reply With Quote
Old 2011-12-30, 23:15   #301
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by debrouxl View Post
Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: ... fglrx 11-12 under Debian Testing 64 bits
...
Obviously, I'm not going to make this GPU work much
But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?
Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.

Well, if you have a good idea how to force it to finish the assignment faster, let me know!

Higher SievePrimes would go in this direction. Doubling the CPU effort you could expect a speedup of 3-5%. As you noticed with the values >180k: not really worth the effort.

SIEVE_SIZE_LIMIT 64kB would make the siever more efficient on your system as the Athlon CPU has 64kB L1 cache. The next mfakto version will have the sieve size configurable, but in your case it would just increase the CPU wait time.

In my eyes a new kernel has the best chance for real improvement of the throughput. This would be a barrett kernel based on 24-bit operations. I'm not yet certain if it would need to be entirely based on 24-bit, or if the 32-bit mul_hi is still allowed. I'm (slowly) working on these kernels, but cannot tell when they'll be ready. Also, it is hard to give a good estimate if the improvement will be 5% or 50% ...

And I recently thought of another idea that could increase throughput, especially on slower cards: the calculations in the kernel always require an initial division. GPUs are not made for divisions, so I could move this division from the GPU to the CPU, preferably into another thread.

But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.

Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.
Bdot is offline   Reply With Quote
Old 2011-12-31, 01:43   #302
therealwebs
 
Dec 2011
Ottawa, Canada

22 Posts
Default

On this machine (2x5870), whether or not mfakto 10p1 crashes seems to be up to chance. I'm trying to run 2 instances mostly unattended, and when I check on it (every 2 to 12 hours), usually one has crashed. I won't be able to dig in and really test until I get back to the physical location tomorrow (everything is being done via Teamviewer/remote desktop). I'll see if I can't screenshot the mfakto window along with a process dump the next time it goes south. Thanks for your work on this!
therealwebs is offline   Reply With Quote
Old 2011-12-31, 08:46   #303
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default

Quote:
Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.
Exactly.
On another computer, which I have intermittent access to, a Mobility Radeon HD 4650 (550 MHz) driven by a Core i7 720QM @ 1.6 GHz (which has lower single-core throughput than the Athlon II X4 640 @ 3 GHz) goes through assignments in the 65M range more than twice and a half faster than the desktop HD 5450...
Unsurprisingly, on both computers, mfakto -d cpu does less than 4M per second.

Quote:
But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.
OK

Quote:
Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.
I'll look into that, even if I probably won't overclock anything.


Thanks
debrouxl is offline   Reply With Quote
Old 2012-01-10, 14:49   #304
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2·5·29 Posts
Default

I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.
KyleAskine is offline   Reply With Quote
Old 2012-01-10, 22:50   #305
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.
Raw figures and game benchmarks look promising, and the simplified instruction scheduling should boost performance of the 32-bit operations quite a bit, even though I have not been able to find detailed specs about the timing of the operations. Not sure if mul32, mul_hi and convert_* still occupy the whole "Graphic Core Next" ...

Another update:
I have been working a bit on mfakto and implemented the variable SieveSizeLimit. And in order to easily test it, I also made the upper limit of SievePrimes configurable (between 5000 and 1,000,000). I added a test for it to a new --perftest option, so that you can check which SieveSize fits best to the typical SievePrimes values you have. The output contains a list how fast sieving alone is (this is on an otherwise idle Phenom 2 X4 955 @ 3.2 GHz):
Code:
2. Sieve (M/s)
   SievePrimes:      5000   20000   80000   200000   500000  1000000
SieveSizeLimit
  12 kiB           136.60   89.10   46.78    27.38    14.15     6.28
  24 kiB           152.08  110.37   62.30    39.25    22.82    11.36
  36 kiB           156.94  115.52   71.29    47.36    28.78    15.52
  48 kiB           158.79  119.92   78.07    52.81    33.24    19.01
  59 kiB           157.13  120.70   82.93    54.79    36.58    21.85
  71 kiB           137.47  107.61   77.29    54.07    36.73    23.29
  83 kiB           127.99   99.11   73.83    52.63    37.19    24.49
  95 kiB           122.54   94.50   71.26    51.05    37.70    25.69
 107 kiB           114.02   89.02   67.95    51.58    37.26    26.32
 118 kiB           107.10   84.73   63.78    50.51    37.16    26.74
 142 kiB            99.38   78.03   59.94    49.09    36.82    27.11
 166 kiB            93.95   73.78   57.86    47.93    35.08    27.41
 189 kiB            87.60   69.12   54.13    45.88    35.02    27.41
 213 kiB            83.13   66.16   52.67    45.00    33.74    27.53
 236 kiB            81.05   64.50   51.39    43.93    34.11    27.60
 260 kiB            79.17   62.76   50.06    42.78    34.09    27.24
 283 kiB            77.22   61.57   49.01    42.63    34.19    26.93
 307 kiB            76.78   60.33   47.70    42.01    33.84    27.52
 331 kiB            75.66   59.80   48.18    41.02    33.37    27.37
 354 kiB            73.93   58.56   47.70    41.23    33.00    27.31
 378 kiB            73.40   58.40   47.24    40.37    33.45    27.27
And this is on a stock (2.7GHz ?) i7-2600:
Code:
2. Sieve (M/s)
   SievePrimes:      5000   20000   80000   200000   500000  1000000
SieveSizeLimit
  12 kiB           167.10  107.70   54.40    33.82    20.04    11.96
  24 kiB           189.79  136.77   73.83    47.18    29.02    18.64
  36 kiB           194.79  142.45   86.47    56.14    35.26    23.38
  48 kiB           177.86  135.17   89.44    59.99    38.60    26.43
  59 kiB           162.03  124.87   89.10    61.40    40.98    28.72
  71 kiB           148.61  117.39   86.94    61.31    41.94    30.22
  83 kiB           141.95  112.71   86.40    62.64    43.48    31.89
  95 kiB           136.69  108.89   85.23    63.35    44.90    33.15
 107 kiB           131.41  104.69   82.61    62.55    45.39    33.82
 118 kiB           126.81  101.86   79.87    62.46    45.66    34.41
 142 kiB           120.92   96.91   76.12    62.03    44.90    35.07
 166 kiB           114.77   92.57   74.75    62.40    46.12    33.94
 189 kiB           111.64   90.66   73.31    62.00    47.28    36.93
 213 kiB           107.66   86.50   71.72    60.39    46.84    36.71
 236 kiB           107.26   86.01   71.08    60.86    47.30    38.52
 260 kiB           103.58   84.10   68.98    59.79    46.59    38.97
 283 kiB           102.16   82.36   67.12    58.66    46.62    38.48
 307 kiB           101.63   80.78   66.40    57.28    47.35    37.33
 331 kiB            99.62   79.60   65.41    58.17    47.14    38.61
 354 kiB            97.86   78.80   64.63    56.96    47.37    38.78
 378 kiB            96.47   77.18   64.45    55.75    47.64    38.88
For larger SievePrimes it is of advantage to increase SieveSizeLimit towards the L2-cache-size. This is even more evident when the machine is loaded with more mfakto-instances and mprime.

And I finally got around to implement a barrett-kernel based on mul24. Performance is quite promising (174M/s compared to the other kernel“s 135M/s on a HD5770). The only disadvantage is that it does not find any factors yet .
However, positive side-effect: I found a few places in the traditional mul24 kernel where I could combine a left-shift + add into a mad24, increasing the total performance of that kernel by ~2-3%.
Bdot is offline   Reply With Quote
Old 2012-01-10, 23:28   #306
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

Quote:
Originally Posted by Bdot View Post
And this is on a stock (2.7GHz ?) i7-2600:
3.4 GHz, turbo to 3.8.

(mfaktc...)
Dubslow is offline   Reply With Quote
Old 2012-01-11, 00:06   #307
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

87816 Posts
Default works

Works good on my Llano A8-3850 apu, thanks :)
Attached Thumbnails
Click image for larger version

Name:	picture.jpg
Views:	137
Size:	141.8 KB
ID:	7547  
kracker is offline   Reply With Quote
Old 2012-01-11, 08:51   #308
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by kracker View Post
Works good on my Llano A8-3850 apu, thanks :)
Thanks for this info! Could you please also post the OpenCL device info part as mfakto reports it? If I can easily figure out we're running on Llano, then I can enable a zero-memory-copy optimization, that should increase GPU utilisation by ~10% when only a single instance is running (and by a small amount for multi-instance).
Bdot is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3498 2021-08-06 21:07
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 22:00.


Fri Aug 6 22:00:27 UTC 2021 up 14 days, 16:29, 1 user, load averages: 2.77, 2.79, 2.69

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.