mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

flashjh 2011-12-24 23:49

Warning Question
 
I have several TF DC assignments for my P4 3.4 system with a HD 4670. Anyway, I noticed it kept giving a warning about a particular exponent and would skip it. I finally got around to messing with it.

Factor=N/A,27960979,68,69

Always gives:

WARNING: exponent is not prime!
Ignoring TF M27960979 from 2^68 to 2^69!
WARNING: ignoring line 1 in "worktodo.txt"! Reason: invalid data

So, I know it's not prime and both mfak(co) say the same thing... is there any way to fulfill my GPU TF on this exponent or do I need to use Prime95 for this one?

TheJudger 2011-12-25 00:32

[QUOTE=flashjh;283454]
[...]
Factor=N/A,27960979,68,69

Always gives:

WARNING: exponent is not prime!
Ignoring TF M27960979 from 2^68 to 2^69!
WARNING: ignoring line 1 in "worktodo.txt"! Reason: invalid data

So, I know it's not prime and both mfak(co) say the same thing... [/QUOTE]
Well, no surprise that both, mfaktc and mfakto, tell you that this exponent is not prime... it is the same code!

[QUOTE=flashjh;283454]is there any way to fulfill my GPU TF on this exponent or do I need to use Prime95 for this one?[/QUOTE]
Take the sourcecode and disable the check for prime only exponents. :razz:
This will work for your assignment (M27960979 from 2[SUP]68[/SUP] to 2[SUP]69[/SUP]) because the start is "big enough". The problem with non-prime exponents is that the prime factors can be very small (for prime exponents the smallest possible factor of M[SUB]p[/SUB] is 2kp+1 but for composite exponents they can be much smaller than the exponent itself). Those very small factors can be sieved out before testing just because there is no code written which takes care of this.

Oliver

flashjh 2011-12-25 01:40

[QUOTE=TheJudger;283455]Well, no surprise that both, mfaktc and mfakto, tell you that this exponent is not prime... it is the same code!
[/QUOTE]

Fair enough... I was just making sure everyone knew I tested both and since I hadn't seen this before in my readings I didn't want that to be the reason :cool:

axn 2011-12-25 02:58

[QUOTE=flashjh;283454]
So, I know it's not prime and both mfak(co) say the same thing... is there any way to fulfill my GPU TF on this exponent or do I need to use Prime95 for this one?[/QUOTE]

If the exponent is not prime, you have an invalid exponent (possibly due to a typo). Find the correct exponent and do the TF on it. If you can't find the correct exponent, throw out that line. GIMPS does not deal with composite exponents. Even P95 will balk at that one.

flashjh 2011-12-26 05:03

Reporting Question
 
Lesson learned for me... I copy and paste the lists in, but somehow I messed up that one. I fixed it to match my assignments (one digit was off) and everything worked. Thanks for your help.

And a queston... Has anyone noticed PrimeNet result changes? I now use the spider to post my results (which is awesome by the way). I noticed that PrimeNet now shows all my 'factor' results as [FONT=Tahoma]F-PM1 instead of just F. The results column has the correct factor, but since I use mfakt(oc) for all my TFing I was wondering why PrimeNet is showing the change. Is the spider makeing a mistake when reporting or is PrimeNet making a mistake? Also, PrimeNet and GPU to 72 don't show the same GHz days since PrimeNet thinks it was found with P1.[/FONT]


[FONT=Tahoma]Jerry[/FONT]

Dubslow 2011-12-26 20:04

[url]http://www.mersenneforum.org/showthread.php?t=12827&page=58[/url]

Last post there ^, and there's a few posts on the next page. I'd read through changelog.txt as well. This is a known issue and hopefully will be fixed soon.

therealwebs 2011-12-30 05:55

bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!

*it's my first time using procdump, so it might not have captured the right thing :S

dump: [url]http://dl.dropbox.com/u/5274619/mfakto_dmp_111230_002904.dmp[/url]

debrouxl 2011-12-30 13:00

A while ago, at home, we bought a cheap desktop computer, to replace an old laptop which died after several years of ~24/7 BOINC crunching (World Community Grid, with a short period of RSALS when we were factoring the TI-68k & TI-Z80 512-bit RSA signing keys).
The desktop computer is an Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: it has therefore never been intended as a serious number cruncher (either NFS or TF).
But I wanted to test the GPU nonetheless, so I set up fglrx 11-12 under Debian Testing 64 bits :smile:

Unsurprisingly, this GPU is not very fast: few estimated compute elements, lots of CPU wait even with high SievePrimes values. Excerpt of mfakto 0.10 output:
[code]OpenCL device info
name Cedar (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.1 AMD-APP (831.4) (CAL 1.4.1646)
maximum threads per block 128
maximum threads per grid 2097152
number of multiprocessors 2 (160 compute elements (estimate for ATI GPUs))
clock rate 650MHz

snip

got assignment: exp=65XXXXXX bit_min=69 bit_max=70
Starting trial factoring M65XXXXXX from 2^69 to 2^70, k_min = Y - k_max = Z
Using GPU kernel "mfakto_cl_barrett79"

found a valid checkpoint file!
last finished class was: 888
found 0 factors already

class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait
893/4620 | 247.46M | 26.399s | 9.37M/s | 5000 | 5h40m | 191738us
896/4620 | 243.27M | 25.684s | 9.47M/s | 5625 | 5h30m | 203540us
897/4620 | 241.17M | 25.518s | 9.45M/s | 6328 | 5h27m | 203248us
900/4620 | 239.08M | 25.204s | 9.49M/s | 7119 | 5h23m | 199535us
908/4620 | 234.88M | 24.866s | 9.45M/s | 8008 | 5h18m | 202758us
917/4620 | 232.78M | 24.610s | 9.46M/s | 9009 | 5h15m | 200071us
920/4620 | 230.69M | 24.404s | 9.45M/s | 10135 | 5h11m | 199695us
921/4620 | 228.59M | 24.182s | 9.45M/s | 11401 | 5h08m | 200666us
932/4620 | 224.40M | 23.770s | 9.44M/s | 12826 | 5h03m | 199813us
936/4620 | 222.30M | 23.591s | 9.42M/s | 14429 | 5h00m | 199118us
941/4620 | 220.20M | 23.306s | 9.45M/s | 16232 | 4h56m | 197502us
945/4620 | 218.10M | 23.110s | 9.44M/s | 18261 | 4h53m | 197998us
948/4620 | 216.01M | 22.825s | 9.46M/s | 20543 | 4h49m | 195479us
953/4620 | 213.91M | 22.615s | 9.46M/s | 23110 | 4h46m | 196987us
956/4620 | 211.81M | 22.402s | 9.46M/s | 25998 | 4h43m | 195551us
957/4620 | 209.72M | 22.207s | 9.44M/s | 29247 | 4h40m | 193541us
965/4620 | 207.62M | 22.004s | 9.44M/s | 32902 | 4h37m | 194135us
972/4620 | 205.52M | 21.809s | 9.42M/s | 37014 | 4h34m | 189328us
977/4620 | 203.42M | 21.633s | 9.40M/s | 41640 | 4h32m | 189369us
980/4620 | 201.33M | 21.381s | 9.42M/s | 46845 | 4h28m | 188979us
class | candidates | time | avg. rate | SievePrimes | ETA | CPU wait
981/4620 | 199.23M | 21.184s | 9.40M/s | 52700 | 4h25m | 188281us
992/4620 | 197.13M | 20.996s | 9.39M/s | 59287 | 4h23m | 186956us
1001/4620 | 195.04M | 20.802s | 9.38M/s | 66697 | 4h20m | 185757us
1005/4620 | 192.94M | 20.616s | 9.36M/s | 75034 | 4h17m | 184389us
1008/4620 | 192.94M | 20.543s | 9.39M/s | 84413 | 4h16m | 181585us
1013/4620 | 190.84M | 20.360s | 9.37M/s | 94964 | 4h13m | 179461us
1016/4620 | 188.74M | 20.131s | 9.38M/s | 106834 | 4h10m | 175500us
1020/4620 | 186.65M | 20.009s | 9.33M/s | 120188 | 4h08m | 174065us
1025/4620 | 184.55M | 19.711s | 9.36M/s | 135211 | 4h04m | 170872us
1028/4620 | 184.55M | 19.725s | 9.36M/s | 152112 | 4h04m | 168330us
1032/4620 | 182.45M | 19.588s | 9.31M/s | 171126 | 4h02m | 167198us
1040/4620 | 180.36M | 19.379s | 9.31M/s | 192516 | 3h59m | 161959us
1041/4620 | 180.36M | 19.463s | 9.27M/s | 200000 | 4h00m | 158546us
1053/4620 | 180.36M | 19.391s | 9.30M/s | 200000 | 3h59m | 163496us
1056/4620 | 180.36M | 19.354s | 9.32M/s | 200000 | 3h58m | 162110us
1061/4620 | 180.36M | 19.525s | 9.24M/s | 200000 | 4h00m | 163995us
1065/4620 | 180.36M | 19.550s | 9.23M/s | 200000 | 4h00m | 161185us
1068/4620 | 180.36M | 19.526s | 9.24M/s | 200000 | 3h59m | 162294us[/code]

Obviously, I'm not going to make this GPU work much :smile:
But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?

Bdot 2011-12-30 22:42

[QUOTE=therealwebs;284024]bdot, I'm using the new mfakto10p1 and it crashes consistently using the --CLtest argument. It passes -st and -st2 flawlessly, though. I've attached a dump from the process*. It seems to blame amdocl64.dll. Have a look if you'd like!

*it's my first time using procdump, so it might not have captured the right thing :S

dump: [URL]http://dl.dropbox.com/u/5274619/mfakto_dmp_111230_002904.dmp[/URL][/QUOTE]

What is the last output of mfakto?

You captured the right thing, however, my debugger cannot make anything meaningful out of it. This is most likely caused by different runtime versions. What I do see is the abort location, amdocl64!clGetSamplerInfo. This is the OpenCL runtime, but mfakto never calls clGetSamplerInfo. So I assume this part of the information is already wrong. I'll see if I can somehow get more info out of the dump, thanks a lot for providing it. If I can't extract anything better, I'll probably create a special debug version for you that should show more.

Do you still have any aborts during normal operation?

Bdot 2011-12-30 23:15

[QUOTE=debrouxl;284049] Athlon II X4 640 @ 3 GHz, 4 GB of RAM, and a Radeon HD 5450: ... fglrx 11-12 under Debian Testing 64 bits :smile:
...
Obviously, I'm not going to make this GPU work much :smile:
But could it somehow be forced to complete the current assignments a bit faster ? For instance, higher SievePrimes values (though values above 180K don't seem to make much of a difference), a SIEVE_SIZE_LIMIT of 64 KB, or something else ?[/QUOTE]

Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.

Well, if you have a good idea how to force it to finish the assignment faster, let me know!

Higher SievePrimes would go in this direction. Doubling the CPU effort you could expect a speedup of 3-5%. As you noticed with the values >180k: not really worth the effort.

SIEVE_SIZE_LIMIT 64kB would make the siever more efficient on your system as the Athlon CPU has 64kB L1 cache. The next mfakto version will have the sieve size configurable, but in your case it would just increase the CPU wait time.

In my eyes a new kernel has the best chance for real improvement of the throughput. This would be a barrett kernel based on 24-bit operations. I'm not yet certain if it would need to be entirely based on 24-bit, or if the 32-bit mul_hi is still allowed. I'm (slowly) working on these kernels, but cannot tell when they'll be ready. Also, it is hard to give a good estimate if the improvement will be 5% or 50% ...

And I recently thought of another idea that could increase throughput, especially on slower cards: the calculations in the kernel always require an initial division. GPUs are not made for divisions, so I could move this division from the GPU to the CPU, preferably into another thread.

But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.

Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.

therealwebs 2011-12-31 01:43

On this machine (2x5870), whether or not mfakto 10p1 crashes seems to be up to chance. I'm trying to run 2 instances mostly unattended, and when I check on it (every 2 to 12 hours), usually one has crashed. I won't be able to dig in and really test until I get back to the physical location tomorrow (everything is being done via Teamviewer/remote desktop). I'll see if I can't screenshot the mfakto window along with a process dump the next time it goes south. Thanks for your work on this!

debrouxl 2011-12-31 08:46

[quote]Thanks for this test. It confirms that the HD 5450 is capable of delivering about 8-9 GHz-Days/day. Probably without consuming a lot of CPU power. Better than nothing, but certainly not well-suited for bringing the GPU-to-72 assignments to 72 bits.[/quote]
Exactly.
On another computer, which I have intermittent access to, a Mobility Radeon HD 4650 (550 MHz) driven by a Core i7 720QM @ 1.6 GHz (which has lower single-core throughput than the Athlon II X4 640 @ 3 GHz) goes through assignments in the 65M range more than twice and a half faster than the desktop HD 5450...
Unsurprisingly, on both computers, mfakto -d cpu does less than 4M per second.

[quote]But for now, I'm afraid there's nothing in mfakto that you can do to speed it up.[/quote]
OK :smile:

[quote]Hmm, can HD 5450 be overclocked? If so, leave the memory clock low but push the core clock higher - this will linearly increase throughput.[/quote]
I'll look into that, even if I probably won't overclock anything.


Thanks :smile:

KyleAskine 2012-01-10 14:49

I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.

Bdot 2012-01-10 22:50

[QUOTE=KyleAskine;285724]I'd be interested to see how the new 7970's do if anyone manages to get their hands on one.[/QUOTE]

Raw figures and game benchmarks look promising, and the simplified instruction scheduling should boost performance of the 32-bit operations quite a bit, even though I have not been able to find detailed specs about the timing of the operations. Not sure if mul32, mul_hi and convert_* still occupy the whole "Graphic Core Next" ...

Another update:
I have been working a bit on mfakto and implemented the variable SieveSizeLimit. And in order to easily test it, I also made the upper limit of SievePrimes configurable (between 5000 and 1,000,000). I added a test for it to a new --perftest option, so that you can check which SieveSize fits best to the typical SievePrimes values you have. The output contains a list how fast sieving alone is (this is on an otherwise idle Phenom 2 X4 955 @ 3.2 GHz):
[code]
2. Sieve (M/s)
SievePrimes: 5000 20000 80000 200000 500000 1000000
SieveSizeLimit
12 kiB 136.60 89.10 46.78 27.38 14.15 6.28
24 kiB 152.08 110.37 62.30 39.25 22.82 11.36
36 kiB 156.94 115.52 71.29 47.36 28.78 15.52
48 kiB 158.79 119.92 78.07 52.81 33.24 19.01
59 kiB 157.13 120.70 82.93 54.79 36.58 21.85
71 kiB 137.47 107.61 77.29 54.07 36.73 23.29
83 kiB 127.99 99.11 73.83 52.63 37.19 24.49
95 kiB 122.54 94.50 71.26 51.05 37.70 25.69
107 kiB 114.02 89.02 67.95 51.58 37.26 26.32
118 kiB 107.10 84.73 63.78 50.51 37.16 26.74
142 kiB 99.38 78.03 59.94 49.09 36.82 27.11
166 kiB 93.95 73.78 57.86 47.93 35.08 27.41
189 kiB 87.60 69.12 54.13 45.88 35.02 27.41
213 kiB 83.13 66.16 52.67 45.00 33.74 27.53
236 kiB 81.05 64.50 51.39 43.93 34.11 27.60
260 kiB 79.17 62.76 50.06 42.78 34.09 27.24
283 kiB 77.22 61.57 49.01 42.63 34.19 26.93
307 kiB 76.78 60.33 47.70 42.01 33.84 27.52
331 kiB 75.66 59.80 48.18 41.02 33.37 27.37
354 kiB 73.93 58.56 47.70 41.23 33.00 27.31
378 kiB 73.40 58.40 47.24 40.37 33.45 27.27
[/code]And this is on a stock (2.7GHz ?) i7-2600:
[code]
2. Sieve (M/s)
SievePrimes: 5000 20000 80000 200000 500000 1000000
SieveSizeLimit
12 kiB 167.10 107.70 54.40 33.82 20.04 11.96
24 kiB 189.79 136.77 73.83 47.18 29.02 18.64
36 kiB 194.79 142.45 86.47 56.14 35.26 23.38
48 kiB 177.86 135.17 89.44 59.99 38.60 26.43
59 kiB 162.03 124.87 89.10 61.40 40.98 28.72
71 kiB 148.61 117.39 86.94 61.31 41.94 30.22
83 kiB 141.95 112.71 86.40 62.64 43.48 31.89
95 kiB 136.69 108.89 85.23 63.35 44.90 33.15
107 kiB 131.41 104.69 82.61 62.55 45.39 33.82
118 kiB 126.81 101.86 79.87 62.46 45.66 34.41
142 kiB 120.92 96.91 76.12 62.03 44.90 35.07
166 kiB 114.77 92.57 74.75 62.40 46.12 33.94
189 kiB 111.64 90.66 73.31 62.00 47.28 36.93
213 kiB 107.66 86.50 71.72 60.39 46.84 36.71
236 kiB 107.26 86.01 71.08 60.86 47.30 38.52
260 kiB 103.58 84.10 68.98 59.79 46.59 38.97
283 kiB 102.16 82.36 67.12 58.66 46.62 38.48
307 kiB 101.63 80.78 66.40 57.28 47.35 37.33
331 kiB 99.62 79.60 65.41 58.17 47.14 38.61
354 kiB 97.86 78.80 64.63 56.96 47.37 38.78
378 kiB 96.47 77.18 64.45 55.75 47.64 38.88
[/code]For larger SievePrimes it is of advantage to increase SieveSizeLimit towards the L2-cache-size. This is even more evident when the machine is loaded with more mfakto-instances and mprime.

And I finally got around to implement a barrett-kernel based on mul24. Performance is quite promising (174M/s compared to the other kernelĀ“s 135M/s on a HD5770). The only disadvantage is that it does not find any factors yet :redface:.
However, positive side-effect: I found a few places in the traditional mul24 kernel where I could combine a left-shift + add into a mad24, increasing the total performance of that kernel by ~2-3%.

Dubslow 2012-01-10 23:28

[QUOTE=Bdot;285814]And this is on a stock (2.7GHz ?) i7-2600:
[/QUOTE]
3.4 GHz, turbo to 3.8.

(mfaktc...)

kracker 2012-01-11 00:06

works
 
1 Attachment(s)
Works good on my Llano A8-3850 apu, thanks :)

Bdot 2012-01-11 08:51

[QUOTE=kracker;285827]Works good on my Llano A8-3850 apu, thanks :)[/QUOTE]
Thanks for this info! Could you please also post the OpenCL device info part as mfakto reports it? If I can easily figure out we're running on Llano, then I can enable a zero-memory-copy optimization, that should increase GPU utilisation by ~10% when only a single instance is running (and by a small amount for multi-instance).

kracker 2012-01-11 15:42

[QUOTE=Bdot;285871]Thanks for this info! Could you please also post the OpenCL device info part as mfakto reports it? If I can easily figure out we're running on Llano, then I can enable a zero-memory-copy optimization, that should increase GPU utilisation by ~10% when only a single instance is running (and by a small amount for multi-instance).[/QUOTE]

Ok, here it is.
(btw, it only uses about 65-80% of my gpu when I run 1 instance... that might be normal ?)

OpenCL device info
name BeaverCreek (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.1 AMD-APP (851.4) (CAL 1.4.1646 (VM))
maximum threads per block 256
maximum threads per grid 16777216
number of multiprocessors 5 (400 compute elements (estimate for ATI GPUs))
clock rate 600MHz

Dubslow 2012-01-11 15:52

[QUOTE=kracker;285896]
(btw, it only uses about 65-80% of my gpu when I run 1 instance... that might be normal ?)[/QUOTE]
That depends on what CPU you're using. I'll try and link you to an old post that I have no intention of rewriting from scratch.



[offtopic]Welcome to the GPU to 72 team! Except it seems you haven't actually gotten work from the [url=gpu.mersenne.info]tool[/url]. You can find more info the GPU to 72 subforum somewhere around here. Happy crunching![/offtopic]

KyleAskine 2012-01-11 16:10

[QUOTE=kracker;285896]
(btw, it only uses about 65-80% of my gpu when I run 1 instance... that might be normal ?)
[/QUOTE]

It is probably the issue that Bdot mentioned above. You should be at around 90% or so with one instance in my opinion, since it is obvious that your CPU can sieve way faster than your GPU can process.

kracker 2012-01-11 16:47

Ah, ok thanks :)
Oh, and also I was wondering is there any way to reduce the priority of it? I have to pause it every time I do a gpu-intensive program or game, Thanks :)

(P.S.: Is there a way to automatically pull assignments? Right now I realized I'll have to manually get more once it gets done. :whistle:)

Dubslow 2012-01-11 21:07

[QUOTE=kracker;285903]Ah, ok thanks :)
Oh, and also I was wondering is there any way to reduce the priority of it? I have to pause it every time I do a gpu-intensive program or game, Thanks :)
[/QUOTE]

Try using a batch file; you can set CPU affinity and priority in the command to start mfakto. Or you can change it via Task Manager after it's already running. (Sadly I have yet to find a decent post on the GPU usage thing.)
Edit: As a holdover: [QUOTE=Dubslow;274571]The CPU wait indicates how long the CPU is waiting for work. If it's greater than 1000, than the CPU is waiting a lot, which means the GPU is overwhelmed. Sieve Primes controls how much work is done on the CPU; that's why the program auto-adjusted that up to 200,000 (the default is 25,000, and 5,000 is the minimum).[/QUOTE]

[QUOTE=kracker;285903]
(P.S.: Is there a way to automatically pull assignments? Right now I realized I'll have to manually get more once it gets done. :whistle:)[/QUOTE]
That is being worked on at the moment, unfortunately not ready yet.

Bdot 2012-01-11 22:32

Thanks for the device info, I'll put that on the wishlist ;-)

[QUOTE=kracker;285903]Ah, ok thanks :)
Oh, and also I was wondering is there any way to reduce the priority of it? I have to pause it every time I do a gpu-intensive program or game, Thanks :)
[/QUOTE]

Adding to Dubslow's comment:

While you can lower the priority as mentioned (but not built-in), it may not result in what you want. The reason is, that the priority setting applies to the CPU part only. On the GPU there is no such thing as priorities - it's all round-robin on the same level. mfakto tries to keep 5 blocks (tasks) in the GPU-queue, which can make the UI laggy, if window-movements for instance have to wait until these 5 tasks are processed.

You can try two settings in [B]mfakto.ini[/B]:
[B]GridSize [/B]defines of how many factor candidates one block will consist. Lowering this value should already increase responsiveness a lot at the expense of a little more CPU overhead.
[B]NumStreams[/B] is the number of blocks being scheduled. Lowering to 3 or 2 causes other tasks to be served quicker, but mfakto will have a smaller buffer to cover fluctuations in available CPU power.

BTW, the relatively low GPU utilization can also occur if the CPU cores are rather busy. Sometimes the auto-adjusting of the SievePrimes value is confused if there is no CPU available to serve the GPU queue: the time it took to get the required CPU power is then wrongly interpreted as CPU idle time waiting for the GPU to finish. Try setting [B]SievePrimesAdjust=0[/B] and [B]SievePrimes=100000[/B] (to be tested what is good). Alternatively, set up two copies of mfakto to run in parallel. Then they can cover each other's gaps in GPU utilisation.

James Heinrich 2012-01-11 23:30

Can I request any [i]mfakto[/i] users help me out with some benchmark data. I want to update my [url=http://mersenne-aries.sili.net/mfaktc.php]mfaktc table[/url] to include AMD GPUs as well, but I need some data to base it on. Please send me benchmarks on a wide variety of GPUs (very old to very new, and very slow to very fast) so that I can get as accurate a picture of how GFLOPS scales into GHz-days/day performance across the various products. For each GPU, I need the following 4 bits of data of [b]a single running instance[/b] (even if you normally run multiple instances, please just run one for this test):[list=1][*]GPU model (including clockspeed if overclocked)[*]assignment (exponent, from-bits, to-bits)[*]wall clock runtime[*]average GPU usage[/list]If you want to include the CPU model/speed and SievePrimes values as well that's interesting, but not required.

Please PM or email me the results as opposed to posting in this thread. I'll post back when I have enough data to make a reasonable chart.

KyleAskine 2012-01-12 03:44

[QUOTE=James Heinrich;285993]Can I request any [i]mfakto[/i] users help me out with some benchmark data. I want to update my [url=http://mersenne-aries.sili.net/mfaktc.php]mfaktc table[/url] to include AMD GPUs as well, but I need some data to base it on. Please send me benchmarks on a wide variety of GPUs (very old to very new, and very slow to very fast) so that I can get as accurate a picture of how GFLOPS scales into GHz-days/day performance across the various products. For each GPU, I need the following 4 bits of data of [b]a single running instance[/b] (even if you normally run multiple instances, please just run one for this test):[list=1][*]GPU model (including clockspeed if overclocked)[*]assignment (exponent, from-bits, to-bits)[*]wall clock runtime[*]average GPU usage[/list]If you want to include the CPU model/speed and SievePrimes values as well that's interesting, but not required.

Please PM or email me the results as opposed to posting in this thread. I'll post back when I have enough data to make a reasonable chart.[/QUOTE]

Why isn't SievePrimes required data? If the GPU is the bottleneck, a different SievePrimes number will affect wall clock runtime, but not the other three variables.

Put another way: I can always lower sieve primes and destroy my wall clock performance by increasing the number of candidates, but the reason performance is bad would be missed by your metrics, since the GPU usage would be the same.

I will try to get you some values tomorrow. I have 6950's modded with shaders unlocked, and a 6570 I can hopefully get you tomorrow!

James Heinrich 2012-01-12 13:38

[QUOTE=KyleAskine;286016]Why isn't SievePrimes required data?[/QUOTE]It's nice to have so I can see if I expect any given benchmark to be on the high or low side, but I don't need it per se for the calculations. There's enough (too much!) variance in the data (based on what I've seen of mfaktc data) that it doesn't really make much difference overall, my chart will just provide [i]rough[/i] guidelines, +/-10% at best.

KyleAskine 2012-01-12 14:00

[QUOTE=James Heinrich;286044]It's nice to have so I can see if I expect any given benchmark to be on the high or low side, but I don't need it per se for the calculations. There's enough (too much!) variance in the data (based on what I've seen of mfaktc data) that it doesn't really make much difference overall, my chart will just provide [i]rough[/i] guidelines, +/-10% at best.[/QUOTE]

I have no idea, and this could be way off target, but why don't you just get the M/s value and the GPU usage? Wouldn't that be the easiest benchmark to get? Or does that not account for everything?

James Heinrich 2012-01-12 15:27

[QUOTE=James Heinrich;285993]I want to update my [url=http://mersenne-aries.sili.net/mfaktc.php]mfaktc table[/url] to include AMD GPUs as well[/QUOTE]I have some rough data now, enough to at least put up the chart. I may refine it slightly as I get more data, but it should be reasonably close.

You'll notice that the Radeon+mfakto combination is considerably less efficient at turning theoretical GFLOPS into GHz-days/day TF results than GeForce+mfaktc. Right now I'm using a divider of 18 (for mfaktc, I'm using 14 for older v1.x GPUs, 5 for v2.0 and 7.5 for v2.1). So that's why you see a Radeon 6990 and a GeForce GTX 570 both expecting ~282GHz-days/day, even though the 6990 has 5100 GFLOPS to the 570's 1400.

More benchmark data is still welcome, especially from older/slower GPUs.

kjaget 2012-01-12 15:48

[QUOTE=KyleAskine;286047]I have no idea, and this could be way off target, but why don't you just get the M/s value and the GPU usage? Wouldn't that be the easiest benchmark to get? Or does that not account for everything?[/QUOTE]

For one, I don't think that M/sec has anything to do with run time per factor. Or at least there's no easy way to map from one to the other. Turn off the siever and your M/sec would go through the roof - as would execution time since you're doing a lot of unnecessary work.

KyleAskine 2012-01-12 19:29

[QUOTE=kjaget;286062]For one, I don't think that M/sec has anything to do with run time per factor. Or at least there's no easy way to map from one to the other. Turn off the siever and your M/sec would go through the roof - as would execution time since you're doing a lot of unnecessary work.[/QUOTE]

Sure it does. Candidates / M/s = time per class. I do admit that I am not sure what each class is, or why it seems to skip around at random, but I am sure there is a simple explanation of that.

kjaget 2012-01-12 19:55

[QUOTE=KyleAskine;286080]Sure it does. Candidates / M/s = time per class.[/QUOTE]

Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.

Bdot 2012-01-13 12:37

[QUOTE=kjaget;286084]Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.[/QUOTE]

I also thought a bit about what mfakto should display. Each test is split into 4620 classes, of which 960 need to be tested, the others can be excluded right away (because they contain only FC's that are divisible by 3, 5, 7 or 11). Now, I can change the current display of class numbers to a display of the class counter (e. g. 23/960), or a percent complete. What would you prefer?

Bdot 2012-01-13 13:06

[QUOTE=James Heinrich;286059]I have some rough data now, enough to at least put up the chart. I may refine it slightly as I get more data, but it should be reasonably close.
[/QUOTE]

Thanks a lot for this table addition!
The question marks in the "Compute" column could be replaced by the OpenCL version that these cards support: 1.1 for all cards except those with an RVxxx chip with xxx<700. RV700 is the first to support OpenCL 1.1. But I don't know if the earlier cards supported 1.0 or no OpenCL at all ...

And OpenCL 1.1 is required for mfakto, therefore the same split can be used to find the AMD cards with "will not run" mfakto. Should be the same as selecting anything below HD4xxx as "will not run".

[QUOTE=James Heinrich;286059]
You'll notice that the Radeon+mfakto combination is considerably less efficient at turning theoretical GFLOPS into GHz-days/day TF results than GeForce+mfaktc. Right now I'm using a divider of 18 (for mfaktc, I'm using 14 for older v1.x GPUs, 5 for v2.0 and 7.5 for v2.1). So that's why you see a Radeon 6990 and a GeForce GTX 570 both expecting ~282GHz-days/day, even though the 6990 has 5100 GFLOPS to the 570's 1400.
[/QUOTE]

Do you really need to show that so clearly :cry:
I now got the barrett mul24 kernel to work (correctly!), which increases the efficiency by ~20-30%. But to reach the "5" divider will be hard ... Maybe with the HD7970 ...

KyleAskine 2012-01-13 13:21

[QUOTE=kjaget;286084]Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.[/QUOTE]

Which is why you need GPU usage as well.

M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.

I think M/s and GPU usage should be sufficient to determine a theoretical maximum GHz-d/d.

Of course things like CPU (which affects SievePrimes) matters too, but I think you can get a theoretical max (independent of system) from those numbers.

James Heinrich 2012-01-13 14:13

[QUOTE=Bdot;286156]The question marks in the "Compute" column could be replaced by the OpenCL version that these cards support[/quote]I've updated the table with that data. Does this seem reasonable?[code]UPDATE `gpu` SET `compute` = "1.2" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 7%");
UPDATE `gpu` SET `compute` = "1.2" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 6%");
UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 5%");
UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 4%");

UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND ((`codename` = "Westler") OR (`codename` = "Zacate") OR (`codename` = "Ontario") OR (`codename` = "WinterPark") OR (`codename` = "BeaverCreek"));[/code]


[QUOTE=Bdot;286156]Do you really need to show that so clearly :cry:[/QUOTE]Sorry! :blush:
It's no reflection on your programming, just the design of AMD GPUs. [url=http://www.tomshardware.com/reviews/radeon-hd-7970-benchmark-tahiti-gcn,3104-2.html]This article[/url] illustrates some of the problems with VLIW4 that [i]Graphics Core Next[/i] is supposed to remedy. Perhaps it can translate into better mfakto efficiency(?)

[QUOTE=Bdot;286156]But to reach the "5" divider will be hard[/quote]It's also hard for mfaktc/NVIDIA. Older v1.x GPUs are pretty close to the current Radeon efficiency, and newer v2.1 GPUs actually take a 33% performance hit compared to v2.0, not quite sure why. But I still [b]need more benchmark data[/b]. I've seen results ranging from 13x to 18x in the few benchmarks I've received so far, I need more data points to figure out what patterns there may be.

kjaget 2012-01-13 14:43

[QUOTE=KyleAskine;286158]Which is why you need GPU usage as well.

M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.[/QUOTE]

Again, candidates per second is only useful for determining run time if you know how many candidates are needed to test an exponent. Since, as you say, this number changes depending on lots of factors, just measuring the time is a better measure of how long it takes to test one exponent.

If you're looking for a theoretical measure, we'd need to hack the code to turn off sieving so as many candidates are fed to the GPU as possible per CPU<->GPU transaction. Run as many copies of these as necessary to max the GPU (or compare this to running 1 instance and scaling it with GPU load to see if it gives the same answer). Then we'd need to run through a pass with sieve primes maxed to see the minimum number of candidates required to test an exponent. This last step would only have to be done once since it's independent of the GPU.

Combining the peak candidates per second with the minimum number of candidates per exponent would get us close to the theoretical peak throughput.

But I'm not convinced that ignoring the real overhead in real systems is any more accurate a measurement than just seeing how long an exponent takes to run in a real system.

kjaget 2012-01-13 14:46

[QUOTE=James Heinrich;286162]It's also hard for mfaktc/NVIDIA. Older v1.x GPUs are pretty close to the current Radeon efficiency, and newer v2.1 GPUs actually take a 33% performance hit compared to v2.0, not quite sure why. [/QUOTE]

See [url]http://www.mersenneforum.org/showpost.php?p=281245&postcount=1399[/url].

If I understand it correctly, 2.1 removed some compute resources and relies on a better scheduler to try and run more instructions in parallel. But mfaktc instruction parallelism can't be improved by the better scheduler so it gets hit by the reduced resources without any corresponding gain from the better scheduler.

kjaget 2012-01-13 14:58

[QUOTE=Bdot;286155]I also thought a bit about what mfakto should display. Each test is split into 4620 classes, of which 960 need to be tested, the others can be excluded right away (because they contain only FC's that are divisible by 3, 5, 7 or 11). Now, I can change the current display of class numbers to a display of the class counter (e. g. 23/960), or a percent complete. What would you prefer?[/QUOTE]

A % complete would be interesting, but in a way it's implied by the ETA field.

I would like to see the timing info grouped together first (time/class & eta), then sieve primes, then the throughput stuff grouped together last. This orders it roughly by order of importance performance-wise, at least from a user's perspective. I've seen too many people set sieveprimes as low as possible to get a higher candidates/sec number when all that does is kill their run times. Hopefully moving time first will inspire them to minimize that instead of trying to max M/s by making the GPU do unnecessary work.

But whatever you do, I'd coordinate with Oliver so you guys keep as much of the code common as possible. Should make it easier later on when it's integrated into Prime95 (I can dream, can't I).

Dubslow 2012-01-13 15:12

[QUOTE=KyleAskine;286158]
M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.
[/QUOTE]
I found that when testing a 200M number, avg. rate dropped from ~195 to ~170, maybe ~165 sometimes. When I went back to 50M, the rate went up again. Could this be due to a higher cost of checking factors?

Dubslow 2012-01-13 15:22

[QUOTE=kjaget;286165]
If you're looking for a theoretical measure, we'd need to hack the code to turn off sieving so as many candidates are fed to the GPU as possible per CPU<->GPU transaction. Run as many copies of these as necessary to max the GPU (or compare this to running 1 instance and scaling it with GPU load to see if it gives the same answer).[/QUOTE]
This is what TheJudger [url=http://www.mersenneforum.org/showpost.php?p=281726&postcount=1409]does[/url] when he tests efficiency. He posted such a test on mfaktc 0.18's release. As for what Mr. Askine was saying, yes you'd need number of candidates tested to get runtime, but OTOH, avg. rate should correlate with GHzD/d, regardless of runtime, e.g. I get ~190 M/s, and GD/d of roughly ~100. Then you multiply by runtime to get GD=total FC per assignment.
[QUOTE=kjaget;286168]A % complete would be interesting, but in a way it's implied by the ETA field.

I would like to see the timing info grouped together first (time/class & eta), then sieve primes, then the throughput stuff grouped together last. This orders it roughly by order of importance performance-wise, at least from a user's perspective. I've seen too many people set sieveprimes as low as possible to get a higher candidates/sec number when all that does is kill their run times. Hopefully moving time first will inspire them to minimize that instead of trying to max M/s by making the GPU do unnecessary work.[/quote]
[QUOTE=kjaget;286165]I'd prefer a count/960 rather than percentage. How do you know that it's always exactly 960 classes and that all the others don't work for a given assignment? Why couldn't it be 961, 962, or 1063?[/quote]
[QUOTE=kjaget;286165]But whatever you do, I'd coordinate with Oliver so you guys keep as much of the code common as possible. Should make it easier later on when it's integrated into Prime95 (I can dream, can't I).[/QUOTE]
I agree on the coordination point, but as for integration, at this point at least, we'd need to include both. Has anybody ever tested mfakto on nVidia cards?

Bdot 2012-01-13 17:17

[QUOTE=James Heinrich;286162]I've updated the table with that data. Does this seem reasonable?
[/QUOTE]
Thanks, looks good to me! When I checked what's new in 1.2 and did not find anything important I did not even bother to check which cards support it. So I cannot tell this difference.

[QUOTE=James Heinrich;286162]
Sorry! :blush:
It's no reflection on your programming, just the design of AMD GPUs. [URL="http://www.tomshardware.com/reviews/radeon-hd-7970-benchmark-tahiti-gcn,3104-2.html"]This article[/URL] illustrates some of the problems with VLIW4 that [I]Graphics Core Next[/I] is supposed to remedy. Perhaps it can translate into better mfakto efficiency(?)
[/QUOTE]
No worries, I did not take it too hard :smile:.
While this article shows a basic problem, it is one that the OpenCL compiler was brilliant at circumventing. Probably that optimizations have cost quite some effort, but the translated OpenCL code was reordered so much that I sometimes had trouble matching it to the original code. The compiler knows about the VLIW4/5 dependency issue, analyzes it and reorders as much as the dependencies allow.
But often it is hard to find independent instructions to fill the gaps.
Even more of a problem of VLIW5 are the instructions that can run only on the special "t" unit, leaving 4 others empty. Widely discussed are mul32 and mul_hi in this respect, but conversions back and forth between integer and floating point representation are as bad. And finally all the operations to provide for carry/borrow cost their share of the available GFLOPS.

[QUOTE=James Heinrich;286162]
But I still [B]need more benchmark data[/B].[/QUOTE]

I need more machines to test on :grin:
[quote=Dubslow]
I found that when testing a 200M number, avg. rate dropped from ~195 to ~170, maybe ~165 sometimes. When I went back to 50M, the rate went up again. Could this be due to a higher cost of checking factors?
[/quote]
It may not seem much: 2 or 3 bits more are just 2 or 3 more loops. But testing a 50M number usually requires 19 loops, so we have an increase of more than 10%. I'd say: yes, it's the higher cost of checking the factors. The barrett kernel should not suffer that much from the additional loops as its loops are simpler at the cost of some more one-time effort.

[quote=Dubslow]
How do you know that it's always exactly 960 classes and that all the others don't work for a given assignment? Why couldn't it be 961, 962, or 1063?[/quote]

Because I've counted them all :smile:
That's the nice thing about modulo: it all repeats over and over ... No matter where in the circle of 4620 classes you start, you'll always hit each class once. By excluding FC's that are 3 or 5 mod 8, and multiples of 3, 5, 7, 11 you keep 2/4 * 2/3 * 4/5 * 6/7 * 10/11 = 960 of 4620 classes.

[quote=Dubslow]
Has anybody ever tested mfakto on nVidia cards? [/quote]

I did not notice the newer NV drivers add OpenCL 1.1 support! Thanks for the hint. Currently the "-O3" parameter to the OpenCL compiler makes it fail, but I'll try without it ...

KyleAskine 2012-01-13 17:31

[QUOTE=Bdot;286179]
I need more machines to test on :grin:
[/QUOTE]

In a sort of serious way, do you need anything that would help with mfakto? Do you have a 6xxx card? I would be more than willing to pitch something in to help you get appropriate equipment to help you test in house.

James Heinrich 2012-01-13 17:45

[QUOTE=KyleAskine;286181]Do you have a 6xxx card?[/QUOTE]I think a 7xxx-series card would be far more useful, since things actually changed between 6 and 7. :smile:

KyleAskine 2012-01-13 18:47

[QUOTE=James Heinrich;286184]I think a 7xxx-series card would be far more useful, since things actually changed between 6 and 7. :smile:[/QUOTE]

Well, no one has one yet.

On my 5870 I get around 200 M/s with Barrett32.

On my 6970 I get around 120 M/s with Barrett32. I get around 140 M/s with MUL24.

So I think we still need major refinements with the 6xxx series.

Though hopefully Barrett24 fixes everything!

Bdot 2012-01-14 01:27

[QUOTE=KyleAskine;286189]
Though hopefully Barrett24 fixes everything![/QUOTE]

Well, certainly not everything. Currently it is capable only of finding factors between 2[SUP]63[/SUP] and 2[SUP]70[/SUP]. It should be able to find them up to 2[SUP]71[/SUP], but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 2[SUP]70[/SUP] limit, or if I can fix it to work for all 2[SUP]71[/SUP] as well. And once that is done, I'd like to send it out to a few people for testing.

But 2[SUP]72[/SUP], the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ...

BTW, testing mfakto on Nvidia turns out to be way more effort than it might be worth. Nvidia's OpenCL compiler is buggy and not yet complete. I had to remove all printf's even though they were in inactive #ifdefs. And once that was done, the compiler crashes.
[code]
Error in processing command line: Don't understand command line argument "-O3"!
[/code][code]
(0) Error: call to external function printf is not supported
[/code][code]
Select device - Get device info - Compiling kernels .Stack dump:
0. Running pass 'Function Pass Manager' on module ''.
1. Running pass 'Combine redundant instructions' on function '@mfakto_cl_barrett79'

mfakto-nv.exe has stopped working
[/code]

Dubslow 2012-01-14 01:30

Lol I can't help I hardly know anything about programming, only the very basics

KyleAskine 2012-01-14 03:13

[QUOTE=Bdot;286230]Well, certainly not everything. Currently it is capable only of finding factors between 2[SUP]63[/SUP] and 2[SUP]70[/SUP]. It should be able to find them up to 2[SUP]71[/SUP], but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 2[SUP]70[/SUP] limit, or if I can fix it to work for all 2[SUP]71[/SUP] as well. And once that is done, I'd like to send it out to a few people for testing.

But 2[SUP]72[/SUP], the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ...
[/QUOTE]

Well, since I have to use MUL24 anyway, I cannot factor to 72 as is, so I am not really losing any functionality. Though being able to factor to 71 would be helpful, since there really aren't too many candidates left that are only done to 69 or less.

flashjh 2012-01-15 05:44

[QUOTE=Bdot;286230]I'd like to send it out to a few people for testing.[/QUOTE]

I can help test when you're ready.

Bdot 2012-01-16 09:36

[QUOTE=KyleAskine;286238]Well, since I have to use MUL24 anyway, I cannot factor to 72 as is, so I am not really losing any functionality. Though being able to factor to 71 would be helpful, since there really aren't too many candidates left that are only done to 69 or less.[/QUOTE]

The MUL24 kernel can handle up to 72, having the limit at 71 was a mistake in one of the test versions I sent to you but fixed in the 0.10 release.

The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4.

I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ...

KyleAskine 2012-01-16 11:36

[QUOTE=Bdot;286453]The MUL24 kernel can handle up to 72, having the limit at 71 was a mistake in one of the test versions I sent to you but fixed in the 0.10 release.

The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4.

I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ...[/QUOTE]

I will try to harvest some from GPU to 72 tonight when I get home.

flashjh 2012-01-16 15:23

[QUOTE=KyleAskine;286457]I will try to harvest some from GPU to 72 tonight when I get home.[/QUOTE]

Me too :smile:

chalsall 2012-01-16 16:30

[QUOTE=KyleAskine;286457]I will try to harvest some from GPU to 72 tonight when I get home.[/QUOTE]

I've asked Spidy to look for some available candidates in the 59M range. So far it's only found 15.

If it can't get a couple of hundred or so, I'll ask it to grab a few up the 60M or so range. Past where it's currently useful for the LL wavefront, but worthwhile for future work.

I'll have the system return them to the pool after being TFed to 70 or beyond.

Edit: Spidy also often finds at least a few dozen only TFed to 69 in the current working ranges at around 0000 UTC if you want to check then.

Edit2: And, of course, you could always do some work in the DC range where there's a large number available only TFed to 68.

KyleAskine 2012-01-16 16:40

I have mfakto on a 6570, which is far from a high end card. When I go from 69 to 70 on it, it can do around 50 or 55 M/s. When I go from 70 to 71 it can only do around 25 M/s. Could I be hitting the onboard memory limit, so it has to use slow memory? Could it be something else? I haven't experienced this with any other card.

Edit - Nope, I just grabbed a 69 to 70, and that has the same issue. Something else has happened here...

Edit 2 - Alright, the core speed seems to have been halved... and I cannot change it... time for a reboot.

Dubslow 2012-01-16 16:48

[QUOTE=chalsall;286481]

Edit: Spidy also often finds at least a few dozen only TFed to 69 in the current working ranges at around 0000 UTC if you want to check then.
[/QUOTE]

I've found that more often than not the standard daily dump finds 100+ at 69 <60M. Just give it a day or two, and as long as flash and Kyle are paying attention, they ought to be able to get them real easy like. (I think that's the first time I've ever used that particular construction.) I check like once or twice a day, and there's usually at least 50 available at 69.

Bdot 2012-01-16 16:53

[QUOTE=KyleAskine;286482]I have mfakto on a 6570, which is far from a high end card. When I go from 69 to 70 on it, it can do around 50 or 55 M/s. When I go from 70 to 71 it can only do around 25 M/s. Could I be hitting the onboard memory limit, so it has to use slow memory? Could it be something else? I haven't experienced this with any other card.[/QUOTE]
Memory usage should be the same (and very low). I'd guess that some other parameters are chosen differently, like threads per grid, NumStreams, VectorSize, SievePrimes or the kernel being used. Threads per grid and NumStreams are the only parameters to influence GPU memory consumption.

Are CPU and GPU load about the same? Do you set the affinity of mfakto (differently)? I'll include in the test package a binary that can tell you the raw kernel speed. That can help isolate the issue.

KyleAskine 2012-01-16 19:13

[QUOTE=Bdot;286485]Memory usage should be the same (and very low). I'd guess that some other parameters are chosen differently, like threads per grid, NumStreams, VectorSize, SievePrimes or the kernel being used. Threads per grid and NumStreams are the only parameters to influence GPU memory consumption.

Are CPU and GPU load about the same? Do you set the affinity of mfakto (differently)? I'll include in the test package a binary that can tell you the raw kernel speed. That can help isolate the issue.[/QUOTE]

Something wonky happened with the drivers (or something), but a reboot got me back to normal.

RMAC9.5 2012-01-19 02:27

Bdot,

I recently added an ATI6970 video card to this PC and need some help in setting up Mfakto-0.10p1. My O/S is Windows Server 2003 SP1. My video driver is 11-12_xp64_dd_ccc_ocl. The CCC setup exe or msi appeared to install correctly even though I don't see the red CCC icon at bottom right side of my screen. My initial error message when I ran Mfakto-64.exe was a MSVCR100.dll not found message. After getting this error I ran the OpenCL.msi file in the OpenCL64 folder and the CCC_utility64.msi file in the Utility64 folder. They appeared to install correctly, but this did not solve my problem. Search finds 3 copies of the MSVCR100.dll on this PC with today's date. Suggestions on what to do next are appreciated.

Thanks,
Roger

flashjh 2012-01-19 03:27

1 Attachment(s)
[QUOTE=RMAC9.5;286665]Bdot,

I recently added an ATI6970 video card to this PC and need some help in setting up Mfakto-0.10p1. My O/S is Windows Server 2003 SP1. My video driver is 11-12_xp64_dd_ccc_ocl. The CCC setup exe or msi appeared to install correctly even though I don't see the red CCC icon at bottom right side of my screen. My initial error message when I ran Mfakto-64.exe was a MSVCR100.dll not found message. After getting this error I ran the OpenCL.msi file in the OpenCL64 folder and the CCC_utility64.msi file in the Utility64 folder. They appeared to install correctly, but this did not solve my problem. Search finds 3 copies of the MSVCR100.dll on this PC with today's date. Suggestions on what to do next are appreciated.

Thanks,
Roger[/QUOTE]

Did you use the GPU computing guide? I found I needed the [URL="http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx"]AMD App SDK[/URL]. I never saw that error though... bdot may have more suggestions.

EDIT: I did some searching for your specific error and found [URL="http://social.msdn.microsoft.com/Forums/nl-NL/vssetup/thread/51573822-a6cc-4cf8-aaad-2899c8703438"]this[/URL]:

[CODE]I managed to fix this missing MSVCR100.dll error by installing the 2010 C++ Visual Studio redistributable at:
[URL]http://www.microsoft.com/downloads/en/details.aspx?FamilyID=BD512D9E-43C8-4655-81BF-9350143D5867[/URL]
[/CODE]

Bdot 2012-01-20 16:14

[QUOTE=RMAC9.5;286665]Bdot,

I recently added an ATI6970 video card to this PC and need some help in setting up Mfakto-0.10p1. My O/S is Windows Server 2003 SP1. My video driver is 11-12_xp64_dd_ccc_ocl. The CCC setup exe or msi appeared to install correctly even though I don't see the red CCC icon at bottom right side of my screen. My initial error message when I ran Mfakto-64.exe was a MSVCR100.dll not found message. After getting this error I ran the OpenCL.msi file in the OpenCL64 folder and the CCC_utility64.msi file in the Utility64 folder. They appeared to install correctly, but this did not solve my problem. Search finds 3 copies of the MSVCR100.dll on this PC with today's date. Suggestions on what to do next are appreciated.

Thanks,
Roger[/QUOTE]
Hi Roger,

as flashjh already pointed out, this is the MS VisualC Runtime, version 10.0 that needs to be installed via the redistributable. If you already have that dll a few times, then mfakto was just not able to find it (the dll's directory needs to be in the PATH variable), or it has the wrong version. You can try creating another copy of that dll in mfakto's directory, if you want to avoid downloading and installing the redistributable.

Bdot

Bdot 2012-01-22 10:41

[QUOTE=flashjh;286668]Did you use the GPU computing guide? I found I needed the [URL="http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx"]AMD App SDK[/URL]. I never saw that error though... bdot may have more suggestions.[/QUOTE]

BTW, when using the latest (11.12) AMD drivers, the APP SDK is no longer required. I'm not sure since when that is so - AMD documented it for 11.7/Windows and 11.9/Linux, but I found on 11.9/Windows it was still needed.

flashjh 2012-01-22 14:31

[QUOTE=Bdot;286938]BTW, when using the latest (11.12) AMD drivers, the APP SDK is no longer required. I'm not sure since when that is so - AMD documented it for 11.7/Windows and 11.9/Linux, but I found on 11.9/Windows it was still needed.[/QUOTE]

Yes, you're right... My newest install only has 11.12 and everything works great. Thanks.

RMAC9.5 2012-01-23 06:37

ATI 6970 & Windows Server 2003 64 bit O/S
 
Hi Bdot,
Thanks for your quick reply. I have some progress to report.
I copied the MSVCR100 and MSVCP100 dlls from C:\ATI\Support\11-12_xp64_dd_ccc_ocl\Bin64 (default install folder) to the C:\Mfakto folder that I created for the mfakto-0.10p1 zip file that I downloaded. Mfakto-64.exe opened a black background message window and told me the following when I ran it:[LIST]
select device -gpu not found, fall back to cpu
then a simple test started from 1 - 15
when it completed the black background message window closed and nothing else appeared to happen[/LIST]Two more pieces of information that might be important: [LIST]
The AMDDriverDownloader.exe that I downloaded from AMD's web site recommended the 11-12_xp64 file above as the best match for my Windows Server 2003 O/S.
This PC is still using a plain vanilla VGA device driver for my CRT because it was NOT replaced when the CCC driver completed its successful install. My video resolution is fine because I don't game on this PC and I only installed my ATI video card and CCC driver for Folding or (my first love) GIMPS.[/LIST]

Bdot 2012-01-23 19:10

[QUOTE=RMAC9.5;287014]Hi Bdot,
Thanks for your quick reply. I have some progress to report.
I copied the MSVCR100 and MSVCP100 dlls from C:\ATI\Support\11-12_xp64_dd_ccc_ocl\Bin64 (default install folder) to the C:\Mfakto folder that I created for the mfakto-0.10p1 zip file that I downloaded. Mfakto-64.exe opened a black background message window and told me the following when I ran it:[LIST][*]select device -gpu not found, fall back to cpu
then a simple test started from 1 - 15
when it completed the black background message window closed and nothing else appeared to happen[/LIST][/QUOTE]
This means, that copying in the dlls allowed mfakto to start. "GPU not found" is not what you want, and it will always happen, if the AMD device driver is not used for the GPU. There can be other reasons, but as you already found out that the AMD driver is not used, that needs to be resolved first.

[QUOTE=RMAC9.5;287014]
Two more pieces of information that might be important: [LIST][*]The AMDDriverDownloader.exe that I downloaded from AMD's web site recommended the 11-12_xp64 file above as the best match for my Windows Server 2003 O/S.
This PC is still using a plain vanilla VGA device driver for my CRT because it was NOT replaced when the CCC driver completed its successful install. My video resolution is fine because I don't game on this PC and I only installed my ATI video card and CCC driver for Folding or (my first love) GIMPS.[/LIST][/QUOTE]
The choice of the driver is correct, 2003 is the server OS matching XP. If this driver is not working, mfakto will not be able to use the GPU and revert to the CPU. (Am I repeating myself ? :ermm: )
I suggest completely removing the whole CCC (via Add/Remove Programs), and reinstall the MSI package. I don't have 2003 around, but maybe some forum will know if 64-bit 2003 is supposed to work with CCC (or vice versa).

Bdot 2012-01-23 19:35

While reading the Release Notes of the just-released AMD APP SDK 2.6 I found this here:
[quote]
Async copies preview (set environment variable GPU_ASYNC_MEM_COPY=2 to enable).
[/quote]As this is a runtime thing, I tested it on a Windows and a Linux box, and indeed: When using a single instance of mfakto, the memory transfer to the GPU will be hidden. Performance gain: 10-20%. When using more than one mfakto instance, the gain is less, but still measurable on my systems. This is mainly caused by a higher GPU utilisation.

However, be cautious: AMD called it a "preview". I guess it is not well-tested. If you decide to give it a try, please run the full selftest (mfakto -st2) before real trial-factoring. Prerequisite is Catalyst 11.12 which has the same OpenCL runtime as APP SDK 2.6.

(In case anyone is not sure how to set that on Windows: Either set it in Control Panel -> System -> Advanced -> Environment variables as a new System variable, and then restart all cmd-prompts, or in a cmd-prompt, run "set GPU_ASYNC_MEM_COPY=2" before starting mfakto.)

I'd be very interested to hear about your experiences should you give it a try. I think, this 10-20% speed-up will also affect entry-level cards. Is there anyone around who could test that on one of those?

KyleAskine 2012-01-23 20:15

[QUOTE=Bdot;287055]While reading the Release Notes of the just-released AMD APP SDK 2.6 I found this here:
As this is a runtime thing, I tested it on a Windows and a Linux box, and indeed: When using a single instance of mfakto, the memory transfer to the GPU will be hidden. Performance gain: 10-20%. When using more than one mfakto instance, the gain is less, but still measurable on my systems. This is mainly caused by a higher GPU utilisation.

However, be cautious: AMD called it a "preview". I guess it is not well-tested. If you decide to give it a try, please run the full selftest (mfakto -st2) before real trial-factoring. Prerequisite is Catalyst 11.12 which has the same OpenCL runtime as APP SDK 2.6.

(In case anyone is not sure how to set that on Windows: Either set it in Control Panel -> System -> Advanced -> Environment variables as a new System variable, and then restart all cmd-prompts, or in a cmd-prompt, run "set GPU_ASYNC_MEM_COPY=2" before starting mfakto.)

I'd be very interested to hear about your experiences should you give it a try. I think, this 10-20% speed-up will also affect entry-level cards. Is there anyone around who could test that on one of those?[/QUOTE]

I upgraded to 12.1 preview recently on my entry level (6570) card, and this must have automatically been set, because my performance increased around 20% and my utilization went up around 7% (from around 88% to 95%). I tried typing that, and it didn't change, which is why I speculate that this gets set automatically in 12.1.

flashjh 2012-01-24 03:00

[QUOTE=Bdot;287055]While reading the Release Notes of the just-released AMD APP SDK 2.6 I found this here:
As this is a runtime thing, I tested it on a Windows and a Linux box, and indeed: When using a single instance of mfakto, the memory transfer to the GPU will be hidden. Performance gain: 10-20%. When using more than one mfakto instance, the gain is less, but still measurable on my systems. This is mainly caused by a higher GPU utilisation.

However, be cautious: AMD called it a "preview". I guess it is not well-tested. If you decide to give it a try, please run the full selftest (mfakto -st2) before real trial-factoring. Prerequisite is Catalyst 11.12 which has the same OpenCL runtime as APP SDK 2.6.

(In case anyone is not sure how to set that on Windows: Either set it in Control Panel -> System -> Advanced -> Environment variables as a new System variable, and then restart all cmd-prompts, or in a cmd-prompt, run "set GPU_ASYNC_MEM_COPY=2" before starting mfakto.)

I'd be very interested to hear about your experiences should you give it a try. I think, this 10-20% speed-up will also affect entry-level cards. Is there anyone around who could test that on one of those?[/QUOTE]

Amazing increase in speed! I run two instances (one per GPU) on my test machine. Times drppoed from 2.3 sec/iter to 1.5. It's definitely not as good with more than one instance per GPU. The SievePrimes dropped to 5000, so I know it increased the GPU efficiency. I just wish it was as efficient with more than one instance.

Bdot 2012-01-24 08:39

[QUOTE=flashjh;287088]Amazing increase in speed! I run two instances (one per GPU) on my test machine. Times drppoed from 2.3 sec/iter to 1.5. It's definitely not as good with more than one instance per GPU. The SievePrimes dropped to 5000, so I know it increased the GPU efficiency. I just wish it was as efficient with more than one instance.[/QUOTE]
Hmm, what do you get with two instances? Due to better sieving it should be below 3 secs/class for each instance ... What is the CPU load with two instances? If two instances need to share a CPU core, then you will not see any additional throughput - maybe even a decrease as the async copy seems to require a little more CPU.

Bdot 2012-01-24 08:42

[QUOTE=KyleAskine;287062]I upgraded to 12.1 preview recently on my entry level (6570) card, and this must have automatically been set, because my performance increased around 20% and my utilization went up around 7% (from around 88% to 95%). I tried typing that, and it didn't change, which is why I speculate that this gets set automatically in 12.1.[/QUOTE]
Thanks, that is good to know - when that feature leaves "preview" status and is enabled everywhere, more people will benefit from it.

flashjh 2012-01-24 13:44

[QUOTE=Bdot;287098]Hmm, what do you get with two instances? Due to better sieving it should be below 3 secs/class for each instance ... What is the CPU load with two instances? If two instances need to share a CPU core, then you will not see any additional throughput - maybe even a decrease as the async copy seems to require a little more CPU.[/QUOTE]

Yes it is, sorry for the unnecessarily discouraging report :blush:. I was getting about 2.3 sec/class when running two instances per card, which is very good! I stopped the extra instance because the other factoring on the computer slowed down too much for now. I'll start other instances up later.

Bdot 2012-01-28 19:39

Catalyst 12.1 released, but wait!
 
Kyle has tried the freshly released Catalyst 12.1 drivers and now mfakto is aborting (SIGSEGV) in Linux. You may want to wait a little until I had time to investigate ... probably during coming week. I may have a chance to test 12.1 on W7 tomorrow.

flashjh 2012-01-28 21:06

[QUOTE=Bdot;287536]Kyle has tried the freshly released Catalyst 12.1 drivers and now mfakto is aborting (SIGSEGV) in Linux. You may want to wait a little until I had time to investigate ... probably during coming week. I may have a chance to test 12.1 on W7 tomorrow.[/QUOTE]

I have been using 12.1 on WinXP 32 and Win 7 64 for several days with really good results.

kracker 2012-01-29 03:22

[QUOTE=flashjh;287549]I have been using 12.1 on WinXP 32 and Win 7 64 for several days with really good results.[/QUOTE]

+1 Here also.

KyleAskine 2012-01-30 00:30

[QUOTE=kracker;287595]+1 Here also.[/QUOTE]

Yes, it works fantastic on Win 7. But my Linux box bombed when I installed it. I had to roll back.

bcp19 2012-01-30 14:57

Is there a... dunno the right 'word'... a changeover in mfakto around 29.504-29.505M? I was just noticing that my GPU is taking now 51-55 minutes to complete 29.505M exp's when it had been taking 43 minutes to do 29.503M ones.

[code]Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4617/4620 | 271.32M | 2.718s | 99.82M/s | 5000 | 0m00s | 6169us
no factor for M29504119 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 43m 40.808s
got assignment: exp=29504159 bit_min=68 bit_max=69
tf(29504159, 68, 69, ...);
k_min = 5001801697200 - k_max = 10003603396367
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4612/4620 | 271.32M | 2.727s | 99.49M/s | 5000 | 0m00s | 6275us
no factor for M29504159 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 43m 41.343s
got assignment: exp=29504177 bit_min=68 bit_max=69
tf(29504177, 68, 69, ...);
k_min = 5001798643380 - k_max = 10003597293337
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4612/4620 | 271.32M | 2.727s | 99.49M/s | 5000 | 0m00s | 6264us
no factor for M29504177 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 43m 41.141s
got assignment: exp=29504227 bit_min=68 bit_max=69
tf(29504227, 68, 69, ...);
k_min = 5001790165680 - k_max = 10003580340517
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4617/4620 | 271.32M | 2.715s | 99.93M/s | 5000 | 0m00s | 6165us
no factor for M29504227 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 43m 41.050s
got assignment: exp=29504269 bit_min=68 bit_max=69
tf(29504269, 68, 69, ...);
k_min = 5001783046260 - k_max = 10003566100192
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4616/4620 | 271.32M | 2.727s | 99.49M/s | 5000 | 0m00s | 6235us
no factor for M29504269 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 43m 41.435s
got assignment: exp=29504351 bit_min=68 bit_max=69
tf(29504351, 68, 69, ...);
k_min = 5001769144680 - k_max = 10003538297770
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4609/4620 | 271.32M | 2.820s | 96.21M/s | 5000 | 0m00s | 6629us
no factor for M29504351 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 48m 29.711s
got assignment: exp=29504383 bit_min=68 bit_max=69
tf(29504383, 68, 69, ...);
k_min = 5001763720800 - k_max = 10003527448086
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4613/4620 | 271.32M | 2.787s | 97.35M/s | 5000 | 0m00s | 6448us
no factor for M29504383 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 52m 37.465s
got assignment: exp=29504399 bit_min=68 bit_max=69
tf(29504399, 68, 69, ...);
k_min = 5001761008860 - k_max = 10003522023253
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4617/4620 | 271.32M | 2.789s | 97.28M/s | 5000 | 0m00s | 6405us
no factor for M29504399 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 52m 11.283s
got assignment: exp=29504443 bit_min=68 bit_max=69
tf(29504443, 68, 69, ...);
k_min = 5001753552180 - k_max = 10003507104992
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4617/4620 | 271.32M | 2.832s | 95.80M/s | 5000 | 0m00s | 6659us
no factor for M29504443 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 56m 20.353s
got assignment: exp=29504507 bit_min=68 bit_max=69
tf(29504507, 68, 69, ...);
k_min = 5001742699800 - k_max = 10003485405784
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4617/4620 | 271.32M | 2.824s | 96.08M/s | 5000 | 0m00s | 6597us
no factor for M29504507 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 55m 39.322s
got assignment: exp=29504509 bit_min=68 bit_max=69
tf(29504509, 68, 69, ...);
k_min = 5001742362540 - k_max = 10003484727685
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4619/4620 | 271.32M | 2.823s | 96.11M/s | 5000 | 0m00s | 6633us
no factor for M29504509 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 51m 11.275s
got assignment: exp=29504569 bit_min=68 bit_max=69
tf(29504569, 68, 69, ...);
k_min = 5001732189300 - k_max = 10003464384765
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4616/4620 | 271.32M | 2.721s | 99.71M/s | 5000 | 0m00s | 6085us
no factor for M29504569 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 52m 12.215s
got assignment: exp=29504669 bit_min=68 bit_max=69
tf(29504669, 68, 69, ...);
k_min = 5001715238520 - k_max = 10003430480082
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4612/4620 | 271.32M | 2.818s | 96.28M/s | 5000 | 0m00s | 6580us
no factor for M29504669 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 54m 25.430s
got assignment: exp=29504677 bit_min=68 bit_max=69
tf(29504677, 68, 69, ...);
k_min = 5001713880240 - k_max = 10003427767718
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4619/4620 | 271.32M | 2.822s | 96.14M/s | 5000 | 0m00s | 6607us
no factor for M29504677 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 53m 29.768s
got assignment: exp=29504693 bit_min=68 bit_max=69
tf(29504693, 68, 69, ...);
k_min = 5001711168300 - k_max = 10003422342993
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4615/4620 | 271.32M | 2.846s | 95.33M/s | 5000 | 0m00s | 6649us
no factor for M29504693 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 51m 17.654s
got assignment: exp=29504773 bit_min=68 bit_max=69
tf(29504773, 68, 69, ...);
k_min = 5001697608600 - k_max = 10003395219456
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4616/4620 | 271.32M | 2.817s | 96.31M/s | 5000 | 0m00s | 6581us
no factor for M29504773 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 53m 33.822s
got assignment: exp=29504801 bit_min=68 bit_max=69
tf(29504801, 68, 69, ...);
k_min = 5001692859240 - k_max = 10003385726253
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4615/4620 | 271.32M | 2.757s | 98.41M/s | 5000 | 0m00s | 6250us
no factor for M29504801 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 53m 46.318s
got assignment: exp=29504863 bit_min=68 bit_max=69
tf(29504863, 68, 69, ...);
k_min = 5001682348740 - k_max = 10003364705653
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
4617/4620 | 271.32M | 2.831s | 95.84M/s | 5000 | 0m00s | 6653us
no factor for M29504863 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
tf(): total time spent: 51m 21.457s
[/code]

bcp19 2012-01-31 15:45

mfakto just moved up to 29.507M exps and has dropped back down to 43 min per exp. Interesting/weird little bump in that small a range.

KyleAskine 2012-01-31 21:05

[QUOTE=KyleAskine;287696]Yes, it works fantastic on Win 7. But my Linux box bombed when I installed it. I had to roll back.[/QUOTE]

I think this was an installation issue. It now works.

Bdot 2012-02-01 00:09

[QUOTE=bcp19;287889]mfakto just moved up to 29.507M exps and has dropped back down to 43 min per exp. Interesting/weird little bump in that small a range.[/QUOTE]
I bet if you do the range again, it will be fast. I'd blame Windows and all the background tasks it is performing (indexer, backup, Wupdate, defender / virus scanning ...). Even if it is none of that I'd try blaming Windows ;-)

Bdot 2012-02-01 00:23

[QUOTE=KyleAskine;287930]I think this was an installation issue. It now works.[/QUOTE]
I have 12.1 running for 2 days on my Linux box now. No aborts nor other issues. So I think it was some old library still on the system, or old values in LD_LIBRARY_PATH.

bcp19 2012-02-01 00:56

[QUOTE=Bdot;287943]I bet if you do the range again, it will be fast. I'd blame Windows and all the background tasks it is performing (indexer, backup, Wupdate, defender / virus scanning ...). Even if it is none of that I'd try blaming Windows ;-)[/QUOTE]

I can't imagine what process would run for 16+ hours impacting core 4 without affecting other items. Core 1 running 332M LL stayed at .168ms/iter, Core 2 still averaged 20 min 41-46sec on mfaktc, core 3 running 45M LL stayed at .017 ms/iter. So, something impacted Core 4? M/s stayed at 97-99. Odd thing to me was I caught 2 exps ending, and after seeing the ~54min post, the first class printed said 43min to go.

Bdot 2012-02-01 12:49

[QUOTE=bcp19;287947]I can't imagine what process would run for 16+ hours impacting core 4 without affecting other items. Core 1 running 332M LL stayed at .168ms/iter, Core 2 still averaged 20 min 41-46sec on mfaktc, core 3 running 45M LL stayed at .017 ms/iter. So, something impacted Core 4? M/s stayed at 97-99. Odd thing to me was I caught 2 exps ending, and after seeing the ~54min post, the first class printed said 43min to go.[/QUOTE]
Even after thinking about this a little bit more, I have no explanation. I thought about the difference in the exponents. The barrett kernels need a tiny bit longer to process a "1" instead of a "0" in the binary representation of the exponent. Usually, the first 7 bits are preprocessed on the host, so they don't count. M29504269 has just 8 times "1", M29504399 has 10. I doubt this would really be measurable, and for sure it is not accountable for +25% runtime.
mfakto 0.09 still wrote checkpoints after each class. You have 11min=660s more runtime. That is 660/960 ~ 0.7s more per class. As the reported times per class do not fluctuate by that much, it is quite likely that the delay is rather on the host code. If you don't have any task specifically pinned to core #4, and the other tasks are not affected, this really just leaves disk access as the culprit. Which mfaktc-version are you running? If that is before 0.18, then mfaktc would also write CPs after every class, so it should also be delayed by ~0.7s per class ... but if you did not switch to the latest mfakto-version, then I assume you also did not switch to the latest mfaktc-version. And if mfaktc < 0.18 was not affected I'm at the end of my knowledge/guesswork.
BTW, both indexing and virus scan can take forever if you have lots of files. On a dev machine with some GB of subversion repositories (~500k files including the .svn ones), it did not finish within one day - I had to disable them.

bcp19 2012-02-01 19:13

[QUOTE=Bdot;287974]Even after thinking about this a little bit more, I have no explanation. I thought about the difference in the exponents. The barrett kernels need a tiny bit longer to process a "1" instead of a "0" in the binary representation of the exponent. Usually, the first 7 bits are preprocessed on the host, so they don't count. M29504269 has just 8 times "1", M29504399 has 10. I doubt this would really be measurable, and for sure it is not accountable for +25% runtime.
mfakto 0.09 still wrote checkpoints after each class. You have 11min=660s more runtime. That is 660/960 ~ 0.7s more per class. As the reported times per class do not fluctuate by that much, it is quite likely that the delay is rather on the host code. If you don't have any task specifically pinned to core #4, and the other tasks are not affected, this really just leaves disk access as the culprit. Which mfaktc-version are you running? If that is before 0.18, then mfaktc would also write CPs after every class, so it should also be delayed by ~0.7s per class ... but if you did not switch to the latest mfakto-version, then I assume you also did not switch to the latest mfaktc-version. And if mfaktc < 0.18 was not affected I'm at the end of my knowledge/guesswork.
BTW, both indexing and virus scan can take forever if you have lots of files. On a dev machine with some GB of subversion repositories (~500k files including the .svn ones), it did not finish within one day - I had to disable them.[/QUOTE]

I'm running .18 on mfaktc and .09 on mfakto. Disk access should not be a factor since it is being run from a ramdisk and I highly doubt my ram has a latency of .7s.

chalsall 2012-02-01 19:20

[QUOTE=bcp19;288004]I'm running .18 on mfaktc and .09 on mfakto. Disk access should not be a factor since it is being run from a ramdisk and I highly doubt my ram has a latency of .7s.[/QUOTE]

Things which make you go "Hmmmm... That's unusual..." are things which should be investigaged.

Perhaps these exponents should be run again by another mfakto worker (using the same code) to see if the same behaviour is observed.

Bdot 2012-02-01 23:31

[QUOTE=chalsall;288005]Things which make you go "Hmmmm... That's unusual..." are things which should be investigaged.

Perhaps these exponents should be run again by another mfakto worker (using the same code) to see if the same behaviour is observed.[/QUOTE]

Yes, you're right - if it is reproducible at all.

bcp19, could you please rerun one of the slow exponents, just to make sure it is something in mfakto? If it is slow again, then I'd like to know what Windows you're running, and which Catalyst version so I can setup the same ...

bcp19 2012-02-02 00:49

[QUOTE=Bdot;288034]Yes, you're right - if it is reproducible at all.

bcp19, could you please rerun one of the slow exponents, just to make sure it is something in mfakto? If it is slow again, then I'd like to know what Windows you're running, and which Catalyst version so I can setup the same ...[/QUOTE]

Ok, reruning 29504443 which was 56 min. I'm running Win 7 and the driver under device manager says 8.892.0.0, which I think is the latest version, 12.1?

chalsall 2012-02-02 01:29

[QUOTE=bcp19;288044]Ok, reruning 29504443 which was 56 min. I'm running Win 7 and the driver under device manager says 8.892.0.0, which I think is the latest version, 12.1?[/QUOTE]

You are both exactly correct -- rerun the unusual in the exact same environment, and see if the results are the same.

Einstein once said "Insanity: doing the same thing over and over again and expecting different results.

Of course, he never believed in quantum undertainty; and certainly had never encountered Windows.... :grin:

bcp19 2012-02-02 01:53

[QUOTE=chalsall;288046]You are both exactly correct -- rerun the unusual in the exact same environment, and see if the results are the same.

Einstein once said "Insanity: doing the same thing over and over again and expecting different results.

Of course, he never believed in quantum undertainty; and certainly had never encountered Windows.... :grin:[/QUOTE]

Too true. Exp complete, 46 minutes. So some stray protons or neutrons must have been flying around causing time to warp on my 4th core.

Bdot 2012-02-02 07:28

[QUOTE=bcp19;288044]Ok, reruning 29504443 which was 56 min. I'm running Win 7 and the driver under device manager says 8.892.0.0, which I think is the latest version, 12.1?[/QUOTE]

Thanks, I should be able to run the same and see if I can find out what's up.

Edit: Oops, did not see this before:
[quote=bcp19]
Too true. Exp complete, 46 minutes. So some stray protons or neutrons must have been flying around causing time to warp on my 4th core.
[/quote]
Was that the same exponent again (a third time)? So we have 56m 20.353s, 56m and 46m? Sorry, I'm a bit confused ... :confused:

bcp19 2012-02-02 16:33

[QUOTE=Bdot;288062]Thanks, I should be able to run the same and see if I can find out what's up.

Edit: Oops, did not see this before:

Was that the same exponent again (a third time)? So we have 56m 20.353s, 56m and 46m? Sorry, I'm a bit confused ... :confused:[/QUOTE]

I only ran it twice. If you look close, I said I was rerunning the exp "which was 56 min".

Bdot 2012-02-02 21:23

File locks for worktodo and results
 
[QUOTE=bcp19;288091]I only ran it twice. If you look close, I said I was rerunning the exp "which was 56 min".[/QUOTE]
Sorry, my poor English had made me think this meant that the rerun took 56 min. Probably one of the occasions when it becomes obvious that I'm not a native English speaker ...

Anyway, thanks for the test - it seems the case can be blamed on Windows (or other applications running there), after all.

I have a question for the scripters among you. In the next mfakto version I plan to add file locking for worktodo and results files. While on Windows all APIs seems to boil down to the same underlying locking mechanism, on Linux we have fcntl/lockf and flock locks which are independent. In case you have scripts or programs for Linux to maintain worktodo and results files (such as [URL="http://www.gpu72.com/spider/"]Chalsall's Submission Spider[/URL]), would you consider synchronizing file accesses with mfakto? Do you have preferences which one to use? Is anyone accessing these files over NFS? Any other reasons why I should use one or the other? Of course I will provide these changes to Oliver when I got the locking to work, so that mfaktc can get the same.

Dubslow 2012-02-02 21:45

[QUOTE=Bdot;288114]Sorry, my poor English had made me think this meant that the rerun took 56 min. Probably one of the occasions when it becomes obvious that I'm not a native English speaker ...

[/QUOTE]
FWIW, you've managed to get by me for *checks join date* 8 months. I only just now looked at your location.

chalsall 2012-02-03 14:23

[QUOTE=Bdot;288114]I have a question for the scripters among you. In the next mfakto version I plan to add file locking for worktodo and results files. While on Windows all APIs seems to boil down to the same underlying locking mechanism, on Linux we have fcntl/lockf and flock locks which are independent. In case you have scripts or programs for Linux to maintain worktodo and results files (such as [URL="http://www.gpu72.com/spider/"]Chalsall's Submission Spider[/URL]), would you consider synchronizing file accesses with mfakto? Do you have preferences which one to use?[/QUOTE]

This would be excellent. And it doesn't really matter which is used, just so long as everyone agrees to use the same one (which would effectively be defined by you). Perl (and, of course, C) can do any of them. fcntl is more powerful, but in this case lockf would be just fine (and is a little easier to use).

[QUOTE]Is anyone accessing these files over NFS? Any other reasons why I should use one or the other? Of course I will provide these changes to Oliver when I got the locking to work, so that mfaktc can get the same.[/QUOTE]

I know at least a few people are using NFS, so I would suggest in addition to the locking, you also consider implementing the same functionality as Prime95/mprime's worktodo.add feature. This would allow scripts to add work to mfakto with no risks, even on NFS, as it would be mfakto which would be the sole writer of the file.

I asked for this feature on the mfaktc thread; don't know if it was added to a todo list. Since you're looking at working at related functionality, could you consider doing so? I would then write a reservation spider to complement it (actually two, one for PrimeNet, and one for GPU72).

Dubslow 2012-02-03 18:00

FWIW, I like being able to modify the worktodo without having to worry about mfatk*. I only check to make sure it's not close to finishing an assignment, which is the only time it writes files. When would it lock the file? All the time? (I personally find it a pain to have to stop MPrime to modify worktodo, especially if I have S2 P-1 going. From what I can tell, most of the changes I typically make would not be possible with worktodo.add, e.g. adding only in sequence only whenver MPrime feels like looking at the file.)

chalsall 2012-02-03 20:02

[QUOTE=Dubslow;288204]FWIW, I like being able to modify the worktodo without having to worry about mfatk*. I only check to make sure it's not close to finishing an assignment, which is the only time it writes files.[/QUOTE]

Or, depending on the settings, when it (randomly) finds a factor, which you cannot predict.

[QUOTE=Dubslow;288204]From what I can tell, most of the changes I typically make would not be possible with worktodo.add, e.g. adding only in sequence only whenver MPrime feels like looking at the file.)[/QUOTE]

But you are exceptional (with all meanings of the word intended :smile:).

Most would find this function useful....

Dubslow 2012-02-03 22:35

Well then make the file locking a user option?

Bdot 2012-02-03 23:05

I've added worktodo.add to the todo list. But it's separate from the locking.

The files will not remain locked all the time (this would defeat the purpose of having multiple mfakto instances safely writing to the same results.txt file). The following is planned:mfakto will try to lock the file in question. If it fails to get the lock, it will say so and wait until it can lock the file. After the update to the file it will release the lock.

On Windows this will mean that your editor will complain (either when opening the file, or when attempting to save - depends on the cleverness of the editor), if exactly at this moment mfakto owns the lock. Retrying should work then (again, depends). If you opened the file in an editor while mfakto wants to write something, most editors will block mfakto.

On Linux, by default, only "lock-aware" applications will notice that mfakto holds a lock - locks are not enforced. For instance,
[code]
gvim worktodo.txt
[/code]will ignore any locks on worktodo.txt, and mfakto may write updates to the file while you edit it.
[code]
flock worktodo.txt -c gvim worktodo.txt
[/code]will keep the file locked while you edit it, blocking mfakto for that time. (Some may have noticed I currently favor flock() over fcntl().)

Do you really think this needs to be configurable?


And another update: I wrote a quick-hack-kernel that uses just 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi. Kernel Analyzer says that on Cayman this will be ~20% faster than mul24, while it's of no use for the other GPU's. So this kernel could make HD6970 almost as fast as HD5870, or even a bit faster ...

Dubslow 2012-02-03 23:32

[QUOTE=Bdot;288241]

Do you really think this needs to be configurable?
[/QUOTE]

I really had no idea about the details of your goals and implementation, and was only stating what I liked (or rather that's what I should have done). If you think that my way and this locking thingy are still compatible, that's fine by me. I suppose my only real request is that when this is all cut and dried, that I know whether (and how) I can still edit worktodo on the fly.

chalsall 2012-02-03 23:49

[QUOTE=Bdot;288241]I've added worktodo.add to the todo list. But it's separate from the locking. On Linux, by default, only "lock-aware" applications will notice that mfakto holds a lock - locks are not enforced[/QUOTE]

Of course. Locks are agreed upon; they are semaphores.


All times are UTC. The time now is 13:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.