mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2014-07-27 11:46

[QUOTE=kracker;378955]I'll do/try that. :smile:

On another note...
[code]

Selftest statistics
number of tests 287351
successful tests 287350
no factor found 1

selftest FAILED!

ERROR: selftest failed, exiting.
[/code][/QUOTE]
Oh, that's sad. Was that the IntelHD? And again it was a really good choice to add so many test cases ...

It must be a different rounding that leads to a higher-than-expected error. I cannot reproduced this error on my H/W. Can you reproduce it?

The factor is not particularly close to the limit of this kernel, the exponent does not have a long suite of ones in its binary form ... I don't see why this test should fail.

I'll provide a special test version to you to be able to analyze this failure.

kracker 2014-07-28 01:12

Yep, just ran part of -st2 again on my HD4600. Same failure.

Rodrigo 2014-07-29 12:05

[QUOTE=LaurV;376585]If you want to squeeze more performance from mfakto, try to factor lower than 73 bits only. Contrary to mfaktc, where there is no big drop in performance for higher bitlevels (or, say, no big gain in performance for lower bit levels), for mfakto, especial for higher GCN cards, the "shorter" kernels are much faster. For example, I get from my HD7970 GHz edition, something like: 450GHzD/D when factoring 6xM to 74, but I get 500GHzD/D when factoring to 73 only, and so on. Decreasing the bitlevel increase the "gain" (but helps GIMPS less) and also decreasing the exponent increase the "gain", but only a little. For example, the same card I described above, gives 630-650GHzD/D when factoring 4xM exponents to 69 bits. Right now, Chris made them unavailable from GPU72, to channel the workers toward Cat4 exponents, but there are still [URL="http://www.gpu72.com/reports/current_level/"]35 thousands[/URL] of them (44-47M) at 68 bits, you can take them to 69 directly from PrimeNet, or ask Chris to make them available. For this range of expos and bitlevel, the performance of the card (kernel) is about 50% higher.[/QUOTE]

How does one go about getting these kinds of assignments? I just went into PrimeNet to request manual assignments in that range using the specified range option, and the server took forever to respond, finally issuing a timeout error.

Can I enter, say "45000000" to "47000000", or do I have to enter specific starting/ending exponents?

(I was able to get manual assignments in the normal manner if no range was indicated.)

Rodrigo

LaurV 2014-07-29 13:29

[QUOTE=Rodrigo;379296]How does one go about getting these kinds of assignments?[/QUOTE]
GPU72

chalsall 2014-07-29 16:26

[QUOTE=Rodrigo;379296]Can I enter, say "45000000" to "47000000", or do I have to enter specific starting/ending exponents?[/QUOTE]

I can't speak to how one might get such assignments from Primenet using the manual assignment page there, since GPU72 uses different techniques to reserve candidates for TF'ing beyond the nominal CPU "bit levels".

But, such work is easily available from GPU72 -- both low DCTF'ing and low LLTF'ing. Please note that LLTF'ing is the most needed at the moment, and the deeper the better. But for those whose cards are more optimal going to lower bit levels are encouraged to do so.

Rodrigo 2014-07-29 23:20

Thanks, LaurV and Chris.

I guess I got sidetracked by this part:

[QUOTE]but there are still [URL="http://www.gpu72.com/reports/current_level/"]35 thousands[/URL] of them (44-47M) at 68 bits, you can take them to 69 directly from PrimeNet[/QUOTE]It's time to get around to investigating GPU72. :smile:

Rodrigo

potonono 2014-08-03 16:35

[QUOTE=kracker;379188]Yep, just ran part of -st2 again on my HD4600. Same failure.[/QUOTE]

I get the same failure on my HD2500.

[QUOTE]######### testcase 2584/32927 (M59000521[82-83]) #########
Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett15_83_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Aug 03 08:00 | 3828 0.1% | 4.565 n.a. | n.a. 81205 0.00%
no factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett15_83_gs_4]
ERROR: selftest failed for M59000521 (cl_barrett15_83_gs)
no factor found
tf(): total time spent: 4.565s[/QUOTE]

Bdot 2014-08-05 20:56

[QUOTE=potonono;379619]I get the same failure on my HD2500.[/QUOTE]
Thanks for the confirmation.

I'm on it to troubleshoot this with kracker, but don't have a enough time to make good progress on it. The next step would be to build a version where tracing will show where exactly calculations go wrong ...

Jayder 2014-08-11 13:43

[QUOTE=Bdot;371459]I have released [URL="http://mersenneforum.org/mfakto/mfakto-0.14/"]mfakto 0.14.[/URL]

As most people seem to use the GPU sieve, I no longer created the versions with different (CPU-) sieve sizes. If anyone still needs them, just let me know.[/QUOTE]
I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. :smile: I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.

Bdot 2014-08-13 00:50

[QUOTE=Jayder;380174]I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. :smile: I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.[/QUOTE]

No problem, I think I can build them within the next days.


Just a quick update about the HD4600/2500 selftest failure:
I analyzed kracker's data and the code. The reason is that the HD4600 has a slightly different rounding behavior.

I noticed that also for AMD devices the code walks dangerously close to the border of the available precision. Even though all tests succeed, the warning lights that Oliver once built into the code (CHECKS_MODBASECASE) do light up in the 15_82/15_83 and 15_88 kernels.

In order to fix that I finally did a long-waiting attempt: base the initial division on double instead of float. This allows for doing the div_180_90 in two instead of five steps with only one instead of four big multiplications in between.

Result: no more CHECKS_MODBASECASE issues (plenty of safety bits), and 1.5% faster overall (on HD7950), even though processing speed for doubles is just 4:1. I will run a few tests over night and then probably send out 0.15pre2 for testing - it should at least fix the IntelHD issue.

I'll then test if the smaller kernels would also benefit from using doubles, and how the performance looks like on mid- and lower end GPUs where the performance for doubles is just 16:1.

Well, maybe after my vacation :motorhome:

Bdot 2014-08-13 20:07

mfakto-0.15pre2 ready for testing
 
Dear mfakto-testers,

I now put the windows/64 version of mfakto-0.15pre2 to the [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre2/mfakto-0.15pre2.zip"]ftp[/URL]. I'd appreciate if you could test it on the various systems you have access to:

[LIST][*] does mfakto detect the devices automatically or are switches (like -d 11) required[*]does it correctly identify the devices and their device type[*] is 'mfakto -st' reporting success (on fast systems, or when you have lots of time, 'mfakto -st[B]2[/B]') - if testing is too long, you can always interrupt by pressing 'q' or Ctrl-C.[*] use a normal trial factoring task and try to optimize the ini-file settings: try VectorSize=2 and =4 (1, 3, 8 and 16 are possible as well) to see which is faster, then use the +/-, s/S, p/P keys to get the best possible GHz-days: what was the TF job, and which settings (VectorSize, SievePrimes, SieveSize, SieveProcessSize) gave the best performance for the specific device?[*] any problems/suggestions?[/LIST]
Additional performance-testing:

As the new division algorithm is based on double precision, I'd need to get performance results from as many different devices as possible:
[LIST][*]Modify the ini-file to use the best VectorSize (see above)[*]Switch to CPU sieving: SieveOnGPU=0[*]make sure CPU and GPU are idle[*]run "mfakto-pi.exe -st > st-pi.log"[*]keep it running for one or two minutes, then press q (or Ctrl-C)[*]have a look at st-pi.log: ist the detected clock speed correct (it rarely is on AMD - please let me know the correct one)[*]send me the log[/LIST]Thanks a lot for any help you can provide - even if the complete checklist is too long for you: any partial result is also appreciated.


All times are UTC. The time now is 23:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.