mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2014-07-27, 11:46   #1167
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by kracker View Post
I'll do/try that.

On another note...
Code:
Selftest statistics                                    
  number of tests           287351
  successful tests          287350
  no factor found           1

selftest FAILED!

ERROR: selftest failed, exiting.
Oh, that's sad. Was that the IntelHD? And again it was a really good choice to add so many test cases ...

It must be a different rounding that leads to a higher-than-expected error. I cannot reproduced this error on my H/W. Can you reproduce it?

The factor is not particularly close to the limit of this kernel, the exponent does not have a long suite of ones in its binary form ... I don't see why this test should fail.

I'll provide a special test version to you to be able to analyze this failure.
Bdot is offline   Reply With Quote
Old 2014-07-28, 01:12   #1168
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Yep, just ran part of -st2 again on my HD4600. Same failure.
kracker is offline   Reply With Quote
Old 2014-07-29, 12:05   #1169
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

2·467 Posts
Default

Quote:
Originally Posted by LaurV View Post
If you want to squeeze more performance from mfakto, try to factor lower than 73 bits only. Contrary to mfaktc, where there is no big drop in performance for higher bitlevels (or, say, no big gain in performance for lower bit levels), for mfakto, especial for higher GCN cards, the "shorter" kernels are much faster. For example, I get from my HD7970 GHz edition, something like: 450GHzD/D when factoring 6xM to 74, but I get 500GHzD/D when factoring to 73 only, and so on. Decreasing the bitlevel increase the "gain" (but helps GIMPS less) and also decreasing the exponent increase the "gain", but only a little. For example, the same card I described above, gives 630-650GHzD/D when factoring 4xM exponents to 69 bits. Right now, Chris made them unavailable from GPU72, to channel the workers toward Cat4 exponents, but there are still 35 thousands of them (44-47M) at 68 bits, you can take them to 69 directly from PrimeNet, or ask Chris to make them available. For this range of expos and bitlevel, the performance of the card (kernel) is about 50% higher.
How does one go about getting these kinds of assignments? I just went into PrimeNet to request manual assignments in that range using the specified range option, and the server took forever to respond, finally issuing a timeout error.

Can I enter, say "45000000" to "47000000", or do I have to enter specific starting/ending exponents?

(I was able to get manual assignments in the normal manner if no range was indicated.)

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2014-07-29, 13:29   #1170
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

3×3,221 Posts
Default

Quote:
Originally Posted by Rodrigo View Post
How does one go about getting these kinds of assignments?
GPU72
LaurV is offline   Reply With Quote
Old 2014-07-29, 16:26   #1171
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

100110001101102 Posts
Default

Quote:
Originally Posted by Rodrigo View Post
Can I enter, say "45000000" to "47000000", or do I have to enter specific starting/ending exponents?
I can't speak to how one might get such assignments from Primenet using the manual assignment page there, since GPU72 uses different techniques to reserve candidates for TF'ing beyond the nominal CPU "bit levels".

But, such work is easily available from GPU72 -- both low DCTF'ing and low LLTF'ing. Please note that LLTF'ing is the most needed at the moment, and the deeper the better. But for those whose cards are more optimal going to lower bit levels are encouraged to do so.
chalsall is online now   Reply With Quote
Old 2014-07-29, 23:20   #1172
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

3A616 Posts
Default

Thanks, LaurV and Chris.

I guess I got sidetracked by this part:

Quote:
but there are still 35 thousands of them (44-47M) at 68 bits, you can take them to 69 directly from PrimeNet
It's time to get around to investigating GPU72.

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2014-08-03, 16:35   #1173
potonono
 
potonono's Avatar
 
Jun 2005
USA, IL

193 Posts
Default

Quote:
Originally Posted by kracker View Post
Yep, just ran part of -st2 again on my HD4600. Same failure.
I get the same failure on my HD2500.

Quote:
######### testcase 2584/32927 (M59000521[82-83]) #########
Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett15_83_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Aug 03 08:00 | 3828 0.1% | 4.565 n.a. | n.a. 81205 0.00%
no factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett15_83_gs_4]
ERROR: selftest failed for M59000521 (cl_barrett15_83_gs)
no factor found
tf(): total time spent: 4.565s
potonono is offline   Reply With Quote
Old 2014-08-05, 20:56   #1174
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default

Quote:
Originally Posted by potonono View Post
I get the same failure on my HD2500.
Thanks for the confirmation.

I'm on it to troubleshoot this with kracker, but don't have a enough time to make good progress on it. The next step would be to build a version where tracing will show where exactly calculations go wrong ...
Bdot is offline   Reply With Quote
Old 2014-08-11, 13:43   #1175
Jayder
 
Jayder's Avatar
 
Dec 2012

2×139 Posts
Default

Quote:
Originally Posted by Bdot View Post
I have released mfakto 0.14.

As most people seem to use the GPU sieve, I no longer created the versions with different (CPU-) sieve sizes. If anyone still needs them, just let me know.
I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.

Last fiddled with by Jayder on 2014-08-11 at 13:44
Jayder is offline   Reply With Quote
Old 2014-08-13, 00:50   #1176
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by Jayder View Post
I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.
No problem, I think I can build them within the next days.


Just a quick update about the HD4600/2500 selftest failure:
I analyzed kracker's data and the code. The reason is that the HD4600 has a slightly different rounding behavior.

I noticed that also for AMD devices the code walks dangerously close to the border of the available precision. Even though all tests succeed, the warning lights that Oliver once built into the code (CHECKS_MODBASECASE) do light up in the 15_82/15_83 and 15_88 kernels.

In order to fix that I finally did a long-waiting attempt: base the initial division on double instead of float. This allows for doing the div_180_90 in two instead of five steps with only one instead of four big multiplications in between.

Result: no more CHECKS_MODBASECASE issues (plenty of safety bits), and 1.5% faster overall (on HD7950), even though processing speed for doubles is just 4:1. I will run a few tests over night and then probably send out 0.15pre2 for testing - it should at least fix the IntelHD issue.

I'll then test if the smaller kernels would also benefit from using doubles, and how the performance looks like on mid- and lower end GPUs where the performance for doubles is just 16:1.

Well, maybe after my vacation
Bdot is offline   Reply With Quote
Old 2014-08-13, 20:07   #1177
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default mfakto-0.15pre2 ready for testing

Dear mfakto-testers,

I now put the windows/64 version of mfakto-0.15pre2 to the ftp. I'd appreciate if you could test it on the various systems you have access to:

  • does mfakto detect the devices automatically or are switches (like -d 11) required
  • does it correctly identify the devices and their device type
  • is 'mfakto -st' reporting success (on fast systems, or when you have lots of time, 'mfakto -st2') - if testing is too long, you can always interrupt by pressing 'q' or Ctrl-C.
  • use a normal trial factoring task and try to optimize the ini-file settings: try VectorSize=2 and =4 (1, 3, 8 and 16 are possible as well) to see which is faster, then use the +/-, s/S, p/P keys to get the best possible GHz-days: what was the TF job, and which settings (VectorSize, SievePrimes, SieveSize, SieveProcessSize) gave the best performance for the specific device?
  • any problems/suggestions?

Additional performance-testing:

As the new division algorithm is based on double precision, I'd need to get performance results from as many different devices as possible:
  • Modify the ini-file to use the best VectorSize (see above)
  • Switch to CPU sieving: SieveOnGPU=0
  • make sure CPU and GPU are idle
  • run "mfakto-pi.exe -st > st-pi.log"
  • keep it running for one or two minutes, then press q (or Ctrl-C)
  • have a look at st-pi.log: ist the detected clock speed correct (it rarely is on AMD - please let me know the correct one)
  • send me the log
Thanks a lot for any help you can provide - even if the complete checklist is too long for you: any partial result is also appreciated.
Bdot is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 01:06.


Fri Aug 6 01:06:13 UTC 2021 up 13 days, 19:35, 1 user, load averages: 2.50, 2.42, 2.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.