mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-08-19, 13:44   #78
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by KingKurly View Post
I found that if I plug in a monitor and keyboard to that computer and then login to the computer locally, the video card is found and can be used just fine. It would be a bit of a burden to have to always login locally, but I guess I can do that until a better solution is determined.
Well, it appears the dependency to the running X-Server is not yet dropped (or some additional work is necessary). You need to be logged in in order to start the X-Server. But then you can lock the screen and run mfakto remotely.

I'll check if we can get rid of that.

Quote:
Originally Posted by KingKurly View Post
That said, I do have a new problem to report:
ERROR: THREADS_PER_BLOCK (256) > deviceinfo.maxThreadsPerBlock
I'll check what implications that has and if we could drop this check altogether as OpenCL calculates the threads a little differently.

Quote:
Originally Posted by KingKurly View Post
it fails selftest 1-5 and 9-11. See below:
Now that is odd! The 72-bit kernel fails, but the vectored versions of the same kernel succeed! I just compared the kernels, but there are no code-differences.
Plus, I can reproduce it now on my Linux box: I still had the LD_LIBRARY_PATH point to 2.4, and that runs fine. When pointing it to 2.5, the problem appears. Looks like an AMD APP issue, I'll check what I can do about it. Running 2.5 on the CPU also works fine ...

I already wanted to drop the single kernel because it is so much slower ...

As you built your own binary anyway, go to mfaktc.c and comment out line 487 (removing the _71BIT_MUL24 kernel). Don't submit results with that to primenet, just use it to check what your GPU can do . You can then run the full selftest (-st), if you want the GPU to work for a while. There, you also see the speed of the different kernels for different tasks.
Bdot is offline   Reply With Quote
Old 2011-08-20, 03:00   #79
KingKurly
 
KingKurly's Avatar
 
Sep 2010
Annapolis, MD, USA

33·7 Posts
Default

I rebuilt the program with the change you recommended. All the tests pass, including the large selftest. The card seems to do about 5-10M/s in the "lower" ranges (like below 75M) and is about 10% of that in the 332M+ range.

Code:
Selftest statistics
  number of tests           3637
  successfull tests         3637

selftest PASSED!
I look forward to future versions, and I will not use the program to submit any "no factor" results until you have indicated that it is safe for me to do so. If I happen to find factors, I might submit those, but I do not expect to use the program for much production work until things have stabilized a bit more.

Thanks again!
KingKurly is offline   Reply With Quote
Old 2011-08-20, 17:18   #80
KingKurly
 
KingKurly's Avatar
 
Sep 2010
Annapolis, MD, USA

33×7 Posts
Default

The very first test I ran saved me an LL test, and of course saved someone else the LL-D down the road.

Code:
    class | candidates |    time | avg. rate | SievePrimes |    ETA | avg. wait
3760/4620 |    159.38M | 16.721s |   9.53M/s |       50000 | 49m53s |   90889us
3765/4620 |    159.38M | 16.696s |   9.55M/s |       50000 | 49m32s |   90729us
Result[00]: M40660811 has a factor: 490782599517282826471
found 1 factor(s) for M40660811 from 2^68 to 2^69 (partially tested) [mfakto 0.07 mfakto_cl_barrett79]
tf(): total time spent:  3h 53m 44.575s
I had the exponent queued up for a first-time LL test, but I've since removed it from my worktodo because it's not necessary!
KingKurly is offline   Reply With Quote
Old 2011-08-21, 14:49   #81
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by KingKurly View Post
The very first test I ran saved me an LL test, and of course saved someone else the LL-D down the road.

Code:
    class | candidates |    time | avg. rate | SievePrimes |    ETA | avg. wait
3760/4620 |    159.38M | 16.721s |   9.53M/s |       50000 | 49m53s |   90889us
3765/4620 |    159.38M | 16.696s |   9.55M/s |       50000 | 49m32s |   90729us
Result[00]: M40660811 has a factor: 490782599517282826471
found 1 factor(s) for M40660811 from 2^68 to 2^69 (partially tested) [mfakto 0.07 mfakto_cl_barrett79]
tf(): total time spent:  3h 53m 44.575s
I had the exponent queued up for a first-time LL test, but I've since removed it from my worktodo because it's not necessary!
What a start! While I already found a lot of factors with mfakto, they almost all have been known before .

BTW, at the expense of a little more CPU you can speed up the tests a little: Set SievePrimes to 200000 and the siever will eliminate some more candidates so the GPU will not test them. What's mfakto's CPU-load right now and with SievePrimes at 200k?

9.5 M/s is also not bad for an entry-level GPU - I guess it is as least twice as fast as one of your CPU cores.

Grats also to the successful selftest. The speed of the tests does not depend a lot on the size of the exponent but mainly on the kernel being used. The selftest will run each test with all kernels that can handle the required factor length. If you still have the output of the selftest you should see that mfakto_cl_barrett79 is always close to 10 M/s, most others a bit below that, and mfakto_cl_95 slowly crawling along.
Bdot is offline   Reply With Quote
Old 2011-08-24, 20:55   #82
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Did anyone else give mfakto a try? Any experiences to share (anything strange happening, suggestions you'd like to get included or excluded for the next versions, performance figures for other GPUs, ...)?

I'm running this version on a SuSE 11.4 box with AMD APP SDK 2.4, and when multiple instances are running I occasionally see one instance hanging. It will completely occupy one CPU core but no GPU resources. It is looping inside some kernel code, being immune to kill, kill -9 or attempts to attach a debugger or gcore. So far, reboot was the only way I know to get rid of it. How can I find out where that hang occurs? And what else could I try to kick such a process without a reboot?
Bdot is offline   Reply With Quote
Old 2011-08-25, 17:29   #83
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post
Did anyone else give mfakto a try?
I had the same experience as another poster: had to recompile to reduce number of threads per block and disable one kernel. Apart from that AMD_APP refused to install on Win2008 so I had to swap the graphic cards between two machines so the AMD one would be on Windows 7. The performance is about 20% of what I get out of GeForce 8800 GTS (around 6 M/s comparing to 29 M/s). I haven't played with sieve parameter much - just had to disable auto adjust as it will raise the setting to the limit slowing the testing to a crawl. If I'll lower it to below default I would probably get better overall performance.

Last fiddled with by apsen on 2011-08-25 at 17:31
apsen is offline   Reply With Quote
Old 2011-08-25, 20:19   #84
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by apsen View Post
The performance is about 20% of what I get out of GeForce 8800 GTS (around 6 M/s comparing to 29 M/s). I haven't played with sieve parameter much - just had to disable auto adjust as it will raise the setting to the limit slowing the testing to a crawl. If I'll lower it to below default I would probably get better overall performance.
That is a bit slower than I had expected. Which kernel and bitlevel was that? But if raising SievePrimes slows down the tests, then the tests are CPU-limited, and the GPU not running at full load. If you want, build a test binary with CL_PERFORMANCE_INFO defined (params.h) - this will tell you the memory transfer rate and pure kernel speed, without accounting for the siever.

According to hwcompare, the 8800 GTS should be 3-4 times faster, so 8-10 M/s would be expected if OpenCL and my port were as efficient as Oliver's CUDA implementation.

Last fiddled with by Bdot on 2011-08-25 at 20:29 Reason: added hwcompare
Bdot is offline   Reply With Quote
Old 2011-08-26, 11:55   #85
Chaichontat
 
Aug 2011

216 Posts
Default

Hi, I'm running mfakto on my HD6950 @912MHz, Catalyst 11.8 SDK 2.5, one thing that I seen is that it uses approx. 30 percent of my GPU utilization and gives about 50M/s. Does anyone knows how to make it fully use the GPU?
Thanks.
Chaichontat is offline   Reply With Quote
Old 2011-08-26, 14:30   #86
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post
That is a bit slower than I had expected. Which kernel and bitlevel was that? But if raising SievePrimes slows down the tests, then the tests are CPU-limited, and the GPU not running at full load. If you want, build a test binary with CL_PERFORMANCE_INFO defined (params.h) - this will tell you the memory transfer rate and pure kernel speed, without accounting for the siever.

According to hwcompare, the 8800 GTS should be 3-4 times faster, so 8-10 M/s would be expected if OpenCL and my port were as efficient as Oliver's CUDA implementation.
I did some more testing and it looks like the problem is in getting enough CPU. When i run it alone i'm getting about 7.3 M/s and CPU usage is 50-56%(!) on two core machine. When I start prime95 the CPU usage drops to about 10% average and I'm getting about 6.5 M/s even though prime runs at default priority and mfacto at normal. Also average wait is always in teens of milliseconds (12000-15000 microseconds). It is lower without prime95 running.
apsen is offline   Reply With Quote
Old 2011-08-27, 12:44   #87
MrHappy
 
MrHappy's Avatar
 
Dec 2003
Paisley Park & Neverland

5×37 Posts
Default

I get ~28M/s on my HD5670 / Phenom II 4 Core 925 with 2 Cores on P-1 tests, 1 Core LL-D; and 1 core is busy video editing. I'll look again when the video job is done.
MrHappy is offline   Reply With Quote
Old 2011-08-28, 14:18   #88
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

70316 Posts
Default

Quote:
Originally Posted by Chaichontat View Post
Hi, I'm running mfakto on my HD6950 @912MHz, Catalyst 11.8 SDK 2.5, one thing that I seen is that it uses approx. 30 percent of my GPU utilization and gives about 50M/s. Does anyone knows how to make it fully use the GPU?
Thanks.
At the current stage of development, mfaktc/mfakto sieves for probable primes on the CPU side before passing them to the GPU for checking. Make sure that sieveprimes on your machine has gone down to 10,000. Beyond that, at the moment, you have to throw more CPU at it, in the form of running a second copy of mfaktc on a different core.

50M/s is doing a bit better than my GTX440 under mfaktc, incidentally.

Setting up both mfaktc and mfakto to sieve on the GPU is at least a dream for the developers.
Christenson is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2696 2021-04-18 17:48
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3492 2021-03-24 14:09
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 21:21.

Wed Apr 21 21:21:48 UTC 2021 up 13 days, 16:02, 0 users, load averages: 2.32, 1.98, 1.85

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.