mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2011-08-19 13:44

[QUOTE=KingKurly;269472]
I found that if I plug in a monitor and keyboard to that computer and then login to the computer locally, the video card is found and can be used just fine. It would be a bit of a burden to have to always login locally, but I guess I can do that until a better solution is determined.
[/quote]

Well, it appears the dependency to the running X-Server is not yet dropped (or some additional work is necessary). You need to be logged in in order to start the X-Server. But then you can lock the screen and run mfakto remotely.

I'll check if we can get rid of that.

[QUOTE=KingKurly;269472]
That said, I do have a new problem to report:
ERROR: THREADS_PER_BLOCK (256) > deviceinfo.maxThreadsPerBlock
[/quote]

I'll check what implications that has and if we could drop this check altogether as OpenCL calculates the threads a little differently.

[QUOTE=KingKurly;269472]
it fails selftest 1-5 and 9-11. See below:
[/quote]

Now that is odd! The 72-bit kernel fails, but the vectored versions of the same kernel succeed! I just compared the kernels, but there are no code-differences.
Plus, I can reproduce it now on my Linux box: I still had the LD_LIBRARY_PATH point to 2.4, and that runs fine. When pointing it to 2.5, the problem appears. Looks like an AMD APP issue, I'll check what I can do about it. Running 2.5 on the CPU also works fine ...

I already wanted to drop the single kernel because it is so much slower ...

As you built your own binary anyway, go to mfaktc.c and comment out line 487 (removing the _71BIT_MUL24 kernel). Don't submit results with that to primenet, just use it to check what your GPU can do :smile:. You can then run the full selftest (-st), if you want the GPU to work for a while. There, you also see the speed of the different kernels for different tasks.

KingKurly 2011-08-20 03:00

I rebuilt the program with the change you recommended. All the tests pass, including the large selftest. The card seems to do about 5-10M/s in the "lower" ranges (like below 75M) and is about 10% of that in the 332M+ range.

[CODE]
Selftest statistics
number of tests 3637
successfull tests 3637

selftest PASSED!
[/CODE]

I look forward to future versions, and I will not use the program to submit any "no factor" results until you have indicated that it is safe for me to do so. If I happen to find factors, I might submit those, but I do not expect to use the program for much production work until things have stabilized a bit more.

Thanks again! :smile:

KingKurly 2011-08-20 17:18

The very first test I ran saved me an LL test, and of course saved someone else the LL-D down the road.

[CODE]
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3760/4620 | 159.38M | 16.721s | 9.53M/s | 50000 | 49m53s | 90889us
3765/4620 | 159.38M | 16.696s | 9.55M/s | 50000 | 49m32s | 90729us
Result[00]: M40660811 has a factor: 490782599517282826471
found 1 factor(s) for M40660811 from 2^68 to 2^69 (partially tested) [mfakto 0.07 mfakto_cl_barrett79]
tf(): total time spent: 3h 53m 44.575s
[/CODE]

I had the exponent queued up for a first-time LL test, but I've since removed it from my worktodo because it's not necessary!

Bdot 2011-08-21 14:49

[QUOTE=KingKurly;269623]The very first test I ran saved me an LL test, and of course saved someone else the LL-D down the road.

[CODE]
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3760/4620 | 159.38M | 16.721s | 9.53M/s | 50000 | 49m53s | 90889us
3765/4620 | 159.38M | 16.696s | 9.55M/s | 50000 | 49m32s | 90729us
Result[00]: M40660811 has a factor: 490782599517282826471
found 1 factor(s) for M40660811 from 2^68 to 2^69 (partially tested) [mfakto 0.07 mfakto_cl_barrett79]
tf(): total time spent: 3h 53m 44.575s
[/CODE]I had the exponent queued up for a first-time LL test, but I've since removed it from my worktodo because it's not necessary![/QUOTE]

What a start! While I already found a lot of factors with mfakto, they almost all have been known before :smile:.

BTW, at the expense of a little more CPU you can speed up the tests a little: Set SievePrimes to 200000 and the siever will eliminate some more candidates so the GPU will not test them. What's mfakto's CPU-load right now and with SievePrimes at 200k?

9.5 M/s is also not bad for an entry-level GPU - I guess it is as least twice as fast as one of your CPU cores.

Grats also to the successful selftest. The speed of the tests does not depend a lot on the size of the exponent but mainly on the kernel being used. The selftest will run each test with all kernels that can handle the required factor length. If you still have the output of the selftest you should see that mfakto_cl_barrett79 is always close to 10 M/s, most others a bit below that, and mfakto_cl_95 slowly crawling along.

Bdot 2011-08-24 20:55

Did anyone else give mfakto a try? Any experiences to share (anything strange happening, suggestions you'd like to get included or excluded for the next versions, performance figures for other GPUs, ...)?

I'm running this version on a SuSE 11.4 box with AMD APP SDK 2.4, and when multiple instances are running I occasionally see one instance hanging. It will completely occupy one CPU core but no GPU resources. It is looping inside some kernel code, being immune to kill, kill -9 or attempts to attach a debugger or gcore. So far, reboot was the only way I know to get rid of it. How can I find out where that hang occurs? And what else could I try to kick such a process without a reboot?

apsen 2011-08-25 17:29

[QUOTE=Bdot;270043]Did anyone else give mfakto a try?[/QUOTE]

I had the same experience as another poster: had to recompile to reduce number of threads per block and disable one kernel. Apart from that AMD_APP refused to install on Win2008 so I had to swap the graphic cards between two machines so the AMD one would be on Windows 7. The performance is about 20% of what I get out of GeForce 8800 GTS (around 6 M/s comparing to 29 M/s). I haven't played with sieve parameter much - just had to disable auto adjust as it will raise the setting to the limit slowing the testing to a crawl. If I'll lower it to below default I would probably get better overall performance.

Bdot 2011-08-25 20:19

[QUOTE=apsen;270091]The performance is about 20% of what I get out of GeForce 8800 GTS (around 6 M/s comparing to 29 M/s). I haven't played with sieve parameter much - just had to disable auto adjust as it will raise the setting to the limit slowing the testing to a crawl. If I'll lower it to below default I would probably get better overall performance.[/QUOTE]

That is a bit slower than I had expected. Which kernel and bitlevel was that? But if raising SievePrimes slows down the tests, then the tests are CPU-limited, and the GPU not running at full load. If you want, build a test binary with CL_PERFORMANCE_INFO defined (params.h) - this will tell you the memory transfer rate and pure kernel speed, without accounting for the siever.

According to [URL="http://www.hwcompare.com/3268/geforce-8800-gts-g80-320mb-vs-radeon-hd-4550-256mb/"]hwcompare[/URL], the 8800 GTS should be 3-4 times faster, so 8-10 M/s would be expected if OpenCL and my port were as efficient as Oliver's CUDA implementation.

Chaichontat 2011-08-26 11:55

Hi, I'm running mfakto on my HD6950 @912MHz, Catalyst 11.8 SDK 2.5, one thing that I seen is that it uses approx. 30 percent of my GPU utilization and gives about 50M/s. Does anyone knows how to make it fully use the GPU?
Thanks.

apsen 2011-08-26 14:30

[QUOTE=Bdot;270107]That is a bit slower than I had expected. Which kernel and bitlevel was that? But if raising SievePrimes slows down the tests, then the tests are CPU-limited, and the GPU not running at full load. If you want, build a test binary with CL_PERFORMANCE_INFO defined (params.h) - this will tell you the memory transfer rate and pure kernel speed, without accounting for the siever.

According to [URL="http://www.hwcompare.com/3268/geforce-8800-gts-g80-320mb-vs-radeon-hd-4550-256mb/"]hwcompare[/URL], the 8800 GTS should be 3-4 times faster, so 8-10 M/s would be expected if OpenCL and my port were as efficient as Oliver's CUDA implementation.[/QUOTE]

I did some more testing and it looks like the problem is in getting enough CPU. When i run it alone i'm getting about 7.3 M/s and CPU usage is 50-56%(!) on two core machine. When I start prime95 the CPU usage drops to about 10% average and I'm getting about 6.5 M/s even though prime runs at default priority and mfacto at normal. Also average wait is always in teens of milliseconds (12000-15000 microseconds). It is lower without prime95 running.

MrHappy 2011-08-27 12:44

I get ~28M/s on my HD5670 / Phenom II 4 Core 925 with 2 Cores on P-1 tests, 1 Core LL-D; and 1 core is busy video editing. I'll look again when the video job is done.

Christenson 2011-08-28 14:18

[QUOTE=Chaichontat;270138]Hi, I'm running mfakto on my HD6950 @912MHz, Catalyst 11.8 SDK 2.5, one thing that I seen is that it uses approx. 30 percent of my GPU utilization and gives about 50M/s. Does anyone knows how to make it fully use the GPU?
Thanks.[/QUOTE]

At the current stage of development, mfaktc/mfakto sieves for probable primes on the CPU side before passing them to the GPU for checking. Make sure that sieveprimes on your machine has gone down to 10,000. Beyond that, at the moment, you have to throw more CPU at it, in the form of running a second copy of mfaktc on a different core.

50M/s is doing a bit better than my GTX440 under mfaktc, incidentally.

Setting up both mfaktc and mfakto to sieve on the GPU is at least a dream for the developers.


All times are UTC. The time now is 03:31.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.