mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-04-30, 12:35   #430
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Dubslow View Post


...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc?

From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger.

Thanks
I'd be happy if TheJudger decided to use some of my code for mfaktc, after I took so much of his code to make mfakto. However, it remains his decision if he wants any of that in.

When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to https://github.com/Bdot42/mfakto, and it's still OpenSource .
Have a look at https://github.com/Bdot42/mfakto/com...0885c16337d05a, for instance, to see the first check-in bringing the variable progress - there's still some work needed though ...
Bdot is offline   Reply With Quote
Old 2012-04-30, 13:02   #431
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by Bdot View Post
I'd be happy if TheJudger decided to use some of my code for mfaktc, after I took so much of his code to make mfakto. However, it remains his decision if he wants any of that in.

When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to https://github.com/Bdot42/mfakto, and it's still OpenSource .
Have a look at https://github.com/Bdot42/mfakto/com...0885c16337d05a, for instance, to see the first check-in bringing the variable progress - there's still some work needed though ...
I'd love to do a merge -- there are two reasons I asked you:
1) I have very limited C experience, though merging already-written code should be a good thing from an experience standpoint
2) For the next 2 weeks I will have little time to spend on coding/merging -- but then after that is summer

I guess that means that if no one else has in two weeks' time, I'll take a crack at it. (Some people know I already took a shot at merging some mfaktc code into CUDALucas, and I had planned on extending that.)
Dubslow is offline   Reply With Quote
Old 2012-05-02, 21:31   #432
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2·5·29 Posts
Default 500M/s Sieving Cap??

I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.

What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering?
KyleAskine is offline   Reply With Quote
Old 2012-05-02, 22:47   #433
bcp19
 
bcp19's Avatar
 
Oct 2011

7×97 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.

What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering?
Are you running Windows or Linux? AMD or Intel chip? While I never have run Linux(so don't know if it would change things), I know with a Win7/i5 2400 combo I could not max my HD5770 even with 2 cores feeding it, plus the 'active' window always ran faster. When I put it in my 2500, I could hit 88% load if the mfakto window was 'active' and 66% load if it was not(had a pretty high SP as well, or it'd have a 20% cpu wait at 5k). I since have gotten an AMD Phenom II x6 1055 which I run 3 cores on a GTX 460 and 2 cores on the 5770. It too runs Win7, but the 2 cores keep it at 88% load regardless of what is the active window. I think the problem boils down to OpenCL not being able to run as well as CUDA, unless Linux makes a difference.
bcp19 is offline   Reply With Quote
Old 2012-05-03, 01:31   #434
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

4428 Posts
Default

Quote:
Originally Posted by bcp19 View Post
Are you running Windows or Linux? AMD or Intel chip? While I never have run Linux(so don't know if it would change things), I know with a Win7/i5 2400 combo I could not max my HD5770 even with 2 cores feeding it, plus the 'active' window always ran faster. When I put it in my 2500, I could hit 88% load if the mfakto window was 'active' and 66% load if it was not(had a pretty high SP as well, or it'd have a 20% cpu wait at 5k). I since have gotten an AMD Phenom II x6 1055 which I run 3 cores on a GTX 460 and 2 cores on the 5770. It too runs Win7, but the 2 cores keep it at 88% load regardless of what is the active window. I think the problem boils down to OpenCL not being able to run as well as CUDA, unless Linux makes a difference.
I think I was really hazy in my last post. Let me try to be a bit more precise with my language.

I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards.

No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now.

Processor is definitely not the bottleneck. Nor is raw GPU power.

I just want to know if there is a way for me to discover what is.
KyleAskine is offline   Reply With Quote
Old 2012-05-03, 11:00   #435
bcp19
 
bcp19's Avatar
 
Oct 2011

12478 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I think I was really hazy in my last post. Let me try to be a bit more precise with my language.

I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards.

No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now.

Processor is definitely not the bottleneck. Nor is raw GPU power.

I just want to know if there is a way for me to discover what is.
When I was testing mfakto on one of my quads I ran into a somewhat similar problem with what I refer to as 'diminishing returns'. mfakto, unlike mfaktc, seems to need an extra bit of computing power beyond the core supplied(a single mfakto instance reports 30% cpu usage overall), so when I used all 4 cores, it would actually give me less throughput than if I used 3 and had nothing on the 4th core. (I only tested this on a Core2Quad, which has it's own quirks, but I would imagine a i5/i7 quad would have somewhat similar results)
bcp19 is offline   Reply With Quote
Old 2012-05-03, 11:23   #436
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I think I was really hazy in my last post. Let me try to be a bit more precise with my language.

I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards.

No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now.

Processor is definitely not the bottleneck. Nor is raw GPU power.

I just want to know if there is a way for me to discover what is.
a few ideas:
  • if you run two instances for one card, none for the other and two threads prime95, then what's the sieving you get out of the mfakto instances? If starting prime95 slows down sieving to ~60%, then it is probably just hyper-threads that are assigned to the tasks. An available hyper-thread is reported by windows as available, but if you start using it, you're actually stealing CPU-time from its twin.
  • on an otherwise idle machine, run 4 instances of "mfakto --perftest" at once. What's the total sieving performance with that? This test eliminates any GPU interaction (copy and process).
  • take the variable-SieveSize binary that I sent you and try with 24k SieveSizeLimit. If L1 caches are the problem, then this should be faster than 36k with the same binary. You can also use this binary for the --perftest above, in order to see what the best SieveSizeLimit would be. (Note this setting is a multiple of ~12k - it will use the highest such multiple that is not more than what you specify)
  • I added to my todo-list a dummy kernel that would just read each transferred byte but not do any TF with it, just for better performance tests.
  • I finished my tests with a lower SievePrimes (as low as 256, but I think I'll not allow below 1000). Together with the new GHz-days/day display you can easily see the effects of lowering SP that far, so I think I can publish that. In many cases you would sacrifice total throughput just to get higher GPU utilization - but your case may be different.
Bdot is offline   Reply With Quote
Old 2012-05-03, 12:52   #437
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2·5·29 Posts
Default

I will take a look tonight. Thanks for the suggestions!
KyleAskine is offline   Reply With Quote
Old 2012-05-03, 18:10   #438
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2×5×29 Posts
Default

And just so you know, the reason I do suspect it isn't processor related is that I moved Sieveprimes around, and once I got below a certain point processor load dropped, but M/s stayed constant.
KyleAskine is offline   Reply With Quote
Old 2012-05-06, 02:24   #439
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

1001000102 Posts
Default

Quote:
Originally Posted by Bdot View Post
a few ideas:
  • if you run two instances for one card, none for the other and two threads prime95, then what's the sieving you get out of the mfakto instances? If starting prime95 slows down sieving to ~60%, then it is probably just hyper-threads that are assigned to the tasks. An available hyper-thread is reported by windows as available, but if you start using it, you're actually stealing CPU-time from its twin.
  • on an otherwise idle machine, run 4 instances of "mfakto --perftest" at once. What's the total sieving performance with that? This test eliminates any GPU interaction (copy and process).
  • take the variable-SieveSize binary that I sent you and try with 24k SieveSizeLimit. If L1 caches are the problem, then this should be faster than 36k with the same binary. You can also use this binary for the --perftest above, in order to see what the best SieveSizeLimit would be. (Note this setting is a multiple of ~12k - it will use the highest such multiple that is not more than what you specify)
  • I added to my todo-list a dummy kernel that would just read each transferred byte but not do any TF with it, just for better performance tests.
  • I finished my tests with a lower SievePrimes (as low as 256, but I think I'll not allow below 1000). Together with the new GHz-days/day display you can easily see the effects of lowering SP that far, so I think I can publish that. In many cases you would sacrifice total throughput just to get higher GPU utilization - but your case may be different.
I ran the stress test with two threads and two versions of mfakto hitting the same card. It ran at around 90%, which is what the 'faster' of the two cards runs at when I run 4x mfakto.

The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue.

Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file.

So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them).

When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards.

I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually.
KyleAskine is offline   Reply With Quote
Old 2012-05-06, 21:01   #440
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I ran the stress test with two threads and two versions of mfakto hitting the same card. It ran at around 90%, which is what the 'faster' of the two cards runs at when I run 4x mfakto.

The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue.

Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file.

So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them).

When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards.

I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually.
The tests somewhat point towards memory, if they point to something at all. It's not CPU, I perfectly agree. And the GPUs also have some room.

If 2x mfakto can bring one GPU to 99%, but adding the prime95 stress-test lowers the GPU-load to 90%, then we already see that there is some influence. In this case, it can only be the memory system including caches.

Delivering 500M candidates to the GPUs also means transferring 2GB of data over the bus. PCIe 2.0 x16 should be able to transfer 8GB to each card (4GB, if you enabled crossfire) - plenty of room, you'd think.

I suggest another test: I also sent you the performance-info binary in the last package. This is a normal mfakto binary, additionally it queries and displays OpenCL performance data of both the data transfer and the kernel execution. The perf-info you've sent me last time showed transfer rates of 2.1-2.3 GB/s. Please start the pi-binary instead of the real ones, but start them one by one and monitor the transfer rates it is reporting. The first one will certainly start with fairly consistent ~2.2 GB/s. When adding another mfakto-pi on the same card, does it start to fluctuate? Is the reaction the same when adding an instance on the other card? And how do the transfer rates look like with 4 instances?

I expect them to still show 2.3GB/s quite often, but in between it will also show much lower values, if the memory transfer to the GPU is an issue.

I think I will compile a version for you that adds another debug-flag to show detailed timing info for each of the steps. This will show where more time is spent when more instances start up.

I also have a version that skips sieving and transfer of the candidates to the GPU completely, I just need to adapt it to the new kernels. This way we could test what the GPUs really could do if they had all data they needed.

Did you already play around with the clocks of your memory modules? Of course, overclocking is always a bit dangerous, but how about slowing it down a bit? To see if the capping effect gets stronger?

And I have yet another idea: Of each 32-bit offset for the FCs only 24 bits are evaluated. Each GPU-thread needs 4 FCs to fill its vector. Instead of transferring 4x32=128 bits for each GPU thread, I could squeeze 4x24 bits into 3x32 bit integers. This should reduce the required bandwidth by 25%. A bit more computational effort, but if the reduced I/O offsets that? Certainly worth a test.
Bdot is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3498 2021-08-06 21:07
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 22:10.


Fri Aug 6 22:10:26 UTC 2021 up 14 days, 16:39, 1 user, load averages: 3.09, 3.18, 2.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.