![]() |
|
|
#430 | |
|
Nov 2010
Germany
3·199 Posts |
Quote:
When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to https://github.com/Bdot42/mfakto, and it's still OpenSource .Have a look at https://github.com/Bdot42/mfakto/com...0885c16337d05a, for instance, to see the first check-in bringing the variable progress - there's still some work needed though ... |
|
|
|
|
|
|
#431 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3×29×83 Posts |
Quote:
1) I have very limited C experience, though merging already-written code should be a good thing from an experience standpoint 2) For the next 2 weeks I will have little time to spend on coding/merging -- but then after that is summer ![]() I guess that means that if no one else has in two weeks' time, I'll take a crack at it. (Some people know I already took a shot at merging some mfaktc code into CUDALucas, and I had planned on extending that.) |
|
|
|
|
|
|
#432 |
|
Oct 2011
Maryland
2·5·29 Posts |
I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.
What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering? |
|
|
|
|
|
#433 | |
|
Oct 2011
7×97 Posts |
Quote:
|
|
|
|
|
|
|
#434 | |
|
Oct 2011
Maryland
4428 Posts |
Quote:
I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards. No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now. Processor is definitely not the bottleneck. Nor is raw GPU power. I just want to know if there is a way for me to discover what is. |
|
|
|
|
|
|
#435 | |
|
Oct 2011
12478 Posts |
Quote:
|
|
|
|
|
|
|
#436 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
|
|
|
|
|
|
|
#437 |
|
Oct 2011
Maryland
2·5·29 Posts |
I will take a look tonight. Thanks for the suggestions!
|
|
|
|
|
|
#438 |
|
Oct 2011
Maryland
2×5×29 Posts |
And just so you know, the reason I do suspect it isn't processor related is that I moved Sieveprimes around, and once I got below a certain point processor load dropped, but M/s stayed constant.
|
|
|
|
|
|
#439 | |
|
Oct 2011
Maryland
1001000102 Posts |
Quote:
The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue. Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file. So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them). When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards. I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually. |
|
|
|
|
|
|
#440 | |
|
Nov 2010
Germany
59710 Posts |
Quote:
If 2x mfakto can bring one GPU to 99%, but adding the prime95 stress-test lowers the GPU-load to 90%, then we already see that there is some influence. In this case, it can only be the memory system including caches. Delivering 500M candidates to the GPUs also means transferring 2GB of data over the bus. PCIe 2.0 x16 should be able to transfer 8GB to each card (4GB, if you enabled crossfire) - plenty of room, you'd think. I suggest another test: I also sent you the performance-info binary in the last package. This is a normal mfakto binary, additionally it queries and displays OpenCL performance data of both the data transfer and the kernel execution. The perf-info you've sent me last time showed transfer rates of 2.1-2.3 GB/s. Please start the pi-binary instead of the real ones, but start them one by one and monitor the transfer rates it is reporting. The first one will certainly start with fairly consistent ~2.2 GB/s. When adding another mfakto-pi on the same card, does it start to fluctuate? Is the reaction the same when adding an instance on the other card? And how do the transfer rates look like with 4 instances? I expect them to still show 2.3GB/s quite often, but in between it will also show much lower values, if the memory transfer to the GPU is an issue. I think I will compile a version for you that adds another debug-flag to show detailed timing info for each of the steps. This will show where more time is spent when more instances start up. I also have a version that skips sieving and transfer of the candidates to the GPU completely, I just need to adapt it to the new kernels. This way we could test what the GPUs really could do if they had all data they needed. Did you already play around with the clocks of your memory modules? Of course, overclocking is always a bit dangerous, but how about slowing it down a bit? To see if the capping effect gets stronger? And I have yet another idea: Of each 32-bit offset for the FCs only 24 bits are evaluated. Each GPU-thread needs 4 FCs to fill its vector. Instead of transferring 4x32=128 bits for each GPU thread, I could squeeze 4x24 bits into 3x32 bit integers. This should reduce the required bandwidth by 25%. A bit more computational effort, but if the reduced I/O offsets that? Certainly worth a test. |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3498 | 2021-08-06 21:07 |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2719 | 2021-08-05 22:43 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |