![]() |
[QUOTE=Dubslow;297966]:shock:
...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc? From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger. Thanks :smile:[/QUOTE] I'd be happy if TheJudger decided to use some of my code for mfaktc, after I took so much of his code to make mfakto. However, it remains his decision if he wants any of that in. When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to [URL]https://github.com/Bdot42/mfakto[/URL], and it's still OpenSource :smile:. Have a look at [URL]https://github.com/Bdot42/mfakto/commit/ccf6d26fe3be5d4ab655b0069e0885c16337d05a[/URL], for instance, to see the first check-in bringing the variable progress - there's still some work needed though ... |
[QUOTE=Bdot;297979]I'd be happy if TheJudger decided to use some of my code for mfaktc, after I took so much of his code to make mfakto. However, it remains his decision if he wants any of that in.
When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to [URL]https://github.com/Bdot42/mfakto[/URL], and it's still OpenSource :smile:. Have a look at [URL]https://github.com/Bdot42/mfakto/commit/ccf6d26fe3be5d4ab655b0069e0885c16337d05a[/URL], for instance, to see the first check-in bringing the variable progress - there's still some work needed though ...[/QUOTE] I'd love to do a merge -- there are two reasons I asked you: 1) I have very limited C experience, though merging already-written code should be a good thing from an experience standpoint 2) For the next 2 weeks I will have little time to spend on coding/merging -- but then after that is summer :smile: I guess that means that if no one else has in two weeks' time, I'll take a crack at it. (Some people know I already took a shot at merging some mfaktc code into CUDALucas, and I had planned on extending that.) |
500M/s Sieving Cap??
I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.
What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering? |
[QUOTE=KyleAskine;298219]I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.
What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering?[/QUOTE] Are you running Windows or Linux? AMD or Intel chip? While I never have run Linux(so don't know if it would change things), I know with a Win7/i5 2400 combo I could not max my HD5770 even with 2 cores feeding it, plus the 'active' window always ran faster. When I put it in my 2500, I could hit 88% load if the mfakto window was 'active' and 66% load if it was not(had a pretty high SP as well, or it'd have a 20% cpu wait at 5k). I since have gotten an AMD Phenom II x6 1055 which I run 3 cores on a GTX 460 and 2 cores on the 5770. It too runs Win7, but the 2 cores keep it at 88% load regardless of what is the active window. I think the problem boils down to OpenCL not being able to run as well as CUDA, unless Linux makes a difference. |
[QUOTE=bcp19;298230]Are you running Windows or Linux? AMD or Intel chip? While I never have run Linux(so don't know if it would change things), I know with a Win7/i5 2400 combo I could not max my HD5770 even with 2 cores feeding it, plus the 'active' window always ran faster. When I put it in my 2500, I could hit 88% load if the mfakto window was 'active' and 66% load if it was not(had a pretty high SP as well, or it'd have a 20% cpu wait at 5k). I since have gotten an AMD Phenom II x6 1055 which I run 3 cores on a GTX 460 and 2 cores on the 5770. It too runs Win7, but the 2 cores keep it at 88% load regardless of what is the active window. I think the problem boils down to OpenCL not being able to run as well as CUDA, unless Linux makes a difference.[/QUOTE]
I think I was really hazy in my last post. Let me try to be a bit more precise with my language. I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards. No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now. Processor is definitely not the bottleneck. Nor is raw GPU power. I just want to know if there is a way for me to discover what is. |
[QUOTE=KyleAskine;298245]I think I was really hazy in my last post. Let me try to be a bit more precise with my language.
I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards. No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now. Processor is definitely not the bottleneck. Nor is raw GPU power. I just want to know if there is a way for me to discover what is.[/QUOTE] When I was testing mfakto on one of my quads I ran into a somewhat similar problem with what I refer to as 'diminishing returns'. mfakto, unlike mfaktc, seems to need an extra bit of computing power beyond the core supplied(a single mfakto instance reports 30% cpu usage overall), so when I used all 4 cores, it would actually give me less throughput than if I used 3 and had nothing on the 4th core. (I only tested this on a Core2Quad, which has it's own quirks, but I would imagine a i5/i7 quad would have somewhat similar results) |
[QUOTE=KyleAskine;298245]I think I was really hazy in my last post. Let me try to be a bit more precise with my language.
I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards. No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now. Processor is definitely not the bottleneck. Nor is raw GPU power. I just want to know if there is a way for me to discover what is.[/QUOTE] a few ideas: [LIST][*]if you run two instances for one card, none for the other and two threads prime95, then what's the sieving you get out of the mfakto instances? If starting prime95 slows down sieving to ~60%, then it is probably just hyper-threads that are assigned to the tasks. An available hyper-thread is reported by windows as available, but if you start using it, you're actually stealing CPU-time from its twin.[*]on an otherwise idle machine, run 4 instances of "mfakto --perftest" at once. What's the total sieving performance with that? This test eliminates any GPU interaction (copy and process).[*]take the variable-SieveSize binary that I sent you and try with 24k SieveSizeLimit. If L1 caches are the problem, then this should be faster than 36k [B]with the same binary[/B]. You can also use this binary for the --perftest above, in order to see what the best SieveSizeLimit would be. (Note this setting is a multiple of ~12k - it will use the highest such multiple that is not more than what you specify)[*]I added to my todo-list a dummy kernel that would just read each transferred byte but not do any TF with it, just for better performance tests.[*]I finished my tests with a lower SievePrimes (as low as 256, but I think I'll not allow below 1000). Together with the new GHz-days/day display you can easily see the effects of lowering SP that far, so I think I can publish that. In many cases you would sacrifice total throughput just to get higher GPU utilization - but your case may be different.[/LIST] |
I will take a look tonight. Thanks for the suggestions!
|
And just so you know, the reason I do suspect it isn't processor related is that I moved Sieveprimes around, and once I got below a certain point processor load dropped, but M/s stayed constant.
|
[QUOTE=Bdot;298277]a few ideas:
[LIST][*]if you run two instances for one card, none for the other and two threads prime95, then what's the sieving you get out of the mfakto instances? If starting prime95 slows down sieving to ~60%, then it is probably just hyper-threads that are assigned to the tasks. An available hyper-thread is reported by windows as available, but if you start using it, you're actually stealing CPU-time from its twin.[*]on an otherwise idle machine, run 4 instances of "mfakto --perftest" at once. What's the total sieving performance with that? This test eliminates any GPU interaction (copy and process).[*]take the variable-SieveSize binary that I sent you and try with 24k SieveSizeLimit. If L1 caches are the problem, then this should be faster than 36k [B]with the same binary[/B]. You can also use this binary for the --perftest above, in order to see what the best SieveSizeLimit would be. (Note this setting is a multiple of ~12k - it will use the highest such multiple that is not more than what you specify)[*]I added to my todo-list a dummy kernel that would just read each transferred byte but not do any TF with it, just for better performance tests.[*]I finished my tests with a lower SievePrimes (as low as 256, but I think I'll not allow below 1000). Together with the new GHz-days/day display you can easily see the effects of lowering SP that far, so I think I can publish that. In many cases you would sacrifice total throughput just to get higher GPU utilization - but your case may be different.[/LIST][/QUOTE] I ran the stress test with two threads and two versions of mfakto hitting the same card. It ran at around 90%, which is what the 'faster' of the two cards runs at when I run 4x mfakto. The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue. Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file. So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them). When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards. I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually. |
[QUOTE=KyleAskine;298577]I ran the stress test with two threads and two versions of mfakto hitting the same card. It ran at around 90%, which is what the 'faster' of the two cards runs at when I run 4x mfakto.
The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue. Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file. So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them). When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards. I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually.[/QUOTE] The tests somewhat point towards memory, if they point to something at all. It's not CPU, I perfectly agree. And the GPUs also have some room. If 2x mfakto can bring one GPU to 99%, but adding the prime95 stress-test lowers the GPU-load to 90%, then we already see that there is some influence. In this case, it can only be the memory system including caches. Delivering 500M candidates to the GPUs also means transferring 2GB of data over the bus. PCIe 2.0 x16 should be able to transfer 8GB to each card (4GB, if you enabled crossfire) - plenty of room, you'd think. I suggest another test: I also sent you the performance-info binary in the last package. This is a normal mfakto binary, additionally it queries and displays OpenCL performance data of both the data transfer and the kernel execution. The perf-info you've sent me last time showed transfer rates of 2.1-2.3 GB/s. Please start the pi-binary instead of the real ones, but start them one by one and monitor the transfer rates it is reporting. The first one will certainly start with fairly consistent ~2.2 GB/s. When adding another mfakto-pi on the same card, does it start to fluctuate? Is the reaction the same when adding an instance on the other card? And how do the transfer rates look like with 4 instances? I expect them to still show 2.3GB/s quite often, but in between it will also show much lower values, if the memory transfer to the GPU is an issue. I think I will compile a version for you that adds another debug-flag to show detailed timing info for each of the steps. This will show where more time is spent when more instances start up. I also have a version that skips sieving and transfer of the candidates to the GPU completely, I just need to adapt it to the new kernels. This way we could test what the GPUs really could do if they had all data they needed. Did you already play around with the clocks of your memory modules? Of course, overclocking is always a bit dangerous, but how about slowing it down a bit? To see if the capping effect gets stronger? And I have yet another idea: Of each 32-bit offset for the FCs only 24 bits are evaluated. Each GPU-thread needs 4 FCs to fill its vector. Instead of transferring 4x32=128 bits for each GPU thread, I could squeeze 4x24 bits into 3x32 bit integers. This should reduce the required bandwidth by 25%. A bit more computational effort, but if the reduced I/O offsets that? Certainly worth a test. |
| All times are UTC. The time now is 22:59. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.