![]() |
[QUOTE=retina;409787]Those graphs are the OS time slice views and have absolutely nothing at all to do with memory bandwidth or bottlenecks. The OS can't halt the CPU when instructions are waiting for memory, it just doesn't work that way. The OS can't see into the process and decide to somehow insert a HLT instruction in the middle of a memory read instruction. The OS is just another program (albeit with higher privileges) and runs on the same core hardware as everything else. Only when some interrupt or exception happens does the OS get to run some code, but during normal program operation the OS isn't even running.[/QUOTE]
I know all that. But the graphs *do* indicate how much each core is being used by the system including all programs. I think you misunderstood what I'm saying when I mentioned that the CPU % used on each core isn't running at 100%. All I mean is that it's a clear indication that Prime95 itself isn't so much CPU bound at that point, but rather it's bottlenecked by something else, which is almost certainly the memory. Even on a new system when I'm burning it in and Prime95 is literally the only non-OS thing running, this is the case. If Prime95 is in the middle of some LL iteration and it involves reading/writing a large chunk of RAM, then it will by definition be memory bound at that point and any further execution in the program will be, by necessity, on hold until that's done. If that mem access happens to fit into L2/L3 cache then it would finish fairly quick, but even then there's still latency involved with L2/L3 cache coherence and any cache read misses or write-throughs to the memory controller. It's those cases where it has to access main memory and the latency involved where I suspect we're seeing the most memory related bottlenecks and the LL test stalls for a bit while waiting for those ops to complete. It's expected with large datasets and not out of the ordinary. My only point was to what degree the CPU is stalled during those times. It may be sloppy of me to say the CPU is stalled because it's really the Prime95 execution thread that's stalled. If I had another bunch of things running on this system that needed some CPU cycles to execute (web server, SQL, whatever) it would happily use those cycles if needed. Is that the confusing part of my terminology, that you thought I was saying the CPU itself is being halted or something? Because I didn't mean to imply that. |
[QUOTE=Madpoo;409765]
Assign a really small exponent to that worker in the 10M range while you look at the graphs of each individual core. What you *should* see (and what I see) is that the first core on that worker will use roughly all of it's power, but the other 3 "helper" cores will use a noticeably smaller amount. In the case of a 10M exponent, it should be pretty obvious. Is that true? I have no idea... it was just my theory to explain why I saw a more pronounced CPU idle with smaller exponents. The fewer cores I threw at the worker meant each CPU was using more of its full potential, I guess because there's a point where memory and CPU are roughly balanced.[/QUOTE] In this case the OS is measuring prime95's shoddy multithreading code. Prime95 breaks the task into several decent sized chunks, submits them to the worker threads, waits for them all to finish -- repeat. What you are seeing is that all worker threads won't finish chunks at the same time causing idleness. Worse, if you have say 10 chunks to do on 4 workers, then 2 workers do three chunks and two workers do two chunks --- i.e. two workers are 33% idle. |
[QUOTE=Prime95;409814]In this case the OS is measuring prime95's shoddy multithreading code. Prime95 breaks the task into several decent sized chunks, submits them to the worker threads, waits for them all to finish -- repeat. What you are seeing is that all worker threads won't finish chunks at the same time causing idleness. Worse, if you have say 10 chunks to do on 4 workers, then 2 workers do three chunks and two workers do two chunks --- i.e. two workers are 33% idle.[/QUOTE]
Oh... weird. :smile: Is that the type of thing that would be more pronounced with smaller exponents? And is there anything that could be done to improve the way it works, like having an equal # of chunks as worker threads? I understand that the LL process is sequential and any multithreading is the type of thing that has to happen within a single iteration which limits things a bit. Given a best case scenario (no other apps running, leaving as much of the horsepower to Prime95 as possible), each core/thread should finish its work at the same time as every other one? And is it the role of the first thread to distribute those chunks and collect the results and prep for the next step? If so, I guess there is indeed a case to be made to make that primary thread run on something besides core zero which also has to deal with interrupts on most systems. Maybe give that core the lightest possible load. :smile: |
[QUOTE=Madpoo;409838]Is that the type of thing that would be more pronounced with smaller exponents?[/quote]
Likely. With larger FFTs there are more chunks to process. Thus, percentage-wise there is less wastage when the odd number of chunks are distributed (e.g. 50 chunks on 4 threads vs. 10 chunks on 4 threads = 8% vs. 33% waste) [quote] And is there anything that could be done to improve the way it works, like having an equal # of chunks as worker threads? [/quote] Maybe, but multi-threading optimization is not a high priority item for me. Ernst's mlucas seems to do a better job with multi-threading. I'm not sure if that's because he has superior methods or some other reason. [quote]Given a best case scenario (no other apps running, leaving as much of the horsepower to Prime95 as possible), each core/thread should finish its work at the same time as every other one?[/quote] Assuming the number of chunks is a multiple of the number of threads and no lock contention issues arise, then they should finish very close to the same time. [quote]And is it the role of the first thread to distribute those chunks and collect the results and prep for the next step?[/QUOTE] Yes. |
| All times are UTC. The time now is 23:25. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.