![]() |
[QUOTE=Mark Rose;402559]It's basically a matter of flops. A GTX 580 running at stock clocks does 1581 GFLOPS. A four year newer 18 core E5-2699 v3 gives you 662 GFLOPS at base clock. Assuming the E5-2699 could be programmed as efficiently as mfaktc uses the GTX 580, you'd be looking at about 180-185 GHz-d/d, or 42% of the performance for 58% of the power usage.
TF does. Doing TF might be an option for the "free" cores on memory bandwidth starved systems.[/QUOTE] Is this including new instructions such as AVX/AVX2/AVX3.2? |
[QUOTE=Anonuser;402563]I think that an E5-2699 v3 running at stock clocks does 2.3 • 18 • 32 = 1324.8 GFLOPS (single precision).
[URL]http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf[/URL][/QUOTE] Then the numbers I read earlier, 16 flops per clock, were talking about DP not SP. Thanks for the correction! |
[QUOTE=Mark Rose;402568]Then the numbers I read earlier, 16 flops per clock, were talking about DP not SP. Thanks for the correction![/QUOTE]
The same paper notes that [CODE]Workloads using Intel AVX instructions may reduce processor frequency as far down as the AVX base frequency to stay within TDP limits.[/CODE] Thus the peak number may be unreachable on some SKUs. Oliver P.S. This is *not* the case fore the "smaller socket CPUs" like socket 1150 Haswell CPUs. |
[QUOTE=henryzz;402564]Is this including new instructions such as AVX/AVX2/AVX3.2?[/QUOTE]
Yeah, that was kind of my other question/comment. Specifically that George has said previously that he hasn't updated/optimized the CPU factoring code in Prime95 simply because GPU's did it better. However, I don't think would still necessarily hold true. Maybe that's still the case, but like I said, we do have some massively multi-cored CPU's coming our way in the years (heck... in the *months*) to come. I imagine a 72-core Knights Landing CPU (with 4-way hyperthreading) would compare favorably at doing optimized TF work compared to any GPU out there. I'm trying to find out more about the KNL hyperthreading. I know it's 4-way but what I'm trying to dig up is what kind of capabilities are present in each HT. Each actual core has *two* 512-bit VPUs based on AVX-512. That seems to imply that at the very least, 2 threads per core can still do awesome 512-bit goodness. With 144 threads that can do AVX-512, tell me that wouldn't be pretty awesome. On the double-precision topic, the specs for the current generation Xeon Phi top out at 1.2 TFlops for the 7120P. I'll leave it to smarter people than I, so I don't know how that would relate (and it's probably not an exact science) to GHz/day. But then people don't use their GPU's for double-precision math, if I understand it correctly. Knights Landing is estimated to do DP work up to 3+ TFlops. At some point, someone will get a KNL CPU and try out Prime95 on it. Even with it's fast memory options, an LL test is going to hit a memory bandwidth speed bump past a certain # of cores. If the rest of the cores could be doing TF work that would be pretty cool. Every 2 cores = 1 tile, and each tile has a 1MB L2 cache. I don't know what kinds of workloads might do well in a 1MB workspace, but keeping things L2 cache friendly would help, and maybe that means splitting TF into appropriate chunks across pairs of physical cores. The biggest deal is going to be that MCDRAM up to 16GB with up to 400GB/s of mem bandwidth, and then DDR4 for the off-chip RAM. Besides KNL, the new AVX-512 support on future desktop chips seems like a good thing to start taking advantage of. |
[QUOTE=Madpoo;402584]At some point, someone will get a KNL CPU and try out Prime95 on it. Even with it's fast memory options, an LL test is going to hit a memory bandwidth speed bump past a certain # of cores. If the rest of the cores could be doing TF work that would be pretty cool.
[/QUOTE] Even as it is it's faster if I leave a core idle per socket, so that the main worker isn't interrupted and miscellaneous tasks run on that idle core. The idle cores usually run somewhere around 10-50% utilization. [QUOTE]Besides KNL, the new AVX-512 support on future desktop chips seems like a good thing to start taking advantage of.[/QUOTE] I think they're leaving this off of the desktop chips... |
[QUOTE=aurashift;402587]<re: avx-512> ... I think they're leaving this off of the desktop chips...[/QUOTE]
Yeah, and that'll be a bummer. If I'm due for a desktop refresh on my personal system around then, I'd seriously consider a Xeon based motherboard and snag one of the Skylake Xeons. I'm sure at some point as we upgrade some of our servers we'll wind up with Skylake Xeons in the future HP Proliants too. |
[QUOTE=Madpoo;402591]Yeah, and that'll be a bummer. If I'm due for a desktop refresh on my personal system around then, I'd seriously consider a Xeon based motherboard and snag one of the Skylake Xeons.
I'm sure at some point as we upgrade some of our servers we'll wind up with Skylake Xeons in the future HP Proliants too.[/QUOTE] EVGA used to make a dual xeon socketed gaming board...shame they discontinued it. I think Asus still makes one but I've fried so many asus boards I'm kinda shying away from them. In any case, it'll be interesting to see how the many-core problem is solved. |
:direction:
|
AVX3.2 is not only double the size but it also seems more complete than any of the other extensions.
Hopefully it will make it onto a skylake refresh or cannonlake. I get the feeling that it might be the one that eventually gets the most development done for it. |
I have a feature request: skip the first CPU.
At least with Linux, and probably other operating systems, a lot of system interrupts happen on the first CPU. On a four core box, where I want to keep a single core free for other tasks, I'd like to run mprime on cores 1-3 and leave core 0 free. The only way I can currently do this besides manually setting process/thread affinities on the command line with taskset is to configure mprime to use four workers and AffinityScramble2=01234, and put no assignments in worktodo.txt for the first worker. Solutions would be to grab the highest numbered cores first, or to use the cores as listed in AffinityScramble2 so that if I set it to 1230, it would use cores 1-3 before core 0. Presently it insists on always using core 0, which is suboptimal. Unless I'm doing it wrong? |
[QUOTE=Mark Rose;402743]I have a feature request: skip the first CPU.
At least with Linux, and probably other operating systems, a lot of system interrupts happen on the first CPU. On a four core box, where I want to keep a single core free for other tasks, I'd like to run mprime on cores 1-3 and leave core 0 free. The only way I can currently do this besides manually setting process/thread affinities on the command line with taskset is to configure mprime to use four workers and AffinityScramble2=01234, and put no assignments in worktodo.txt for the first worker. Solutions would be to grab the highest numbered cores first, or to use the cores as listed in AffinityScramble2 so that if I set it to 1230, it would use cores 1-3 before core 0. Presently it insists on always using core 0, which is suboptimal. Unless I'm doing it wrong?[/QUOTE] In local.txt, under each worker's individual heading you may write "Affinity=1" or whichever. You needn't bother with affinity scramble or any such thing. Just set 3 workers, then assign them individually to cores 1, 2, and 3 (excluding core 0). Rather than editing the file you can also do this through one of the menu options in "./mprime -m" (I forget which option exactly, I believe it's one of the first three or four). |
| All times are UTC. The time now is 05:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.