mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Hardware Benchmark Jest Thread for 100M exponents (https://www.mersenneforum.org/showthread.php?t=13185)

aurashift 2015-08-29 16:02

[QUOTE=Madpoo;409088]I haven't done extensive testing on this new system, but on a dual 10-core box (and almost the same on dual 8-core systems) if I run 2 workers, each one using all of the cores on a CPU, it gets a little squirrelly ...

If either of the workers is doing a test on an exponent > 58M in size, the other worker needs to be working on one < 37M, otherwise the performance of both workers starts to degrade pretty noticeably, and gets worse pretty quick as the other worker does things larger than 37M.

In other words, I can run a pair of 57M tests okay, but running a pair of 58M tests gets sketchy, and running a pair of 59M+ exponents really doesn't work well. So when I'm testing anything 58M or higher, I make sure and set the other one to doing some triple checks in the 34M-37M range and everything is happy.

On the new 14-core box, the increased memory bandwidth must really help because I did a basic test and the breakpoint is a little higher, like as much as 60M. I haven't exactly tried running a pair of 60M tests at once on each worker, but I can do a 60M and a 45M okay... it slows down a little, but not too much.

One thing is for sure... if you're running a 332M+ exponent on one core, you really need to have worker #2 doing something < 37M or you'll see a big hit. Fortunately it's easy to see... just start up all the workers and get a baseline of how each is doing, then stop worker #2 (the one not doing a large exponent) and see if worker #1 speeds up dramatically.

For example, when I was testing M383838383 one one CPU and doing anything larger than 37M on the other, the ETA for the big one was ~160-180 days. If I stopped the other worker or gave it something < 37M, then it was only ~ 120 days. That's a pretty big difference.

Note that the second worker also runs slower, so both are affected.

This same memory contention also occurs if you're running a bunch of workers, each one using a single core/thread. It may seem like a good idea to run a bunch of workers, each one doing it's own thing, but I guarantee that unless you're doing work on exponents in the 1M-5M range, the performance of each worker will slow down dramatically...

This is all just based on my own experience... server systems with probably better than average memory systems compared to the average desktop. Generally doing 2 DIMMs per channel with interleaving, decent memory speeds, etc. It seemed to hold true whether it was DDR2 or DDR3 systems, and as mentioned, it was only on this new system with DDR4 RAM that I saw that threshold go up a bit.

This is why I'm pretty psyched to see how Knights Landing CPU's will do, with all of that on-die high speed memory (and of course AVX-512). Since memory tends to be the bottleneck, having lots of really fast RAM will make for some pretty dramatic improvements.[/QUOTE]

So, I've got my 100M digit workers set to use all the cores on a single socket except the first core, to avoid interrupts.

Anything lower than 100M I have one worker using two cores for an LL test, using all available cores.

What would you recommend?

Madpoo 2015-08-30 05:22

[QUOTE=aurashift;409117]So, I've got my 100M digit workers set to use all the cores on a single socket except the first core, to avoid interrupts.

Anything lower than 100M I have one worker using two cores for an LL test, using all available cores.

What would you recommend?[/QUOTE]

The interrupt handling on the first core really depends on the system itself and how it's setup. I can't remember what kind of boxes you have?

I remember looking into this again recently on my HP Proliant servers, and the INT handling on those is different than a desktop.

First off, the PCI cards will use one CPU or the other as it's primary interrupt handler (the detailed system specs will tell you which slots are handled by which CPU). This helps you setup a heavy I/O system doing lots of network or disk traffic so you can populate your slots with extra NIC or array controllers on the least loaded CPU.

Other than that though, it sounded like interrupt handling isn't really locked to the first core of a chip, but that's where I couldn't get as much info. On older Compaq Proliants with ancient Windows NT versions, Compaq had a custom HAL that would also allow any CPU to handle interrupts, but that was in the dinosaur age when there weren't multi-core chips. That feature is pretty common even on desktop dual-socket motherboards that use the Intel chipset, and I think the default Windows HAL has that built-in now.

Personally, my systems don't do a lot of disk or network things that generate lots of interrupts... they're typically more CPU or memory related during crunch times. And even then, the NICs all do TOE (TCP offload engine) so they're able to do more things on their own without generating interrupts. Ditto with the smart array controllers and lots of caching going on. Even on a busy SQL server connected to a shared storage unit over SAS, I can do something pretty heavy like a defrag and the interrupts are barely noticeable. I don't remember now if those were affined to core 0 though... I'll have to check next time.

Madpoo 2015-08-31 18:21

[QUOTE=Madpoo;409088]...If either of the workers is doing a test on an exponent > 58M in size, the other worker needs to be working on one < 37M, otherwise the performance of both workers starts to degrade pretty noticeably, and gets worse pretty quick as the other worker does things larger than 37M....[/QUOTE]

Looking back at this... I found out something kind of cool with the Xeon E5-2697 v3 system I just got. On the v1/v2 systems, yeah, what I said above is true.

On this system with fast (2133) DDR4 memory and the faster QPI (I think it's 9.6 Gbps), it definitely does a lot better running two larger LL tests on both CPUs at once.

Right now I have a 74M on one worker and a 70M on the other, and they're both running as fast as they would solo. So now I reckon I may experiment later and see where it's limits are. I guess that jibes with what I saw when testing a 100M digit exponent... it did better than I expected when adding additional cores from the other CPU. I expected it to do a little better with one more core, but then drop in performance as the QPI link flooded. But no, it did pretty well with 3-4 more cores. Not astoundingly faster, but marginally so.

aurashift 2015-08-31 21:43

are you seeing the CPUs at 100% but still with the disparate timings?

(oracle nix and bl460c gen8)

Madpoo 2015-09-01 16:00

[QUOTE=aurashift;409293]are you seeing the CPUs at 100% but still with the disparate timings?

(oracle nix and bl460c gen8)[/QUOTE]

When running a pair of larger exponents and seeing the "slowdown" issue, yeah, in general both of them are still running at 100% on all of their assigned cores, but not exactly. Like, there might be a % or two of wiggle room.

That did strike me as odd, that CPU usage would still be near max if it was indeed being memory throttled, but... well, I don't know. I have no theories on that off the top of my head.

I have noticed that when I'm testing a really *small* exponent, say something in the 1M-10M range, the first core for that worker will be 100%, but the rest of the cores will be less than that, maybe only 70% or so. I really don't know why that would be either.

When I was doing my triple-checks of things below 2M, to avoid that I re-configured a couple boxes to only do pairs of cores per worker (and just had more workers). Something like that anyway... I don't recall what I ended up with, but that seemed to balance it out and get more oomph out of the system. At those FFT sizes it was okay to run multiple workers on a single CPU without experiencing whatever contention.

aurashift 2015-09-07 15:37

let's start a thread to determine: "optimal LL configuration" so we don't hijack this one any further.

Madpoo 2016-06-23 05:19

[QUOTE=Madpoo;408718]Just got me a shiny new dual E5-2697 v3 server (14 cores each @ 2.6 GHz).

With max performance mode enabled and running 1 worker with 14 cores on one CPU, I get "51 days, 1 hour" to test M332220523.

I can bump it to add more cores on the other cpu:
16 total = 45 days,
17 total = 43.5 days
18 total = 43 days
...

At some point the gains go backwards... if I go to all 28 cores on both chips the total time reverts to around 44.5 days.
[/QUOTE]

And now the same test with the E5-2690 v4 (dual 14-cores, 2.6GHz, DDR4-2400)

Testing M332220523 as the only worker, with varying # of threads:
[CODE]10 threads = 64 days, 13 hours
12 threads = 59 days, 5 hours
14 threads = 57 days, 9 hours
16 threads = 61 days, 3 hours
18 threads = 66 days, 0 hours
22 threads = 66 days, 18 hours
28 threads = 67 days, 6 hours
[/CODE]

Well that's odd... from that it seems worse than the v3 processor at the same clock and slower RAM. The only diff is now I'm using Prime95 v28.9 instead of v28.7 so I may pull up the older version and do another comparison, see if that makes a difference.

I guess it could also be that Prime95 doesn't recognize the new CPU exactly (although it sees that it has FMA, SSE2, etc) and might be choosing a non-optimal method for that exponent?

Okay... so with Prime95 28.7:
[CODE]14 threads = 56 days, 15 hours
18 threads = 66 days, 7 hours
[/CODE]

So, not really that different. Hmm... very puzzling. I may have to retest on that v3 CPU to confirm my previous times. I'm confused by that.

EDIT: Crumb... I went back to that v3 box and re-did the tests... same as before, meaning for some reason it would finish faster than the more awesome v4 CPU.

I'm hopeful it's just a matter of this model CPU getting some love from Prime95?

At least it runs faster on the smaller FFT sizes which matters more to me for now. :smile:

Mark Rose 2016-06-23 05:26

Make sure the turbo button is depressed.

Madpoo 2016-06-23 15:09

[QUOTE=Mark Rose;436758]Make sure the turbo button is depressed.[/QUOTE]

LOL... yeah, good point. :smile:

But seriously, it does remind me I should use CPU-Z to look at the turbo clock it's using when under stress. If past experience is any guide, the first CPU is probably 1-tick below max turbo, and the 2nd CPU is 2 below.

Someone had mentioned a nexus between turbo mode and AVX and I did read something the other day about AVX use on Intel leading to a deliberate reduction in possible turbo boosts by one multiple. Whether that was for thermal or other reasons, I'm not sure... probably to fit into the TDP spec but that's a guess.

I also considered the unlikely scenario that it's using a different FFT size on the new machine. Probably not, but since I forgot to see what it used, I'll check it anyway.

What struck me as the most odd was how it got worse right away when adding extra cores on the other CPU. Both CPUs have a 9.6 GT/s QPI link, but the overall faster RAM (2400 vs 2133) would lead me to think it would be even better at doing cross-CPU transfers?

Puzzling... today I'm going to test it out on smaller #'s (like 4M FFT size, perhaps using M49 as a comparison) to see how it handles adding extra cores.

Hopefully I can get any/all of my 2400 MHz RAM tests done today because I need to install the rest of the memory and get the system installed; this is my chance to get whatever performance testing in an ideal setup.

EDIT: I just noticed that the E5-2697 v3 has a max turbo of 5x whereas the E5-2690 v4 has a max of 6x. So that may account for why, at the same 2.6GHz base spec, this thing does faster right off the top, regardless of mem speed.

LaurV 2016-06-24 09:49

If you have some bandwidth limitations, then more cores will waste time waiting for the data and waiting each other.

About 50 days in 14 core for that expo is quite ok, a Titan needs ~60.

Madpoo 2016-06-24 16:45

[QUOTE=LaurV;436834]If you have some bandwidth limitations, then more cores will waste time waiting for the data and waiting each other.

About 50 days in 14 core for that expo is quite ok, a Titan needs ~60.[/QUOTE]

The weird part was that the Broadwell-E with faster memory and a higher turbo (I verified the clock was faster during the run on the v4 compared to the v3) was getting *slower* iteration times than Haswell-E.

At that FFT size anyway (18M I think).

At "regular" FFT sizes, in the 2M-4M range anyway, the Broadwell-E is definitely faster, but I think that has everything to do with a slightly faster clock thanks to the extra 100MHz multiplier.

I tried all kinds of things to get that 332,220,523 test running faster. Disabled hyperthreading, set the fans to "max cooling" (it's in a pretty cold server room in my building... cooler in there than in the colocation it will eventually move to... 64 F). Nothing made it run faster. Disabling HT had a (very) marginal boost, like it would have shaved an hour off the 57+ day run.

It's definitely not thermal throttling on the CPU side... CPU temps were low, but it's multiplier was consistently around 29x instead of the max 32x. Faster than the stock 2.6GHz and doing heavy AVX limits you to 31x anyway I think.

I confirmed that with Prime95 stopped, it runs happily at 32x in the static high performance setting, so that's all fine. Memory is definitely 2400 MHz confirmed by CPU-Z.

I took some screenshots of Broadwell v Haswell side by side, with Prime95 in a background window with CPU-Z in front, just to better see how everything is doing. I need to grab those off my work desktop where I snapped the images and I can post them.


All times are UTC. The time now is 06:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.