mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Hardware Benchmark Jest Thread for 100M exponents (https://www.mersenneforum.org/showthread.php?t=13185)

TObject 2015-06-11 16:29

Try to affine the Prime95 executable itself to the second socket, I wonder if that would switch things around.

Madpoo 2015-06-11 18:53

[QUOTE=Madpoo;403894]...The stock clock of the chip is 3.0 GHz, and with all 10 cores enabled the max turbo boost is 3 x 133MHz....[/QUOTE]

I checked out CPU-Z... I guess I should have checked some assumptions. The bus speed is 100 MHz, not 133 MHz like I thought? So I guess for 3 GHz the default multiplier is 30, with a max boost to 33x (3.3 GHz).

On one server I checked, both sockets were running at 33x, but on another I looked at, CPU 1 was 33x and CPU 2 was only 32x. That's weird. That's only a 3% drop, but maybe sometimes it drops to 31x which could account for the 6% drop I saw the other day. (EDIT: that system is reporting proc temps of 44C and 50C respectively, right now... nowhere near excessive).

I'll have to explore a few more of these systems to see if there's any pattern to it.

Curiously, CPU-Z says the multiplier ranges from 12x to 36x, not 33x. I guess the Intel spec sheet I looked at could be wrong, or maybe that just shows up because it *can* do 6x boost if only 1 core is enabled. CPU-Z probably just pulls that info from a lookup of the CPU type.

Madpoo 2015-06-11 19:22

Turbo Boost info
 
From the Intel CPU doc:

[QUOTE]To determine the highest performance frequency amongst active cores, the processor
takes the following into consideration:
• The number of cores operating in the C0 state.
• The estimated current consumption.
• The estimated power consumption.
• The die temperature.
Any of these factors can affect the maximum frequency for a given workload. If the
power, current, or thermal limit is reached, the processor will automatically reduce the
frequency to stay with its TDP limit.[/QUOTE]

For the purposes of these tests, all cores are in the C0 state, I know that much. I know CPU #2 is slightly warmer. CPU #1 with 33x is running at 1.111v, CPU #2 at 32x is 1.126v.

That's about all I can piece together anyway. I guess CPU #2 is just deciding 32x is all it can do given it's current situation.

These 2690 v2 chips have a pretty high TDP... 130W. I know they're doing a lot of work and all that, but they're each doing the same work and the only difference I can see is the temp, so that must be the deciding factor at play... just enough higher that it stepped down by one multiplier.

Oh well. I'm going to assume that's that and there's nothing I can do short of changing the fan speed. There *is* an option on these models to go from "increased cooling" to "max cooling" or something like that. It would basically pump all the fans to 100% all the time. I could look at that, see if it makes a difference. They're spinning right now at 68% on 3 of them, 80% on one, and already 100% on 2 more. About in the zones I'd expect too if I remember the sequence correctly. CPU #1 sits more or less behind the ones running faster, and they're probably running faster because the array controller is just further back than the CPU and it tends to run pretty hot, so those fans run faster to keep it cool.

If I increased the fan speed to 100% on the other ones that blow on CPU #2 that might tip the scales back.

In other words, it's only incidental that on this model server, CPU #1 is cooler, just because those fans happen to run a bit faster to cool some other components in that sector.

Next time I have to reboot one of these I'll give it a shot. Probably not for a few weeks though.

Madpoo 2015-06-11 20:52

[QUOTE=Madpoo;403913]...Next time I have to reboot one of these I'll give it a shot. Probably not for a few weeks though.[/QUOTE]

I gave myself the excuse of installing this month's Microsoft patches on the system (I did want to pick one system as a test for them).

I set to max cooling, and the fans are now running 100%.

With Prime95 running full tilt just as before, the CPU temps are now 44C and 42C, so CPU #2 dropped from 50C. Not bad.

Unfortunately from CPU-Z it looks like the multiplier is still at 32x, although it does occasionally go to 33x about 10% of the time, so that's new.

With Prime95 stopped, both CPU's will stick there at 33x multipliers, so it's really just the extra heat and wattage of a full load, I think.

Well, that's interesting... I'll leave it at the max cooling for now but I'll probably set it back to increased cooling at some point. It's a marginal improvement, but I'd rather not have the fans run full tilt 24/7 unless it was a night and day type of difference.

aurashift 2015-07-18 21:42

Too lazy to ssh. two 16 core CPUs, 17 threads 45 days.

Madpoo 2015-08-24 22:32

[QUOTE=aurashift;406072]Too lazy to ssh. two 16 core CPUs, 17 threads 45 days.[/QUOTE]

Just got me a shiny new dual E5-2697 v3 server (14 cores each @ 2.6 GHz).

With max performance mode enabled and running 1 worker with 14 cores on one CPU, I get "51 days, 1 hour" to test M332220523.

I can bump it to add more cores on the other cpu:
16 total = 45 days,
17 total = 43.5 days
18 total = 43 days
...

At some point the gains go backwards... if I go to all 28 cores on both chips the total time reverts to around 44.5 days.

I did NOT use the "time" option... it ignores the affinity scramble and doesn't seem to pick the right cores on my system with HT enabled, so I just changed the # of cores for that one worker and let it run through enough iterations to get an idea of the actual ETA.

I tried it out on the M383,838,383 that I completed on my dual 10-core (E5-2690 v2) box... that took ~ 120 days to complete on 10 cores, and with 14 cores of the new system it would have been done in around 74 days. Not bad at all.

A 34M exponent like the ones I'm DC'ing right now will finish in under 10 hours on 14 cores. :smile: I think those are currently taking ~ 13-14 hours on a 10-core worker.

EDIT: PS, I had a little scare with the new box. It shipped with an old firmware and I hadn't updated yet... I got Windows installed and did the Prime95 bench before updating stuff. It wasn't seeing all of the cores properly... seemed like it was entirely missing the other CPU altogether. I flashed the latest Proliant DL380 Gen9 firmware and noticed in the release notes that there was some issue that could keep some apps from seeing all the cores. After flashing, Prime95 saw all the cores just fine and my heart resumed it's normal pace.

Madpoo 2015-08-24 22:52

[QUOTE=Madpoo;408718]Just got me a shiny new dual E5-2697 v3 server (14 cores each @ 2.6 GHz).[/QUOTE]

Oh, and before someone asks, the memory is running with 2 DIMMs per channel, 4 channels per CPU, and they're all DDR4 @ 2133 MHz. 256 GB total (16 x 16 GB). So it benefits from the fast RAM and being able to interleave. I'm pretty pleased with it's performance boost over our Gen8 boxes.

kladner 2015-08-25 02:21

[QUOTE=Madpoo;403917].....
With Prime95 stopped, both CPU's will stick there at 33x multipliers, so it's really just the extra heat and wattage of a full load, I think.
.....[/QUOTE]

How does it behave with one single-thread worker? Wouldn't that show the maximum boost ratio?

Madpoo 2015-08-25 20:12

[QUOTE=kladner;408730]How does it behave with one single-thread worker? Wouldn't that show the maximum boost ratio?[/QUOTE]

It probably would (haven't tested it) because it wouldn't be running as hot. But gaining the extra multiplier isn't that big a deal compared to losing all of the extra threads on that worker.

If I had a system that was running heavy from normal use (not just Prime95) and could only devote the resources of one core, that'd be another story... depending on what the rest of the load was like, assuming it wasn't totally maxing out like Prime95 would, it could just get that extra clock multiplier without TDP concerns.

In my case, our servers are provisioned to be able to handle load from other locations in failover situations, so we're not normally running full bore unless things are really bad somewhere in the country (or sometimes the web servers or SQL do actually get a workout, and I can usually tell because the iterations/sec of some worker will drop like a rock), but aside from bursty activity and disaster recovery, normal operations aren't too bad, so Prime95 can do it's idle processing fairly well, and might as well use all the cores.

One benefit I get out of it is early detection of memory issues. We had one system that started to show early signs of DIMM problems, and I'd like to think it was because of the good workout it was getting. The servers have advanced ECC so the end residues of some runs during that period matched, but the server let me know it recovered some mem errors and I was able to swap out the bad modules.

A couple years back before I got back into Prime95, we had a server that would only blue screen with a memory issue when it was the active SQL node in a cluster... boo. During normal operation when it was a passive node, it was all roses and unicorns, but the moment we actually asked it to do something, about 3-4 hours into operation it would BSOD. So yeah... that's the kind of thing I'm hoping to avoid. Call Prime95 my early detection system or whatever, but it's worked like that in at least one case.

Lorenzo 2015-08-28 19:51

[QUOTE=Madpoo;408718]Just got me a shiny new dual E5-2697 v3 server (14 cores each @ 2.6 GHz).

With max performance mode enabled and running 1 worker with 14 cores on one CPU, I get "51 days, 1 hour" to test M332220523.

I can bump it to add more cores on the other cpu:
16 total = 45 days,
17 total = 43.5 days
18 total = 43 days
...

At some point the gains go backwards... if I go to all 28 cores on both chips the total time reverts to around 44.5 days.

I did NOT use the "time" option... it ignores the affinity scramble and doesn't seem to pick the right cores on my system with HT enabled, so I just changed the # of cores for that one worker and let it run through enough iterations to get an idea of the actual ETA.

I tried it out on the M383,838,383 that I completed on my dual 10-core (E5-2690 v2) box... that took ~ 120 days to complete on 10 cores, and with 14 cores of the new system it would have been done in around 74 days. Not bad at all.

A 34M exponent like the ones I'm DC'ing right now will finish in under 10 hours on 14 cores. :smile: I think those are currently taking ~ 13-14 hours on a 10-core worker.

EDIT: PS, I had a little scare with the new box. It shipped with an old firmware and I hadn't updated yet... I got Windows installed and did the Prime95 bench before updating stuff. It wasn't seeing all of the cores properly... seemed like it was entirely missing the other CPU altogether. I flashed the latest Proliant DL380 Gen9 firmware and noticed in the release notes that there was some issue that could keep some apps from seeing all the cores. After flashing, Prime95 saw all the cores just fine and my heart resumed it's normal pace.[/QUOTE]
Very interesting! How many days it's will be working when you assign LL test for each CPU. As i understand you got 51 days when you used 14 cores of one CPU and other cores idled. So what will happen when runing two LL test for all cores (2x14), how many days need Prime95 to complete each of them in parallel?

Madpoo 2015-08-29 01:40

[QUOTE=Lorenzo;409050]Very interesting! How many days it's will be working when you assign LL test for each CPU. As i understand you got 51 days when you used 14 cores of one CPU and other cores idled. So what will happen when runing two LL test for all cores (2x14), how many days need Prime95 to complete each of them in parallel?[/QUOTE]

I haven't done extensive testing on this new system, but on a dual 10-core box (and almost the same on dual 8-core systems) if I run 2 workers, each one using all of the cores on a CPU, it gets a little squirrelly ...

If either of the workers is doing a test on an exponent > 58M in size, the other worker needs to be working on one < 37M, otherwise the performance of both workers starts to degrade pretty noticeably, and gets worse pretty quick as the other worker does things larger than 37M.

In other words, I can run a pair of 57M tests okay, but running a pair of 58M tests gets sketchy, and running a pair of 59M+ exponents really doesn't work well. So when I'm testing anything 58M or higher, I make sure and set the other one to doing some triple checks in the 34M-37M range and everything is happy.

On the new 14-core box, the increased memory bandwidth must really help because I did a basic test and the breakpoint is a little higher, like as much as 60M. I haven't exactly tried running a pair of 60M tests at once on each worker, but I can do a 60M and a 45M okay... it slows down a little, but not too much.

One thing is for sure... if you're running a 332M+ exponent on one core, you really need to have worker #2 doing something < 37M or you'll see a big hit. Fortunately it's easy to see... just start up all the workers and get a baseline of how each is doing, then stop worker #2 (the one not doing a large exponent) and see if worker #1 speeds up dramatically.

For example, when I was testing M383838383 one one CPU and doing anything larger than 37M on the other, the ETA for the big one was ~160-180 days. If I stopped the other worker or gave it something < 37M, then it was only ~ 120 days. That's a pretty big difference.

Note that the second worker also runs slower, so both are affected.

This same memory contention also occurs if you're running a bunch of workers, each one using a single core/thread. It may seem like a good idea to run a bunch of workers, each one doing it's own thing, but I guarantee that unless you're doing work on exponents in the 1M-5M range, the performance of each worker will slow down dramatically...

This is all just based on my own experience... server systems with probably better than average memory systems compared to the average desktop. Generally doing 2 DIMMs per channel with interleaving, decent memory speeds, etc. It seemed to hold true whether it was DDR2 or DDR3 systems, and as mentioned, it was only on this new system with DDR4 RAM that I saw that threshold go up a bit.

This is why I'm pretty psyched to see how Knights Landing CPU's will do, with all of that on-die high speed memory (and of course AVX-512). Since memory tends to be the bottleneck, having lots of really fast RAM will make for some pretty dramatic improvements.


All times are UTC. The time now is 06:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.