mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-11-05, 13:35   #1
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×2,399 Posts
Default The bandwidth bottleneck is apparently much older than I thought

(Could go in either Hardware or Software, meh)

I recently just finished a batch of DCs for GIMPS before shifting the computer temporarily elsewhere. This means I got to watch the four DCs finish within a few hours of each other.

This is on a now-geriatric Sandy Bridge i5-2500K, a quadcore with no hyperthreading, with 16 GiB of DDR3-1333 (or whatever the standard DDR3 speed is) in four sticks/two channels. As far as I know no other stressful computation was going on in the computer.

Note the iteration times:

Quote:
[Worker #3 Nov 5 01:11] Iteration: 42450000 / 42451699 [99.99%], ms/iter: 31.699, ETA: 00:00:53
[Worker #3 Nov 5 01:12] M42451699 is not prime. Res64: 24A7F063EA489C2C. We8: F270D118,38481060,00000000
[Worker #3 Nov 5 01:12] No work to do at the present time. Waiting.
[Comm thread Nov 5 01:12] Sending result to server: UID: Dubslow/Guilty-Spark, M42451699 is not prime. Res64: 24A7F063EA489C2C. We8: F270D118,38481060,00000000, AID: 2676DE321FBE2EB1D16A8471FADAF228
[Comm thread Nov 5 01:12]
[Comm thread Nov 5 01:12] PrimeNet success code with additional info:
[Comm thread Nov 5 01:12] LL test successfully completes double-check of M42451699 --
[Comm thread Nov 5 01:12] CPU credit is 62.5720 GHz-days.
[Comm thread Nov 5 01:12] Done communicating with server.
[Worker #1 Nov 5 01:12] Iteration: 41900000 / 42454033 [98.69%], ms/iter: 31.639, ETA: 04:52:09
[Worker #4 Nov 5 01:12] Iteration: 42300000 / 42454177 [99.63%], ms/iter: 31.541, ETA: 01:21:02
[Worker #2 Nov 5 01:20] Iteration: 42300000 / 42454169 [99.63%], ms/iter: 29.410, ETA: 01:15:34
[Worker #1 Nov 5 01:33] Iteration: 41950000 / 42454033 [98.81%], ms/iter: 25.525, ETA: 03:34:25
[Worker #4 Nov 5 01:34] Iteration: 42350000 / 42454177 [99.75%], ms/iter: 25.518, ETA: 00:44:18
[Worker #2 Nov 5 01:41] Iteration: 42350000 / 42454169 [99.75%], ms/iter: 25.511, ETA: 00:44:17
[Worker #1 Nov 5 01:55] Iteration: 42000000 / 42454033 [98.93%], ms/iter: 25.548, ETA: 03:13:19
[Worker #4 Nov 5 01:55] Iteration: 42400000 / 42454177 [99.87%], ms/iter: 25.548, ETA: 00:23:04
[Worker #2 Nov 5 01:02] Iteration: 42400000 / 42454169 [99.87%], ms/iter: 25.575, ETA: 00:23:05
[Worker #3 Nov 5 01:12] Resuming.
[Worker #3 Nov 5 01:12] No work to do at the present time. Waiting.
[Worker #1 Nov 5 01:16] Iteration: 42050000 / 42454033 [99.04%], ms/iter: 25.598, ETA: 02:52:22
[Worker #4 Nov 5 01:16] Iteration: 42450000 / 42454177 [99.99%], ms/iter: 25.588, ETA: 00:01:46
[Worker #4 Nov 5 01:18] M42454177 is not prime. Res64: 75C26583462ECFFD. We8: 4BCBB21D,24876058,00000000
[Worker #4 Nov 5 01:18] No work to do at the present time. Waiting.
[Comm thread Nov 5 01:18] Sending result to server: UID: Dubslow/Guilty-Spark, M42454177 is not prime. Res64: 75C26583462ECFFD. We8: 4BCBB21D,24876058,00000000, AID: 508690EF92F14B3239659DA73E9CC91D
[Comm thread Nov 5 01:18]
[Comm thread Nov 5 01:18] PrimeNet success code with additional info:
[Comm thread Nov 5 01:18] LL test successfully completes double-check of M42454177 --
[Comm thread Nov 5 01:18] CPU credit is 62.5757 GHz-days.
[Comm thread Nov 5 01:18] Done communicating with server.
[Worker #2 Nov 5 01:23] Iteration: 42450000 / 42454169 [99.99%], ms/iter: 24.285, ETA: 00:01:41
[Worker #2 Nov 5 01:24] M42454169 is not prime. Res64: 5E28ADCEE818D89D. We8: BED9A63C,34428640,00000000
[Worker #2 Nov 5 01:24] No work to do at the present time. Waiting.
[Comm thread Nov 5 01:24] Sending result to server: UID: Dubslow/Guilty-Spark, M42454169 is not prime. Res64: 5E28ADCEE818D89D. We8: BED9A63C,34428640,00000000, AID: 55BF2EA00C479FA6554382767795CB26
[Comm thread Nov 5 01:24]
[Comm thread Nov 5 01:24] PrimeNet success code with additional info:
[Comm thread Nov 5 01:24] LL test successfully completes double-check of M42454169 --
[Comm thread Nov 5 01:24] CPU credit is 62.5757 GHz-days.
[Comm thread Nov 5 01:24] Done communicating with server.
[Worker #1 Nov 5 01:31] Iteration: 42100000 / 42454033 [99.16%], ms/iter: 18.304, ETA: 01:48:00
[Worker #1 Nov 5 01:44] Iteration: 42150000 / 42454033 [99.28%], ms/iter: 15.197, ETA: 01:17:00
[Worker #1 Nov 5 01:57] Iteration: 42200000 / 42454033 [99.40%], ms/iter: 15.246, ETA: 01:04:32
[Worker #1 Nov 5 02:09] Iteration: 42250000 / 42454033 [99.51%], ms/iter: 15.278, ETA: 00:51:57
[Worker #3 Nov 5 02:12] Resuming.
[Worker #3 Nov 5 02:12] No work to do at the present time. Waiting.
[Worker #4 Nov 5 02:18] Resuming.
[Worker #4 Nov 5 02:18] No work to do at the present time. Waiting.
[Worker #1 Nov 5 02:22] Iteration: 42300000 / 42454033 [99.63%], ms/iter: 15.284, ETA: 00:39:14
[Worker #2 Nov 5 02:24] Resuming.
[Worker #2 Nov 5 02:24] No work to do at the present time. Waiting.
[Worker #1 Nov 5 02:35] Iteration: 42350000 / 42454033 [99.75%], ms/iter: 15.272, ETA: 00:26:28
[Worker #1 Nov 5 02:47] Iteration: 42400000 / 42454033 [99.87%], ms/iter: 15.259, ETA: 00:13:44
[Worker #1 Nov 5 03:00] Iteration: 42450000 / 42454033 [99.99%], ms/iter: 15.246, ETA: 00:01:01
It's a shame that the second and third workers finished with less than one update-gap's worth of time between them, but still, three workers achieve 93% of the work as four, and one worker alone achieves a hair over half the throughput of four workers.

What the hack? Is the memory really that bad? AVX really that good? I've been running two Sandy Bridge boxen with the same memory quantity, speeds and channels for six years now (on and off) and I never thought it was this bad. Maybe the hyperthreaded box would show differently.

I wonder how two cores per worker would fare? Presumably better? I have since shifted the box to the other work, so I won't be running GIMPS for several weeks, but seriously, what the hack is going on.
Dubslow is offline   Reply With Quote
Old 2017-11-05, 14:19   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

32·11·29 Posts
Default

I'm also running an i5-2500K, but with 4 sticks of dual rank DDR3-1600, configured for a single worker:

Quote:
[Work thread Nov 5 08:57] Iteration: 30740000 / 41532151 [74.01%], ms/iter: 4.512, ETA: 13:31:38
The exponent is a little smaller, and it's using a 2240K FFT. From what I recall, 4 workers gives a little more throughput with Sandy Bridge, but I prefer to finish assignments quickly.

From another thread:

Quote:
Originally Posted by Prime95 View Post
That does look tasty!! I presume the 16 float-ops vs 8 in Sandy/Ivy Bridge comes from counting FMA as 2 float-ops. So, I'll need to get crackin' on FMA coding once I get a Haswell chip. They've also cut down on the bottlenecks to L1 and L2.

That leaves our main memory bottleneck. In fact, the improvements above will make that bottleneck even more glaring. The CPU really needs a third memory channel.
So main memory bandwidth has been an issue for a long time. It's before my time, but I believe Nehalem with its three memory channels was the last CPU that didn't suffer from a memory bottleneck. Perhaps one of the Mersenne elders can provide some enlightenment there.
Mark Rose is online now   Reply With Quote
Old 2017-11-05, 14:20   #3
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

2×587 Posts
Default

I'm running dual channel DDR3-2133 on my 2500k @4,0 and even then it is memory bottle-necked.

Not the fastest/most efficient desktop anymore, but the 130W is a nice space heater for the room in winter.
VictordeHolland is offline   Reply With Quote
Old 2017-11-05, 15:04   #4
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×3×5×191 Posts
Default

Memory bandwidth has been an issue since the first core 2 quads. My old Q6600 could only output around 3x one core with 1066mhz DDR2.
henryzz is offline   Reply With Quote
Old 2017-11-06, 16:51   #5
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29×113 Posts
Default

Quote:
Originally Posted by henryzz View Post
Memory bandwidth has been an issue since the first core 2 quads. My old Q6600 could only output around 3x one core with 1066mhz DDR2.
Yeah, that goes along with my server/Xeon observations too... on new as well as older gen CPUs.

You could predict that the slower the memory, the more slowdown you'll see with multiple workers on a single memory bus, and that does seem to play out on the real world.

A while back I was running on a few dual-socketed Xeon 5160 servers (Woodcrest vintage CPU) and I think the memory config I had was running them at 800MHz, so even less than the 1333MHz those CPUs can support (I was probably doing 3 dimms per channel which lowers the speed available).

This was in my early days of trying out different configs, but it showed that memory penalty pretty clearly when doing multiple workers. It was around that time I decided it would just be easier overall to do one worker using all the cores of a single socket (NUMA node). If the system has 2 CPUs, each CPU has it's own bank of RAM and if you're using a NUMA aware operating system, it should assign memory to a process that belongs to the same bank of memory that the CPU running that code is affined to.

In practice I can't be totally sure that Prime95, even with the affinity I force on it, is actually using memory from the proper bank, but I'm pretty sure it's working as expected since I don't typically see any difference in per-iteration times when one or both workers are running, so I don't think the QPI (quickpath interconnect... bus between CPUs) is being flooded with "wrong way" memory access.

Anyway... point being, memory contention has indeed always been an issue with multiple workers... I'm pretty sure since the dawn of multi-cored CPUs, the speed of the CPU has outpaced the memory speed in terms of what Prime95 is doing.
Madpoo is offline   Reply With Quote
Old 2017-11-16, 19:50   #6
aurashift
 
Jan 2015

11×23 Posts
Default

Quote:
Originally Posted by Madpoo View Post

Anyway... point being, memory contention has indeed always been an issue with multiple workers... I'm pretty sure since the dawn of multi-cored CPUs, the speed of the CPU has outpaced the memory speed in terms of what Prime95 is doing.
The 600 series chipsets are 3 channel memory right? Wasn't 4 channel coming out sometime?
aurashift is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
We've apparently done it all Gordon GPU to 72 0 2015-08-30 18:54
Modular Inversion Bottleneck Sam Kennedy Programming 4 2013-01-25 16:50
Just a thought of Quark numbers. SarK0Y Miscellaneous Math 44 2011-11-07 18:01
I thought I found another one..... schickel Aliquot Sequences 0 2011-02-21 03:52
Opteron Bottleneck?? Prime95 Hardware 31 2003-09-17 06:54

All times are UTC. The time now is 03:33.

Tue Oct 20 03:33:55 UTC 2020 up 40 days, 44 mins, 0 users, load averages: 1.60, 1.78, 1.76

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.