mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-11-30, 18:50   #1
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

32716 Posts
Default ivy bridge versus haswell

Hello,

Been some years i looked at intel Xeon cpu's as price dropped on ebay of some of them now.

Clock for clock for Woltman's great AVX versus AVX2 code, what's the speed difference between sandybridge(ivy bridge) versus haswell?

I see that AVX2 can do FMA3 and has 2 ports that can do FMA versus sandybridge/ivy bridge can do 1 and bunch of instructions for gathering data at AVX2 - but i'm interested in hearing the difference clock for clock as of course if you look online for most results you don't know whether they tested 1 highly overclocked and turboboosted core versus something duck slow busy swapping :)

i should add: to test LLR's 4M - 8M bits roughly.

Last fiddled with by diep on 2017-11-30 at 18:54
diep is offline   Reply With Quote
Old 2017-11-30, 19:56   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

Haswell is something like 30% faster, clock for clock, provided you have the memory bandwidth to keep up.
Mark Rose is offline   Reply With Quote
Old 2017-11-30, 20:22   #3
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14478 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Haswell is something like 30% faster, clock for clock, provided you have the memory bandwidth to keep up.
Oh boy, that's a lot!
diep is offline   Reply With Quote
Old 2017-12-01, 12:24   #4
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26·7 Posts
Default

Peak IPC going from Sandy Bridge to Haswell is 50% faster in my testing.

What sort of FFT size are we looking at here? You will likely struggle with ram limiting on dual channel models with slow DDR3. Depending on the Xeon type, if they use quad channel DDR4 that would be quite nice. I was looking at the E5-2666 v3 on ebay last night as possibly the best balance between price, cores and clocks, although compatible mobos seem to be hard to find at a good price nowadays. I also note the i7-5820k are finally dropping in price and could be interesting also.

Actually, since LLR was mentioned, using it in multi-thread mode can help depending on FFT size.
mackerel is offline   Reply With Quote
Old 2017-12-01, 13:34   #5
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Yeah i had bought a Titan-Z from nvidia to toy here to get FFT to work for k * 2^n +- c transforms. Though most of it will be testing to see if it's possible to get the bandwidth.

There is 2 problems. First problem that cpu's do not share because of caches is rewriting the data. What's at limbnumber binary 0b0000001010001 for example has to go then to 0b1000101000000 and vice versa. That's one of the major problems to solve at a GPU for transforms that do not fit inside the tiny L1 datacache of a gpu.

Which is just a meager 48KB or so divided by say at least 8 warps (number of simultaneously running threads at the same SM-core) you have at most 6 KB for a vector of 32 cudacores. Practical with power of 2 that's just 512 doubles and 4KB (allows 12 warps).

With those 512 doubles which you have to share with 32 cuda cores, it's all about how to efficiently use the GDDR5 bandwidth.

If i ignore the manufacturer cheating and use diepeveen double precision gflops a single chip in the titan-z (it has 2 gpu's on each card) delivers roughly: 0.7Ghz * 15 SM's * 64 doubles a clock = 672 Gflops (without cheating, with cheating multiply it by 2 or so)

The bandwidth of GDDR5 is 336.5GB/s. Maybe 80% is usable of that practical which is roughly 0.8 * 336.5 = 270GB/s

That gives an efficiency need of 2.5 gflops per byte at this nvidia titan-z gpu.

If we compare with a CPU, it's very impressive what latest generation Xeons deliver there - they have made up much terrain compared to GPU's last years.

I was looking at building a 2 socket Haswell. Seems like that you need 4 rdimm 2133Mhz (well that's how it's called actually clocked of course a lot lower) for the 4 channels. At Johan de Gelas test you can see putting in 1 rdimm ddr4 per channel is fastest, provided you do not need overly much RAM, for maximum bandwidth.

90GB/s roughly is very impressive there from a single socket CPU.

the 12 core cpu's are sometimes bit cheaper on ebay, as opposed to the 4k dollar a chip for the 18 core things.

I see there is 3 ports at the Haswell. Now i didn't check how many uops a haswell AVX2.0 instruction eats - i'm assuming somehow you can get to 3 instructions per clock AVX2.0 .

In diepeveen-double precision throughput as compared to the nvidia chip that's giving then: 3 * 4 = 12 doubles per core per clocktick throughput.

So a single chip delivers then:

12 cores * 2.2Ghz * 12 = 316.8 gflops double precision for a 12 core chip at 2.2Ghz. Very impressive.

Regrettably cheapest motherboard i saw that i would find acceptable is a supermicro X10dri, though i do not know the memory bandwidth it can achieve.

As those intel testmachines with 100% sureness are always overclocked major league everywhere - i remember there how Johan de Gelas start this century also had new P4's with Diep on it and it took me many years until i could reproduce his results. As it appears the results i could only reproduce at P4's with overclocked memory controller (FSB), years years later.

Yet if i apply the same math like for the nvidia then: 316.8 gflops / (45GB/s * 80%) = 8.8 gflops per byte on the Haswell Xeons. In practice it will be more than that. As that 90GB/s he achieved only with 18 core Xeons (and 2 of them not with 1 hence the edit to 8.8 gflops per byte).

Yet even 9 gflops per byte DDR4 bandwidth is very impressive from the CPU hardware. It's very easy there to toy with the caches which the nvidia hardly has.
Please note this is diepeveen-gflops. Not the cheated gflops from manufacturer. They multiply everything by 2 with a lame excuse.

At Nvidia you got a couple of dozen doubles in the L1 datacache and a handful of registers at the nvidia card versus all those great L1 and L2 and L3 caches at the Xeons.

I would argue x64 strikes back.

edit: you need 4 dimms per socket at ddr4 with 4 channels it seems which is of course 8 dimms for a 2 socket Xeon haswell box. Not cheap at 51 dollar for a registered ecc dimm of 4GB. that's 400+ dollar.

Last fiddled with by diep on 2017-12-01 at 13:45
diep is offline   Reply With Quote
Old 2017-12-01, 14:10   #6
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by mackerel View Post
Peak IPC going from Sandy Bridge to Haswell is 50% faster in my testing.

What sort of FFT size are we looking at here? You will likely struggle with ram limiting on dual channel models with slow DDR3. Depending on the Xeon type, if they use quad channel DDR4 that would be quite nice. I was looking at the E5-2666 v3 on ebay last night as possibly the best balance between price, cores and clocks, although compatible mobos seem to be hard to find at a good price nowadays. I also note the i7-5820k are finally dropping in price and could be interesting also.

Actually, since LLR was mentioned, using it in multi-thread mode can help depending on FFT size.
Your system will eat 600 watt or so if you use good PSU and run those at full throttle. Most engineering samples use more power than the TDP of the release.

On other hand there is also 12 core Xeons ES on ebay which have 30MB L3 cache as opposed to 25MB L3 cache for the 10 core Xeons. That will give for sure greater speed depending upon transform size.

Also with watercooling, which costs not so much, easy to put those engineering samples of 12 cores @ 2.0 / 2.2 Ghz to 2.4Ghz Above that i wouldn't really do.

With watercooling you can keep cpu easily at room temperature. Those cpu's consume easily 10% less power if you compare room temperature versus 50C+.

In short you can clock them a little higher - though i have no idea which dual socket motherboard for haswell allows to do that :)

The Ghz they show with a screenshot could be really expensive overclockers motherboard or a special intel motherboard for special testing purposes - i'm not sure of that (well that's how it was some years ago).

More L3 cache is big huge major advantage for the couple of millions of bitsizes of the Riesels.

p.s. i'm using these: to watercool cpu's under 100 watt. do not look at tdp numbers of the v3 haswells they consume power as if it is for free :)
need to move to bigger blocks than this for those provided you have a slow waterpump.
https://www.ebay.com/itm/New-Water-C...MAAOSwSypY~btk

At the 2666v3 engineering sample you need to remove at least 250 watt of each cpu running Woltman's DWT codes...

Last fiddled with by diep on 2017-12-01 at 14:17
diep is offline   Reply With Quote
Old 2017-12-02, 01:42   #7
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

I'm still of the opinion that the i5-7500 or i3-8100 (when the non-Z motherboards are out) are the way to go, simply because they consume so much less power.
Mark Rose is offline   Reply With Quote
Old 2017-12-02, 01:46   #8
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41·251 Posts
Default

I didn't read all the posts yet, just the first. That is because they are quite long, I will let the topic open and read them during the day, but you seem to try to beat me in the length of posts, haha, my boss always complains that my emails are too long, I give to many details and he has no time to read them.

Anyhow, I only wanted to say that if you go for an ivy, make sure that is not one that would need delidding. Most of them do, they are not suitable for "intensive heat" activities that people on and around this forum do.
LaurV is offline   Reply With Quote
Old 2017-12-02, 09:15   #9
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I'm still of the opinion that the i5-7500 or i3-8100 (when the non-Z motherboards are out) are the way to go, simply because they consume so much less power.
Let's see the system i looked at which costs 1200 with DDR4 ram ecc-reg.
ES processors. under full load with watercooling i guess, based upon watt numbers Johan de Gelas posted (measured at the wall), 410 watt.

that's haswell 24 cores @ 2.2Ghz with 8 sticks of ecc reg DDR4 ram.
which is 52.8Ghz

Your suggestion is i5-7500 is 3.4 Ghz * 4 cores = 13.6Ghz

As for power estimates i estimate the 2 socket xeon system full load with water cooling at 410 watt at the wall.

Means powerwise the i5-7500 should eat less than:

410 * 13.6 / 52.8 = 105.6 watt at the wall.

Under full load - no way you get that. A few fans and motherboard with ddr4 already gonna eat more than that and psu's are notorious inefficient at 100 watt.

Took me ages to find years ago a psu that was efficient around 170 watt...
yes that was a gold 80+ rating psu rated 350 watt. You would need a gold psu rated 200 watt, as at 50% load they get their rated efficiency.

equipped with ddr4 still is the high price of 110 dollar for 2 sticks of DDR4 ram or so. without ECC, and i like to have ecc on my ram, except for gddr5 which has its own CRC.

Years ago the ecc ram of my xeons had a cost of 4-6 dollar a gigabyte and reg ecc ddr4 cheapest offer i can find is nearly 13 dollar a gigabyte now.

L3 math, or sram math (as most L3 is similar to sram): those Xeon haswells have 30MB L3 cache, with multithreading pretty large transforms fit inside.

L3 math i5-7500, has under 6MB L3. Only the utmost smallest transforms fit inside the L3 cache...

Pricewise it's also difficult that you find a gold rated psu 80%+ efficiency (gold is 50% efficient at 50% load and above) and 4 of them that's gonna be cheaper than a single psu for the dual Xeon system, gold rated, that's gonna run the show.

Yet pricewise building the system is the only thing where the i5-7500 and 4 of them might be close to the dual socket system. Yet running on 4 machines at same time is a pain....
diep is offline   Reply With Quote
Old 2017-12-02, 09:23   #10
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by LaurV View Post
I didn't read all the posts yet, just the first. That is because they are quite long, I will let the topic open and read them during the day, but you seem to try to beat me in the length of posts, haha, my boss always complains that my emails are too long, I give to many details and he has no time to read them.

Anyhow, I only wanted to say that if you go for an ivy, make sure that is not one that would need delidding. Most of them do, they are not suitable for "intensive heat" activities that people on and around this forum do.
Thanks for the tip LaurV. I agree with you that ivy doesn't seem interesting at all.

Right now i keep my money in my pocket until i sold some 3d printers (it still has to release). Yet i'm really amazed about how fast haswell is from a crunching perspective and the many offers on ebay for xeons which some years ago would've cost a fortune online. Yet my computers here get really outdated (L5420's), especially for the CAD. So let's see how long i can play the money saving at a cpu system, gpgpu enthusiast :)

In total i overpaid for the titan-z (of course it was in Netherlands itself i got swindled as they do not have ebay over here yet a market place that is not so funny) - yet i want it to toy with it to see how much i can limit the overhead for FFT.

Will take some time and it's unsure i'm gonna manage a high bandwidth out of the GDDR5 with FFT/DWT :)

Until then for sure a haswell would run what i run right now a lot lot faster... ...even the i5-7500 system and we didn't do the discussion yet that a FFT mine initially would be a power of 2 FFT so wouldn't be useful for testing a single k systematically only some fixed k's :)
diep is offline   Reply With Quote
Old 2017-12-02, 09:40   #11
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

26×7 Posts
Default

Quote:
Originally Posted by diep View Post
Your system will eat 600 watt or so if you use good PSU and run those at full throttle. Most engineering samples use more power than the TDP of the release.
While I haven't tried it, I very strongly doubt it'll go anywhere near that power level, even if you had two of them. Based of scaling multiple consumer desktop CPUs in the worst case you might get 200W a piece but I think that unlikely.

I previously got a 14 core ES to play with, and it is not hot, although it differs from the retail equivalent in that it has slightly lower clocks. Cooling is more than catered for by a small air cooler: http://noctua.at/en/nh-u9dx-i4 - note I consider that a small cooler compared to typical high end air cooling, as I use the D14 and D15 in my desktop builds where more cooling makes more difference. While I have measured power consumption of that Xeon in the past, I can't recall the exact value but I'm pretty sure the total system power when running P95-like workloads (more usually LLR) was well below 200W. I don't see numbers much bigger than that unless doing extreme overclocking at 1.5V+ though a CPU. Not recommended for power efficiency for obvious reasons.

On L3 cache, to my observations it looks like a enough/not enough thing. Once you have enough L3 for the work, having more doesn't make any difference. The cache per core tends to a trend within a product family, around 2.5MB/core for E5 Xeon, 2.0MB/core on i7, 1.5MB/core on i5. With so many products, there are exceptions, but it works as a generalisation. I'd also caution multi-thread scaling is less than perfect especially at higher core counts.

There was an earlier comment on working out instruction rates, I'd suggest looking at http://www.agner.org/optimize/ if not already. Particularly item 3, it goes into some low level description of how and what is done in modern CPUs.
mackerel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sandy Bridge vs Haswell AVX2 L-L Completion Times danmur Information & Answers 16 2016-12-14 15:09
CUDALucas versus Mfaktc/o Brain GPU Computing 26 2011-12-06 08:48
NTT transform at (AMD) GPU versus *lucas diep GPU Computing 11 2011-05-11 20:27
Head versus tail R.D. Silverman Lounge 9 2008-12-16 14:28
Pfactor versus Pminus1 GP2 Marin's Mersenne-aries 4 2003-09-30 02:52

All times are UTC. The time now is 15:13.


Fri Jul 7 15:13:21 UTC 2023 up 323 days, 12:41, 0 users, load averages: 1.20, 1.10, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔