mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2005-05-29, 19:55   #45
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by PhilF
To me, these are amazing differences.
As I already tried to explain, such differences (big plus on one side, big minus on another) are the result of the many variables, which changed with the processors (cache, processor core, clock frequency).

Although the smaller cache of the Northwood-based Celeron (see here for a comparison) needs less clock cycles for accesses, the clock speed penalty weights that roughly up. But now comes the big change (look at the first graph shown on the linked page), less cache means more accesses to the RAM (this is, where the access latency is ~4 times as big as the L2 access latency in the graph - the long middle part, which is close to 20 cycles for the Northwood).

Now imagine your 128k Celeron having to wait 4-5 times longer for data, which doesn't fit into the 128k L2 cache but still fits into the 256k L2 on the other Celeron. No data means waiting - at 2.8 GHz, while the other CPU has data and can do something - although running at 1.7 GHz, it is doing something during that time The reality is more complex, but at least this example should demistify these results. Well, ok, there is the video encoder. But as I said, it depends on the algorithm. If the video encoder only works with e.g. 100k of image data at once, it won't profit from more cache with higher latency, but from a faster core and a faster cache.
Dresdenboy is offline   Reply With Quote
Old 2005-05-30, 01:19   #46
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

13428 Posts
Default

Quote:
Originally Posted by Dresdenboy
As I already tried to explain, such differences (big plus on one side, big minus on another) are the result of the many variables, which changed with the processors (cache, processor core, clock frequency).
Thanks for the explanation. I understand the mechanisms causing the differences. I am simply amazed at the amount of the difference.

I have a new motherboard on the way that uses DDR memory which I plan on using to experiment with the 128K Celeron. I hope to get the iteration time down to where is should be based on other's benchmarks.

This stuff is fun!
PhilF is offline   Reply With Quote
Old 2005-05-30, 03:29   #47
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23×1,021 Posts
Default

PhilF,

Preliminary results with the 10 level 2nd pass show that the 128KB L2 cache will never be fast for the 1024K FFT and higher.

A kind user has set up a 128KB Celeron for me to SSH into. At 2.0 GHz I cannot get the 1024K FFT below 75 ms. My 2.0 GHz Northwood is about 47 ms. The penalty will be worse for larger FFTs.

P.S. This 2.0 GHz machine is 161 ms for a 1792K FFT - and that is using the existing 11 level 2nd pass.

Last fiddled with by Prime95 on 2005-05-30 at 03:32
Prime95 is online now   Reply With Quote
Old 2005-05-30, 12:35   #48
db597
 
db597's Avatar
 
Jan 2003

2·103 Posts
Default

PhilF, Prime95 is a very specialised software that's been highly optimised in the use of certain instructions and cache sizes. So it isn't representative of the CPU's performance in average realworld applications (and vice versa).

In fact, this isn't the only time I've seen a large deviation in performance due to cache size. For the work I do at college, the software I'm working on depends very much on cache because of the large matrix sizes it processes. Here's the rough numbers I get:

P4 Northwood 512Kb @2GHz - 2.98 seconds / time step
P4 Prescott 1024Kb @3.4GHz - 0.96 seconds / time step

So the Prescott is 3x faster, despite only a 70% increase in clock speed. I suspect anyone doing CFD (StarCD, Fluent) or FEA (Foam, Ansys) will find similar results when going up in cache size.

Whatever the hardware review sites like Tomshardware, HardOCP et al say about the Prescott being no better than a Northwood doesn't apply at all. Each CPU has it's strengths and weaknesses - you really need to choose your processor depending on what application you intend to run on it. For 3D games it's Athlon64, for GIMPS it's P4 Northwood, for engineering/science it's P4 660 with 2MB cache!
db597 is offline   Reply With Quote
Old 2005-05-30, 13:11   #49
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default

@db597:
Is there some cache blocking involved or is everything left to the compiler (incl. prefetching)?

BTW, I don't agree with the Prescott being the optimal CPU for engineering/scientific purposes. But this is also not only depending on the type of CPU.
Dresdenboy is offline   Reply With Quote
Old 2005-05-30, 14:47   #50
db597
 
db597's Avatar
 
Jan 2003

2×103 Posts
Default

Everything is left to the compiler. There's no fancy MPI, OpenMP etc in this code I used for benchmarking. We use the Intel Fortran Compiler for Linux, so it should optimise well for both Northwood and Prescott.

In fact, the previous version of IFORT didn't have any Prescott support (v7 was released before the Prescott). Version 8 adds SSE3 instructions, but I don't use them. Even SSE2 is turned off, as we need double precision floating point. The compiler switches are all the same for both codes (-O2 only). As for prefetching, I think not. The IFORT man page for -O2 says:

· Inlining of intrinsics
· The following capabilities for performance gain: constant propagation, copy propagation, dead-code elimination, global register allocation, global instruction scheduling and conrol speculation, loop unrolling, optimized code selection, partial redundancy elimination, strength reduction/induction variable simplification, variable renaming, exception handling optimizations, tail recursions, peephole optimizations, structure assignment lowering and optimizations, and dead store elimination.

g77 (gcc v3.4) is much worse than IFORT. It gives about 50% slower time steps. Really can't beat Intel for compilers, they are excellent IFORT gives even Athlons a pretty good boost as compared to g77 (though naturally not as much a boost as the P4s).

For uniprocessor development and small simulation runs, the P4 6XX are great (larger runs need MPI clusters and a domain decomposition approach). But of course, you're right that it depends heavily on what code you're running. CFD and FEA seem to be very cache dependant. I've run benchmarks on BOFFIN (LES) and FOAM - both show a non-linear speed up moving to the Prescott. Can't say for sure about the other science/engineering code.

Last fiddled with by db597 on 2005-05-30 at 14:50
db597 is offline   Reply With Quote
Old 2005-05-30, 16:07   #51
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·1,021 Posts
Default

Quote:
Originally Posted by db597
Even SSE2 is turned off, as we need double precision floating point.
SSE2 has double-precision floats. Turn that optimization on! The P4 has twice as much peak floating point throughput using SSE2 instructions.
Prime95 is online now   Reply With Quote
Old 2005-05-30, 18:19   #52
db597
 
db597's Avatar
 
Jan 2003

2·103 Posts
Default

I'm not entirely sure if it's the effect of the SSE2 or the other optimisations that IFORT does when I tell it that the target architecture is a P4 (-xW swtich). But when I enable them, the results are slightly different to when they are turned off (in the 4th significant figure).

This is a bit of a worry since that effectively allows me to trust it only as a single precision result. The compiler seems to have taken some "shortcuts" that affects the quality of the data. There's a "-mp" switch to maintain floating point precision, but that results in a slow down that sort of defeats the purpose. Also, the speed up I get is only about 10%, not double.

Do you have a similar precision issue with Prime95? Or is this avoided since you use assembly?
db597 is offline   Reply With Quote
Old 2005-05-30, 19:11   #53
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2×557 Posts
Default

The results of SSE2-operations (double precision) may differ a bit compared to x87 fpu (dp)... (the least significant bits in the mantissa usually).

Depending on the algorithm these error can move towards the most significant bits...

e.g. subtraction of 2 floats in same region of size is often bad ;)

Last fiddled with by TheJudger on 2005-05-30 at 19:11
TheJudger is offline   Reply With Quote
Old 2005-05-31, 01:03   #54
db597
 
db597's Avatar
 
Jan 2003

CE16 Posts
Default

Having read your replies, I've just done a few more tests on the compiler options. In the code, there are some very small numbers in the order of 1E-10. Normally without the -xW switch, for one of the numbers I get -4.xxxxxE-10. Once I turn on the SSE2, I get +7.xxxxxxE-10. A big difference in the opposite direction!

My code has a lot of subtraction of numbers, since a differential:

dx
---
dy

in numerical simulation is (x1-x2)/(y1-y2). As TheJudger mentioned, if the numbers are very close to each other, small differences in the precision makes a big difference.

Still, I'm not entirely sure if it's the SSE2 or the other optimisations that the compiler does when I invoke the -xW switch. When compiling some subroutines, it says "vectorised". That could be the culprit instead.

Last fiddled with by db597 on 2005-05-31 at 01:04
db597 is offline   Reply With Quote
Old 2005-05-31, 03:28   #55
simi69
 

52·97 Posts
Default

AMD Athlon(TM) XP 3200+

CPU speed: 2200.30 MHz

CPU features: RDTSC, CMOV, Prefetch, 3DNow!, MMX, SSE

L1 cache size: 64 KB
L2 cache size: 512 KB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
L1 TLBS: 32
L2 TLBS: 256
Prime95 32-bit version 24.12, RdtscTiming=1
Best time for 512K FFT length: 31.550 ms.
Best time for 640K FFT length: 42.628 ms.
Best time for 768K FFT length: 51.965 ms.
Best time for 896K FFT length: 63.704 ms.
Best time for 1024K FFT length: 70.795 ms.
Best time for 1280K FFT length: 95.638 ms.
Best time for 1536K FFT length: 114.720 ms.
Best time for 1792K FFT length: 139.079 ms.
Best time for 2048K FFT length: 157.082 ms.
Best time for 2560K FFT length: 206.427 ms.
[Mon May 30 22:24:38 2005]
Best time for 3072K FFT length: 258.363 ms.
Best time for 3584K FFT length: 305.050 ms.
Best time for 4096K FFT length: 340.004 ms.
Best time for 58 bit trial factors: 5.689 ms.
Best time for 59 bit trial factors: 5.555 ms.
Best time for 60 bit trial factors: 5.593 ms.
Best time for 61 bit trial factors: 5.607 ms.
Best time for 62 bit trial factors: 10.223 ms.
Best time for 63 bit trial factors: 10.308 ms.
Best time for 64 bit trial factors: 24.220 ms.
Best time for 65 bit trial factors: 25.047 ms.
Best time for 66 bit trial factors: 25.297 ms.
Best time for 67 bit trial factors: 25.453 ms.
  Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Benchmarks MurrayInfoSys Information & Answers 3 2011-04-14 17:10
LLR benchmarks Oddball No Prime Left Behind 11 2010-08-06 21:39
benchmarks Unregistered Information & Answers 15 2009-08-18 16:44
Benchmarks for i7 965 lavalamp Hardware 21 2009-01-06 04:32
Benchmarks Vandy Hardware 6 2002-10-28 13:45

All times are UTC. The time now is 20:50.


Tue Feb 7 20:50:00 UTC 2023 up 173 days, 18:18, 1 user, load averages: 1.11, 0.98, 0.97

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔