![]() |
![]() |
#45 | |
Apr 2003
Berlin, Germany
192 Posts |
![]() Quote:
Although the smaller cache of the Northwood-based Celeron (see here for a comparison) needs less clock cycles for accesses, the clock speed penalty weights that roughly up. But now comes the big change (look at the first graph shown on the linked page), less cache means more accesses to the RAM (this is, where the access latency is ~4 times as big as the L2 access latency in the graph - the long middle part, which is close to 20 cycles for the Northwood). Now imagine your 128k Celeron having to wait 4-5 times longer for data, which doesn't fit into the 128k L2 cache but still fits into the 256k L2 on the other Celeron. No data means waiting - at 2.8 GHz, while the other CPU has data and can do something - although running at 1.7 GHz, it is doing something during that time ![]() |
|
![]() |
![]() |
![]() |
#46 | |
"6800 descendent"
Feb 2005
Colorado
13428 Posts |
![]() Quote:
I have a new motherboard on the way that uses DDR memory which I plan on using to experiment with the 128K Celeron. I hope to get the iteration time down to where is should be based on other's benchmarks. This stuff is fun! ![]() |
|
![]() |
![]() |
![]() |
#47 |
P90 years forever!
Aug 2002
Yeehaw, FL
23×1,021 Posts |
![]()
PhilF,
Preliminary results with the 10 level 2nd pass show that the 128KB L2 cache will never be fast for the 1024K FFT and higher. A kind user has set up a 128KB Celeron for me to SSH into. At 2.0 GHz I cannot get the 1024K FFT below 75 ms. My 2.0 GHz Northwood is about 47 ms. The penalty will be worse for larger FFTs. P.S. This 2.0 GHz machine is 161 ms for a 1792K FFT - and that is using the existing 11 level 2nd pass. Last fiddled with by Prime95 on 2005-05-30 at 03:32 |
![]() |
![]() |
![]() |
#48 |
Jan 2003
2·103 Posts |
![]()
PhilF, Prime95 is a very specialised software that's been highly optimised in the use of certain instructions and cache sizes. So it isn't representative of the CPU's performance in average realworld applications (and vice versa).
In fact, this isn't the only time I've seen a large deviation in performance due to cache size. For the work I do at college, the software I'm working on depends very much on cache because of the large matrix sizes it processes. Here's the rough numbers I get: P4 Northwood 512Kb @2GHz - 2.98 seconds / time step P4 Prescott 1024Kb @3.4GHz - 0.96 seconds / time step So the Prescott is 3x faster, despite only a 70% increase in clock speed. I suspect anyone doing CFD (StarCD, Fluent) or FEA (Foam, Ansys) will find similar results when going up in cache size. Whatever the hardware review sites like Tomshardware, HardOCP et al say about the Prescott being no better than a Northwood doesn't apply at all. Each CPU has it's strengths and weaknesses - you really need to choose your processor depending on what application you intend to run on it. For 3D games it's Athlon64, for GIMPS it's P4 Northwood, for engineering/science it's P4 660 with 2MB cache! |
![]() |
![]() |
![]() |
#49 |
Apr 2003
Berlin, Germany
5518 Posts |
![]()
@db597:
Is there some cache blocking involved or is everything left to the compiler (incl. prefetching)? BTW, I don't agree with the Prescott being the optimal CPU for engineering/scientific purposes. But this is also not only depending on the type of CPU. |
![]() |
![]() |
![]() |
#50 |
Jan 2003
2×103 Posts |
![]()
Everything is left to the compiler. There's no fancy MPI, OpenMP etc in this code I used for benchmarking. We use the Intel Fortran Compiler for Linux, so it should optimise well for both Northwood and Prescott.
In fact, the previous version of IFORT didn't have any Prescott support (v7 was released before the Prescott). Version 8 adds SSE3 instructions, but I don't use them. Even SSE2 is turned off, as we need double precision floating point. The compiler switches are all the same for both codes (-O2 only). As for prefetching, I think not. The IFORT man page for -O2 says: · Inlining of intrinsics · The following capabilities for performance gain: constant propagation, copy propagation, dead-code elimination, global register allocation, global instruction scheduling and conrol speculation, loop unrolling, optimized code selection, partial redundancy elimination, strength reduction/induction variable simplification, variable renaming, exception handling optimizations, tail recursions, peephole optimizations, structure assignment lowering and optimizations, and dead store elimination. g77 (gcc v3.4) is much worse than IFORT. It gives about 50% slower time steps. Really can't beat Intel for compilers, they are excellent IFORT gives even Athlons a pretty good boost as compared to g77 (though naturally not as much a boost as the P4s). For uniprocessor development and small simulation runs, the P4 6XX are great (larger runs need MPI clusters and a domain decomposition approach). But of course, you're right that it depends heavily on what code you're running. CFD and FEA seem to be very cache dependant. I've run benchmarks on BOFFIN (LES) and FOAM - both show a non-linear speed up moving to the Prescott. Can't say for sure about the other science/engineering code. Last fiddled with by db597 on 2005-05-30 at 14:50 |
![]() |
![]() |
![]() |
#51 | |
P90 years forever!
Aug 2002
Yeehaw, FL
23·1,021 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#52 |
Jan 2003
2·103 Posts |
![]()
I'm not entirely sure if it's the effect of the SSE2 or the other optimisations that IFORT does when I tell it that the target architecture is a P4 (-xW swtich). But when I enable them, the results are slightly different to when they are turned off (in the 4th significant figure).
This is a bit of a worry since that effectively allows me to trust it only as a single precision result. The compiler seems to have taken some "shortcuts" that affects the quality of the data. There's a "-mp" switch to maintain floating point precision, but that results in a slow down that sort of defeats the purpose. Also, the speed up I get is only about 10%, not double. Do you have a similar precision issue with Prime95? Or is this avoided since you use assembly? |
![]() |
![]() |
![]() |
#53 |
"Oliver"
Mar 2005
Germany
2×557 Posts |
![]()
The results of SSE2-operations (double precision) may differ a bit compared to x87 fpu (dp)... (the least significant bits in the mantissa usually).
Depending on the algorithm these error can move towards the most significant bits... e.g. subtraction of 2 floats in same region of size is often bad ;) Last fiddled with by TheJudger on 2005-05-30 at 19:11 |
![]() |
![]() |
![]() |
#54 |
Jan 2003
CE16 Posts |
![]()
Having read your replies, I've just done a few more tests on the compiler options. In the code, there are some very small numbers in the order of 1E-10. Normally without the -xW switch, for one of the numbers I get -4.xxxxxE-10. Once I turn on the SSE2, I get +7.xxxxxxE-10. A big difference in the opposite direction!
My code has a lot of subtraction of numbers, since a differential: dx --- dy in numerical simulation is (x1-x2)/(y1-y2). As TheJudger mentioned, if the numbers are very close to each other, small differences in the precision makes a big difference. Still, I'm not entirely sure if it's the SSE2 or the other optimisations that the compiler does when I invoke the -xW switch. When compiling some subroutines, it says "vectorised". That could be the culprit instead. Last fiddled with by db597 on 2005-05-31 at 01:04 |
![]() |
![]() |
![]() |
#55 |
52·97 Posts |
![]()
AMD Athlon(TM) XP 3200+
CPU speed: 2200.30 MHz CPU features: RDTSC, CMOV, Prefetch, 3DNow!, MMX, SSE L1 cache size: 64 KB L2 cache size: 512 KB L1 cache line size: 64 bytes L2 cache line size: 64 bytes L1 TLBS: 32 L2 TLBS: 256 Prime95 32-bit version 24.12, RdtscTiming=1 Best time for 512K FFT length: 31.550 ms. Best time for 640K FFT length: 42.628 ms. Best time for 768K FFT length: 51.965 ms. Best time for 896K FFT length: 63.704 ms. Best time for 1024K FFT length: 70.795 ms. Best time for 1280K FFT length: 95.638 ms. Best time for 1536K FFT length: 114.720 ms. Best time for 1792K FFT length: 139.079 ms. Best time for 2048K FFT length: 157.082 ms. Best time for 2560K FFT length: 206.427 ms. [Mon May 30 22:24:38 2005] Best time for 3072K FFT length: 258.363 ms. Best time for 3584K FFT length: 305.050 ms. Best time for 4096K FFT length: 340.004 ms. Best time for 58 bit trial factors: 5.689 ms. Best time for 59 bit trial factors: 5.555 ms. Best time for 60 bit trial factors: 5.593 ms. Best time for 61 bit trial factors: 5.607 ms. Best time for 62 bit trial factors: 10.223 ms. Best time for 63 bit trial factors: 10.308 ms. Best time for 64 bit trial factors: 24.220 ms. Best time for 65 bit trial factors: 25.047 ms. Best time for 66 bit trial factors: 25.297 ms. Best time for 67 bit trial factors: 25.453 ms. |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Benchmarks | MurrayInfoSys | Information & Answers | 3 | 2011-04-14 17:10 |
LLR benchmarks | Oddball | No Prime Left Behind | 11 | 2010-08-06 21:39 |
benchmarks | Unregistered | Information & Answers | 15 | 2009-08-18 16:44 |
Benchmarks for i7 965 | lavalamp | Hardware | 21 | 2009-01-06 04:32 |
Benchmarks | Vandy | Hardware | 6 | 2002-10-28 13:45 |