mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Prime95 version 27.3 (https://www.mersenneforum.org/showthread.php?t=16535)

Prime95 2012-02-17 04:34

Prime95 version 27.3
 
An early beta prime95 version 27.3 is available. This version supports 64-bit optimized AVX FFTs. 32-bit AVX FFTs are also a little bit faster. I haven't done full benchmarks so I'm not sure how much faster it is than versions 27.2 or 26.6.

The good/bad news is these FFTs are so fast that they are limited by memory bandwidth -- standard Sandy Bridge CPUs will experience a slow down when running all 4 cores. I'd like to hear from Sandy Bridge-E users to see if they also suffer slow downs when all 4 cores are running.

If you do not have a Sandy Bridge CPU there is [B]absolutely no reason to download this version (especially AMD Bulldozer).[/B]

Download links:
Windows 64-bit: [url]ftp://mersenne.org/gimps/p64v273.zip[/url]
Windows 32-bit: [url]ftp://mersenne.org/gimps/p95v273.zip[/url]
Linux 64-bit: [url]ftp://mersenne.org/gimps/mprime273-linux64.tar.gz[/url]
Linux 32-bit: [url]ftp://mersenne.org/gimps/mprime273.tar.gz[/url]
Mac OS X: [url]ftp://mersenne.org/gimps/Prime95-MacOSX-273.zip[/url]
Source code: [url]ftp://mersenne.org/gimps/source273.zip[/url]

My Ubuntu box no longer boots so Linux versions are not available yet.

I addressed several bug reports from LLR users. Please retest cases that caused problems in gwnum 27.2.

Dubslow 2012-02-17 05:01

[QUOTE=Prime95;289657]

My Ubuntu box no longer boots so Linux versions are not available yet.
[/QUOTE]

If you can give directions (makefile?), I have gcc 4.5.2. (Assuming no one beats me to it.)

Edit: Having problems downloading source. It gets the first few megabytes before completely stopping.

Prime95 2012-02-17 05:02

P.S. Can a Phenom user check to see if CPU speed detection is any better?

Jwb52z 2012-02-17 05:13

How would you check without googling if you have a Sandy Bridge CPU?

retina 2012-02-17 05:17

[QUOTE=Prime95;289659]P.S. Can a Phenom user check to see if CPU speed detection is any better?[/QUOTE]I don't have a Phenom but I tried my Fusion 1.6GHz Dual Core anyway:[code]AMD E-350 Processor
CPU speed: 3130.84 MHz, 2 cores
CPU features: Prefetch, MMX, SSE, SSE2
L1 cache size: 32 KB
L2 cache size: 512 KB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
L1 TLBS: 40
L2 TLBS: 512
Prime95 64-bit version 27.3, RdtscTiming=1[/code]Still says almost double the speed that it should be.

Dubslow 2012-02-17 05:18

[QUOTE=Jwb52z;289660]How would you check without googling if you have a Sandy Bridge CPU?[/QUOTE]

When and how did you buy it? (If it's more than a year old it's not SB.) If you're running Windows, download CPU-Z, it tells you exactly what the processor is.

Prime95 2012-02-17 05:20

Use Options/CPU. It will say something like Intel Core i3 or i5 or i7 - 2xxx or 3xxx. My Sandy Bridge is an i5-2500K.

Jwb52z 2012-02-17 05:20

[QUOTE=Dubslow;289662]When and how did you buy it? (If it's more than a year old it's not SB.) If you're running Windows, download CPU-Z, it tells you exactly what the processor is.[/QUOTE]Ok, I don't have a Sandy Bridge then. Mine is an i7 Q720, which were released in the third quarter of 2009.

Dubslow 2012-02-17 05:27

[QUOTE=Jwb52z;289664]Ok, I don't have a Sandy Bridge then. Mine is an i7 Q720, which were released in the third quarter of 2009.[/QUOTE]

SB and up are all 2xxx four digit numbers. (Ivy Bridge will be 3xxx, as well as the hexcore SBs, which really makes no sense to me.)

Prime95 2012-02-17 05:46

[QUOTE=Dubslow;289658]If you can give directions (makefile?), I have gcc 4.5.2. (Assuming no one beats me to it.)[/QUOTE]

You won't get the necessary security files to build an official release. You'll have to wait for me to get my box up and running

bcp19 2012-02-17 06:29

[QUOTE=Prime95;289657]The good/bad news is these FFTs are so fast that they are limited by memory bandwidth -- standard Sandy Bridge CPUs will experience a slow down when running all 4 cores. I'd like to hear from Sandy Bridge-E users to see if they also suffer slow downs when all 4 cores are running.[/QUOTE]

Tried it out on my 2500k, exp 26000069, 1344K FFT, timing went from .010 to .012. On exp 26214607, 1440k FFT, timing went from .010 to .013. I figure it is the memory bandwidth you mentioned, since it is running 3 cores P95 and 1 core mfaktc. Could also be due to it being overclocked.

LaurV 2012-02-17 06:54

Do we need to update (recommended? non-recommended?) from v26 for core2 duo/quad processors? (nehalem/westmere no sb, no avx, no fma3/4 :razz:) Or we better stay on 26 for these processors? (I am not a big fan of new stuff against stable stuff!)

edit: I will give a test to the new version tonight, when I reach "the" SB-E computer

LaurV 2012-02-17 09:38

(time limit)
edit2: (rhetoric) what the hack do you have in 38 MB of sources ?? :shock: (firewall kicked me off at about 30, three times, till I realized why, then I clicked "stop" then "resume" in the middle, cheated it, hehe).
edit3: no need to answer, I saw gwnum libraries already, maybe they could be separated, most guys only want to "look" into the sources... not to use them for whatever...

schickel 2012-02-17 11:24

[QUOTE=Prime95;289659]P.S. Can a Phenom user check to see if CPU speed detection is any better?[/QUOTE]Here's what I get:

v26.6 Build 3:[quote]AMD Phenom(tm) II X6 1090T Processor
CPU speed: 9455.57 MHz, 6 cores
CPU features: Prefetch, 3DNow!, MMX, SSE, SSE2
L1 cache size: 64 KB
L2 cache size: 512 KB, L3 Cache size 6 MB[/quote]v27.3:[quote]AMD Phenom(tm) II X6 1090T Processor
CPU speed: 3151.63 MHz, 6 cores
CPU features: Prefetch, 3DNow!, MMX, SSE, SSE2
L1 cache size: 64 KB
L2 cache size: 512 KB, L3 cache size: 6 MB[/quote]Windows reports 3.20 GHz.

BigBrother 2012-02-17 11:29

My i7-2600K@4,5GHz with DDR3-2133 memory runs slightly faster compared to 27.2. Three cores, 2560K FFTs.
And as a bonus, Prime95 now correctly detects AVX in my laptop with i3-2310M, wasn't detected in 27.1 and 27.2.

Ralf Recker 2012-02-17 13:37

[QUOTE=LaurV;289683](time limit)
edit2: (rhetoric) what the hack do you have in 38 MB of sources ?? :shock: (firewall kicked me off at about 30, three times, till I realized why, then I clicked "stop" then "resume" in the middle, cheated it, hehe).
edit3: no need to answer, I saw gwnum libraries already, maybe they could be separated, most guys only want to "look" into the sources... not to use them for whatever...[/QUOTE]
7zip compresses the unpacked source272.zip archive to 4421674 byte (-mx9, 6855628 byte with the default settings).

LaurV 2012-02-17 14:17

1 Attachment(s)
that was my wondering! the 38MB was the COMPRESSED file.

Anyhow, not that is the reason I posted. I just did a comparative benchmark. Attached the 2.73 vs 2.72 so if George can do anything with it, be my guest. If I could insist on something especially, just say so.

The benchmark is done on an i7 2600k running at 4.35ghz, on a maximus extreme z mobo (quad channel ddr3)

Ralf Recker 2012-02-17 14:39

[QUOTE=LaurV;289712]that was my wondering! the 38MB was the COMPRESSED file. [/QUOTE]
[CODE]ralf@quadriga:~/Temp$ find source273/ | grep [.]a$ | xargs ls -l
-rw-r--r-- 1 ralf ralf 30653248 16. Feb 22:11 source273/gwnum/linux64/gwnum.a
-rw-r--r-- 1 ralf ralf 30557998 16. Feb 21:58 source273/gwnum/linux/gwnum.a
-rw-r--r-- 1 ralf ralf 20951800 16. Feb 22:11 source273/gwnum/macosx64/gwnum.a
-rw-r--r-- 1 ralf ralf 18362596 16. Feb 21:59 source273/gwnum/macosx/gwnum.a
[/CODE]not to mention the few MB of Windows object files, the .c and .asm sources...

That's why I would like to see .7z files as a download option:

[CODE]ralf@quadriga:~/Temp$ ls -l *.7z
-rw-r--r-- 1 ralf ralf 1028891 17. Feb 15:42 p64v273.7z
-rw-r--r-- 1 ralf ralf 946552 17. Feb 15:42 p95v273.7z
-rw-r--r-- 1 ralf ralf 4687444 17. Feb 15:41 source273.7z[/CODE]

Prime95 2012-02-17 14:40

[QUOTE=LaurV;289678]Do we need to update (recommended? non-recommended?) from v26 for core2 duo/quad processors? (nehalem/westmere no sb, no avx, no fma3/4 :razz:) Or we better stay on 26 for these processors? [/QUOTE]

Reread the original post. Pay attention to the bold text.

James Heinrich 2012-02-17 15:03

[QUOTE=Prime95;289657]I'd like to hear from Sandy Bridge-E users to see if they also suffer slow downs when all 4 cores are running.[/QUOTE]What about when all 6 cores are running? :smile:

Here are benchmarks for v26.6.3 and v27.3.1
CPU is Intel Core i7-3930K, running at 4500MHz (125 x 36)
RAM is Corsair 32GB (4 x 8GB) running at 1333MHz

v26.6.3:[code]Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
CPU speed: 4427.36 MHz, 6 hyperthreaded cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 12 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 26.6, RdtscTiming=1
Best time for 768K FFT length: 5.773 ms., avg: 6.164 ms.
Best time for 896K FFT length: 7.083 ms., avg: 7.347 ms.
Best time for 1024K FFT length: 7.818 ms., avg: 7.965 ms.
Best time for 1280K FFT length: 10.178 ms., avg: 10.291 ms.
Best time for 1536K FFT length: 12.385 ms., avg: 13.020 ms.
Best time for 1792K FFT length: 15.095 ms., avg: 15.479 ms.
Best time for 2048K FFT length: 16.689 ms., avg: 17.075 ms.
Best time for 2560K FFT length: 21.396 ms., avg: 22.026 ms.
Best time for 3072K FFT length: 26.322 ms., avg: 26.632 ms.
Best time for 3584K FFT length: 31.657 ms., avg: 32.234 ms.
Best time for 4096K FFT length: 35.007 ms., avg: 35.929 ms.
Best time for 5120K FFT length: 45.628 ms., avg: 45.921 ms.
Best time for 6144K FFT length: 57.025 ms., avg: 57.752 ms.
Best time for 7168K FFT length: 70.184 ms., avg: 71.039 ms.
Best time for 8192K FFT length: 77.062 ms., avg: 77.326 ms.
Timing FFTs using 2 threads on 1 physical CPUs.
Best time for 768K FFT length: 5.633 ms., avg: 5.682 ms.
Best time for 896K FFT length: 6.820 ms., avg: 6.916 ms.
Best time for 1024K FFT length: 7.632 ms., avg: 7.709 ms.
Best time for 1280K FFT length: 9.846 ms., avg: 10.161 ms.
Best time for 1536K FFT length: 12.099 ms., avg: 12.156 ms.
Best time for 1792K FFT length: 14.622 ms., avg: 14.716 ms.
Best time for 2048K FFT length: 16.420 ms., avg: 16.503 ms.
Best time for 2560K FFT length: 20.711 ms., avg: 20.773 ms.
Best time for 3072K FFT length: 25.758 ms., avg: 25.852 ms.
Best time for 3584K FFT length: 30.745 ms., avg: 30.902 ms.
Best time for 4096K FFT length: 34.547 ms., avg: 34.596 ms.
Best time for 5120K FFT length: 44.723 ms., avg: 44.834 ms.
Best time for 6144K FFT length: 54.992 ms., avg: 55.260 ms.
Best time for 7168K FFT length: 64.596 ms., avg: 64.752 ms.
Best time for 8192K FFT length: 72.898 ms., avg: 73.215 ms.
Timing FFTs using 4 threads on 2 physical CPUs.
Best time for 768K FFT length: 2.913 ms., avg: 3.109 ms.
Best time for 896K FFT length: 3.543 ms., avg: 3.621 ms.
Best time for 1024K FFT length: 3.984 ms., avg: 4.106 ms.
Best time for 1280K FFT length: 5.119 ms., avg: 5.302 ms.
Best time for 1536K FFT length: 6.214 ms., avg: 6.354 ms.
Best time for 1792K FFT length: 7.584 ms., avg: 7.667 ms.
Best time for 2048K FFT length: 8.499 ms., avg: 8.607 ms.
Best time for 2560K FFT length: 10.744 ms., avg: 10.928 ms.
Best time for 3072K FFT length: 13.348 ms., avg: 13.519 ms.
Best time for 3584K FFT length: 15.908 ms., avg: 15.968 ms.
Best time for 4096K FFT length: 17.892 ms., avg: 21.058 ms.
Best time for 5120K FFT length: 23.064 ms., avg: 23.210 ms.
Best time for 6144K FFT length: 28.449 ms., avg: 28.839 ms.
Best time for 7168K FFT length: 33.486 ms., avg: 33.655 ms.
Best time for 8192K FFT length: 37.878 ms., avg: 37.994 ms.
Timing FFTs using 6 threads on 3 physical CPUs.
Best time for 768K FFT length: 2.215 ms., avg: 2.482 ms.
Best time for 896K FFT length: 2.443 ms., avg: 2.494 ms.
Best time for 1024K FFT length: 3.040 ms., avg: 3.660 ms.
Best time for 1280K FFT length: 3.609 ms., avg: 3.981 ms.
Best time for 1536K FFT length: 4.372 ms., avg: 4.955 ms.
Best time for 1792K FFT length: 5.268 ms., avg: 5.607 ms.
Best time for 2048K FFT length: 5.907 ms., avg: 6.060 ms.
Best time for 2560K FFT length: 7.448 ms., avg: 8.190 ms.
Best time for 3072K FFT length: 9.166 ms., avg: 9.358 ms.
Best time for 3584K FFT length: 10.914 ms., avg: 11.352 ms.
Best time for 4096K FFT length: 12.253 ms., avg: 12.536 ms.
Best time for 5120K FFT length: 15.852 ms., avg: 17.838 ms.
Best time for 6144K FFT length: 19.515 ms., avg: 19.837 ms.
Best time for 7168K FFT length: 23.596 ms., avg: 23.755 ms.
Best time for 8192K FFT length: 25.831 ms., avg: 26.815 ms.
Timing FFTs using 8 threads on 4 physical CPUs.
Best time for 768K FFT length: 1.957 ms., avg: 2.193 ms.
Best time for 896K FFT length: 1.918 ms., avg: 2.289 ms.
Best time for 1024K FFT length: 2.676 ms., avg: 3.021 ms.
Best time for 1280K FFT length: 3.169 ms., avg: 3.562 ms.
Best time for 1536K FFT length: 3.722 ms., avg: 4.273 ms.
Best time for 1792K FFT length: 4.074 ms., avg: 5.503 ms.
Best time for 2048K FFT length: 4.776 ms., avg: 4.979 ms.
Best time for 2560K FFT length: 6.129 ms., avg: 6.505 ms.
Best time for 3072K FFT length: 7.328 ms., avg: 7.529 ms.
Best time for 3584K FFT length: 8.511 ms., avg: 9.094 ms.
Best time for 4096K FFT length: 9.812 ms., avg: 10.025 ms.
Best time for 5120K FFT length: 12.560 ms., avg: 12.816 ms.
Best time for 6144K FFT length: 15.458 ms., avg: 15.687 ms.
Best time for 7168K FFT length: 19.118 ms., avg: 19.346 ms.
Best time for 8192K FFT length: 20.482 ms., avg: 20.753 ms.
Timing FFTs using 10 threads on 5 physical CPUs.
Best time for 768K FFT length: 1.840 ms., avg: 1.959 ms.
Best time for 896K FFT length: 1.768 ms., avg: 1.849 ms.
Best time for 1024K FFT length: 2.506 ms., avg: 3.226 ms.
Best time for 1280K FFT length: 2.973 ms., avg: 3.861 ms.
Best time for 1536K FFT length: 3.461 ms., avg: 3.662 ms.
Best time for 1792K FFT length: 3.789 ms., avg: 4.184 ms.
Best time for 2048K FFT length: 4.409 ms., avg: 5.578 ms.
Best time for 2560K FFT length: 5.735 ms., avg: 6.310 ms.
Best time for 3072K FFT length: 6.912 ms., avg: 7.930 ms.
Best time for 3584K FFT length: 7.963 ms., avg: 8.226 ms.
Best time for 4096K FFT length: 9.262 ms., avg: 10.254 ms.
Best time for 5120K FFT length: 11.633 ms., avg: 12.171 ms.
Best time for 6144K FFT length: 14.213 ms., avg: 15.227 ms.
Best time for 7168K FFT length: 17.578 ms., avg: 20.373 ms.
Best time for 8192K FFT length: 19.021 ms., avg: 20.007 ms.
Timing FFTs using 12 threads on 6 physical CPUs.
Best time for 768K FFT length: 1.735 ms., avg: 1.870 ms.
Best time for 896K FFT length: 1.681 ms., avg: 1.777 ms.
Best time for 1024K FFT length: 2.405 ms., avg: 2.565 ms.
Best time for 1280K FFT length: 2.846 ms., avg: 3.015 ms.
Best time for 1536K FFT length: 3.312 ms., avg: 4.853 ms.
Best time for 1792K FFT length: 3.648 ms., avg: 3.810 ms.
Best time for 2048K FFT length: 4.273 ms., avg: 4.455 ms.
Best time for 2560K FFT length: 5.473 ms., avg: 5.709 ms.
Best time for 3072K FFT length: 6.495 ms., avg: 6.830 ms.
Best time for 3584K FFT length: 7.519 ms., avg: 10.094 ms.
Best time for 4096K FFT length: 8.793 ms., avg: 9.152 ms.
Best time for 5120K FFT length: 10.979 ms., avg: 11.273 ms.
Best time for 6144K FFT length: 13.569 ms., avg: 13.756 ms.
Best time for 7168K FFT length: 19.527 ms., avg: 24.010 ms.
[Fri Feb 17 09:51:23 2012]
Best time for 8192K FFT length: 17.883 ms., avg: 18.977 ms.
Best time for 61 bit trial factors: 1.722 ms.
Best time for 62 bit trial factors: 1.750 ms.
Best time for 63 bit trial factors: 1.980 ms.
Best time for 64 bit trial factors: 2.041 ms.
Best time for 65 bit trial factors: 2.394 ms.
Best time for 66 bit trial factors: 2.832 ms.
Best time for 67 bit trial factors: 2.795 ms.
Best time for 75 bit trial factors: 2.732 ms.
Best time for 76 bit trial factors: 2.714 ms.
Best time for 77 bit trial factors: 2.731 ms.[/code]

James Heinrich 2012-02-17 15:06

v27.3.1:[code]Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
CPU speed: 4428.45 MHz, 6 hyperthreaded cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 12 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 27.3, RdtscTiming=1
Best time for 768K FFT length: 3.525 ms., avg: 3.766 ms.
Best time for 896K FFT length: 4.229 ms., avg: 4.360 ms.
Best time for 1024K FFT length: 4.822 ms., avg: 5.015 ms.
Best time for 1280K FFT length: 6.237 ms., avg: 6.578 ms.
Best time for 1536K FFT length: 7.705 ms., avg: 7.937 ms.
Best time for 1792K FFT length: 9.450 ms., avg: 9.811 ms.
Best time for 2048K FFT length: 10.610 ms., avg: 10.857 ms.
Best time for 2560K FFT length: 13.572 ms., avg: 13.968 ms.
Best time for 3072K FFT length: 16.983 ms., avg: 17.560 ms.
Best time for 3584K FFT length: 20.632 ms., avg: 21.255 ms.
Best time for 4096K FFT length: 23.395 ms., avg: 23.797 ms.
Best time for 5120K FFT length: 30.964 ms., avg: 31.520 ms.
Best time for 6144K FFT length: 37.074 ms., avg: 38.321 ms.
Best time for 7168K FFT length: 45.105 ms., avg: 46.369 ms.
Best time for 8192K FFT length: 52.026 ms., avg: 53.417 ms.
Timing FFTs using 2 threads on 1 physical CPUs.
Best time for 768K FFT length: 3.742 ms., avg: 3.801 ms.
Best time for 896K FFT length: 4.420 ms., avg: 4.515 ms.
Best time for 1024K FFT length: 5.099 ms., avg: 5.186 ms.
Best time for 1280K FFT length: 6.550 ms., avg: 6.690 ms.
Best time for 1536K FFT length: 8.284 ms., avg: 8.397 ms.
Best time for 1792K FFT length: 9.921 ms., avg: 10.083 ms.
Best time for 2048K FFT length: 11.230 ms., avg: 11.452 ms.
Best time for 2560K FFT length: 14.445 ms., avg: 14.628 ms.
Best time for 3072K FFT length: 17.977 ms., avg: 18.175 ms.
Best time for 3584K FFT length: 21.966 ms., avg: 22.118 ms.
Best time for 4096K FFT length: 24.553 ms., avg: 24.876 ms.
Best time for 5120K FFT length: 33.083 ms., avg: 33.401 ms.
Best time for 6144K FFT length: 40.608 ms., avg: 41.396 ms.
Best time for 7168K FFT length: 49.466 ms., avg: 49.643 ms.
Best time for 8192K FFT length: 53.535 ms., avg: 53.966 ms.
Timing FFTs using 4 threads on 2 physical CPUs.
Best time for 768K FFT length: 1.968 ms., avg: 2.111 ms.
Best time for 896K FFT length: 2.345 ms., avg: 2.731 ms.
Best time for 1024K FFT length: 2.703 ms., avg: 2.807 ms.
Best time for 1280K FFT length: 3.468 ms., avg: 3.591 ms.
Best time for 1536K FFT length: 4.339 ms., avg: 4.501 ms.
Best time for 1792K FFT length: 5.213 ms., avg: 5.424 ms.
Best time for 2048K FFT length: 5.943 ms., avg: 6.631 ms.
Best time for 2560K FFT length: 7.598 ms., avg: 8.088 ms.
Best time for 3072K FFT length: 9.481 ms., avg: 9.869 ms.
Best time for 3584K FFT length: 11.683 ms., avg: 11.948 ms.
Best time for 4096K FFT length: 13.051 ms., avg: 13.461 ms.
Best time for 5120K FFT length: 17.282 ms., avg: 18.642 ms.
Best time for 6144K FFT length: 21.036 ms., avg: 22.284 ms.
Best time for 7168K FFT length: 25.351 ms., avg: 25.719 ms.
Best time for 8192K FFT length: 28.323 ms., avg: 29.149 ms.
Timing FFTs using 6 threads on 3 physical CPUs.
Best time for 768K FFT length: 1.581 ms., avg: 1.669 ms.
Best time for 896K FFT length: 1.937 ms., avg: 2.077 ms.
Best time for 1024K FFT length: 2.185 ms., avg: 2.443 ms.
Best time for 1280K FFT length: 2.767 ms., avg: 3.056 ms.
Best time for 1536K FFT length: 3.426 ms., avg: 3.683 ms.
Best time for 1792K FFT length: 4.156 ms., avg: 4.502 ms.
Best time for 2048K FFT length: 4.636 ms., avg: 6.082 ms.
Best time for 2560K FFT length: 6.003 ms., avg: 7.454 ms.
Best time for 3072K FFT length: 7.315 ms., avg: 7.658 ms.
Best time for 3584K FFT length: 9.087 ms., avg: 9.937 ms.
Best time for 4096K FFT length: 10.031 ms., avg: 10.480 ms.
Best time for 5120K FFT length: 12.685 ms., avg: 15.464 ms.
Best time for 6144K FFT length: 15.021 ms., avg: 15.328 ms.
Best time for 7168K FFT length: 17.718 ms., avg: 18.087 ms.
Best time for 8192K FFT length: 21.619 ms., avg: 22.076 ms.
Timing FFTs using 8 threads on 4 physical CPUs.
Best time for 768K FFT length: 1.466 ms., avg: 1.647 ms.
Best time for 896K FFT length: 1.799 ms., avg: 2.035 ms.
Best time for 1024K FFT length: 2.020 ms., avg: 2.312 ms.
Best time for 1280K FFT length: 2.567 ms., avg: 3.278 ms.
Best time for 1536K FFT length: 3.145 ms., avg: 3.565 ms.
Best time for 1792K FFT length: 3.880 ms., avg: 4.463 ms.
Best time for 2048K FFT length: 4.251 ms., avg: 4.737 ms.
Best time for 2560K FFT length: 5.459 ms., avg: 7.065 ms.
Best time for 3072K FFT length: 6.771 ms., avg: 7.051 ms.
Best time for 3584K FFT length: 8.493 ms., avg: 8.961 ms.
Best time for 4096K FFT length: 9.493 ms., avg: 9.850 ms.
Best time for 5120K FFT length: 10.965 ms., avg: 11.244 ms.
Best time for 6144K FFT length: 12.694 ms., avg: 13.180 ms.
Best time for 7168K FFT length: 14.442 ms., avg: 16.460 ms.
Best time for 8192K FFT length: 19.903 ms., avg: 23.450 ms.
Timing FFTs using 10 threads on 5 physical CPUs.
Best time for 768K FFT length: 1.385 ms., avg: 1.503 ms.
Best time for 896K FFT length: 1.703 ms., avg: 2.205 ms.
Best time for 1024K FFT length: 1.913 ms., avg: 2.148 ms.
Best time for 1280K FFT length: 2.423 ms., avg: 2.616 ms.
Best time for 1536K FFT length: 2.985 ms., avg: 3.177 ms.
Best time for 1792K FFT length: 3.631 ms., avg: 4.262 ms.
Best time for 2048K FFT length: 4.061 ms., avg: 4.855 ms.
Best time for 2560K FFT length: 5.160 ms., avg: 6.439 ms.
Best time for 3072K FFT length: 6.350 ms., avg: 7.193 ms.
Best time for 3584K FFT length: 8.003 ms., avg: 8.788 ms.
Best time for 4096K FFT length: 8.992 ms., avg: 10.087 ms.
Best time for 5120K FFT length: 10.316 ms., avg: 10.618 ms.
Best time for 6144K FFT length: 12.080 ms., avg: 14.433 ms.
Best time for 7168K FFT length: 13.702 ms., avg: 14.702 ms.
Best time for 8192K FFT length: 18.641 ms., avg: 19.888 ms.
Timing FFTs using 12 threads on 6 physical CPUs.
Best time for 768K FFT length: 1.341 ms., avg: 2.044 ms.
Best time for 896K FFT length: 1.650 ms., avg: 1.899 ms.
Best time for 1024K FFT length: 1.853 ms., avg: 2.241 ms.
Best time for 1280K FFT length: 2.349 ms., avg: 2.675 ms.
Best time for 1536K FFT length: 2.885 ms., avg: 3.267 ms.
Best time for 1792K FFT length: 3.543 ms., avg: 3.845 ms.
Best time for 2048K FFT length: 3.978 ms., avg: 4.259 ms.
Best time for 2560K FFT length: 5.058 ms., avg: 5.342 ms.
Best time for 3072K FFT length: 6.145 ms., avg: 6.588 ms.
Best time for 3584K FFT length: 7.639 ms., avg: 8.056 ms.
Best time for 4096K FFT length: 8.871 ms., avg: 9.962 ms.
Best time for 5120K FFT length: 10.090 ms., avg: 10.410 ms.
Best time for 6144K FFT length: 11.937 ms., avg: 13.256 ms.
Best time for 7168K FFT length: 13.119 ms., avg: 14.355 ms.
Best time for 8192K FFT length: 18.253 ms., avg: 19.411 ms.
Best time for 61 bit trial factors: 1.730 ms.
Best time for 62 bit trial factors: 1.738 ms.
Best time for 63 bit trial factors: 1.976 ms.
Best time for 64 bit trial factors: 2.042 ms.
Best time for 65 bit trial factors: 2.398 ms.
Best time for 66 bit trial factors: 2.815 ms.
Best time for 67 bit trial factors: 2.802 ms.
Best time for 75 bit trial factors: 2.729 ms.
Best time for 76 bit trial factors: 2.736 ms.
Best time for 77 bit trial factors: 2.716 ms.[/code][QUOTE=Prime95;289659]P.S. Can a Phenom user check to see if CPU speed detection is any better?[/QUOTE]Both v26.6 and v27.3 detected my 4500MHz Sandy-E as 4428MHz.

And both still don't recognize the architecture:[quote][Main thread Feb 17 10:01] Mersenne number primality test program version 27.3
[Main thread Feb 17 10:01] Optimizing for CPU architecture: [b]Unknown Intel[/b], L2 cache size: 256 KB, L3 cache size: 12 MB[/quote]

Is this a speed-testing alpha, or should it be considered a semi-stable beta and suitable for production work?

Ralf Recker 2012-02-17 15:15

[QUOTE=James Heinrich;289720]Both v26.6 and v27.3 detected my 4500MHz Sandy-E as 4428MHz.[/QUOTE]

What is your mainboard type / BIOS version?

[QUOTE=James Heinrich;289720]And both still don't recognize the architecture:[/QUOTE]

If you post your CPUID (you can look it up with CPU-Z or a similar tool) George might be able to use the information to improve the CPU detection.

fivemack 2012-02-17 15:19

[QUOTE=James Heinrich;289718]What about when all 6 cores are running? :smile:[/QUOTE]

Maybe I'm misunderstanding the request, but I think the question is whether there's a slowdown running six one-thread workers on six different jobs

Robert_47 2012-02-17 16:31

Just FYI, neither version runs on an AMD FX-4100 Bulldozer.

Zero 2012-02-17 16:42

Here's a run on i5-2500k @ 4.5GHz for comparison. 4GB RAM @ 2133MT/s
[code] [Fri Feb 17 23:16:05 2012]
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
CPU speed: 4429.34 MHz, 4 cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 6 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 27.3, RdtscTiming=1
Best time for 768K FFT length: 3.528 ms., avg: 3.556 ms.
Best time for 896K FFT length: 4.288 ms., avg: 4.299 ms.
Best time for 1024K FFT length: 4.817 ms., avg: 4.835 ms.
Best time for 1280K FFT length: 6.145 ms., avg: 6.158 ms.
Best time for 1536K FFT length: 7.547 ms., avg: 7.597 ms.
Best time for 1792K FFT length: 9.048 ms., avg: 9.057 ms.
Best time for 2048K FFT length: 10.071 ms., avg: 10.144 ms.
Best time for 2560K FFT length: 12.760 ms., avg: 12.794 ms.
Best time for 3072K FFT length: 15.845 ms., avg: 15.856 ms.
Best time for 3584K FFT length: 19.112 ms., avg: 19.134 ms.
Best time for 4096K FFT length: 21.419 ms., avg: 21.444 ms.
Best time for 5120K FFT length: 27.735 ms., avg: 27.755 ms.
Best time for 6144K FFT length: 33.359 ms., avg: 33.404 ms.
Best time for 7168K FFT length: 40.513 ms., avg: 40.526 ms.
Best time for 8192K FFT length: 46.788 ms., avg: 46.831 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 1.947 ms., avg: 1.956 ms.
Best time for 896K FFT length: 2.317 ms., avg: 2.457 ms.
Best time for 1024K FFT length: 2.587 ms., avg: 2.743 ms.
Best time for 1280K FFT length: 3.333 ms., avg: 3.344 ms.
Best time for 1536K FFT length: 4.058 ms., avg: 4.316 ms.
Best time for 1792K FFT length: 4.872 ms., avg: 4.898 ms.
Best time for 2048K FFT length: 5.403 ms., avg: 5.732 ms.
Best time for 2560K FFT length: 6.829 ms., avg: 6.868 ms.
Best time for 3072K FFT length: 8.434 ms., avg: 8.447 ms.
Best time for 3584K FFT length: 10.214 ms., avg: 10.232 ms.
Best time for 4096K FFT length: 11.372 ms., avg: 11.385 ms.
Best time for 5120K FFT length: 14.721 ms., avg: 14.776 ms.
Best time for 6144K FFT length: 17.614 ms., avg: 17.627 ms.
Best time for 7168K FFT length: 21.228 ms., avg: 21.244 ms.
Best time for 8192K FFT length: 24.790 ms., avg: 24.846 ms.
Timing FFTs using 3 threads.
Best time for 768K FFT length: 1.346 ms., avg: 1.360 ms.
Best time for 896K FFT length: 1.605 ms., avg: 1.624 ms.
Best time for 1024K FFT length: 1.807 ms., avg: 1.829 ms.
Best time for 1280K FFT length: 2.309 ms., avg: 2.336 ms.
Best time for 1536K FFT length: 2.804 ms., avg: 2.847 ms.
Best time for 1792K FFT length: 3.341 ms., avg: 3.373 ms.
Best time for 2048K FFT length: 3.759 ms., avg: 3.793 ms.
Best time for 2560K FFT length: 4.749 ms., avg: 4.785 ms.
Best time for 3072K FFT length: 5.897 ms., avg: 5.939 ms.
Best time for 3584K FFT length: 7.119 ms., avg: 7.186 ms.
Best time for 4096K FFT length: 8.076 ms., avg: 8.121 ms.
Best time for 5120K FFT length: 10.241 ms., avg: 10.296 ms.
Best time for 6144K FFT length: 12.166 ms., avg: 12.191 ms.
Best time for 7168K FFT length: 14.617 ms., avg: 14.654 ms.
Best time for 8192K FFT length: 17.305 ms., avg: 17.379 ms.
Timing FFTs using 4 threads.
Best time for 768K FFT length: 1.175 ms., avg: 1.182 ms.
Best time for 896K FFT length: 1.407 ms., avg: 1.414 ms.
Best time for 1024K FFT length: 1.586 ms., avg: 1.595 ms.
Best time for 1280K FFT length: 2.027 ms., avg: 2.038 ms.
Best time for 1536K FFT length: 2.452 ms., avg: 2.467 ms.
Best time for 1792K FFT length: 2.949 ms., avg: 2.963 ms.
Best time for 2048K FFT length: 3.260 ms., avg: 3.270 ms.
Best time for 2560K FFT length: 4.160 ms., avg: 4.212 ms.
Best time for 3072K FFT length: 5.050 ms., avg: 5.078 ms.
Best time for 3584K FFT length: 6.244 ms., avg: 6.264 ms.
Best time for 4096K FFT length: 6.917 ms., avg: 6.943 ms.
Best time for 5120K FFT length: 8.485 ms., avg: 8.560 ms.
Best time for 6144K FFT length: 10.061 ms., avg: 10.113 ms.
Best time for 7168K FFT length: 11.901 ms., avg: 11.986 ms.
Best time for 8192K FFT length: 14.386 ms., avg: 14.399 ms.
Best time for 61 bit trial factors: 1.715 ms.
Best time for 62 bit trial factors: 1.731 ms.
Best time for 63 bit trial factors: 1.957 ms.
Best time for 64 bit trial factors: 2.029 ms.
Best time for 65 bit trial factors: 2.376 ms.
Best time for 66 bit trial factors: 2.804 ms.
Best time for 67 bit trial factors: 2.776 ms.
Best time for 75 bit trial factors: 2.702 ms.
Best time for 76 bit trial factors: 2.698 ms.
Best time for 77 bit trial factors: 2.699 ms.
[/code]JH, it would be interesting to see your results with HT disabled.

bcp19 2012-02-17 16:55

Just ported the new version to my 2400. System has 2 cores running mfaktc, 1 doing LL on a 46M and 1 doing P-1. LL iterations dropped from .024 to .019. Nice speed boost.
Am I right in thinking the P-1 coding should be unaffected? I had the LL on 27.2 and the P-1 on 26.6 due to memory, but with this being a 64 bit build, I can now run both on 27.3.

2500K bench:
[code]Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
CPU speed: 4260.11 MHz, 4 cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 6 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 27.3, RdtscTiming=1
Best time for 768K FFT length: 3.677 ms., avg: 3.826 ms.
Best time for 896K FFT length: 4.489 ms., avg: 4.539 ms.
Best time for 1024K FFT length: 5.041 ms., avg: 5.064 ms.
Best time for 1280K FFT length: 6.452 ms., avg: 6.476 ms.
Best time for 1536K FFT length: 7.924 ms., avg: 8.121 ms.
Best time for 1792K FFT length: 9.499 ms., avg: 9.518 ms.
Best time for 2048K FFT length: 10.590 ms., avg: 10.612 ms.
Best time for 2560K FFT length: 13.410 ms., avg: 13.509 ms.
Best time for 3072K FFT length: 16.680 ms., avg: 16.714 ms.
Best time for 3584K FFT length: 20.142 ms., avg: 20.180 ms.
Best time for 4096K FFT length: 22.639 ms., avg: 22.790 ms.
Best time for 5120K FFT length: 29.448 ms., avg: 29.816 ms.
Best time for 6144K FFT length: 35.307 ms., avg: 35.353 ms.
Best time for 7168K FFT length: 42.849 ms., avg: 42.887 ms.
Best time for 8192K FFT length: 49.787 ms., avg: 49.882 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 2.037 ms., avg: 2.138 ms.
Best time for 896K FFT length: 2.429 ms., avg: 2.466 ms.
Best time for 1024K FFT length: 2.724 ms., avg: 2.755 ms.
Best time for 1280K FFT length: 3.529 ms., avg: 3.551 ms.
Best time for 1536K FFT length: 4.288 ms., avg: 4.332 ms.
Best time for 1792K FFT length: 5.137 ms., avg: 5.180 ms.
Best time for 2048K FFT length: 5.752 ms., avg: 5.786 ms.
Best time for 2560K FFT length: 7.260 ms., avg: 7.306 ms.
Best time for 3072K FFT length: 8.987 ms., avg: 9.026 ms.
Best time for 3584K FFT length: 10.890 ms., avg: 11.477 ms.
Best time for 4096K FFT length: 12.189 ms., avg: 12.224 ms.
Best time for 5120K FFT length: 15.746 ms., avg: 16.710 ms.
Best time for 6144K FFT length: 18.816 ms., avg: 19.186 ms.
Best time for 7168K FFT length: 22.616 ms., avg: 23.328 ms.
Best time for 8192K FFT length: 26.472 ms., avg: 26.990 ms.
Timing FFTs using 3 threads.
Best time for 768K FFT length: 1.420 ms., avg: 1.455 ms.
Best time for 896K FFT length: 1.708 ms., avg: 1.758 ms.
Best time for 1024K FFT length: 1.955 ms., avg: 1.998 ms.
Best time for 1280K FFT length: 2.532 ms., avg: 2.578 ms.
Best time for 1536K FFT length: 3.118 ms., avg: 3.171 ms.
Best time for 1792K FFT length: 3.683 ms., avg: 3.720 ms.
Best time for 2048K FFT length: 4.193 ms., avg: 4.600 ms.
Best time for 2560K FFT length: 5.345 ms., avg: 5.787 ms.
Best time for 3072K FFT length: 6.541 ms., avg: 6.711 ms.
Best time for 3584K FFT length: 8.069 ms., avg: 8.950 ms.
Best time for 4096K FFT length: 9.011 ms., avg: 9.303 ms.
Best time for 5120K FFT length: 11.368 ms., avg: 11.669 ms.
Best time for 6144K FFT length: 13.815 ms., avg: 14.036 ms.
Best time for 7168K FFT length: 16.336 ms., avg: 16.566 ms.
Best time for 8192K FFT length: 19.011 ms., avg: 19.205 ms.
Timing FFTs using 4 threads.
Best time for 768K FFT length: 1.289 ms., avg: 1.307 ms.
Best time for 896K FFT length: 1.572 ms., avg: 1.589 ms.
Best time for 1024K FFT length: 1.773 ms., avg: 1.825 ms.
Best time for 1280K FFT length: 2.309 ms., avg: 2.366 ms.
Best time for 1536K FFT length: 2.817 ms., avg: 3.212 ms.
Best time for 1792K FFT length: 3.364 ms., avg: 3.433 ms.
Best time for 2048K FFT length: 3.795 ms., avg: 3.886 ms.
Best time for 2560K FFT length: 4.860 ms., avg: 5.039 ms.
Best time for 3072K FFT length: 5.842 ms., avg: 6.479 ms.
Best time for 3584K FFT length: 7.207 ms., avg: 7.550 ms.
Best time for 4096K FFT length: 8.130 ms., avg: 8.508 ms.
Best time for 5120K FFT length: 10.159 ms., avg: 10.619 ms.
Best time for 6144K FFT length: 12.097 ms., avg: 13.624 ms.
Best time for 7168K FFT length: 14.258 ms., avg: 14.404 ms.
Best time for 8192K FFT length: 16.324 ms., avg: 16.675 ms.
Best time for 61 bit trial factors: 1.787 ms.
Best time for 62 bit trial factors: 1.796 ms.
Best time for 63 bit trial factors: 2.036 ms.
Best time for 64 bit trial factors: 2.107 ms.
Best time for 65 bit trial factors: 2.468 ms.
Best time for 66 bit trial factors: 2.911 ms.
Best time for 67 bit trial factors: 2.886 ms.
Best time for 75 bit trial factors: 2.809 ms.
Best time for 76 bit trial factors: 2.814 ms.
Best time for 77 bit trial factors: 2.810 ms.
[/code]

Lennart 2012-02-17 17:12

[QUOTE=Robert_47;289728]Just FYI, neither version runs on an AMD FX-4100 Bulldozer.[/QUOTE]

What software did you use ?


Lennart

Prime95 2012-02-17 17:15

[QUOTE=fivemack;289723]Maybe I'm misunderstanding the request, but I think the question is whether there's a slowdown running six one-thread workers on six different jobs[/QUOTE]

That is correct. James, try adding "TimingOutput=4" to prime.txt. Restart aand run just one worker. Note the per-iteration times. Now start the second worker, note times, etc.. Do the workers slow down a lot?

On my machine (all workers running 2400K FFTs), I get times of 1 worker - 13.7ms, 2 workers - 13.9ms, 3 workers - 14.5ms, 4 workers - 16.6ms.

Prime95 2012-02-17 17:17

[QUOTE=Robert_47;289728]Just FYI, neither version runs on an AMD FX-4100 Bulldozer.[/QUOTE]

Grrrr. Does Options/CPU identify the chip as supporting AVX?

If not, can you add the line "CpuSupportsAVX=1" to local.ini and let me know if your benchmarks indicate prime95 runs faster with AVX vs. v26 using SSE2? Thanks.

Prime95 2012-02-17 17:24

[QUOTE=James Heinrich;289720]Both v26.6 and v27.3 detected my 4500MHz Sandy-E as 4428MHz.

And both still don't recognize the architecture:

Is this a speed-testing alpha, or should it be considered a semi-stable beta and suitable for production work?[/QUOTE]

I got the family/model number from cpu-world.com. You'll get recognized properly in the next release.

I think this version is fairly stable and suitable for production work.

Prime95 2012-02-17 17:30

[QUOTE=bcp19;289730]Am I right in thinking the P-1 coding should be unaffected?[/QUOTE]

If you are asking: "Can I use 27.3 and resume a P-1 that was partially completed by an earlier version?" The answer is yes.

mdettweiler 2012-02-17 17:47

[QUOTE=Prime95;289734]That is correct. James, try adding "TimingOutput=4" to prime.txt. Restart aand run just one worker. Note the per-iteration times. Now start the second worker, note times, etc.. Do the workers slow down a lot?

On my machine (all workers running 2400K FFTs), I get times of 1 worker - 13.7ms, 2 workers - 13.9ms, 3 workers - 14.5ms, 4 workers - 16.6ms.[/QUOTE]
Just curious, would this memory-bandwidth bottleneck affect (relatively) very small FFTs that fit entirely in-cache (or something like that...I forget exactly how it works), such as those often used with LLR?

Prime95 2012-02-17 17:58

[QUOTE=mdettweiler;289738]Just curious, would this memory-bandwidth bottleneck affect (relatively) very small FFTs that fit entirely in-cache (or something like that...I forget exactly how it works), such as those often used with LLR?[/QUOTE]

Probably not. My L3 cache is 6MB, or 1.5 MB / core. A float is 8 bytes. So the max FFT size is 192K. The sin/cos data, the program itself, the OS will all want memory too. Maybe a 128K FFT will fit in the L3 cache. At 20 bits per float, you might can test 2.5 million bit numbers. If you try this, let me know if you see a slow down as you run more workers.

James Heinrich 2012-02-17 18:01

[QUOTE=Ralf Recker;289721]What is your mainboard type / BIOS version?[/QUOTE]Asus P9X79 PRO, BIOS v0802

[QUOTE=Ralf Recker;289721]If you post your CPUID (you can look it up with CPU-Z or a similar tool) George might be able to use the information to improve the CPU detection.[/QUOTE]CPU-Z screenshot attached.

[QUOTE=Zero;289729]JH, it would be interesting to see your results with HT disabled.[/QUOTE]Faster:[code]Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
CPU speed: 4425.82 MHz, 6 cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 12 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 27.3, RdtscTiming=1
Best time for 768K FFT length: 3.489 ms., avg: 3.518 ms.
Best time for 896K FFT length: 4.176 ms., avg: 4.653 ms.
Best time for 1024K FFT length: 4.750 ms., avg: 4.867 ms.
Best time for 1280K FFT length: 6.127 ms., avg: 6.862 ms.
Best time for 1536K FFT length: 7.589 ms., avg: 8.131 ms.
Best time for 1792K FFT length: 9.277 ms., avg: 9.828 ms.
Best time for 2048K FFT length: 10.422 ms., avg: 10.728 ms.
Best time for 2560K FFT length: 13.317 ms., avg: 14.179 ms.
Best time for 3072K FFT length: 16.633 ms., avg: 17.080 ms.
Best time for 3584K FFT length: 20.201 ms., avg: 20.543 ms.
Best time for 4096K FFT length: 22.846 ms., avg: 23.072 ms.
Best time for 5120K FFT length: 30.047 ms., avg: 31.448 ms.
Best time for 6144K FFT length: 36.142 ms., avg: 38.056 ms.
Best time for 7168K FFT length: 43.865 ms., avg: 44.438 ms.
Best time for 8192K FFT length: 50.945 ms., avg: 51.268 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 1.964 ms., avg: 2.017 ms.
Best time for 896K FFT length: 2.268 ms., avg: 2.521 ms.
Best time for 1024K FFT length: 2.570 ms., avg: 2.840 ms.
Best time for 1280K FFT length: 3.346 ms., avg: 3.861 ms.
Best time for 1536K FFT length: 4.126 ms., avg: 4.513 ms.
Best time for 1792K FFT length: 4.986 ms., avg: 5.269 ms.
Best time for 2048K FFT length: 5.590 ms., avg: 5.960 ms.
Best time for 2560K FFT length: 7.147 ms., avg: 7.247 ms.
Best time for 3072K FFT length: 8.904 ms., avg: 10.948 ms.
Best time for 3584K FFT length: 10.938 ms., avg: 11.665 ms.
Best time for 4096K FFT length: 12.223 ms., avg: 12.343 ms.
Best time for 5120K FFT length: 15.972 ms., avg: 17.223 ms.
Best time for 6144K FFT length: 19.084 ms., avg: 19.621 ms.
Best time for 7168K FFT length: 23.257 ms., avg: 23.906 ms.
Best time for 8192K FFT length: 27.104 ms., avg: 27.206 ms.
Timing FFTs using 3 threads.
Best time for 768K FFT length: 1.356 ms., avg: 1.397 ms.
Best time for 896K FFT length: 1.585 ms., avg: 1.637 ms.
Best time for 1024K FFT length: 1.781 ms., avg: 1.871 ms.
Best time for 1280K FFT length: 2.300 ms., avg: 2.585 ms.
Best time for 1536K FFT length: 2.898 ms., avg: 3.237 ms.
Best time for 1792K FFT length: 3.532 ms., avg: 4.807 ms.
Best time for 2048K FFT length: 3.944 ms., avg: 5.175 ms.
Best time for 2560K FFT length: 4.980 ms., avg: 8.678 ms.
Best time for 3072K FFT length: 6.210 ms., avg: 7.212 ms.
Best time for 3584K FFT length: 7.864 ms., avg: 8.434 ms.
Best time for 4096K FFT length: 8.591 ms., avg: 9.051 ms.
Best time for 5120K FFT length: 11.160 ms., avg: 12.002 ms.
Best time for 6144K FFT length: 13.168 ms., avg: 14.839 ms.
Best time for 7168K FFT length: 15.991 ms., avg: 16.532 ms.
Best time for 8192K FFT length: 18.811 ms., avg: 20.220 ms.
Timing FFTs using 4 threads.
Best time for 768K FFT length: 1.289 ms., avg: 1.306 ms.
Best time for 896K FFT length: 1.517 ms., avg: 1.603 ms.
Best time for 1024K FFT length: 1.697 ms., avg: 1.795 ms.
Best time for 1280K FFT length: 2.191 ms., avg: 2.825 ms.
Best time for 1536K FFT length: 2.660 ms., avg: 3.394 ms.
Best time for 1792K FFT length: 3.196 ms., avg: 3.266 ms.
Best time for 2048K FFT length: 3.531 ms., avg: 3.884 ms.
Best time for 2560K FFT length: 4.483 ms., avg: 4.571 ms.
Best time for 3072K FFT length: 5.494 ms., avg: 6.034 ms.
Best time for 3584K FFT length: 6.751 ms., avg: 7.105 ms.
Best time for 4096K FFT length: 7.544 ms., avg: 9.167 ms.
Best time for 5120K FFT length: 8.947 ms., avg: 9.108 ms.
Best time for 6144K FFT length: 10.688 ms., avg: 12.479 ms.
Best time for 7168K FFT length: 12.787 ms., avg: 13.099 ms.
Best time for 8192K FFT length: 16.064 ms., avg: 18.325 ms.
Timing FFTs using 5 threads.
Best time for 768K FFT length: 1.201 ms., avg: 1.218 ms.
Best time for 896K FFT length: 1.457 ms., avg: 1.708 ms.
Best time for 1024K FFT length: 1.615 ms., avg: 1.642 ms.
Best time for 1280K FFT length: 2.070 ms., avg: 2.183 ms.
Best time for 1536K FFT length: 2.499 ms., avg: 2.537 ms.
Best time for 1792K FFT length: 3.035 ms., avg: 3.092 ms.
Best time for 2048K FFT length: 3.354 ms., avg: 3.414 ms.
Best time for 2560K FFT length: 4.285 ms., avg: 4.558 ms.
Best time for 3072K FFT length: 5.231 ms., avg: 5.284 ms.
Best time for 3584K FFT length: 6.474 ms., avg: 6.568 ms.
Best time for 4096K FFT length: 7.195 ms., avg: 7.287 ms.
Best time for 5120K FFT length: 8.379 ms., avg: 10.145 ms.
Best time for 6144K FFT length: 9.583 ms., avg: 10.677 ms.
Best time for 7168K FFT length: 11.155 ms., avg: 12.152 ms.
Best time for 8192K FFT length: 15.262 ms., avg: 17.920 ms.
Timing FFTs using 6 threads.
Best time for 768K FFT length: 1.145 ms., avg: 1.161 ms.
Best time for 896K FFT length: 1.383 ms., avg: 1.403 ms.
Best time for 1024K FFT length: 1.538 ms., avg: 1.562 ms.
Best time for 1280K FFT length: 1.965 ms., avg: 1.998 ms.
Best time for 1536K FFT length: 2.390 ms., avg: 2.445 ms.
Best time for 1792K FFT length: 2.918 ms., avg: 3.389 ms.
Best time for 2048K FFT length: 3.234 ms., avg: 3.581 ms.
Best time for 2560K FFT length: 4.163 ms., avg: 4.245 ms.
Best time for 3072K FFT length: 5.088 ms., avg: 5.615 ms.
Best time for 3584K FFT length: 6.297 ms., avg: 6.394 ms.
Best time for 4096K FFT length: 6.991 ms., avg: 7.393 ms.
Best time for 5120K FFT length: 8.213 ms., avg: 8.400 ms.
Best time for 6144K FFT length: 9.420 ms., avg: 10.439 ms.
Best time for 7168K FFT length: 10.742 ms., avg: 11.658 ms.
Best time for 8192K FFT length: 14.644 ms., avg: 17.097 ms.
Best time for 61 bit trial factors: 1.716 ms.
Best time for 62 bit trial factors: 1.718 ms.
Best time for 63 bit trial factors: 1.958 ms.
Best time for 64 bit trial factors: 2.031 ms.
Best time for 65 bit trial factors: 2.371 ms.
Best time for 66 bit trial factors: 2.808 ms.
Best time for 67 bit trial factors: 2.777 ms.
Best time for 75 bit trial factors: 2.701 ms.
Best time for 76 bit trial factors: 2.701 ms.
Best time for 77 bit trial factors: 2.707 ms.[/code]

James Heinrich 2012-02-17 18:15

[QUOTE=Prime95;289734]That is correct. James, try adding "TimingOutput=4" to prime.txt. Restart aand run just one worker. Note the per-iteration times. Now start the second worker, note times, etc.. Do the workers slow down a lot?

On my machine (all workers running 2400K FFTs), I get times of 1 worker - 13.7ms, 2 workers - 13.9ms, 3 workers - 14.5ms, 4 workers - 16.6ms.[/QUOTE]With hyperthreading disabled:[quote]Resuming primality test of M44789989 using AVX Core2 type-3 FFT length 2400K, Pass1=384, Pass2=6400[/quote]1: 12.8ms
2: 13.2ms
3: 13.6ms
4: 14.8ms
5: 16.7ms
6: ~19ms (ranges from 18.3 to 21.1 in different workers)

flashjh 2012-02-17 18:22

1 Attachment(s)
[QUOTE=Prime95;289735]Grrrr. Does Options/CPU identify the chip as supporting AVX?

If not, can you add the line "CpuSupportsAVX=1" to local.ini and let me know if your benchmarks indicate prime95 runs faster with AVX vs. v26 using SSE2? Thanks.[/QUOTE]

I have a dual Opteron 6272 (Bulldozer) system. Doesn't work. Here is what happens with benchmark:

[CODE]
[Feb 17 11:17] Worker starting
[Feb 17 11:17] Your timings will be written to the results.txt file.
[Feb 17 11:17] Compare your results to other computers at [URL]http://www.mersenne.org/report_benchmarks[/URL]
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 2 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 3 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 4 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 5 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 6 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 7 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 8 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 9 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 10 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 11 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 12 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 13 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 14 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 15 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 16 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 17 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 18 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 19 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 20 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 21 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 22 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 23 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 24 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 25 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 26 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 27 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 28 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 29 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 30 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 31 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing FFTs using 32 threads.
[Feb 17 11:17] Cannot initialize FFT code, errcode=1002
[Feb 17 11:17] Number sent to gwsetup is too large for the FFTs to handle.
[Feb 17 11:17] Timing trial factoring of M35000011 with 61 bit length factors. Best time: 6.404 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 62 bit length factors. Best time: 6.428 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 63 bit length factors. Best time: 12.623 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 64 bit length factors. Best time: 12.646 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 65 bit length factors. Best time: 10.620 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 66 bit length factors. Best time: 10.568 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 67 bit length factors. Best time: 10.608 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 75 bit length factors. Best time: 10.851 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 76 bit length factors. Best time: 10.865 ms.
[Feb 17 11:17] Timing trial factoring of M35000011 with 77 bit length factors. Best time: 10.838 ms.
[Feb 17 11:17] Benchmark complete.
[Feb 17 11:17] Worker stopped.
[/CODE]

The screen shot of what it detected at startup. Note that is sees AVX but the L3 cache size should be 16MB. Also, the CPU cores run @ 2.1GHz each.

Robert_47 2012-02-17 18:39

[QUOTE=Prime95;289735]Grrrr. Does Options/CPU identify the chip as supporting AVX?

If not, can you add the line "CpuSupportsAVX=1" to local.ini and let me know if your benchmarks indicate prime95 runs faster with AVX vs. v26 using SSE2? Thanks.[/QUOTE]

It does indicate AVX support, and adding CpuSupportsAVX=1 does nothing. Adding CpuArchitecture=5 does work, with the following results.

[CODE]
[Fri Feb 17 11:10:00 2012]
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
AMD FX(tm)-4100 Quad-Core Processor
CPU speed: 7145.40 MHz, 4 cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 16 KB
L2 cache size: 2 MB, L3 cache size: 8 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
L1 TLBS: 32
L2 TLBS: 1024
Prime95 64-bit version 27.3, RdtscTiming=1
Best time for 768K FFT length: 16.035 ms., avg: 16.180 ms.
Best time for 896K FFT length: 18.598 ms., avg: 18.746 ms.
Best time for 1024K FFT length: 21.362 ms., avg: 21.603 ms.
Best time for 1280K FFT length: 27.315 ms., avg: 27.488 ms.
Best time for 1536K FFT length: 34.157 ms., avg: 34.386 ms.
Best time for 1792K FFT length: 38.610 ms., avg: 39.915 ms.
Best time for 2048K FFT length: 45.505 ms., avg: 45.838 ms.
Best time for 2560K FFT length: 58.154 ms., avg: 58.291 ms.
Best time for 3072K FFT length: 72.854 ms., avg: 73.033 ms.
Best time for 3584K FFT length: 87.523 ms., avg: 88.185 ms.
Best time for 4096K FFT length: 101.059 ms., avg: 101.255 ms.
Best time for 5120K FFT length: 126.274 ms., avg: 126.459 ms.
Best time for 6144K FFT length: 154.792 ms., avg: 155.017 ms.
Best time for 7168K FFT length: 180.377 ms., avg: 180.638 ms.
Best time for 8192K FFT length: 228.028 ms., avg: 229.531 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 10.198 ms., avg: 10.450 ms.
Best time for 896K FFT length: 12.109 ms., avg: 12.195 ms.
Best time for 1024K FFT length: 13.813 ms., avg: 14.111 ms.
Best time for 1280K FFT length: 17.648 ms., avg: 17.932 ms.
Best time for 1536K FFT length: 22.396 ms., avg: 22.549 ms.
Best time for 1792K FFT length: 26.009 ms., avg: 26.172 ms.
Best time for 2048K FFT length: 29.854 ms., avg: 30.128 ms.
Best time for 2560K FFT length: 38.316 ms., avg: 38.867 ms.
Best time for 3072K FFT length: 47.161 ms., avg: 47.609 ms.
Best time for 3584K FFT length: 56.724 ms., avg: 56.919 ms.
Best time for 4096K FFT length: 64.791 ms., avg: 65.295 ms.
Best time for 5120K FFT length: 81.092 ms., avg: 81.646 ms.
Best time for 6144K FFT length: 101.216 ms., avg: 102.067 ms.
Best time for 7168K FFT length: 119.703 ms., avg: 120.206 ms.
Best time for 8192K FFT length: 147.991 ms., avg: 148.265 ms.
Timing FFTs using 3 threads.
Best time for 768K FFT length: 7.077 ms., avg: 7.567 ms.
Best time for 896K FFT length: 8.158 ms., avg: 8.368 ms.
Best time for 1024K FFT length: 9.247 ms., avg: 9.802 ms.
Best time for 1280K FFT length: 12.015 ms., avg: 12.205 ms.
Best time for 1536K FFT length: 14.970 ms., avg: 15.260 ms.
Best time for 1792K FFT length: 17.418 ms., avg: 17.661 ms.
Best time for 2048K FFT length: 20.051 ms., avg: 20.346 ms.
Best time for 2560K FFT length: 25.626 ms., avg: 26.019 ms.
Best time for 3072K FFT length: 31.602 ms., avg: 32.133 ms.
Best time for 3584K FFT length: 37.741 ms., avg: 38.158 ms.
Best time for 4096K FFT length: 43.382 ms., avg: 43.971 ms.
Best time for 5120K FFT length: 54.130 ms., avg: 54.678 ms.
Best time for 6144K FFT length: 67.366 ms., avg: 68.239 ms.
Best time for 7168K FFT length: 78.552 ms., avg: 79.195 ms.
Best time for 8192K FFT length: 97.640 ms., avg: 98.646 ms.
Timing FFTs using 4 threads.
Best time for 768K FFT length: 5.554 ms., avg: 5.650 ms.
Best time for 896K FFT length: 6.334 ms., avg: 6.482 ms.
Best time for 1024K FFT length: 7.213 ms., avg: 7.749 ms.
Best time for 1280K FFT length: 9.412 ms., avg: 9.538 ms.
Best time for 1536K FFT length: 11.761 ms., avg: 12.036 ms.
Best time for 1792K FFT length: 13.605 ms., avg: 13.865 ms.
Best time for 2048K FFT length: 15.838 ms., avg: 16.180 ms.
Best time for 2560K FFT length: 20.355 ms., avg: 20.613 ms.
Best time for 3072K FFT length: 24.890 ms., avg: 25.225 ms.
Best time for 3584K FFT length: 29.366 ms., avg: 29.775 ms.
Best time for 4096K FFT length: 34.192 ms., avg: 34.332 ms.
Best time for 5120K FFT length: 42.279 ms., avg: 42.504 ms.
Best time for 6144K FFT length: 53.200 ms., avg: 53.802 ms.
[Fri Feb 17 11:15:03 2012]
Best time for 7168K FFT length: 61.309 ms., avg: 61.941 ms.
Best time for 8192K FFT length: 77.664 ms., avg: 78.375 ms.
Best time for 61 bit trial factors: 2.898 ms.
Best time for 62 bit trial factors: 2.935 ms.
Best time for 63 bit trial factors: 3.331 ms.
Best time for 64 bit trial factors: 3.821 ms.
Best time for 65 bit trial factors: 4.922 ms.
Best time for 66 bit trial factors: 7.235 ms.
Best time for 67 bit trial factors: 7.068 ms.
Best time for 75 bit trial factors: 5.679 ms.
Best time for 76 bit trial factors: 5.657 ms.
Best time for 77 bit trial factors: 5.664 ms.
[/CODE]

The 26.6 benchmarks:

[CODE]
[Fri Feb 17 11:16:31 2012]
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
AMD FX(tm)-4100 Quad-Core Processor
CPU speed: 7145.37 MHz, 4 cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 16 KB
L2 cache size: 2 MB, L3 cache size: 8 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
L1 TLBS: 32
L2 TLBS: 1024
Prime95 64-bit version 26.6, RdtscTiming=1
Best time for 768K FFT length: 12.699 ms., avg: 13.011 ms.
Best time for 896K FFT length: 14.730 ms., avg: 15.182 ms.
Best time for 1024K FFT length: 16.543 ms., avg: 16.850 ms.
Best time for 1280K FFT length: 21.618 ms., avg: 22.041 ms.
Best time for 1536K FFT length: 27.136 ms., avg: 27.453 ms.
Best time for 1792K FFT length: 32.042 ms., avg: 32.217 ms.
Best time for 2048K FFT length: 35.837 ms., avg: 36.206 ms.
Best time for 2560K FFT length: 44.669 ms., avg: 45.025 ms.
Best time for 3072K FFT length: 55.582 ms., avg: 55.973 ms.
Best time for 3584K FFT length: 70.664 ms., avg: 70.783 ms.
Best time for 4096K FFT length: 72.125 ms., avg: 73.739 ms.
Best time for 5120K FFT length: 95.987 ms., avg: 96.159 ms.
Best time for 6144K FFT length: 118.114 ms., avg: 118.218 ms.
Best time for 7168K FFT length: 145.150 ms., avg: 145.459 ms.
Best time for 8192K FFT length: 160.341 ms., avg: 160.421 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 8.115 ms., avg: 8.359 ms.
Best time for 896K FFT length: 9.243 ms., avg: 9.315 ms.
Best time for 1024K FFT length: 10.619 ms., avg: 10.959 ms.
Best time for 1280K FFT length: 13.965 ms., avg: 14.182 ms.
Best time for 1536K FFT length: 17.407 ms., avg: 17.531 ms.
Best time for 1792K FFT length: 19.832 ms., avg: 20.011 ms.
Best time for 2048K FFT length: 23.235 ms., avg: 23.496 ms.
Best time for 2560K FFT length: 28.494 ms., avg: 28.726 ms.
Best time for 3072K FFT length: 35.235 ms., avg: 35.437 ms.
Best time for 3584K FFT length: 47.298 ms., avg: 47.841 ms.
Best time for 4096K FFT length: 47.045 ms., avg: 47.491 ms.
Best time for 5120K FFT length: 60.316 ms., avg: 60.747 ms.
Best time for 6144K FFT length: 74.983 ms., avg: 75.628 ms.
Best time for 7168K FFT length: 91.288 ms., avg: 91.763 ms.
Best time for 8192K FFT length: 105.784 ms., avg: 106.304 ms.
Timing FFTs using 3 threads.
Best time for 768K FFT length: 5.506 ms., avg: 5.592 ms.
Best time for 896K FFT length: 6.490 ms., avg: 6.588 ms.
Best time for 1024K FFT length: 7.328 ms., avg: 7.850 ms.
Best time for 1280K FFT length: 9.430 ms., avg: 9.531 ms.
Best time for 1536K FFT length: 11.618 ms., avg: 11.750 ms.
Best time for 1792K FFT length: 13.882 ms., avg: 14.021 ms.
Best time for 2048K FFT length: 15.552 ms., avg: 15.774 ms.
Best time for 2560K FFT length: 19.449 ms., avg: 19.657 ms.
Best time for 3072K FFT length: 23.973 ms., avg: 24.064 ms.
Best time for 3584K FFT length: 32.700 ms., avg: 33.082 ms.
Best time for 4096K FFT length: 31.916 ms., avg: 31.972 ms.
Best time for 5120K FFT length: 41.108 ms., avg: 41.458 ms.
Best time for 6144K FFT length: 50.510 ms., avg: 51.064 ms.
Best time for 7168K FFT length: 61.613 ms., avg: 62.130 ms.
Best time for 8192K FFT length: 69.090 ms., avg: 69.676 ms.
Timing FFTs using 4 threads.
Best time for 768K FFT length: 4.364 ms., avg: 4.425 ms.
Best time for 896K FFT length: 5.044 ms., avg: 5.093 ms.
Best time for 1024K FFT length: 5.716 ms., avg: 6.188 ms.
Best time for 1280K FFT length: 7.491 ms., avg: 7.534 ms.
Best time for 1536K FFT length: 9.245 ms., avg: 9.304 ms.
Best time for 1792K FFT length: 10.880 ms., avg: 11.000 ms.
Best time for 2048K FFT length: 12.416 ms., avg: 12.567 ms.
Best time for 2560K FFT length: 15.492 ms., avg: 15.610 ms.
Best time for 3072K FFT length: 19.008 ms., avg: 19.299 ms.
Best time for 3584K FFT length: 25.681 ms., avg: 25.914 ms.
Best time for 4096K FFT length: 25.362 ms., avg: 25.521 ms.
Best time for 5120K FFT length: 32.789 ms., avg: 32.869 ms.
Best time for 6144K FFT length: 40.238 ms., avg: 40.991 ms.
Best time for 7168K FFT length: 48.517 ms., avg: 48.870 ms.
Best time for 8192K FFT length: 56.353 ms., avg: 57.335 ms.
Best time for 61 bit trial factors: 2.898 ms.
Best time for 62 bit trial factors: 2.935 ms.
Best time for 63 bit trial factors: 3.327 ms.
Best time for 64 bit trial factors: 3.822 ms.
Best time for 65 bit trial factors: 4.925 ms.
Best time for 66 bit trial factors: 5.845 ms.
Best time for 67 bit trial factors: 5.814 ms.
Best time for 75 bit trial factors: 5.674 ms.
Best time for 76 bit trial factors: 5.656 ms.
Best time for 77 bit trial factors: 5.659 ms.
[/CODE]

Both benchmarks were 64 bits. The actual CPU speed was 3800MHz.

Prime95 2012-02-17 19:30

[QUOTE=James Heinrich;289742]With hyperthreading disabled:1: 12.8ms
2: 13.2ms
3: 13.6ms
4: 14.8ms
5: 16.7ms
6: ~19ms (ranges from 18.3 to 21.1 in different workers)[/QUOTE]

Interesting - worse than I would have expected (I assume all 4 memory channels are populated). Perhaps contention for the L3 cache is the culprit. If I understand correctly, your L3 cache has to feed all 6 cores.

Prime95 2012-02-17 19:34

[QUOTE=Robert_47;289747]It does indicate AVX support, and adding CpuSupportsAVX=1 does nothing. Adding CpuArchitecture=5 does work, with the following results.[/QUOTE]

Pretty grim.

Until I write a version that supports FMA, it looks like I need to steer Bulldozer down the SSE2 path. The way AMD implemented AVX on Bulldozer, SSE2 and AVX have the same theoretical throughput unless you use FMA.

Jwb52z 2012-02-17 19:41

How many more interim versions do you think there will be before non-Sandy Bridge CPUs can utilize version 27?

Robert_47 2012-02-17 19:46

[QUOTE=Lennart;289732]What software did you use ?


Lennart[/QUOTE]

Prime95, and Rebirther's newest llr and pfgw. All three have the same problems for both 32 bit and 64 bit.

James Heinrich 2012-02-17 19:50

[QUOTE=Prime95;289754]Interesting - worse than I would have expected (I assume all 4 memory channels are populated). Perhaps contention for the L3 cache is the culprit. If I understand correctly, your L3 cache has to feed all 6 cores.[/QUOTE]All 4 channels are populated, but the RAM is running at 1333MHz, so performance may be better with faster RAM.

aketilander 2012-02-17 20:11

Intel Core i7-3930K @ 3.20GHz
 
Well, when I changed from version 26.6 to version 27.3 on my Intel Core i7-3930K @ 3.20GHz the per iteration time changed from 0.033 to 0.021 on all six cores under Windows ultimate 64-bit. Alla six cores report that they are using AVX.

So there seems to be a VERY large improvement! :smile:

Prime95 2012-02-17 20:12

[QUOTE=Jwb52z;289757]How many more interim versions do you think there will be before non-Sandy Bridge CPUs can utilize version 27?[/QUOTE]

I don't know when we will have a final v27 release. This version will work on a non-Sandy Bridge CPU, but it won't be any faster than v26.6.

drh 2012-02-17 20:33

I'm seeing similar results. My i5-2500K has dropped from .032 to .022 per iteration on 1 core, the P-1 Stage 2 dropped from 450 sec to 355 sec, also on 1 core, with 2 cores of mfaktc running on the other 2 cores, with no OC.

Huge improvement, great job!
Doug

monst 2012-02-17 20:58

Here's what I'm seeing for iteration times on my i5-2500K running 2 instances
of Prime95 (on M26161123 and M26161217) and 2 instances of mfaktc.
(The chip is overclocked to 4.5 GHz.)

26.6 (64-bit) --> 12.4 ms

27.2 (32-bit) --> 9.8 ms

27.3 (32-bit) --> 9.6 ms
27.3 (64-bit) --> 9.1 ms

Nice improvement!!

fivemack 2012-02-17 23:15

[QUOTE=James Heinrich;289742]With hyperthreading disabled:1: 12.8ms
2: 13.2ms
3: 13.6ms
4: 14.8ms
5: 16.7ms
6: ~19ms (ranges from 18.3 to 21.1 in different workers)[/QUOTE]

I would be intrigued to see, if it's not too awkward to run, what the speed-as-you-add-more-workers progression is for v26.6.3.

James Heinrich 2012-02-17 23:53

[QUOTE=fivemack;289797]I would be intrigued to see, if it's not too awkward to run, what the speed-as-you-add-more-workers progression is for v26.6.3.[/QUOTE]v26.6.3 vs v27.1.3 (both 64-bit):
1: 21.1ms vs 12.8ms (65% faster)
2: 21.2ms vs 13.2ms (61% faster)
3: 21.4ms vs 13.6ms (57% faster)
4: 21.5ms vs 14.8ms (45% faster)
5: 22.0ms vs 16.7ms (32% faster)
6: 22.6ms vs 19.0ms (19% faster)

[b]edit:[/b] just realized I ran the 27.1.3 with Hyperthreading disabled, and v26.6.3 with it enabled :sad:
Will re-run benchmarks later.

fivemack 2012-02-18 00:18

These numbers are a bit confusing to analyse because everyone seems to be running their machines at different clock speeds and memory speeds.

I appreciate that it involves multiple reboots and makes running other jobs difficult while you're doing it, but I think that a conclusive analysis of the effect of memory bandwidth really would benefit from benchmarks from 27.3 at two CPU multipliers as far apart as possible, with memory speed kept the same, and turbo and hyperthreading turned off in both cases.

(really ideally would also be data points at two different memory speeds with CPU multiplier kept the same, but I don't know if X79 BIOSes allow you to set that conveniently)

The idea's to solve for runtime as A + B/cpuspeed + C/memoryspeed and see if anything interesting shows up in the values of A, B and C. I've done this analysis with the SPEC99 benchmarks to divide them into CPU-intensive and memory-intensive ones.

Prime95 2012-02-18 00:27

Linux executables should be available. Untested. Sometimes Primenet doesn't recognize new versions, but I have to do some evening entertaining right now.

P.S. This is the second time Ubuntu has toasted the root disk. Arcane fsck command restored it both times. I don't know how a novice user would ever recover...

James Heinrich 2012-02-18 00:43

[QUOTE=fivemack;289802]I think that a conclusive analysis of the effect of memory bandwidth really would benefit from benchmarks from 27.3 at two CPU multipliers as far apart as possible, with memory speed kept the same, and turbo and hyperthreading turned off in both cases. (really ideally would also be data points at two different memory speeds with CPU multiplier kept the same, but I don't know if X79 BIOSes allow you to set that conveniently)[/QUOTE]I'll see if I can get you this data tomorrow.

LaurV 2012-02-18 01:54

[QUOTE=fivemack;289723]Maybe I'm misunderstanding the request, but I think the question is whether there's a slowdown running six one-thread workers on six different jobs[/QUOTE]
You are right, the memory traffic would be higher in that case. There is no slowdown for me when running 4 different workers, doing 4 different jobs on 4 physical cores (alone, or used as 8 logical, with helpers) compared with 2.72, I would say it is a copper faster. And much faster than 2.65/66.

firejuggler 2012-02-18 07:07

i5-2500K, stock speed , windows 7 home premium
[code]
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
CPU speed: 3336.07 MHz, 4 cores
CPU features: Prefetch, MMX, SSE, SSE2, SSE4, AVX
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 6 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 27.3, RdtscTiming=1
Best time for 768K FFT length: 4.720 ms., avg: 4.901 ms.
Best time for 896K FFT length: 5.770 ms., avg: 5.987 ms.
Best time for 1024K FFT length: 6.469 ms., avg: 6.606 ms.
Best time for 1280K FFT length: 8.261 ms., avg: 8.452 ms.
Best time for 1536K FFT length: 10.157 ms., avg: 10.394 ms.
Best time for 1792K FFT length: 12.174 ms., avg: 12.454 ms.
Best time for 2048K FFT length: 13.578 ms., avg: 13.884 ms.
Best time for 2560K FFT length: 17.210 ms., avg: 17.559 ms.
Best time for 3072K FFT length: 21.431 ms., avg: 21.703 ms.
Best time for 3584K FFT length: 25.954 ms., avg: 26.447 ms.
Best time for 4096K FFT length: 29.306 ms., avg: 29.481 ms.
Best time for 5120K FFT length: 37.961 ms., avg: 38.296 ms.
Best time for 6144K FFT length: 45.709 ms., avg: 47.362 ms.
Best time for 7168K FFT length: 55.606 ms., avg: 55.943 ms.
Best time for 8192K FFT length: 63.926 ms., avg: 64.368 ms.
Timing FFTs using 2 threads.
Best time for 768K FFT length: 2.611 ms., avg: 2.691 ms.
Best time for 896K FFT length: 3.123 ms., avg: 3.200 ms.
Best time for 1024K FFT length: 3.512 ms., avg: 3.643 ms.
Best time for 1280K FFT length: 4.557 ms., avg: 4.876 ms.
Best time for 1536K FFT length: 5.513 ms., avg: 5.684 ms.
Best time for 1792K FFT length: 6.637 ms., avg: 6.861 ms.
Best time for 2048K FFT length: 7.410 ms., avg: 7.569 ms.
Best time for 2560K FFT length: 9.301 ms., avg: 9.652 ms.
Best time for 3072K FFT length: 11.599 ms., avg: 11.857 ms.
Best time for 3584K FFT length: 14.025 ms., avg: 14.427 ms.
Best time for 4096K FFT length: 15.921 ms., avg: 16.137 ms.
Best time for 5120K FFT length: 20.980 ms., avg: 23.371 ms.
Best time for 6144K FFT length: 24.257 ms., avg: 24.651 ms.
Best time for 7168K FFT length: 29.402 ms., avg: 29.815 ms.
Best time for 8192K FFT length: 34.335 ms., avg: 34.642 ms.
Timing FFTs using 3 threads.
Best time for 768K FFT length: 1.838 ms., avg: 1.916 ms.
Best time for 896K FFT length: 2.223 ms., avg: 2.361 ms.
Best time for 1024K FFT length: 2.547 ms., avg: 3.664 ms.
Best time for 1280K FFT length: 3.276 ms., avg: 3.400 ms.
Best time for 1536K FFT length: 4.017 ms., avg: 4.120 ms.
Best time for 1792K FFT length: 4.768 ms., avg: 4.914 ms.
Best time for 2048K FFT length: 5.392 ms., avg: 5.509 ms.
Best time for 2560K FFT length: 6.934 ms., avg: 7.173 ms.
Best time for 3072K FFT length: 8.511 ms., avg: 8.696 ms.
Best time for 3584K FFT length: 10.403 ms., avg: 10.938 ms.
Best time for 4096K FFT length: 11.650 ms., avg: 12.018 ms.
Best time for 5120K FFT length: 14.835 ms., avg: 15.071 ms.
Best time for 6144K FFT length: 17.789 ms., avg: 18.049 ms.
Best time for 7168K FFT length: 21.164 ms., avg: 21.397 ms.
Best time for 8192K FFT length: 24.641 ms., avg: 25.339 ms.
Timing FFTs using 4 threads.
Best time for 768K FFT length: 1.670 ms., avg: 1.723 ms.
Best time for 896K FFT length: 2.032 ms., avg: 2.079 ms.
Best time for 1024K FFT length: 2.305 ms., avg: 2.403 ms.
Best time for 1280K FFT length: 2.997 ms., avg: 3.066 ms.
Best time for 1536K FFT length: 3.637 ms., avg: 5.094 ms.
Best time for 1792K FFT length: 4.386 ms., avg: 4.509 ms.
Best time for 2048K FFT length: 4.895 ms., avg: 7.237 ms.
Best time for 2560K FFT length: 6.309 ms., avg: 6.483 ms.
Best time for 3072K FFT length: 7.560 ms., avg: 7.742 ms.
Best time for 3584K FFT length: 9.366 ms., avg: 9.645 ms.
Best time for 4096K FFT length: 10.515 ms., avg: 11.590 ms.
Best time for 5120K FFT length: 13.031 ms., avg: 13.184 ms.
[Fri Feb 17 22:29:50 2012]
Best time for 6144K FFT length: 15.449 ms., avg: 15.707 ms.
Best time for 7168K FFT length: 18.263 ms., avg: 18.686 ms.
Best time for 8192K FFT length: 21.290 ms., avg: 21.571 ms.
Best time for 61 bit trial factors: 2.294 ms.
Best time for 62 bit trial factors: 2.309 ms.
Best time for 63 bit trial factors: 2.607 ms.
Best time for 64 bit trial factors: 2.698 ms.
Best time for 65 bit trial factors: 3.169 ms.
Best time for 66 bit trial factors: 3.740 ms.
Best time for 67 bit trial factors: 3.709 ms.
Best time for 75 bit trial factors: 3.614 ms.
Best time for 76 bit trial factors: 3.596 ms.
Best time for 77 bit trial factors: 3.633 ms.
[/code]
tldr : more than 3 core on one task is useless

fivemack 2012-02-18 07:58

[QUOTE=firejuggler;289840]i5-2500K, stock speed , windows 7 home premium[/QUOTE]

Useful data - what's the memory speed here?

(I think you can see something like a memory-bandwidth effect by comparing this to the 4429/2133 i5/2500K data and seeing that the speed ratio goes down as the number of threads go up - the 4429/2133 is 48% faster at 4 threads and only 34% faster at 1 thread - but that would imply that the memory on the i5/2500K is 1333MHz, so if it isn't I'll have to revise my analysis)

firejuggler 2012-02-18 13:50

Yes, memory speed is 1333Mhz.

James Heinrich 2012-02-18 14:32

[QUOTE=fivemack;289802]I think that a conclusive analysis of the effect of memory bandwidth really would benefit from benchmarks from 27.3 at two CPU multipliers as far apart as possible, with memory speed kept the same, and turbo and hyperthreading turned off in both cases. (really ideally would also be data points at two different memory speeds with CPU multiplier kept the same)[/QUOTE]You may find these numbers interesting. I ran the measurements 6 times: 3 at 1500MHz, 3 at 4500MHz; 2 each at 800/1333/1600MHz RAM; and with multipliers between of 12x/15x/36x/45x to match:

[color=navy]
Prime95 v27.3.1, Windows 7 Pro x64
Intel Core i7-3930K, Hyperthreading disabled
Corsair 4x8GB DDR3-1600

CPU: 100x45=4500MHz; RAM: 1600-10-10-10-27-2T
1-worker: 12.6ms
2-worker: 12.9ms
3-worker: 13.1ms
4-worker: 13.6ms
5-worker: 14.8ms
6-worker: 16.4ms

CPU: 125x36=4500MHz; RAM: 1333-10-10-10-27-2T
1-worker: 12.8ms
2-worker: 13.1ms
3-worker: 13.6ms
4-worker: 14.6ms
5-worker: 16.5ms
6-worker: 18.9ms

CPU: 100x45=4500MHz; RAM: 800-10-10-10-27-2T
1-worker: 15.0ms
2-worker: 17.7ms
3-worker: 21.6ms
4-worker: 26.1ms
5-worker: 26.2ms
6-worker: 31.0ms

CPU: 100x15=1500MHz; RAM: 1600-10-10-10-27-2T
1-worker: 36.5ms
2-worker: 36.6ms
3-worker: 36.8ms
4-worker: 37.0ms
5-worker: 37.9ms
6-worker: 38.8ms

CPU: 125x12=1500MHz; RAM: 1333-10-10-10-27-2T
1-worker: 36.5ms
2-worker: 36.8ms
3-worker: 37.1ms
4-worker: 37.3ms
5-worker: 38.0ms
6-worker: 38.6ms

CPU: 100x15=1500MHz; RAM: 800-10-10-10-27-2T
1-worker: 36.9ms
2-worker: 37.3ms
3-worker: 37.6ms
4-worker: 38.1ms
5-worker: 39.3ms
6-worker: 41.0ms
[/color]

ATH 2012-02-18 15:10

Dell XPS Laptop with Corei7 2720QM 2.20 Ghz 4 Cores 8 Threads. RAM: 8Gb DDR3 PC3-10700 (667 Mhz).

Prime95 26.6.3 and 27.3.1 reports "OldCpuSpeed=2947" in local.txt

Testing LL on 45M exponents (FFT 2400K):

26.6.3:
1 worker, 2 threads each: [B]21.7 ms[/B]
2 workers, 2 threads each: [B]22.9 ms[/B]
3 workers, 2 threads each: [B]32.5-35.8 ms[/B]
4 workers, 2 threads each: [B]41.3 ms[/B]
8 workers, 1 thread each: [B]81.5-83.2 ms[/B]
1 worker, 4 threads each: [B]11.3 ms[/B]
2 worker, 4 threads each: [B]21.2 ms[/B]
1 worker, 8 threads each: [B]11.6 ms[/B]

27.3.1:
1 worker, 2 threads each: [B]13.3 ms[/B] (~60% faster than 26.6.3)
2 workers, 2 threads each: [B]15.1 ms[/B] (~50% faster than 26.6.3)
3 workers, 2 threads each: [B]23.2-25.4 ms[/B] (30-50% faster than 26.6.3)
4 workers, 2 threads each: [B]29.4-29.9 ms[/B] (~40% faster than 26.6.3)
8 workers, 1 thread each: [B]55.4-56.4 ms[/B] (45-50% faster than 26.6.3)
1 worker, 4 threads each: [B]8.5 ms[/B] (~30% faster than 26.6.3)
2 worker, 4 threads each: [B]16.3 ms[/B] (~30% faster than 26.6.3)
1 worker, 8 threads each: [B]9.0 ms[/B] (~30% faster than 26.6.3)

ATH 2012-02-18 15:24

Testing this I might have found a bug. If I run for example 2 workers on 2 threads each and I use this local.txt:
[CODE]WorkerThreads=2

[Worker #1]
ThreadsPerTest=2

[Worker #2]
ThreadsPerTest=2[/CODE]

then Prime95 assigns affinity automatically:
[Worker #1] Setting affinity to run on any logical CPU.
[Worker #1] Setting affinity to run helper thread 1 on any logical CPU.
[Worker #2] Setting affinity to run on any logical CPU.
[Worker #2] Setting affinity to run helper thread 1 on any logical CPU.

and I get around 15.1ms per iteration for each worker (45M exponent).

If I want to assign affinity myself like this local.txt:

[CODE]WorkerThreads=2

[Worker #1]
ThreadsPerTest=2
Affinity=1

[Worker #2]
ThreadsPerTest=2
Affinity=3
[/CODE]

Now Prime95 reports:
[Worker #1] Setting affinity to run worker on logical CPU #2
[Worker #1] Setting affinity to run helper thread 1 on logical CPU #3
[Worker #2] Setting affinity to run worker on logical CPU #4
[Worker #2] Setting affinity to run helper thread 1 on logical CPU #5

but now I get 22.5 ms on each worker, so like 50% slower. This happens with other choices for affinity and with 3 workers and also happens in Prime95 26.6.3.

James Heinrich 2012-02-18 15:49

[QUOTE=ATH;289871]If I want to assign affinity myself like this local.txt[/QUOTE]Affinity settings in local.txt are zero-based. Try Affiinity=0 and Affinity=2 instead of 1 and 3. Right now you're running the first instance on the hyperthreaded part of core1 plus the real part of core2, and the second instance on the hyperthreaded part of core2 plus the real part of core3.

bcp19 2012-02-18 16:59

[QUOTE=James Heinrich;289868]You may find these numbers interesting. I ran the measurements 6 times: 3 at 1500MHz, 3 at 4500MHz; 2 each at 800/1333/1600MHz RAM; and with multipliers between of 12x/15x/36x/45x to match:


[COLOR=navy]Prime95 v27.3.1, Windows 7 Pro x64[/COLOR]
[COLOR=navy]Intel Core i7-3930K, Hyperthreading disabled[/COLOR]
[COLOR=navy]Corsair 4x8GB DDR3-1600[/COLOR]

[COLOR=navy]CPU: 100x45=4500MHz; RAM: 1600-10-10-10-27-2T[/COLOR]
[COLOR=navy]1-worker: 12.6ms[/COLOR]
[COLOR=navy]2-worker: 12.9ms[/COLOR]
[COLOR=navy]3-worker: 13.1ms[/COLOR]
[COLOR=navy]4-worker: 13.6ms[/COLOR]
[COLOR=navy]5-worker: 14.8ms[/COLOR]
[COLOR=navy]6-worker: 16.4ms[/COLOR]

[/QUOTE]

What timings do you use for your memory at 1600? Mine is actually a bit off norm, 103x41=4261, so memory is at 1650 and 9-9-9-24.

James Heinrich 2012-02-18 18:29

[QUOTE=bcp19;289877][quote]CPU: 100x45=4500MHz; RAM: 1600-10-10-10-27-2T[/quote]What timings do you use for your memory at 1600? Mine is actually a bit off norm, 103x41=4261, so memory is at 1650 and 9-9-9-24.[/QUOTE]I used 10-10-10-27-2T timings for all tests. Obviously it could've run faster than that at 800MHz, but I kept it constant across all tests.

ATH 2012-02-18 18:45

[QUOTE=James Heinrich;289875]Affinity settings in local.txt are zero-based. Try Affiinity=0 and Affinity=2 instead of 1 and 3. Right now you're running the first instance on the hyperthreaded part of core1 plus the real part of core2, and the second instance on the hyperthreaded part of core2 plus the real part of core3.[/QUOTE]

Yeah it's a bit confusing Affinity runs from 0 to 7 but inside Prime95 it writes CPU # 1 to 8. But I tried Affinity=0 and Affinity=2 just now and got 25.9ms and I tried Affinity=2 and Affinity=4 and still 25.9ms, then I removed Affinity from local.txt and got 14.8ms, so there seems to be a bug.

Prime95 2012-02-18 19:06

We have a reproducible bug on non-SSE4 machines (like Pentium 4s) testing numbers like: 30448908048555*2^666666-1

GIMPS users won't care, but users of LLR / PFGW built on this library need to be careful.

James Heinrich 2012-02-18 19:25

[QUOTE=ATH;289885]then I removed Affinity from local.txt and got 14.8ms, so there seems to be a bug.[/QUOTE]No, you removed the affinity settings and it ran each thread on a separate physical core. With the affinity settings you were telling it to run two threads on each of two physical cores, leaving two full cores (4 virtual cores) unused. Without the affinity, Prime95 simply spread the load across 4 real cores (leaving only the 4 virtual cores "unused").

ATH 2012-02-18 19:49

[QUOTE=James Heinrich;289890]No, you removed the affinity settings and it ran each thread on a separate physical core. With the affinity settings you were telling it to run two threads on each of two physical cores, leaving two full cores (4 virtual cores) unused. Without the affinity, Prime95 simply spread the load across 4 real cores (leaving only the 4 virtual cores "unused").[/QUOTE]

Then why is CPU usage around 50% in both cases? Hyperthreading sucks I better try to disable it then.

James Heinrich 2012-02-18 20:00

[QUOTE=ATH;289893]Then why is CPU usage around 50% in both cases? Hyperthreading sucks I better try to disable it then.[/QUOTE]CPU is reported as 50% because you're using 4 of 8 logical cores. It doesn't take into account that a core isn't necessarily the same as a core. With 4 threads running on [0,1] [2,3] you're using exactly half the processing capabilities of the CPU. With 4 threads running on [0] [2] [4] [6] then you're using nearly all the processing capability of the CPU. Running 8 threads instead of 4 won't give you much extra performance in Prime95, but it will allow smoother multitasking (in that other, disparate tasks can better use the idle parts of the cores via Hyperthreading).

retina 2012-02-18 23:27

[QUOTE=James Heinrich;289875]Affinity settings in local.txt are zero-based. Try Affiinity=0 and Affinity=2 instead of 1 and 3. Right now you're running the first instance on the hyperthreaded part of core1 plus the real part of core2, and the second instance on the hyperthreaded part of core2 plus the real part of core3.[/QUOTE]Note that there is no such thing as a hyperthreaded portion and a real portion, all logical CPUs are the same in all respects. Two L-CPUs can share the resources of one CPU, but it makes no difference which of those two you choose to use. Perhaps what you were trying to say was that the two logical threads for each worker are running on different physical CPUs. Can we call this cross-threading?

emily 2012-02-19 00:30

[QUOTE=retina;289922]Note that there is no such thing as a hyperthreaded portion and a real portion, all logical CPUs are the same[/QUOTE]

Yeah, I think HT simply duplicates the registers and keeps one execution pipeline per core, so it can hold two programs at the same time and execute one of it. Because the pipeline is long, it's possible one instruction is at the beginning of the pipeline and another at its middle.

So, each logical CPU refers to a set of registers, and there's one execution pipeline for every 2 sets of registers.

Dubslow 2012-02-19 00:46

[QUOTE=retina;289922]Note that there is no such thing as a hyperthreaded portion and a real portion, all logical CPUs are the same in all respects. Two L-CPUs can share the resources of one CPU, but it makes no difference which of those two you choose to use. Perhaps what you were trying to say was that the two logical threads for each worker are running on different physical CPUs. Can we call this cross-threading?[/QUOTE]

I'm pretty sure he understands this but was using the bad terminology to get his meaning across. Regardless of details of implementation, 2 logical cores map to one physical core, and it's best (if using hyperthreading) to have 2 threads per worker, with each thread pinned to two logical cores that are "paired". They are not paired in anyway in the software, but we humans "know" that they are "paired".

James Heinrich 2012-02-19 01:07

[QUOTE=Dubslow;289925]I'm pretty sure he understands this but was using the bad terminology to get his meaning across.[/QUOTE]Yes. Sorry for wording it badly.

[QUOTE=Dubslow;289925]They are not paired in anyway in the software, but we humans "know" that they are "paired".[/QUOTE]Prime95 does (attempt to) detect which cores are paired on startup (detection doesn't always work in my experience, but the fallback default assumptions are true so it doesn't matter much). As mentioned above, if affinity isn't specified, Prime95 will assign workers to separate physical cores first (where possible) before doubling them up. If you specify 2 threads per worker, and more threads total than there are physical cores, Prime95 will try to keep both threads of the each worker on the same physical core (e.g. assign worker #1 to run on core0 with helper thread on core1).

fivemack 2012-02-19 01:30

OK, so the six-worker times from James's data are fitted reasonably by

t = 38000/{CPU MHz} + 16000/{memory speed MHz} milliseconds

The one-worker times are fitted reasonably by

t = 45000/{CPU MHz} + 5000/{memory speed MHz} milliseconds

Which suggests that a 3000MHz / 1600MHz system would give around 23ms with six workers and around 18ms with one worker; is this like what's observed?

Zero 2012-02-19 05:26

[QUOTE=James Heinrich;289897]Running 8 threads instead of 4 won't give you much extra performance in Prime95, but it will allow smoother multitasking (in that other, disparate tasks can better use the idle parts of the cores via Hyperthreading).[/QUOTE]I thought that you just showed that HT actually results in degradation with Prime where 6/12 runs slower than 6/6.


CPU usage displayed in windows is actually scheduler usage and not CPU load.


I see about 50% improvement overall running with AVX over AVX disabled. Is this about right or are there some further optimizations still to be done?

James Heinrich 2012-02-19 13:05

[QUOTE=Zero;289950]I thought that you just showed that HT actually results in degradation with Prime where 6/12 runs slower than 6/6.[/QUOTE]Comparing HT vs non-HT on 27.3.1, using average times for 4096K:[code] 1-thread, 1 core, non-HT: 23.072
1-thread, 1 core, HT: 23.797 ( 3% slower)
2-thread, 1 core, HT: 24.876 ( 8% slower)

6-thread, 6 core, non-HT: 7.393
12-thread, 6 core, HT: 9.962 (35% slower)[/code]So yes, with Hyperthreading running throughput is slower than without (but the loss here may be gained elsewhere in the system if you use the system for actual work and not just dedicated-GIMPS). But it seems clear that multiple threads per worker is a waste of time: it actually runs noticeably slower than one thread per worker.

erg 2012-02-19 18:19

Just upgraded from 'Mersenne Prime Test Program, Version 26.6.8' to the Linux 64 27.3 binary.

Version 26.6.8:
[Worker #1 Feb 19 10:03] Iteration: 3890000 / 56500573 [6.88%]. Per iteration time: 0.035 sec.
[Worker #2 Feb 19 10:07] Iteration: 7880000 / 46052729 [17.11%]. Per iteration time: 0.028 sec.
[Worker #3 Feb 19 10:07] Iteration: 7600000 / 46289729 [16.41%]. Per iteration time: 0.029 sec.
[Worker #4 Feb 19 10:05] Iteration: 12830000 / 27892327 [45.99%]. Per iteration time: 0.017 sec.
[Worker #5 Feb 19 10:02] Iteration: 6280000 / 56422397 [11.13%]. Per iteration time: 0.035 sec.
[Worker #6 Feb 19 10:07] Iteration: 2900000 / 51423049 [5.63%]. Per iteration time: 0.033 sec.

Version 27.3:
[Worker #1 Feb 19 10:10] Iteration: 3900000 / 56500573 [6.90%]. Per iteration time: 0.022 sec.
[Worker #2 Feb 19 10:12] Iteration: 7890000 / 46052729 [17.13%]. Per iteration time: 0.018 sec.
[Worker #3 Feb 19 10:12] Iteration: 7610000 / 46289729 [16.43%]. Per iteration time: 0.019 sec.
[Worker #4 Feb 19 10:11] Iteration: 12850000 / 27892327 [46.07%]. Per iteration time: 0.011 sec.
[Worker #5 Feb 19 10:13] Iteration: 6300000 / 56422397 [11.16%]. Per iteration time: 0.023 sec.
[Worker #6 Feb 19 10:12] Iteration: 2910000 / 51423049 [5.65%]. Per iteration time: 0.021 sec.

erg@ommegang ~ $ uname -a
Linux ommegang 3.2.5-1-ARCH #1 SMP PREEMPT Tue Feb 7 08:34:36 CET 2012 x86_64 Intel(R) Core(TM) i7-3960X CPU @ 3.30GHz GenuineIntel GNU/Linux

Looks like AVX gives a nice speedup. Great work!

drh 2012-02-20 01:54

LL instead of P1
 
Anyone else seen this - in my worker window, I'm askind for P-1's and have been getting them for a real long time, now, after upgrading to 27.3, I'm starting to get LL's instead.

Doug

LaurV 2012-02-20 02:51

Anyone experiencing a "blue screen" issue? I did, and repeated 3-4 times, about 3 to 5 minutes after I launched v273 (i7 2600k, 4 workers, all DC in 26M range, OC-ed at 4.35GHz, the mobo seems to be stable much higher with v272 of p95). The message (second paragraph of the blue-screen text, where the error is explained, after the introductory part, and before the dumping memory part :D) it says something like "one IC of the mainboard did not get a timer interrupt and the computer is kiked down to avoid bla bla bla" which I never seen before, and I assume is some asus-maximus-4-extreme-z-stuff (which is a very good mobo, and I would reccomend it to anyone, since I have it I am everyday more and more wondered by it!).

It could be the fact that I am pushing the OC too high, or it could be the p95v273 itself, because the problem does not happen with the same config (1-2 cudalucas and/or 2-6 mfaktc in background) AND p95v272. Also, it does not seems to happen with 273 if I run it at 3.8ghz or lower (standard is 3.4). It is only happen with 273 AND oc-over 4G.

Just FYI.

Prime95 2012-02-20 02:58

[QUOTE=LaurV;290038]"blue screen" issue...the mobo seems to be stable much higher with v272 of p95).[/QUOTE]

Since v27.3 is more efficient than 27.2, it will put more pressure on your overclocked CPU. You'll need to run torture tests to find your new stable speed.

retina 2012-02-20 03:03

[QUOTE=LaurV;290038]It could be the fact that I am pushing the OC too high, ...[/QUOTE]I would definitely vote affirmative to that suggestion.

[size=1]I still don't quite understand the desire to overclock. Sure, you might get a small improvement in run times, but long term: your risk of error increases exponentially, higher power bills, short life of components, frustration and the lost work that occurs when things like blue screen happen can easily eat up all possible speed advantages.[/size]

Dubslow 2012-02-20 04:24

Definitely memory issues. My one mfaktc instance dropped 9M/s throughput. (175->166)

fivemack 2012-02-20 13:20

[QUOTE=LaurV;290038]Anyone experiencing a "blue screen" issue? I did, and repeated 3-4 times, about 3 to 5 minutes after I launched v273 (i7 2600k, 4 workers, all DC in 26M range, OC-ed at 4.35GHz, the mobo seems to be stable much higher with v272 of p95).

It could be the fact that I am pushing the OC too high[/QUOTE]

Almost surely; I wonder if it's relevant that you're getting speeds an average of 18% faster at the same clock speed, and have to push your clock speed down by 14% to get reliability back.

Zero 2012-02-20 14:27

[QUOTE=retina;290042][SIZE=1]higher power bills[/SIZE][/QUOTE]

That depends...

For example with a base system that draws 120W (MB, VGA card, monitor etc) and where the CPU draws 80W @ 3.4GHz stock whilst draws 150W @ 5GHz OC'd

Then a job taking 10 hours to run at 5GHz would consume (120+150)*10=2.7kWh

At 3.4GHz the same job would take 14.7 hours to complete (120+80)*14.7=2.94kWh

In this cae the OC'd user has used less power and has more time to do other things. Of course he/she needs to ensure their system is stable and that CPU life is not significantly reduced. YMMV ;)

Dubslow 2012-02-20 15:10

Whoops. Last time I checked, 2.7 > 2.49, but then 2.49 isn't what he wrote :razz:


Then I think he underestimated the power draw for the OC. OC in general draws power at a greater than linear rate, so even factoring in the shorter runtime, the total energy consumption should still be higher.

James Heinrich 2012-02-20 15:14

[QUOTE=Dubslow;290087]Last time I checked 2.7 > 2.94[/QUOTE]Maybe you should check again. :smile:

bcp19 2012-02-20 15:22

[QUOTE=James Heinrich;290089]Maybe you should check again. :smile:[/QUOTE]

Maybe he is correct, after all, 2^34 is not 512 times larger than 2^25.

Zero 2012-02-20 15:44

[QUOTE=Dubslow;290087]OC in general draws power at a greater than linear rate[/QUOTE]

If you look at the power ratio vs the frequency ratio in the above example you'll see that they are not linear.

Whether it's a fair example is debatable as there are many scenario's. It's just to show that there is more to consider than just CPU power. It was not my intention to go so far OT, apologies for that.

LaurV 2012-02-20 16:56

Overclocking can be used for many different things, for example hardware testing. Say I don't care about the lifetime of the hardware I use, as long as it will die after it is deprecated :cool: and I give no sheep about the power bill. The electricity here where I live is quite cheap and I am not a big fan of the global warming hoax. In fact, except for April (when 40 deg Celsius is the average monthly, including the nights altogether) the rest of the year seems a bit cold for me here, and year by year seems to be colder and colder, as I am getting older and older, so I need my 1200W power supply running to make lots of heat :smile:. The only big BIG[SIZE=4] BIG[/SIZE] problem is eternal arguing with the family, who always need airconds... days and nights, including November (the coldest months of the year, last Nov the temperatures dropped to +12 Celsius in the morning). My family is quite strange, why they should need aircond when there are 35 degrees outside? The body temperature is over 36, so it should be ok! :ick:

Aaaa, if you say "to cool the case which is under the table", well that is a different story, but to cool the sleeping room? Who need aircond in the sleeping room? :loco:

[COLOR=Silver](this post is not for the guys having their fun-detectors turned off)
[/COLOR]

Dubslow 2012-02-20 17:10

I think the high today is around 5, and that's kinda warm for around here :wink:

James Heinrich 2012-02-20 17:17

[QUOTE=LaurV;290118]last Nov the temperatures dropped to +12 Celsius in the morning[/QUOTE]Sometimes the temperature rises to -35C in the afternoon in my part of the world... overclocked systems can be quite useful in that department. :smile:

Still, it's all about what we're each comfortable with. I personally overclock all my systems to what I consider a "safe" level (e.g. my system is 90% stable at 4900MHz CPU/1600MHz RAM but 100% stable at 4500/1333). I consider the increased power consumption and heat generation a fair exchange for the increased throughput (and decreased need to heat my house).

bcp19 2012-02-20 17:57

[QUOTE=James Heinrich;290121]Sometimes the temperature rises to -35C in the afternoon in my part of the world... overclocked systems can be quite useful in that department. :smile:

Still, it's all about what we're each comfortable with. I personally overclock all my systems to what I consider a "safe" level (e.g. my system is 90% stable at 4900MHz CPU/1600MHz RAM but 100% stable at 4500/1333). I consider the increased power consumption and heat generation a fair exchange for the increased throughput (and decreased need to heat my house).[/QUOTE]

Wow, you live underwater?
[quote]The confluence at 42N 81W was of particular interest to many of us, a group of windsurfers in Northeast and Central Ohio. Its position, 9 miles off shore in Lake Erie, represented a difficult but attainable challenge.[/quote]

Dubslow 2012-02-20 18:02

Also since when does Ohio get [i]that[/i] cold? :smile:

James Heinrich 2012-02-20 18:15

Hmm, typo in my profile, should be 4[b]3[/b]N81W :redface:
And no, it doesn't get [i]that[/i] cold here, but I used to live in Edmonton and it does there.
:threadhijacked:

chalsall 2012-02-20 18:44

[QUOTE=James Heinrich;290127]...but I used to live in Edmonton and it does there.[/QUOTE]

LOL... I grew up in a small town in the interior of British Columbia. During high-school I used to do watchman duty at a sawmill in -40 degrees (C or F) weather in a light ski jacket and jeans.

Now that I've been living in Barbados for ten plus years, I feel cold if the temperature drops below +25 C.... :smile:

Dubslow 2012-02-20 18:57

[QUOTE=chalsall;290129]LOL... I grew up in a small town in the interior of British Columbia. During high-school I used to do watchman duty at a sawmill in -40 degrees (C or F) weather in a light ski jacket and jeans.
[/QUOTE]

That sounds like something I'd do. I have one pair of jeans and like ten pairs of shorts, and only a sweatshirt.

(Edit: In case you hadn't seen this: [url]http://www.mersenneforum.org/showpost.php?p=289651&postcount=557[/url])

kladner 2012-02-21 00:21

[QUOTE=James Heinrich;290127]Hmm, typo in my profile, should be 4[B]3[/B]N81W :redface:
And no, it doesn't get [I]that[/I] cold here, but I used to live in Edmonton and it does there.
:threadhijacked:[/QUOTE]

DANG! I had previously looked up your old coordinates, and found a spot in the middle of Lake Erie, which had been visited by sail boarders. It was a latitude/longitude "node" (I think).

EDIT: Ah! That was 4[B][U]2[/U][/B]N81W

James Heinrich 2012-02-21 12:50

I just noticed this:[quote]Cannot use lots of memory because ÀzØ is running.[/quote]This is the first time I've seen something like that. LowMemWhileRunning= normally works fine. Stopping and restarting Prime95 worked fine. For reference, my config is:[code]LowMemWhileRunning=photoshop,PTgui,steamapps[/code]

kladner 2012-02-21 15:13

1 Attachment(s)
ÀzØ
= Latin Capital A with grave, Latin small z, Latin Cap. O w/stroke, (binary)

Hi James,
I am just wondering if the mystery executable appears as above (and in attached) on your system. I looked at this in a few different Character Encoding choices, but could not sort out possible languages.

Have you found any explanation for this oddity?

James Heinrich 2012-02-21 15:35

[QUOTE=kladner;290277]I am just wondering if the mystery executable appears as above (and in attached) on your system. I looked at this in a few different Character Encoding choices, but could not sort out possible languages.[/QUOTE]No, there was nothing running that could match that. My best guess would be an invalid pointer, something like Prime95 gets a list of running process IDs, and then looks up the name/path for each of them to see if they match, and in the few cycles between getting the list and looking it up, the process no longer exists (note: wild theory only).

ÀzØ = [C0][7A][D8][01]
I thought it might be a couple UTF-8 characters, but it's invalid for that. UTF-16BE or LE don't make much sense either. I have to conclude it's just garbage. :smile:


All times are UTC. The time now is 17:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.