mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Prime95 performance on Core i5 3570K (https://www.mersenneforum.org/showthread.php?t=17776)

akselsm 2013-02-09 11:36

Prime95 performance on Core i5 3570K
 
I am struggling to understand how Prime95 performs on my Intel Core i5 3570K-based computer. I am running 4 workers on 4 threads (one on each core). From the benchmarks on mersenne.org, I expect to see iteration times of approx. 20 ms (or a little bit higher, since I am using the computer as well) for 3M FFTs, but the observed iteration times range from 50 to 75 ms.

I tried enabling the workers one by one, with the following results:[LIST][*]1 worker: 20 ms (CPU#1)[*]2 workers: 25 ms (CPU#1 and CPU#2)[*]3 workers: 40+ ms (CPU#1-3)[*]4 workers: 50+ ms (CPU#1-4)[/LIST]
I have run the benchmark, and it gives me the same results as others have gotten (at 3072K: 20 ms on one thread, down to 13 ms on 4 threads).

Have I misunderstood something about how running multiple workers affects performance, or is there something wrong?

Aramis Wyler 2013-02-09 15:47

With 4 workers running at 50ms each, every 50ms you're making progress on 4 primes. 4 units of work. With 4 thread running at 20ms (combined) but only against 1 prime, then you're only getting 2.5 units of work (50/20) in each 50ms.

So yes, putting 4 cores against 1 number will get that number completed faster than doing 4 numbers at a time, but not 4 times faster - if you put all 4 cores on 1 number, and complete 4 numbers that way (at the 20 speed, each) that will take longer than doing 4 simultaneously at the 50 speed.

chappy 2013-02-09 16:50

You should be getting around 24 ms on each worker for the 59x numbers typically being handed out right now.

Do you have hyperthreading turned off and the latest version of P-95?

as an example I have 4 numbers running on mine each 59 million is running at 24 ms. the 56 million is running at 23-24 ms. per iteration and the 53 is running at 21 ms per iteration.

akselsm 2013-02-09 17:30

All four numbers are in the 55x to 59x range, and run at 50-60 ms per iteration.

AFAIK the i5 3570K doesn't have HT, and I am running Prime95 v27.9 build 1 (x64). Previously, I ran v27.7, and performance was the same.

As stated previously, if I run two workers on two cores (one worker per core, leaving two cores idle), I get normal performance. If I add more workers and cores (preserving the one-to-one worker-core relationship), performance drops. Thus, running two workers at "full speed" will give the same results in the same time as running four workers at "half speed".

akselsm 2013-02-09 17:37

[QUOTE=Aramis Wyler;328675]With 4 workers running at 50ms each, every 50ms you're making progress on 4 primes. 4 units of work. With 4 thread running at 20ms (combined) but only against 1 prime, then you're only getting 2.5 units of work (50/20) in each 50ms.

So yes, putting 4 cores against 1 number will get that number completed faster than doing 4 numbers at a time, but not 4 times faster - if you put all 4 cores on 1 number, and complete 4 numbers that way (at the 20 speed, each) that will take longer than doing 4 simultaneously at the 50 speed.[/QUOTE]

I am aware that the Lucas-Lehmer algorithm is highly sequential, and not parallellizable. However, this is not my problem:

I am putting each [B]one[/B] of the cores against [B]one[/B] number, but if I utilize all four cores (one core per number), the performance of all cores drops drastically.

Mr. P-1 2013-02-09 18:17

Some drop off in efficiency is to be expected. For example, doing 29M doublechecks on my 3570K at stock speed, my iteration time is typically 11-12ms with just one worker, rising to 16ms with four workers. The main reason for the slowdown is probably cache contention.

Your efficiency drop-off is extreme. Since your processor is the same as mine, we should look elsewhere for the problem. What is your RAM configuration?

sdbardwick 2013-02-09 18:23

Double check that each worker is assigned to its own unique core via the worker windows dialog. Personally, I don't use the 'smart assignment' option; I always iterate through the worker specific options to set worker #1 to cpu #1, etc. just to make sure the CPU assignments are what I expect.

If the assignments reflect that (and Task Manager shows 100% CPU usage), but the iteration times are still high, then the most likely culprit is thermal throttling due to bad contact with the CPU HSF (those Intel push-pin types can look like they are seated correctly, then pop loose later).

Dubslow 2013-02-09 18:30

#6 and #7 have it right: the two most likely issues are a RAM bottleneck (you should be okay with dual channel DDR3-1333 or higher), or the workers are somehow interfering with each other in a way that's not supposed to happen. Definitely check the Task Manager

akselsm 2013-02-09 18:46

I have 2x Kingston HyperX blu 8GB (16 GB total). Checked everything, and found out that I at some point must have put the memory sticks in the wrong slots (i.e. no dual channel).

I get iteration times of 26 ms now. Thanks for helping!

Xyzzy 2013-02-09 23:36

[QUOTE]I have 2x Kingston HyperX blu 8GB (16 GB total).[/QUOTE]Doing one or two instances of P-1 factoring would really help the project. Most people do not have that much memory.

Mr. P-1 2013-02-10 01:03

[QUOTE=Xyzzy;328751]Doing one or two instances of P-1 factoring would really help the project. Most people do not have that much memory.[/QUOTE]

I do. :evil:


All times are UTC. The time now is 08:01.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.