mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2020-02-29, 19:02   #1
phillipsjk
 
Nov 2019

2810 Posts
Default Performance dropped dramatically after update

Hello,

I am running FreeBSD on an old dual-Xeon system. After a software update last month, the run-time on Primality testing went up dramatically. I initially blamed the Web-browser and virtual machine downloading the internet: but the poor performance remains after killing those processes.

I suspect it may be related to "Lazy FP State Restore" mitigation.

https://wiki.freebsd.org/Speculative...ulnerabilities

I was also assigned two large exponents for initial PRP testing in mid January. I suspect I am thoroughly memory bandwidth limited for those. I am processing 4 assignments at once because each CPU has two independent L2 caches (2x6MB, 12MB total).

If I look at my assignments on mersenne.org, it says I am expected to finish 54xxxxxx (Double check work, 47% done) in 10 days. Based on the actual throughput, mprime (269ms/iter) mprime expects me to finish in about 90 days. That is about double the run-time I am expecting (think I got closer to 100ms/iter previously)

For the 105xxxxxx exponents, it is even worse: assignment list on mersenne.org estimates 80 days to go (8% done), but mprime says based on throughput (500ms/iter), I should expect to finish in about 560 days. Again, this is double the run-time I expect (would expect to multiply run-time by 4 at most).

I already told mprime not to grab more work for that machine. I was hoping that once the 54xxxxxx exponent, finished: my other processes would use the idle cores, and avoid cache flushes.

Edit: I did check that all 8 cores (over 2 CPUs) are active.

Last fiddled with by phillipsjk on 2020-02-29 at 19:20 Reason: typo, added L2 cache size.
phillipsjk is offline   Reply With Quote
Old 2020-03-01, 05:09   #2
phillipsjk
 
Nov 2019

22×7 Posts
Question

I may transfer this work to my "white elephant" (quad CPU) AMD machine. I generally can't afford to run it, but it should complete within a month (more efficiently than the current machine).

Is switching machines mid-run frowned upon? I presume it is only a problem if the machines are unreliable.
phillipsjk is offline   Reply With Quote
Old 2020-03-01, 14:00   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

6,833 Posts
Default

Switching machines happens often -- not frowned upon at all.
Prime95 is offline   Reply With Quote
Old 2020-03-01, 14:40   #4
phillipsjk
 
Nov 2019

22·7 Posts
Default

Good to know.

Last night: killing the browser (but not the virtual machine) improved performance by 20% (about right for how much CPU it was using). Not sure why that did not happen earlier.

Maybe I was not tracking performance carefully enough.

Last fiddled with by phillipsjk on 2020-03-01 at 14:42 Reason: Added second parenthetical comment.
phillipsjk is offline   Reply With Quote
Old 2020-03-01, 17:06   #5
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7·29·37 Posts
Default

Quote:
Originally Posted by phillipsjk View Post
After a software update last month, the run-time on Primality testing went up dramatically.
Maybe it was a Spectre/Meltdown update?

https://www.pcworld.com/article/3250...-hardware.html
https://www.zdnet.com/article/linus-...h-needs-curbs/

Xyzzy is offline   Reply With Quote
Old 2020-03-01, 20:36   #6
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22×3×232 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
I do wish that operating system manufacturers would provide a version that unleashes the full speculation potential of our hardware, for people actually wanting to do computation rather than checking their bank accounts and email.
fivemack is offline   Reply With Quote
Old 2020-03-02, 02:00   #7
phillipsjk
 
Nov 2019

2810 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
Don't think so. My CPU does not do hyperthreading.

Edit: None of the Intel CPUs I have in use support hyperthreading. (I did try to replace a Pentium D with a Hyperthreaded Pentium 4 (Overclocked) for a while to save a bit of power)

Last fiddled with by phillipsjk on 2020-03-02 at 02:07
phillipsjk is offline   Reply With Quote
Old 2020-03-07, 21:39   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2×52×223 Posts
Default

What does 'top' show in terms of CPU utilization?
ewmayer is offline   Reply With Quote
Old 2020-03-17, 06:47   #9
phillipsjk
 
Nov 2019

22×7 Posts
Default

Quote:
Originally Posted by ewmayer View Post
What does 'top' show in terms of CPU utilization?
If I recall correctly, mprime was consistently over 500% CPU usage.


Without mprime (and 2.5Ghz instead of 2.0Ghz):


Code:
last pid:  2002;  load averages:  0.56,  0.47,  0.41                                                        up 0+06:45:34  00:45:53
47 processes:  1 running, 46 sleeping
CPU:  1.7% user,  0.0% nice,  2.9% system,  0.0% interrupt, 95.4% idle
Mem: 1023M Active, 2086M Inact, 152M Laundry, 4967M Wired, 7682M Free
ARC: 2064M Total, 1055M MFU, 931M MRU, 199K Anon, 41M Header, 36M Other
     1511M Compressed, 5313M Uncompressed, 3.52:1 Ratio
Swap: 10G Total, 10G Free
Edit: I see "system" usage is about double "user" usage. I wonder if the system is trying to compress mprime's working memory. Edit2: probably not: ARC is a disk cache.


Edit3: I wonder if I was having NUMA issues. On "white Elephant", I am running 1 instance for each CPU. Does not explain why it would suddenly stop working well though.

Last fiddled with by phillipsjk on 2020-03-17 at 06:55
phillipsjk is offline   Reply With Quote
Old 2020-03-17, 18:00   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×23×83 Posts
Default

Quote:
Originally Posted by phillipsjk View Post
Hello,

I am running FreeBSD on an old dual-Xeon system. After a software update last month, the run-time on Primality testing went up dramatically.
...
I was also assigned two large exponents for initial PRP testing in mid January. I suspect I am thoroughly memory bandwidth limited for those. I am processing 4 assignments at once because each CPU has two independent L2 caches (2x6MB, 12MB total).

If I look at my assignments on mersenne.org, it says I am expected to finish 54xxxxxx (Double check work, 47% done) in 10 days. Based on the actual throughput, mprime (269ms/iter) mprime expects me to finish in about 90 days. That is about double the run-time I am expecting (think I got closer to 100ms/iter previously)

For the 105xxxxxx exponents, it is even worse: assignment list on mersenne.org estimates 80 days to go (8% done), but mprime says based on throughput (500ms/iter), I should expect to finish in about 560 days. Again, this is double the run-time I expect (would expect to multiply run-time by 4 at most).

I already told mprime not to grab more work for that machine. I was hoping that once the 54xxxxxx exponent, finished: my other processes would use the idle cores, and avoid cache flushes.

Edit: I did check that all 8 cores (over 2 CPUs) are active.
What are your power save settings?
What xeon model do you have? How do your iteration times compare to the 1-core benchmarks for it at mersenne.ca? https://www.mersenne.ca/throughput.p...%7C0&mhz1=2400
What benchmarking have you done before and after the update, and what does that show?
What else might have changed?

On my old dual-E5520, I scaled back to two workers to get assignment latencies acceptable as exponents increased, and for that I'm getting 28. ms / iter on two 93.8M PRPs on Win7 (all 3 memory slot sets full; 4 cores per Xeon).

Your exponents, (105M / 54M)2.1 are ~4.04 runtime ratio expected.
Your 54M iteration time is like an old single core 32-bit laptop I have, on XP or Vista (Pentium 750M, both memory slots full)
kriesel is offline   Reply With Quote
Old 2020-03-18, 00:04   #11
phillipsjk
 
Nov 2019

22·7 Posts
Default

I have a dual Intel Xeon L5420 @ 2.50GHz, and my single-core benchmark numbers are way better than the ones listed on that site:


https://www.mersenne.ca/throughput.p...%7C0&mhz1=2500



Code:
[Worker #1 Mar 17 17:39] Worker starting
[Worker #1 Mar 17 17:39] Your timings will be written to the results.bench.txt file.
[Worker #1 Mar 17 17:39] Compare your results to other computers at http://www.mersenne.org/report_benchmarks
[Worker #1 Mar 17 17:39] Timing 41 iterations of 768K all-complex FFT length.  Best time: 12.660 ms., avg time: 12.789 ms.
[Worker #1 Mar 17 17:39] Timing 39 iterations of 800K all-complex FFT length.  Best time: 13.521 ms., avg time: 13.683 ms.
[Worker #1 Mar 17 17:39] Timing 36 iterations of 864K all-complex FFT length.  Best time: 15.596 ms., avg time: 15.775 ms.
[Worker #1 Mar 17 17:39] Timing 33 iterations of 960K all-complex FFT length.  Best time: 16.617 ms., avg time: 16.761 ms.
[Worker #1 Mar 17 17:39] Timing 31 iterations of 1024K all-complex FFT length.  Best time: 17.187 ms., avg time: 17.436 ms.
[Worker #1 Mar 17 17:39] Timing 27 iterations of 1152K all-complex FFT length.  Best time: 20.683 ms., avg time: 20.886 ms.
[Worker #1 Mar 17 17:39] Timing 24 iterations of 1280K all-complex FFT length.  Best time: 22.413 ms., avg time: 22.627 ms.
[Worker #1 Mar 17 17:39] Timing 22 iterations of 1440K all-complex FFT length.  Best time: 26.845 ms., avg time: 27.130 ms.
[Worker #1 Mar 17 17:39] Timing 20 iterations of 1536K all-complex FFT length.  Best time: 27.724 ms., avg time: 28.007 ms.
[Worker #1 Mar 17 17:39] Timing 19 iterations of 1600K all-complex FFT length.  Best time: 28.822 ms., avg time: 29.064 ms.
[Worker #1 Mar 17 17:39] Timing 18 iterations of 1728K all-complex FFT length.  Best time: 32.894 ms., avg time: 33.140 ms.
[Worker #1 Mar 17 17:39] Timing 16 iterations of 1920K all-complex FFT length.  Best time: 35.669 ms., avg time: 36.704 ms.
[Worker #1 Mar 17 17:39] Timing 15 iterations of 2048K all-complex FFT length.  Best time: 36.545 ms., avg time: 36.827 ms.
[Worker #1 Mar 17 17:39] Timing 13 iterations of 2304K all-complex FFT length.  Best time: 43.282 ms., avg time: 43.601 ms.
[Worker #1 Mar 17 17:39] Timing 13 iterations of 2400K all-complex FFT length.  Best time: 45.685 ms., avg time: 45.946 ms.
[Worker #1 Mar 17 17:39] Timing 12 iterations of 2560K all-complex FFT length.  Best time: 47.397 ms., avg time: 47.689 ms.
[Worker #1 Mar 17 17:39] Timing 11 iterations of 2880K all-complex FFT length.  Best time: 56.123 ms., avg time: 56.526 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 3072K all-complex FFT length.  Best time: 57.419 ms., avg time: 57.765 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 3200K all-complex FFT length.  Best time: 60.542 ms., avg time: 60.882 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 3456K all-complex FFT length.  Best time: 69.810 ms., avg time: 70.180 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 3840K all-complex FFT length.  Best time: 73.925 ms., avg time: 74.258 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 4096K all-complex FFT length.  Best time: 76.530 ms., avg time: 78.048 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 4608K all-complex FFT length.  Best time: 91.814 ms., avg time: 97.894 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 4800K all-complex FFT length.  Best time: 94.609 ms., avg time: 95.093 ms.
[Worker #1 Mar 17 17:39] Timing 10 iterations of 5120K all-complex FFT length.  Best time: 98.754 ms., avg time: 99.913 ms.
[Worker #1 Mar 17 17:40] Timing 10 iterations of 5760K all-complex FFT length.  Best time: 118.108 ms., avg time: 119.862 ms.
[Worker #1 Mar 17 17:40] Timing 10 iterations of 6144K all-complex FFT length.  Best time: 120.637 ms., avg time: 123.088 ms.
[Worker #1 Mar 17 17:40] Timing 10 iterations of 6400K all-complex FFT length.  Best time: 126.661 ms., avg time: 128.707 ms.
[Worker #1 Mar 17 17:40] Timing 10 iterations of 6912K all-complex FFT length.  Best time: 142.442 ms., avg time: 143.616 ms.
[Worker #1 Mar 17 17:40] Timing 10 iterations of 7680K all-complex FFT length.  Best time: 157.132 ms., avg time: 158.526 ms.
[Worker #1 Mar 17 17:40] Timing 10 iterations of 8192K all-complex FFT length.  Best time: 160.198 ms., avg time: 161.768 ms.
[Worker #1 Mar 17 17:40] FFT timings benchmark complete.
[Worker #1 Mar 17 17:40] Worker stopped.
I have speedstep enabled, but on freeBSD I essentially manually choose between 2.0 and 2.5Ghz with
Code:
# sysctl dev.cpu.0.freq=2500
I am mostly memory bandwidth limited with each CPU having access to dual channel DDR2 FBDIMMS (at 667Mhz, IIRC).


I did not significantly change the work-load of the machine, but I did update the web browser (mozilla). I also have a Virtualbox instance running a mostly idle Archive Team "warrior" instance. The later may trigger extra cache evictions during context switches.


Benchmarking has been mainly limited to monitoring the ms/iteration times. As I said up-thread, they approximately doubled. It is possible that the 105XXXXXX assignments (happened at around the right time) are just too large for my CPU caches.


Just today I stated doing P-1 factoring on this machine (shorter run-times). I am now limiting mprime to 6 CPU cores. Other software, such as the web browser, is much more responsive with 2 "idle" cores available.


I suspect I was getting cache evictions with mprime loading all cores.

Last fiddled with by phillipsjk on 2020-03-18 at 00:05
phillipsjk is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Dropbox dropped Prime95 performance 47% Rodrigo Information & Answers 26 2011-06-21 05:06
How to restore CPU if it was dropped from server? Unregistered Information & Answers 4 2009-10-22 11:16
64-bit performance of v25.6 James Heinrich PrimeNet 11 2008-04-24 01:42
64 bit performance? zacariaz Hardware 1 2007-05-10 13:08
Performance battlemaxx Prime Sierpinski Project 4 2005-06-29 20:32

All times are UTC. The time now is 21:42.

Sat May 30 21:42:22 UTC 2020 up 66 days, 19:15, 1 user, load averages: 1.39, 1.67, 1.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.