mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-05-01, 01:35   #23
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

614010 Posts
Default

Quote:
Originally Posted by Madpoo View Post
That's exactly the problem... the CPU usage can vary wildly.
Then the whole benchmark thing is useless. If you can't get reliable figures for real production usage then using benchmarks from a different runtime environment configuration is not going to help you.

I think bgbeuning had it correct. Use actual live data to select you run parameters.
retina is offline   Reply With Quote
Old 2016-05-01, 03:40   #24
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7·1,069 Posts
Default

I think I have my general game plan in mind. Thanks to all for the lively discussion.

First off, I'll only include the best 2 to 4 implementations for each FFT size so that even if benchmarking fails to find the best implementation then we won't be losing too much performance.

My plan is to store in local.txt several (10?) throughput benchmarks for each FFT size. Prime95 will schedule a job to run at random times. This job scans worktodo.txt to see which FFT sizes we will need now or in the near future. Then it runs 20-second benchmarks for each FFT implementation where we don't already have sufficient benchmark data.

Prime95 will remember the CPU brand string from CPUID to detect moving local.txt to a new machine. Also, bench data will be dated and deleted after say 6 months. This also limits the damage from copying local.txt to new machines.

The theory is that if we do several benchmarks, hopefully a substantial number will be done while interference from other apps is minimal or non-impactful.

Next help I'll need is throughput benchmarks from your machines so that I can figure out the best 2 to 4 implementations to include.
Prime95 is offline   Reply With Quote
Old 2016-05-01, 09:53   #25
dh1
 
dh1's Avatar
 
"Denny"
Sep 2015

5×7 Posts
Default

respectfully request a user-selectable option to manually run all needed tuning jobs at once or on-demand and be done with it. I assume I can provide an FFT range set, and know my own system's loads.
dh1 is offline   Reply With Quote
Old 2016-05-01, 12:29   #26
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·7·223 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Next help I'll need is throughput benchmarks from your machines so that I can figure out the best 2 to 4 implementations to include.
Which FFTs? and do you need for both 1 worker, 2 workers, 4 workers etc.?
ATH is offline   Reply With Quote
Old 2016-05-01, 15:57   #27
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D3B16 Posts
Default

Quote:
Originally Posted by ATH View Post
Which FFTs? and do you need for both 1 worker, 2 workers, 4 workers etc.?
I'll need 1 and 4 (all cores), but you'll have to wait for me to make a special prime95 version that includes many more FFT implementations. Probably need all FFTs above 64K. I'll post instructions when the time comes.

Last fiddled with by Prime95 on 2016-05-01 at 16:00
Prime95 is offline   Reply With Quote
Old 2016-05-02, 14:04   #28
tului
 
Jan 2013

22×17 Posts
Default

Couldn't the benchmarks just be labeled as a "stress test/benchmark" that you run after the person clicks on Join GIMPS? I mean they've already clicked to join and logged in. Then the burn in crowd won't be bothered with it.

Alternately, have a version of the burn in stuff that builds a giant database of FFT performance
tului is offline   Reply With Quote
Old 2016-05-03, 10:21   #29
rudi_m
 
rudi_m's Avatar
 
Jul 2005

2·7·13 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I think I have my general game plan in mind. Thanks to all for the lively discussion.

First off, I'll only include the best 2 to 4 implementations for each FFT size so that even if benchmarking fails to find the best implementation then we won't be losing too much performance.

My plan is to store in local.txt several (10?) throughput benchmarks for each FFT size. Prime95 will schedule a job to run at random times. This job scans worktodo.txt to see which FFT sizes we will need now or in the near future. Then it runs 20-second benchmarks for each FFT implementation where we don't already have sufficient benchmark data.

Prime95 will remember the CPU brand string from CPUID to detect moving local.txt to a new machine. Also, bench data will be dated and deleted after say 6 months. This also limits the damage from copying local.txt to new machines.

The theory is that if we do several benchmarks, hopefully a substantial number will be done while interference from other apps is minimal or non-impactful.
Hm, I'm a bit sceptical.

Do we know that the best implementation really depends on CPUID at all? Then all GIMPS users could share/re-use their benchmarks somehow via GIMPS. The benchmark is only needed to run if the is no cached benchmark available via GIMPS.

But I have doubts about this. I guess it's more important what else is running on the machine, how is the CPU cache used by other processes or how is the OS-kernel (process scheduler) configured. The idealistic benchmark on the idle machine may not tell us the best FFT implementations for the case when the system is running normally.

So I would not run extra benchmarks on the client. I would just run and measure from time to time the current real workload using different implementations. Maybe once per day, switch to the 4 implementations one after the other and continue with the best one. This way you would never need to waste CPU time (except that you are not using the best implementation a few minutes per day.)

Last fiddled with by rudi_m on 2016-05-03 at 10:23
rudi_m is offline   Reply With Quote
Old 2016-05-03, 11:16   #30
axn
 
axn's Avatar
 
Jun 2003

32×19×29 Posts
Default

Quote:
Originally Posted by rudi_m View Post
it's more important what else is running on the machine, how is the CPU cache used by other processes or how is the OS-kernel (process scheduler) configured.
These are transient factors (except the OS kernel configuration), and beyond our (software's) control. It would be futile to try to account for this -- even dynamic measurement can only achieve so much, since which FFT might be better might change according to the current system conditions.

What this thread is trying to achieve is to find out which FFT implementation is best for which hardware configuration(s). What George thought was improvement to some FFTs (as measured by his Haswell / Skylake) turned out to be a regression for some other machines.

So... scientifically establish which is the best FFT implementation for as many h/w configurations as possible, and use that data to select at runtime the most optimal one by benchmarking a limited subset of FFT implementations.

Last fiddled with by axn on 2016-05-03 at 11:17
axn is online now   Reply With Quote
Old 2016-05-03, 12:49   #31
rudi_m
 
rudi_m's Avatar
 
Jul 2005

2·7·13 Posts
Default

Quote:
Originally Posted by axn View Post
These are transient factors (except the OS kernel configuration), and beyond our (software's) control. It would be futile to try to account for this -- even dynamic measurement can only achieve so much, since which FFT might be better might change according to the current system conditions.
IMO it's the other way around. Just benchmarking the currently running workers is more easy. We don't need to care for CPUID, OS, etc.

Quote:
Originally Posted by axn View Post
What George thought was improvement to some FFTs (as measured by his Haswell / Skylake) turned out to be a regression for some other machines.
All my Haswell and Skylake systems do run a bit slower with 28.9 for real LL tests. That's why I have doubts that the idealistic benchmarks would improve things on my systems.

BTW benchmarking the real workload could be also used to find the optimal CPU affinity automatically. For example I have one setup with two LL threads and two factoring threads. I get the most throughput when running LL on CPU 0 and 2, and factoring on CPU 1 and 3. Should be possible to find out this automatically by testwise moving the threads between CPUs.
rudi_m is offline   Reply With Quote
Old 2016-05-03, 13:28   #32
axn
 
axn's Avatar
 
Jun 2003

10011010111112 Posts
Default

Quote:
Originally Posted by rudi_m View Post
IMO it's the other way around. Just benchmarking the currently running workers is more easy. We don't need to care for CPUID, OS, etc.
And that's the plan. Current data collection is to determine what are the top 2-4 implementations in general, and then these will be benchmarked locally.

I think you are misunderstanding the CPUID thing. That is to make sure that the locally recorded benchmark data can be discarded if it is copied to another machine.
axn is online now   Reply With Quote
Old 2016-05-03, 17:20   #33
rudi_m
 
rudi_m's Avatar
 
Jul 2005

2668 Posts
Default

Quote:
Originally Posted by axn View Post
And that's the plan. Current data collection is to determine what are the top 2-4 implementations in general, and then these will be benchmarked locally.
I was only concerned about the way how this local benchmark will look like. IMO we should just measure the throughput of the actual running workers instead of an extra synthetic benchmark (while all the real workers are stopped). Particularly each worker should be "benchmarked" separately while the other workers are running as usual.

In theory it could even happen that two workers should use different implementations to get the most throughput together. (e.g. one worker uses more CPU cache the other less.)

Quote:
Originally Posted by axn View Post
I think you are misunderstanding the CPUID thing. That is to make sure that the locally recorded benchmark data can be discarded if it is copied to another machine.
Well, I mean it could be better to not cache whole benchmark results but just the last used FFT implementation per worker. Then from time to time we check whether a worker should switch to another implementation (without using old benchmarks).

George wrote: "The theory is that if we do several benchmarks, hopefully a substantial number will be done while interference from other apps is minimal or non-impactful."

But I think
1. We should not use the absolute benchmark times from certain dates and situations of the history, They are in general not comparable.
2. I _want_ the benchmark running _while_ the usual interference from other apps (or other mprime workers!) rather than filtering for ideal,
synthetic benchmarks.
rudi_m is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 - stop all workers on error [feature request] kql Software 1 2020-12-31 15:15
New Feature! Xyzzy Lounge 0 2017-01-07 22:52
Feature request: Prime95 priority higher than 10 JuanTutors Software 19 2006-10-29 04:09
Prime95 Version 24.13 "Feature" RMAC9.5 Software 2 2006-03-24 21:12
Designing a home system for CNT. xilman Hardware 6 2004-10-21 19:41

All times are UTC. The time now is 12:41.

Fri May 7 12:41:30 UTC 2021 up 29 days, 7:22, 0 users, load averages: 2.86, 2.74, 2.81

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.