mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-04-26, 20:32   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D3D16 Posts
Default help designing prime95 28.10 feature

I want to add a feature to prime95 to select the best FFT implementation for the user's computer. The need for this became all-too-apparent in 28.9 when some reported the new version was slower than 28.6 and others found it faster.

My plan is to do a throughput benchmark and write the results to local.txt. When starting an LL test, prime95 looks at the benchmark data and selects the fastest FFT.

The open issues where your ideas may be helpful:

1) The benchmark really wants to run when nothing else is going on. Even so, the OS may fire up some process that skews a run. So, I need a good algorithm that tracks multiple runs and throws out outlier data.

2) My first thought was to have a menu choice to run the throughput benchmark. I suspect it will not get run on many machines. Alternatively, I could run the benchmark when prime95 is launched or at a late night hour or both until we have enough runs we are confident in our throughput data. Ideas? How many runs before we are confident in the results?

3) Any ideas on how best to detect a significant change (CPU, memory, whatever?) and discard the accumulated data?

Last fiddled with by Prime95 on 2016-04-26 at 20:34
Prime95 is online now   Reply With Quote
Old 2016-04-26, 21:31   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

B7316 Posts
Default

The optimal FFT size may also change depending on the load on the machine throughout the day.

How quickly can you run a benchmark and get an accurate result? Would it be possible to run a micro-benchmark every hour or so many iterations and adjust to current conditions?

I'm not sure of any easy way to detect change in memory bandwidth, for instance, and so micro-benchmarks may be the way to go.
Mark Rose is offline   Reply With Quote
Old 2016-04-26, 22:29   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×5×499 Posts
Default

I was going to use 20 second benchmarks running on all cores to determine a machine's throughput for each possible FFT implementation.
Prime95 is online now   Reply With Quote
Old 2016-04-26, 23:28   #4
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

293110 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I was going to use 20 second benchmarks running on all cores to determine a machine's throughput for each possible FFT implementation.
Run hourly, that would be a 0.6% overhead. Would it be possible to pick the three or four most likely fastest implementations and benchmark just those?
Mark Rose is offline   Reply With Quote
Old 2016-04-27, 00:45   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D3D16 Posts
Default

I was thinking we'd run a handful and be done, or maybe a dozen if were not getting consistent results, or something. This is not something to be done on a continuing basis.
Prime95 is online now   Reply With Quote
Old 2016-04-27, 02:56   #6
axn
 
axn's Avatar
 
Jun 2003

25×5×31 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I was thinking we'd run a handful and be done, or maybe a dozen if were not getting consistent results, or something. This is not something to be done on a continuing basis.
You can use the live LL test itself to benchmark. No work would be wasted in that case. When you encounter an FFT for the first time, the LL would be run in "tune" mode. When sufficient data is obtained (say 1 days worth), you can write it to a file.

However, that requires a fixed worker/thread configuration. If that also needs to be tuned simultaneously, then it is not feasible to do it.
axn is offline   Reply With Quote
Old 2016-04-27, 02:58   #7
potonono
 
potonono's Avatar
 
Jun 2005
USA, IL

19310 Posts
Default

Cam you temporarily increase the program priority?
potonono is offline   Reply With Quote
Old 2016-04-27, 11:09   #8
xtreme2k
 
xtreme2k's Avatar
 
Aug 2002

2·3·29 Posts
Default

My suggestions
  • Benchmark can be manually requested
  • Benchmark can run automatically prior to the start of a LL
  • Benchmark can run automatically when new CPU instructions, cache size, new P95 version is detected
  • Benchmark can run solely based on users settings (eg. I use 1 worker 12 threads for Prime95, so please don't test 28 workers or different w/t combinations or at different FFTs other than the ones being worked on as those are pointless)

Last fiddled with by xtreme2k on 2016-04-27 at 11:10
xtreme2k is offline   Reply With Quote
Old 2016-04-27, 13:03   #9
bgbeuning
 
Dec 2014

3·5·17 Posts
Default

If we have 4 FFT algo to choose from, and we have been assigned an exponent of 70M can prime95 do this?
1. Run FFT 1 for iterations 1 to 1000 (for the 70M exponent)
2. Run FFT 2 for iterations 1001 to 2000
3. Run FFT 3 for iterations 2001 to 3000
4. (you get the idea)
5. Run FFT 1 again for 4001 to 5000
...
Run FFT (best) for iterations 20001 to 70M

until prime95 gets enough samples to be able to pick between the 4 choices.
If some runs for an FFT disagree with other runs, then some other process may
have been active on the system during a run so we might need more samples.
The advantage being all CPU time is used to finish real LL work.

If wall clock elapsed time = CPU usage, then no other process has run to compete with prime95.
(See Windows API GetProcessTimes.)

If process page rate is low, then no other process is competing for memory.
(See Windows API GetProcessMemoryInfo. Windows will take memory from
a process even when the system has lots of free memory. When the page is
needed again, it causes a "soft" page fault. A hard page fault requires a disk access.
A soft page fault is when the page is already in memory but marked available.
GetProcessMemoryInfo includes both soft and hard faults. So it will never show
0 page faults. The PDH library lets a process access perfmon data which has soft
and hard faults separated.)
bgbeuning is offline   Reply With Quote
Old 2016-04-27, 19:53   #10
TObject
 
TObject's Avatar
 
Feb 2012

34·5 Posts
Default

Can figure out what makes certain algorithms run faster on certain architectures, so that no benchmarking is necessary?

What are we up against in this case? Intel’s secrecy?
TObject is offline   Reply With Quote
Old 2016-04-27, 21:08   #11
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

1011011100112 Posts
Default

Quote:
Originally Posted by TObject View Post
Can figure out what makes certain algorithms run faster on certain architectures, so that no benchmarking is necessary?

What are we up against in this case? Intel’s secrecy?
The problem is that different memory bandwidths, cache bandwidths, cache sizes, the exponent, and the number of cores running can affect what is the optimal FFT. There are too many variables.
Mark Rose is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 - stop all workers on error [feature request] kql Software 1 2020-12-31 15:15
New Feature! Xyzzy Lounge 0 2017-01-07 22:52
Feature request: Prime95 priority higher than 10 JuanTutors Software 19 2006-10-29 04:09
Prime95 Version 24.13 "Feature" RMAC9.5 Software 2 2006-03-24 21:12
Designing a home system for CNT. xilman Hardware 6 2004-10-21 19:41

All times are UTC. The time now is 16:38.

Sun May 9 16:38:37 UTC 2021 up 31 days, 11:19, 1 user, load averages: 3.49, 3.54, 3.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.