mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   help designing prime95 28.10 feature (https://www.mersenneforum.org/showthread.php?t=21248)

Prime95 2016-04-26 20:32

help designing prime95 28.10 feature
 
I want to add a feature to prime95 to select the best FFT implementation for the user's computer. The need for this became all-too-apparent in 28.9 when some reported the new version was slower than 28.6 and others found it faster.

My plan is to do a throughput benchmark and write the results to local.txt. When starting an LL test, prime95 looks at the benchmark data and selects the fastest FFT.

The open issues where your ideas may be helpful:

1) The benchmark really wants to run when nothing else is going on. Even so, the OS may fire up some process that skews a run. So, I need a good algorithm that tracks multiple runs and throws out outlier data.

2) My first thought was to have a menu choice to run the throughput benchmark. I suspect it will not get run on many machines. Alternatively, I could run the benchmark when prime95 is launched or at a late night hour or both until we have enough runs we are confident in our throughput data. Ideas? How many runs before we are confident in the results?

3) Any ideas on how best to detect a significant change (CPU, memory, whatever?) and discard the accumulated data?

Mark Rose 2016-04-26 21:31

The optimal FFT size may also change depending on the load on the machine throughout the day.

How quickly can you run a benchmark and get an accurate result? Would it be possible to run a micro-benchmark every hour or so many iterations and adjust to current conditions?

I'm not sure of any easy way to detect change in memory bandwidth, for instance, and so micro-benchmarks may be the way to go.

Prime95 2016-04-26 22:29

I was going to use 20 second benchmarks running on all cores to determine a machine's throughput for each possible FFT implementation.

Mark Rose 2016-04-26 23:28

[QUOTE=Prime95;432618]I was going to use 20 second benchmarks running on all cores to determine a machine's throughput for each possible FFT implementation.[/QUOTE]

Run hourly, that would be a 0.6% overhead. Would it be possible to pick the three or four most likely fastest implementations and benchmark just those?

Prime95 2016-04-27 00:45

I was thinking we'd run a handful and be done, or maybe a dozen if were not getting consistent results, or something. This is not something to be done on a continuing basis.

axn 2016-04-27 02:56

[QUOTE=Prime95;432634]I was thinking we'd run a handful and be done, or maybe a dozen if were not getting consistent results, or something. This is not something to be done on a continuing basis.[/QUOTE]

You can use the live LL test itself to benchmark. No work would be wasted in that case. When you encounter an FFT for the first time, the LL would be run in "tune" mode. When sufficient data is obtained (say 1 days worth), you can write it to a file.

However, that requires a fixed worker/thread configuration. If that also needs to be tuned simultaneously, then it is not feasible to do it.

potonono 2016-04-27 02:58

Cam you temporarily increase the program priority?

xtreme2k 2016-04-27 11:09

My suggestions[LIST][*]Benchmark can be manually requested[*]Benchmark can run automatically prior to the start of a LL [*]Benchmark can run automatically when new CPU instructions, cache size, new P95 version is detected[*]Benchmark can run solely based on users settings (eg. I use 1 worker 12 threads for Prime95, so please don't test 28 workers or different w/t combinations or at different FFTs other than the ones being worked on as those are pointless)[/LIST]

bgbeuning 2016-04-27 13:03

If we have 4 FFT algo to choose from, and we have been assigned an exponent of 70M can prime95 do this?
1. Run FFT 1 for iterations 1 to 1000 (for the 70M exponent)
2. Run FFT 2 for iterations 1001 to 2000
3. Run FFT 3 for iterations 2001 to 3000
4. (you get the idea)
5. Run FFT 1 again for 4001 to 5000
...
Run FFT (best) for iterations 20001 to 70M

until prime95 gets enough samples to be able to pick between the 4 choices.
If some runs for an FFT disagree with other runs, then some other process may
have been active on the system during a run so we might need more samples.
The advantage being all CPU time is used to finish real LL work.

If wall clock elapsed time = CPU usage, then no other process has run to compete with prime95.
(See Windows API GetProcessTimes.)

If process page rate is low, then no other process is competing for memory.
(See Windows API GetProcessMemoryInfo. Windows will take memory from
a process even when the system has lots of free memory. When the page is
needed again, it causes a "soft" page fault. A hard page fault requires a disk access.
A soft page fault is when the page is already in memory but marked available.
GetProcessMemoryInfo includes both soft and hard faults. So it will never show
0 page faults. The PDH library lets a process access perfmon data which has soft
and hard faults separated.)

TObject 2016-04-27 19:53

Can figure out what makes certain algorithms run faster on certain architectures, so that no benchmarking is necessary?

What are we up against in this case? Intel’s secrecy?

Mark Rose 2016-04-27 21:08

[QUOTE=TObject;432673]Can figure out what makes certain algorithms run faster on certain architectures, so that no benchmarking is necessary?

What are we up against in this case? Intel’s secrecy?[/QUOTE]

The problem is that different memory bandwidths, cache bandwidths, cache sizes, the exponent, and the number of cores running can affect what is the optimal FFT. There are too many variables.


All times are UTC. The time now is 08:37.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.