mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-04-28, 19:13   #23
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3·17·97 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I worry you are leaving too much performance on the table. Are you planning on running multithreaded FFTs? The gwnum library's multithreading implementation is not great for smaller FFTs. See the benchmark below showing every implementation of the 128K FFT loses 25% or more running multithreaded. This is from a non-OC'ed Skylake box.

Code:
Prime95 64-bit version 29.2, RdtscTiming=1
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=4 (4 cpus, 1 worker):  0.15 ms.  Throughput: 6828.90 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=4 (4 cpus, 4 workers):  0.46,  0.46,  0.46,  0.46 ms.  Throughput: 8680.52 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=2 (4 cpus, 1 worker):  0.15 ms.  Throughput: 6774.04 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=2 (4 cpus, 4 workers):  0.46,  0.45,  0.46,  0.46 ms.  Throughput: 8786.84 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=1 (4 cpus, 1 worker):  0.16 ms.  Throughput: 6238.53 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=1 (4 cpus, 4 workers):  0.46,  0.46,  0.46,  0.46 ms.  Throughput: 8616.51 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=4 (4 cpus, 1 worker):  0.18 ms.  Throughput: 5605.04 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=4 (4 cpus, 4 workers):  0.45,  0.45,  0.45,  0.45 ms.  Throughput: 8911.29 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=2 (4 cpus, 1 worker):  0.17 ms.  Throughput: 6036.93 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=2 (4 cpus, 4 workers):  0.43,  0.43,  0.43,  0.43 ms.  Throughput: 9346.56 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=1 (4 cpus, 1 worker):  0.17 ms.  Throughput: 5885.71 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=1 (4 cpus, 4 workers):  0.44,  0.44,  0.44,  0.44 ms.  Throughput: 9054.45 iter/sec.
I was going to do that but yesterday night (before your post) after testing that FFT size on my laptop on 3 threads I was totality disappointed. Gain drops from 2.67x at n=5.2M to 2.1x at n=2M, riesel base 2.
pinhodecarlos is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Benchmarks MurrayInfoSys Information & Answers 3 2011-04-14 17:10
LLR benchmarks Oddball No Prime Left Behind 11 2010-08-06 21:39
benchmarks Unregistered Information & Answers 15 2009-08-18 16:44
Benchmarks for i7 965 lavalamp Hardware 21 2009-01-06 04:32
Benchmarks Vandy Hardware 6 2002-10-28 13:45

All times are UTC. The time now is 17:31.


Sun Aug 1 17:31:23 UTC 2021 up 9 days, 12 hrs, 0 users, load averages: 1.92, 1.58, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.