mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-09-28, 15:41   #1
IBethune
 
Nov 2010

52 Posts
Default Multithreaded PFGW

I have implemented a version of PFGW which supports multi-threading, essentially along the same lines as LLR i.e. adding a gwset_num_threads() call between gwinit() and gwsetup(). The code is working correctly i.e. giving the correct answers, but does not always give speedup. If anyone wants to play, the patch is here: https://pastebin.com/xw7ykJbM (note that the patch also works with the most recent gwnum, since it removes the use of the deprecated gwmap_to_fft_info() function). Multiple threads are enabled using the -T<number> command line argument.

For example, if I run a PRP test on a large factorial prime 208003!-1 I find that going from 1 to 2 threads I get a 1.6x speedup, which is comparable to what I get using the LLR program to test 387*2^3322763+1, which is a Proth of roughly the same digit length.

Code:
../pfgw64 -q'208003!-1' -V -T2
PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2]

Generic modular reduction using generic reduction FMA3 FFT length 336K, Pass1=448, Pass2=768, clm=4, 2 threads on A 3374558-bit number

./llr64.3.8.20 -q387*2^3322763+1 -d -t2
Starting Proth prime test of 387*2^3322763+1
Using all-complex FMA3 FFT length 256K, Pass1=128, Pass2=2K, 2 threads, a = 5
However, if I test the same Proth using PFGW (either a PRP or a primality test), the performance goes down when I switch from 1 to two threads. The main difference I see is that this test uses the "Special modular reduction":

Code:
../pfgw64 -q'387*2^3322763+1' -V -T2
PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2]

Special modular reduction using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads on 387*2^3322763+1

../pfgw64 -q'387*2^3322763+1' -V -t -T2
PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2]

Primality testing 387*2^3322763+1 [N-1, Brillhart-Lehmer-Selfridge]                                    
Running N-1 test using base 5                                                  
Special modular reduction using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads on 387*2^3322763+1
However, even if I hack the code to use the generic reduction for this candidate, I still observe the slowdown:

Code:
../pfgw64 -q'387*2^3322763+1' -V -T2
PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2]

Generic modular reduction using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads on 387*2^3322763+1
So question to George, Mark and other knowledgable people... is there anything which PFGW is doing differently to LLR in terms of how it uses gwnum that results in an configuration which is not efficient on multiple threads?

- Iain
IBethune is offline   Reply With Quote
Old 2017-09-28, 19:15   #2
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(3,3^1118781+1)/3

2×7×647 Posts
Default

Quote:
Originally Posted by IBethune View Post
... is there anything which PFGW is doing differently to LLR in terms of how it uses gwnum that results in an configuration which is not efficient on multiple threads?

- Iain
As soon as you are in the 'Generic modular reduction using generic reduction' territory, by definition you have PFGW doing things differently to LLR. Neither LLR nor P95 have an equivalent use mode.
Batalov is offline   Reply With Quote
Old 2017-09-28, 20:16   #3
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

23·52·29 Posts
Default

The term "Special" or "Generic" txt is output depending upon the call used to gwnum to set up the modular reduction. gwsetup() vs gwsetup_general_mod()/gwsetup_general_mod_64(). The text following "using" is the output produced by a call to gwmodulo_as_string().
rogue is offline   Reply With Quote
Old 2017-09-29, 08:27   #4
IBethune
 
Nov 2010

52 Posts
Default

That's why I'm confused - it doesn't seem to be as simple as generic/special reduction controlling whether the multithreaded FFT performs well or not e.g.
  • million digit Factorial prime with PFGW -> generic reduction -> good thread scaling
  • million digit Proth prime with LLR -> special reduction -> good thread scaling
  • million digit Proth prime with PFGW -> special reduction -> poor thread scaling
  • million digit Proth prime with PFGW, hacked to use generic reduction -> poor thread scaling.

So there is some other effect at work here, which I don't quite understand at this point. I wonder if George has any insight?
IBethune is offline   Reply With Quote
Old 2017-09-29, 11:28   #5
axn
 
axn's Avatar
 
Jun 2003

23×3×193 Posts
Default

Iain, can you post the output (showing the FFT) from both single thread and 2-thread runs for the Proth number? Also the iteration timings and the CPU usage as well for both of these?

What kind of CPU is it? If it is HT, are both the threads of the 2-thread run somehow running on the same physical core?

Last fiddled with by axn on 2017-09-29 at 11:32 Reason: Clarified "output"
axn is offline   Reply With Quote
Old 2018-03-16, 17:52   #6
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

23·52·29 Posts
Default

I applied the change and didn't see any improvement either. Did you look at how llr implemented it?
rogue is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PFGW GUI vs CMD houding Software 1 2016-06-20 12:11
Which Work Types are Multithreaded tului Software 6 2015-11-28 21:59
PFGW 3.3.6 or PFGW 3.4.2 Please update now! Joe O Sierpinski/Riesel Base 5 5 2010-09-30 14:07
Feature request: multithreaded polsel Andi47 Msieve 1 2010-02-20 01:16
Multithreaded QS/NFS sieve stage nuggetprime Msieve 5 2008-08-11 07:48

All times are UTC. The time now is 01:34.

Thu Jul 9 01:34:14 UTC 2020 up 105 days, 23:07, 0 users, load averages: 1.48, 1.75, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.