mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2022-08-14, 20:06   #1
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·11·113 Posts
Default Msieve and OpenMP

I've started looking at transitioning Msieve from pthreads to OpenMP to make it easier for others to work with the source. I don't expect any performance regressions (or gains in already threaded sections) but since OpenMP and pthreads happily coexist I can keep pthreads in place anywhere it is faster. This also means I can work on one section at a time.

I started with the square root code. The FaaS model of running multiple dependencies at a time works ok, but it significantly increases the memory use and saves no time when the first dependency finds a factor. Instead I use multiple threads to work on a single dependency.

After reading the relations for a dependency, the sqrt has two time-intensive steps, multiplying the relations and calculating the algebraic square root using Newton's method. Multiplying the relations has lots of available parallelism in the beginning but less for the last few, largest products. However, I still see reasonable performance. As an example, for 11,311-, on a 64-core AMD EPYC node this step requires 132 minutes with a single thread but about 7.75 minutes with 64 threads, about 17x faster.

The Newton iteration is more difficult. However, we can extract some parallelism in the polynomial products and mods. Again for 11,311-, this step also took 132 minutes with a single thread but about 25 minutes with 7 active threads, a little over 5x faster.

Overall, including the time to read the relations, each dependency for 11,311- requires 73 minutes. For this same number, the FaaS code required 304 minutes for 4 dependencies in parallel that fit in memory.

This code is available in the try_openmp branch at https://github.com/gchilders/msieve_nfsathome.git. You can play with it if you wish, but it is a work in progress so don't be surprised by occasional brokenness.
frmky is offline   Reply With Quote
Old 2022-08-14, 21:37   #2
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3×1,693 Posts
Default

Thank you Greg for the above, for this step of improvement. Quick question since I lost sensitivity on how intensity the hard disk works, how is it during the mentioned "multiple threads to work on a single dependency"? In the past I've never managed to start in parallel a filtering processing of the next job before finishing the SR of the previous job, even having lost of memory. Think my bottleneck was my HD speed but I might be wrong. Thank you

Last fiddled with by pinhodecarlos on 2022-08-14 at 21:40
pinhodecarlos is offline   Reply With Quote
Old 2022-08-14, 21:41   #3
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×11×113 Posts
Default

The drive is only accessed when the cycles and relations are read at the beginning. That still uses a single thread.
frmky is offline   Reply With Quote
Old 2022-08-14, 21:43   #4
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

507910 Posts
Default

Ah ok, thank you.
pinhodecarlos is offline   Reply With Quote
Old 2022-08-21, 09:14   #5
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

1001101101102 Posts
Default

I've now rewritten the linear algebra using OpenMP rather than pthreads. Using gcc 11, it runs at the same speed as the pthreads version both with and without MPI on both AMD EPYC and Fujitsu A64FX nodes. I haven't tested earlier versions of gcc, but I don't expect it work at all on gcc 6 or earlier (I think) due to lack of support of reduction of array sections.

Next I'll look into parallelizing parts of the filtering. I'm not hopeful there's much to gain there, but I have a couple of things to try.
frmky is offline   Reply With Quote
Old 2022-08-23, 13:05   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110111012 Posts
Default

Multiple threads can probably benefit the duplicate and singleton removal but those would be a major refactoring. CADO-NFS can run the merge phase in parallel but that would be a super-major refactoring :)
jasonp is offline   Reply With Quote
Old 2022-08-24, 00:27   #7
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

248610 Posts
Default

I'm not up for a major rewrite, but I've add some parallelism to the in-memory singleton and clique removal phases so far. Still staring at the code as time permits.
frmky is offline   Reply With Quote
Old 2022-08-26, 05:19   #8
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·11·113 Posts
Default

I think I'm done for now. I've added OpenMP constructs where it seems to make sense without making large changes. Of course some filtering steps parallelize better than others. For example, for 11,311- on a 64-core node the initial reading of the relations is 14.5x faster and the singleton removal initial pass is 6.4x faster while duplicate removal pass 2 and full merge steps remain single threaded. Overall the filtering for 11,311- is reduced from 21 hours to 7 hours, 3x faster. You can directly compare the runs:
Original code: https://pastebin.com/raw/qAqkPAg6
OpenMP code: https://pastebin.com/raw/JsWVNqfR

As another test, I'm running 13_2_938m1 now. Filtering starting with 652M unique relations took 1h40m to produce 34.9M cycles for a matrix. LA is running now.

Better performance is possible with more extensive changes. Suggestions and improvements are welcome!

Last fiddled with by frmky on 2022-08-26 at 05:33
frmky is offline   Reply With Quote
Old 2022-08-26, 16:52   #9
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3·1,693 Posts
Default

Wow...well done...congrats!
pinhodecarlos is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
msieve with GMP 6.2.0 ryanp Msieve 2 2020-06-16 20:36
GMP-ECM with --enable-openmp flag set in configure = bad results? GP2 GMP-ECM 3 2016-10-16 10:21
Msieve on a Mac (Help) pxp Msieve 1 2013-02-28 14:56
OpenMP R. Gerbicz Programming 13 2011-09-22 01:11
msieve help em99010pepe Msieve 23 2009-09-27 16:13

All times are UTC. The time now is 02:56.


Fri Oct 7 02:56:56 UTC 2022 up 50 days, 25 mins, 0 users, load averages: 1.90, 1.85, 1.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔