mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > YAFU

Reply
 
Thread Tools
Old 2022-09-28, 03:27   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×17×73 Posts
Default ggnfs improvements

As mentioned here, the microecm/tinyecm code in yafu appears to be helpful in the ggnfs lattice sievers as a replacement for mpqs when splitting cofactors remaining after trial divison.

I pulled the lasieve4_64 code into yafu with the changes to use ecm. I've built it on two different linux systems and with gcc-9, gcc-11, and icc. The gcc-9 build was on WSL and I needed to add -no-pie to CFLAGS, but other than that it went smoothly.

Here are some comparative timings on a few inputs. Obviously much more testing is needed, but take this as a good sign that it could be an improvement across the board.


300 bits, I11e, -f 650000 -c 10000
Code:
====== uecm version ======
Total yield: 51520
0/0 mpqs failures, 2821/28 vain mpqs
milliseconds total: Sieve 5820 Sched 0 medsched 6400
TD 4910 (Init 1030, MPQS 400) Sieve-Change 17040
34.392u 0.233s 0:35.11 98.6%    0+0k 800+6848io 1pf+0w

====== mpqs version ======
Total yield: 51520
0/0 mpqs failures, 2821/28 vain mpqs
milliseconds total: Sieve 5730 Sched 0 medsched 6430
TD 8190 (Init 1180, MPQS 3470) Sieve-Change 16720
37.289u 0.233s 0:37.95 98.8%    0+0k 800+6848io 1pf+0w
330 bits, I12e, -f 900000 -c 10000
Code:
====== uecm version ======
Total yield: 112189
0/0 mpqs failures, 77770/22906 vain mpqs
milliseconds total: Sieve 23770 Sched 0 medsched 16730
TD 16400 (Init 2270, MPQS 2370) Sieve-Change 23750
80.383u 0.891s 1:21.86 99.2%    0+0k 0+16224io 0pf+0w

====== mpqs version ======
Total yield: 112189
0/0 mpqs failures, 77770/22906 vain mpqs
milliseconds total: Sieve 23390 Sched 0 medsched 16300
TD 36800 (Init 2510, MPQS 22050) Sieve-Change 23520
99.705u 0.929s 1:41.19 99.4%    0+0k 8+16224io 0pf+0w
360 bits, I13e, -f 1600000 -c 5000
Code:
====== uecm version ======
Total yield: 28932
0/0 mpqs failures, 23373/6029 vain mpqs
milliseconds total: Sieve 51640 Sched 0 medsched 21420
TD 13870 (Init 1480, MPQS 560) Sieve-Change 19980
106.987u 1.517s 1:48.85 99.6%   0+0k 712+4528io 1pf+0w

====== mpqs version ======
Total yield: 28932
0/0 mpqs failures, 23373/6029 vain mpqs
milliseconds total: Sieve 54140 Sched 0 medsched 21630
TD 18650 (Init 1280, MPQS 5680) Sieve-Change 20280
114.655u 1.626s 1:56.85 99.5%   0+0k 0+4528io 0pf+0w
390 bits, I13e, -f 2700000 -c 5000
Code:
====== uecm version ======
Total yield: 18051
0/0 mpqs failures, 15358/4583 vain mpqs
milliseconds total: Sieve 46570 Sched 0 medsched 22940
TD 11160 (Init 760, MPQS 570) Sieve-Change 30600
112.513u 1.480s 1:54.54 99.5%   0+0k 0+2976io 0pf+0w

====== mpqs version ======
Total yield: 18051
0/0 mpqs failures, 15358/4583 vain mpqs
milliseconds total: Sieve 48100 Sched 0 medsched 21990
TD 14530 (Init 960, MPQS 3910) Sieve-Change 30460
116.308u 1.485s 1:59.34 98.6%   0+0k 0+2976io 0pf+0w
Getting diminishing returns. cofactor splitting is still faster but a smaller part of the whole.

420 bits, I13e, -f 6500000 -c 2000
Code:
====== uecm version ======
Total yield: 2766
0/0 mpqs failures, 2193/633 vain mpqs
milliseconds total: Sieve 21200 Sched 0 medsched 11240
TD 4060 (Init 280, MPQS 100) Sieve-Change 22970
63.047u 0.608s 1:04.10 99.2%    0+0k 0+488io 0pf+0w

====== mpqs version ======
Total yield: 2766
0/0 mpqs failures, 2193/633 vain mpqs
milliseconds total: Sieve 21910 Sched 0 medsched 11290
TD 4310 (Init 330, MPQS 560) Sieve-Change 22660
63.718u 0.637s 1:05.06 98.8%    0+0k 8+488io 0pf+0w
Maybe a parameter problem here? It is barely splitting cofactors at all.

Contrived case, with three large primes on rational side of I think it was 2^523-1 with 14e. lpbr/a 30, mfbr 90, mfba 60, rlambda 3.6 alambda 2.6
Code:
====== uecm version ======
Total yield: 3442
0/0 mpqs failures, 68120/2701 vain mpqs
milliseconds total: Sieve 41680 Sched 0 medsched 16230
TD 110120 (Init 1670, MPQS 88770) Sieve-Change 48560
218.973u 1.240s 3:40.75 99.7%   0+0k 696+704io 1pf+0w

====== mpqs version ======
Total yield: 3590
0/0 mpqs failures, 68138/146190 vain mpqs
milliseconds total: Sieve 42420 Sched 0 medsched 15620
TD 134580 (Init 1690, MPQS 114620) Sieve-Change 50040
245.104u 1.206s 4:07.07 99.6%   0+0k 8+736io 0pf+0w
Faster but with slightly lower yield.

That's all for now. Any more successful builds and testing would be great to hear about!

Last fiddled with by bsquared on 2022-10-01 at 03:35 Reason: duplicated a code block
bsquared is offline   Reply With Quote
Old 2022-09-28, 14:58   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·17·73 Posts
Default

Just to note, building with CFLAGS=-DGGNFS_MPQS during the final make should enable the original mpqs code.
bsquared is offline   Reply With Quote
Old 2022-09-28, 15:15   #3
charybdis
 
charybdis's Avatar
 
Apr 2020

53·7 Posts
Default

I know the CADO developers don't like to include other people's code, but it's likely that your new ECM code is also faster than CADO's internal ECM that it uses for splitting cofactors. Might be worth dropping them a line just in case some of them are interested.
charybdis is offline   Reply With Quote
Old 2022-09-28, 21:43   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

372310 Posts
Default

Quote:
Originally Posted by charybdis View Post
I know the CADO developers don't like to include other people's code, but it's likely that your new ECM code is also faster than CADO's internal ECM that it uses for splitting cofactors. Might be worth dropping them a line just in case some of them are interested.
I don't know, their ecm looks pretty Serious... Edwards curves, bytecode driven...

But I've communicated with Paul Z. before, I can see if he's interested.
bsquared is offline   Reply With Quote
Old 2022-09-30, 13:55   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×17×73 Posts
Default

I've been studying and working with the ggnfs lattice siever lately. The assembly code is pretty much unassailable, but there are C equivalents for most of the assembly code. I've made some progress understanding these... enough to be able to do some vectorized re-writes using AVX512. The first one I picked is the Sieve-Change section (lasieve_setup). Example timings for the orignial code shown here for a random 512-bit semiprime:

Code:
time ./gnfs-lasieve4I14e -v -f 15300000 -c 2000 -a nfs512_15300k_I14.job
gnfs-lasieve4I14e (with asm64): L1_BITS=15, SVN $Revision: 430 $
Warning:  lowering FB_bound to 15299999.
FBsize 990256+0 (deg 5), 1892784+0 (deg 1)
total yield: 3136, q=15302017 (0.05278 sec/rel) ETA 0h00m)
108 Special q, 158 reduction iterations
reports: 19855429->2233021->1979662->1025589->638030->385088
Number of relations with k rational and l algebraic primes for (k,l)=:

Total yield: 3136
0/0 mpqs failures, 2729/1443 vain mpqs
milliseconds total: Sieve 74220 Sched 0 medsched 23260
TD 13650 (Init 420, MPQS 1030) Sieve-Change 54390
TD side 0: init/small/medium/large/search: 1730 870 610 2010 1100
sieve: init/small/medium/large/search: 1140 15470 920 20890 1290
TD side 1: init/small/medium/large/search: 1980 640 730 1850 510
sieve: init/small/medium/large/search: 830 10280 620 22300 480
171.995u 1.947s 2:54.71 99.5%   0+0k 848+624io 1pf+0w
And here it is again with the new vectorized lasieve_setup, and tinyecm:

Code:
time ./gnfs-lasieve4I14e -v -f 15300000 -c 2000 -a nfs512_15300k_I14.job
gnfs-lasieve4I14e (with asm64): L1_BITS=15, SVN $Revision: 430 $
Warning:  lowering FB_bound to 15299999.
FBsize 990256+0 (deg 5), 1892784+0 (deg 1)
total yield: 3136, q=15302017 (0.04164 sec/rel) ETA 0h00m)
108 Special q, 158 reduction iterations
reports: 19853879->2232572->1979660->1025587->638029->385088
Number of relations with k rational and l algebraic primes for (k,l)=:

Total yield: 3136
0/0 mpqs failures, 2729/1442 vain mpqs
milliseconds total: Sieve 73830 Sched 0 medsched 25360
TD 12910 (Init 380, MPQS 190) Sieve-Change 18470
TD side 0: init/small/medium/large/search: 1560 880 630 2070 1090
sieve: init/small/medium/large/search: 920 16610 710 20290 1310
TD side 1: init/small/medium/large/search: 1930 520 790 2010 720
sieve: init/small/medium/large/search: 560 11060 810 21380 180
136.897u 1.872s 2:19.38 99.5%   0+0k 840+624io 1pf+0w
About a 3x improvement for that routine, 25% overall.

I'll be maintaining and working with the ggnfs lattice siever in the yafu github. These new changes are available there. Build process is the same, except add CFLAGS=-DAVX512_LASIEVE_SETUP when making the gnfs-lasieve4I executables (after building the liblasieve libraries in the asm/ directory, the same as before).

Other sections of the code are looking approachable as well, I hope there is more to come.

Last fiddled with by bsquared on 2022-09-30 at 13:55
bsquared is offline   Reply With Quote
Old 2022-10-02, 17:49   #6
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

1,847 Posts
Default

I think I'm compiling this incorrectly, but how do I compile without AVX-512 instructions? I'm getting a number of similar errors like this:

Code:
recurrence6.c:327:17: error: incompatible types when assigning to type ‘__m512i’ {aka ‘__vector(8) long long int’} from type ‘int’
  327 |             k = _mm512_mask_div_epu32(k, m, b, c);
wombatman is offline   Reply With Quote
Old 2022-10-02, 19:31   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×17×73 Posts
Default

Quote:
Originally Posted by wombatman View Post
I think I'm compiling this incorrectly, but how do I compile without AVX-512 instructions? I'm getting a number of similar errors like this:

Code:
recurrence6.c:327:17: error: incompatible types when assigning to type ‘__m512i’ {aka ‘__vector(8) long long int’} from type ‘int’
  327 |             k = _mm512_mask_div_epu32(k, m, b, c);
I'll have to check, maybe I didn't protect all of the avx512 code. You need avx512 to get any speedup.

Last fiddled with by bsquared on 2022-10-02 at 19:32
bsquared is offline   Reply With Quote
Old 2022-10-03, 01:10   #8
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

111001101112 Posts
Default

Quote:
Originally Posted by bsquared View Post
I'll have to check, maybe I didn't protect all of the avx512 code. You need avx512 to get any speedup.
Ah, ok. I thought the microecm also gave at least a small boost to GGNFS even without avx512.
wombatman is offline   Reply With Quote
Old 2022-10-03, 14:52   #9
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×17×73 Posts
Default

Quote:
Originally Posted by wombatman View Post
Ah, ok. I thought the microecm also gave at least a small boost to GGNFS even without avx512.
You are right! The microecm improvement doesn't require AVX512. I just checked in a version that should compile without -DAVX512_LASIEVE_SETUP in CFLAGS.

Probably -DAVX512_LASIEVE_SETUP will only work with the Intel compiler right now, because I use the _mm512_div_epu32 and _mm512_rem_epu32 intrinsics which I think only Intel has definitions of. Trying to make statically fails because the static cilkrts library doesn't exist... looking into including a copy of the cilkrts library with OpenCilk.

Last fiddled with by bsquared on 2022-10-03 at 14:53
bsquared is offline   Reply With Quote
Old 2022-10-04, 02:33   #10
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

111001101112 Posts
Default

Quote:
Originally Posted by bsquared View Post
You are right! The microecm improvement doesn't require AVX512. I just checked in a version that should compile without -DAVX512_LASIEVE_SETUP in CFLAGS.

Probably -DAVX512_LASIEVE_SETUP will only work with the Intel compiler right now, because I use the _mm512_div_epu32 and _mm512_rem_epu32 intrinsics which I think only Intel has definitions of. Trying to make statically fails because the static cilkrts library doesn't exist... looking into including a copy of the cilkrts library with OpenCilk.
Thanks! They build, so that's a good sign! I am still fighting with MSieve to see if I can get it to work properly (i.e., compiles but doesn't actually do Stage 1 GPU work), but I will test the new GGNFS as soon as possible.
wombatman is offline   Reply With Quote
Old 2022-10-05, 16:57   #11
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

1110100010112 Posts
Default New improvement

The large prime bucket sieve (lasched) and trial division (MMX-TD) benefitted from AVX512 as well. The changes were actually very similar to things I did in yafu's SIQS.

Here is a summary of a few tests.

Test cases:
time ./gnfs-lasieve4I11e -v -f 650000 -c 10000 -a nfs300_650k_I11.job
time ./gnfs-lasieve4I12e -v -f 900000 -c 5000 -a nfs330_900k_I12.job
time ./gnfs-lasieve4I13e -v -f 1600000 -c 5000 -a nfs360_1600k_I13.job
time ./gnfs-lasieve4I13e -v -f 2700000 -c 2000 -a nfs390_2700k_I13.job
time ./gnfs-lasieve4I13e -v -f 6500000 -c 2000 -a nfs420_6500k_I13.job
time ./gnfs-lasieve4I14e -v -f 15300000 -c 2000 -a nfs512_15300k_I14.job
time ./gnfs-lasieve4I16e -v -f 50000000 -c 500 -a nfs631_50000k_I16.job

The first 6 .jobs were produced by yafu with default parameters for the indicated semiprime sizes.
The last one I pulled from the msieve poly request thread on a GNFS C190 and then used yafu to fill with default (probably bad) parameters.
Code:
Original GGNFS:
size      time (s)
300       41
330       53
360       121
390       44
420       59
512       170
631       613

With AVX512 versions of lasched, lasieve_setup, and MMX-TD
size      time (s)
300       24 
330       32
360       92
390       32
420       41
512       121
631       473
lasched is about 20% faster and trial division over 2x faster for larger inputs. Overall it seems about 30% faster on average.

The build process is slightly different now. To make with all of the new code do:

cd asm
make liblasieve.a AVX512_TD=1
make liblasieveI11.a AVX512_TD=1
make liblasieveI12.a AVX512_TD=1
make liblasieveI13.a AVX512_TD=1
make liblasieveI14.a AVX512_TD=1
make liblasieveI15.a AVX512_TD=1
make liblasieveI16.a AVX512_TD=1
cp *.a ..
cd ..
make all AVX512_LASCHED=1 AVX512_LASETUP=1 AVX512_TD=1 LASTATS=1

I think the new stuff might only build with icc. That's all I've tested with for now, anyway. If you have it, add CC=icc to the make lines. As before, there is a small speedup with tinyecm if all you have is gcc and a non-avx512 system.

To verify it all still works, I factored 2^523-1 using the new sievers.

Last fiddled with by bsquared on 2022-10-05 at 17:59 Reason: added snfs job results. fixed the build line
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
yafu ignoring yafu.ini chris2be8 YAFU 9 2022-02-17 17:52
YAFU + GGNFS Confirmation nivek000 YAFU 1 2021-12-10 22:35
Running YAFU via Aliqueit doesn't find yafu.ini EdH YAFU 8 2018-03-14 17:22
GGNFS or something better? Zeta-Flux Factoring 1 2007-08-07 22:40
ggnfs ATH Factoring 3 2006-08-12 22:50

All times are UTC. The time now is 10:33.


Tue Dec 6 10:33:12 UTC 2022 up 110 days, 8:01, 0 users, load averages: 1.11, 0.94, 0.85

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔