![]() |
![]() |
#78 |
"Ben"
Feb 2007
1110101101002 Posts |
![]()
Thank you very much!
Found and fixed, uploaded just now. Code:
time ./gnfs-lasieve4I12e -k -v -n0 -a c103-2.job.T1 -o test.out gnfs-lasieve4I12e (with asm64,avx-512 mmx-td,lasetup,lasched,sieve1,ecm,tds0,search0,tdsched): L1_BITS=15 FBsize 100777+0 (deg 4), 171534+0 (deg 1) Sorted factor base on side 0: 1: 32495 2: 32397 Sorted factor base on side 1: 1: 168023 total yield: 155958, q=1327517 (0.00051 sec/rel) ETA 0h00m) 937 Special q, 1411 reduction iterations reports: 165365787->20138278->18040597->11058350->11056552->10304931->321750 (5306059) Total yield: 155958 milliseconds total: Sieve 24660 Sched 8950 medsched 6040 TD 25660 (Init 970, MPQS 3550) Sieve-Change 13800 TD side 0: init/small/medium/large/search: 790 2510 930 1140 6040 sieve: init/small/medium/large/search: 1470 10160 750 860 1170 TD side 1: init/small/medium/large/search: 130 2460 1010 810 4910 sieve: init/small/medium/large/search: 640 6140 990 1180 1300 aborts: 0 0 Expected yield/cost: 4.15e+04 0 p-1: 0 tests, 0 successes ecm: 0 tests, 0 successes MPQS-AUX 0 COF: 280255 tests, 0 ecm, 0 aux: 0 mpqs, 0 mpqs3, 0 ecm, 0 too big 77.763u 1.444s 1:19.53 99.5% 0+0k 1648+23208io 1pf+0w |
![]() |
![]() |
![]() |
#79 | |
Sep 2009
46278 Posts |
![]()
Why does c103-2.job.T1 contain:
Quote:
Code:
print OUTF "m: $M\n" if ($M); Code:
print OUTF "n: $N\nm: $M\n"; |
|
![]() |
![]() |
![]() |
#80 | |
"Ben"
Feb 2007
22×941 Posts |
![]() Quote:
It was an inexact translation of #defines like this to AVX512 in lasieve_prepn: #define A1MOD0(p) ((aux= absa1%p)> 0 ? p-aux : 0 ) where I neglected to account for the "else" case (where absa1%p == 0) |
|
![]() |
![]() |
![]() |
#81 |
"Bo Chen"
Oct 2005
Wuhan,China
2×3×31 Posts |
![]()
When sieve R942 using lpb34, the avx512-lasieve5 crash immediately,
while the original lasieve5 could work properly. command is ./gnfs-lasieve4I16e -v -f 380000000 -c 1000 -o R942_16e_r_380000000_380001000.out -r R942_poly.txt -R R942_poly.txt file attached. |
![]() |
![]() |
![]() |
#82 |
"Ben"
Feb 2007
22×941 Posts |
![]()
Fixed and checked in! Thanks for the report!
|
![]() |
![]() |
![]() |
#83 |
"Bo Chen"
Oct 2005
Wuhan,China
BA16 Posts |
![]()
Thanks for the fix.
Another confusion is that , when set the rlim and alim to 500000000, the version 551 (newest version) is about 10% slower than the 550 version (second newest version). I'm not sure if change the _mm512_set1_epi32(ij_ub) to _mm512_set1_epu32(ij_ub) or such modification could resolve the slower speed. |
![]() |
![]() |
![]() |
#84 | |
Jul 2003
So Cal
1010010101112 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#85 | |
"Ben"
Feb 2007
22×941 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#86 |
Jul 2003
So Cal
2,647 Posts |
![]()
Just those two SVML intrinsics? Then let's see if we can replace them without a performance hit. It'll make it much easier to work with if it's not tied to the Intel compilers.
|
![]() |
![]() |
![]() |
#87 |
"Ben"
Feb 2007
22·941 Posts |
![]()
Yep pretty sure just those. I started to do that, based on code I wrote a long time ago that used the fast reciprocal intrinsic (_mm512_rcp14_pd, I think), followed by a couple rounds of newton iteration. But I didn't finish it to the point where it works. I will try to find it.
|
![]() |
![]() |
![]() |
#88 |
"Ben"
Feb 2007
72648 Posts |
![]()
Also needed a replacement for _mm512_rem_epu64, which was a taller order.
I put in a 52-bit vector Barrett multiplier to do the job of the _mm512_rem_epu64 SVML intrinsic (in modmul32_16). Replacing the 32-bit div/rem SVML intrinsics was fairly straightforward using double-precision floating point divides instead. So the good news is that the lasieve5_64 project will now compile with GCC (tested with gcc 11.1.0)! The bad news is that some of the AVX512 routines cause segfaults when built with GCC. Testing them one at a time, these seem to work: ECM, LASETUP, TDS0. LASCHED, SIEVE1, TDSCHED, SEARCH, and TD all lead to segfaults ![]() I initially suspected alignment issues, since icc/icx is very lenient on alignment (mallocs all get an alignment to 64B boundaries automatically, I think, and load/store get compiled to loadu/storeu since there is no difference in speed). But so far, working with LASCHED, that isn't helping. I'll keep working at it as I get time. Meanwhile, building with AVX512_LASETUP=1 AVX512_TDS0=1 AVX512_ECM=1 works so far and gives some speedup over no AVX512 [edit] Tests with larger factor bases are failing, so this isn't quite ready yet. Last fiddled with by bsquared on 2023-02-16 at 22:11 |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
yafu ignoring yafu.ini | chris2be8 | YAFU | 9 | 2022-02-17 17:52 |
YAFU + GGNFS Confirmation | nivek000 | YAFU | 1 | 2021-12-10 22:35 |
Running YAFU via Aliqueit doesn't find yafu.ini | EdH | YAFU | 8 | 2018-03-14 17:22 |
GGNFS or something better? | Zeta-Flux | Factoring | 1 | 2007-08-07 22:40 |
ggnfs | ATH | Factoring | 3 | 2006-08-12 22:50 |