mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   CADO-NFS (https://www.mersenneforum.org/forumdisplay.php?f=170)
-   -   CADO-NFS error (exit code -6) (https://www.mersenneforum.org/showthread.php?t=25842)

RedGolpe 2020-08-16 10:21

CADO-NFS error (exit code -6)
 
I have been using a CADO-NFS installation for some time now and everything worked smoothly (probably did around 100 factorizations with this machine, ranging from 66 to 143 digits) when it suddenly dropped an error on a C114. Given the error I thought the data may have been physically corrupted somehow, so I rerun the job and I got the same exact error at the same point. Here's the last few lines:

[CODE]PID22512 2020-08-16 03:24:26,424 Debug:HTTP server: 127.0.0.1 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID22512 2020-08-16 03:24:26,424 Debug:HTTP server: 127.0.0.1 Translated path cgi-bin/upload.py to /home/ubuntu/cado-nfs/scripts/cadofactor/upload.py
PID22512 2020-08-16 03:24:26,520 Info:HTTP server: 127.0.0.1 Sending workunit c115_sieving_2460000-2470000 to client localhost+3
PID22512 2020-08-16 03:24:26,520 Debug:HTTP server: 127.0.0.1 "GET /cgi-bin/getwu?clientid=localhost+3 HTTP/1.1" 200 -
PID22512 2020-08-16 03:24:29,807 Debug:HTTP server: 127.0.0.1 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID22512 2020-08-16 03:24:29,807 Debug:HTTP server: 127.0.0.1 Translated path cgi-bin/upload.py to /home/ubuntu/cado-nfs/scripts/cadofactor/upload.py
PID22512 2020-08-16 03:24:29,921 Info:HTTP server: 127.0.0.1 Sending workunit c115_sieving_2470000-2480000 to client localhost+2
PID22512 2020-08-16 03:24:29,921 Debug:HTTP server: 127.0.0.1 "GET /cgi-bin/getwu?clientid=localhost+2 HTTP/1.1" 200 -
PID22512 2020-08-16 03:26:24,258 Warning:Command: Process with PID 31842 finished with return code -6
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Program run on server failed with exit code -6
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Command line was: /home/ubuntu/cado-nfs/build/ip-172-31-36-46/filter/dup2 -poly nfsdata/c115.poly -nrels 3021546 -renumber nfsdata/c115.renumber.gz -t 8 nfsdata/c115.dup1//0/dup1.0.0000.gz > nfsdata/c115.dup2.slice0.stdout.1 2> nfsdata/c115.dup2.slice0.stderr.1
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Stderr output (last 10 lines only) follow (stored in file nfsdata/c115.dup2.slice0.stderr.1):
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: 1 files (1 new and 0 already renumbered)
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Reading files already renumbered:
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: Reading new files (using 8 auxiliary threads for roots mod p):
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: terminate called after throwing an instance of 'renumber_t::corrupted_table'
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: what(): terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: Renumber table is corrupt: cannot find p=0x3, r=0x2 on side 1; note: vp=0x4, vr=0x2
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass: terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:[/CODE]

The C114 in question is 351896878082073008542259904904535828992306666357139721605086070409717621857387884266956068558630908324661823125361. I have the complete log file if it's of any use.

EdH 2020-08-20 15:00

I have just experienced the same trouble. I have traced my failure to a corrupted relations (*.gz) file. In my case this is for a 163 digit composite. At first, the whole upload directory vanished, but it came back after a reboot*. I tried deleting the corrupted file and CADO_NFS wouldn't complete because it was missing. I didn't try removing or editing anything else. Instead, I started msieve to do the Linear Algebra instead of spending more time with CADO-NFS.

*Be careful if you try rebooting! If you are using the default setup for CADO-NFS, the working directory is in /tmp and will be removed during a reboot. You must copy the directory elsewhere to save it.

RedGolpe 2020-08-20 17:53

[QUOTE=EdH;554380]I tried deleting the corrupted file and CADO_NFS wouldn't complete because it was missing.[/QUOTE]
Which means for now one has no choice but restart the factorization from scratch. Also of note is the fact that I rerun my job [I]after[/I] deleting the workdir, so it looks like such corruption is generated (possibly in a reproducible way, at least with similar hardware) by the software.
[QUOTE=EdH;554380]If you are using the default setup for CADO-NFS, the working directory is in /tmp and will be removed during a reboot. You must copy the directory elsewhere to save it.[/QUOTE]
In fact, I strongly suggest to run it with a custom directory. In all cases where the factorization is interrupted /tmp isn't cleaned and the CADO files quickly clog it, not to mention the fact one might want to check something after the factorization is complete anyway.

RedGolpe 2020-08-20 20:44

And it happened again on a C107. Same error as before, can reproduce.

EdH 2020-08-20 22:26

When my current factorization is completed (tomorrow), I want to run your c114 on that machine, which is the one that failed. If you post the c107, I'll run that one, too. Of note, the current machine (Z620) would not run the most recent git revision of CADO-NFS for anything somewhat large, although it factored the example with no issue. It is running a revision that has worked nearly flawlessly on two other machines. Unfortunately, I don't remember what the failure was, only that I had to try something earlier.

RedGolpe 2020-08-20 23:04

[QUOTE=EdH;554434]If you post the c107, I'll run that one, too.[/QUOTE]
54022122323205311359700529131254845253584832080092810873601245077747279904751944559089001546838958178759103

Both the problematic factorizations were run on an Amazon EC2 instance with Ubuntu. Tonight I will test the C107 on another machine with a similar OS (Ubuntu on WSL/Windows 10) and (hopefully) the same build and see if the problem persists.

On EC2 I am running a fairly recent version (one or two weeks old) installed with [CODE]git clone https://gitlab.inria.fr/cado-nfs/cado-nfs.git[/CODE]Not sure how to obtain the build version though.

EdH 2020-08-21 00:26

Your EC2 is probably on a Xeon, yes? My Z620 is a Xeon.

type "git log" in the cado-nfs directory to find out what commit version you're running:
[code]
commit ea3f28ba3f41ecbcdf3c15f9fe3433680ab0df42
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date: Fri Sep 6 17:23:13 2019 +0200

[polyselect1] avoid polynomials that are found multiple times

commit b5a1635fbcf6083923c44b439f92ece5ad91292f
Merge: 053a11b 43ae1d1
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date: Fri Sep 6 10:04:56 2019 +0200

Merge branch 'master' of git+ssh://scm.gforge.inria.fr/git/cado-nfs/cado-nfs

commit 053a11b449753ec69018593c4634de63ed5d7e89
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date: Fri Sep 6 10:04:39 2019 +0200

added KnuthSchroeppel function

commit 43ae1d1ddc095f74709ecb50e98b9e2413716c34
Author: Pierrick Gaudry <pierrick.gaudry@loria.fr>
Date: Thu Sep 5 12:30:51 2019 +0200
. . .
[/code]I don't remember how to "get" an earlier commit, but I'm sure it's in the docs. I know I had to do that in the past, but my memory is only good for a short period of time.

I will have to run up a Colab session and check these numbers also.

RedGolpe 2020-08-21 01:08

The instances I use run 3.3 GHz AMD EPYC processors with 4 cores and 8 threads with 16 GiB RAM. If it's of any use, they are of type "c5a.2xlarge" in Amazon jargon. In the meantime, I run the C105 on my WSL machine and it completed correctly. Tomorrow I'll check the version (I'm sure mine is older, but don't know how much) and run some more tests.

EdH 2020-08-21 12:20

Well, I factored both composites this morning with no issues. I suppose at some point I'll play with some different revisions and see if anything more turns up.

RedGolpe 2020-08-21 12:59

Summary of findings follows.

Tested factorization of the C107 54022122323205311359700529131254845253584832080092810873601245077747279904751944559089001546838958178759103 = 6892192422790360694669529583587636497846216763819494386433 * 7838162228984026472885414974266767581580262988991 on two machines with the following specifications:
- "EC2": an Amazon EC2 instance with 3.3 GHz AMD EPYC processors, 4 cores, 8 threads, 16 GiB RAM, Ubuntu 18.04 fully updated
- "WSL": a Windows 10 PC with Intel Core i7-7800X processors, 6 cores, 12 threads, 32 GiB RAM, WSL Ubuntu 18.04 fully updated

All tests were run with the default command line cado-nfs.py <N> workdir=<workdir> unless specified.
When they failed, all tests did with the same "corrupted table" error described [URL="https://www.mersenneforum.org/showpost.php?p=553876&postcount=1"]here[/URL].
When failed, the error seems reproducible with the same command line on the same machine.
A log file is available.

[B][COLOR="Red"]Failed[/COLOR][/B] on EC2, CADO-NFS with timestamp July 22 (two runs).
[B][COLOR="red"]Failed[/COLOR][/B] on EC2, CADO-NFS with timestamp August 18.
[B][COLOR="red"]Failed[/COLOR][/B] on EC2, CADO-NFS with timestamp August 18, parameters -t 6.
[B][COLOR="#090"]Passed[/COLOR][/B] on WSL, CADO-NFS with timestamp March 17.
[B][COLOR="red"]Failed[/COLOR][/B] on WSL, CADO-NFS with timestamp August 18.
[B][COLOR="red"]Failed[/COLOR][/B] on WSL, CADO-NFS with timestamp August 18, parameters -t 8.

So it looks like whatever it is, it does not depend on the processor type, on the number of cores and on the actual cores used, and was introduced some time between March 17 and July 22. I will report this as a bug to the cado-nfs-discuss mailing list.

EdH 2020-08-21 14:27

That's why my September commit is working, then. I'll watch the mailing list to see what they have to say. Thanks.

No promises, but I might try to narrow the commits down a bit more.

EdH 2020-08-21 14:56

I swapped over to a newer commit (Aug 5) and remembered why I wasn't using it - It won't communicate properly with clients:
[code]
ERROR:root:Invalid workunit file: Error: key STDOUT not recognized
[/code]I wonder if this is a conflict between commits and clients have to be closer to the server, In which case, I won't be able to use later commits because I still have some Core2 machines. . .

RedGolpe 2020-08-21 15:12

It seems the good guys at INRIA are already looking into my report. They don't seem to require more information for now.

EdH 2020-08-21 16:20

I'll read the posts when I get my digest version. For now, I'm going to run my September commit and see what shows up later. I'll check the latest git again later on and see if the client communication issue has disappeared.

bur 2021-05-03 10:18

Unfortunately, I ran into that error on a C153 which ran over the weekend:

[CODE]Warning:Command: Process with PID 849626 finished with return code -6 Error:Filtering - Duplicate Removal, removal pass: Program run on server failed
with exit code -6 Error:Filtering - Duplicate Removal, removal pass: Command line was: /home/flori
an/Math/cado-nfs/build/florian-Precision-3640-Tower/filter/dup2 -poly ./workdir/AL30081984/1971-C153/c155.poly -nrels 62519376 -renumber ./workdir/AL30081984/19
71-C153/c155.renumber.gz ./workdir/AL30081984/1971-C153/c155.dup1//0/dup1.0.0000.gz ./workdir/AL30081984/1971-C153/c155.dup1//0/dup1.0.0001.gz > ./workdir/AL300
81984/1971-C153/c155.dup2.slice0.stdout.4 2> ./workdir/AL30081984/1971-C153/c155.dup2.slice0.stderr.4 Error:Filtering - Duplicate Removal, removal pass: Stderr output (last 10 lines
only) follow (stored in file ./workdir/AL30081984/1971-C153/c155.dup2.slice0.std
err.4):
Error:Filtering - Duplicate Removal, removal pass: antebuffer set to /home/
florian/Math/cado-nfs/build/florian-Precision-3640-Tower/utils/antebuffer
Error:Filtering - Duplicate Removal, removal pass: [checking true duplicate
s on sample of 750234 cells]
Error:Filtering - Duplicate Removal, removal pass: Allocated hash table of
75023359 entries (286MiB)
Error:Filtering - Duplicate Removal, removal pass: Constructing the two fil
elists...
Error:Filtering - Duplicate Removal, removal pass: 2 files (2 new and 0 alr
eady renumbered)
Error:Filtering - Duplicate Removal, removal pass: Reading files already re
numbered:
Error:Filtering - Duplicate Removal, removal pass: Reading new files (using
3 auxiliary threads for roots mod p):
Error:Filtering - Duplicate Removal, removal pass: terminate called after t
hrowing an instance of 'renumber_t::corrupted_table'
Error:Filtering - Duplicate Removal, removal pass: what(): Renumber tabl
e is corrupt: cannot find p=0x4a2bfa9, r=0xd70340 on side 1; note: vp=0x4a2bfb6,
vr=0xd70340
Error:Filtering - Duplicate Removal, removal pass:
Traceback (most recent call last):
File "./cado-nfs.py", line 122, in <module>
factors = factorjob.run()
File "./scripts/cadofactor/cadotask.py", line 6131, in run
last_status = task.run()
File "./scripts/cadofactor/cadotask.py", line 3845, in run
raise Exception("Program failed")
Exception: Program failed[/CODE]Restarting with parameters.snaphop.0 didn't help.

It seems I can still use the relations by having msieve continue the work? How would I do that?

According to [url]https://www.mersenneforum.org/showthread.php?t=11948&page=21#227[/url] it seems I can cat the gz files and have msieve process them. But if one of the files is apparently corrupted, how do I find out which one? They all have a size between 3 and 7 MB. I did a zcat | grep and the missing 4a2bfa9 prime is present in some relation, but does that help?

[SIZE="1"]Please don't tell me all is lost...[/SIZE]

bur 2021-05-03 12:13

So I just ignored the cado error message and used the relations with msieve. In case someone has the same problem in the future:

All required files are in workdir/cxxx.upload.
First combine all gz compressed relations into one rels.dat:
[CODE]zcat *.gz > rels.dat[/code]

Then use convert_poly in cado-nfs/build/machine/misc to convert the cnnn.poly file to cnnn.fb:
[CODE]convert_poly -if cado -of msieve < c155.poly > c155.fb[/CODE]

I suggest copying both files to a new directory so nothing gets accidentally modified. Create a cnnn.n file with the number to be factored and then run:
[CODE]../msieve/msieve -i c155.n -s rels.dat -l c155msieve.log -nf c155.fb -t 10 -nc1
../msieve/msieve -i c155.n -s rels.dat -l c155msieve.log -nf c155.fb -t 10 -nc2
../msieve/msieve -i c155.n -s rels.dat -l c155msieve.log -nf c155.fb -t 10 -nc3[/CODE]

Currently I'm at the -nc2 step and it's performing LA with an ETA of 2:20 hours.

For sake of completeness, if not enough relation are found, see [url]https://www.mersenneforum.org/showthread.php?t=11948&page=21#230[/url] for how to make cado-nfs do more sieving. After that it should be possible use msieve as explained above.

charybdis 2021-05-03 13:04

[QUOTE=bur;577517][CODE]../msieve/msieve -i c155.n -s rels.dat -l c155msieve.log -nf c155.fb -t 10 -nc1
../msieve/msieve -i c155.n -s rels.dat -l c155msieve.log -nf c155.fb -t 10 -nc2
../msieve/msieve -i c155.n -s rels.dat -l c155msieve.log -nf c155.fb -t 10 -nc3[/CODE][/QUOTE]

[C]-nc[/C] performs all of [C]-nc1[/C], [C]-nc2[/C], [C]-nc3[/C] in succession.

EdH 2021-05-03 13:09

Good post!

I thought I had posted a "How I ..." on using CADO-NFS for poly/sieving and Msieve for LA, but apparently I've been slacking. This is how I run all my larger jobs. I had originally written my own conversion (for the .fb), before I learned of the provided one.

For some of my scripts, I do a check for *.cyc after the -nc1 step. The scripts use the existence of that file to tell whether filtering succeeded or not. Then the scripts can either call -nc2 or call for more sieving.

Not sure if you know this (you probably do), but if -nc2 is interrupted, use -ncr to continue. If you use -nc2 again, it will start LA from scratch.

bur 2021-05-03 13:32

Thanks, it's basically your linked post with the small addition of how to convert poly to fb. I'm glad this error can easily be worked out, otherwise I'd be quite nervous on longer jobs.

Not sure why cado-nfs chokes on othe rels while msieve has no problem with them.

[QUOTE]This is how I run all my larger jobs.[/QUOTE]Why is that? Is msieve faster on those steps?

[QUOTE]-nc performs all of -nc1, -nc2, -nc3 in succession.[/QUOTE]Yes, and EdH already mentioned that in his post, I still used the seperate steps since I wasn't sure it would work at all with the corruption cado-nfs talked about.

charybdis 2021-05-03 13:47

[QUOTE=bur;577526]Not sure why cado-nfs chokes on othe rels while msieve has no problem with them.[/quote]

I don't think there's anything wrong with the relations, it's a bug in the way that CADO duplicate removal processes them. And if a few relations are bad, then msieve will just ignore them.

[QUOTE]Why is that? Is msieve faster on those steps?[/quote]

The most time-consuming part of the postprocessing, the linear algebra (-nc2), is substantially faster with msieve than with CADO. In addition, CADO uses much more memory than msieve during the filtering stage, so a given machine will be able to run larger numbers with msieve than with CADO.

bur 2021-05-03 14:16

Ah, that's good to know!

Maybe a stupid question, but since msieve is open source why is the implementation of cado-nfs linear algebra not just taken from msieve?

VBCurtis 2021-05-03 14:45

CADO's algorithm features less interprocess communication during the (longest) first stage of matrix solving than msieve, which allows jobs to be split among machines fruitfully. This allows larger jobs to be run on regular hardware.

An ideal solution would be to have an -msieve flag in CADO which runs the matrix using msieve within the cado-nfs.py wrapper.


All times are UTC. The time now is 20:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.