mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > CADO-NFS

Reply
 
Thread Tools
Old 2020-08-16, 10:21   #1
RedGolpe
 
RedGolpe's Avatar
 
Aug 2006
Monza, Italy

22×17 Posts
Default CADO-NFS error (exit code -6)

I have been using a CADO-NFS installation for some time now and everything worked smoothly (probably did around 100 factorizations with this machine, ranging from 66 to 143 digits) when it suddenly dropped an error on a C114. Given the error I thought the data may have been physically corrupted somehow, so I rerun the job and I got the same exact error at the same point. Here's the last few lines:

Code:
PID22512 2020-08-16 03:24:26,424 Debug:HTTP server: 127.0.0.1 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID22512 2020-08-16 03:24:26,424 Debug:HTTP server: 127.0.0.1 Translated path cgi-bin/upload.py to /home/ubuntu/cado-nfs/scripts/cadofactor/upload.py
PID22512 2020-08-16 03:24:26,520 Info:HTTP server: 127.0.0.1 Sending workunit c115_sieving_2460000-2470000 to client localhost+3
PID22512 2020-08-16 03:24:26,520 Debug:HTTP server: 127.0.0.1 "GET /cgi-bin/getwu?clientid=localhost+3 HTTP/1.1" 200 -
PID22512 2020-08-16 03:24:29,807 Debug:HTTP server: 127.0.0.1 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID22512 2020-08-16 03:24:29,807 Debug:HTTP server: 127.0.0.1 Translated path cgi-bin/upload.py to /home/ubuntu/cado-nfs/scripts/cadofactor/upload.py
PID22512 2020-08-16 03:24:29,921 Info:HTTP server: 127.0.0.1 Sending workunit c115_sieving_2470000-2480000 to client localhost+2
PID22512 2020-08-16 03:24:29,921 Debug:HTTP server: 127.0.0.1 "GET /cgi-bin/getwu?clientid=localhost+2 HTTP/1.1" 200 -
PID22512 2020-08-16 03:26:24,258 Warning:Command: Process with PID 31842 finished with return code -6
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Program run on server failed with exit code -6
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Command line was: /home/ubuntu/cado-nfs/build/ip-172-31-36-46/filter/dup2 -poly nfsdata/c115.poly -nrels 3021546 -renumber nfsdata/c115.renumber.gz -t 8 nfsdata/c115.dup1//0/dup1.0.0000.gz > nfsdata/c115.dup2.slice0.stdout.1 2> nfsdata/c115.dup2.slice0.stderr.1
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass: Stderr output (last 10 lines only) follow (stored in file nfsdata/c115.dup2.slice0.stderr.1):
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass:     1 files (1 new and 0 already renumbered)
PID22512 2020-08-16 03:26:24,259 Error:Filtering - Duplicate Removal, removal pass:     Reading files already renumbered:
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:     Reading new files (using 8 auxiliary threads for roots mod p):
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:     terminate called after throwing an instance of 'renumber_t::corrupted_table'
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:       what():  terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:     Renumber table is corrupt: cannot find p=0x3, r=0x2 on side 1; note: vp=0x4, vr=0x2
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:     terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:     terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:     terminate called recursively
PID22512 2020-08-16 03:26:24,260 Error:Filtering - Duplicate Removal, removal pass:
The C114 in question is 351896878082073008542259904904535828992306666357139721605086070409717621857387884266956068558630908324661823125361. I have the complete log file if it's of any use.
RedGolpe is offline   Reply With Quote
Old 2020-08-20, 15:00   #2
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,461 Posts
Default

I have just experienced the same trouble. I have traced my failure to a corrupted relations (*.gz) file. In my case this is for a 163 digit composite. At first, the whole upload directory vanished, but it came back after a reboot*. I tried deleting the corrupted file and CADO_NFS wouldn't complete because it was missing. I didn't try removing or editing anything else. Instead, I started msieve to do the Linear Algebra instead of spending more time with CADO-NFS.

*Be careful if you try rebooting! If you are using the default setup for CADO-NFS, the working directory is in /tmp and will be removed during a reboot. You must copy the directory elsewhere to save it.
EdH is offline   Reply With Quote
Old 2020-08-20, 17:53   #3
RedGolpe
 
RedGolpe's Avatar
 
Aug 2006
Monza, Italy

22·17 Posts
Default

Quote:
Originally Posted by EdH View Post
I tried deleting the corrupted file and CADO_NFS wouldn't complete because it was missing.
Which means for now one has no choice but restart the factorization from scratch. Also of note is the fact that I rerun my job after deleting the workdir, so it looks like such corruption is generated (possibly in a reproducible way, at least with similar hardware) by the software.
Quote:
Originally Posted by EdH View Post
If you are using the default setup for CADO-NFS, the working directory is in /tmp and will be removed during a reboot. You must copy the directory elsewhere to save it.
In fact, I strongly suggest to run it with a custom directory. In all cases where the factorization is interrupted /tmp isn't cleaned and the CADO files quickly clog it, not to mention the fact one might want to check something after the factorization is complete anyway.

Last fiddled with by RedGolpe on 2020-08-20 at 18:33
RedGolpe is offline   Reply With Quote
Old 2020-08-20, 20:44   #4
RedGolpe
 
RedGolpe's Avatar
 
Aug 2006
Monza, Italy

22·17 Posts
Default

And it happened again on a C107. Same error as before, can reproduce.

Last fiddled with by RedGolpe on 2020-08-20 at 20:53
RedGolpe is offline   Reply With Quote
Old 2020-08-20, 22:26   #5
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,461 Posts
Default

When my current factorization is completed (tomorrow), I want to run your c114 on that machine, which is the one that failed. If you post the c107, I'll run that one, too. Of note, the current machine (Z620) would not run the most recent git revision of CADO-NFS for anything somewhat large, although it factored the example with no issue. It is running a revision that has worked nearly flawlessly on two other machines. Unfortunately, I don't remember what the failure was, only that I had to try something earlier.
EdH is offline   Reply With Quote
Old 2020-08-20, 23:04   #6
RedGolpe
 
RedGolpe's Avatar
 
Aug 2006
Monza, Italy

22·17 Posts
Default

Quote:
Originally Posted by EdH View Post
If you post the c107, I'll run that one, too.
54022122323205311359700529131254845253584832080092810873601245077747279904751944559089001546838958178759103

Both the problematic factorizations were run on an Amazon EC2 instance with Ubuntu. Tonight I will test the C107 on another machine with a similar OS (Ubuntu on WSL/Windows 10) and (hopefully) the same build and see if the problem persists.

On EC2 I am running a fairly recent version (one or two weeks old) installed with
Code:
git clone https://gitlab.inria.fr/cado-nfs/cado-nfs.git
Not sure how to obtain the build version though.
RedGolpe is offline   Reply With Quote
Old 2020-08-21, 00:26   #7
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

346110 Posts
Default

Your EC2 is probably on a Xeon, yes? My Z620 is a Xeon.

type "git log" in the cado-nfs directory to find out what commit version you're running:
Code:
commit ea3f28ba3f41ecbcdf3c15f9fe3433680ab0df42
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date:   Fri Sep 6 17:23:13 2019 +0200

    [polyselect1] avoid polynomials that are found multiple times

commit b5a1635fbcf6083923c44b439f92ece5ad91292f
Merge: 053a11b 43ae1d1
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date:   Fri Sep 6 10:04:56 2019 +0200

    Merge branch 'master' of git+ssh://scm.gforge.inria.fr/git/cado-nfs/cado-nfs

commit 053a11b449753ec69018593c4634de63ed5d7e89
Author: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Date:   Fri Sep 6 10:04:39 2019 +0200

    added KnuthSchroeppel function

commit 43ae1d1ddc095f74709ecb50e98b9e2413716c34
Author: Pierrick Gaudry <pierrick.gaudry@loria.fr>
Date:   Thu Sep 5 12:30:51 2019 +0200
. . .
I don't remember how to "get" an earlier commit, but I'm sure it's in the docs. I know I had to do that in the past, but my memory is only good for a short period of time.

I will have to run up a Colab session and check these numbers also.
EdH is offline   Reply With Quote
Old 2020-08-21, 01:08   #8
RedGolpe
 
RedGolpe's Avatar
 
Aug 2006
Monza, Italy

6810 Posts
Default

The instances I use run 3.3 GHz AMD EPYC processors with 4 cores and 8 threads with 16 GiB RAM. If it's of any use, they are of type "c5a.2xlarge" in Amazon jargon. In the meantime, I run the C105 on my WSL machine and it completed correctly. Tomorrow I'll check the version (I'm sure mine is older, but don't know how much) and run some more tests.
RedGolpe is offline   Reply With Quote
Old 2020-08-21, 12:20   #9
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

D8516 Posts
Default

Well, I factored both composites this morning with no issues. I suppose at some point I'll play with some different revisions and see if anything more turns up.
EdH is offline   Reply With Quote
Old 2020-08-21, 12:59   #10
RedGolpe
 
RedGolpe's Avatar
 
Aug 2006
Monza, Italy

22·17 Posts
Default

Summary of findings follows.

Tested factorization of the C107 54022122323205311359700529131254845253584832080092810873601245077747279904751944559089001546838958178759103 = 6892192422790360694669529583587636497846216763819494386433 * 7838162228984026472885414974266767581580262988991 on two machines with the following specifications:
- "EC2": an Amazon EC2 instance with 3.3 GHz AMD EPYC processors, 4 cores, 8 threads, 16 GiB RAM, Ubuntu 18.04 fully updated
- "WSL": a Windows 10 PC with Intel Core i7-7800X processors, 6 cores, 12 threads, 32 GiB RAM, WSL Ubuntu 18.04 fully updated

All tests were run with the default command line cado-nfs.py <N> workdir=<workdir> unless specified.
When they failed, all tests did with the same "corrupted table" error described here.
When failed, the error seems reproducible with the same command line on the same machine.
A log file is available.

Failed on EC2, CADO-NFS with timestamp July 22 (two runs).
Failed on EC2, CADO-NFS with timestamp August 18.
Failed on EC2, CADO-NFS with timestamp August 18, parameters -t 6.
Passed on WSL, CADO-NFS with timestamp March 17.
Failed on WSL, CADO-NFS with timestamp August 18.
Failed on WSL, CADO-NFS with timestamp August 18, parameters -t 8.

So it looks like whatever it is, it does not depend on the processor type, on the number of cores and on the actual cores used, and was introduced some time between March 17 and July 22. I will report this as a bug to the cado-nfs-discuss mailing list.

Last fiddled with by RedGolpe on 2020-08-21 at 13:42
RedGolpe is offline   Reply With Quote
Old 2020-08-21, 14:27   #11
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,461 Posts
Default

That's why my September commit is working, then. I'll watch the mailing list to see what they have to say. Thanks.

No promises, but I might try to narrow the commits down a bit more.
EdH is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
CADO-NFS error (exit code -9) RedGolpe CADO-NFS 6 2020-09-01 12:29
Is there an error code listing for msieve? EdH Msieve 2 2019-11-14 22:58
CADO-NFS Square Root Error Ferrier CADO-NFS 3 2019-11-01 23:51
Error Code 40 storm5510 Software 19 2016-11-14 15:59
HRF3.TXT now has computer-id and error code GP2 Data 2 2003-10-09 06:46

All times are UTC. The time now is 04:50.

Thu Dec 3 04:50:29 UTC 2020 up 1:01, 0 users, load averages: 1.11, 1.34, 1.29

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.