mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve v1.46 feedback (https://www.mersenneforum.org/showthread.php?t=13676)

Jeff Gilchrist 2010-08-16 19:20

[QUOTE=Batalov;225577]Jeff, I see that Jasonp assumes that your memory is ECC and cannot fail. However, clusters of cross-my-heart-and-hope-not-to-die non-ECC i7's exist. Is your cluster ECC? And the computer where non-MPI version ran?[/QUOTE]

I'm not sure. They are both HP based Xeon systems made for the HPC market so could very well have ECC but I'm not sure how to find out. I don't have any model numbers. Is there something I could check in Linux as a non-root user to see if the RAM is ECC or not?

Phil MjX 2010-08-16 20:04

Hi !

I have encountered today the corrupt state at the iteration 57718 (Dim 3649974) : 10000-epsilon...
but also with the backup checkpoint, this time at iteration 55914 (Dim 3535884) :surrender

My computer isn't overclocked, has been designed to crunch composites, with quality components, and has been stress tested for days (occt, prime95, linpack...) under windows.

I think Jeff's RAM isn't implied in the failure, or also is mine !

The SVN 377 binary wasn't able to recover from neither of the checkpoints.

I had 700k new rels from another compter awaiting, so I have rebuilt the matrix from scratch with these relations and SVN 377, to see if corrupt states does happen.

Hope this helps. I have backed up the tree checkpoints and all the files are kept frozen awaiting for tests. The restart run has been launched in a new folder.

Kind regards.
Philippe

jasonp 2010-08-16 20:16

Nothing is wrong with anyone's computer when this problem crops up, it's just an unintended side effect of making the LA more flexible. There's a fix in SVN but I haven't updated any binaries.

Phil MjX 2010-08-16 20:46

Thanks,

I have compiled SVN 377 version of msieve source code : it wasn't able to recover from checkpoints, that's why I have restarted the LA phase.

Are you referring to this SVN version for the fix ?
Should a new fresh run avoid the corrupt states (note that I'll see it by myself if all run OK to completion...).

Regards.
Philippe.

jasonp 2010-08-17 01:53

Note that if you still have the old checkpoint, you can try patching the source as described in post #77 above and your old work should not be wasted.

Batalov 2010-08-17 04:01

[quote=Phil MjX;225726]I had 700k new rels from another compter awaiting, so I have rebuilt the matrix from scratch with these relations and SVN 377, to see if corrupt states does happen.

Hope this helps. I have backed up the tree checkpoints and all the files are kept frozen awaiting for tests. The restart run has been launched in a new folder.[/quote]

In addition to what Jason wrote above, note that if 1) you have a matrix and a checkpoint and 2) you later [I]strictly[/I] append (not remdups or otherwise change the order) to the .dat file, then the older matrix and checkpoint will still be working fine. This is because there are no control sums (or similar) that would prevent that from happening and because the old "file" sits inside the new appended .dat file like a nesting doll - so the matrix and cycles files will only reference the old portion of the file.

In contrast, if we simply delete one (for example, redundant) line from somewhere near the beginning of the .dat file, this house of cards will collapse. So, that is something to keep in mind; in a case of manipulations with the .dat file and a backburner idea of trying some ideas later with the old matrix, it would be better to make a full backup of the project directory.

tgrdy 2010-08-17 06:51

[quote=jasonp;225798]Note that if you still have the old checkpoint, you can try patching the source as described in post #77 above and your old work should not be wasted.[/quote]

jason, I patched the /common/lanczos/lanczos.c , as you wrote in #77. Make msieve, run.
It still does not work.

I copy all the data to another PC, run v1.45, build a new matrix, it seems ok.
After about 24 hours:
linear algebra completed 2105405 of 4895210 dimensions (43.0%, ETA 32h15m)

I keep the old chk file. If new-svn fix it, I will test the checkpoint again for the C157 gnfs.

Thanks.


201,607,496 Aug 14 05:52 r521b.dat.chk
122,372,972 Aug 12 18:37 r521b.dat.cyc
1,462,452,324 Aug 12 18:39 r521b.dat.mat
8,043,804,767 Aug 12 17:12 r521b.dat

[quote]Tue Aug 17 14:35:41 2010 Msieve v. 1.47
Tue Aug 17 14:35:41 2010 random seeds: 9043c742 23934020
Tue Aug 17 14:35:41 2010 factoring ....... (157 digits)
Tue Aug 17 14:35:42 2010 no P-1/P+1/ECM available, skipping
Tue Aug 17 14:35:42 2010 commencing number field sieve (157-digit input)
Tue Aug 17 14:35:42 2010 R0: -718286775230264074412218462160
Tue Aug 17 14:35:42 2010 R1: 330882384079102889
Tue Aug 17 14:35:42 2010 A0: -574785673446953991103337093662087263
Tue Aug 17 14:35:42 2010 A1: -2849845913901309456445653108969
Tue Aug 17 14:35:42 2010 A2: 19403845861276275944787296
Tue Aug 17 14:35:42 2010 A3: 40609767955771924744
Tue Aug 17 14:35:42 2010 A4: 185982982727102
Tue Aug 17 14:35:42 2010 A5: 33078600
Tue Aug 17 14:35:42 2010 skew 468676.75, size 2.585e-15, alpha -7.130, combined = 1.913e-12 rroots = 3
Tue Aug 17 14:35:42 2010
Tue Aug 17 14:35:42 2010 commencing linear algebra
Tue Aug 17 14:35:43 2010 matrix starts at (0, 0)
Tue Aug 17 14:35:44 2010 matrix is 5039914 x 5040091 (1529.3 MB) with weight 485006686 (96.23/col)
Tue Aug 17 14:35:44 2010 sparse part has weight 340412623 (67.54/col)
Tue Aug 17 14:35:44 2010 saving the first 48 matrix rows for later
Tue Aug 17 14:35:45 2010 matrix includes 64 packed rows
Tue Aug 17 14:35:46 2010 matrix is 5039866 x 5040091 (1482.4 MB) with weight 387000662 (76.78/col)
Tue Aug 17 14:35:46 2010 sparse part has weight 338202876 (67.10/col)
Tue Aug 17 14:35:46 2010 using block size 65536 for processor cache size 8192 kB
Tue Aug 17 14:36:06 2010 commencing Lanczos iteration (8 threads)
Tue Aug 17 14:36:06 2010 memory use: 1456.0 MB
Tue Aug 17 14:36:06 2010 restarting at iteration 47281 (dim = 2989936)
Tue Aug 17 14:36:10 2010 error: corrupt state, please restart from checkpoint[/quote]

jasonp 2010-08-17 17:35

Okay, if you only lost a day then it probably is not worth trying to salvage the previous LA run.

Phil MjX 2010-08-17 20:48

Thanks Jason and Batalov, this c161 is the largest composite I have ever factorized (it comes from the aliquot sequence with index 5400 I am dealing with for months now) ad I am interested in the way the size of the matrix envolves with extra sieving.

I have also ran a lot of postprocessing with various set of rels, to see exactly when "cycle explosion" does occurs and what happens after...

I have happily sieved it for 2 months with 6 cores, so 24 h of computation lost won't perturb my sleep :smile:!

For me too, everything is currently running OK up to 54 % !

Kind regards

Philippe.

jasonp 2010-08-17 23:02

Philippe, let us know what you find...when these kinds of experiments take months it's very difficult to collect many data points.

EdH 2010-08-19 03:18

My current copy of Msieve does not appear to want to stop polynomial selection. On Monday I was running Aliqueit against a c91 with gnfs_cutoff set to 89 and use_msieve_polyfind = true. After five hours of endless polynomials, I stopped it and changed gnfs_cutoff to 95. The c91 subsequently finished in two hours via SIQS.

Today I had a c99 turn up and had set gnfs_cutoff to 95 and use_msieve_polyfind = false. This time the polynomial selection ran for 2.5 hours before I shut it down.

Here is a portion of the terminal output:
[code]
Msieve v. 1.46
Wed Aug 18 22:50:34 2010
random seeds: e1729248 fdb4d4d9
factoring 618155374139563156953657470966251220509792800441172795436567726667286308678511882481861288669912867 (99 digits)
searching for 15-digit factors
commencing number field sieve (99-digit input)
commencing number field sieve polynomial selection
time limit set to 0.33 hours
searching leading coefficients from 1 to 5694691
deadline: 5 seconds per coefficient
coeff 60-60 5538380 6092218 6092219 6701440
------- 5538380-6092218 6092219-6701440
poly 0 p 6045077 q 6125117 coeff 37026803899009
poly 0 p 5879333 q 6139231 coeff 36094583412923
poly 0 p 5675141 q 6141751 coeff 34855302911891
poly 0 p 5600069 q 6153677 coeff 34461015803713
coeff 120-120 5783585 6361943 6361944 6998138
------- 5783585-6361943 6361944-6998138
[/code]Is it supposed to stop after 0.33 hours? Or, is it possibly supposed to run all the coefficients?

I don't remember it running this way before. I thought I've been running 1.46 for awhile now, but this is a different machine - the other one destroyed its OS.

Current machine:
linux Fedora 13, Intel(R) Pentium(R) 4 CPU 2.80GHz, 512MB

Any thoughts? Thanks. . .

(Sorry if this has been covered already. I only did a weak search.)


All times are UTC. The time now is 04:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.