mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Trouble restarting large job (https://www.mersenneforum.org/showthread.php?t=22176)

fivemack 2017-04-03 06:34

Trouble restarting large job
 
I ran -nc2 on a 64G machine, and want to transfer the matrix and checkpoint to a faster machine with 32G memory to do the actual linear algebra.

But, on two separate machines and two attempts per machine, including restarting at an earlier checkpoint, I get something like

[code]
Fri Mar 31 20:51:19 2017 commencing Lanczos iteration (6 threads)
Fri Mar 31 20:51:19 2017 memory use: 16253.3 MB
Fri Mar 31 20:51:20 2017 restarting at iteration 626 (dim = 39615)
Fri Mar 31 20:53:58 2017 linear algebra at 0.1%, ETA 1106h20m
Fri Mar 31 20:55:29 2017 error: corrupt state, please restart from checkpoint
[/code]

I've tried restarting on the 64G machine I was using originally, and that ran for eight hours without giving such a message.

henryzz 2017-04-03 20:42

Is it trying to use a different block size on the 32gb machines?

fivemack 2017-04-03 22:32

[QUOTE=henryzz;456135]Is it trying to use a different block size on the 32gb machines?[/QUOTE]

Different superblock size, though same block size:

[code]
tractor (64G)
Sun Apr 2 22:07:29 2017 using block size 8192 and superblock size 983040 for processor cache size 10240 kB
pumpkin (32G, i7-4930K)
Fri Mar 31 20:46:40 2017 sparse part has weight 4557683460 (118.90/col)
Fri Mar 31 20:46:40 2017 using block size 8192 and superblock size 1179648 for processor cache size 12288 kB
butternut (32G, i7-5820K)
Fri Mar 31 20:08:38 2017 using block size 8192 and superblock size 1179648 for processor cache size 12288 kB
Fri Mar 31 20:13:16 2017 commencing Lanczos iteration (6 threads)
[/code]

I'm currently redoing the processing on butternut in the hope I can run the whole job there, but RelProcTime is about 26 hours so will not have results immediately.

Xyzzy 2018-01-03 22:56

1 Attachment(s)
We are experiencing the same error when starting a job.

We have tried a binary we compiled ourself and one that is from someone else that is known to work.

We have also tried several different target densities.

FWIW, we ran msieve.dat through "remdups" prior to starting the job.

:help:[CODE]Wed Jan 3 13:14:14 2018 commencing linear algebra
Wed Jan 3 13:14:40 2018 read 23031042 cycles
Wed Jan 3 13:15:14 2018 cycles contain 64089728 unique relations
Wed Jan 3 13:49:41 2018 read 64089728 relations
Wed Jan 3 13:52:03 2018 using 20 quadratic characters above 4294917296
Wed Jan 3 13:57:39 2018 building initial matrix
Wed Jan 3 14:09:31 2018 memory use: 8680.3 MB
Wed Jan 3 14:09:40 2018 read 23031042 cycles
Wed Jan 3 14:09:43 2018 matrix is 23026809 x 23031042 (7496.0 MB) with weight 2179834698 (94.65/col)
Wed Jan 3 14:09:43 2018 sparse part has weight 1711677031 (74.32/col)
Wed Jan 3 14:16:50 2018 filtering completed in 3 passes
Wed Jan 3 14:16:54 2018 matrix is 22923906 x 22924106 (7476.6 MB) with weight 2173903730 (94.83/col)
Wed Jan 3 14:16:54 2018 sparse part has weight 1707775293 (74.50/col)
Wed Jan 3 14:17:48 2018 matrix starts at (0, 0)
Wed Jan 3 14:17:51 2018 matrix is 22923906 x 22924106 (7476.6 MB) with weight 2173903730 (94.83/col)
Wed Jan 3 14:17:51 2018 sparse part has weight 1707775293 (74.50/col)
Wed Jan 3 14:17:51 2018 saving the first 48 matrix rows for later
Wed Jan 3 14:17:55 2018 matrix includes 64 packed rows
Wed Jan 3 14:17:57 2018 matrix is 22923858 x 22924106 (7142.7 MB) with weight 1764421261 (76.97/col)
Wed Jan 3 14:17:57 2018 sparse part has weight 1643180859 (71.68/col)
Wed Jan 3 14:17:58 2018 using block size 8192 and superblock size 294912 for processor cache size 3072 kB
Wed Jan 3 14:19:57 2018 commencing Lanczos iteration (2 threads)
Wed Jan 3 14:19:57 2018 memory use: 6034.5 MB
Wed Jan 3 14:23:49 2018 linear algebra at 0.0%, ETA 936h37m
Wed Jan 3 14:25:03 2018 checkpointing every 30000 dimensions
Wed Jan 3 16:46:03 2018 error: corrupt state, please restart from checkpoint[/CODE]

RichD 2018-01-04 01:13

If you are already into Block Lanczos, one option to try (backup the folder first) is:
[CODE]./msieve -v -t 2 -ncr skip_matbuild=1 -nc3[/CODE]


All times are UTC. The time now is 01:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.