![]() |
Trouble restarting large job
I ran -nc2 on a 64G machine, and want to transfer the matrix and checkpoint to a faster machine with 32G memory to do the actual linear algebra.
But, on two separate machines and two attempts per machine, including restarting at an earlier checkpoint, I get something like [code] Fri Mar 31 20:51:19 2017 commencing Lanczos iteration (6 threads) Fri Mar 31 20:51:19 2017 memory use: 16253.3 MB Fri Mar 31 20:51:20 2017 restarting at iteration 626 (dim = 39615) Fri Mar 31 20:53:58 2017 linear algebra at 0.1%, ETA 1106h20m Fri Mar 31 20:55:29 2017 error: corrupt state, please restart from checkpoint [/code] I've tried restarting on the 64G machine I was using originally, and that ran for eight hours without giving such a message. |
Is it trying to use a different block size on the 32gb machines?
|
[QUOTE=henryzz;456135]Is it trying to use a different block size on the 32gb machines?[/QUOTE]
Different superblock size, though same block size: [code] tractor (64G) Sun Apr 2 22:07:29 2017 using block size 8192 and superblock size 983040 for processor cache size 10240 kB pumpkin (32G, i7-4930K) Fri Mar 31 20:46:40 2017 sparse part has weight 4557683460 (118.90/col) Fri Mar 31 20:46:40 2017 using block size 8192 and superblock size 1179648 for processor cache size 12288 kB butternut (32G, i7-5820K) Fri Mar 31 20:08:38 2017 using block size 8192 and superblock size 1179648 for processor cache size 12288 kB Fri Mar 31 20:13:16 2017 commencing Lanczos iteration (6 threads) [/code] I'm currently redoing the processing on butternut in the hope I can run the whole job there, but RelProcTime is about 26 hours so will not have results immediately. |
1 Attachment(s)
We are experiencing the same error when starting a job.
We have tried a binary we compiled ourself and one that is from someone else that is known to work. We have also tried several different target densities. FWIW, we ran msieve.dat through "remdups" prior to starting the job. :help:[CODE]Wed Jan 3 13:14:14 2018 commencing linear algebra Wed Jan 3 13:14:40 2018 read 23031042 cycles Wed Jan 3 13:15:14 2018 cycles contain 64089728 unique relations Wed Jan 3 13:49:41 2018 read 64089728 relations Wed Jan 3 13:52:03 2018 using 20 quadratic characters above 4294917296 Wed Jan 3 13:57:39 2018 building initial matrix Wed Jan 3 14:09:31 2018 memory use: 8680.3 MB Wed Jan 3 14:09:40 2018 read 23031042 cycles Wed Jan 3 14:09:43 2018 matrix is 23026809 x 23031042 (7496.0 MB) with weight 2179834698 (94.65/col) Wed Jan 3 14:09:43 2018 sparse part has weight 1711677031 (74.32/col) Wed Jan 3 14:16:50 2018 filtering completed in 3 passes Wed Jan 3 14:16:54 2018 matrix is 22923906 x 22924106 (7476.6 MB) with weight 2173903730 (94.83/col) Wed Jan 3 14:16:54 2018 sparse part has weight 1707775293 (74.50/col) Wed Jan 3 14:17:48 2018 matrix starts at (0, 0) Wed Jan 3 14:17:51 2018 matrix is 22923906 x 22924106 (7476.6 MB) with weight 2173903730 (94.83/col) Wed Jan 3 14:17:51 2018 sparse part has weight 1707775293 (74.50/col) Wed Jan 3 14:17:51 2018 saving the first 48 matrix rows for later Wed Jan 3 14:17:55 2018 matrix includes 64 packed rows Wed Jan 3 14:17:57 2018 matrix is 22923858 x 22924106 (7142.7 MB) with weight 1764421261 (76.97/col) Wed Jan 3 14:17:57 2018 sparse part has weight 1643180859 (71.68/col) Wed Jan 3 14:17:58 2018 using block size 8192 and superblock size 294912 for processor cache size 3072 kB Wed Jan 3 14:19:57 2018 commencing Lanczos iteration (2 threads) Wed Jan 3 14:19:57 2018 memory use: 6034.5 MB Wed Jan 3 14:23:49 2018 linear algebra at 0.0%, ETA 936h37m Wed Jan 3 14:25:03 2018 checkpointing every 30000 dimensions Wed Jan 3 16:46:03 2018 error: corrupt state, please restart from checkpoint[/CODE] |
If you are already into Block Lanczos, one option to try (backup the folder first) is:
[CODE]./msieve -v -t 2 -ncr skip_matbuild=1 -nc3[/CODE] |
| All times are UTC. The time now is 01:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.