![]() |
1 Attachment(s)
Hello all !
I have also a "corruption" problem with msieve 1.46, both with official sourceforge binary and 64 bit Jeff's ones. when I build the matrix for a c161 with 44 millions unique rels, I obtain the message : "error : column too large; corrupt file ?" (it isn't printed in the log file but it appears after the first filtering ) I have resieved a bit: same problem with the different binaries. Surprisingly, all went OK with msieve 145 (gpu enabled version) Excepting the abnormal termination with msieve 1.46, everything is exactly the same in the log file (number of cycles, matrix size up to the break...) I'll do the linear algebra phase with msieve 1.45 but i'd have been interested in testing "large blocks" option of 64 bits binaries ! Kind regards and thanks ! Philippe |
Philippe: this is a known (minor) problem and is fixed in the latest SVN.
Jeff: your problem is not the same as Philippe's, and Serge sees it too on a very big (non-MPI) run. In his case I'm not confident about the stability of the underlying machine, but that excuse won't fly on your cluster. I guess it really is something in the code; Serge has added a patch that keeps two savefiles, because in the case where a restart fails immediately then restarting from the previous checkpoint allowed the LA to get past the failure point. Have you managed to get an MPI run to complete? |
My last msieve 1.46 run was for 31 hours in a row and I didn't had any crash but I changed two things. First I reduced the overclock of the machine, if 1.46 is quicker it will push harder from the CPU and it will go hooter, and second I shutdown all tasks that were schedule to run due to the installation of an application that added a 5 am task to run. I suppose this all solved my problem but now I see different issues with the 1.46 version.
|
[QUOTE=jasonp;225364]Philippe: this is a known (minor) problem and is fixed in the latest SVN.
[/QUOTE] Hi Jason ! Thanks for answering. I'll compile the latest SVN (with cygwin) and try it again ! Regards. Philippe |
Carlos: interesting! v1.46 has more multithreading in the LA and so will probably push the memory bus harder than v1.45 did.
|
[quote=jasonp;225368]Carlos: interesting! v1.46 has more multithreading in the LA and so will probably push the memory bus harder than v1.45 did.[/quote]
That occurred also for me when the last llr version was out. It was quicker and hotter on running. Meanwhile, and this is a stupid question, my last msieve run went well but I never had to resume LA, it was a "-nc -v -t 4" all through. I only had crashes when resuming LA, using "-nc2 -nc3 -ncr -v -t 4" flags. Can 1.46 version be with a flag issue? I have two batch files, one starts perform NFS combining (-nc) and another to resume LA from last checkpoint and finish square root (-nc2 -nc3 -ncr). Maybe it's a -nc2 issue because I have -nc2 and -nc3 together!?!?! |
[QUOTE=em99010pepe;225382]I only had crashes when resuming LA, using "-nc2 -nc3 -ncr -v -t 4" flags. Can 1.46 version be with a flag issue? [/QUOTE]
I start the LA with "-nc2 -nc3 -v -t 4" but resume with "-ncr -nc3 -v -t 4" without the -nc2. |
[quote=frmky;225385]I start the LA with "-nc2 -nc3 -v -t 4" but resume with "-ncr -nc3 -v -t 4" without the -nc2.[/quote]
But why with v1.44 resume works and with v1.46 not? |
'-nc2 -ncr' is redundant, but -ncr sets a flag that automatically performs a restart.
I don't know why the older version worked, but the older version had less multithreading than the current one... Could you be more specific about what else you saw that wasn't working? |
[quote=Jeff Gilchrist;225357]*UPDATE*: Crap, when I try to restart it gives me the corrupt message immediately:
[code]restarting at iteration 9329 (dim = 589938) error: corrupt state, please restart from checkpoint [/code][/quote] Yep, exactly like what I had recently, with the unpatched rel.1.46. (I had that when restarting from .chk file with dim = 142[U]9926[/U].) Note that dim = 58[U]9938[/U] and after 1 iteration (which 64 dims), the orthogonality check is invoked (because it is set to be every 10000 dims) and immediately fails. I also have a bootstrap perl script, so what I have found was 50 restarts in the log. I did have my daily backup of a .chk file, so I lost only a couple hours. In my code, I set the dump_interval to be divisible by check_interval (this is a palliative treatment; it dodges the bug, not removes it). And it works; I just passed iteration 10000000 out of 16092101. ETA is 8 days away. Home-style LA still ruleZ. :razz: |
[QUOTE=Phil MjX;225367]Hi Jason !
Thanks for answering. I'll compile the latest SVN (with cygwin) and try it again ! Regards. Philippe[/QUOTE] HI ! Cygwin didn't want to allocate enough memory to the LA phase of this quite big job (even with 8 GB ram) so I have installed msieve 1.47 (SVN 376) on kubuntu. (For linux noobs, using WUBI, I'd advice NOT to try to install Nvidia driver by hand, following 2 years old website instructions, by manually killing the X server and answering yes, yes and yes to all of the install warnings...:big grin: It destroyed kubuntu (unrecoverable even by recovery mode) but also corrupted Grub and my access to windows 7... ...after a few hours with the repair CD, the kubuntu .iso, installing everything from scratch, re compiling gmp, gmp-ecm and msieve, I am back !) The point was to test the LARGEBLOCKS flag for the completion time of a 6M column matrix. with msieve 1.47 x86_64: Block size 65536 for a cache of 6MB (core 2 quad 9550 : shouldn't it be 3MB by core = 12 MB ?) ETA 62.5 hours after 0.3% with msieve 1.47 x86_64 LARGEBLOCKS=1 : Block size 262144 for a cache of 6MB ETA 50 hours after 0.3% Great improvement for this one (wrong cache size effect increasing the gap ?) Thanks for always improving msieve ! Kind regards Philippe |
| All times are UTC. The time now is 04:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.