mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Cunningham Tables

Reply
 
Thread Tools
Old 2019-06-02, 18:02   #166
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3·17·97 Posts
Default

Weren’t you complaining about your internet connection dropping, how often the server now sends out a wu? As I said previously NFS@Home use Q range of 2k but I wouldn’t mind to have it double size (to 4k).

Last fiddled with by pinhodecarlos on 2019-06-02 at 18:03
pinhodecarlos is offline   Reply With Quote
Old 2019-06-02, 18:35   #167
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

28×19 Posts
Default

I was- but the machines subject to that drop have more than 4 cores, so Tom's --override trick allows me to be reasonably sure I can keep my own workunits under 30 minutes for Q-range 2000.
To me, the tradeoff is time wasted starting up las for each WU versus time wasted from a partially completed WU when someone pauses CADO for other work. It seems there is a general consensus that longer WUs are acceptable, so I'll wait a day for anyone to opine that we're OK as-is, and then I suppose I'll double WU length to 2000.
VBCurtis is offline   Reply With Quote
Old 2019-06-02, 19:50   #168
RichD
 
RichD's Avatar
 
Sep 2008
Kansas

17×199 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
That said, there may be some startup costs for each WU ...
Actually I was thinking as the work completed. Many threads would wait for the last thread to complete, then package it up and compress it, then upload, then clean up, then request a new WU. I'm seeing ten steps between each of my "Running ..." commands.

Edit: Please don't take me as complaining. I am simply pointing out a possible abnormally.

Last fiddled with by RichD on 2019-06-02 at 19:56
RichD is offline   Reply With Quote
Old 2019-06-02, 21:34   #169
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

28×19 Posts
Default

Those steps seem like a pretty good argument for using two clients per socket, memory permitting. If one client is in admin-task mode, the other will be using most of the CPU. When both run, HT balances things out.

It's also a good argument for longer WUs, as you point out- less time spent with an empty pool of Q's at the end of each WU.

There is a scheduled power outage in my building overnight Thursday night/ Friday morning (USA pacific time). Since the server will be restarting then anyway, I'll change Q-range to 2000 at that time.
VBCurtis is offline   Reply With Quote
Old 2019-06-03, 08:28   #170
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

293 Posts
Default

Server seems to be dead for last ~60minutes.

In the logs I see
Code:
PID22558 2019-06-03 00:07:23,776 Info:Lattice Sieving: Adding workunit 2330L.c207_sieving_15254000-15255000 to database
PID22558 2019-06-03 00:07:23,779 Debug:Lattice Sieving: Return code is: 134
PID22558 2019-06-03 00:07:23,779 Debug:Lattice Sieving: stderr is: b'code BUG() : condition slice_start != NULL failed in realloc_slice_start at /home/nfsforluke/cado/sieve/bucket.cpp:159 -- Abort\nAborted\n'
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Program run on instance-1.66ad9519 failed with exit code 134
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in file <HIDDEN_PATH>/cado-nfs/2330Ljob/2330L.c207.upload/2330L.c207_sieving_15247000-15248000.7nc9k5rh.stderr0):
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: 	code BUG() : condition slice_start != NULL failed in realloc_slice_start at /home/nfsforluke/cado/sieve/bucket.cpp:159 -- Abort
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: 	Aborted
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: 	
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Exceeded maximum number of failed workunits, maxfailed=100
SethTro is offline   Reply With Quote
Old 2019-06-03, 09:59   #171
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

24×7×43 Posts
Default

Quote:
Originally Posted by SethTro View Post
Server seems to be dead for last ~60minutes.

In the logs I see
Code:
PID22558 2019-06-03 00:07:23,776 Info:Lattice Sieving: Adding workunit 2330L.c207_sieving_15254000-15255000 to database
PID22558 2019-06-03 00:07:23,779 Debug:Lattice Sieving: Return code is: 134
PID22558 2019-06-03 00:07:23,779 Debug:Lattice Sieving: stderr is: b'code BUG() : condition slice_start != NULL failed in realloc_slice_start at /home/nfsforluke/cado/sieve/bucket.cpp:159 -- Abort\nAborted\n'
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Program run on instance-1.66ad9519 failed with exit code 134
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in file <HIDDEN_PATH>/cado-nfs/2330Ljob/2330L.c207.upload/2330L.c207_sieving_15247000-15248000.7nc9k5rh.stderr0):
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: 	code BUG() : condition slice_start != NULL failed in realloc_slice_start at /home/nfsforluke/cado/sieve/bucket.cpp:159 -- Abort
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: 	Aborted
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: 	
PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Exceeded maximum number of failed workunits, maxfailed=100
My clients can't connect, giving
Code:
ERROR:root:Upload failed, URL error: <urlopen error [Errno 111] Connection refused>
ERROR:root:Waiting 10.0 seconds before retrying (I have been waiting since 9830.0 seconds)
ET_ is offline   Reply With Quote
Old 2019-06-03, 11:09   #172
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3·17·97 Posts
Default

CPR
pinhodecarlos is offline   Reply With Quote
Old 2019-06-03, 14:12   #173
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

28×19 Posts
Default

CADO has a couple of safeties: the maximum number of failed workunits is 100, and the maximum number of timed-out workunits is 100.

I've increased both of these. It seems Luke's client failed somehow, sent error codes back to the server, and after 100 such failed WUs the server quit.

I increased WU size to 2000 while I was in there.
VBCurtis is offline   Reply With Quote
Old 2019-06-03, 14:17   #174
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

28×19 Posts
Default

Luke's client is now sending back crash reports:
Code:
code BUG() : condition rc == 0 failed in malloc_aligned at /home/nfsforluke/cado/utils/memory.c:188 -- Abort
I think this is the same error Carlos had initially?
They're coming every 10 seconds, so even with a failed-workunit setting of 1000 the server will die again in 3 hrs. I'm taking it back down in hopes Luke can kill his client.
I finish class at 9am Pacific time, will restart server at that time.
VBCurtis is offline   Reply With Quote
Old 2019-06-03, 14:46   #175
lukerichards
 
lukerichards's Avatar
 
"Luke Richards"
Jan 2018
Birmingham, UK

25×32 Posts
Default

Apologies for this folks - not sure what's happened!

All I can say is I stopped the client and restarted it and then the errors started happening.

When the server is back up, if the errors are still popping up I'll just create a new VM instance and get to work on there. Hopefully a fresh instance will be error free.
lukerichards is offline   Reply With Quote
Old 2019-06-03, 14:52   #176
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

28×19 Posts
Default

I've restarted the server. Growing pains of team-CADO work, I suppose.
VBCurtis is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Coordination thread for redoing P-1 factoring ixfd64 Lone Mersenne Hunters 81 2021-04-17 20:47
big job planning henryzz Cunningham Tables 16 2010-08-07 05:08
Sieving reservations and coordination gd_barnes No Prime Left Behind 2 2008-02-16 03:28
Sieved files/sieving coordination gd_barnes Conjectures 'R Us 32 2008-01-22 03:09
Special Project Planning wblipp ElevenSmooth 2 2004-02-19 05:25

All times are UTC. The time now is 08:01.


Tue Jul 27 08:01:50 UTC 2021 up 4 days, 2:30, 0 users, load averages: 1.67, 1.81, 1.83

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.