![]() |
|
|
#166 |
|
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
3·17·97 Posts |
Weren’t you complaining about your internet connection dropping, how often the server now sends out a wu? As I said previously NFS@Home use Q range of 2k but I wouldn’t mind to have it double size (to 4k).
Last fiddled with by pinhodecarlos on 2019-06-02 at 18:03 |
|
|
|
|
|
#167 |
|
"Curtis"
Feb 2005
Riverside, CA
28×19 Posts |
I was- but the machines subject to that drop have more than 4 cores, so Tom's --override trick allows me to be reasonably sure I can keep my own workunits under 30 minutes for Q-range 2000.
To me, the tradeoff is time wasted starting up las for each WU versus time wasted from a partially completed WU when someone pauses CADO for other work. It seems there is a general consensus that longer WUs are acceptable, so I'll wait a day for anyone to opine that we're OK as-is, and then I suppose I'll double WU length to 2000. |
|
|
|
|
|
#168 |
|
Sep 2008
Kansas
17×199 Posts |
Actually I was thinking as the work completed. Many threads would wait for the last thread to complete, then package it up and compress it, then upload, then clean up, then request a new WU. I'm seeing ten steps between each of my "Running ..." commands.
Edit: Please don't take me as complaining. I am simply pointing out a possible abnormally. Last fiddled with by RichD on 2019-06-02 at 19:56 |
|
|
|
|
|
#169 |
|
"Curtis"
Feb 2005
Riverside, CA
28×19 Posts |
Those steps seem like a pretty good argument for using two clients per socket, memory permitting. If one client is in admin-task mode, the other will be using most of the CPU. When both run, HT balances things out.
It's also a good argument for longer WUs, as you point out- less time spent with an empty pool of Q's at the end of each WU. There is a scheduled power outage in my building overnight Thursday night/ Friday morning (USA pacific time). Since the server will be restarting then anyway, I'll change Q-range to 2000 at that time. |
|
|
|
|
|
#170 |
|
"Seth"
Apr 2019
293 Posts |
Server seems to be dead for last ~60minutes.
In the logs I see Code:
PID22558 2019-06-03 00:07:23,776 Info:Lattice Sieving: Adding workunit 2330L.c207_sieving_15254000-15255000 to database PID22558 2019-06-03 00:07:23,779 Debug:Lattice Sieving: Return code is: 134 PID22558 2019-06-03 00:07:23,779 Debug:Lattice Sieving: stderr is: b'code BUG() : condition slice_start != NULL failed in realloc_slice_start at /home/nfsforluke/cado/sieve/bucket.cpp:159 -- Abort\nAborted\n' PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Program run on instance-1.66ad9519 failed with exit code 134 PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Stderr output (last 10 lines only) follow (stored in file <HIDDEN_PATH>/cado-nfs/2330Ljob/2330L.c207.upload/2330L.c207_sieving_15247000-15248000.7nc9k5rh.stderr0): PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: code BUG() : condition slice_start != NULL failed in realloc_slice_start at /home/nfsforluke/cado/sieve/bucket.cpp:159 -- Abort PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Aborted PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: PID22558 2019-06-03 00:07:23,780 Error:Lattice Sieving: Exceeded maximum number of failed workunits, maxfailed=100 |
|
|
|
|
|
#171 | |
|
Banned
"Luigi"
Aug 2002
Team Italia
24×7×43 Posts |
Quote:
Code:
ERROR:root:Upload failed, URL error: <urlopen error [Errno 111] Connection refused> ERROR:root:Waiting 10.0 seconds before retrying (I have been waiting since 9830.0 seconds) |
|
|
|
|
|
|
#172 |
|
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
3·17·97 Posts |
CPR
|
|
|
|
|
|
#173 |
|
"Curtis"
Feb 2005
Riverside, CA
28×19 Posts |
CADO has a couple of safeties: the maximum number of failed workunits is 100, and the maximum number of timed-out workunits is 100.
I've increased both of these. It seems Luke's client failed somehow, sent error codes back to the server, and after 100 such failed WUs the server quit. I increased WU size to 2000 while I was in there. |
|
|
|
|
|
#174 |
|
"Curtis"
Feb 2005
Riverside, CA
28×19 Posts |
Luke's client is now sending back crash reports:
Code:
code BUG() : condition rc == 0 failed in malloc_aligned at /home/nfsforluke/cado/utils/memory.c:188 -- Abort They're coming every 10 seconds, so even with a failed-workunit setting of 1000 the server will die again in 3 hrs. I'm taking it back down in hopes Luke can kill his client. I finish class at 9am Pacific time, will restart server at that time. |
|
|
|
|
|
#175 |
|
"Luke Richards"
Jan 2018
Birmingham, UK
25×32 Posts |
Apologies for this folks - not sure what's happened!
All I can say is I stopped the client and restarted it and then the errors started happening. When the server is back up, if the errors are still popping up I'll just create a new VM instance and get to work on there. Hopefully a fresh instance will be error free. |
|
|
|
|
|
#176 |
|
"Curtis"
Feb 2005
Riverside, CA
28×19 Posts |
I've restarted the server. Growing pains of team-CADO work, I suppose.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Coordination thread for redoing P-1 factoring | ixfd64 | Lone Mersenne Hunters | 81 | 2021-04-17 20:47 |
| big job planning | henryzz | Cunningham Tables | 16 | 2010-08-07 05:08 |
| Sieving reservations and coordination | gd_barnes | No Prime Left Behind | 2 | 2008-02-16 03:28 |
| Sieved files/sieving coordination | gd_barnes | Conjectures 'R Us | 32 | 2008-01-22 03:09 |
| Special Project Planning | wblipp | ElevenSmooth | 2 | 2004-02-19 05:25 |