mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   CADO-NFS (https://www.mersenneforum.org/forumdisplay.php?f=170)
-   -   Team sieve for Kosta C198 (https://www.mersenneforum.org/showthread.php?t=25492)

EdH 2020-05-07 02:52

[QUOTE=axn;544773]Server unreachable for 40 minutes. Problem at my end or server?[/QUOTE]
All my machines are suffering, too.

VBCurtis 2020-05-07 04:10

oops! My home desktop crashed. Seems it somehow took down the server (must've forgotten to run cado within screen). It'll be up momemtarily.
EDIT: Nope, that was merely a coincidence. I get error: too many failed workunits (max 100).
This is not the max-timed-out error; that setting tasks.maxtimedout is set to 5000, plenty.

Seems the buckets-full error Ed gets on a few clients added up. I checked the params.c90 file to see if there's a setting to loosen this and allow the factorization to keep going, but I may need to build a new job in a new folder starting at the Q we left off at. More shortly.

Re-Edit: started CADO anew at Q=57.66M. Building factor base now, should be up in 10-15 min for workunits.

axn 2020-05-07 11:52

More problems. Just now, the clients are unable to connect to the server.

EdH 2020-05-07 12:40

[QUOTE=axn;544790]More problems. Just now, the clients are unable to connect to the server.[/QUOTE]
When I woke up the slumberers this morning, they all got in each others' ways because they hadn't gracefully stopped last night, due to the outage. The new run by Curtis might have compounded my local troubles, since the machines (or, at least some of them) wanted new roots1.

I have totally removed all my machines and will hope this allows the server to catch up. I will check later to see if I should add any back. I will have to evaluate my scripts again, to lessen the load.

Although I previously chose to ignore the "full" issue, I did actually address it when I found some, by installing the latest dev on those machines. I also upgraded some others. However, I have found that the latest dev slows my machines by over 10%. This is repeatable. WUs that averaged 14.5 minutes took 16.5. Those that took 27 minutes, took 33, etc. Going back to the earlier installs, took the time back to the earlier lengths. But in all of this, I did keep an eye out for the "full" issues and thought I had minimized them. I must not have minimized them enough.

My apologies for breaking the server. I hope it is back up now that my machines have been removed.

axn 2020-05-07 13:15

:smile: Well, FWIW, things are running fine now. Maybe you can start adding the clients back in a staggered fashion?

EdH 2020-05-07 14:31

[QUOTE=axn;544795]:smile: Well, FWIW, things are running fine now. Maybe you can start adding the clients back in a staggered fashion?[/QUOTE]
Done so. I think they are all back up, properly. I'll try to keep an eye on this thread. . .

(I had to distribute the roots1 file locally. Some of my machines wanted to d/l the 200+MB file at about 5kB/s! 40000 seconds!?! Not sure what's going on there.)

VBCurtis 2020-05-07 15:52

The server is on a university network, and I've experienced 100Mbit speeds downloading relations files from it to my home desktop.

Bummer about the speed loss due to updating software!

Things look good presently from my end. Q=59.9M, 5.8M new relations since the restart last night.

EdH 2020-05-07 22:08

[QUOTE=VBCurtis;544808]The server is on a university network, and I've experienced 100Mbit speeds downloading relations files from it to my home desktop.

Bummer about the speed loss due to updating software!

Things look good presently from my end. Q=59.9M, 5.8M new relations since the restart last night.[/QUOTE]Well, I found two more of my machines with "most_full" issues and updated CADO-NFS on both. They immediately slowed by 2+ minutes per run. I'm left with opting for errors or slowing progress. If the errors compound to server stoppages, it's a bigger issue, so I'm choosing slower running when I come across these "exit code 134," "most_full" errors.

EdH 2020-05-07 22:38

Due to some of my research through previous logs, I have come to the idea (possibly false), that the slowdown is actually from CADO-NFS adjusting some internal values to prevent the "most_full" conditions from arising. This comes at a cost to performance.

VBCurtis 2020-05-07 22:40

If your errors are related to the bkmult bucket full, then the lost speed may be illusory to an extent- your old version may have either run fast with a smaller bucket, or crashed/had to redo a whole workunit on a larger setting. I've set bkmult=1.10 or 1.12 on 190+ jobs because ever newer versions would run into buckets-full, increase bkmult, and restart on a Q. If that happens a bunch, I've found it's faster to just set bkmult from the get-go (I did not set it on this job).

Regardless of the cause of the error: if you crashed every 10 workunits on that machine, but were 10% faster, your new copy of CADO is just as fast as the old one, since it's doing 10 slow WUs in the time the old copy did 11 but failed on one (if the failure took almost the whole time, at least).

I hope that makes you feel a bit better!

I saw errors most often on #40 and #53. I saw 53 today, so I bet you caught that one today too.

EdH 2020-05-08 02:45

[QUOTE=VBCurtis;544850]If your errors are related to the bkmult bucket full, then the lost speed may be illusory to an extent- your old version may have either run fast with a smaller bucket, or crashed/had to redo a whole workunit on a larger setting. I've set bkmult=1.10 or 1.12 on 190+ jobs because ever newer versions would run into buckets-full, increase bkmult, and restart on a Q. If that happens a bunch, I've found it's faster to just set bkmult from the get-go (I did not set it on this job).

Regardless of the cause of the error: if you crashed every 10 workunits on that machine, but were 10% faster, your new copy of CADO is just as fast as the old one, since it's doing 10 slow WUs in the time the old copy did 11 but failed on one (if the failure took almost the whole time, at least).

I hope that makes you feel a bit better!

I saw errors most often on #40 and #53. I saw 53 today, so I bet you caught that one today too.[/QUOTE]All my errors were early - between 13 and 27 seconds. #40 was a for sure, but I don't remember if #53 was. I did two or three others than #40.

I'll try to remember to check #53 for sure tomorrow.. I changed #49 from an i5 to an i7 today, to squeeze just a little more from it. #49 was the slowest of my machines at about 33 minutes/WU. With a brand new CADO-NFS install, it's running at around 27/WU now. Not real impressive, but every little bit helps.


All times are UTC. The time now is 20:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.