mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Cunningham Tables

Reply
 
Thread Tools
Old 2019-07-16, 13:23   #320
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22·5·373 Posts
Default

Quote:
Originally Posted by yoyo View Post
The workunit processing is not fully under control by the server. There are also volunteers in the game. For CN, HC, GCW 5 curves are put into one boinc workunit and sent to a volunteer with a deadline of 5 days. If the volunteer doesn't return any result it takes 5 days until the server recognise it. After 5 days this workunit is sent to someone else which also might not return. So some curves just need time.
Those resent also happens if a workunit returns with an error. In this case it is also resent to someone else.
Such resents doesn't lead to any progress.
In the meantime the server sends out other ecm workunits from it's "unsent tasks" list. It sends always the oldest ones first.

If the "unsent tasks" list drains below 1000 it generates new tasks from one of the ecm input queues. It chooses the input queue from the project which got least computing power in the last 5 days.
It might be efficacious to keep track of volunteers who repeatedly fail to return
results.

Also, under the current scheme it is possible that a given number may never finish.
It gets close to finishing (say within 50 curves). These assignments fail. They get
reassigned. They fail..... etc. If the server kept track of which users returned
results quickly it could hand out reassignments of failed assignments to just those users.
This way a number would not take more than a week to finish (say) the last 50 or so
curves.

5 days seems quite generous. Just one core on my 2.4GHz laptop finishes a curve
(with B1 = 10^9, composite = 200 digits) in 4500 seconds...….

The policy of how new numbers get assigned seems very reasonable. However, if
a number is stalled owing to repeated failed assignments, then that particular
project will not accumulate much computing power during that period. i.e.
almost none. Should not the next number that gets started then get assigned to that
project?

This is not what I see. I see numbers in some projects get stalled for an extended
period, but new numbers seemingly get started up in OTHER projects.

Perhaps my eyesight needs correction???

We see numbers stalled for more than a week [thus no computing
power applied] and no new numbers get started despite the dearth of compute power
over that period.

Of course, my perceptions might be wrong.
R.D. Silverman is offline   Reply With Quote
Old 2019-07-17, 22:47   #321
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

164448 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
BTW, does anyone know what is happening with 2,2102L? It had already been
sieved, now it seems to be sieving again. Of course sometimes relations fall short
and a number needs additional sieving, but the resieve effort for 2,2102L has been
going on for sufficiently long that I suspect this is not the case.
Any ideas? The resieve has taken sufficiently long that it can't be just
"additional sieving".
R.D. Silverman is offline   Reply With Quote
Old 2019-07-19, 14:27   #322
thome
 
May 2009

2×11 Posts
Default

Hi folks.

Thanks for your efforts in doing this computation with cado-nfs.

Just a few random thoughts. (that came after reading this thread)
  • the SIZEOF_P_R_VALUES and SIZEOF_INDEX (w/o underscores since commit 11913a722) are relevant only to filtering. Sieving doesn't care.
  • the server isn't resilient to the situation of clients failing repeatedly. We're aware of that, and this is indeed a nuisance when one wants to run a distributed computation. The reason is that we *want to know* when a client is failing like this. But obviously this doesn't make much sense if we're not baby-sitting the computation ourselves. The more resilient way would be to have the server blacklist those annoying clients.
  • the server can only cope with so many connections per second. One of the culprits is the sqlite3 back-end -- at least if you see things like "sqlite3: Operational Error", then don't look further. cado-nfs can be used with a mysql backend. Making the switch halfway is not trivial, but doable. Filesystems can also be a source of delays if you have millions of files.
  • there's not much info on crisis management with the database if it ever gets corrupt. Likewise, respawning WUs that were abandoned because of repeated failures is not a trivial task. I just committed shell fragments that I use for that purpose (f1f77c911).
  • the report page at http://factoring.cloudygo.com/ is nice ; actually this is an acknowledged missing feature of cado-nfs. (see this page on the tracker, for instance). This functionality doesn't exist in cado-nfs yet because we never invested time in it. But we're pretty happy to receive decent contributions to this end.
  • Quote:
    Originally Posted by SethTro View Post
    A couple of places in the code (las-parallel,cpp, bind_threads.sh, cpubinding.cpp) all might affect this. I'll have to dig into that :/
    indeed (concerning las-parallel.cpp). If you compile with hwloc installed, and if you try "-t auto", you'll have las automatically adjust to your machine. There are many other options (see "-t help" if you dare).

Briefly put: we're well aware that cado-nfs has rough edges. It's not easy to get it to work on large projects when you're not familiar with the internals. On the other hand, contributions are welcome, and we're happy to help understand some obscure features. (preferrably using the mailing list).
thome is offline   Reply With Quote
Old 2019-07-19, 15:30   #323
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

4,861 Posts
Default

thome-
Thank you very much for your comments, and time spent browsing this lengthy thread!
Unfortunately, most times CADO has failed to continue issuing workunits it has done so without any error at all. Both the server terminal window and log look fully normal- but the server is stalled. A pair of ctrl-c's issued to the server terminal kill CADO, and a restart using the snapshot file fixes whatever stalled the server. These stalls may be linked to a large user starting many clients at once; but we're talking maybe 20 new clients, not hundreds. One failure was due to a poisoned client, but that problem was easy to discover and correct with the client and by extending the number of bad results before shutting down the server. A client-blacklist option would be welcome.

Questions concerning postprocessing:
The host currently has 64GB ram and 200GB swap on NVM SSD. Do you expect I will need to upgrade to 128GB ram? Does fast swap help enough on filtering to maybe not need the extra memory? The job is being run on a 1TB SSD, and the relations will take up ~300GB of disk.
My only data point is a C186 that ran 32/33LP. 675M relations filtered in 32GB without issue, and the matrix also ran within 32GB without swap. I should have found an C19x job to do before tackling this one, but sometimes interesting projects appear before we're perfectly ready!
VBCurtis is online now   Reply With Quote
Old 2019-07-19, 16:21   #324
thome
 
May 2009

101102 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
thome-
Thank you very much for your comments, and time spent browsing this lengthy thread!
Unfortunately, most times CADO has failed to continue issuing workunits it has done so without any error at all. Both the server terminal window and log look fully normal- but the server is stalled. A pair of ctrl-c's issued to the server terminal kill CADO, and a restart using the snapshot file fixes whatever stalled the server. These stalls may be linked to a large user starting many clients at once; but we're talking maybe 20 new clients, not hundreds.
Odd. I definitely remember seeing things like this when we were using the sqlite3 back-end. Not since I've preferred the mysql one.

Quote:
Originally Posted by VBCurtis View Post
One failure was due to a poisoned client, but that problem was easy to discover and correct with the client and by extending the number of bad results before shutting down the server. A client-blacklist option would be welcome.
This is probably not very hard to do, but that would go with a handful of other things I'd like to see added to cado-nfs client/server eventually (one of them being failover servers for the client -- I have that in a branch, I may merge it at some point if I get to finish the work).

Quote:
Originally Posted by VBCurtis View Post
Questions concerning postprocessing:
The host currently has 64GB ram and 200GB swap on NVM SSD. Do you expect I will need to upgrade to 128GB ram? Does fast swap help enough on filtering to maybe not need the extra memory? The job is being run on a 1TB SSD, and the relations will take up ~300GB of disk.
My only data point is a C186 that ran 32/33LP. 675M relations filtered in 32GB without issue, and the matrix also ran within 32GB without swap. I should have found an C19x job to do before tackling this one, but sometimes interesting projects appear before we're perfectly ready!
I don't have a recent data point at this size, and I don't recall having tried to factor things with a 64GB node in the recent past.

From the rsa220 log files (more than 5 years ago, on a machine which was probably 3 years old by then):
- purge: 6h WCT, 62G RAM
- merge: 60h WCT, 195G RAM (but cado-nfs merge has evolved a lot recently!)
- replay: 4h WCT, 207G RAM
- lingen in linear algebra (done in early 2016, somewhat later than the rest) required 500G RAM (but I'm actively working on it this summer.).

the RAM counts represent the VmPeak which may have limited significance.

I actually still have the relations for that one. Time permitting, I may give a try to the recent software to see how it fares with that data.
thome is offline   Reply With Quote
Old 2019-07-19, 17:56   #325
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

1001000112 Posts
Default

Quote:
Originally Posted by thome View Post
Hi folks.
Thanks for your efforts in doing this computation with cado-nfs.
Thome, thanks for reading through our problems and adding your insight.


Quote:
  • there's not much info on crisis management with the database if it ever gets corrupt. Likewise, respawning WUs that were abandoned because of repeated failures is not a trivial task. I just committed shell fragments that I use for that purpose (f1f77c911).
I'm going to start saving daily copies of the database for cloudygo. At ~100mb I can store 180 days for 18 gigs and reduce this concern.

Thome, would you be interested in a new server flag that saves a last N days of the database? I'm imagining copying it every night and deleting any copies older than N days, or maybe saving a new copy every X WUs (so that during periods of no work it doesn't delete the older copies)


Quote:
  • the report page at http://factoring.cloudygo.com/ is nice ; actually this is an acknowledged missing feature of cado-nfs. (see this page on the tracker, for instance). This functionality doesn't exist in cado-nfs yet because we never invested time in it. But we're pretty happy to receive decent contributions to this end.
  • The whole site is around 1000 lines of code (https://github.com/sethtroisi/factoring-ui), if you think that it or something similar would be interesting to add to the project I'm happy to write up a proposal for that and work with you to get it commit.

    Quote:
    indeed (concerning las-parallel.cpp). If you compile with hwloc installed, and if you try "-t auto", you'll have las automatically adjust to your machine. There are many other options (see "-t help" if you dare).
I'm actively investigating this, I might have some questions on the mailing list later.

Quote:
Originally Posted by thome View Post
Briefly put: we're well aware that cado-nfs has rough edges. It's not easy to get it to work on large projects when you're not familiar with the internals. On the other hand, contributions are welcome, and we're happy to help understand some obscure features. (preferrably using the mailing list).
I was shocked but pleasantly surprised to get the logdate patch pulled in a single day. I have a couple of other contributions planned :)
SethTro is offline   Reply With Quote
Old 2019-07-20, 02:51   #326
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

10010111111012 Posts
Default

Looks like the server machine or connection failed. It does not respond to ssh, so it may be a campus internet outage.
I'll head in to my office and investigate, hopefully we'll be back up in half an hour or so.
VBCurtis is online now   Reply With Quote
Old 2019-07-20, 03:19   #327
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

11×347 Posts
Default

Three of four clients locked up, which also locked up the switch they were on. I see the server is not communicating with cloudygo ATM, either. In case (one, or more ,of) these machines caused the server to go down, I have taken all four off line.

The fourth machine went into the waiting loop.

Edit: I guess I was constructing this msg as VBCurtis was posting. I had not seen his until after I submitted mine.

I will still leave my clients off line for now.

Last fiddled with by EdH on 2019-07-20 at 03:24
EdH is offline   Reply With Quote
Old 2019-07-20, 04:10   #328
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

113758 Posts
Default

The power outage mentioned in post #169 that never happened got around to happening tonight.
According to campus police, power should return to the building at 7am pacific time Saturday (10 hours from now).

I expect the machine will boot itself when it gets power, so I'll try to connect and fire up CADO shortly after 7am. If I am unable to connect, I'll drive back to campus and power on the machine manually.

Sorry for the outage, folks! On the bright side, better this week than next (when I wouldn't be able to get to campus at all).
VBCurtis is online now   Reply With Quote
Old 2019-07-20, 12:14   #329
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

11·347 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
The power outage mentioned in post #169 that never happened got around to happening tonight.
According to campus police, power should return to the building at 7am pacific time Saturday (10 hours from now).

I expect the machine will boot itself when it gets power, so I'll try to connect and fire up CADO shortly after 7am. If I am unable to connect, I'll drive back to campus and power on the machine manually.

Sorry for the outage, folks! On the bright side, better this week than next (when I wouldn't be able to get to campus at all).
Bummer, but as you pointed out, good to get the outage done with.

Glad to hear my machines were not the cause.

Last fiddled with by EdH on 2019-07-20 at 12:26
EdH is offline   Reply With Quote
Old 2019-07-20, 14:32   #330
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

486110 Posts
Default

We're back up and running; our host box did power up on its own.
VBCurtis is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Coordination thread for redoing P-1 factoring ixfd64 Lone Mersenne Hunters 81 2021-04-17 20:47
big job planning henryzz Cunningham Tables 16 2010-08-07 05:08
Sieving reservations and coordination gd_barnes No Prime Left Behind 2 2008-02-16 03:28
Sieved files/sieving coordination gd_barnes Conjectures 'R Us 32 2008-01-22 03:09
Special Project Planning wblipp ElevenSmooth 2 2004-02-19 05:25

All times are UTC. The time now is 20:28.


Fri Jul 16 20:28:55 UTC 2021 up 49 days, 18:16, 1 user, load averages: 2.21, 2.07, 2.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.