![]() |
[QUOTE=yoyo;521740]The workunit processing is not fully under control by the server. There are also volunteers in the game. For CN, HC, GCW 5 curves are put into one boinc workunit and sent to a volunteer with a deadline of 5 days. If the volunteer doesn't return any result it takes 5 days until the server recognise it. After 5 days this workunit is sent to someone else which also might not return. So some curves just need time.
Those resent also happens if a workunit returns with an error. In this case it is also resent to someone else. Such resents doesn't lead to any progress. In the meantime the server sends out other ecm workunits from it's "unsent tasks" list. It sends always the oldest ones first. If the "unsent tasks" list drains below 1000 it generates new tasks from one of the ecm input queues. It chooses the input queue from the project which got least computing power in the last 5 days.[/QUOTE] It might be efficacious to keep track of volunteers who repeatedly fail to return results. Also, under the current scheme it is possible that a given number may never finish. It gets close to finishing (say within 50 curves). These assignments fail. They get reassigned. They fail..... etc. If the server kept track of which users returned results quickly it could hand out reassignments of failed assignments to just those users. This way a number would not take more than a week to finish (say) the last 50 or so curves. 5 days seems quite generous. Just one core on my 2.4GHz laptop finishes a curve (with B1 = 10^9, composite = 200 digits) in 4500 seconds...…. The policy of how new numbers get assigned seems very reasonable. However, if a number is stalled owing to repeated failed assignments, then that particular project will not accumulate much computing power during that period. i.e. almost none. Should not the next number that gets started then get assigned to that project? This is not what I see. I see numbers in some projects get stalled for an extended period, but new numbers seemingly get started up in OTHER projects. Perhaps my eyesight needs correction??? We see numbers stalled for more than a week [thus no computing power applied] and no new numbers get started despite the dearth of compute power over that period. Of course, my perceptions might be wrong. |
[QUOTE=R.D. Silverman;521564]BTW, does anyone know what is happening with 2,2102L? It had already been
sieved, now it seems to be sieving again. Of course sometimes relations fall short and a number needs additional sieving, but the resieve effort for 2,2102L has been going on for sufficiently long that I suspect this is not the case.[/QUOTE] Any ideas? The resieve has taken sufficiently long that it can't be just "additional sieving". |
Hi folks.
Thanks for your efforts in doing this computation with cado-nfs. Just a few random thoughts. (that came after reading this thread) [LIST]the SIZEOF_P_R_VALUES and SIZEOF_INDEX (w/o underscores since commit [URL="https://scm.gforge.inria.fr/anonscm/gitweb?p=cado-nfs/cado-nfs.git;a=commit;h=11913a722"]11913a722[/URL]) are relevant only to filtering. Sieving doesn't care.[/LIST][LIST]the server isn't resilient to the situation of clients failing repeatedly. We're aware of that, and this is indeed a nuisance when one wants to run a distributed computation. The reason is that we *want to know* when a client is failing like this. But obviously this doesn't make much sense if we're not baby-sitting the computation ourselves. The more resilient way would be to have the server blacklist those annoying clients.[/LIST][LIST]the server can only cope with so many connections per second. One of the culprits is the sqlite3 back-end -- at least if you see things like "sqlite3: Operational Error", then don't look further. cado-nfs can be used with a mysql backend. Making the switch halfway is not trivial, but doable. Filesystems can also be a source of delays if you have millions of files.[/LIST][LIST]there's not much info on crisis management with the database if it ever gets corrupt. Likewise, respawning WUs that were abandoned because of repeated failures is not a trivial task. I just committed shell fragments that I use for that purpose ([URL="https://scm.gforge.inria.fr/anonscm/gitweb?p=cado-nfs/cado-nfs.git;a=commit;h=f1f77c911f0e71db05c05ea943e7e26f05051d06"]f1f77c911[/URL]).[/LIST][LIST]the report page at [url]http://factoring.cloudygo.com/[/url] is nice ; actually this is an acknowledged missing feature of cado-nfs. (see [URL="https://gforge.inria.fr/tracker/index.php?func=detail&aid=16699&group_id=2065&atid=7445"]this page on the tracker[/URL], for instance). This functionality doesn't exist in cado-nfs yet because we never invested time in it. But we're pretty happy to receive decent contributions to this end.[/LIST][LIST] [QUOTE=SethTro;521724] A couple of places in the code (las-parallel,cpp, bind_threads.sh, cpubinding.cpp) all might affect this. I'll have to dig into that :/[/QUOTE] indeed (concerning las-parallel.cpp). If you compile with hwloc installed, and if you try "-t auto", you'll have las automatically adjust to your machine. There are many other options (see "-t help" if you dare).[/LIST] Briefly put: we're well aware that cado-nfs has rough edges. It's not easy to get it to work on large projects when you're not familiar with the internals. On the other hand, contributions are welcome, and we're happy to help understand some obscure features. (preferrably using the mailing list). |
thome-
Thank you very much for your comments, and time spent browsing this lengthy thread! Unfortunately, most times CADO has failed to continue issuing workunits it has done so without any error at all. Both the server terminal window and log look fully normal- but the server is stalled. A pair of ctrl-c's issued to the server terminal kill CADO, and a restart using the snapshot file fixes whatever stalled the server. These stalls may be linked to a large user starting many clients at once; but we're talking maybe 20 new clients, not hundreds. One failure was due to a poisoned client, but that problem was easy to discover and correct with the client and by extending the number of bad results before shutting down the server. A client-blacklist option would be welcome. Questions concerning postprocessing: The host currently has 64GB ram and 200GB swap on NVM SSD. Do you expect I will need to upgrade to 128GB ram? Does fast swap help enough on filtering to maybe not need the extra memory? The job is being run on a 1TB SSD, and the relations will take up ~300GB of disk. My only data point is a C186 that ran 32/33LP. 675M relations filtered in 32GB without issue, and the matrix also ran within 32GB without swap. I should have found an C19x job to do before tackling this one, but sometimes interesting projects appear before we're perfectly ready! |
[QUOTE=VBCurtis;521926]thome-
Thank you very much for your comments, and time spent browsing this lengthy thread! Unfortunately, most times CADO has failed to continue issuing workunits it has done so without any error at all. Both the server terminal window and log look fully normal- but the server is stalled. A pair of ctrl-c's issued to the server terminal kill CADO, and a restart using the snapshot file fixes whatever stalled the server. These stalls may be linked to a large user starting many clients at once; but we're talking maybe 20 new clients, not hundreds.[/QUOTE] Odd. I definitely remember seeing things like this when we were using the sqlite3 back-end. Not since I've preferred the mysql one. [QUOTE=VBCurtis;521926] One failure was due to a poisoned client, but that problem was easy to discover and correct with the client and by extending the number of bad results before shutting down the server. A client-blacklist option would be welcome.[/QUOTE] This is probably not very hard to do, but that would go with a handful of other things I'd like to see added to cado-nfs client/server eventually (one of them being failover servers for the client -- I have that in a branch, I may merge it at some point if I get to finish the work). [QUOTE=VBCurtis;521926] Questions concerning postprocessing: The host currently has 64GB ram and 200GB swap on NVM SSD. Do you expect I will need to upgrade to 128GB ram? Does fast swap help enough on filtering to maybe not need the extra memory? The job is being run on a 1TB SSD, and the relations will take up ~300GB of disk. My only data point is a C186 that ran 32/33LP. 675M relations filtered in 32GB without issue, and the matrix also ran within 32GB without swap. I should have found an C19x job to do before tackling this one, but sometimes interesting projects appear before we're perfectly ready![/QUOTE] I don't have a recent data point at this size, and I don't recall having tried to factor things with a 64GB node in the recent past. From the rsa220 log files (more than 5 years ago, on a machine which was probably 3 years old by then): - purge: 6h WCT, 62G RAM - merge: 60h WCT, 195G RAM (but cado-nfs merge has evolved a lot recently!) - replay: 4h WCT, 207G RAM - lingen in linear algebra (done in early 2016, somewhat later than the rest) required 500G RAM (but I'm actively working on it this summer.). the RAM counts represent the VmPeak which may have limited significance. I actually still have the relations for that one. Time permitting, I may give a try to the recent software to see how it fares with that data. |
[QUOTE=thome;521924]Hi folks.
Thanks for your efforts in doing this computation with cado-nfs. [/QUOTE] Thome, thanks for reading through our problems and adding your insight. [QUOTE][LIST]there's not much info on crisis management with the database if it ever gets corrupt. Likewise, respawning WUs that were abandoned because of repeated failures is not a trivial task. I just committed shell fragments that I use for that purpose ([URL="https://scm.gforge.inria.fr/anonscm/gitweb?p=cado-nfs/cado-nfs.git;a=commit;h=f1f77c911f0e71db05c05ea943e7e26f05051d06"]f1f77c911[/URL]).[/LIST] [/QUOTE] I'm going to start saving daily copies of the database for cloudygo. At ~100mb I can store 180 days for 18 gigs and reduce this concern. Thome, would you be interested in a new server flag that saves a last N days of the database? I'm imagining copying it every night and deleting any copies older than N days, or maybe saving a new copy every X WUs (so that during periods of no work it doesn't delete the older copies) [QUOTE][LIST]the report page at [url]http://factoring.cloudygo.com/[/url] is nice ; actually this is an acknowledged missing feature of cado-nfs. (see [URL="https://gforge.inria.fr/tracker/index.php?func=detail&aid=16699&group_id=2065&atid=7445"]this page on the tracker[/URL], for instance). This functionality doesn't exist in cado-nfs yet because we never invested time in it. But we're pretty happy to receive decent contributions to this end.[/LIST][LIST] [/QUOTE] The whole site is around 1000 lines of code ([url]https://github.com/sethtroisi/factoring-ui[/url]), if you think that it or something similar would be interesting to add to the project I'm happy to write up a proposal for that and work with you to get it commit. [QUOTE] indeed (concerning las-parallel.cpp). If you compile with hwloc installed, and if you try "-t auto", you'll have las automatically adjust to your machine. There are many other options (see "-t help" if you dare).[/LIST][/QUOTE] I'm actively investigating this, I might have some questions on the mailing list later. [QUOTE=thome;521924] Briefly put: we're well aware that cado-nfs has rough edges. It's not easy to get it to work on large projects when you're not familiar with the internals. On the other hand, contributions are welcome, and we're happy to help understand some obscure features. (preferrably using the mailing list).[/QUOTE] I was shocked but pleasantly surprised to get the logdate patch pulled in a single day. I have a couple of other contributions planned :) |
Looks like the server machine or connection failed. It does not respond to ssh, so it may be a campus internet outage.
I'll head in to my office and investigate, hopefully we'll be back up in half an hour or so. |
Three of four clients locked up, which also locked up the switch they were on. I see the server is not communicating with cloudygo ATM, either. In case (one, or more ,of) these machines caused the server to go down, I have taken all four off line.
The fourth machine went into the waiting loop. Edit: I guess I was constructing this msg as VBCurtis was posting. I had not seen his until after I submitted mine. I will still leave my clients off line for now. |
The power outage mentioned in post #169 that never happened got around to happening tonight.
According to campus police, power should return to the building at 7am pacific time Saturday (10 hours from now). I expect the machine will boot itself when it gets power, so I'll try to connect and fire up CADO shortly after 7am. If I am unable to connect, I'll drive back to campus and power on the machine manually. Sorry for the outage, folks! On the bright side, better this week than next (when I wouldn't be able to get to campus at all). |
[QUOTE=VBCurtis;521969]The power outage mentioned in post #169 that never happened got around to happening tonight.
According to campus police, power should return to the building at 7am pacific time Saturday (10 hours from now). I expect the machine will boot itself when it gets power, so I'll try to connect and fire up CADO shortly after 7am. If I am unable to connect, I'll drive back to campus and power on the machine manually. Sorry for the outage, folks! On the bright side, better this week than next (when I wouldn't be able to get to campus at all).[/QUOTE] Bummer, but as you pointed out, good to get the outage done with. Glad to hear my machines were not the cause.:smile: |
We're back up and running; our host box did power up on its own.
|
| All times are UTC. The time now is 21:49. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.