![]() |
Team sieve for Kosta C198
If you have some spare cores, my usual server/port is now serving workunits for a C198 from the Kosta numbers (M107^12 + 1). A=30 so about 5.5GB per process. Default is now 4 threads per process, naturally you can set to whatever you like via "--override t {number}" on the command line.
swellman and I plan to sieve this from Q=2-80M, with the 15e queue handling 80-700M. In a sense, a JV version of what we just did for the C217- use CADO for what it is best at on a large sieve region and small Q, and The Cloud in the region that minimizes wasted effort. If it works well, I think we can use this hybrid to factor up to 201 or 202 digit numbers on 15e queue. I can personally sieve about 1MQ/day, and I'm planning a month of sieving. |
My cores are occupied for the next 5 days or so, but I can join in afterwards.
So, the exact same command line, and it will pull in the new poly file and root file and start crunching? Is it better to run one client with all the threads or split into multiple (say 2 or 3) clients on a single CPU with 12 threads? |
Same exact command line, correct. I don't know what is most efficient; on my home machine I often run two 6-threaded instances, so that I can kill one when something else catches my attention.
The server itself is running 10 4-threaded instances. |
Server seems to be unavailable to both my machines.
|
I just started adding machines. I hope I didn't break something, again! I did before, but that was because the ones I tried didn't have enough RAM. This time I made sure all the machines had at least 8GB with >6GB free. I had several already running and then added a few more. Maybe the initial d/l for the more recent machines was too heavy.
[later] I think I will kill all my machines that never finished downloading the initial files and leave those that had failed to upload running and see if that affects the server in a positive manner. . . [later, still] It is serving again. I aborted several machines. Is that what fixed it, or just coincidence? I will start adding again, but at a slower pace. . . [Even later] It appears that I cannot download the roots1 file in any acceptable time frame and that the d/l keeps the server from performing other tasks while fiddling. Speedtest.net shows my d/l at >22Mbps, but the file size shows only about 20M after a couple minutes. I have taken to manually copying the roots1 file from an already running client into the others I am trying to start. This appears to be working. |
Hrmm... I see lots of clients from eFarm, and couple from seven, and the localhost clients are all still running. Work is being accepted and given out.
Strange. Yield at Q=7M is higher than it was at Q=2M, so I started the sieving a little low. Holy wow does Ed have some clients going! eFarm 40 appears to be crashing repeatedly, though. |
I added a few more machines today with no d/l issues and, more importantly, I didn't seem to hang the server at all.:smile:
From my vantage eFarm.40 looks like it's running correctly ATM. Is there (or, will there be) a cloudygo page for this team sieve? |
Thanks to Ed and Seth's clients, we're flying through this one!
Presently Q=11.3M, just over 18M relations found. We started at Q=2M, so yield is a tick below 2 so far. Not great, slightly tempting to use I=16- but at this yield we'll still get ~150M relations from a Q-range that doesn't sieve well on the 15e queue; "free" relations so to speak! I think on any larger input, we'd do this "free" cado sieving on I=16 to reduce the Q-range to be done on the 15e queue. Thanks for all the assistance! We've done 5MQ in a day, so just ~14 more days at this pace will complete it to Q=80M. Hopefully that helps Seth decide whether to bother with a cloudygo tracker- I felt bad that we stopped the 2,1165 job like the day after he got the tracker up! |
I am seeing an occasional full bucket on more than one machine:
[code] ERROR:root:Command resulted in exit code 134 ERROR:root:Stderr: code BUG() : condition most_full > 1 failed in reserve at .../las-threads.cpp:123 -- Abort Aborted (core dumped) [/code]Is there something I should adjust at this end, or as long as it's only a few, not worry about it? My current scripts kill the machine until I can review and restart it. I can easily change that to allow it to continue on its own, which is my current plan, unless you have another idea. So far, the restarted machines appear to run fine. |
There is a setting "bkmult" that, in my experience, occasionally self-adjusts when a buckets-full message pops up. It usually results in a slower run for that one workunit, but when it happens often I have added a tasks.sieve.bkmult=1.10 (default is 1, from what I can tell, and this only happens on large jobs).
You can try setting --override bkmult=1.10 or 1.12 and see if the errors cease; but I don't know why you get a crash when I just get a slowdown from that particular bucket, so maybe I'm talking about a different "bucket". You could also try the CADO mailing list on [url]http://cado-nfs.gforge.inria.fr/support.html[/url] But if that machine isn't using the most-current GIT, that may be fruitless (what dev wants to hear "on the version from 7 months ago, I have this bug...."?). EDIT: "don't worry about it" is fine, too. You're not getting many bad WUs from this problem. |
[QUOTE=VBCurtis;544284]
. . . EDIT: "don't worry about it" is fine, too. You're not getting many bad WUs from this problem.[/QUOTE] The failure is easy to skip with my scripts and those machines that have shown it, continued on fine when restarted, so I'll go with, "don't worry about it." |
We've reached Q=20M, with 37.7M relations found. Yield is now above 2.0 average, and the last 8.7MQ got us 19.5M relations for a yield of 2.2ish since my update yesterday. If yield stays put, that's ~160M relations from Q=2-80M. Yield on ggnfs at 700M was around 1.1, so we're saving a Q-range of ~140M by doing this CADO effort.
The job has been posted to the 15e queue, but is a ways down the list; I think we'll finish this CADO effort before the ggnfs relations are ready from 15e. |
Today's update:
Q=34.2M, 72.3M relations found. The server got stuck for a few minutes, and then processed ~40 workunits in a single burst. No idea what goes on in the database, but at least I didn't need to restart it this time! |
[QUOTE=VBCurtis;544494]Today's update:
Q=34.2M, 72.3M relations found. The server got stuck for a few minutes, and then processed ~40 workunits in a single burst. No idea what goes on in the database, but at least I didn't need to restart it this time![/QUOTE] My machines had trouble uploading WUs for two periods, yesterday: the first at around 5PM Eastern and the second at around 9PM Eastern. Examples from one of my machines show uploads that usually take about two seconds, taking as long as 23 minutes and 48 seconds: [code] upload example 1: start - - 17:03:32 complete- 17:21:10 upload example 2: start - - 17:32:01 complete- 17:55:49 upload example 3: start - - 21:42:07 complete- 21:51:27 [/code] |
Curiosity:
All WUs appear to be the same size (2000). I have two i5's that do not hyperthread, so they are running -t 4. I have several i7's that do hyperthread, so they are running -t 8. All have at least 8GB RAM. Timewise, the two i5's are just about keeping up with the i7's in completing their WUs. The i5's are running at about 3200 MHz, while the i7's are running at about 3400 MHz. The only thing I see in the i5's favor, is that they have SSDs. Is there that much drive activity to account for these observations? Or, is it something to do with hyperthreading overhead? |
The only relevant data I've taken is running 6 threads on a 6-core, and then 12. The 12-threaded job was about 20% faster than the 6 threaded job.
I've no idea why a faster i7 would take as long as the i5, unless the i5 is a newer generation with newer instructions. |
[QUOTE=VBCurtis;544528]The only relevant data I've taken is running 6 threads on a 6-core, and then 12. The 12-threaded job was about 20% faster than the 6 threaded job.
I've no idea why a faster i7 would take as long as the i5, unless the i5 is a newer generation with newer instructions.[/QUOTE] That may very well be the difference. The i5's are much newer than the i7's. BTW, as I write this, all of my machines are complaining: [code] 2020-05-03 20:07:55,928 - ERROR:root:Upload failed, URL error: <urlopen error [Errno 111] Connection refused> 2020-05-03 20:07:55,928 - ERROR:root:Waiting 10.0 seconds before retrying (I have been waiting since 1930.0 seconds) [/code]and: [code] INFO:root:spin=44 is_wu=True blog=0 INFO:root:Downloading http://TheMachine.dyn.ucr.edu:44455/cgi-bin/getwu?clientid=eFarm.20 to download/WU.eFarm.20117763498 (cafile = None) ERROR:root:Download failed, URL error: <urlopen error [Errno 111] Connection refused> [/code] |
Yea, CADO quit; when I tried to restart it, I got the error message that we hit the max failed workunits of 100. Your machines killed us! :razz:
I set the new max to 1000, which should last us the duration of this effort. The server is back up. |
[QUOTE=VBCurtis;544544]Yea, CADO quit; when I tried to restart it, I got the error message that we hit the max failed workunits of 100. Your machines killed us! :razz:
I set the new max to 1000, which should last us the duration of this effort. The server is back up.[/QUOTE]Apologies! But, mine shouldn't be failing that much now, unless it's possibly due to "tasks.wutimeout = 3600 # one hour." My slower machines are taking less than 30 minutes to complete, other than not being able to report. I probably used up some of the "failed" quota in the beginning, but I have all the scripts doing a good job of gracefully ending after a submission, now. That means the curfewed ones don't leave anything unfinished. I did have a couple machines with the "condition most_full" failure - one had several. I installed a brand new CADO-NFS on that one and haven't seen the error anymore. |
Update: Q=49.2M, 111.2M total relations. Average yield: 111.2/47.2 = 2.36.
Yield since last update (Q=34.2M): 38.9M / 15M = 2.59. At the current yield, we'll get ~75M more relations for a total approaching 190M relations. That leaves ~850M for nfs@home to sieve. We're running just over 5MQ a day, and I just added 10 threads. If Ed continues his support, we'll finish Monday the 11th. |
[QUOTE=VBCurtis;544657]Update: Q=49.2M, 111.2M total relations. Average yield: 111.2/47.2 = 2.36.
Yield since last update (Q=34.2M): 38.9M / 15M = 2.59. At the current yield, we'll get ~75M more relations for a total approaching 190M relations. That leaves ~850M for nfs@home to sieve. We're running just over 5MQ a day, and I just added 10 threads. If Ed continues his support, we'll finish Monday the 11th.[/QUOTE] I expect to see this through and even just added the machine that finished the c178 HCN today. |
Server unreachable for 40 minutes. Problem at my end or server?
|
[QUOTE=axn;544773]Server unreachable for 40 minutes. Problem at my end or server?[/QUOTE]
All my machines are suffering, too. |
oops! My home desktop crashed. Seems it somehow took down the server (must've forgotten to run cado within screen). It'll be up momemtarily.
EDIT: Nope, that was merely a coincidence. I get error: too many failed workunits (max 100). This is not the max-timed-out error; that setting tasks.maxtimedout is set to 5000, plenty. Seems the buckets-full error Ed gets on a few clients added up. I checked the params.c90 file to see if there's a setting to loosen this and allow the factorization to keep going, but I may need to build a new job in a new folder starting at the Q we left off at. More shortly. Re-Edit: started CADO anew at Q=57.66M. Building factor base now, should be up in 10-15 min for workunits. |
More problems. Just now, the clients are unable to connect to the server.
|
[QUOTE=axn;544790]More problems. Just now, the clients are unable to connect to the server.[/QUOTE]
When I woke up the slumberers this morning, they all got in each others' ways because they hadn't gracefully stopped last night, due to the outage. The new run by Curtis might have compounded my local troubles, since the machines (or, at least some of them) wanted new roots1. I have totally removed all my machines and will hope this allows the server to catch up. I will check later to see if I should add any back. I will have to evaluate my scripts again, to lessen the load. Although I previously chose to ignore the "full" issue, I did actually address it when I found some, by installing the latest dev on those machines. I also upgraded some others. However, I have found that the latest dev slows my machines by over 10%. This is repeatable. WUs that averaged 14.5 minutes took 16.5. Those that took 27 minutes, took 33, etc. Going back to the earlier installs, took the time back to the earlier lengths. But in all of this, I did keep an eye out for the "full" issues and thought I had minimized them. I must not have minimized them enough. My apologies for breaking the server. I hope it is back up now that my machines have been removed. |
:smile: Well, FWIW, things are running fine now. Maybe you can start adding the clients back in a staggered fashion?
|
[QUOTE=axn;544795]:smile: Well, FWIW, things are running fine now. Maybe you can start adding the clients back in a staggered fashion?[/QUOTE]
Done so. I think they are all back up, properly. I'll try to keep an eye on this thread. . . (I had to distribute the roots1 file locally. Some of my machines wanted to d/l the 200+MB file at about 5kB/s! 40000 seconds!?! Not sure what's going on there.) |
The server is on a university network, and I've experienced 100Mbit speeds downloading relations files from it to my home desktop.
Bummer about the speed loss due to updating software! Things look good presently from my end. Q=59.9M, 5.8M new relations since the restart last night. |
[QUOTE=VBCurtis;544808]The server is on a university network, and I've experienced 100Mbit speeds downloading relations files from it to my home desktop.
Bummer about the speed loss due to updating software! Things look good presently from my end. Q=59.9M, 5.8M new relations since the restart last night.[/QUOTE]Well, I found two more of my machines with "most_full" issues and updated CADO-NFS on both. They immediately slowed by 2+ minutes per run. I'm left with opting for errors or slowing progress. If the errors compound to server stoppages, it's a bigger issue, so I'm choosing slower running when I come across these "exit code 134," "most_full" errors. |
Due to some of my research through previous logs, I have come to the idea (possibly false), that the slowdown is actually from CADO-NFS adjusting some internal values to prevent the "most_full" conditions from arising. This comes at a cost to performance.
|
If your errors are related to the bkmult bucket full, then the lost speed may be illusory to an extent- your old version may have either run fast with a smaller bucket, or crashed/had to redo a whole workunit on a larger setting. I've set bkmult=1.10 or 1.12 on 190+ jobs because ever newer versions would run into buckets-full, increase bkmult, and restart on a Q. If that happens a bunch, I've found it's faster to just set bkmult from the get-go (I did not set it on this job).
Regardless of the cause of the error: if you crashed every 10 workunits on that machine, but were 10% faster, your new copy of CADO is just as fast as the old one, since it's doing 10 slow WUs in the time the old copy did 11 but failed on one (if the failure took almost the whole time, at least). I hope that makes you feel a bit better! I saw errors most often on #40 and #53. I saw 53 today, so I bet you caught that one today too. |
[QUOTE=VBCurtis;544850]If your errors are related to the bkmult bucket full, then the lost speed may be illusory to an extent- your old version may have either run fast with a smaller bucket, or crashed/had to redo a whole workunit on a larger setting. I've set bkmult=1.10 or 1.12 on 190+ jobs because ever newer versions would run into buckets-full, increase bkmult, and restart on a Q. If that happens a bunch, I've found it's faster to just set bkmult from the get-go (I did not set it on this job).
Regardless of the cause of the error: if you crashed every 10 workunits on that machine, but were 10% faster, your new copy of CADO is just as fast as the old one, since it's doing 10 slow WUs in the time the old copy did 11 but failed on one (if the failure took almost the whole time, at least). I hope that makes you feel a bit better! I saw errors most often on #40 and #53. I saw 53 today, so I bet you caught that one today too.[/QUOTE]All my errors were early - between 13 and 27 seconds. #40 was a for sure, but I don't remember if #53 was. I did two or three others than #40. I'll try to remember to check #53 for sure tomorrow.. I changed #49 from an i5 to an i7 today, to squeeze just a little more from it. #49 was the slowest of my machines at about 33 minutes/WU. With a brand new CADO-NFS install, it's running at around 27/WU now. Not real impressive, but every little bit helps. |
[QUOTE=VBCurtis;544850]. . .
I saw errors most often on #40 and #53. I saw 53 today, so I bet you caught that one today too.[/QUOTE]I'm not seeing any more short time errors. If you note any more of any length, let me know next time you post. Thanks! |
Ed-
I saw no more errors. All- We've reached Q=80M, so we're done! I'll cat and remdups these presently, and will post the number of unique relations we achieved as an edit to this post shortly. Thanks for your help, everyone! Stats: 137.4M unique relations, 59.6M duplicates. That's 197M total relations from Q=2-80M. Lots of these will duplicate the nfs@home run, but we'll still save a bunch of inefficient high-Q effort! |
[QUOTE=VBCurtis;545053]Ed-
I saw no more errors. All- We've reached Q=80M, so we're done! I'll cat and remdups these presently, and will post the number of unique relations we achieved as an edit to this post shortly. Thanks for your help, everyone! Stats: 137.4M unique relations, 59.6M duplicates. That's 197M total relations from Q=2-80M. Lots of these will duplicate the nfs@home run, but we'll still save a bunch of inefficient high-Q effort![/QUOTE] Nice. I’ll keep an eye on NFS@Home job for this composite, though it probably won’t finish sieving until midsummer. But ~100M relations added onto what 15e generates should get this onto LA. I wonder if a C200 is possible on 15e using this approach? Question for another day. |
[QUOTE=swellman;545083]I wonder if a C200 is possible on 15e using this approach? Question for another day.[/QUOTE]
There's no doubt this approach makes C200-201 reasonable with 15e. If yield looks sketchy, we simply run Q=20-100M on I=16 rather than the A=30 that we chose for this job. We'd get 30% more relations -> 220M uniques before starting nfs@home. Or, if it proves faster, Q=20-150M on A=30 and then nfs@home. |
Mike has withdrawn his reservation for postprocessing this job. Anybody interested in taking it on? I have a couple of i7s each with 32 Gb of memory but it would take months to finish this job on either rig.
|
Sure, I can do it; I already have the CADO'ed relations anyway.
You can post the reservation for me. |
| All times are UTC. The time now is 20:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.