mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2014-06-25, 15:06   #23
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I must apologize for all the trouble. I had one stale file that I missed deleting that was causing the problem. Everything is working perfectly. I'm sorry.
PLEASE don't apologize. You are building up a HUGE buffer for DC Cat 4, led me to figure out that cg1 was more economical than g2, and this is probably the first time so many requests for work were made concurrently -- there could very well could have been a bug on my end.

All's cool!

P.S. I think it would be great if you made your AMI available publicly for anyone else who wanted to play.
chalsall is online now   Reply With Quote
Old 2014-06-25, 16:12   #24
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22·733 Posts
Default

Quote:
Originally Posted by chalsall View Post
PLEASE don't apologize. You are building up a HUGE buffer for DC Cat 4, led me to figure out that cg1 was more economical than g2, and this is probably the first time so many requests for work were made concurrently -- there could very well could have been a bug on my end.
How big is the buffer? Is there anyway to see the size of the buffers for various work categories?

Quote:
All's cool!

P.S. I think it would be great if you made your AMI available publicly for anyone else who wanted to play.
I might have time later. I want to rebuild it as an instance store, plus tweak it to work with user data for passing the passwords, picking work type, and whatnot. Right now that's either baked right into the AMI or specified in my GitHub repository.
Mark Rose is offline   Reply With Quote
Old 2014-06-25, 16:36   #25
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22×733 Posts
Default

Speaking of making the AMI, the biggest concern is handling exponents that don't get completed when an instance is terminated. I'm doing that now with rsync/ssh and manually looking at the backup directories, but that's not turn-key easy to set up. I have some ideas of features that GPU72 could implement to make it easier:

1. Being able to request a 1 day time when fetching work. If the machine can do 280 GHz-day/day, and the largest assignment (LL, 71->74) is about 50 GHz-day, making the needed buffer of 2 assignments 100 GHz, assignments should be complete in 9 hours. A 200 GHz-day/day machine could complete in 12 hours.

2. Being able to specify a machine name that is stored with the assignment, and then manually going through the assignment list when an instance is terminated and unreserving those assignments (which could be scripted). More convenient would be a way of unreserving all the assignments of a machine by name.

I may go the route of creating another script to deal with the stale backup directories, with the option of unreserving anything stale, or making a new worktodo.txt with the lost exponents (along with any checkpoint files).
Mark Rose is offline   Reply With Quote
Old 2014-06-25, 17:02   #26
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

973010 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
How big is the buffer? Is there anyway to see the size of the buffers for various work categories?
That's rather complicated...

You have to look at the Primenet Exponent Status Report, and then take into consideration the Primenet Thresholds Report.

Then, look at the Trial Factored Depth Change Report vs the Exponent Status Change Report (this is for the 30M to 40M range).

...and then try to figure out how things stand.

This is further complicated by the fact that the category thresholds move every day as candidates are DCed and LLed, and the fact that you can't entirely rely on the DC or LL completion rates for an estimate as to how many assignments will be requested per day, since some Cat 4 assignments can be held for *years* before being recycled (even under the new assignment / recycling rules).

But the short version is... We've probably now got about a month's worth of DC Cat 4 buffer -- LL Cat 4 is right at the edge.

I'm actually in the process of migrating my EC2 instances to LL Cat 4 work based on the recent surge in the DC range.
chalsall is online now   Reply With Quote
Old 2014-06-25, 17:14   #27
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I have some ideas of features that GPU72 could implement to make it easier:
I would be *very* happy to work with you on an optimal solution.

But if I may suggest an alternative solution space...

I've been using a custom AMI which is launched with a 1GB volume attached at launch (at /dev/sdb auto-mounted at /home). A crontab then contains @Reboot entries which launches the mfaktc.exe(s) processes and a mprime process after a successful launch.

This has the advantage that when offline an instance's store is only 1GB, and the same AMI could be used by anyone. The downside is only one instances can be launched at a time from the "Management Console", in order to select the 1GB volume to attach.

Thoughts?

P.S. I'm very much enjoying this experiment, and working with you. Collaboration is good!
chalsall is online now   Reply With Quote
Old 2014-06-25, 17:25   #28
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·5·7·139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I have some ideas of features that GPU72 could implement to make it easier:
Sorry... My immediate above was actually speaking about the AMI. To speak to your suggestions about GPU72...

Quote:
Originally Posted by Mark Rose View Post
1. Being able to request a 1 day time when fetching work. If the machine can do 280 GHz-day/day, and the largest assignment (LL, 71->74) is about 50 GHz-day, making the needed buffer of 2 assignments 100 GHz, assignments should be complete in 9 hours. A 200 GHz-day/day machine could complete in 12 hours.
There is already an option on the assignment pages for "GHz Days of Work". Is this what you are suggesting, or are you suggesting that GPU72 should refine it's assignments ranges/depths based on this? The latter is not currently done (it simply limits the number of assignments), but the former makes sense.

Quote:
Originally Posted by Mark Rose View Post
2. Being able to specify a machine name that is stored with the assignment, and then manually going through the assignment list when an instance is terminated and unreserving those assignments (which could be scripted). More convenient would be a way of unreserving all the assignments of a machine by name.
This is something requested before (long ago, by Bdot amongst others), and the code is already half developed. Perhaps this will be the kick-in-the-pants needed for me to finish the code.
chalsall is online now   Reply With Quote
Old 2014-06-25, 18:35   #29
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22·733 Posts
Default

Quote:
Originally Posted by chalsall View Post
That's rather complicated...

You have to look at the Primenet Exponent Status Report, and then take into consideration the Primenet Thresholds Report.

Then, look at the Trial Factored Depth Change Report vs the Exponent Status Change Report (this is for the 30M to 40M range).

...and then try to figure out how things stand.

This is further complicated by the fact that the category thresholds move every day as candidates are DCed and LLed, and the fact that you can't entirely rely on the DC or LL completion rates for an estimate as to how many assignments will be requested per day, since some Cat 4 assignments can be held for *years* before being recycled (even under the new assignment / recycling rules).
Don't we have computers to figure that all out? ;)

Quote:
But the short version is... We've probably now got about a month's worth of DC Cat 4 buffer -- LL Cat 4 is right at the edge.

I'm actually in the process of migrating my EC2 instances to LL Cat 4 work based on the recent surge in the DC range.
Okay, cool. I've got another 22 hours of DCTF crunching scheduled (2300 GHzd or 325 70->71 assignments, if that's what GPU72 decides), plus whatever leftovers I get from the terminated machines that I'll put on my own cards (350 GHzd/d) before I leave out of town for about a week.

Last fiddled with by Mark Rose on 2014-06-25 at 18:35
Mark Rose is offline   Reply With Quote
Old 2014-06-25, 19:11   #30
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

293210 Posts
Default

Quote:
Originally Posted by chalsall View Post
I would be *very* happy to work with you on an optimal solution.

But if I may suggest an alternative solution space...

I've been using a custom AMI which is launched with a 1GB volume attached at launch (at /dev/sdb auto-mounted at /home). A crontab then contains @Reboot entries which launches the mfaktc.exe(s) processes and a mprime process after a successful launch.

This has the advantage that when offline an instance's store is only 1GB, and the same AMI could be used by anyone. The downside is only one instances can be launched at a time from the "Management Console", in order to select the 1GB volume to attach.

Thoughts?
I prefer not to use extra EBS volumes. First, they cost money, even if not a lot. Second, if you create your spot request with an extra EBS volume, a new one is created every time a spot instance is launched. You could end up with dozens floating around, all costing you money, with half baked assignments. Scripting a solution to attach, mount, copy, umount, detach, delete, etc., is work and requires access keys and whatnot. Sure, you can set up IAM roles, but it's no longer a turn-key solution. Plus it needs to be cron'ed from somewhere, and what do you do with the assignments you collect?

I think a simpler solution will be just giving up on the assignments, and telling people to not bid too close to spot to avoid frequent instance termination. Amazon also says that they try to gracefully shutdown instances, so I may write a shutdown script that stops mfaktc, submits any final work to primenet and unreserved any assignments in work to do. All that can happen as is right now. It only fails if the instance dies too quickly.

Quote:
P.S. I'm very much enjoying this experiment, and working with you. Collaboration is good!
This whole project is one massive cooperation. It's excellent :D

Quote:
Originally Posted by chalsall View Post
There is already an option on the assignment pages for "GHz Days of Work". Is this what you are suggesting, or are you suggesting that GPU72 should refine it's assignments ranges/depths based on this? The latter is not currently done (it simply limits the number of assignments), but the former makes sense.
I use that feature. :) I added support for it to teknohog's mfloop.py script. On my own boxes I have a crontab for each copy of mfaktc, each set to buffer 2 days worth of GHz-d for the particular card. So my GTX 760 has a buffer of 500 GHz-days. mfloop.py calculates what's in worktodo.txt and if below 500, requests the difference from GPU72. GPU72 always returns a least one assignment, even if larger than the requested amount, so the buffer fills past 500. It works beautifully. It's key that the buffer is bigger than the largest assignment possibly returned, because mfaktc will quit if it runs out of work.

On the EC2 instances, I'm only buffering 25 GHz-day of what GPU72 decides of DCTF. The largest assignments are 7 GHz-day and the cron runs once an hour, so that works. On EC2 I'm more concerned about handling lost assignments than coping with unexpected ISP or GPU72 downtime.

Quote:
This is something requested before (long ago, by Bdot amongst others), and the code is already half developed. Perhaps this will be the kick-in-the-pants needed for me to finish the code.
:D
Mark Rose is offline   Reply With Quote
Old 2014-06-25, 19:19   #31
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

A lot of this is going right over my head but it sounds promising enough.

One question I do have is regarding the buffer that has been mentioned so much. I've just been grabbing assignments in batches of 100 per card, which can last me two weeks if they're all 71 -> 74. It always seemed like a bit much but I figured it wasn't hurting anything. Should I be doing smaller batches?
TheMawn is offline   Reply With Quote
Old 2014-06-25, 19:33   #32
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by TheMawn View Post
One question I do have is regarding the buffer that has been mentioned so much. I've just been grabbing assignments in batches of 100 per card, which can last me two weeks if they're all 71 -> 74. It always seemed like a bit much but I figured it wasn't hurting anything. Should I be doing smaller batches?
You're good. Keep doing what you're doing.
chalsall is online now   Reply With Quote
Old 2014-06-25, 19:44   #33
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

260216 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I prefer not to use extra EBS volumes. First, they cost money, even if not a lot. Second, if you create your spot request with an extra EBS volume, a new one is created every time a spot instance is launched.
Only a new root volume is created at instance creation, and then deleted at termination. The attached 1GB volume is the same across instances. And during "standby", only the 1GB /home volumes are stored.

But this gives the advantage of distributing an AMI and a sample /home image.

LOL, we're arguing about technical details which costs cents per day!
chalsall is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Let GPU72 Decide and other questions jschwar313 GPU to 72 11 2016-10-14 19:16
Factor not recorded by GPU72 bayanne GPU to 72 24 2014-05-16 09:20
GPU72 out of 332M exponents? Uncwilly GPU to 72 16 2014-04-11 11:31
Cooperative Agreement or Capitalist Takeover? You decide! cheesehead Lounge 97 2013-11-16 21:19
GPU72.COM is down swl551 GPU to 72 1 2013-01-11 12:54

All times are UTC. The time now is 15:23.


Fri Jul 16 15:23:37 UTC 2021 up 49 days, 13:10, 1 user, load averages: 1.43, 1.68, 1.72

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.