mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > Cloud Computing

Reply
 
Thread Tools
Old 2016-05-25, 06:16   #45
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×31×47 Posts
Default

Quote:
Originally Posted by chalsall View Post
Ah... Thanks. That explains why Craig didn't understand how I was getting GPU instances so inexpensively.

But still, it seems a bit silly... Why present to different accounts different pricing options? Did they not know the "plebs" would compare notes?
There are five total zones in us-east-1. Our primary account has access to all five since it was made long ago. Our other accounts don't get access to all the zones. I could probably pester them but we haven't had a pressing need.
Mark Rose is offline   Reply With Quote
Old 2016-05-27, 04:51   #46
GP2
 
GP2's Avatar
 
Sep 2003

29·89 Posts
Default

Has anyone ever tried out Google Cloud or Microsoft Azure?

I took a brief look at the documentation for Google Compute Engine, it says that spot instances ("preemptible instances") are automatically "terminated" after they run for 24 hours. However, Google's definition of a "terminated" instance is really a "stopped" instance; it can be restarted (unless it has a local SSD device attached). So I guess you could just keep restarting it every day...

Last fiddled with by GP2 on 2016-05-27 at 04:52
GP2 is offline   Reply With Quote
Old 2016-07-06, 10:15   #47
patrik
 
patrik's Avatar
 
"Patrik Johansson"
Aug 2002
Uppsala, Sweden

23·53 Posts
Default

I just saw this discussion and wanted to tell you that I made a script last year that successfully attached an EBS volume to a persistent spot instance request. (Persistant means that, if it has terminated due to the spot price going above the maximum price you have given, it automatically lauches again when the price goes below that level again.)

I have prepared an EBS volume with an mprime executable and configured with my primenet user-id and preferences.

I made a modified AMI with /mnt/xvda created and the following lines added to /etc/rc.local, (which on this linux system is the a that executes during boot):
Code:
# Attach volume with mprime and run
/home/ec2-user/start_mprime.sh
Then I have the start script start_mprime.sh (explaining attach_vol.sh below):
Code:
#! /bin/sh
#
# Attach volume with partial result and continue work with mprime
#

# Attach volume to only instance (script only works with one)
/home/ec2-user/attach_vol.sh
# Wait for device to appear
while [ ! -e /dev/xvdf ]
do
  sleep 1
done
# Sleep 10 extra seconds just to be safe
sleep 10
# Mount disk
mount /dev/xvdf /mnt/xvdf
# Wait if mount command is not blocking (didn't check)
while [ ! -e /mnt/xvdf/mprime ]
do
  sleep 1
done
# Start mprime
nohup /mnt/xvdf/mprime/mprime -d >> /mnt/xvdf/mprime/log.txt &
When I created the persistent spot instance request, I gave it an IAM role to be permitted to mount volumes from within the instance, so the attach_vol.sh "script" just gives the command
Code:
aws ec2 attach-volume --volume-id vol-[my-id-masked] --instance-id `/home/ec2-user/instance_id.sh` --device /dev/xvdf
Finally, the instance_id.sh is another one-line script:
Code:
#! /bin/sh
aws ec2 describe-instances --filters Name=instance-state-name,Values=running | grep "InstanceId" | awk '{print $2}' | sed -e s/\"//g -e s/,//
Note that this quick hack only works with one running instance, and that the volume ID is hard-coded into the previous script, but it illustrates that the method works.

Btw, welcome back, GP2!

Last fiddled with by patrik on 2016-07-06 at 10:16 Reason: Welcome GP2
patrik is offline   Reply With Quote
Old 2016-07-07, 15:19   #48
GP2
 
GP2's Avatar
 
Sep 2003

29·89 Posts
Default

Quote:
Originally Posted by patrik View Post
I just saw this discussion and wanted to tell you that I made a script last year that successfully attached an EBS volume to a persistent spot instance request.
There is another thread in the Hardware forum where there is more extensive discussion of Amazon EC2, mostly about the cost-effectiveness versus running a simple compute-server farm of your own.

If there is enough interest, perhaps a Cloud Computing sub-forum could be created.

Are you currently still using Amazon?

There is a better solution available now: in just the past few days, Amazon began general availability of the EFS (Elastic File System), which is a variation of the standard NFS networked file system. It's currently only available in us-east-1 (N. Virginia), us-west-2 (Oregon) and eu-west-1 (Ireland), but presumably they will soon deploy it more widely.

With EFS you no longer need to allocate a separate do-not-delete-on-termination EBS volume, with fixed 1 GB allocation. Rather, EFS just grows automatically, as necessary, and it lets you store all the work directories (with the worktodo.txt and save files) of all the instances as sibling subdirectories. The mprime executable and configuration files (prime.txt and local.txt) also live on your EFS filesystem. All the availability zones within a single region can share the same EFS filesystem.

I wrote a user-data (startup) script that lets newly-launched instances automatically locate orphaned subdirectories (of instances that were terminated for whatever reason) and resume the work. I can share if anyone's interested. This uses the User Data field that is filled in when the instance is configured and launched, so you can just use the standard Amazon AMI rather than creating one of your own just to modify the operating system startup scripts.

This allows for any number of simultaneous instances, and multiple instance types (c4.large, c4.xlarge, etc).

Quote:
Originally Posted by patrik View Post
Finally, the instance_id.sh is another one-line script:
Code:
#! /bin/sh
aws ec2 describe-instances --filters Name=instance-state-name,Values=running | grep "InstanceId" | awk '{print $2}' | sed -e s/\"//g -e s/,//
If an instance just wants to discover its own instance id, the standard way to do it is $(curl http://169.254.169.254/latest/meta-data/instance-id) (the URL only works from within a running EC2 instance).

I do use describe-instances in my user-data startup script, to discover all the other running instances, so I can figure out which EFS subdirectories are orphaned. Each subdirectory has a name which matches the instance id that created it, so if that instance is no longer running then the subdirectory is orphaned and a newly-launched instance can take over its worktodo and save files. Note that your IAM role has to grant permission to run describe-instances.

Quote:
Originally Posted by patrik View Post
Note that this quick hack only works with one running instance, and that the volume ID is hard-coded into the previous script, but it illustrates that the method works.
I am running more than one instance, so it wouldn't work for me (in AWS regions where EFS is not yet available). However you could probably run aws ec2 describe-volumes to find all volumes with attachment.status = detached, and then just pick the first one in the list.


Quote:
Originally Posted by patrik View Post
Btw, welcome back, GP2!
Thanks. By the way, there was an amusing case where you did both the original first-time LL test and the double check, eight years apart: M35062633. I think the exponent was randomly assigned both times. I've also seen this happen with Curtis C. as well.
GP2 is offline   Reply With Quote
Old 2016-07-07, 15:31   #49
GP2
 
GP2's Avatar
 
Sep 2003

29·89 Posts
Default

In another thread Mark Rose posted some benchmarks for the smallest compute-optimized instance (c4.large).
GP2 is offline   Reply With Quote
Old 2016-07-07, 18:46   #50
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

52×373 Posts
Default

Quote:
Originally Posted by GP2 View Post
Each subdirectory has a name which matches the instance id that created it, so if that instance is no longer running then the subdirectory is orphaned and a newly-launched instance can take over its worktodo and save files.
Just wondering... How do you avoid race-conditions in the case two or more instances come back online at the same instant? File-system lock files? Or do you manually launch instances so this isn't an issue?
chalsall is offline   Reply With Quote
Old 2016-07-07, 20:45   #51
GP2
 
GP2's Avatar
 
Sep 2003

29·89 Posts
Default

Quote:
Originally Posted by chalsall View Post
Just wondering... How do you avoid race-conditions in the case two or more instances come back online at the same instant? File-system lock files? Or do you manually launch instances so this isn't an issue?
Here's the section of code from my user-data script. Note that this script runs as root.

I'm no expert on handling race conditions, so any critiques would be welcome.

Each subdirectory has the same name as the instance id that created it.

When a new instance launches, it first creates a tmp file that contains a list (one per line) of all orphaned subdirectories, i.e., subdirectories with names starting with "i-*" that don't correspond to any running instances. If orphaned subdirectories exist, it renames one of them to its own instance id, and then it will take over and resume any pending worktodo and save files.

Here's how I try to handle race conditions: the script reads the list of orphaned subdirectories one line at a time. For each line, it attempts to rename the orphaned subdirectory to its own instance id, and then checks for the existence of the renamed subdirectory.

Normally it will succeed on the first line and break out of the loop, but if there's a race condition another instance might have renamed that subdirectory in the meantime, so the rename will have failed. It then continues looping to the next line and tries that one, and so forth.

If it reaches end of file without finding any suitable orphaned subdirectory to rename (or perhaps the list was empty to begin with), then it simply creates a new subdirectory named after its own instance id.

Note: the IAM role that your instances run under must include a policy that allows describe-instances.


Code:
    availability_zone=$(curl http://169.254.169.254/latest/meta-data/placement/availability-zone)
    region=$(echo -n ${availability_zone} | sed 's/[a-z]$//')

    all_subdirs_tmpfile=$(mktemp)
    ls -d -1 i-* > ${all_subdirs_tmpfile}
    all_instances_tmpfile=$(mktemp)
    #Make sure to filter out recently terminated instances, which otherwise remain for up to one hour
    aws ec2 describe-instances --region=${region} --output=text --query 'Reservations[*].Instances[*].InstanceId' --filters "Name=instance-state-name,Values=running" | sed 's/\t/\n/g' | sort > ${all_instances_tmpfile}

    orphaned_subdirs_tmpfile=$(mktemp)
    comm -2 -3 ${all_subdirs_tmpfile} ${all_instances_tmpfile} > ${orphaned_subdirs_tmpfile}

    instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id)

    while read -r line; do
       mv ${line} ${instance_id}
       if [ -d ${instance_id} ]; then
           break
       fi
    done < ${orphaned_subdirs_tmpfile}

    if [ ! -d ${instance_id} ]; then
        mkdir ${instance_id}
    fi
GP2 is offline   Reply With Quote
Old 2016-07-07, 21:19   #52
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

52·373 Posts
Default

Quote:
Originally Posted by GP2 View Post
I'm no expert on handling race conditions, so any critiques would be welcome.
Assuming the EFS's rename function is atomic (and it should be), this should work fine.
chalsall is offline   Reply With Quote
Old 2016-07-08, 16:23   #53
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29×113 Posts
Default

Quote:
Originally Posted by chalsall View Post
Assuming the EFS's rename function is atomic (and it should be), this should work fine.
One other technique that might be helpful is to randomize a delay before starting the process, just in case two systems started up at the exact time. This is somewhat common in recovery situations where race conditions (or power surges if we're talking hardware, or bandwidth surges when things are scheduled at the same time) are a possibility.
Madpoo is offline   Reply With Quote
Old 2016-07-08, 16:47   #54
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

52×373 Posts
Default

Quote:
Originally Posted by Madpoo View Post
One other technique that might be helpful is to randomize a delay before starting the process, just in case two systems started up at the exact time.
This doesn't actually eliminate the race condition, just makes it less likely. What GP2 has implemented should be deterministically sane (again, assuming renames are atomic; since EFS is based on NFSv4.1, I believe it is).
chalsall is offline   Reply With Quote
Old 2016-07-10, 11:59   #55
GP2
 
GP2's Avatar
 
Sep 2003

29·89 Posts
Default

Quote:
Originally Posted by GP2 View Post
I wrote a user-data (startup) script that lets newly-launched instances automatically locate orphaned subdirectories (of instances that were terminated for whatever reason) and resume the work. I can share if anyone's interested. This uses the User Data field that is filled in when the instance is configured and launched, so you can just use the standard Amazon AMI rather than creating one of your own just to modify the operating system startup scripts.
Note that there is one minor drawback to using the user-data field at instance launch time to run your startup script: this script will only run once, when the virtual machine boots for the first time after being instantiated.

So if the virtual machine is ever rebooted for any reason, all the stuff in the user-data startup script won't get done, including running mprime. The virtual machine will just sit there doing nothing, but of course the instance is still running and billable.

Normally, this isn't a problem, since you would rarely need to reboot your virtual machine, and if you ever did you could just manually re-run the user-data script commands. However, one of my instances spontaneously rebooted today for an unknown reason, and I didn't discover that until ten hours later. It might have been hardware related, because when I clicked on it in the EC2 console it displayed "Retiring: This instance is scheduled for retirement after [date about two weeks from now]". So I just terminated it and the spot fleet request automatically launched a new instance.

The moral of the story is that you have to monitor: just because your instance is up and running doesn't mean that mprime is up and running. With EFS it's easy to just do an "ls -lt" command periodically to check the last-modified dates of the sibling subdirectories that contain worktodo and save files, because any subdirectory that's older than half an hour (i.e. the DiskWriteTime interval) means that mprime isn't writing save files, so it probably isn't running.

For traditional setups that don't use EFS (which is currently only available in us-east-1, us-west-2 and eu-west-1 regions), I'm not sure what would be the best solution to ensure that mprime is still running. Maybe screen-scrape http://www.mersenne.org/workload/ and check to make sure the percentage-complete statistic keeps incrementing daily. Or you could theoretically go to http://www.mersenne.org/cpus/ and click on each registered computer, and then set the Email option to send an email if the computer is late contacting the PrimeNet server, but that's completely impractical if you're dealing with ephemeral virtual machines rather than physical hardware, especially spot instances that can get terminated at any time, and then you'd get bogus e-mails when the terminated instances no longer talk to PrimeNet.
GP2 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Did Amazon just join GIMPS? ixfd64 Lounge 20 2018-04-24 06:53
How about using Amazon's hardware instead? GP2 Cloud Computing 154 2017-03-29 16:02
Amazon Cloud Outrage kladner Science & Technology 7 2017-03-02 14:18
doing large NFS jobs on Amazon EC2? ixfd64 Factoring 3 2012-06-06 08:27
Amazon is a greedy bastard of a company. jasong jasong 14 2007-12-13 21:02

All times are UTC. The time now is 18:10.

Fri Nov 27 18:10:10 UTC 2020 up 78 days, 15:21, 4 users, load averages: 1.67, 1.67, 1.46

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.