mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2014-06-24, 16:57   #12
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
That inspired me. After tweaking all night, I've now got 10 M2050 cards (at a total of $0.80/hr or less) crunching DCTF.
Coolness! Thanks for helping out! And, yeah, it's kinda neat being able to "spin up" ~2,800 GHzd/d of GPU compute when needed, all for less than $17 USD a day!

Quote:
Originally Posted by Mark Rose View Post
Had to release back quite a few DCTF exponents are part of the debugging through. Hopefully that didn't mess anything up on GPU72's end.
Not a problem at all.

Quote:
Originally Posted by Mark Rose View Post
Now to see how long I feel like paying for this experiment...
Indeed!

I've set myself a budget of no more than $150 USD a month -- approximately the same as my other habit (tobacco).

Quote:
Originally Posted by Mark Rose View Post
Edit: looks like I hit a race condition. I ended up getting one set of assignments 9 times, when all 5 machines (10 cards) launched at once. My guess is a non-isolated read in the database. I'll have to hack in a random delay for polling for work.
Within your own code, or GPU72? The DCTF assignment form locks the tables before quering for and assigning candidates, so it /shouldn't/ have given the same assignment twice. But if you can give me a couple of example candidates (and, if possible, the approximate time assigned), I can drill down and see if I have a SPE.

Thanks again!

P.S. Oh, and so you know, I had planned on comparing the economics of g2.2xlarge vs. cg1.4xlarge. The latter is a bit less than twice the spot price (USD 0.0815 vs 0.135) currently, with a bit more than twice the GPU throughput (~240 vs 2*~280 GHzd/d), so the cg1 definitely wins. Thanks for inspiring me to drill down on What Makes (Economic) Sense!
chalsall is online now   Reply With Quote
Old 2014-06-24, 19:55   #13
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

1011011101002 Posts
Default

Quote:
Originally Posted by chalsall View Post
Coolness! Thanks for helping out! And, yeah, it's kinda neat being able to "spin up" ~2,800 GHzd/d of GPU compute when needed, all for less than $17 USD a day!
Yeah. I had no idea it would be so cheap with spot. I did bump up the spot price over a cent when I was launching and terminating all the instances, but I think there's still a fair amount of spare capacity in the one zone in us-east-1. The other zones there and in Ireland (eu-west-1) are all at full price at the moment.

Quote:
I've set myself a budget of no more than $150 USD a month -- approximately the same as my other habit (tobacco).
$50 for myself. But it's good for a few days and several positions on the DCTF leaderboard lol

Quote:
Within your own code, or GPU72? The DCTF assignment form locks the tables before quering for and assigning candidates, so it /shouldn't/ have given the same assignment twice. But if you can give me a couple of example candidates (and, if possible, the approximate time assigned), I can drill down and see if I have a SPE.
With GPU72. Approximately 09:08 to 09:15 UTC (5:08 AM EDT). Look for a bunch of exponents that all got handed out 9+ times each. As it was from 5 separate machines, it's not a race in the mfloop.py script.

If you're using MySQL, be sure you're using the LOCK TABLE `table` WRITE; syntax.

Quote:
P.S. Oh, and so you know, I had planned on comparing the economics of g2.2xlarge vs. cg1.4xlarge. The latter is a bit less than twice the spot price (USD 0.0815 vs 0.135) currently, with a bit more than twice the GPU throughput (~240 vs 2*~280 GHzd/d), so the cg1 definitely wins. Thanks for inspiring me to drill down on What Makes (Economic) Sense!
As soon as I saw that cg1.4xlarge could be had for less than double the g2.2xlarge spot price, I decided to investigate the performance of the GPU, and it turned out you get about double the GHz-d/$. There is, however, a lot less cg1 capacity available.

I wrote my scripts fairly generically. They should support the g2 out of the box. I want to add a 0-50 second random delay to the work fetching to not swamp mersenne/gpu72. A little more work and I could have an AMI that could be completely configured from the user-data, so anyone could launch and participate. Given how quickly they chew through exponents, the backup component really isn't necessary, but might be desirable for 73-74 assignments. I would also like to use the CPUs, which sit idle.
Mark Rose is offline   Reply With Quote
Old 2014-06-24, 20:37   #14
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
With GPU72. Approximately 09:08 to 09:15 UTC (5:08 AM EDT). Look for a bunch of exponents that all got handed out 9+ times each. As it was from 5 separate machines, it's not a race in the mfloop.py script.
Drilled down. Didn't find any evidence in the logs that GPU72 made a mistake. Could you give additional evidence that I've made a mistake by way of candidates assigned, rather than just time?

Quote:
Originally Posted by Mark Rose View Post
If you're using MySQL, be sure you're using the LOCK TABLE `table` WRITE; syntax.
The exact code is:
Code:
   DoSQLCommand("lock tables GPU write, Assigned write");

   $sql = "select * from (select Exponent,FactTo from GPU where"
." Exponent>=${Low} and Exponent<=${High} and FactTo<${Pledge} and WorkTypePN=201 and Status=0 and Back=0"
." order by ${Sort} limit ${Number}) as G order by Exponent";
chalsall is online now   Reply With Quote
Old 2014-06-24, 22:33   #15
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22×733 Posts
Default

Quote:
Originally Posted by chalsall View Post
Drilled down. Didn't find any evidence in the logs that GPU72 made a mistake. Could you give additional evidence that I've made a mistake by way of candidates assigned, rather than just time?
These are the candidates that have shown up in my worktodo.txt files more than once in the past nine hours (I'm buffering 9 hours of work, and this doesn't include work already done): I haven't unreserved any of these, at least not in the last 12 hours (I was unreserving a bunch last night during debugging, but that stopped over 12 hours ago).

Code:
backup:~/mfaktc_backup$ cat */worktodo.txt | sort | uniq -cd | less | sort -r
      5 Factor=N/A,35253013,70,71
      5 Factor=N/A,35253007,70,71
      5 Factor=N/A,35252993,70,71
      5 Factor=N/A,35252927,70,71
      5 Factor=N/A,35252717,70,71
      5 Factor=N/A,35252689,70,71
      5 Factor=N/A,35252641,70,71
      5 Factor=N/A,35252431,70,71
      5 Factor=N/A,35252417,70,71
      5 Factor=N/A,35252359,70,71
      5 Factor=N/A,35252233,70,71
      5 Factor=N/A,35252143,70,71
      5 Factor=N/A,35252089,70,71
      5 Factor=N/A,35252051,70,71
      5 Factor=N/A,35252047,70,71
      5 Factor=N/A,35252039,70,71
      5 Factor=N/A,35251757,70,71
      5 Factor=N/A,35251621,70,71
      5 Factor=N/A,35251597,70,71
      5 Factor=N/A,35251499,70,71
      5 Factor=N/A,35251303,70,71
      5 Factor=N/A,35251199,70,71
      5 Factor=N/A,35251147,70,71
      5 Factor=N/A,35251087,70,71
      2 Factor=N/A,35253047,70,71
      2 Factor=N/A,35251789,70,71
or ordered by exponent:

Code:
backup:~/mfaktc_backup$ cat */worktodo.txt | sort | uniq -cd | FS=',' sort -k 2
      5 Factor=N/A,35251087,70,71
      5 Factor=N/A,35251147,70,71
      5 Factor=N/A,35251199,70,71
      5 Factor=N/A,35251303,70,71
      5 Factor=N/A,35251499,70,71
      5 Factor=N/A,35251597,70,71
      5 Factor=N/A,35251621,70,71
      5 Factor=N/A,35251757,70,71
      2 Factor=N/A,35251789,70,71
      5 Factor=N/A,35252039,70,71
      5 Factor=N/A,35252047,70,71
      5 Factor=N/A,35252051,70,71
      5 Factor=N/A,35252089,70,71
      5 Factor=N/A,35252143,70,71
      5 Factor=N/A,35252233,70,71
      5 Factor=N/A,35252359,70,71
      5 Factor=N/A,35252417,70,71
      5 Factor=N/A,35252431,70,71
      5 Factor=N/A,35252641,70,71
      5 Factor=N/A,35252689,70,71
      5 Factor=N/A,35252717,70,71
      5 Factor=N/A,35252927,70,71
      5 Factor=N/A,35252993,70,71
      5 Factor=N/A,35253007,70,71
      5 Factor=N/A,35253013,70,71
      2 Factor=N/A,35253047,70,71
I haven't been copying old worktodo.txt files around, just rsync'ing them to the backup machine every minute.

Last fiddled with by Mark Rose on 2014-06-24 at 22:37
Mark Rose is offline   Reply With Quote
Old 2014-06-24, 22:53   #16
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

260216 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
These are the candidates that have shown up in my worktodo.txt files more than once in the past nine hours (I'm buffering 9 hours of work, and this doesn't include work already done): I haven't unreserved any of these, at least not in the last 12 hours (I was unreserving a bunch last night during debugging, but that stopped over 12 hours ago).

Code:
backup:~/mfaktc_backup$ cat */worktodo.txt | sort | uniq -cd | less | sort -r
      5 Factor=N/A,35253013,70,71
      5 Factor=N/A,35253007,70,71
      5 Factor=N/A,35252993,70,71
      5 Factor=N/A,35252927,70,71
      5 Factor=N/A,35252717,70,71
      5 Factor=N/A,35252689,70,71
      5 Factor=N/A,35252641,70,71
      5 Factor=N/A,35252431,70,71
      5 Factor=N/A,35252417,70,71
      5 Factor=N/A,35252359,70,71
      5 Factor=N/A,35252233,70,71
      5 Factor=N/A,35252143,70,71
      5 Factor=N/A,35252089,70,71
      5 Factor=N/A,35252051,70,71
      5 Factor=N/A,35252047,70,71
      5 Factor=N/A,35252039,70,71
      5 Factor=N/A,35251757,70,71
      5 Factor=N/A,35251621,70,71
      5 Factor=N/A,35251597,70,71
      5 Factor=N/A,35251499,70,71
      5 Factor=N/A,35251303,70,71
      5 Factor=N/A,35251199,70,71
      5 Factor=N/A,35251147,70,71
      5 Factor=N/A,35251087,70,71
      2 Factor=N/A,35253047,70,71
      2 Factor=N/A,35251789,70,71
or ordered by exponent:

Code:
backup:~/mfaktc_backup$ cat */worktodo.txt | sort | uniq -cd | FS=',' sort -k 2
      5 Factor=N/A,35251087,70,71
      5 Factor=N/A,35251147,70,71
      5 Factor=N/A,35251199,70,71
      5 Factor=N/A,35251303,70,71
      5 Factor=N/A,35251499,70,71
      5 Factor=N/A,35251597,70,71
      5 Factor=N/A,35251621,70,71
      5 Factor=N/A,35251757,70,71
      2 Factor=N/A,35251789,70,71
      5 Factor=N/A,35252039,70,71
      5 Factor=N/A,35252047,70,71
      5 Factor=N/A,35252051,70,71
      5 Factor=N/A,35252089,70,71
      5 Factor=N/A,35252143,70,71
      5 Factor=N/A,35252233,70,71
      5 Factor=N/A,35252359,70,71
      5 Factor=N/A,35252417,70,71
      5 Factor=N/A,35252431,70,71
      5 Factor=N/A,35252641,70,71
      5 Factor=N/A,35252689,70,71
      5 Factor=N/A,35252717,70,71
      5 Factor=N/A,35252927,70,71
      5 Factor=N/A,35252993,70,71
      5 Factor=N/A,35253007,70,71
      5 Factor=N/A,35253013,70,71
      2 Factor=N/A,35253047,70,71
I haven't been copying old worktodo.txt files around, just rsync'ing them to the backup machine every minute.
All I can tell you is what I see.

Code:
mysql> select * from Assigned where Exponent=35251087;
+---------+----------+----------------------------------+----------+------------+--------+----------+--------+---------------------+---------------------+---------------------+---------------------+---------------------+----------+----------+----+--------------+-----------------+-----+-----+---------+-----+
| ID      | Exponent | User                             | WorkType | WorkTypePN | Status | FactFrom | FactTo | Assigned            | Updated             | Completed           | Adjusted            | James               | Extended | Factored | P1 | GHzDays      | IP              | CID | CPU | Percent | AID |
+---------+----------+----------------------------------+----------+------------+--------+----------+--------+---------------------+---------------------+---------------------+---------------------+---------------------+----------+----------+----+--------------+-----------------+-----+-----+---------+-----+
| 1384288 | 35251087 | 0b704871bc2c1d367ffd1518c35d6e60 |        1 |        201 |      1 |       69 |     70 | 2014-06-08 23:20:16 | 0000-00-00 00:00:00 | 2014-06-12 01:21:47 | 2014-06-09 01:21:47 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 3.3917772770 | 84.80.145.240   |     |   0 |       0 |     | 
| 1405044 | 35251087 | ea39a75de82cd896610be22735054fc5 |        1 |        201 |     90 |       70 |     71 | 2014-06-24 07:13:48 | 0000-00-00 00:00:00 | 2014-06-24 07:37:56 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 6.7835545540 | 50.17.6.127     |     |   0 |       0 |     | 
| 1405226 | 35251087 | ea39a75de82cd896610be22735054fc5 |        1 |        201 |     90 |       70 |     71 | 2014-06-24 08:27:58 | 0000-00-00 00:00:00 | 2014-06-24 08:56:44 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 6.7835545540 | 50.19.167.195   |     |   0 |       0 |     | 
| 1405425 | 35251087 | ea39a75de82cd896610be22735054fc5 |        1 |        201 |      1 |       70 |     71 | 2014-06-24 09:59:58 | 0000-00-00 00:00:00 | 2014-06-24 11:22:34 | 2014-06-24 11:22:34 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 6.7835545540 | 174.129.137.188 |     |   0 |       0 |     | 
+---------+----------+----------------------------------+----------+------------+--------+----------+--------+---------------------+---------------------+---------------------+---------------------+---------------------+----------+----------+----+--------------+-----------------+-----+-----+---------+-----+
Status == 90 means canceled. == 1 means completed.
chalsall is online now   Reply With Quote
Old 2014-06-24, 23:17   #17
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

B7416 Posts
Default

Quote:
Originally Posted by chalsall View Post
All I can tell you is what I see.

Code:
mysql> select * from Assigned where Exponent=35251087;
+---------+----------+----------------------------------+----------+------------+--------+----------+--------+---------------------+---------------------+---------------------+---------------------+---------------------+----------+----------+----+--------------+-----------------+-----+-----+---------+-----+
| ID      | Exponent | User                             | WorkType | WorkTypePN | Status | FactFrom | FactTo | Assigned            | Updated             | Completed           | Adjusted            | James               | Extended | Factored | P1 | GHzDays      | IP              | CID | CPU | Percent | AID |
+---------+----------+----------------------------------+----------+------------+--------+----------+--------+---------------------+---------------------+---------------------+---------------------+---------------------+----------+----------+----+--------------+-----------------+-----+-----+---------+-----+
| 1384288 | 35251087 | 0b704871bc2c1d367ffd1518c35d6e60 |        1 |        201 |      1 |       69 |     70 | 2014-06-08 23:20:16 | 0000-00-00 00:00:00 | 2014-06-12 01:21:47 | 2014-06-09 01:21:47 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 3.3917772770 | 84.80.145.240   |     |   0 |       0 |     | 
| 1405044 | 35251087 | ea39a75de82cd896610be22735054fc5 |        1 |        201 |     90 |       70 |     71 | 2014-06-24 07:13:48 | 0000-00-00 00:00:00 | 2014-06-24 07:37:56 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 6.7835545540 | 50.17.6.127     |     |   0 |       0 |     | 
| 1405226 | 35251087 | ea39a75de82cd896610be22735054fc5 |        1 |        201 |     90 |       70 |     71 | 2014-06-24 08:27:58 | 0000-00-00 00:00:00 | 2014-06-24 08:56:44 | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 6.7835545540 | 50.19.167.195   |     |   0 |       0 |     | 
| 1405425 | 35251087 | ea39a75de82cd896610be22735054fc5 |        1 |        201 |      1 |       70 |     71 | 2014-06-24 09:59:58 | 0000-00-00 00:00:00 | 2014-06-24 11:22:34 | 2014-06-24 11:22:34 | 0000-00-00 00:00:00 |        0 |        0 |  0 | 6.7835545540 | 174.129.137.188 |     |   0 |       0 |     | 
+---------+----------+----------------------------------+----------+------------+--------+----------+--------+---------------------+---------------------+---------------------+---------------------+---------------------+----------+----------+----+--------------+-----------------+-----+-----+---------+-----+
Status == 90 means canceled. == 1 means completed.
Hmm. There was a lot of GPU72 slowness when I was unassigning candidates this morning. It's possible that a transaction died half way through (is innodb_rollback_on_timeout set to ON?). Could you run something like
Code:
SELECT COUNT(g.Exponent)
FROM GPU g
LEFT JOIN Assigned a ON g.Exponent = a.Exponent AND g.FactTo <= a.FactTo AND g.Status != a.Status
WHERE g.Exponent > 35000000 AND g.Exponent < 36000000 AND g.WorkTypePN=201 and g.Status=0 and g.Back=0
and see if any inconsistencies are found? I assume that Status should match in both tables, correct? I can only guess at the internals :)

I've run into issues with innodb_rollback_on_timeout being set to OFF (default) before.
Mark Rose is offline   Reply With Quote
Old 2014-06-24, 23:30   #18
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I've run into issues with innodb_rollback_on_timeout being set to OFF (default) before.
GPU72 doesn't use transactions. It uses table locks when (and only when) needed.
chalsall is online now   Reply With Quote
Old 2014-06-24, 23:38   #19
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22·733 Posts
Default

Quote:
Originally Posted by chalsall View Post
GPU72 doesn't use transactions. It uses table locks when (and only when) needed.
Okay. Checking the Status consistency is the only other thing I can think of checking. Otherwise I'll see how things look tomorrow.
Mark Rose is offline   Reply With Quote
Old 2014-06-25, 00:15   #20
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Okay. Checking the Status consistency is the only other thing I can think of checking. Otherwise I'll see how things look tomorrow.
Absolutely no disrespect intended. But you don't have the full schema. Assigned.Status != GPU.Status != Exponent.Status.

All I can tell you (again) is I don't think the code on GPU72 did you wrong. Happy to be proven wrong.
chalsall is online now   Reply With Quote
Old 2014-06-25, 00:19   #21
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22·733 Posts
Default

Quote:
Originally Posted by chalsall View Post
Absolutely no disrespect intended. But you don't have the full schema. Assigned.Status != GPU.Status != Exponent.Status.
None taken :)

I may look into more detailed logging. But not tonight. For now I'll just let it run.
Mark Rose is offline   Reply With Quote
Old 2014-06-25, 03:37   #22
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

22×733 Posts
Default

I must apologize for all the trouble. I had one stale file that I missed deleting that was causing the problem.

Everything is working perfectly. I'm sorry.
Mark Rose is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Let GPU72 Decide and other questions jschwar313 GPU to 72 11 2016-10-14 19:16
Factor not recorded by GPU72 bayanne GPU to 72 24 2014-05-16 09:20
GPU72 out of 332M exponents? Uncwilly GPU to 72 16 2014-04-11 11:31
Cooperative Agreement or Capitalist Takeover? You decide! cheesehead Lounge 97 2013-11-16 21:19
GPU72.COM is down swl551 GPU to 72 1 2013-01-11 12:54

All times are UTC. The time now is 15:25.


Fri Jul 16 15:25:22 UTC 2021 up 49 days, 13:12, 1 user, load averages: 1.78, 1.74, 1.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.