mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2019-10-07, 19:28   #1
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default GPU72 Notebook Integration...

As some of you might know, several of us have been having great fun over in the Google Colaboratory Notebook? thread -- learning about Notebook instances...

Since the GPU72 integration is just a small "proof-of-concept" part of the bigger picture, I wanted to fork out further discussions specific to GPU72 here.

An update:

1. Everyone is now running version 0.32 of the Bootstrap payload.

1.1. This fixes a SPE with the regular expression (regex) which was (very rarely) extracting the GHzD/D and IntrT values incorrectly. (Always remember: regex patterns are "greedy"!)

2. I have reduced the recycling period from twelve (12) to three (3) hours.

2.1. So long as no-one sees anything odd, I should be able to bring this down to perhaps as low as 30 minutes.

2.2. The temporal delta value is calculated from Last Updated or Assigned, whichever is youngest.

3. I /might/ have an odd race condition going on in the TF'ing instances. I have seen extremely rare cases on some of my own machines where an assignment is given, but then not reported on. Further drill-down shows it doesn't appear in the worktodo files either.

3.1. The system will automatically recover from this if and whenever it happens, but I don't like things happening which I don't understand. It means I've made a mistake somewhere...

As always, observations welcomed...
chalsall is offline   Reply With Quote
Old 2019-10-08, 01:28   #2
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

175648 Posts
Default

Uncwilly is offline   Reply With Quote
Old 2019-10-10, 17:03   #3
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

214438 Posts
Default

Quick update:

The race condition has finally been solved. Man that was stupid!

A good example of how difficult debugging can be when you start introducing unusual but not unreasonable temporal harmonics.

I have now reduced the recycling period down to one (1) hour. So long as no one reports anything weird, I'll be able to bring this down to twenty (20) minutes.

Edit: Oh... And could I please ask that people try to post here for GPU72_TF Notebook stuff. I feel like I've dominated the Google Notebook thread with this; I'd like that to get back to general usage case-studies and examples.

Edit2: Perhaps a kind Super Mod could create a new sub-forum? Something like "Notebook Instances"? This subject space is just begging to fork into many, many threads!!!

Last fiddled with by chalsall on 2019-10-10 at 17:07
chalsall is offline   Reply With Quote
Old 2019-10-13, 01:14   #4
Chuck
 
Chuck's Avatar
 
May 2011
Orange Park, FL

2·3·139 Posts
Default Checkpoint remaining after my housecleaning

After a day of Kaggle/Colab sessions, I often have a bunch of leftover checkpoint files from restarts etc.

I have manually transferred these to my local worktodo file and completed and reported the work with mfaktc. However, the Kaggle/Colab checkpoints still show up on my GPUto72 report even after they have been reported to Primenet and are not on my GPUto72 assignments list.

This leads to the checkpoint files being picked up by a later Colab run. It restarts work that I have already reported.

If there were checkboxes to delete checkpoint files this could be eliminated.
Chuck is offline   Reply With Quote
Old 2019-10-13, 05:01   #5
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22×3×11×61 Posts
Default

Why not just let the system finish them next time you get going? That is what I do.
Uncwilly is offline   Reply With Quote
Old 2019-10-13, 05:14   #6
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

854810 Posts
Default

Chris, please do NOT expire/recycle the assignments that were already started, for at least 4 or 5 days! One hour is too less.

That is because we have limited account time per week, which we consume in the first 2-3 days, then we must wait until next week, when we want to resume the interrupted assignment, otherwise the work is lost.
LaurV is offline   Reply With Quote
Old 2019-10-13, 13:06   #7
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

232316 Posts
Default

Quote:
Originally Posted by Chuck View Post
This leads to the checkpoint files being picked up by a later Colab run. It restarts work that I have already reported. If there were checkboxes to delete checkpoint files this could be eliminated.
Hmmm... An unusual use-case. I'll have to think about how to handle this -- it's basically a race condition between the Human and the Machine(s).

Could you please PM me a couple of candidate examples where this happened?
chalsall is offline   Reply With Quote
Old 2019-10-13, 13:08   #8
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by LaurV View Post
Chris, please do NOT expire/recycle the assignments that were already started, for at least 4 or 5 days! One hour is too less.
To be clear... candidates which have had work done will *not* be expired. They are held until another one of *your* instances ask for more work -- however long that may be.
chalsall is offline   Reply With Quote
Old 2019-10-21, 16:20   #9
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22×3×11×61 Posts
Default

I just had 2 instances of trying to spin up a session and it goes through the bootstrap and then exits.
I will be AFT for ~6 hours, sorry.
Uncwilly is offline   Reply With Quote
Old 2019-10-21, 17:04   #10
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

214438 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
I just had 2 instances of trying to spin up a session and it goes through the bootstrap and then exits.
Yeah... Something weird is going on with the GPU72 server. I found the load at 90 this morning.

What is happening is the Comms spider is getting work from the server, but the connection is timing out as far as the spider is concerned. I don't understand why, but it's happening on all my instances (including two simulated ones which run 24/7).

I'm in the process of applying updates and will be rebooting in about ten minutes.

If things go well it should be back about two minutes after that. If things don't go well, it will be a bit longer...
chalsall is offline   Reply With Quote
Old 2019-10-21, 17:42   #11
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·7·257 Posts
Default

Quote:
Originally Posted by chalsall View Post
If things go well it should be back about two minutes after that. If things don't go well, it will be a bit longer...
OK... The reboot went well.

I was still seeing the timeout error, so I dug down on the assignment code. Turns out there was a somewhat expensive SQL statement that was pushing the execution time *just* over the spider's time-out period.

Assignments should be working again for everyone. I'll drill down on optimizing up that query, and/or just throw it into the background to process in parallel.

As always, please let me know if anyone sees anything else strange.
chalsall is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Notebook enzocreti enzocreti 0 2019-02-15 08:20
Differentiation/integration in Real Analysis I course clowns789 Analysis & Analytic Number Theory 4 2017-05-23 18:48
integration problem sma4 Homework Help 2 2009-08-02 14:08
What Integration Technique? Primeinator Homework Help 17 2008-06-04 04:07
Tabular Integration w/ 2 Nonlinear Terms Primeinator Homework Help 0 2008-05-01 06:22

All times are UTC. The time now is 18:18.

Sat Jun 6 18:18:12 UTC 2020 up 73 days, 15:51, 1 user, load averages: 1.85, 1.87, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.