mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU72 Notebook Integration... (https://www.mersenneforum.org/showthread.php?t=24818)

chalsall 2019-10-07 19:28

GPU72 Notebook Integration...
 
As some of you might know, several of us have been having great fun over in the [URL="https://mersenneforum.org/showthread.php?t=24646"]Google Colaboratory Notebook?[/URL] thread -- learning about Notebook instances...

Since the GPU72 integration is just a small "proof-of-concept" part of the bigger picture, I wanted to fork out further discussions specific to GPU72 here.

An update:

1. Everyone is now running version 0.32 of the Bootstrap payload.

1.1. This fixes a SPE with the regular expression (regex) which was (very rarely) extracting the GHzD/D and IntrT values incorrectly. (Always remember: regex patterns are "greedy"!)

2. I have reduced the recycling period from twelve (12) to three (3) hours.

2.1. So long as no-one sees anything odd, I should be able to bring this down to perhaps as low as 30 minutes.

2.2. The temporal delta value is calculated from Last Updated or Assigned, whichever is youngest.

3. I /might/ have an odd race condition going on in the TF'ing instances. I have seen extremely rare cases on some of my own machines where an assignment is given, but then not reported on. Further drill-down shows it doesn't appear in the worktodo files either.

3.1. The system will automatically recover from this if and whenever it happens, but I don't like things happening which I don't understand. It means I've made a mistake somewhere...

As always, observations welcomed... :smile:

Uncwilly 2019-10-08 01:28

:tu:

chalsall 2019-10-10 17:03

Quick update:

The race condition has finally been solved. Man that was stupid!

A good example of how difficult debugging can be when you start introducing unusual but not unreasonable temporal harmonics.

I have now reduced the recycling period down to one (1) hour. So long as no one reports anything weird, I'll be able to bring this down to twenty (20) minutes.

Edit: Oh... And could I please ask that people try to post here for GPU72_TF Notebook stuff. I feel like I've dominated the Google Notebook thread with this; I'd like that to get back to general usage case-studies and examples.

Edit2: Perhaps a kind Super Mod could create a new sub-forum? Something like "Notebook Instances"? This subject space is just begging to fork into many, many threads!!! :smile:

Chuck 2019-10-13 01:14

Checkpoint remaining after my housecleaning
 
After a day of Kaggle/Colab sessions, I often have a bunch of leftover checkpoint files from restarts etc.

I have manually transferred these to my local worktodo file and completed and reported the work with mfaktc. However, the Kaggle/Colab checkpoints still show up on my GPUto72 report even after they have been reported to Primenet and are not on my GPUto72 assignments list.

This leads to the checkpoint files being picked up by a later Colab run. It restarts work that I have already reported.

If there were checkboxes to delete checkpoint files this could be eliminated.

Uncwilly 2019-10-13 05:01

Why not just let the system finish them next time you get going? That is what I do.

LaurV 2019-10-13 05:14

Chris, please do NOT expire/recycle the assignments [U]that were already started[/U], for at least 4 or 5 days! One hour is too less.

That is because we have limited account time per week, which we consume in the first 2-3 days, then we must wait until next week, when we want to resume the interrupted assignment, otherwise the work is lost.

chalsall 2019-10-13 13:06

[QUOTE=Chuck;527871]This leads to the checkpoint files being picked up by a later Colab run. It restarts work that I have already reported. If there were checkboxes to delete checkpoint files this could be eliminated.[/QUOTE]

Hmmm... An unusual use-case. I'll have to think about how to handle this -- it's basically a race condition between the Human and the Machine(s).

Could you please PM me a couple of candidate examples where this happened?

chalsall 2019-10-13 13:08

[QUOTE=LaurV;527883]Chris, please do NOT expire/recycle the assignments [U]that were already started[/U], for at least 4 or 5 days! One hour is too less.[/QUOTE]

To be clear... candidates which have had work done will *not* be expired. They are held until another one of *your* instances ask for more work -- however long that may be.

Uncwilly 2019-10-21 16:20

I just had 2 instances of trying to spin up a session and it goes through the bootstrap and then exits.
I will be AFT for ~6 hours, sorry.

chalsall 2019-10-21 17:04

[QUOTE=Uncwilly;528491]I just had 2 instances of trying to spin up a session and it goes through the bootstrap and then exits.[/QUOTE]

Yeah... Something weird is going on with the GPU72 server. I found the load at 90 this morning.

What is happening is the Comms spider is getting work from the server, but the connection is timing out as far as the spider is concerned. I don't understand why, but it's happening on all my instances (including two simulated ones which run 24/7).

I'm in the process of applying updates and will be rebooting in about ten minutes.

If things go well it should be back about two minutes after that. If things don't go well, it will be a bit longer...

chalsall 2019-10-21 17:42

[QUOTE=chalsall;528495]If things go well it should be back about two minutes after that. If things don't go well, it will be a bit longer...[/QUOTE]

OK... The reboot went well.

I was still seeing the timeout error, so I dug down on the assignment code. Turns out there was a somewhat expensive SQL statement that was pushing the execution time *just* over the spider's time-out period.

Assignments should be working again for everyone. I'll drill down on optimizing up that query, and/or just throw it into the background to process in parallel.

As always, please let me know if anyone sees anything else strange.


All times are UTC. The time now is 13:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.