![]() |
GPU72 Notebook Integration...
As some of you might know, several of us have been having great fun over in the [URL="https://mersenneforum.org/showthread.php?t=24646"]Google Colaboratory Notebook?[/URL] thread -- learning about Notebook instances...
Since the GPU72 integration is just a small "proof-of-concept" part of the bigger picture, I wanted to fork out further discussions specific to GPU72 here. An update: 1. Everyone is now running version 0.32 of the Bootstrap payload. 1.1. This fixes a SPE with the regular expression (regex) which was (very rarely) extracting the GHzD/D and IntrT values incorrectly. (Always remember: regex patterns are "greedy"!) 2. I have reduced the recycling period from twelve (12) to three (3) hours. 2.1. So long as no-one sees anything odd, I should be able to bring this down to perhaps as low as 30 minutes. 2.2. The temporal delta value is calculated from Last Updated or Assigned, whichever is youngest. 3. I /might/ have an odd race condition going on in the TF'ing instances. I have seen extremely rare cases on some of my own machines where an assignment is given, but then not reported on. Further drill-down shows it doesn't appear in the worktodo files either. 3.1. The system will automatically recover from this if and whenever it happens, but I don't like things happening which I don't understand. It means I've made a mistake somewhere... As always, observations welcomed... :smile: |
:tu:
|
Quick update:
The race condition has finally been solved. Man that was stupid! A good example of how difficult debugging can be when you start introducing unusual but not unreasonable temporal harmonics. I have now reduced the recycling period down to one (1) hour. So long as no one reports anything weird, I'll be able to bring this down to twenty (20) minutes. Edit: Oh... And could I please ask that people try to post here for GPU72_TF Notebook stuff. I feel like I've dominated the Google Notebook thread with this; I'd like that to get back to general usage case-studies and examples. Edit2: Perhaps a kind Super Mod could create a new sub-forum? Something like "Notebook Instances"? This subject space is just begging to fork into many, many threads!!! :smile: |
Checkpoint remaining after my housecleaning
After a day of Kaggle/Colab sessions, I often have a bunch of leftover checkpoint files from restarts etc.
I have manually transferred these to my local worktodo file and completed and reported the work with mfaktc. However, the Kaggle/Colab checkpoints still show up on my GPUto72 report even after they have been reported to Primenet and are not on my GPUto72 assignments list. This leads to the checkpoint files being picked up by a later Colab run. It restarts work that I have already reported. If there were checkboxes to delete checkpoint files this could be eliminated. |
Why not just let the system finish them next time you get going? That is what I do.
|
Chris, please do NOT expire/recycle the assignments [U]that were already started[/U], for at least 4 or 5 days! One hour is too less.
That is because we have limited account time per week, which we consume in the first 2-3 days, then we must wait until next week, when we want to resume the interrupted assignment, otherwise the work is lost. |
[QUOTE=Chuck;527871]This leads to the checkpoint files being picked up by a later Colab run. It restarts work that I have already reported. If there were checkboxes to delete checkpoint files this could be eliminated.[/QUOTE]
Hmmm... An unusual use-case. I'll have to think about how to handle this -- it's basically a race condition between the Human and the Machine(s). Could you please PM me a couple of candidate examples where this happened? |
[QUOTE=LaurV;527883]Chris, please do NOT expire/recycle the assignments [U]that were already started[/U], for at least 4 or 5 days! One hour is too less.[/QUOTE]
To be clear... candidates which have had work done will *not* be expired. They are held until another one of *your* instances ask for more work -- however long that may be. |
I just had 2 instances of trying to spin up a session and it goes through the bootstrap and then exits.
I will be AFT for ~6 hours, sorry. |
[QUOTE=Uncwilly;528491]I just had 2 instances of trying to spin up a session and it goes through the bootstrap and then exits.[/QUOTE]
Yeah... Something weird is going on with the GPU72 server. I found the load at 90 this morning. What is happening is the Comms spider is getting work from the server, but the connection is timing out as far as the spider is concerned. I don't understand why, but it's happening on all my instances (including two simulated ones which run 24/7). I'm in the process of applying updates and will be rebooting in about ten minutes. If things go well it should be back about two minutes after that. If things don't go well, it will be a bit longer... |
[QUOTE=chalsall;528495]If things go well it should be back about two minutes after that. If things don't go well, it will be a bit longer...[/QUOTE]
OK... The reboot went well. I was still seeing the timeout error, so I dug down on the assignment code. Turns out there was a somewhat expensive SQL statement that was pushing the execution time *just* over the spider's time-out period. Assignments should be working again for everyone. I'll drill down on optimizing up that query, and/or just throw it into the background to process in parallel. As always, please let me know if anyone sees anything else strange. |
I noticed a possibly related issue: a lot of my old assignments from 2014 are showing up on my "Completed Assignments" with the computer name "Kaggle." However, I'm not using Kaggle at the moment.
|
[QUOTE=ixfd64;528503]I noticed a possibly related issue: a lot of my old assignments from 2014 are showing up on my "Completed Assignments" with the computer name "Kaggle." However, I'm not using Kaggle at the moment.[/QUOTE]
OK. Thanks very much for the report. I'll do a sanity check on this. A bit of a busy morning... Me: "But the code didn't change!!! Reality: "No... but the size of your datasets did, and thus the execution times have increased... Perhaps you might consider that... |
[QUOTE=ixfd64;528503]I noticed a possibly related issue: a lot of my old assignments from 2014 are showing up on my "Completed Assignments" with the computer name "Kaggle." However, I'm not using Kaggle at the moment.[/QUOTE]
Ditto. As well as many of my current assignments. I am using Kaggle but these are NOT Kaggle assignments. |
Bogus GPU72 "view assignments" data
[CODE][FONT=Courier New]
[/FONT] [FONT=Courier New]Computer Exponent WT From To Percent Assigned Age Updated Estimated Completion Days To Go GHzD[/FONT] [FONT=Courier New] [/FONT][FONT=Courier New]Manual 63759127 DC TF 74 75 2019-10-19 19:07 2.20 2019-10-21 15:03 2012-12-12 13:14 -2504.44 60.00[/FONT][FONT=Courier New] [/FONT][FONT=Courier New]Manual 63759217 DC TF 74 75 2019-10-19 19:07 2.20 2019-10-21 15:03 2012-12-12 13:14 -2504.44 60.00[/FONT][FONT=Courier New] [/FONT][FONT=Courier New]Manual 63759221 DC TF 74 75 2019-10-19 19:07 2.20 2019-10-21 15:03 2012-12-12 13:14 -2504.44 60.00[/FONT] [/CODE]All of my assignments have this same bogus "estimated completion" and "days to go" information. (Sorry the column header alignment doesn't show correctly in this post). |
[QUOTE=Chuck;528540]All of my assignments have this same bogus "estimated completion" and "days to go" information. (Sorry the column header alignment doesn't show correctly in this post).[/QUOTE]
Yup. A massively stupid typo by yours truly... There are backups. Working it... |
[QUOTE=Chuck;528540](Sorry the column header alignment doesn't show correctly in this post).[/QUOTE]
Huh? It seems it does... |
[QUOTE=LaurV;528564]Huh? It seems it does...[/QUOTE]
...after you kindly fixed it! |
And then there were none
Just got blocked from Kaggle
|
[QUOTE=petrw1;539053]Just got blocked from Kaggle[/QUOTE]
Sorry to hear that. Frankly, I couldn't understand why you seemed to be the sole person /not/ banned! Just out of interest, were you running CPU jobs in parallel? |
[QUOTE=chalsall;539054]Sorry to hear that. Frankly, I couldn't understand why you seemed to be the sole person /not/ banned!
Just out of interest, were you running CPU jobs in parallel?[/QUOTE] In Kaggle I only ever ran mfaktc. 2 commits and 1 RunAll |
| All times are UTC. The time now is 13:40. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.