mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Cloud Computing (https://www.mersenneforum.org/forumdisplay.php?f=134)
-   -   Google Diet Colab Notebook (https://www.mersenneforum.org/showthread.php?t=24646)

ATH 2019-10-23 04:57

[QUOTE=mnd9;528644]This is what’s confusing—my version 1 is clickable it just returns me to the page with Code, Data, Log, Comments but no output.

In post 418, axn said I should have an output tab regardless of it the kernel is killed after 9 hours or completes... I lost 9 hours of GPU quota and the kernel ran so why no output...?

Also can anyone confirm my log makes it look like it failed with error code 137 in 5 seconds, but gives no explanation and didn’t stop running, supposedly that’s a memory error code but I’m running something requiring little memory that I know works from trying it in draft mode

Finally I’d like to test something but I need help from a Python savvy user out there: is there a way to issue a keyboard interrupt (I.e ctrl + c) after a certain time delay? My thought is maybe I’ll program into my code to interrupt my script before the kernel times out and maybe that will result in a successful “complete” status as all of the code cells will run hopefully giving me some output... thoughts?[/QUOTE]

Make sure your script is using the "/kaggle/working" directory, I'm not sure another subfolder there works, I just have the files directly in /kaggle/working. It should work with cudalucas in /usr/local/bin as long as the output and temp files are in /kaggle/working, but there is really no need, works fine for me with cudalucas and mprime directly in that directory.

When you start your notebook remember to enable internet on the right side, if your script is downloading the files from somewhere. Test the script by starting it in the editor so you are sure it works before committing it.

kriesel 2019-10-23 13:00

CUDAPm1 v0.20 on Colab failed selftest
 
I managed to get a couple sessions very early this morning, finish a threads benchmarking run, and run a selftest, in CUDAPm1 v0.20 CUDA 5.5. See the selftest section of [URL]https://www.mersenneforum.org/showpost.php?p=527928&postcount=5[/URL] for some details.

Has anyone else done CUDAPm1 selftests? obtained a successful selftest of CUDAPm1 on Colab, or seen fails? What version, what exponent, etc.?

ATH 2019-10-23 18:13

I found out how to restart a kernel from the command line with Kaggle API.

I have a kernel here that successfully ran for 9 hours on mprime (no GPU) and finished: kernel3abfb13938.
I'm getting the output files with my script I explained in post #424 using Kaggle API and wput to upload the files I need to an ftp site.

Do not power off the kernel, because it needs to still exist, so we can restart it.
Now I can download the script from the kernel using:

~/.local/bin/kaggle kernels pull -w -m <username>/kernel3abfb13938

-w means it downloads to the current directory otherwise use "-p PATH"

It downloads 2 files: kernel3abfb13938.pynb + kernel-metadata.json. Now I want to rename it to something sensible like "mprime1", so I rename the .pynb file to mprime1.pynb.

The kernel-metadata.json looks like this:
[CODE]{
"id": "<username>/kernel3abfb13938",
"id_no": 6347611,
"title": "kernel3abfb13938",
"code_file": "kernel3abfb13938.ipynb",
"language": "python",
"kernel_type": "notebook",
"is_private": true,
"enable_gpu": false,
"enable_internet": true,
"keywords": [],
"dataset_sources": [],
"kernel_sources": [],
"competition_sources": []
}[/CODE]

Rename the 3 instances of "kernel3abfb13938" to "mprime1" and save the file, notice that "enable_gpu" and "enable_internet" is already set to what you had last time you ran it.

Now we restart the kernel with the new name with:

~/.local/bin/kaggle kernels push -p PATH

where PATH is the folder the 2 files is located in, and it will say:
[QUOTE]Kernel version 2 successfully pushed. Please check progress at https://www.kaggle.com/<username>/mprime1[/QUOTE]

and I checked it is working, I had chalsall's reverse SSH script as part of the kernel I restarted and I can connect to it again with SSH and it is running mprime fine.

I suspect I can just push the same 2 files next time without "pulling" down the script again, but I have not tested this yet.



If we combine this with xx005fs's method of having output files attached as a dataset, we can probably restart it without downloading and uploading the files.

[QUOTE=xx005fs;528602]On the top right corner of the output tab, there is a button that says "New Dataset Version", which would pop open a new window prompting you to name it a certain version number of the dataset you used in the previous file. Then as you go on to the next commit, delete the dataset from the input section by crossing the x next to the folder, and reimport the same dataset you just updated with the new version, and you are good to go.[/QUOTE]


When you have a dataset attached to a notebook, how do you use the files in the script? Do you have to run an "import" command or something like that, or do the files just appear in "/kaggle/working" ready for use right away?

chalsall 2019-10-23 19:22

[QUOTE=ATH;528707]I found out how to restart a kernel from the command line with Kaggle API.[/QUOTE]

Nice work! :tu:

xx005fs 2019-10-23 19:40

[QUOTE=ATH;528707]When you have a dataset attached to a notebook, how do you use the files in the script? Do you have to run an "import" command or something like that, or do the files just appear in "/kaggle/working" ready for use right away?[/QUOTE]

I just use a command to copy whatever is in the /kaggle/input/<your dataset> to /kaggle/working.

PhilF 2019-10-23 22:22

[QUOTE=mnd9;528644]This is what’s confusing—my version 1 is clickable it just returns me to the page with Code, Data, Log, Comments but no output.

In post 418, axn said I should have an output tab regardless of it the kernel is killed after 9 hours or completes... I lost 9 hours of GPU quota and the kernel ran so why no output...?[/quote]

My test with mprime confirms what axn said: In my case, a killed mprime job still had the Output tab and the working directory intact. Interestingly, what wasn't saved was the screen output.

[QUOTE=mnd9;528644]Also can anyone confirm my log makes it look like it failed with error code 137 in 5 seconds, but gives no explanation and didn’t stop running, supposedly that’s a memory error code but I’m running something requiring little memory that I know works from trying it in draft mode[/quote]

My killed mprime job showed the same error (137) at the same time (about 5 seconds). That must be a python error caused by it getting killed instead of an application error.

I performed another test that confirmed what I reported earlier. If you get disconnected from a non-committed session, DO NOT click on the banner to power it back on. That will reset the connection and all will be lost. However, if you note the name of the disconnected notebook, close the window it is running in completely, open a new Kaggle browser window, then choose that notebook from your notebook list, you will get dumped back into your previously disconnected session.

In my case at least, the total session time was correct and incrementing; also the CPU usage was at 100%. However, the code block had the "Play" arrow next to it, not the stop button, making one think that it isn't running. Also, the screen output window was gone, which might further make one think the code isn't running. But it is. Only when the CPU usage indicator dropped to near zero could I tell that my code had completed its run. At that point !ls -l (actually it takes 2 of them for some reason) reveals that your working directory and output file(s) are still there.

chalsall 2019-10-24 15:08

[QUOTE=xx005fs;528718]I just use a command to copy whatever is in the /kaggle/input/<your dataset> to /kaggle/working.[/QUOTE]

Just as it might be of some interest... I ran some experiments last night with the "/kaggle/working/" directory.

It is quota'ed to 5GB of storage, even though "df -h" shows the "/" partition has something like 490 GB of free space. Doing a "dd" into a test file caused a "out of space" error at exactly 5G.

Interestingly, there appears to be about 1 TB of storage scattered around the file system. I've had no problem created files several hundred GBs in size in locations other than "/kaggle/working/".

These, obviously, don't survive restarting, but the storage is (temporarily) there if you ever needed it (can't think why you would need anything larger than 5G, mind you).

xx005fs 2019-10-24 16:45

[QUOTE=chalsall;528785]Just as it might be of some interest... I ran some experiments last night with the "/kaggle/working/" directory.

It is quota'ed to 5GB of storage, even though "df -h" shows the "/" partition has something like 490 GB of free space. Doing a "dd" into a test file caused a "out of space" error at exactly 5G.

Interestingly, there appears to be about 1 TB of storage scattered around the file system. I've had no problem created files several hundred GBs in size in locations other than "/kaggle/working/".

These, obviously, don't survive restarting, but the storage is (temporarily) there if you ever needed it (can't think why you would need anything larger than 5G, mind you).[/QUOTE]

Interesting, I thought kaggle would offer more free storage than 5G since a lot of AI training datasets easily exceeds that. I used the /kaggle/working directory only because that's the directory I can actually execute gpuowl even with chmod, and it only outputs the files if you run it in commit mode, not in ineractive mode (as far as I know there isn't any single persistent storage method for interactive mode, and neither is there a way to export the data out from /kaggle/working with a script). I would always recommend spending some quota for initial experimentations with the file system, then later just don't bother with it and commit the run, update the dataset from the commit output section after it's finished running, and then reimport and repeat.

chalsall 2019-10-24 17:01

[QUOTE=xx005fs;528806]Interesting, I thought kaggle would offer more free storage than 5G since a lot of AI training datasets easily exceeds that.[/QUOTE]

My understanding is there are HUGE datasets available to train with; I think you have to ask (the UI) for them to be exposed within the FS. I haven't investigated that myself, but the primary purpose of this environment is to provide (human) training in AI, so the datasets must be available (somehow).

Dylan14 2019-10-24 23:13

If anyone wants to work the ranges above 1G with mfaktc on the Colab (to help the project on [URL="https://www.mersenne.ca/"]mersenne.ca[/URL]), I have devised some code in Python to do it. See below:


[CODE]#Script to automate trial factoring of Mersennes above 1G, using mfaktx.
#Requires wget and curl.
#version history:
#v0.01 - Testing version.
#v0.02 - First public release.

#import needed packages
import sys, os, subprocess, signal, time
#set path to mfaktx
mfaktx_path = "C:\\Users\\Dylan\\Desktop\\mfaktc\\mfaktc-0.21\\"

#names of executables, change if needed
MFAKTX = 'mfaktc.exe'
WGET = 'wget.exe'
CURL = 'curl.exe'

#specify certain parameters for later, when we go and fetch assignments:
TF_LIMIT = str(71)
TF_MIN = str(68)
MAXASSIGNMENTS = str(1)
BIGGEST = str(1)

#changes should not be needed below
print ("---------------------------------")
print ("This is tf1G.py v0.02, a Python script to automate Mersenne trial factoring for exponents above 1 billion.")
print ("It is copyleft, 2019, by Dylan Delgado.")
print ("---------------------------------")

#run checks to see if we have the paths correct
if not os.path.exists(mfaktx_path):
print("The path for Mfaktx does not exist. Check your setting for mfaktx_path.")
sys.exit()
else:
#do we have mfactx?
if not os.path.exists(mfaktx_path + MFAKTX):
print("Mfaktx does not exist. Check your path or name of your executable.")
sys.exit()
#Now we define our URL
URL = "https://www.mersenne.ca/tf1G.php?download_worktodo=1&tf_limit=" + TF_LIMIT + "&tf_min=" + TF_MIN + "&max_assignments=" + MAXASSIGNMENTS + "&biggest=" + BIGGEST
print(URL)

#delete a file (code courtesy of Brian Gladman)
def delete_file(fn):
if os.path.exists(fn):
try:
os.unlink(fn)
except WindowsError:
pass

#submit work to mersenne.ca
def submit_results():
print("Submitting work...")
subprocess.run([CURL, "-F", "results_file=@results.txt", "https://www.mersenne.ca/bulk-factors.php"])
delete_file(mfaktx_path + "results.txt")
#main loop - fetch work with wget, run mfaktx, and submit results
while(True):
#check if we have a worktodo.txt
if not os.path.exists(mfaktx_path + 'worktodo.txt'):
print("No work to do, fetching more work in 5 seconds...")
time.sleep(5)
submit_results()
subprocess.run([WGET, URL, "-Oworktodo.txt"])
#check if worktodo.txt is empty
elif os.stat(mfaktx_path + "worktodo.txt").st_size == 0:
print("No work to do, fetching more work in 5 seconds...")
time.sleep(5)
submit_results()
subprocess.run([WGET, URL, "-Oworktodo.txt"])
#run mfaktc
subprocess.run([mfaktx_path + MFAKTX])[/CODE]The advantage with this is that this is OS agnostic, so this code can also work on Windows and Mac machines. The only things that should be changed is the path to the executables, the names of them, and the parameters.

chalsall 2019-10-25 00:12

[QUOTE=Dylan14;528855]...I have devised some code in Python to do it. See below:[/QUOTE]

I'm going to have to (perhaps) revisit my opinion that Python is always unreadable.

This makes sense! :smile: :tu:


All times are UTC. The time now is 22:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.