mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Cloud Computing (https://www.mersenneforum.org/forumdisplay.php?f=134)
-   -   Google Diet Colab Notebook (https://www.mersenneforum.org/showthread.php?t=24646)

kriesel 2019-11-07 07:44

[QUOTE=axn;529858]If you see any 5x timings, then kill the session and reconnect. Hopefully you'll get a better one.
Also, for the FMA3 runs, you might get better timings by enabling Hyperthreaded LL.[/QUOTE]I don't see the 5x timings, until the mprime log updates, which is at the end of the run. Mprime is run in background, along with a gpu application, so there is no web browser indication of timing during a run. See the third code block of [URL]https://www.mersenneforum.org/showpost.php?p=528073&postcount=8[/URL]
Also I am reluctant to give up a session early since it has become unreliable to get one at all. Will try "CpuNumHyperthreads=2" in mprime local.txt next time around.
It would be good if there was a way to determine in the Colab script which gpu model was available, and branch to running either tf or prp, ll, or P-1 chosen based on that. Anyone have a code fragment that will do that?

axn 2019-11-07 10:31

[QUOTE=kriesel;529881]I don't see the 5x timings, until the mprime log updates, which is at the end of the run. Mprime is run in background, along with a gpu application, so there is no web browser indication of timing during a run. See the third code block of [URL]https://www.mersenneforum.org/showpost.php?p=528073&postcount=8[/URL][/quote]
Ok. That complicates things. But, you can just run mprime for a while (about 5-10 minutes) in the foreground to observe the iteration value and then either relaunch it in the background if numbers are satisfactory or launch a new instance, can't you?

[QUOTE=kriesel;529881]Will try "CpuNumHyperthreads=2" in mprime local.txt next time around.[/QUOTE]
No. The correct flag is [C]HyperthreadLL=1[/C]

kriesel 2019-11-07 14:16

[QUOTE=axn;529894]Ok. That complicates things. But, you can just run mprime for a while (about 5-10 minutes) in the foreground to observe the iteration value and then either relaunch it in the background if numbers are satisfactory or launch a new instance, can't you?

No. The correct flag is [C]HyperthreadLL=1[/C][/QUOTE]I could initially run in foreground as you suggest, although that increases personal overhead and I'm trying to move toward more automation, not more manual involvement. A test built into the script for what cpu type was allocated, and branch on that basis, would be better. (Anyone have suggestions for that? Maybe for now I'll just throw !lscpu in at the front and see what I'm getting.) Attempting to launch a new instance may give a different cpu type, or fail to obtain a gpu, or fail to obtain any VM at all. With two different accounts, from different hosts and web browsers, I'm getting currently about 50% duty cycle on each.

Re the mprime flag you gave, I don't see it in any of the prime95 V29.8b6 documentation. Where did you find that?

From prime95 V29.8b6 undoc.txt:
[CODE]The program automatically computes the number of CPUs, hyperthreading, and speed.
This information is used to calculate how much work to get.
If the program did not correctly figure out your CPU information,
you can override the info in local.txt:
NumCPUs=n
CpuNumHyperthreads=1 or 2
CpuSpeed=s
Where n is the number of physical CPUs or cores, not logical CPUs created by
hyperthreading. Choose 1 for non-hyperthreaded and 2 for hyperthreaded. Finally,
s is the speed in MHz.[/CODE]From readme.txt:
The string "hyperthread" was not present.

From Whatsnew.txt:
Also no mention, of any hyperthread related configuration file entry syntax.

Google Colaboratory is a pretty (sneaky|clever) way of getting us to learn linux and Python, one little bit at a time.

kriesel 2019-11-07 14:27

Here's a first attempt. This attempt terminates immediately because no gpu was available but I still used the dual-task script.[CODE]Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping: 3
CPU MHz: 2000.176
BogoMIPS: 4000.35
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=...

Enter your authorization code:
··········
Mounted at /content/drive
/content/drive/My Drive
/content/drive/My Drive/mprime
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2019-11-07 14:20:55 gpuowl
2019-11-07 14:20:55 Note: no config.txt file found
2019-11-07 14:20:56 config: -use ORIG_X2 -log 120000 -maxAlloc 10240 -user kriesel -cpu colab/K80
2019-11-07 14:20:56 411000059 FFT 24576K: Width 256x4, Height 256x4, Middle 12; 16.33 bits/word
2019-11-07 14:20:56 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:64 getDeviceIDs
2019-11-07 14:20:56 Bye[/CODE]But apparently the mprime instance is still running, because a second try with [CODE]#Notebook to resume a run of mprime on a Colab session
!lscpu
import os.path
from google.colab import drive
import sys
if not os.path.exists('/content/drive/My Drive'):
drive.mount('/content/drive')
%cd '/content/drive/My Drive//'
!chmod +w '/content/drive/My Drive'
%cd '/content/drive/My Drive/mprime//'
!chmod +x ./mprime
!echo run ./mprime
!./mprime -d | tee -a ./mprimelog.txt[/CODE]says so:[CODE]Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping: 3
CPU MHz: 2000.176
BogoMIPS: 4000.35
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
/content/drive/My Drive
/content/drive/My Drive/mprime
run ./mprime
[Main thread Nov 7 14:30] Mersenne number primality test program version 29.8
[Main thread Nov 7 14:30] Optimizing for CPU architecture: Core i3/i5/i7, L2 cache size: 1 MB, L3 cache size: 39424 KB
Another mprime is already running![/CODE]Running !top -d 30 in a separate Colab section confirms it.

storm5510 2019-11-07 14:49

[QUOTE=kriesel;529921]...Google Colaboratory is a pretty (sneaky|clever) way of getting us to learn linux and Python, one little bit at a time.[/QUOTE]

I looked at some Python code yesterday on a different site. It was amazing, to me, how much I could understand having not seen any before. :smile:

axn 2019-11-07 15:17

[QUOTE=kriesel;529921]Re the mprime flag you gave, I don't see it in any of the prime95 V29.8b6 documentation. Where did you find that?[/QUOTE]
Used P95 on a windows laptop with HT. Checked the box that says "Use HT for LL test". Looked in local.txt for what changed.

CpuNumHyperthreads is needed if the program doesn't correctly detect presence of HT. It does in kaggle/colab. So that setting doesn't do anything.

chalsall 2019-11-07 16:16

[QUOTE=kriesel;529921]Anyone have suggestions for that? Maybe for now I'll just throw !lscpu in at the front and see what I'm getting.[/QUOTE]

A quick idea might be to launch mprime with the "-d" option, and either fork it and read it's STDOUT from a "pipe", or else redirect it's STDOUT to a file which is then "tail -f'"'ed.

How you'd automatically deal with a restart request is an exercise left to the reader... :wink:

kriesel 2019-11-07 18:44

[QUOTE=chalsall;529934]A quick idea might be to launch mprime with the "-d" option, and either fork it and read it's STDOUT from a "pipe", or else redirect it's STDOUT to a file which is then "tail -f'"'ed.

How you'd automatically deal with a restart request is an exercise left to the reader... :wink:[/QUOTE]Such as by tee -a?
[CODE]#Notebook to resume a run of mprime on a Colab session
!lscpu
import os.path
from google.colab import drive
import sys
if not os.path.exists('/content/drive/My Drive'):
drive.mount('/content/drive')
%cd '/content/drive/My Drive//'
!chmod +w '/content/drive/My Drive'
%cd '/content/drive/My Drive/mprime//'
!chmod +x ./mprime
!echo run ./mprime
!./mprime -d | tee -a ./mprimelog.txt[/CODE]The destination does not get updated on the fly; only when the mprime session terminates. Perhaps because it's on a Google Drive. Colab console says, if mprime is run in foreground instead of background:
[CODE]/content/drive/My Drive
/content/drive/My Drive/mprime
run ./mprime
[Main thread Nov 7 18:25] Mersenne number primality test program version 29.8
[Main thread Nov 7 18:25] Optimizing for CPU architecture: Core i3/i5/i7, L2 cache size: 1 MB, L3 cache size: 39424 KB
[Main thread Nov 7 18:25] Starting worker.
[Comm thread Nov 7 18:25] Sending interim residue 30000000 for M87092557
[Work thread Nov 7 18:25] Worker starting
[Comm thread Nov 7 18:25] Done communicating with server.
[Work thread Nov 7 18:25] Resuming Gerbicz error-checking PRP test of M87092557 using AVX-512 FFT length 4608K, Pass1=192, Pass2=24K, clm=4
[Work thread Nov 7 18:25] Iteration: 30257107 / 87092557 [34.74%].
[Work thread Nov 7 18:26] Iteration: 30260000 / 87092557 [34.74%], ms/iter: 24.861, ETA: 16d 08:28
[Work thread Nov 7 18:30] Iteration: 30270000 / 87092557 [34.75%], ms/iter: 24.566, ETA: 16d 03:45[/CODE]But the end of mprimelog.txt is still:[CODE][Work thread Nov 7 18:15] Iteration: 30240000 / 87092557 [34.72%], ms/iter: 24.493, ETA: 16d 02:48
[Work thread Nov 7 18:19] Iteration: 30250000 / 87092557 [34.73%], ms/iter: 24.526, ETA: 16d 03:15
[Main thread Nov 7 18:22] Stopping all worker threads.
[Work thread Nov 7 18:22] Stopping PRP test of M87092557 at iteration 30257106 [34.74%]
[Work thread Nov 7 18:22] Worker stopped.
[Main thread Nov 7 18:22] Execution halted.
[Main thread Nov 7 18:22] Choose Test/Continue to restart.[/CODE](as displayed by double clicking the file in the Google Drive tab. The Colab Files menu produces a more current version but hard to deal with narrow file display.)

chalsall 2019-11-07 19:07

[QUOTE=kriesel;529957]Such as by tee -a? The destination does not get updated on the fly; only when the mprime session terminates.[/QUOTE]

1. Yup. Tee works too.

1.1. You guys are going to /love/ Linux. You really are! :wink:

2. Really? Hmmm...

2.1. I have no time to run experiments myself.

2.2. I haven't attached an instance to a Drive since we all started experimenting with this -- way back at the beginning of September.

2.2.1. How time flies when you're having fun!!! :smile: :tu:

3. Try running some tests with the filesystem entirely within the nominal Colab File System (FS).

3.1. It should be as simple as copying files over from a Drive into a directory ("/root/" is fine)...

3.2. And then "throwing" mprime "into the background" by way of the Python shell, and then tail / tail -f logs, etc.

HTH. YMMV.

bayanne 2019-11-08 07:06

An exponent that had been allocated to me 97930517 has been completed by someone else as well, and their result has been accepted. No problem to me, except that this result is not been cleared from the results.txt file, wherever that may be held. Thus it keeps appearing in the results for my instance name.

Where is that file, and can I clear this entry from it?

Uncwilly 2019-11-08 14:46

Ok folks, with regards to sessions being a little persistent, I found out something else. If you have a phone that has linked to a Google account, you can can 'pick-up' a session that is live on a desk top with that same account and vice-versa. This morning I had my work phone at home with a P100 session going. I got into work and fired up my desk top browser and headed to colab and fired up the session. The same session was running. Then I closed the browser on the phone. It was a successful hand off. Just a small way of getting the most out this.


Has anyone that has been working on using the CPU's tried to set it up under one of the GPU72 sponsored sessions? If you have and it works, then getting Chris to include that in the deployment could get us some nice P-1 or DC work too.


All times are UTC. The time now is 23:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.