mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > Cloud Computing

Reply
Thread Tools
Old 2019-12-17, 23:13   #716
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

29216 Posts
Default

Quote:
Originally Posted by lycorn View Post
That is true, but all my colab instances CPUs are identified as Intel Xeon @ 2.30GHz Linux64 on My Account->CPUs option from mersenne.org menu. Next time I use it I´ll select "no accelerator" and see what happens. It won´t probably change anything, meaning the "Intel Xeon @ 2.30GHz" reported is simply the VM´s CPU made available on each session.
Use !lscpu next time you connect. Check the "Model" field:

63 is a Haswell Xeon.
79 is a Broadwell Xeon.
85 is a Skylake Xeon, which supports AXV-512.
PhilF is offline   Reply With Quote
Old 2019-12-17, 23:26   #717
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by lycorn View Post
That is true, but all my colab instances CPUs are identified as Intel Xeon @ 2.30GHz Linux64 on My Account->CPUs option from mersenne.org menu. Next time I use it I´ll select "no accelerator" and see what happens. It won´t probably change anything, meaning the "Intel Xeon @ 2.30GHz" reported is simply the VM´s CPU made available on each session.
What is the name of the program you are running for ECM?
I routinely run mprime (for Mersenne prime PRP testing) and (gpuowl or mfaktc) on the same Colab session, with mprime in the background, and also mfaktc. I'd expect a cpu-only "NONE" session on the same cpu type to give slightly better mprime performance since the cpu would not also be lightly serving the gpu-centric application as it does when a gpu is also in use. That doesn't make a gpu a cpu or vice versa.
!lscpu shows cpu characteristics;
!nvidia-smi shows gpu characteristics including model.
Then if they are ok, I log on to Google drive for a session. Each starts, and if it's an mfaktc run, it's also put in the background, and top -d 120 is run in foreground to show signs of life and later how long the session lasted.
https://www.mersenneforum.org/showpo...73&postcount=8

Last fiddled with by kriesel on 2019-12-17 at 23:39
kriesel is offline   Reply With Quote
Old 2019-12-19, 19:53   #718
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

543710 Posts
Default T4

Next time someone gets a T4 on Google Colab, please run and submit benchmarks for TF and LL.
https://www.mersenne.ca/mfaktc.php
https://www.mersenne.ca/cudalucas.php
kriesel is offline   Reply With Quote
Old 2019-12-19, 20:05   #719
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

100110001101102 Posts
Default

Quote:
Originally Posted by kriesel View Post
Next time someone gets a T4 on Google Colab, please run and submit benchmarks for TF and LL.
Done for TF; don't have the time to do the LL benchmark. James, as noted in the form, the T4s under Colab seems to now be "shared" across two instances.

Also, while I'm typing... Based on Wayne's comment, I spun up a Kaggle TF instance again (on a "disposable" account). After 19 hours across three runs, still working. Weird!
chalsall is offline   Reply With Quote
Old 2019-12-19, 20:55   #720
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

230668 Posts
Default

Quote:
Originally Posted by chalsall View Post
Also, while I'm typing... Based on Wayne's comment, I spun up a Kaggle TF instance again (on a "disposable" account). After 19 hours across three runs, still working. Weird!
Hmmm... I spoke too soon. Kaggle account has just been "blocked"...
chalsall is offline   Reply With Quote
Old 2019-12-19, 21:32   #721
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by chalsall View Post
Done for TF; don't have the time to do the LL benchmark
Thanks for the TF benchmark.
The LL is only 30,000 iterations on Mp48* (~58M) so should not take any reasonable gpu very long. If you haven't the time to set it up and report it, it will wait.
kriesel is offline   Reply With Quote
Old 2019-12-19, 21:32   #722
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

3×5×313 Posts
Default

Quote:
Originally Posted by chalsall View Post
Hmmm... I spoke too soon. Kaggle account has just been "blocked"...
You just have bad luck.
I'll know tomorrow if I'm still on the "NICE" list.
petrw1 is online now   Reply With Quote
Old 2019-12-19, 22:00   #723
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

543710 Posts
Default gpu pot luck

Speaking of luck, the availability of K80 that I'm seeing in Colab is low again.
So I'm switching paradigms.
Make a folder each for working with whatever gpu model comes my way.
Launch a session. See what gpu I get. Launch the matching script for the relevant folder and benchmarking work.

Last fiddled with by kriesel on 2019-12-19 at 22:00
kriesel is offline   Reply With Quote
Old 2019-12-20, 00:04   #724
Dylan14
 
Dylan14's Avatar
 
"Dylan"
Mar 2017

10010001002 Posts
Default Trouble with CudaPM1 with Colab?

I was working on doing some P-1 on some Mersennes with no stage 2 done in the hopes of finding some factors. The script starts up ok, and it starts working on the exponent, but after a while it is unable to write to the Drive:


Code:
Best time for fft = 1568K, time: 0.0818, t1 = 256, t2 = 32, t3 = 64 
Using threads: norm1 256, mult 128, norm2 128. 
Using up to 15912M GPU memory. 
Selected B1=525000, B2=12731250, 4.7% chance of finding a factor 
Using B1 = 525000 from savefile. 
Continuing stage 1 from a partial result of M28222361 fft length = 1568K, iteration = 60001 
Iteration 70000 M28222361, 0x98b803a08faa200c, n = 1568K, CUDAPm1 v0.22 err = 0.07520 (0:06 real, 0.6628 ms/iter, ETA 7:35) 
Iteration 80000 M28222361, 0x1f50201cb4e89065, n = 1568K, CUDAPm1 v0.22 err = 0.07031 (0:07 real, 0.6641 ms/iter, ETA 7:30) 
Iteration 90000 M28222361, 0x306bf7766242d8b8, n = 1568K, CUDAPm1 v0.22 err = 0.07422 (0:07 real, 0.6585 ms/iter, ETA 7:19) 
Couldn't write checkpoint. 
Iteration 100000 M28222361, 0xed07659e0434e0e2, n = 1568K, CUDAPm1 v0.22 err = 0.07227 (0:06 real, 0.6621 ms/iter, ETA 7:15) 
Couldn't write checkpoint. 
Iteration 110000 M28222361, 0x440db9dc0edc5f08, n = 1568K, CUDAPm1 v0.22 err = 0.07422 (0:07 real, 0.6748 ms/iter, ETA 7:17) 
Couldn't write checkpoint. 
Iteration 120000 M28222361, 0x4e29dd3852cc20cc, n = 1568K, CUDAPm1 v0.22 err = 0.07031 (0:07 real, 0.6743 ms/iter, ETA 7:09) 
    SIGINT caught, writing checkpoint. 
    SIGINT caught, writing checkpoint. 
Couldn't write checkpoint. 
Estimated time spent so far: 1:22 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected 
cat: results.txt: Transport endpoint is not connected 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected 
rm: cannot remove 'results.txt': Transport endpoint is not connected
I have not seen this error before with CudaPM1. Can someone reproduce this?
PS: I'm using the script I provided on post 158, with the addition of '!apt-get update' and '!apt-get install cuda-cudart-10-0 cuda-cufft-dev-10-0'.
Dylan14 is offline   Reply With Quote
Old 2019-12-20, 03:22   #725
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

543710 Posts
Default

Quote:
Originally Posted by Dylan14 View Post
I was working on doing some P-1 on some Mersennes with no stage 2 done in the hopes of finding some factors. The script starts up ok, and it starts working on the exponent, but after a while it is unable to write to the Drive:

I had a different issue in Colab with cudapm1; all zero res64s in a selftest that failed to find a known factor. See https://www.mersenneforum.org/showth...928#post527928
kriesel is offline   Reply With Quote
Old 2019-12-21, 01:29   #726
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10101001111012 Posts
Default

Quote:
Originally Posted by Dylan14 View Post
I was working on doing some P-1 on some Mersennes with no stage 2 done in the hopes of finding some factors. The script starts up ok, and it starts working on the exponent, but after a while it is unable to write to the Drive:


Code:
...
Couldn't write checkpoint. 
Iteration 100000 M28222361, 0xed07659e0434e0e2, n = 1568K, CUDAPm1 v0.22 err = 0.07227 (0:06 real, 0.6621 ms/iter, ETA 7:15) 
Couldn't write checkpoint. 
Iteration 110000 M28222361, 0x440db9dc0edc5f08, n = 1568K, CUDAPm1 v0.22 err = 0.07422 (0:07 real, 0.6748 ms/iter, ETA 7:17) 
Couldn't write checkpoint. 
Iteration 120000 M28222361, 0x4e29dd3852cc20cc, n = 1568K, CUDAPm1 v0.22 err = 0.07031 (0:07 real, 0.6743 ms/iter, ETA 7:09) 
    SIGINT caught, writing checkpoint. 
    SIGINT caught, writing checkpoint. 
Couldn't write checkpoint. 
Estimated time spent so far: 1:22 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected 
cat: results.txt: Transport endpoint is not connected 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected 
rm: cannot remove 'results.txt': Transport endpoint is not connected
I have not seen this error before with CudaPM1. Can someone reproduce this?
PS: I'm using the script I provided on post 158, with the addition of '!apt-get update' and '!apt-get install cuda-cudart-10-0 cuda-cufft-dev-10-0'.
My first P100 gpuowl session on Google Colab hit the same issue after 8 hours today . I've had many (dozens) of K80 gpuowl P-1 sessions without hitting that ever.
Code:
2019-12-20 23:29:54 colab/TeslaP100 Exception NSt12experimental10filesystem2v17__cxx1116filesystem_errorE: filesystem error: cannot get current path: Transport endpoint is not connected 
2019-12-20 23:29:54 colab/TeslaP100 waiting for background GCDs.. 
2019-12-20 23:29:54 colab/TeslaP100 Bye 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected 
shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected

Last fiddled with by kriesel on 2019-12-21 at 01:31
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Alternatives to Google Colab kriesel Cloud Computing 11 2020-01-14 18:45
Notebook enzocreti enzocreti 0 2019-02-15 08:20
Computer Diet causes Machine Check Exception -- need heuristics help Christenson Hardware 32 2011-12-25 08:17
Computer diet - Need help garo Hardware 41 2011-10-06 04:06
Workunit diet ? dsouza123 NFSNET Discussion 5 2004-02-27 00:42

All times are UTC. The time now is 05:45.


Fri Aug 6 05:45:50 UTC 2021 up 14 days, 14 mins, 1 user, load averages: 3.20, 3.02, 2.90

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.