![]() |
[QUOTE=lycorn;533133]That is true, but all my colab instances CPUs are identified as Intel Xeon @ 2.30GHz Linux64 on My Account->CPUs option from mersenne.org menu. Next time I use it I´ll select "no accelerator" and see what happens. It won´t probably change anything, meaning the "Intel Xeon @ 2.30GHz" reported is simply the VM´s CPU made available on each session.[/QUOTE]
Use !lscpu next time you connect. Check the "Model" field: 63 is a Haswell Xeon. 79 is a Broadwell Xeon. 85 is a Skylake Xeon, which supports AXV-512. |
[QUOTE=lycorn;533133]That is true, but all my colab instances CPUs are identified as Intel Xeon @ 2.30GHz Linux64 on My Account->CPUs option from mersenne.org menu. Next time I use it I´ll select "no accelerator" and see what happens. It won´t probably change anything, meaning the "Intel Xeon @ 2.30GHz" reported is simply the VM´s CPU made available on each session.[/QUOTE]What is the name of the program you are running for ECM?
I routinely run mprime (for Mersenne prime PRP testing) and (gpuowl or mfaktc) on the same Colab session, with mprime in the background, and also mfaktc. I'd expect a cpu-only "NONE" session on the same cpu type to give slightly better mprime performance since the cpu would not also be lightly serving the gpu-centric application as it does when a gpu is also in use. That doesn't make a gpu a cpu or vice versa. !lscpu shows cpu characteristics; !nvidia-smi shows gpu characteristics including model. Then if they are ok, I log on to Google drive for a session. Each starts, and if it's an mfaktc run, it's also put in the background, and top -d 120 is run in foreground to show signs of life and later how long the session lasted. [url]https://www.mersenneforum.org/showpost.php?p=528073&postcount=8[/url] |
T4
Next time someone gets a T4 on Google Colab, please run and submit benchmarks for TF and LL.
[url]https://www.mersenne.ca/mfaktc.php[/url] [url]https://www.mersenne.ca/cudalucas.php[/url] |
[QUOTE=kriesel;533229]Next time someone gets a T4 on Google Colab, please run and submit benchmarks for TF and LL.[/QUOTE]
Done for TF; don't have the time to do the LL benchmark. James, as noted in the form, the T4s under Colab seems to now be "shared" across two instances. Also, while I'm typing... Based on Wayne's comment, I spun up a Kaggle TF instance again (on a "disposable" account). After 19 hours across three runs, still working. Weird! |
[QUOTE=chalsall;533230]Also, while I'm typing... Based on Wayne's comment, I spun up a Kaggle TF instance again (on a "disposable" account). After 19 hours across three runs, still working. Weird![/QUOTE]
Hmmm... I spoke too soon. Kaggle account has just been "blocked"... |
[QUOTE=chalsall;533230]Done for TF; don't have the time to do the LL benchmark[/QUOTE]Thanks for the TF benchmark.
The LL is only 30,000 iterations on Mp48* (~58M) so should not take any reasonable gpu very long. If you haven't the time to set it up and report it, it will wait. |
[QUOTE=chalsall;533232]Hmmm... I spoke too soon. Kaggle account has just been "blocked"...[/QUOTE]
You just have bad luck. I'll know tomorrow if I'm still on the "NICE" list. |
gpu pot luck
Speaking of luck, the availability of K80 that I'm seeing in Colab is low again.
So I'm switching paradigms. Make a folder each for working with whatever gpu model comes my way. Launch a session. See what gpu I get. Launch the matching script for the relevant folder and benchmarking work. |
Trouble with CudaPM1 with Colab?
I was working on doing some P-1 on some Mersennes with no stage 2 done in the hopes of finding some factors. The script starts up ok, and it starts working on the exponent, but after a while it is unable to write to the Drive:
[CODE]Best time for fft = 1568K, time: 0.0818, t1 = 256, t2 = 32, t3 = 64 Using threads: norm1 256, mult 128, norm2 128. Using up to 15912M GPU memory. Selected B1=525000, B2=12731250, 4.7% chance of finding a factor Using B1 = 525000 from savefile. Continuing stage 1 from a partial result of M28222361 fft length = 1568K, iteration = 60001 Iteration 70000 M28222361, 0x98b803a08faa200c, n = 1568K, CUDAPm1 v0.22 err = 0.07520 (0:06 real, 0.6628 ms/iter, ETA 7:35) Iteration 80000 M28222361, 0x1f50201cb4e89065, n = 1568K, CUDAPm1 v0.22 err = 0.07031 (0:07 real, 0.6641 ms/iter, ETA 7:30) Iteration 90000 M28222361, 0x306bf7766242d8b8, n = 1568K, CUDAPm1 v0.22 err = 0.07422 (0:07 real, 0.6585 ms/iter, ETA 7:19) Couldn't write checkpoint. Iteration 100000 M28222361, 0xed07659e0434e0e2, n = 1568K, CUDAPm1 v0.22 err = 0.07227 (0:06 real, 0.6621 ms/iter, ETA 7:15) Couldn't write checkpoint. Iteration 110000 M28222361, 0x440db9dc0edc5f08, n = 1568K, CUDAPm1 v0.22 err = 0.07422 (0:07 real, 0.6748 ms/iter, ETA 7:17) Couldn't write checkpoint. Iteration 120000 M28222361, 0x4e29dd3852cc20cc, n = 1568K, CUDAPm1 v0.22 err = 0.07031 (0:07 real, 0.6743 ms/iter, ETA 7:09) SIGINT caught, writing checkpoint. SIGINT caught, writing checkpoint. Couldn't write checkpoint. Estimated time spent so far: 1:22 shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected cat: results.txt: Transport endpoint is not connected shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected rm: cannot remove 'results.txt': Transport endpoint is not connected[/CODE]I have not seen this error before with CudaPM1. Can someone reproduce this? PS: I'm using the script I provided on post 158, with the addition of '!apt-get update' and '!apt-get install cuda-cudart-10-0 cuda-cufft-dev-10-0'.[FONT="] [/FONT] |
[QUOTE=Dylan14;533248]I was working on doing some P-1 on some Mersennes with no stage 2 done in the hopes of finding some factors. The script starts up ok, and it starts working on the exponent, but after a while it is unable to write to the Drive:[FONT="]
[/FONT][/QUOTE] I had a different issue in Colab with cudapm1; all zero res64s in a selftest that failed to find a known factor. See [url]https://www.mersenneforum.org/showthread.php?p=527928#post527928[/url] |
[QUOTE=Dylan14;533248]I was working on doing some P-1 on some Mersennes with no stage 2 done in the hopes of finding some factors. The script starts up ok, and it starts working on the exponent, but after a while it is unable to write to the Drive:
[CODE]... Couldn't write checkpoint. Iteration 100000 M28222361, 0xed07659e0434e0e2, n = 1568K, CUDAPm1 v0.22 err = 0.07227 (0:06 real, 0.6621 ms/iter, ETA 7:15) Couldn't write checkpoint. Iteration 110000 M28222361, 0x440db9dc0edc5f08, n = 1568K, CUDAPm1 v0.22 err = 0.07422 (0:07 real, 0.6748 ms/iter, ETA 7:17) Couldn't write checkpoint. Iteration 120000 M28222361, 0x4e29dd3852cc20cc, n = 1568K, CUDAPm1 v0.22 err = 0.07031 (0:07 real, 0.6743 ms/iter, ETA 7:09) SIGINT caught, writing checkpoint. SIGINT caught, writing checkpoint. Couldn't write checkpoint. Estimated time spent so far: 1:22 shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected cat: results.txt: Transport endpoint is not connected shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected rm: cannot remove 'results.txt': Transport endpoint is not connected[/CODE]I have not seen this error before with CudaPM1. Can someone reproduce this? PS: I'm using the script I provided on post 158, with the addition of '!apt-get update' and '!apt-get install cuda-cudart-10-0 cuda-cufft-dev-10-0'.[FONT="] [/FONT][/QUOTE] My first P100 gpuowl session on Google Colab hit the same issue after 8 hours today . I've had many (dozens) of K80 gpuowl P-1 sessions without hitting that ever.[CODE]2019-12-20 23:29:54 colab/TeslaP100 Exception NSt12experimental10filesystem2v17__cxx1116filesystem_errorE: filesystem error: cannot get current path: Transport endpoint is not connected 2019-12-20 23:29:54 colab/TeslaP100 waiting for background GCDs.. 2019-12-20 23:29:54 colab/TeslaP100 Bye shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected shell-init: error retrieving current directory: getcwd: cannot access parent directories: Transport endpoint is not connected [/CODE] |
| All times are UTC. The time now is 23:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.