mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-04-30, 15:00   #67
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

5×112×17 Posts
Default

To be clear, the error is caused by the hardware, not by the program. The program does a wonderful job in catching the error!

My quest is (first) with properly resuming after an error or a ctrl+c, and (second) with having saved all the residue files, and having them properly named (CL style).
LaurV is offline   Reply With Quote
Old 2017-04-30, 22:56   #68
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Pretty fast for hand rolled code (clFFT has had a lot of resources put into it by AMD), but I'm definitely not seeing the performance levels indicated above. Still slower than clLucas 1.04. Anything I should check? This is a FuryX

Good news is, the residues match for the first chunk of iterations.

Code:
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 75002911 at iteration 0
FFT 1024*2048 (4M words, 17.88 bits per word)
OpenCL compile: 1106 ms
setup: 1638 ms
00020000 / 75002911 [0.03%], ms/iter: 5.413, ETA: 4d 16:45; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 5.422, ETA: 4d 16:54; b9fc5678347cad9f error 0.140625 (max 0.140625)
00060000 / 75002911 [0.08%], ms/iter: 5.413, ETA: 4d 16:40; e7fab5c1f11d0f39 error 0.125 (max 0.140625)
00080000 / 75002911 [0.11%], ms/iter: 5.423, ETA: 4d 16:52; 76a6fb920dd95b71 error 0.140625 (max 0.140625)
Code:
Continuing work from a partial result of M75002911 fft length = 4096K iteration = 1001
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1875 (0:36 real, 3.5486 ms/iter, ETA 73:55:06)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9315 ms/iter, ETA 81:53:04)
Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9205 ms/iter, ETA 81:38:41)
Iteration 40000 M( 75002911 )C, 0xb9fc5678347cad9f, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9280 ms/iter, ETA 81:47:22)
Iteration 50000 M( 75002911 )C, 0x992a088a20504a90, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9407 ms/iter, ETA 82:02:32)
Iteration 60000 M( 75002911 )C, 0xe7fab5c1f11d0f39, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9230 ms/iter, ETA 81:39:51)
Iteration 70000 M( 75002911 )C, 0x89386b82336fc06d, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9320 ms/iter, ETA 81:50:26)
Iteration 80000 M( 75002911 )C, 0x76a6fb920dd95b71, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9236 ms/iter, ETA 81:39:18)

I've setup a different Ubuntu 16.30 based system with the latest AMDGPU-PRO driver 17.10 on a FuryX. In this case, timing is incredibly improved.

Code:
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Fiji; OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 75002911 (17.88 bits/word) at iteration 0
OpenCL setup: 2419 ms
00020000 / 75002911 [0.03%], ms/iter: 2.400, ETA: 2d 02:00; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 2.432, ETA: 2d 02:38; b9fc5678347cad9f error 0.125 (max 0.125)
00060000 / 75002911 [0.08%], ms/iter: 2.432, ETA: 2d 02:38; e7fab5c1f11d0f39 error 0.125 (max 0.125)
I also ran clLucas on the FuryX with the same system/driver with clFFT 2.12.2, which did not perform as well. Interestingly, the older driver on the previous system was faster for clLucas
Code:
....
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1914 (0:48 real, 4.7985 ms/iter, ETA 99:57:18)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1914 (0:48 real, 4.8387 ms/iter, ETA 100:46:48)
....
I also tested clLucas vs. gpuOwl on an RX480 in the same Ubuntu/AMDGPU-PRO system, which yielded very good numbers:
Code:
(clLucas)
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1914 (0:52 real, 5.1695 ms/iter, ETA 107:40:57)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1914 (0:52 real, 5.1787 ms/iter, ETA 107:51:42)
Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1914 (0:52 real, 5.2092 ms/iter, ETA 108:28:54)

(gpuOwl)
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Ellesmere; OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 75002911 (17.88 bits/word) at iteration 0
OpenCL setup: 2419 ms
00020000 / 75002911 [0.03%], ms/iter: 3.677, ETA: 3d 04:35; 777b6635d6b78b75 error 0.125 (max 0.125)
00040000 / 75002911 [0.05%], ms/iter: 3.691, ETA: 3d 04:51; b9fc5678347cad9f error 0.125 (max 0.125)
Finally, here are numbers for a W9100 (Hawaii) using the 15.2 fglrx driver and AMD APPSDK 3.0:
Code:
(W9100 - gpuOwl)
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Hawaii; OpenCL 2.0 AMD-APP (1912.5)
LL FFT 4096K (1024*2048*2) of 75002911 (17.88 bits/word) at iteration 0
OpenCL setup: 888 ms
00020000 / 75002911 [0.03%], ms/iter: 3.180, ETA: 2d 18:14; 777b6635d6b78b75 error 0.140625 (max 0.140625)
00040000 / 75002911 [0.05%], ms/iter: 3.138, ETA: 2d 17:21; b9fc5678347cad9f error 0.132812 (max 0.140625)

(W9100 - clLucas)
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1914 (0:47 real, 4.7813 ms/iter, ETA 99:35:46)
Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1914 (0:48 real, 4.7354 ms/iter, ETA 98:37:42)

The doubled performance is pretty amazing - now we just need more FFT sizes :)
airsquirrels is offline   Reply With Quote
Old 2017-05-01, 05:14   #69
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

26508 Posts
Default

Quote:
Originally Posted by LaurV View Post
To be clear, the error is caused by the hardware, not by the program. The program does a wonderful job in catching the error!

My quest is (first) with properly resuming after an error or a ctrl+c, and (second) with having saved all the residue files, and having them properly named (CL style).
LaurV, I just updated gpuOwL on github with these new changes:

- persist checkpoint every "savestep" (new command line argument, defaulting to 500 * logstep).
- use new name format for persist checkpoints (but with final extension .ll)
- use new checkpoint format. The human-readable info is now at the end. Can be printed nicely with:
"tail -1 c<N>.ll" (i.e. use "tail" to print only the very last line, which is the human-readable part).
- in general, use file naming in the style of CUDALucas but with .ll extension

There may be bugs/problems with these new things, looking for feedback :)

Not done yet: no sub-folders.
preda is offline   Reply With Quote
Old 2017-05-01, 05:17   #70
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

23·181 Posts
Default

And a couple of other fixes:
- add a trivial checksum, to catch partially-written checkpoints.
- correctly handle multiple OpenCL "platforms" (discover all the devices in some multi-device setups)
preda is offline   Reply With Quote
Old 2017-05-01, 05:25   #71
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

23×181 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I've setup a different Ubuntu 16.30 based system with the latest AMDGPU-PRO driver 17.10 on a FuryX. In this case, timing is incredibly improved.
Nice, that sort of answers the question ("where did the performance go? -- ask the OpenCL compiler") and saves me the effort to perf-debug the ISA dumps.

Quote:
Originally Posted by airsquirrels View Post
The doubled performance is pretty amazing - now we just need more FFT sizes :)
OK, what FFT sizes do you need? (and why?)
preda is offline   Reply With Quote
Old 2017-05-01, 05:32   #72
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

23·181 Posts
Default

BTW, did you remark the improved error margin as well? Not a huge deal, but it does extend a bit the exponent range available for a given FFT size. (which changes 'radically' the cost for the exponents 'on the border' that become now included in the lower, POT FFT).
preda is offline   Reply With Quote
Old 2017-05-01, 11:54   #73
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

5×112×17 Posts
Default

Yes, we said in the very first post that some 77M exponent can be done with 4K, in spite of the fact that c*Lucas wants more, and we appreciate this!

Short question: Does the new format of the file implies that I can not resume from the old format? (if so, then I will have to wait first to finish 76453229 before playing with the new version, sorry.You do not have to do anything in this direction, whatever format you chose for the future, it is ok with us).

Last fiddled with by LaurV on 2017-05-01 at 11:57
LaurV is offline   Reply With Quote
Old 2017-05-01, 14:37   #74
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×19×83 Posts
Default

Quote:
Originally Posted by preda View Post
OK, what FFT sizes do you need? (and why?)
Well DC is basically beyond 2048K at this time, mostly in the 2560K range, but starting to reach into the 3072K range. LL, as you have discovered, is in the 4096K and 4608K range.

But it really comes down to what exact FFT sizes will be fastest for the hardware.
Mark Rose is offline   Reply With Quote
Old 2017-05-01, 16:04   #75
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

10111101010002 Posts
Default

There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU
henryzz is offline   Reply With Quote
Old 2017-05-01, 16:31   #76
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

C5216 Posts
Default

Quote:
Originally Posted by henryzz View Post
There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU
K*2^n+1 would also be nice.
Mark Rose is offline   Reply With Quote
Old 2017-05-01, 16:42   #77
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

7,069 Posts
Default

Quote:
Originally Posted by henryzz View Post
There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU
To be more sepcific k*b^n+1 and k*b^n-1
rogue is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1720 2023-02-27 03:10
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 16:17.


Wed Mar 29 16:17:49 UTC 2023 up 223 days, 13:46, 0 users, load averages: 0.81, 1.01, 1.07

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔