![]() |
[QUOTE=R. Gerbicz;468640]Modular division by a small "d" number takes only linear time: say you want to get x/d mod N, where d is coprime to N. Let
r=x mod d s=N mod d then x/d==(x+k*N)/d mod N for k==(-r/s) mod d and there is an exact(!) division here: d divides x+k*N the idea behind this was to get the good k multiplier: k is good iff x+k*N==0 mod d, so k==-x/N==-r/s mod d.[/QUOTE] Yes, thank you, this is a simpler way than a full modMul! |
[QUOTE=GP2;468661]I don't really understand all the details being discussed here, but...
I think once you submit a truncated (64-bit or whatever) residue of a given type, and throw away the remaining bits, it's no longer possible to convert that truncated residue from one type to another, no? And for the sake of being able to double-check residues in the database that were returned by different programs, what matters is simply that the 64-bit hex strings match, regardless of how they were calculated. So switching from one type to another in the future would be quite problematic, and it would be quite important for all software authors to agree on a type, right from the start. Or am I missing something?[/QUOTE] Agreeing on a single standard would be best. But George said that the server can ingest various "types" of PRP-3, and Prime95 will be able to double-check any. gpuOwl may be a reason for this, because the existing prp-3 residues were not "type-1". I'm looking into switching gpuOwl to type-1 (for the final residue) by doing a "div-by-9" at the end, because I understand that's the preferred "standard". I think indeed the "small" 64-bit res64 can't be transformed between types perfectly, but it can be moved one step back or forth with a small bound error (< 3 per step) as Robert explained here [url]http://www.mersenneforum.org/showpost.php?p=468381&postcount=210[/url] |
gpuOwl news, v1.5
A summary of recent changes (v 1.5):
- the savefile extension changed to "owl" (from "ll"), - the savefile header is now a text (readable) first line (e.g. on Linux use "head -n1 74207281.owl to get: OWL 1 74207281 2000 2048 2048 -8840269 0 Which means: iteration 2000 of 74207281, using 8M FFT (2048 x 2048 x 2), savefile checksum -8840269, and 0 recovered errors). - added FFT 2M and 8M. - added explicit FFT size setting using argument like "-fft 8M" on the command line. About FFT size selection: by default, FFT size is automatically selected based on exponent size, with these cutoffs: - under 40'000'000, use FFT 2M, - under 78'000'000, use FFT 4M, - under 155'000'000, use FFT 8M If the exponent is below 4'000'000 or above 155'000'000, it is not accepted. The fft size can be overriden, explicitly setting it on the command line "-fft" with a choice of 2M, 4M or 8M. This allows setting nonsensical sizes that would surely result in [roundoff] errors, but those errors will be detected with the normal strong check. A savefile contains the FFT size inside, and when loaded forces that particular size. It's not possible to change the size of a savefile in midflight. To change the size one must move the savefile away (or delete it), and start again with a different -fft value. - the final residue is now of "type-1" (the preferred standard). So for a (PRP) prime, the residue is "1". - There are two choices of kernels, "default" and "legacy". Usually the default set is faster, but it has much shorter carry propagation (4 words carry propagation). That means that the default set is only usable for bits-per-word >= 13. The "legacy" kernels have 32-words carry propagation, and this allows as low as 2-bits-per-word. - For the new 8M FFT, due to OpenCL compiler poor VGPR optimization, the "default" kernels are slower than "legacy". One can select legacy by specifying "-legacy" on the command line. |
GpuOwl news: FGT
I recently understood how to implement a "Fast Galois Transform" (FGT) which is simply complex arithmetic with integers modulo some number M.
I had hope that this integer-only transform may be faster on the GPU because it does not use double-precision floating point (which is slow on commodity GPUs). So I had fun and implemented FGT modulo M(31)=2^31-1 and modulo M(61)=2^61-1. Unfortunately the hoped performance gain was not there, but it was a very cool exercise nevertheless. Anyway, now it's possible to select among these 4 transforms: -fft DP : the old double precision floating point -fft SP : simple precision FP -fft M61 : FGT(M61) -fft M31 : FGT(M31) Of these, SP is very fast but also useless at 2M FFT-size and up (it may prove useful for something at lower FFT sizes). M31 has about 5 bits-per-word usable at 4M FFT size. It's not much use by itself, but can be tested. M61 has deeper word bits than DP. So it can be used for real work. Unfortunately it's also slower than DP. Part of the slowness may be from poor compiler optimizations and that aspect may improve in the future, hopefully. I updated the savefile format to save "compacted" bits now. That means that it's possible to change the FFT size (among 2M, 4M, 8M) or the FFT kind (e.g. switching between DP and M61) in the middle of a test, and everything should work fine (assuming enough usable bits are available; otherwise the "Gerbicz check" which is used with all the transforms will catch it). It's nice that adding the FGTs was done with very little additional code compared to DP-FFT-only -- most of the code is common. Also, a dynamic step of the Gerbicz verification is implemented, which starts with a very small step of 2K iterations at program start (allowing the user to see that the program functions correctly) and ramps up towards 500K if no errors are encountered, or back down if errors are detected. Anyway, if anybody wants to play with pure-integer convolutions on the GPU (for the limited FFT sizes of 2M/4M/8M), the code is here: [url]https://github.com/preda/gpuowl[/url] |
"gpuowl -h" outputs:
[QUOTE] gpuOwL v1.8-12fc090 GPU Mersenne primality checker Command line options: -size 2M|4M|8M : override FFT size. -fft DP|SP|M61|M31 : choose FFT variant [default DP]: DP : double precision floating point. SP : single precision floating point. M61 : Fast Galois Transform (FGT) modulo M(61). M31 : FGT modulo M(31). -user <name> : specify the user name. -cpu <name> : specify the hardware name. -legacy : use legacy kernels -dump <path> : dump compiled ISA to the folder <path> that must exist. [/QUOTE] Note: - the git revision hash is automatically included in the version number. - a "dump" option was added, which on AMDGPU-pro and on ROCm dumps disassembled ISA in the specified folder (useful for performance tuning and debugging) |
1 Attachment(s)
Latest binaries for Windows from git.
|
I've tried to give it a go on my Haswell HD4600, giving the following error message.
[url]https://pastebin.com/raw/eUX6AuDy[/url] It's a beignet-related issue I guess? |
[QUOTE=kracker;471663]Latest binaries for Windows from git.[/QUOTE]
Thanks. This program doesn't work on WSL because WSL can't do GPU acceleration, you just saved me the hassle of setting up MSVS or MinGW with OpenCL :smile: |
[QUOTE=heliosh;471686]I've tried to give it a go on my Haswell HD4600, giving the following error message.
[URL]https://pastebin.com/raw/eUX6AuDy[/URL] It's a beignet-related issue I guess?[/QUOTE] [QUOTE] In file included from stringInput.cl:1: ./gpuowl.cl:135:52: error: use of undeclared identifier 'M_SQRT1_2' ./gpuowl.cl:136:52: error: use of undeclared identifier 'M_SQRT1_2' ./gpuowl.cl:216:49: error: call to 'fabs' is ambiguous [/QUOTE] Yes it appears that beignet does not iterpret correctly the #pragma OPENCL EXTENSION cl_khr_fp64 : enable which is used to enable "double" in OpenCL 1.x [url]https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/cl_khr_fp64.html[/url] [url]https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/mathConstants.html[/url] Unfortunately at this point I don't know more about that (i.e. double support in beignet). At first look it seems the problem is not in gpuOwl. There may be some work-around, but it's hard for me to locate it as I don't use beignet. |
GpuOwl 4M timings
Some performance data that I see on my hardware, at 4M FFT (adequate for the current wavefront around 76M).
This is with ROCm 1.6-180. ROCm seems to generate better optimized code compared to AMDGPU-pro, so in general better performance. All hardware is standard, air-cooled, nothing changed. Vega64: 1.63 ms/it (under-clocked to 1401MHz for thermal reasons) FuryX : 1.89 ms/it R9-Nano: 2.05 ms/it (the card downcloks itself for thermal reasons) 390x: 2.17 ms/it Broadly speaking this comes out to a bit under 2days per exponent. |
I know you can do this in your OpenCL code:
#ifdef cl_khr_fp64 // do something using the extension #else // do something else or #error! #endif but that probably won't be quite as elegant as using clGetDeviceInfo with CL_DEVICE_EXTENSIONS. That way you can give the end user a meaningful message. |
| All times are UTC. The time now is 21:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.