mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

ldesnogu 2010-09-24 18:20

[QUOTE=xilman;231292]Some crypto applications use only integer and logical operations on small word sizes are are embarassingly parallel. Examples include direct key search on simple block ciphers or LFSR-based stream ciphers, together with similar computations to build Hellman tables or rainbow tables. These typically run [B]very[/B] quickly on a GPU.[/QUOTE]
Did you compare against multi-threaded core on a current CPU? Is that CPU code somehow available?

I'm not saying you're doing something wrong (in fact, I'd bet you aren't :smile:), but that'd be the first time I see a result with a speedup >20 without using some special graphic units of the GPU such as the interpolators.

frmky 2010-09-24 20:11

[QUOTE=ldesnogu;231298]that'd be the first time I see a result with a speedup >20 without using some special graphic units of the GPU such as the interpolators.[/QUOTE]
An available example of just this type of crypto application is distributed.net's RC5-72 challenge. It uses only very simple 32-bit integer instructions, requires very little memory, and is embarrassingly parallel. In short, it's ideal for any GPU. A Core i7-920 (2.8 GHz) using 8 threads can test 29.2 MKeys/sec. Here are comparisons with various GPU's.

Core i7-920: 29.2 Mkeys/sec
nVidia GTX 275: 300 Mkeys/sec (10.3x)
nVidia GTX 480: 620 Mkeys/sec (21.2x)
ATI HD 4890: 730 Mkeys/sec (25x)
ATI HD 5870: 1.9 Gkeys/sec (65x)
ATI HD 5970: 3.0 Gkeys/sec (100x)

ldesnogu 2010-09-24 20:32

Thanks for the info, very impressive, no matter the characteristics of the problem :smile:

Ken_g6 2010-09-25 04:36

Interesting. I've done some work with [url=http://www.primegrid.com/forum_thread.php?id=1737]nVIDIA[/url] and [url=http://www.primegrid.com/forum_thread.php?id=2683]ATI[/url], and I found ATI [i]slower[/i]! It's mostly simple operations; no memory required in the inner loop. I wonder whether it's the 64-bit integer multiplies or the OpenCL code that made ATI slow?

frmky 2010-09-25 05:20

ATI's OpenCL compiler isn't very good right now. The ATI RC5 core is written in CAL/IL while the nVidia code is written in CUDA. As far as I'm aware, no one has tried to optimize the nVidia RC5 PTX code created by the CUDA compiler.

msft 2010-09-25 23:44

Hi ,medettweiler
[QUOTE=mdettweiler;231044]I tried re-hardcoding the u[sub]0[/sub] value manually for a specific LLR test and feeding MacLucasFFTW the number's exponent, but it didn't work[/QUOTE]
Please upload source code, I can code review.

mdettweiler 2010-09-26 01:34

[QUOTE=msft;231459]Hi ,medettweiler

Please upload source code, I can code review.[/QUOTE]
Ah, I deleted it already. But it was pretty simple (brain-dead more like it). In the main function of the program, we have:
[code] case 3:
n = (q-1)/averbits +1;
j = power_of_two_length(n);
n = choose_length(j,n);

if (x != NULL) cutilSafeCall(cudaFreeHost((char *)x));
cutilSafeCall(cudaMallocHost((void**) &x,(n+n)*sizeof(BIG_DOUBLE)));
for (k=1;k<n;k++) x[k]=0.0;
[B]x[0] = 4.0;[/B]
j = 1;
break;[/code]
In the line that I've marked in bold, I changed 4.0 to 3.0, since for the number in question (2001*2^287959-1) that is the value of u[sub]0[/sub] that should be used for the LLR test. (Per [URL="http://en.wikipedia.org/wiki/Lucas–Lehmer–Riesel_test"]Wikipedia's article on the LLR test[/URL], u[sub]0[/sub]=3 is used when n is 3 mod 4, as it is in this case.)

2001*2^287959-1 is a known prime; the result I got, however, said that M287959 was composite, and gave a residual (which of course would be incorrect even for 2^287959-1).

msft 2010-09-26 11:32

[QUOTE=mdettweiler;231465][code]
[B]x[0] = 4.0;[/B][/code]
[/QUOTE]
I think this code compute mod 2^287959-1, it is no good.:cry:

mdettweiler 2010-09-26 18:27

[QUOTE=msft;231510]I think this code compute mod 2^287959-1, it is no good.:cry:[/QUOTE]
Yeah, that was my though too. The trouble is that I don't have the coding knowhow to change that part. :ermm:

Meanwhile, though, Gary's GPU has completed another LL result:
[url=http://www.mersenne.org/report_exponent/?exp_lo=35000627&exp_hi=35000627&B1=Get+status]M35000627[/url]

msft 2010-09-27 03:34

[QUOTE=mdettweiler;231534]
Meanwhile, though, Gary's GPU has completed another LL result:
[url=http://www.mersenne.org/report_exponent/?exp_lo=35000627&exp_hi=35000627&B1=Get+status]M35000627[/url][/QUOTE]
Our GPU is busy.:lol:

TheJudger 2010-09-30 09:31

Hi msft,

does this CUDA FFT implementation fit into your code?

NukadaFFT
[url]http://matsu-www.is.titech.ac.jp/~nukada/nufft/[/url]

Oliver


All times are UTC. The time now is 22:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.