![]() |
[QUOTE=xilman;231292]Some crypto applications use only integer and logical operations on small word sizes are are embarassingly parallel. Examples include direct key search on simple block ciphers or LFSR-based stream ciphers, together with similar computations to build Hellman tables or rainbow tables. These typically run [B]very[/B] quickly on a GPU.[/QUOTE]
Did you compare against multi-threaded core on a current CPU? Is that CPU code somehow available? I'm not saying you're doing something wrong (in fact, I'd bet you aren't :smile:), but that'd be the first time I see a result with a speedup >20 without using some special graphic units of the GPU such as the interpolators. |
[QUOTE=ldesnogu;231298]that'd be the first time I see a result with a speedup >20 without using some special graphic units of the GPU such as the interpolators.[/QUOTE]
An available example of just this type of crypto application is distributed.net's RC5-72 challenge. It uses only very simple 32-bit integer instructions, requires very little memory, and is embarrassingly parallel. In short, it's ideal for any GPU. A Core i7-920 (2.8 GHz) using 8 threads can test 29.2 MKeys/sec. Here are comparisons with various GPU's. Core i7-920: 29.2 Mkeys/sec nVidia GTX 275: 300 Mkeys/sec (10.3x) nVidia GTX 480: 620 Mkeys/sec (21.2x) ATI HD 4890: 730 Mkeys/sec (25x) ATI HD 5870: 1.9 Gkeys/sec (65x) ATI HD 5970: 3.0 Gkeys/sec (100x) |
Thanks for the info, very impressive, no matter the characteristics of the problem :smile:
|
Interesting. I've done some work with [url=http://www.primegrid.com/forum_thread.php?id=1737]nVIDIA[/url] and [url=http://www.primegrid.com/forum_thread.php?id=2683]ATI[/url], and I found ATI [i]slower[/i]! It's mostly simple operations; no memory required in the inner loop. I wonder whether it's the 64-bit integer multiplies or the OpenCL code that made ATI slow?
|
ATI's OpenCL compiler isn't very good right now. The ATI RC5 core is written in CAL/IL while the nVidia code is written in CUDA. As far as I'm aware, no one has tried to optimize the nVidia RC5 PTX code created by the CUDA compiler.
|
Hi ,medettweiler
[QUOTE=mdettweiler;231044]I tried re-hardcoding the u[sub]0[/sub] value manually for a specific LLR test and feeding MacLucasFFTW the number's exponent, but it didn't work[/QUOTE] Please upload source code, I can code review. |
[QUOTE=msft;231459]Hi ,medettweiler
Please upload source code, I can code review.[/QUOTE] Ah, I deleted it already. But it was pretty simple (brain-dead more like it). In the main function of the program, we have: [code] case 3: n = (q-1)/averbits +1; j = power_of_two_length(n); n = choose_length(j,n); if (x != NULL) cutilSafeCall(cudaFreeHost((char *)x)); cutilSafeCall(cudaMallocHost((void**) &x,(n+n)*sizeof(BIG_DOUBLE))); for (k=1;k<n;k++) x[k]=0.0; [B]x[0] = 4.0;[/B] j = 1; break;[/code] In the line that I've marked in bold, I changed 4.0 to 3.0, since for the number in question (2001*2^287959-1) that is the value of u[sub]0[/sub] that should be used for the LLR test. (Per [URL="http://en.wikipedia.org/wiki/Lucas–Lehmer–Riesel_test"]Wikipedia's article on the LLR test[/URL], u[sub]0[/sub]=3 is used when n is 3 mod 4, as it is in this case.) 2001*2^287959-1 is a known prime; the result I got, however, said that M287959 was composite, and gave a residual (which of course would be incorrect even for 2^287959-1). |
[QUOTE=mdettweiler;231465][code]
[B]x[0] = 4.0;[/B][/code] [/QUOTE] I think this code compute mod 2^287959-1, it is no good.:cry: |
[QUOTE=msft;231510]I think this code compute mod 2^287959-1, it is no good.:cry:[/QUOTE]
Yeah, that was my though too. The trouble is that I don't have the coding knowhow to change that part. :ermm: Meanwhile, though, Gary's GPU has completed another LL result: [url=http://www.mersenne.org/report_exponent/?exp_lo=35000627&exp_hi=35000627&B1=Get+status]M35000627[/url] |
[QUOTE=mdettweiler;231534]
Meanwhile, though, Gary's GPU has completed another LL result: [url=http://www.mersenne.org/report_exponent/?exp_lo=35000627&exp_hi=35000627&B1=Get+status]M35000627[/url][/QUOTE] Our GPU is busy.:lol: |
Hi msft,
does this CUDA FFT implementation fit into your code? NukadaFFT [url]http://matsu-www.is.titech.ac.jp/~nukada/nufft/[/url] Oliver |
| All times are UTC. The time now is 22:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.