20141115, 00:05  #1 
"Ed Hall"
Dec 2009
Adirondack Mtns
3,259 Posts 
More CUDA Queries
I have a working CUDA program that calculates the proper divisor sum of smallish numbers at this point, across many threads. However, I've only implemented trial factoring so far. Currently, I have the array of small primes initialized in the kernel, since I haven't been able to successfully initialize it in the host and provide a single copy on the GPU for all threads to use. I've tried looking at shared memory, but I'm not sure if what I'm looking at is in the proper direction. I'd like to move:
Code:
const unsigned short ps[0,2,3,5,7,11,13,...N] My eventual plan is to move my entire cycle search routine into threads on the GPU and see if I can run 48 threads of an equivalent program to the two I'm running in the CPU threads, currently. Any pointers, hints, comments, etc. that anyone might care to provide would be appreciated. 
20141115, 02:41  #2 
Nov 2014
2^{3} Posts 
Hi,
How large is this array, and how often is it accessed, and what is the access pattern? You'll probably want to put this in constant memory or shared memory. If you put it in constant memory, you can copy it there once and then it will persist between kernel calls. Declaring it as a global variable in your kernel will also put it in constant memory. Constant memory is faster than global memory since it is cached, and optimized for when all threads are reading the same address. Otherwise the reads are serialized. You might also want to consider copying it from constant memory into shared memory at the beginning of the kernel, assuming its not too big. I find this method to be most useful for lookup tables with random access patterns. 
20141115, 04:38  #3 
"Ed Hall"
Dec 2009
Adirondack Mtns
3,259 Posts 
Thanks for the reply. It looks like efficiency is going to keep me below 1000 elements. The elements will each be used from the start to some point within, or possibly through the entire set, by every thread. I have played with a full set of about 240k elements in a CPU version, but the array was over 1M in size. It does work, but it's too slow. I'm going to have to figure out a way to duplicate some math routines within the kernel.
Constant memory sounds like what I'll want, but I couldn't figure out the correct way to initialize and then access it. 
20141115, 05:14  #4 
Nov 2014
2^{3} Posts 
You can write to constant memory using the cudaMemcpyToSymbol function:
Code:
// Constant memory cannot be dynamically allocated so the size must be // known at runtime, and is limited to 64KB __constant__ unsigned short device_ps[ 20000 ]; int main() { unsigned short host_ps[ 20000 ]; ... cudaMemcpyToSymbol(device_ps, host_ps, sizeof(host_ps), 0, cudaMemcpyHostToDevice); } 
20141115, 16:09  #5 
"Ed Hall"
Dec 2009
Adirondack Mtns
110010111011_{2} Posts 
Thank you, BenR,
That appears to be working fine. The array ended up with 6542 elements and a size of 13084 bytes. I had seen that reference in the manual, but hadn't been able to implement it properly. Thank you for helping out. 
20141127, 00:16  #6 
"Ed Hall"
Dec 2009
Adirondack Mtns
110010111011_{2} Posts 
MillerRabin Only Valid to 31 Bits?
I appear to have a 32 (31) bit problem, but I can't run it down. I've written a GPU version of the MillerRabin algorithm that works fine at 10 digits, but doesn't find primes of 11 digits. I believe I have all necessary variables using int64_t, which should work at 11 digits. I will eventually be using this up to 16 digits, if possible.
Here is my code: Code:
#include <stdio.h> #include <math.h> #define N 18 __constant__ int64_t dev_rand[100]; __device__ int64_t mulmod( int64_t a, int64_t b, int64_t mod){ int64_t x=0, y=a%mod; while (b>0){ if ( b%2==1 ) { x=(x+y)%mod; } y=(y*2)%mod; b/=2; } return x%mod; } __device__ int64_t modulo(int64_t base, int64_t exponent, int64_t mod){ int64_t x=1; int64_t y=base; while (exponent>0){ if (exponent%2==1) { x=(x*y)%mod; } y=(y*y)%mod; exponent/=2; } return x%mod; } __global__ void millerrabin( int64_t *a, int *b ) { int i, iteration, tid=blockIdx.x; int64_t s; if (tid < N){ iteration=b[tid]; if (a[tid]<2) { b[tid]=0; } else if (a[tid]!=2 && a[tid]%2==0) { b[tid]=0; } else{ s=a[tid]1; while (s%2==0) { s/=2; } for (i=0;i<iteration;i++){ int64_t aa=dev_rand[i], temp=s; int64_t mod=modulo(aa, temp, a[tid]); while (temp!=a[tid]1 && mod!=1 && mod!=a[tid]1){ mod=mulmod(mod, mod, a[tid]); temp*=2; } if (mod!=a[tid]1 && temp%2==0) { b[tid]=0; } } } if (b[tid]!=0) { b[tid]=1; } } } int main( int argc, char *argv[] ) { int64_t inputnum, *dev_a, host_rand[100], a[N]; int i, iteration, *dev_b, b[N], size64=sizeof(int64_t), size=sizeof(int); if (argc <= 2) { printf("usage: %s <number> <iterations>\n", argv[0]); return(1); } cudaError_t err = cudaSuccess; inputnum = strtoul(argv[1],0,10); iteration = atoi(argv[2]); if (iteration>100) { iteration=100; } for (i=0;i<100;i++) { host_rand[i]=rand()%(inputnum1)+1; } cudaMemcpyToSymbol(dev_rand, host_rand, sizeof(host_rand), 0, cudaMemcpyHostToDevice); for (i=0;i<N;i++) { a[i]=inputnum+i; b[i]=iteration; } // allocate the memory on the GPU cudaMalloc(&dev_a, N * size64 ); cudaMalloc(&dev_b, N * size ); // copy a and b arrayss to the GPU cudaMemcpy( dev_a, a, N * size64, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, N * size, cudaMemcpyHostToDevice ); millerrabin<<<N,1>>>( dev_a, dev_b ); err = cudaGetLastError(); if (err != cudaSuccess) { printf("Failed to launch add kernel (error code: %s)!\n", cudaGetErrorString(err)); } // copy the arrays back from the GPU to the CPU cudaMemcpy( a, dev_a, N * size64, cudaMemcpyDeviceToHost ); cudaMemcpy( b, dev_b, N * size, cudaMemcpyDeviceToHost ); // display the results for (i=0;i<N;i++){ printf("%ld is ", a[i]); if (!b[i]) { printf("composite.\n"); } else { printf("prime.\n"); } } // printf("%lld/%lld time\n", c1, sz1); // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaDeviceReset(); return 0; } Code:
$ ./mrtest 1123456801 5 1123456801 is prime. 1123456802 is composite. 1123456803 is composite. 1123456804 is composite. 1123456805 is composite. 1123456806 is composite. 1123456807 is composite. 1123456808 is composite. 1123456809 is composite. 1123456810 is composite. 1123456811 is prime. 1123456812 is composite. 1123456813 is composite. 1123456814 is composite. 1123456815 is composite. 1123456816 is composite. 1123456817 is composite. 1123456818 is composite. Code:
$ ./mrtest 11123456801 5 11123456801 is composite. 11123456802 is composite. 11123456803 is composite. 11123456804 is composite. 11123456805 is composite. 11123456806 is composite. 11123456807 is composite. 11123456808 is composite. 11123456809 is composite. 11123456810 is composite. 11123456811 is composite. 11123456812 is composite. 11123456813 is composite. 11123456814 is composite. 11123456815 is composite. 11123456816 is composite. 11123456817 is composite. 11123456818 is composite. Code:
$ ./mrtest 11123456801 50 11123456801 is composite. 11123456802 is composite. 11123456803 is composite. 11123456804 is composite. 11123456805 is composite. 11123456806 is composite. 11123456807 is composite. 11123456808 is composite. 11123456809 is composite. 11123456810 is composite. 11123456811 is composite. 11123456812 is composite. 11123456813 is composite. 11123456814 is composite. 11123456815 is composite. 11123456816 is composite. 11123456817 is composite. 11123456818 is composite. Thanks for any assistance... 
20141127, 02:57  #7 
Jun 2003
13×359 Posts 

20141127, 03:47  #8 
"Ed Hall"
Dec 2009
Adirondack Mtns
CBB_{16} Posts 
Thanks! I just got done converting all the % lines to an equivalent equation, with no change. Now I can look at the real issue.

20141127, 03:53  #9 
Jun 2003
123B_{16} Posts 
Just to clarify, the issue is not with %, rather, if x,y > 2^32 (which will happen when mod > 2^32), x*y (and y*y) > 2^64 which will overflow int64, and therefore will give wrong answer.

20141127, 04:07  #10 
"Ed Hall"
Dec 2009
Adirondack Mtns
3,259 Posts 
Sorry, I wasn't clear enough in my reply. I did recognize that fully when you pointed it out. I was just commenting on how I had travelled down an unrelated path, due to seeing something on line that seemed to infer that % was 32 bit in CUDA. Thank you. I "think" I'm on the right page and have stepped back to my version prior to the % change. Now to find a solution...
Last fiddled with by EdH on 20141127 at 04:08 
20141127, 07:38  #11  
Jun 2003
13×359 Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Website Queries  paulofaller  Information & Answers  24  20140402 16:57 
Result Queries  Factoring Limits  S485122  PrimeNet  0  20090310 07:01 
Individual Account Report queries  shu_the_genius  PrimeNet  5  20031219 18:00 
PRP queries  bc  PSearch  6  20030501 13:16 
copyright, archives...housekeeping queries!  maxscribe  Lounge  1  20020818 18:07 