mersenneforum.org More CUDA Queries
 Register FAQ Search Today's Posts Mark Forums Read

 2014-11-15, 00:05 #1 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 107748 Posts More CUDA Queries I have a working CUDA program that calculates the proper divisor sum of smallish numbers at this point, across many threads. However, I've only implemented trial factoring so far. Currently, I have the array of small primes initialized in the kernel, since I haven't been able to successfully initialize it in the host and provide a single copy on the GPU for all threads to use. I've tried looking at shared memory, but I'm not sure if what I'm looking at is in the proper direction. I'd like to move: Code: const unsigned short ps[0,2,3,5,7,11,13,...N] the current array initialization, from the kernel to the host and place one copy in the GPU for all the threads to access from any subsequent kernel calls. The program I'm working with will currently not process numbers of more than 14 digits, so I am thinking of performing the entire factorization via TF (if I can fit the array into the GPU memory) until I can move more functions into the kernel. My eventual plan is to move my entire cycle search routine into threads on the GPU and see if I can run 48 threads of an equivalent program to the two I'm running in the CPU threads, currently. Any pointers, hints, comments, etc. that anyone might care to provide would be appreciated.
 2014-11-15, 02:41 #2 BenR     Nov 2014 916 Posts Hi, How large is this array, and how often is it accessed, and what is the access pattern? You'll probably want to put this in constant memory or shared memory. If you put it in constant memory, you can copy it there once and then it will persist between kernel calls. Declaring it as a global variable in your kernel will also put it in constant memory. Constant memory is faster than global memory since it is cached, and optimized for when all threads are reading the same address. Otherwise the reads are serialized. You might also want to consider copying it from constant memory into shared memory at the beginning of the kernel, assuming its not too big. I find this method to be most useful for lookup tables with random access patterns.
 2014-11-15, 04:38 #3 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 22·1,151 Posts Thanks for the reply. It looks like efficiency is going to keep me below 1000 elements. The elements will each be used from the start to some point within, or possibly through the entire set, by every thread. I have played with a full set of about 240k elements in a CPU version, but the array was over 1M in size. It does work, but it's too slow. I'm going to have to figure out a way to duplicate some math routines within the kernel. Constant memory sounds like what I'll want, but I couldn't figure out the correct way to initialize and then access it.
 2014-11-15, 05:14 #4 BenR     Nov 2014 32 Posts You can write to constant memory using the cudaMemcpyToSymbol function: Code: // Constant memory cannot be dynamically allocated so the size must be // known at runtime, and is limited to 64KB __constant__ unsigned short device_ps[ 20000 ]; int main() { unsigned short host_ps[ 20000 ]; ... cudaMemcpyToSymbol(device_ps, host_ps, sizeof(host_ps), 0, cudaMemcpyHostToDevice); }
 2014-11-15, 16:09 #5 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 22·1,151 Posts Thank you, BenR, That appears to be working fine. The array ended up with 6542 elements and a size of 13084 bytes. I had seen that reference in the manual, but hadn't been able to implement it properly. Thank you for helping out.
 2014-11-27, 00:16 #6 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 22×1,151 Posts Miller-Rabin Only Valid to 31 Bits? I appear to have a 32 (31) bit problem, but I can't run it down. I've written a GPU version of the Miller-Rabin algorithm that works fine at 10 digits, but doesn't find primes of 11 digits. I believe I have all necessary variables using int64_t, which should work at 11 digits. I will eventually be using this up to 16 digits, if possible. Here is my code: Code: #include #include #define N 18 __constant__ int64_t dev_rand[100]; __device__ int64_t mulmod( int64_t a, int64_t b, int64_t mod){ int64_t x=0, y=a%mod; while (b>0){ if ( b%2==1 ) { x=(x+y)%mod; } y=(y*2)%mod; b/=2; } return x%mod; } __device__ int64_t modulo(int64_t base, int64_t exponent, int64_t mod){ int64_t x=1; int64_t y=base; while (exponent>0){ if (exponent%2==1) { x=(x*y)%mod; } y=(y*y)%mod; exponent/=2; } return x%mod; } __global__ void millerrabin( int64_t *a, int *b ) { int i, iteration, tid=blockIdx.x; int64_t s; if (tid < N){ iteration=b[tid]; if (a[tid]<2) { b[tid]=0; } else if (a[tid]!=2 && a[tid]%2==0) { b[tid]=0; } else{ s=a[tid]-1; while (s%2==0) { s/=2; } for (i=0;i \n", argv[0]); return(1); } cudaError_t err = cudaSuccess; inputnum = strtoul(argv[1],0,10); iteration = atoi(argv[2]); if (iteration>100) { iteration=100; } for (i=0;i<100;i++) { host_rand[i]=rand()%(inputnum-1)+1; } cudaMemcpyToSymbol(dev_rand, host_rand, sizeof(host_rand), 0, cudaMemcpyHostToDevice); for (i=0;i>>( dev_a, dev_b ); err = cudaGetLastError(); if (err != cudaSuccess) { printf("Failed to launch add kernel (error code: %s)!\n", cudaGetErrorString(err)); } // copy the arrays back from the GPU to the CPU cudaMemcpy( a, dev_a, N * size64, cudaMemcpyDeviceToHost ); cudaMemcpy( b, dev_b, N * size, cudaMemcpyDeviceToHost ); // display the results for (i=0;i
2014-11-27, 02:57   #7
axn

Jun 2003

28·3·7 Posts

Quote:
 Originally Posted by EdH Code:  if (exponent%2==1) { x=(x*y)%mod; } y=(y*y)%mod;`
These two multiplications can overflow 64 bits, if the modulus > 2^32

 2014-11-27, 03:47 #8 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 22×1,151 Posts Thanks! I just got done converting all the % lines to an equivalent equation, with no change. Now I can look at the real issue.
2014-11-27, 03:53   #9
axn

Jun 2003

28×3×7 Posts

Quote:
 Originally Posted by EdH Thanks! I just got done converting all the % lines to an equivalent equation, with no change. Now I can look at the real issue.
Just to clarify, the issue is not with %, rather, if x,y > 2^32 (which will happen when mod > 2^32), x*y (and y*y) > 2^64 which will overflow int64, and therefore will give wrong answer.

2014-11-27, 04:07   #10
EdH

"Ed Hall"
Dec 2009

107748 Posts

Quote:
 Originally Posted by axn Just to clarify, the issue is not with %, rather, if x,y > 2^32 (which will happen when mod > 2^32), x*y (and y*y) > 2^64 which will overflow int64, and therefore will give wrong answer.
Sorry, I wasn't clear enough in my reply. I did recognize that fully when you pointed it out. I was just commenting on how I had travelled down an unrelated path, due to seeing something on line that seemed to infer that % was 32 bit in CUDA. Thank you. I "think" I'm on the right page and have stepped back to my version prior to the % change. Now to find a solution...

Last fiddled with by EdH on 2014-11-27 at 04:08

2014-11-27, 07:38   #11
axn

Jun 2003

28×3×7 Posts

Quote:
 Originally Posted by EdH Sorry, I wasn't clear enough in my reply. I did recognize that fully when you pointed it out. I was just commenting on how I had travelled down an unrelated path, due to seeing something on line that seemed to infer that % was 32 bit in CUDA. Thank you. I "think" I'm on the right page and have stepped back to my version prior to the % change. Now to find a solution...
Ah! A misreading on my part. But better safe than sorry, I say.

 Similar Threads Thread Thread Starter Forum Replies Last Post paulofaller Information & Answers 24 2014-04-02 16:57 S485122 PrimeNet 0 2009-03-10 07:01 shu_the_genius PrimeNet 5 2003-12-19 18:00 bc PSearch 6 2003-05-01 13:16 maxscribe Lounge 1 2002-08-18 18:07

All times are UTC. The time now is 10:34.

Thu Jun 30 10:34:19 UTC 2022 up 77 days, 8:35, 0 users, load averages: 1.47, 1.36, 1.30