![]() |
![]() |
#1 |
"Ed Hall"
Dec 2009
Adirondack Mtns
122048 Posts |
![]()
I have a working CUDA program that calculates the proper divisor sum of smallish numbers at this point, across many threads. However, I've only implemented trial factoring so far. Currently, I have the array of small primes initialized in the kernel, since I haven't been able to successfully initialize it in the host and provide a single copy on the GPU for all threads to use. I've tried looking at shared memory, but I'm not sure if what I'm looking at is in the proper direction. I'd like to move:
Code:
const unsigned short ps[0,2,3,5,7,11,13,...N] My eventual plan is to move my entire cycle search routine into threads on the GPU and see if I can run 48 threads of an equivalent program to the two I'm running in the CPU threads, currently. Any pointers, hints, comments, etc. that anyone might care to provide would be appreciated. |
![]() |
![]() |
![]() |
#2 |
Nov 2014
32 Posts |
![]()
Hi,
How large is this array, and how often is it accessed, and what is the access pattern? You'll probably want to put this in constant memory or shared memory. If you put it in constant memory, you can copy it there once and then it will persist between kernel calls. Declaring it as a global variable in your kernel will also put it in constant memory. Constant memory is faster than global memory since it is cached, and optimized for when all threads are reading the same address. Otherwise the reads are serialized. You might also want to consider copying it from constant memory into shared memory at the beginning of the kernel, assuming its not too big. I find this method to be most useful for lookup tables with random access patterns. |
![]() |
![]() |
![]() |
#3 |
"Ed Hall"
Dec 2009
Adirondack Mtns
22×13×101 Posts |
![]()
Thanks for the reply. It looks like efficiency is going to keep me below 1000 elements. The elements will each be used from the start to some point within, or possibly through the entire set, by every thread. I have played with a full set of about 240k elements in a CPU version, but the array was over 1M in size. It does work, but it's too slow. I'm going to have to figure out a way to duplicate some math routines within the kernel.
Constant memory sounds like what I'll want, but I couldn't figure out the correct way to initialize and then access it. |
![]() |
![]() |
![]() |
#4 |
Nov 2014
32 Posts |
![]()
You can write to constant memory using the cudaMemcpyToSymbol function:
Code:
// Constant memory cannot be dynamically allocated so the size must be // known at runtime, and is limited to 64KB __constant__ unsigned short device_ps[ 20000 ]; int main() { unsigned short host_ps[ 20000 ]; ... cudaMemcpyToSymbol(device_ps, host_ps, sizeof(host_ps), 0, cudaMemcpyHostToDevice); } |
![]() |
![]() |
![]() |
#5 |
"Ed Hall"
Dec 2009
Adirondack Mtns
22·13·101 Posts |
![]()
Thank you, BenR,
That appears to be working fine. The array ended up with 6542 elements and a size of 13084 bytes. I had seen that reference in the manual, but hadn't been able to implement it properly. Thank you for helping out. |
![]() |
![]() |
![]() |
#6 |
"Ed Hall"
Dec 2009
Adirondack Mtns
525210 Posts |
![]()
I appear to have a 32 (31) bit problem, but I can't run it down. I've written a GPU version of the Miller-Rabin algorithm that works fine at 10 digits, but doesn't find primes of 11 digits. I believe I have all necessary variables using int64_t, which should work at 11 digits. I will eventually be using this up to 16 digits, if possible.
Here is my code: Code:
#include <stdio.h> #include <math.h> #define N 18 __constant__ int64_t dev_rand[100]; __device__ int64_t mulmod( int64_t a, int64_t b, int64_t mod){ int64_t x=0, y=a%mod; while (b>0){ if ( b%2==1 ) { x=(x+y)%mod; } y=(y*2)%mod; b/=2; } return x%mod; } __device__ int64_t modulo(int64_t base, int64_t exponent, int64_t mod){ int64_t x=1; int64_t y=base; while (exponent>0){ if (exponent%2==1) { x=(x*y)%mod; } y=(y*y)%mod; exponent/=2; } return x%mod; } __global__ void millerrabin( int64_t *a, int *b ) { int i, iteration, tid=blockIdx.x; int64_t s; if (tid < N){ iteration=b[tid]; if (a[tid]<2) { b[tid]=0; } else if (a[tid]!=2 && a[tid]%2==0) { b[tid]=0; } else{ s=a[tid]-1; while (s%2==0) { s/=2; } for (i=0;i<iteration;i++){ int64_t aa=dev_rand[i], temp=s; int64_t mod=modulo(aa, temp, a[tid]); while (temp!=a[tid]-1 && mod!=1 && mod!=a[tid]-1){ mod=mulmod(mod, mod, a[tid]); temp*=2; } if (mod!=a[tid]-1 && temp%2==0) { b[tid]=0; } } } if (b[tid]!=0) { b[tid]=1; } } } int main( int argc, char *argv[] ) { int64_t inputnum, *dev_a, host_rand[100], a[N]; int i, iteration, *dev_b, b[N], size64=sizeof(int64_t), size=sizeof(int); if (argc <= 2) { printf("usage: %s <number> <iterations>\n", argv[0]); return(1); } cudaError_t err = cudaSuccess; inputnum = strtoul(argv[1],0,10); iteration = atoi(argv[2]); if (iteration>100) { iteration=100; } for (i=0;i<100;i++) { host_rand[i]=rand()%(inputnum-1)+1; } cudaMemcpyToSymbol(dev_rand, host_rand, sizeof(host_rand), 0, cudaMemcpyHostToDevice); for (i=0;i<N;i++) { a[i]=inputnum+i; b[i]=iteration; } // allocate the memory on the GPU cudaMalloc(&dev_a, N * size64 ); cudaMalloc(&dev_b, N * size ); // copy a and b arrayss to the GPU cudaMemcpy( dev_a, a, N * size64, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, N * size, cudaMemcpyHostToDevice ); millerrabin<<<N,1>>>( dev_a, dev_b ); err = cudaGetLastError(); if (err != cudaSuccess) { printf("Failed to launch add kernel (error code: %s)!\n", cudaGetErrorString(err)); } // copy the arrays back from the GPU to the CPU cudaMemcpy( a, dev_a, N * size64, cudaMemcpyDeviceToHost ); cudaMemcpy( b, dev_b, N * size, cudaMemcpyDeviceToHost ); // display the results for (i=0;i<N;i++){ printf("%ld is ", a[i]); if (!b[i]) { printf("composite.\n"); } else { printf("prime.\n"); } } // printf("%lld/%lld time\n", c1, sz1); // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaDeviceReset(); return 0; } Code:
$ ./mrtest 1123456801 5 1123456801 is prime. 1123456802 is composite. 1123456803 is composite. 1123456804 is composite. 1123456805 is composite. 1123456806 is composite. 1123456807 is composite. 1123456808 is composite. 1123456809 is composite. 1123456810 is composite. 1123456811 is prime. 1123456812 is composite. 1123456813 is composite. 1123456814 is composite. 1123456815 is composite. 1123456816 is composite. 1123456817 is composite. 1123456818 is composite. Code:
$ ./mrtest 11123456801 5 11123456801 is composite. 11123456802 is composite. 11123456803 is composite. 11123456804 is composite. 11123456805 is composite. 11123456806 is composite. 11123456807 is composite. 11123456808 is composite. 11123456809 is composite. 11123456810 is composite. 11123456811 is composite. 11123456812 is composite. 11123456813 is composite. 11123456814 is composite. 11123456815 is composite. 11123456816 is composite. 11123456817 is composite. 11123456818 is composite. Code:
$ ./mrtest 11123456801 50 11123456801 is composite. 11123456802 is composite. 11123456803 is composite. 11123456804 is composite. 11123456805 is composite. 11123456806 is composite. 11123456807 is composite. 11123456808 is composite. 11123456809 is composite. 11123456810 is composite. 11123456811 is composite. 11123456812 is composite. 11123456813 is composite. 11123456814 is composite. 11123456815 is composite. 11123456816 is composite. 11123456817 is composite. 11123456818 is composite. Thanks for any assistance... |
![]() |
![]() |
![]() |
#7 |
Jun 2003
5,437 Posts |
![]() |
![]() |
![]() |
![]() |
#8 |
"Ed Hall"
Dec 2009
Adirondack Mtns
525210 Posts |
![]()
Thanks! I just got done converting all the % lines to an equivalent equation, with no change. Now I can look at the real issue.
|
![]() |
![]() |
![]() |
#9 |
Jun 2003
153D16 Posts |
![]()
Just to clarify, the issue is not with %, rather, if x,y > 2^32 (which will happen when mod > 2^32), x*y (and y*y) > 2^64 which will overflow int64, and therefore will give wrong answer.
|
![]() |
![]() |
![]() |
#10 |
"Ed Hall"
Dec 2009
Adirondack Mtns
22×13×101 Posts |
![]()
Sorry, I wasn't clear enough in my reply. I did recognize that fully when you pointed it out. I was just commenting on how I had travelled down an unrelated path, due to seeing something on line that seemed to infer that % was 32 bit in CUDA. Thank you. I "think" I'm on the right page and have stepped back to my version prior to the % change. Now to find a solution...
Last fiddled with by EdH on 2014-11-27 at 04:08 |
![]() |
![]() |
![]() |
#11 | |
Jun 2003
543710 Posts |
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Website Queries | paulofaller | Information & Answers | 24 | 2014-04-02 16:57 |
Result Queries - Factoring Limits | S485122 | PrimeNet | 0 | 2009-03-10 07:01 |
Individual Account Report queries | shu_the_genius | PrimeNet | 5 | 2003-12-19 18:00 |
PRP queries | bc | PSearch | 6 | 2003-05-01 13:16 |
copyright, archives...housekeeping queries! | maxscribe | Lounge | 1 | 2002-08-18 18:07 |