mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2014-11-15, 00:05   #1
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,259 Posts
Default More CUDA Queries

I have a working CUDA program that calculates the proper divisor sum of smallish numbers at this point, across many threads. However, I've only implemented trial factoring so far. Currently, I have the array of small primes initialized in the kernel, since I haven't been able to successfully initialize it in the host and provide a single copy on the GPU for all threads to use. I've tried looking at shared memory, but I'm not sure if what I'm looking at is in the proper direction. I'd like to move:
Code:
const unsigned short ps[0,2,3,5,7,11,13,...N]
the current array initialization, from the kernel to the host and place one copy in the GPU for all the threads to access from any subsequent kernel calls. The program I'm working with will currently not process numbers of more than 14 digits, so I am thinking of performing the entire factorization via TF (if I can fit the array into the GPU memory) until I can move more functions into the kernel.

My eventual plan is to move my entire cycle search routine into threads on the GPU and see if I can run 48 threads of an equivalent program to the two I'm running in the CPU threads, currently.

Any pointers, hints, comments, etc. that anyone might care to provide would be appreciated.
EdH is offline   Reply With Quote
Old 2014-11-15, 02:41   #2
BenR
 
BenR's Avatar
 
Nov 2014

23 Posts
Default

Hi,

How large is this array, and how often is it accessed, and what is the access pattern?

You'll probably want to put this in constant memory or shared memory.

If you put it in constant memory, you can copy it there once and then it will persist between kernel calls. Declaring it as a global variable in your kernel will also put it in constant memory. Constant memory is faster than global memory since it is cached, and optimized for when all threads are reading the same address. Otherwise the reads are serialized.

You might also want to consider copying it from constant memory into shared memory at the beginning of the kernel, assuming its not too big. I find this method to be most useful for lookup tables with random access patterns.
BenR is offline   Reply With Quote
Old 2014-11-15, 04:38   #3
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,259 Posts
Default

Thanks for the reply. It looks like efficiency is going to keep me below 1000 elements. The elements will each be used from the start to some point within, or possibly through the entire set, by every thread. I have played with a full set of about 240k elements in a CPU version, but the array was over 1M in size. It does work, but it's too slow. I'm going to have to figure out a way to duplicate some math routines within the kernel.

Constant memory sounds like what I'll want, but I couldn't figure out the correct way to initialize and then access it.
EdH is offline   Reply With Quote
Old 2014-11-15, 05:14   #4
BenR
 
BenR's Avatar
 
Nov 2014

23 Posts
Default

You can write to constant memory using the cudaMemcpyToSymbol function:

Code:
// Constant memory cannot be dynamically allocated so the size must be
// known at runtime, and is limited to 64KB
__constant__ unsigned short device_ps[ 20000 ];


int main()
{

unsigned short host_ps[ 20000 ];

...

cudaMemcpyToSymbol(device_ps, host_ps, sizeof(host_ps), 0, cudaMemcpyHostToDevice);

}
BenR is offline   Reply With Quote
Old 2014-11-15, 16:09   #5
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

1100101110112 Posts
Default

Thank you, BenR,

That appears to be working fine. The array ended up with 6542 elements and a size of 13084 bytes. I had seen that reference in the manual, but hadn't been able to implement it properly. Thank you for helping out.
EdH is offline   Reply With Quote
Old 2014-11-27, 00:16   #6
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

1100101110112 Posts
Default Miller-Rabin Only Valid to 31 Bits?

I appear to have a 32 (31) bit problem, but I can't run it down. I've written a GPU version of the Miller-Rabin algorithm that works fine at 10 digits, but doesn't find primes of 11 digits. I believe I have all necessary variables using int64_t, which should work at 11 digits. I will eventually be using this up to 16 digits, if possible.

Here is my code:
Code:
#include <stdio.h>
#include <math.h>
#define N 18

__constant__ int64_t dev_rand[100];

__device__ int64_t mulmod( int64_t a, int64_t b, int64_t mod){
  int64_t x=0, y=a%mod;
  while (b>0){
    if ( b%2==1 ) { x=(x+y)%mod; }
    y=(y*2)%mod;
    b/=2;
  }
  return x%mod;
}

__device__ int64_t modulo(int64_t base, int64_t exponent, int64_t mod){
  int64_t x=1;
  int64_t y=base;
  while (exponent>0){
    if (exponent%2==1) { x=(x*y)%mod; }
    y=(y*y)%mod;
    exponent/=2;
  }
  return x%mod;
}

__global__ void millerrabin( int64_t *a, int *b ) {
  int i, iteration, tid=blockIdx.x;
  int64_t s;
  if (tid < N){
    iteration=b[tid];
    if (a[tid]<2) { b[tid]=0; }
    else if (a[tid]!=2 && a[tid]%2==0) { b[tid]=0; }
    else{
      s=a[tid]-1;
      while (s%2==0) { s/=2; }
      for (i=0;i<iteration;i++){
        int64_t aa=dev_rand[i], temp=s;
        int64_t mod=modulo(aa, temp, a[tid]);
        while (temp!=a[tid]-1 && mod!=1 && mod!=a[tid]-1){
          mod=mulmod(mod, mod, a[tid]);
          temp*=2;
        }
        if (mod!=a[tid]-1 && temp%2==0) { b[tid]=0; }
      }
    }
    if (b[tid]!=0) { b[tid]=1; }
  }
}

int main( int argc, char *argv[] ) {
  int64_t inputnum, *dev_a, host_rand[100], a[N];
  int i, iteration, *dev_b, b[N], size64=sizeof(int64_t), size=sizeof(int);

  if (argc <= 2) { printf("usage: %s <number> <iterations>\n", argv[0]); return(1); }

  cudaError_t err = cudaSuccess;

  inputnum = strtoul(argv[1],0,10);
  iteration = atoi(argv[2]);
  if (iteration>100) { iteration=100; }

  for (i=0;i<100;i++) { host_rand[i]=rand()%(inputnum-1)+1; }

  cudaMemcpyToSymbol(dev_rand, host_rand, sizeof(host_rand), 0, cudaMemcpyHostToDevice);


  for (i=0;i<N;i++) { a[i]=inputnum+i; b[i]=iteration; }

  // allocate the memory on the GPU
  cudaMalloc(&dev_a, N * size64 );
  cudaMalloc(&dev_b, N * size );


  // copy a and b arrayss to the GPU
  cudaMemcpy( dev_a, a, N * size64, cudaMemcpyHostToDevice );
  cudaMemcpy( dev_b, b, N * size, cudaMemcpyHostToDevice );


  millerrabin<<<N,1>>>( dev_a, dev_b );

  err = cudaGetLastError();
  if (err != cudaSuccess)
    {
      printf("Failed to launch add kernel (error code: %s)!\n", cudaGetErrorString(err));
    }

  // copy the arrays back from the GPU to the CPU
  cudaMemcpy( a, dev_a, N * size64, cudaMemcpyDeviceToHost );
  cudaMemcpy( b, dev_b, N * size, cudaMemcpyDeviceToHost );

  // display the results
  for (i=0;i<N;i++){
    printf("%ld is ", a[i]);
    if (!b[i]) { printf("composite.\n"); }
    else { printf("prime.\n"); }
  }


//  printf("%lld/%lld time\n", c1, sz1);

  // free the memory allocated on the GPU
  cudaFree( dev_a );
  cudaFree( dev_b );

cudaDeviceReset();
return 0;
}
And, here are some example runs:
Code:
$ ./mrtest 1123456801 5
1123456801 is prime.
1123456802 is composite.
1123456803 is composite.
1123456804 is composite.
1123456805 is composite.
1123456806 is composite.
1123456807 is composite.
1123456808 is composite.
1123456809 is composite.
1123456810 is composite.
1123456811 is prime.
1123456812 is composite.
1123456813 is composite.
1123456814 is composite.
1123456815 is composite.
1123456816 is composite.
1123456817 is composite.
1123456818 is composite.
Code:
$ ./mrtest 11123456801 5
11123456801 is composite.
11123456802 is composite.
11123456803 is composite.
11123456804 is composite.
11123456805 is composite.
11123456806 is composite.
11123456807 is composite.
11123456808 is composite.
11123456809 is composite.
11123456810 is composite.
11123456811 is composite.
11123456812 is composite.
11123456813 is composite.
11123456814 is composite.
11123456815 is composite.
11123456816 is composite.
11123456817 is composite.
11123456818 is composite.
Code:
$ ./mrtest 11123456801 50
11123456801 is composite.
11123456802 is composite.
11123456803 is composite.
11123456804 is composite.
11123456805 is composite.
11123456806 is composite.
11123456807 is composite.
11123456808 is composite.
11123456809 is composite.
11123456810 is composite.
11123456811 is composite.
11123456812 is composite.
11123456813 is composite.
11123456814 is composite.
11123456815 is composite.
11123456816 is composite.
11123456817 is composite.
11123456818 is composite.
Is there something obvious that I'm missing?

Thanks for any assistance...
EdH is offline   Reply With Quote
Old 2014-11-27, 02:57   #7
axn
 
axn's Avatar
 
Jun 2003

13×359 Posts
Default

Quote:
Originally Posted by EdH View Post
Code:
    if (exponent%2==1) { x=(x*y)%mod; }
    y=(y*y)%mod;
These two multiplications can overflow 64 bits, if the modulus > 2^32
axn is offline   Reply With Quote
Old 2014-11-27, 03:47   #8
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

CBB16 Posts
Default

Thanks! I just got done converting all the % lines to an equivalent equation, with no change. Now I can look at the real issue.
EdH is offline   Reply With Quote
Old 2014-11-27, 03:53   #9
axn
 
axn's Avatar
 
Jun 2003

123B16 Posts
Default

Quote:
Originally Posted by EdH View Post
Thanks! I just got done converting all the % lines to an equivalent equation, with no change. Now I can look at the real issue.
Just to clarify, the issue is not with %, rather, if x,y > 2^32 (which will happen when mod > 2^32), x*y (and y*y) > 2^64 which will overflow int64, and therefore will give wrong answer.
axn is offline   Reply With Quote
Old 2014-11-27, 04:07   #10
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,259 Posts
Default

Quote:
Originally Posted by axn View Post
Just to clarify, the issue is not with %, rather, if x,y > 2^32 (which will happen when mod > 2^32), x*y (and y*y) > 2^64 which will overflow int64, and therefore will give wrong answer.
Sorry, I wasn't clear enough in my reply. I did recognize that fully when you pointed it out. I was just commenting on how I had travelled down an unrelated path, due to seeing something on line that seemed to infer that % was 32 bit in CUDA. Thank you. I "think" I'm on the right page and have stepped back to my version prior to the % change. Now to find a solution...

Last fiddled with by EdH on 2014-11-27 at 04:08
EdH is offline   Reply With Quote
Old 2014-11-27, 07:38   #11
axn
 
axn's Avatar
 
Jun 2003

13×359 Posts
Default

Quote:
Originally Posted by EdH View Post
Sorry, I wasn't clear enough in my reply. I did recognize that fully when you pointed it out. I was just commenting on how I had travelled down an unrelated path, due to seeing something on line that seemed to infer that % was 32 bit in CUDA. Thank you. I "think" I'm on the right page and have stepped back to my version prior to the % change. Now to find a solution...
Ah! A misreading on my part. But better safe than sorry, I say.
axn is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Website Queries paulofaller Information & Answers 24 2014-04-02 16:57
Result Queries - Factoring Limits S485122 PrimeNet 0 2009-03-10 07:01
Individual Account Report queries shu_the_genius PrimeNet 5 2003-12-19 18:00
PRP queries bc PSearch 6 2003-05-01 13:16
copyright, archives...housekeeping queries! maxscribe Lounge 1 2002-08-18 18:07

All times are UTC. The time now is 22:04.

Fri Aug 7 22:04:33 UTC 2020 up 21 days, 17:51, 1 user, load averages: 1.89, 1.96, 1.98

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.