![]() |
![]() |
#1 |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
32×1,303 Posts |
![]()
Any CUDA people out there who may be able to explain why my GPU code doesn't seem to be able to write to global memory?
Code:
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include <cutil_inline.h> __global__ void testKernel(unsigned char *output) { int k; const unsigned tid = blockIdx.x * blockDim.x + threadIdx.x; for (k=0; k < 8; k++) output[16*tid+k] = 42; for (k=0; k < 8; k++) output[16*tid+k+8] = 66; } #define BLOCK_SIZE 2 #define THREADS_PER_BLOCK 2 #define NUM_THREADS (BLOCK_SIZE * THREADS_PER_BLOCK) #define MEM_SIZE (NUM_THREADS * 8) void run_test (int argc, char** argv) { unsigned char *h_output, *d_output; /* Host and device memory for output */ int iter; dim3 grid (BLOCK_SIZE, 1, 1); /* setup execution parameters */ dim3 threads (THREADS_PER_BLOCK, 1, 1); if (cutCheckCmdLineFlag(argc, (const char**)argv, "device")) cutilDeviceInit(argc, argv); else cudaSetDevice (cutGetMaxGflopsDeviceId()); /* Allocate host memory for output */ h_output = (unsigned char *) calloc (1, 2 * MEM_SIZE); /* Allocate device memory for output */ cutilSafeCall (cudaMalloc ((void**) &d_output, 2 * MEM_SIZE)); /* Execute the kernel */ testKernel <<< grid, threads >>> (d_output); /* Check if kernel execution generated an error */ cutilCheckMsg ("Kernel execution failed"); cutilSafeCall (cudaThreadSynchronize()); /* Wait for threads to complete. */ /* Copy results from device to host memory */ cutilSafeCall (cudaMemcpy (h_output, d_output, 2 & MEM_SIZE, cudaMemcpyDeviceToHost)); cutilSafeCall (cudaThreadSynchronize()); /* Wait for threads to complete. */ for (iter = 0; iter < 2*MEM_SIZE; iter++) { printf ("%d %02x\n", iter, h_output[iter]); } free (h_output); cutilSafeCall (cudaFree (d_output)); cudaThreadExit (); } int main (int argc, char** argv) { run_test (argc, argv); cutilExit(argc, argv); } This test case was developed from the SDK template project then stripped down pretty much to bare-bones. It allocates global memory on host and device, calls a kernel to write constant non-zero bytes then prints what's happened, if anything. On my system it invariably prints zeros. A kernel which does significant computation takes significant time to run, implying that the kernel is being called, but still doesn't write to global memory. The fact that everything works except my code suggests that I have a conceptual error rather than a system bug. Paul |
![]() |
![]() |
![]() |
#2 | |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
32×1,303 Posts |
![]() Quote:
Paul |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3625 | 2023-03-30 00:08 |
The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |
End of the world as we know it (in music) | firejuggler | Lounge | 3 | 2012-12-22 01:43 |
World Cup Soccer | davieddy | Hobbies | 111 | 2011-05-28 19:21 |