mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2009-12-24, 02:04   #23
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

11·317 Posts
Default

Yes, there are patches to Nvidia's version of the Open64 compiler that allow high half multiplies and also the add and subtract instructions that generate ad consume carry bits. I wasn't able to build the compiler using either MinGW or Cygwin, even though the source specifically has build directories for those. Given the deafening silence from the Nvidia forums, probably nobody else is able to do it either :)

Have you tried building the compiler in Linux? If all you do is generate the PTX code but not run it, then you don't need any of the other GPU infrastructure on the machine...
jasonp is offline   Reply With Quote
Old 2009-12-24, 10:19   #24
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

32·1,123 Posts
Default

Quote:
Originally Posted by TheJudger View Post
The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before)

Bad news #1:
The "ptx hack" is ugly!!!
I have to check some compilers...
There is a patch to enable some more intrinsics but I was not able to build the compiler. :(
Could you post details of your __[u]mul24hi() trick please? If you prefer, PM or email will be just as good but posting here will aid other CUDA programmers too.

I've seen the nvidia forum postings and the alleged patches to nvcc but I've never managed to get it working either. It will be very useful in some code I'm writing which, at present, has to use __umulhi() and nasty shifting and masking.


Thanks,
Paul
xilman is offline   Reply With Quote
Old 2009-12-24, 14:38   #25
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21238 Posts
Default

Quote:
Originally Posted by axn View Post
I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one?

In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.
This won't maximize the throughput of the machine.
And for the next generation GPU "Fermi" I might have to force the sieve to sieve only up to 17 or so :(
TheJudger is offline   Reply With Quote
Old 2009-12-24, 15:00   #26
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Jason/Paul: did you check other CUDA compilers?
E.g. PGI advertises their compiler as CUDA-capable.

Jason: yep, I tried on Linux and failed. (actually I'm devolopping my code under Linux).

Paul: for sure, here we go (hopefully your familar with the bash):

My code contained only a single __umulhi(). Since the device functions are inlined all the time it appears several times in the ptx code.

Step #1 (just for safety):
- comment out that __umulhi()
- add "--keep" to the nvcc command line (this will generate alot of files, do it in a seperate subdirectory or so) and compile the code
- check if there is no "mul.hi.u32" in the ptx code
- comment in that __umulhi()

Step #2:
- add "--dry-run" to the nvcc command line and compile the code (actually it won't compile). This shows you the commands issued by nvcc. Write down the commands issued after the ptx-code generation

Step #3:
- compile you code with the --keep option again
- modify the ptx file (search & replace mul.hi.u32 with mul24.hi.u32)
- run the commands which you have written down in step #2

my script used for compiling without ptx hack
Code:
#!/bin/bash -xe

rm -f sieve.o main.o main.exe

gcc -Wall -O2 -c sieve.c -o sieve.o
nvcc -c main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v

gcc -fPIC -o main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart
and now with ptx hack
Code:
#!/bin/bash

mkdir compile_bla_bla
cd compile_bla_bla

gcc -Wall -O2 -c ../sieve.c -o sieve.o
nvcc -c ../main.cu -o main.o -I /opt/cuda/include/ --ptxas-options=-v --keep

cat main.ptx | sed s/mul\.hi\.u32/mul24\.hi\.u32/ > main.ptx.new
mv main.ptx main.ptx.old
mv main.ptx.new main.ptx

rm -f main.sm_10.cubin main.cu.cpp main.o

ptxas --key="xxxxxxxxxx"  -arch=sm_10 -v  "main.ptx"  -o "main.sm_10.cubin"
fatbin --key="xxxxxxxxxx" --source-name="../main.cu" --usage-mode="-v  " --embedded-fatbin="main.fatbin.c" "--image=profile=sm_10,file=main.sm_10.cubin" "--image=profile=compute_10,file=main.ptx"
cudafe++ --m64 --gnu_version=40302 --diag_error=host_device_limited_call --diag_error=ms_asm_decl_not_allowed --parse_templates  --gen_c_file_name "main.cudafe1.cpp" --stub_file_name "main.cudafe1.stub.c" --stub_header_file_name "main.cudafe1.stub.h" "main.cpp1.ii"
gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -I. -I"/opt/cuda/include/" -m64 -o "main.cu.cpp" "main.cudafe1.cpp"
gcc -c -x c++ "-I/opt/cuda/bin/../include" "-I/opt/cuda/bin/../include/cudart"   -I. -I"/opt/cuda/include/" -m64 -o "main.o" "main.cu.cpp"

gcc -fPIC -o ../main.exe sieve.o main.o -L/opt/cuda/lib64/ -lcudart

cd ..
rm compile_bla_bla -rf
If you want to replace only some of your __[u]mulhi with __[u]mul24hi than it will bit a bit more complicated. :(
And offcourse the build script uses some systemspecific pathes...

Before you replace the instruction remember about the different behaviour.
__[u]mulhi() returns bits 32 to 63 while __[u]mul24hi() returns bits 16 to 47.

Last fiddled with by TheJudger on 2009-12-24 at 15:09
TheJudger is offline   Reply With Quote
Old 2009-12-24, 15:09   #27
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

32×1,123 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.
I have to think about possiblities:
(1) speedup the siever by writing better code (I'm not sure if I can do this).
If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2.
(2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution.
(3) put the siever on the GPU. I'm not sure if this might work...
(4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ...

personnally I prefer (4)
Any comments?
I'd try a variant of (3) as follows:

Write a siever in CUDA and run it on the GPU (no TF, in other words) until you have at least a few hundred megabytes of sieved results. You would use compact storage, obviously. Something like this would work: all factors are of the form 2kp+1, so store only the k values and them only as deltas from the previous value. The factors also form 16 "obvious" residue classes (something exploited by prime95 since the very early days) so store 16 such lists of deltas. I don't know whether an unsigned char is enough to store the deltas but an unsigned short surely would be. The results would be stored on disk or in cpu RAM as appropriate. Then you can feed the results into a separate TF kernel in a separate cpu thread.


Paul
xilman is offline   Reply With Quote
Old 2009-12-24, 15:22   #28
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

24·3·11 Posts
Default

Quote:
Originally Posted by TheJudger View Post
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.
I have to think about possiblities:
(1) speedup the siever by writing better code (I'm not sure if I can do this).
If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2.
Ernst has some code in Mlucas for trial factoring in factor.c. You could perhaps steal some ideas?

EDIT: A link could help: http://hogranch.com/mayer/README.html

Last fiddled with by ldesnogu on 2009-12-24 at 15:22
ldesnogu is offline   Reply With Quote
Old 2009-12-24, 15:23   #29
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21238 Posts
Default

Execept that I'm doing the sieve on the CPU thats more or less the way I'm doing the sieving.

I generate a list of 2^20 k's (in the same class) at once and transfer them to the GPU.
The k's are stored as uint32 k_ref_hi, uint32 k_ref_lo and uint32 *k_delta. The deltas are relative to k_ref, NOT to the previous k_delta (think parallel ;))

For 2^20 k's and sieving the first 4000 odd primes the k_delta grows above 300.000.000 so a short surely doesn't fit.

This sieve is segmented and so small that it fits into the L1-cache of the CPU.


From my feeling sieving doesn't fit well on CUDA.
TheJudger is offline   Reply With Quote
Old 2010-01-02, 00:06   #30
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45316 Posts
Default

Hi and a happy new year!

GTX 275, Core 2 Duo overclocked to 4GHz, sieving up to 37831 (4000th odd prime):

One process on the CPU:
M66362159 TF from 2^64 to 2^65: 180s
(siever still too slow)

Two processes on the CPU at the same time
M66362159 TF from 2^64 to 2^65: 279s
M66362189 TF from 2^64 to 2^65: 279s
(using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code))

There are some compiletime options which run fine on newer GPUs but won't work on older ones. (e.g. asynchronous memory transfers aren't supported on G80 chips). Therefore I need to write some checks into the code if the current GPU as capable or not.

I had a horrible stupid "bug" during my attempt to make the code capable running multiple CPU processes concurrently on one GPU:
CUDA source files are usually named "*.cu". My favorite editor doesn't know ".cu" files (for syntax highlighting), I was lazy and did symlinks "*.c" -> "*.cu".
This worked fine until I copied the files with scp (secure copy, openssh). Doing so the "*.c" files were copies of the "*.cu" files. So I've edited the "*.c" files and compiled the "*.cu" files.....
It took some time to figure out why code changes didn't show any difference...
TheJudger is offline   Reply With Quote
Old 2010-01-02, 00:33   #31
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Quote:
Originally Posted by axn View Post
I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one?

In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.
Some more details on this topic:

Generating a candidate list (2^20 candidates)
Code:
Sieve limit                | time | number of raw candidates
31 (10th odd prime)        |  8ms | ~1.83M
547 (100th odd prime)      | 16ms | ~3.16M
7927 (1000th odd prime)    | 21ms | ~4.48M
104743 (10000th odd prime) | 28ms | ~5.77M
(hopefully no typo in this list...)

"raw candidates" are the number of candidates before sieving needed to generate a list of 2^20 candidates after sieving (on average).
As you can see the runtime of the siever depends mostly on the number of raw candidates.

Generate a list of 2^20 candidates with sieving up to 104743 takes ~3.5 times longer than generation a list of same size with sieving up to 31. BUT compare the number of raw candidates. Sieving up to 104743 covers 5.77/1.83 = 3.15 times raw candidates!

So lowering the sievelimit will help to keep the GPU busy BUT this won't increase the throughput much. It will help to generate more heat on the GPU and burn electricy.

Last fiddled with by TheJudger on 2010-01-02 at 00:34
TheJudger is offline   Reply With Quote
Old 2010-01-02, 23:59   #32
msft
 
msft's Avatar
 
Jul 2009
Tokyo

61010 Posts
Default

Happy new year!, Thejudger
Quote:
Originally Posted by TheJudger View Post
(using two CPUs core easily keep the GPU busy all the time. A 2.66GHz Core 2 Duo should be fine for a GTX 275 (with the current code))
Why use CPU ?
Something slow on GPU ?
msft is offline   Reply With Quote
Old 2010-01-03, 01:43   #33
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33·41 Posts
Default

Hello msft,

I'm sceptical about sieving on the GPUs. I would be happy if somebody proves me wrong on that topic. ;)

- sieving in one big array with all GPU threads (each GPU thread handels different small primes)
Problem: will need global memory for the array. When using one char per candidate I need only writes. When using one bit per char I need reads and writes (one global memory!). This should be very slow.

- each thread sieves a small segment
Problem: need to calculate starting offsets much more often.
For this I use a modified euclidean algorithm which needs ifs. Different code paths will break parallel execution as far as I know. :(
Problem: might get imbalanced between threads if the segments are too small. If the segments are bigger (to minimize the effect of imbalanced threads) it will have LONG kernel runtimes on GPU which stalls the GUI (if there is some).
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1656 2020-10-13 14:21
The P-1 factoring CUDA program firejuggler GPU Computing 752 2020-09-08 16:15
"CUDA runtime version 0.0" when running mfaktc.exe froderik GPU Computing 4 2016-10-30 15:29
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 16:26.

Mon Oct 19 16:26:47 UTC 2020 up 39 days, 13:37, 1 user, load averages: 3.00, 2.59, 2.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.