mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-05-13, 15:35   #210
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

7×503 Posts
Default

Quote:
Originally Posted by TheJudger View Post
sieve_table_ could be alligned (I think). I really mean [1-4] but this can be changed, too.
...
If helpfully for allignment: I could move the content of sieve_table_[0] to e.g. sieve_table_[8] and then it would be [0..3] and [4..7].
Having this aligned on a 16 byte boundary will make a big difference in performance. According to Agner Fog's instruction tables, MOVDQA has a fixed latency of 3 cycles and a reciprocal throughput of 1. MOVDQU (for unaligned memory) has a variable latency and a reciprocal throughput anywhere from 4 to 13 times slower. If I understand this correctly, you'll be using 2 MOVDQA's (or U's): one to bring in sieve_table and another to write out ktab. Alignment should help a lot.
bsquared is offline   Reply With Quote
Old 2010-05-13, 16:28   #211
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Kevin,

without the if it is still faster than the "old method" but slower than the current method which includes the if. Moving the if to the 3rd or 5th position is a little bit slower, too.

Having sieve_table_[0] in a separate array is at least worth a try!


Hi bsquared,
I'm pretty sure that I know the ideas of SIMD and I know why alligned memory accesses are better than unalligned. I can look at instruction tables an understand basically what the ops do but I'm unable to write inline assembly code.
So help is needed

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-13, 17:35   #212
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

DC116 Posts
Default

Quote:
Originally Posted by TheJudger View Post

Hi bsquared,
I'm pretty sure that I know the ideas of SIMD and I know why alligned memory accesses are better than unalligned. I can look at instruction tables an understand basically what the ops do but I'm unable to write inline assembly code.
Sorry, I didn't know what level of background knowledge to assume.

Quote:
Originally Posted by TheJudger View Post
So help is needed

Oliver
Well, this (after modification for alignment):
Code:
ktab[k  ]=ic+sieve_table_[0];
ktab[k+1]=ic+sieve_table_[1];
ktab[k+2]=ic+sieve_table_[2];
ktab[k+3]=ic+sieve_table_[3];
Might become something like this, in gcc, also assuming that ktab is 16 byte aligned.
Code:
__asm__ (
		 "pshufd 0x00, (%0), %%xmm1 \n\t" /* broadcast ic to xmm1 */
		 "paddd (%1), %%xmm1 \n\t" /* add 4 copies of ic to 4 entries of sieve_table */
		 "movdqa %%xmm1, (%2) \n\t" /* mov result into 4 locations of ktab */
		 :
		 : "r"(&ic), "r"(sieve_table_), "r"(&ktab[k])
		 : "xmm1", "memory");
Note also that paddd does not set a carry flag, and wraps around on overflow. So this is only useful if you can guarantee there will be no overflow in any of the adds. Since the original code doesn't account for it, I assume this is the case.

Last fiddled with by bsquared on 2010-05-13 at 17:42 Reason: 8 bit immediate...
bsquared is offline   Reply With Quote
Old 2010-05-13, 22:58   #213
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi bsquared,

thank you!

Code:
  __asm__ (
    "pshufd $0x00, (%0), %%xmm1\n\t" /* broadcast ic to xmm1 */
    "paddd (%1), %%xmm1\n\t" /* add 4 copies of ic to 4 entries of sieve_table */
    "movdqu %%xmm1, (%2)\n\t" /* mov result into 4 locations of ktab */
    :
    : "r"(&ic), "r"(sieve_table_), "r"(&ktab[k])
    : "xmm1", "memory");
Took me a few minutes to figure out that there is a missing $ sign.
And I must use movdqu because k_tab[k] can't be alligned on a 16 byte boundary all the time.
I haven't integrated this into mfaktc yet. This was just a small test for myself to see how it works.
The wrap around of paddd is not a problem. The results are < 2^24 unless something went really wrong.

Oliver

Last fiddled with by TheJudger on 2010-05-13 at 23:22
TheJudger is offline   Reply With Quote
Old 2010-05-13, 23:56   #214
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

OK, a quicky in mfaktc code:

- putting sieve_table_[0]: no effect on runtime
- using the asm code posted before instead of
Code:
ktab[k  ]=ic+sieve_table_[0];
ktab[k+1]=ic+sieve_table_[1];
ktab[k+2]=ic+sieve_table_[2];
ktab[k+3]=ic+sieve_table_[3];
reached only ~87% performance of the c code.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-14, 00:04   #215
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi Kevin,

without the if it is still faster than the "old method" but slower than the current method which includes the if. Moving the if to the 3rd or 5th position is a little bit slower, too.

Having sieve_table_[0] in a separate array is at least worth a try!
Oh, my comments were talking about after you'd converted to SIMD code. I doubt that they would make a difference in the C code (unless the compiler is somehow automatically vectorizing the code for you, which is rare but possible).

More thoughts tomorrow when I get a chance to look at the ASM code.

Kevin
kjaget is offline   Reply With Quote
Old 2010-05-14, 03:35   #216
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

67018 Posts
Default

Quote:
Originally Posted by TheJudger View Post

Took me a few minutes to figure out that there is a missing $ sign.
Sorry about that! I haven't tried to build your code at all, so that ASM snippet was untested when I posted it.
bsquared is offline   Reply With Quote
Old 2010-05-14, 03:42   #217
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

352110 Posts
Default

Quote:
Originally Posted by TheJudger View Post
OK, a quicky in mfaktc code:

- putting sieve_table_[0]: no effect on runtime
- using the asm code posted before instead of
Code:
ktab[k  ]=ic+sieve_table_[0];
ktab[k+1]=ic+sieve_table_[1];
ktab[k+2]=ic+sieve_table_[2];
ktab[k+3]=ic+sieve_table_[3];
reached only ~87% performance of the c code.

Oliver
This is maybe not too surprising considering that the c code's independent adds get done 3 per clock tick and the compiler no doubt interleaves the reads and writes to hide latency. Whereas the three sse2 instructions are all in a dependency chain. Also, the variable latency of the last instruction in that chain likely disrupts the pipelining of subsequent writes.

Here is where movdqa might really help if ktab could be made to be aligned somehow.
bsquared is offline   Reply With Quote
Old 2010-05-16, 02:36   #218
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·1,061 Posts
Default

I tried this code (v 0.06-hack-2). I'm using 64-bit linux with gcc 4.3.2 and CUDA 3.0 on a 2.4GHz Core 2 Quad with a GTX 480. With
./mfaktc.exe 3321932839 66 71
I'm only getting about 38 million/sec. However, running 3 copies simultaneously gives about 31 million/sec for each, totaling about 93 million/sec. Here's a snippet of the output of one of 3 copies running simultaneously. BTW, a minor cosmetic bug... The GTX 480 has 32 cores/multiprocessor, not 8, so the number of "shader cores" is incorrect.

Code:
[cluster@node01 0.06]$ ./mfaktc.exe 3321932839 66 71
mfaktc v0.06
Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
  SievePrimes         25000
  SievePrimesAdjust   2
  CudaStreams         2

CUDA device info
  name:                      GeForce GTX 480
  compute capabilities:      2.0
  maximum threads per block: 1024
  number of multiprocessors: 15 (120 shader cores)
  clock rate:                1401MHz

tf(3321932839, 66, 71);
 k_min = 11106030600
 k_max = 355392982921
class    0: tested 150405120 candidates in 5730ms (26248712/sec) (avg. wait: 11849usec)
class    5: tested 150405120 candidates in 5704ms (26368359/sec) (avg. wait: 11821usec)
sp = 52500, min = 5739
sp = 26250, min = 0, max = 100000, prev = 52500 5739
class    9: tested 160235520 candidates in 5835ms (27461100/sec) (avg. wait: 15431usec)
class   12: tested 160235520 candidates in 5267ms (30422540/sec) (avg. wait: 11899usec)
sp = 26250, min = 5285
sp = 13125, min = 0, max = 52500, prev = 26250 5285
class   20: tested 170065920 candidates in 5552ms (30631469/sec) (avg. wait: 14334usec)
class   21: tested 170065920 candidates in 5434ms (31296635/sec) (avg. wait: 13653usec)
sp = 13125, min = 5442
sp = 32812, min = 13125, max = 52500, prev = 26250 0
class   29: tested 157286400 candidates in 5386ms (29202822/sec) (avg. wait: 11929usec)
class   32: tested 157286400 candidates in 5376ms (29257142/sec) (avg. wait: 11907usec)
sp = 32812, min = 5397
sp = 22968, min = 13125, max = 52500, prev = 32812 5397
class   36: tested 162201600 candidates in 5448ms (29772687/sec) (avg. wait: 13131usec)
class   41: tested 162201600 candidates in 5260ms (30836806/sec) (avg. wait: 12016usec)
sp = 22968, min = 5275
sp = 18046, min = 13125, max = 32812, prev = 22968 5275
class   44: tested 165150720 candidates in 5323ms (31025872/sec) (avg. wait: 12653usec)
class   56: tested 165150720 candidates in 5277ms (31296327/sec) (avg. wait: 12394usec)
sp = 18046, min = 5289
sp = 25429, min = 18046, max = 32812, prev = 22968 0
class   57: tested 160235520 candidates in 5242ms (30567630/sec) (avg. wait: 11922usec)
class   60: tested 160235520 candidates in 5243ms (30561800/sec) (avg. wait: 11908usec)
sp = 25429, min = 5258
sp = 21737, min = 18046, max = 32812, prev = 25429 5258
class   65: tested 162201600 candidates in 5284ms (30696744/sec) (avg. wait: 12329usec)
class   69: tested 162201600 candidates in 5217ms (31090971/sec) (avg. wait: 12030usec)
sp = 21737, min = 5231
sp = 19891, min = 18046, max = 25429, prev = 21737 5231
class   72: tested 164167680 candidates in 5276ms (31115936/sec) (avg. wait: 12254usec)
class   77: tested 164167680 candidates in 5276ms (31115936/sec) (avg. wait: 12195usec)
sp = 19891, min = 5289
sp = 22660, min = 19891, max = 25429, prev = 21737 0
frmky is offline   Reply With Quote
Old 2010-05-16, 14:37   #219
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi frmky,

Quote:
Originally Posted by frmky View Post
I tried this code (v 0.06-hack-2). I'm using 64-bit linux with gcc 4.3.2 and CUDA 3.0 on a 2.4GHz Core 2 Quad with a GTX 480. With
./mfaktc.exe 3321932839 66 71
I'm only getting about 38 million/sec. However, running 3 copies simultaneously gives about 31 million/sec for each, totaling about 93 million/sec. Here's a snippet of the output of one of 3 copies running simultaneously. BTW, a minor cosmetic bug... The GTX 480 has 32 cores/multiprocessor, not 8, so the number of "shader cores" is incorrect.
OK, with 3 copies the GPU performance is as expected, I've (limited) access to a GTX 480, too. Your performance with a single copy seems to be too low (I've no idea why).
Btw.: did you modify the compile script to enable sm_20 code? I've noticed when the GPU part was compiled for compute capability 1.x the GF100 chip just ignores the code and does nothing! I know that you're familar with CUDA to I think you've adjusted this.
Just in case: one possibility is to add "-arch=sm_20" to the nvcc command line in the compile script. (my todo list contains a check for problem...)

The number of shaders cores is allready fixed in the (unfinished, unreleased) 0.07 version.

If you want to spent some time: rerun with more CudaStreams (3,4 or 5) and perhaps with SievePrimesAdjust=1.

I never ran Kevins modified versions of 0.06 by myself (didn't compile on my system out of the box). Offcourse I've read his modification and implemented some of his stuff in 0.07.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-16, 17:42   #220
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×1,061 Posts
Default

I did not need to enable sm_20 code. It ran fine with the default sm_10 code, and found appropriate factors. (The GF100 cannot run binary code generated by CUDA 2.x, but can run PTX code from 2.x or binary code from CUDA 3.0 for any sm.) I did a little playing around. Adjusting CudaStreams above 2 had little effect on the speed. SievePrimesAdjust=2 was significantly faster than 1, but both were much slower than expected so I'm not sure what is happening there.
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 05:34.


Fri Aug 6 05:34:29 UTC 2021 up 14 days, 3 mins, 1 user, load averages: 3.13, 2.95, 2.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.