mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-02-12, 01:06   #1596
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

36·13 Posts
Default

Quote:
Originally Posted by TheJudger View Post
For sieve.c I've used gcc 4.6. Result: 20% faster sieving (SievePrimes=5000). This needs further testing but it looks promising.
Yes, no code changes, just an updated compiler!

Oliver
Nice! This also means that (CPU-specific) assembly code fragments would do even better still (but it is surely quite some work).

I also run openSUSE 12.1. Just last night, when I tried the GPU GMP-ECM code, I had to "hack" the cuda includes to let the 4.6 compiler do the job and not bail (which it does by default).
Batalov is offline   Reply With Quote
Old 2012-02-12, 14:33   #1597
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Yes, no code changes, just an updated compiler!

Oliver
That's really interesting. On the windows side, I spent some time porting the code to build with Intel's C compiler. Generally this is a much better optimizing compiler than MSVC, but I saw no difference. Granted, I wasn't building .cu files with it, just .c and combining them with the nvcc/msvc compiled .cu files, but that should have picked up any improvements it could find in sieve.c

Could be a lot of things - MSVC isn't as bad as I thought, the older GCC was particularly bad, I'm not building for AVX-enabled targets so there's something specific in the sieve code which works well with AVX but not SSE, or lots of other options.

Good news regardless, though.

Any idea how 20% faster sieving translates into run time improvements?
kjaget is offline   Reply With Quote
Old 2012-02-12, 18:43   #1598
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

My question is - does the current win code have these improvements? If not - can I get a hold of it? I have a machine where the GPU% is hovering around 85-90% with sieve primes at 5000. I'm CPU limited on my farm atm.

Quote:
Originally Posted by kjaget View Post
Any idea how 20% faster sieving translates into run time improvements?
2 outcomes I can think of (my guess):

1) If your GPU% is running at say 80% (as your CPU is maxed), then 20% sieve code improvement would boost the GPU% by 20% (ish), so one would expect GPU% to increase to 96% (ish), thereby giving an overall improvement of 20% approx.

2) If your GPU% is close to 99%, and sieve primes on your mfaktc instances is say 'x', then the improvement would allow you to increase sieve primes value above 'x'. The actual throughput improvement is anyone's guess. But it won't be higher than 20% and likely to be noticeably less than 20%.

Add disclaimer of 'your mileage may vary'.

-- Craig
nucleon is offline   Reply With Quote
Old 2012-02-14, 04:16   #1599
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

22·23 Posts
Default

A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):

Beginning at line 124 of tf_common.cu in 0.18, change
Code:
/* set result array to 0 */  
  for(i=0;i<32;i++)mystuff->h_RES[i]=0;
  cudaMemcpy(mystuff->d_RES, mystuff->h_RES, 32*sizeof(int), cudaMemcpyHostToDevice);
to

Code:
/* set result array to 0 */  
    cudaMemsetAsync(mystuff->d_RES, 0, 32*sizeof(int));
    for(i=0;i<32;i++)mystuff->h_RES[i]=0;
No improvement for multi-instance cases ; -st2 passed. memset() on h_RES is slower.

-Ethan
Ethan (EO) is offline   Reply With Quote
Old 2012-02-14, 12:34   #1600
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Ethan,

1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case?
Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave.

Oliver
TheJudger is offline   Reply With Quote
Old 2012-02-14, 22:38   #1601
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

5C16 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Ethan,

1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case?
Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave.

Oliver
Hi Oliver -- maybe more like 0.8%; I see 183M/s compared to 181.5M/s with this change.

streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe.

As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ?
Ethan (EO) is offline   Reply With Quote
Old 2012-02-17, 12:21   #1602
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Quote:
Originally Posted by Ethan (EO) View Post
A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):
[...]

No improvement for multi-instance cases ; -st2 passed. memset() on h_RES is slower.
I can reproduce this on my system, too. It is faster for CPU limited situations.

Quote:
Originally Posted by Ethan (EO) View Post
streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe.
Yepp, you're right.

Quote:
Originally Posted by Ethan (EO) View Post
As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ?
I really don't understand why it is faster by a constant factor. The code fragment is executed once per class and the amount of work/data within the fragment is constant all the time. I would assume a constant time improvement (e.g. the 6.5ms in your upper case) but actually it is 0.6% to 0.8% faster. Time per class in my testcases decreased from e.g. 7.220s to 7.165s (-55ms) and from 28.532s to 28.311s (-211ms). For times longer execution time yield four times higher difference in runtime...

Oliver
TheJudger is offline   Reply With Quote
Old 2012-02-19, 19:23   #1603
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Ideas for "running multiple instances of mfaktc in a single directory"
  • add a commandline switch to specify the name of the instance (called NAME here)
  • worktodo file:
    • remove worktodo file name from mfaktc.ini
    • if NAME is not specified use worktodo.txt
    • if NAME is specifed use worktodo.NAME.txt
  • result file:
    • if NAME is not specified write results into results.txt
    • if NAME is specified write results into results.NAME.txt
  • mfaktc.ini:
    • if NAME is not specified read settings from mfaktc.ini
    • if NAME is specified check if mfaktc.NAME.ini exists
      • if mfaktc.NAME.ini exists use mfakt.NAME.ini
      • if mfaktc.NAME.ini doesn't exist use mfaktc.ini

Oliver
TheJudger is offline   Reply With Quote
Old 2012-02-19, 19:52   #1604
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Ideas for "running multiple instances of mfaktc in a single directory"
  • add a commandline switch to specify the name of the instance (called NAME here)
  • worktodo file:
    • remove worktodo file name from mfaktc.ini
    • if NAME is not specified use worktodo.txt
    • if NAME is specifed use worktodo.NAME.txt
  • result file:
    • if NAME is not specified write results into results.txt
    • if NAME is specified write results into results.NAME.txt
  • mfaktc.ini:
    • if NAME is not specified read settings from mfaktc.ini
    • if NAME is specified check if mfaktc.NAME.ini exists
      • if mfaktc.NAME.ini exists use mfakt.NAME.ini
      • if mfaktc.NAME.ini doesn't exist use mfaktc.ini

Oliver
This is how it is implemented in mfakto:
  • it has an option (-i) to let you specify a different ini-file (if not specified, it's mfakto.ini)
  • the ini-file contains worktodo (WorkFile=) and results file (ResultsFile=) names for the instance
  • file locking will make access to the files safe, even if it is the same for all instances
Bdot is offline   Reply With Quote
Old 2012-02-25, 19:36   #1605
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
It looks like I'll be able to help George revise the parsing logic in the near future. Don't expect anything changed today or tomorrow, but if we can collectively decide on a standardized format for the results (by the end of this week, let's say), I'll see if I can get the results parser to understand it all correctly within a few weeks.
James, any luck with this?
flashjh is offline   Reply With Quote
Old 2012-02-25, 19:41   #1606
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

D4D16 Posts
Default

Quote:
Originally Posted by flashjh View Post
James, any luck with this?
Not yet.
I'm experiencing infinitely more difficulty setting up a development environment than I expected. WIMP != LAMP (or even WAMP as I have on my home/development server).
James Heinrich is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 18:20.


Fri Jul 16 18:20:16 UTC 2021 up 49 days, 16:07, 1 user, load averages: 2.56, 2.56, 2.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.