![]() |
|
|
#1596 | |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
22·23·103 Posts |
Quote:
I also run openSUSE 12.1. Just last night, when I tried the GPU GMP-ECM code, I had to "hack" the cuda includes to let the 4.6 compiler do the job and not bail (which it does by default). |
|
|
|
|
|
|
#1597 |
|
Jun 2005
2018 Posts |
That's really interesting. On the windows side, I spent some time porting the code to build with Intel's C compiler. Generally this is a much better optimizing compiler than MSVC, but I saw no difference. Granted, I wasn't building .cu files with it, just .c and combining them with the nvcc/msvc compiled .cu files, but that should have picked up any improvements it could find in sieve.c
Could be a lot of things - MSVC isn't as bad as I thought, the older GCC was particularly bad, I'm not building for AVX-enabled targets so there's something specific in the sieve code which works well with AVX but not SSE, or lots of other options. Good news regardless, though. Any idea how 20% faster sieving translates into run time improvements? |
|
|
|
|
|
#1598 | |
|
Mar 2003
Melbourne
5·103 Posts |
My question is - does the current win code have these improvements? If not - can I get a hold of it? I have a machine where the GPU% is hovering around 85-90% with sieve primes at 5000. I'm CPU limited on my farm atm.
Quote:
1) If your GPU% is running at say 80% (as your CPU is maxed), then 20% sieve code improvement would boost the GPU% by 20% (ish), so one would expect GPU% to increase to 96% (ish), thereby giving an overall improvement of 20% approx. 2) If your GPU% is close to 99%, and sieve primes on your mfaktc instances is say 'x', then the improvement would allow you to increase sieve primes value above 'x'. The actual throughput improvement is anyone's guess. But it won't be higher than 20% and likely to be noticeably less than 20%. Add disclaimer of 'your mileage may vary'. -- Craig |
|
|
|
|
|
|
#1599 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
22×23 Posts |
A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):
Beginning at line 124 of tf_common.cu in 0.18, change Code:
/* set result array to 0 */ for(i=0;i<32;i++)mystuff->h_RES[i]=0; cudaMemcpy(mystuff->d_RES, mystuff->h_RES, 32*sizeof(int), cudaMemcpyHostToDevice); Code:
/* set result array to 0 */
cudaMemsetAsync(mystuff->d_RES, 0, 32*sizeof(int));
for(i=0;i<32;i++)mystuff->h_RES[i]=0;
-Ethan |
|
|
|
|
|
#1600 |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Ethan,
1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case? Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave. Oliver |
|
|
|
|
|
#1601 | |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
22·23 Posts |
Quote:
streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe. As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ? |
|
|
|
|
|
|
#1602 | |||
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Quote:
Quote:
Quote:
Oliver |
|||
|
|
|
|
|
#1603 |
|
"Oliver"
Mar 2005
Germany
45716 Posts |
Ideas for "running multiple instances of mfaktc in a single directory"
Oliver |
|
|
|
|
|
#1604 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
|
|
|
|
|
|
|
#1605 | |
|
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
Quote:
|
|
|
|
|
|
|
#1606 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
3·5·227 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |