mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Batalov 2012-02-12 01:06

[QUOTE=TheJudger;289080][B]For sieve.c I've used gcc 4.6. Result: 20% faster sieving (SievePrimes=5000).[/B] This needs further testing but it looks promising. :smile:
Yes, no code changes, just an updated compiler!

Oliver[/QUOTE]
Nice! This also means that (CPU-specific) assembly code fragments would do even better still (but it is surely quite some work).

I also run openSUSE 12.1. Just last night, when I tried the GPU GMP-ECM code, I had to "hack" the cuda includes to let the 4.6 compiler do the job and not bail (which it does by default).

kjaget 2012-02-12 14:33

[QUOTE=TheJudger;289080]
Yes, no code changes, just an updated compiler!

Oliver[/QUOTE]

That's really interesting. On the windows side, I spent some time porting the code to build with Intel's C compiler. Generally this is a much better optimizing compiler than MSVC, but I saw no difference. Granted, I wasn't building .cu files with it, just .c and combining them with the nvcc/msvc compiled .cu files, but that should have picked up any improvements it could find in sieve.c

Could be a lot of things - MSVC isn't as bad as I thought, the older GCC was particularly bad, I'm not building for AVX-enabled targets so there's something specific in the sieve code which works well with AVX but not SSE, or lots of other options.

Good news regardless, though.

Any idea how 20% faster sieving translates into run time improvements?

nucleon 2012-02-12 18:43

My question is - does the current win code have these improvements? If not - can I get a hold of it? I have a machine where the GPU% is hovering around 85-90% with sieve primes at 5000. I'm CPU limited on my farm atm.

[QUOTE=kjaget;289143]Any idea how 20% faster sieving translates into run time improvements?[/QUOTE]

2 outcomes I can think of (my guess):

1) If your GPU% is running at say 80% (as your CPU is maxed), then 20% sieve code improvement would boost the GPU% by 20% (ish), so one would expect GPU% to increase to 96% (ish), thereby giving an overall improvement of 20% approx.

2) If your GPU% is close to 99%, and sieve primes on your mfaktc instances is say 'x', then the improvement would allow you to increase sieve primes value above 'x'. The actual throughput improvement is anyone's guess. But it won't be higher than 20% and likely to be noticeably less than 20%.

Add disclaimer of 'your mileage may vary'.

-- Craig

Ethan (EO) 2012-02-14 04:16

A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):

Beginning at line 124 of tf_common.cu in 0.18, change
[CODE]
/* set result array to 0 */
for(i=0;i<32;i++)mystuff->h_RES[i]=0;
cudaMemcpy(mystuff->d_RES, mystuff->h_RES, 32*sizeof(int), cudaMemcpyHostToDevice);
[/CODE]

to

[CODE]
/* set result array to 0 */
cudaMemsetAsync(mystuff->d_RES, 0, 32*sizeof(int));
for(i=0;i<32;i++)mystuff->h_RES[i]=0;
[/CODE]

No improvement for multi-instance cases ; -st2 passed. memset() on h_RES is slower.

-Ethan

TheJudger 2012-02-14 12:34

Ethan,

1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case?
Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave.

Oliver

Ethan (EO) 2012-02-14 22:38

[QUOTE=TheJudger;289363]Ethan,

1% more throughput sounds unreasonable high, this is executed once per class. How long was you test case?
Did you mean cudaMemset() or cudaMemsetAsync()? Async would need the streamid as extra parameter and might be unsave.

Oliver[/QUOTE]

Hi Oliver -- maybe more like 0.8%; I see 183M/s compared to 181.5M/s with this change.

streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe.

As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ?

TheJudger 2012-02-17 12:21

[QUOTE=Ethan (EO);289336]A microoptimization that yields an extra 1% throughput for the single instance case on my machine (GTX 470 fed to about 50% utilization with one instance):
[...]

No improvement for multi-instance cases ; -st2 passed. memset() on h_RES is slower.[/QUOTE]

I can reproduce this on my system, too. It is faster for CPU limited situations.

[QUOTE=Ethan (EO);289405]streamid defaults to 0 if omitted, and memsets in streamid 0 aren't overlapped with operations in any other streams; since this is happening before any kernel launches it should be safe.[/QUOTE]
Yepp, you're right.

[QUOTE=Ethan (EO);289405]As for the performance difference -- when I profile these with -tf 101001001 70 71, the memcpy case sees a delay of about 10ms between the memcpy and the first kernel launch, while the memset case sees a delay of about 3.5ms between the memset and the first kernel launch. That's about 0.4% improvement. The rest of the difference ... ?[/QUOTE]

I really don't understand why it is faster by a [B]constant factor[/B]. The code fragment is executed [B]once per class[/B] and the [B]amount of work/data within the fragment is constant[/B] all the time. I would assume a constant time improvement (e.g. the 6.5ms in your upper case) but actually it is 0.6% to 0.8% faster. Time per class in my testcases decreased from e.g. 7.220s to 7.165s (-55ms) and from 28.532s to 28.311s (-211ms). For times longer execution time yield four times higher difference in runtime...

Oliver

TheJudger 2012-02-19 19:23

Ideas for "running multiple instances of mfaktc in a single directory"
[LIST][*]add a commandline switch to specify the name of the instance (called [I]NAME[/I] here)[*]worktodo file:[LIST][*]remove worktodo file name from mfaktc.ini[*]if NAME is not specified use worktodo.txt[*]if NAME is specifed use worktodo.NAME.txt[/LIST][*]result file:[LIST][*]if NAME is not specified write results into results.txt[*]if NAME is specified write results into results.NAME.txt[/LIST][*]mfaktc.ini:[LIST][*]if NAME is not specified read settings from mfaktc.ini[*]if NAME is specified check if mfaktc.NAME.ini exists[LIST][*]if mfaktc.NAME.ini exists use mfakt.NAME.ini[*]if mfaktc.NAME.ini doesn't exist use mfaktc.ini[/LIST][/LIST][/LIST]
Oliver

Bdot 2012-02-19 19:52

[QUOTE=TheJudger;290009]Ideas for "running multiple instances of mfaktc in a single directory"
[LIST][*]add a commandline switch to specify the name of the instance (called [I]NAME[/I] here)[*]worktodo file:[LIST][*]remove worktodo file name from mfaktc.ini[*]if NAME is not specified use worktodo.txt[*]if NAME is specifed use worktodo.NAME.txt[/LIST] [*]result file:[LIST][*]if NAME is not specified write results into results.txt[*]if NAME is specified write results into results.NAME.txt[/LIST] [*]mfaktc.ini:[LIST][*]if NAME is not specified read settings from mfaktc.ini[*]if NAME is specified check if mfaktc.NAME.ini exists[LIST][*]if mfaktc.NAME.ini exists use mfakt.NAME.ini[*]if mfaktc.NAME.ini doesn't exist use mfaktc.ini[/LIST] [/LIST] [/LIST]
Oliver[/QUOTE]
This is how it is implemented in mfakto:
[LIST][*]it has an option (-i) to let you specify a different ini-file (if not specified, it's mfakto.ini)[*]the ini-file contains worktodo (WorkFile=) and results file (ResultsFile=) names for the instance[*]file locking will make access to the files safe, even if it is the same for all instances[/LIST]

flashjh 2012-02-25 19:36

[QUOTE=James Heinrich;273210]It looks like I'll be able to help George revise the parsing logic in the near future. Don't expect anything changed today or tomorrow, but if we can collectively decide on a standardized format for the results (by the end of this week, let's say), I'll see if I can get the results parser to understand it all correctly within a few weeks.[/QUOTE]

James, any luck with this?

James Heinrich 2012-02-25 19:41

[QUOTE=flashjh;290901]James, any luck with this?[/QUOTE]Not yet. :no:
I'm experiencing infinitely more difficulty setting up a development environment than I expected. [url="http://en.wikipedia.org/wiki/WIMP_%28software_bundle%29"]WIMP[/url] != [url="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29"]LAMP[/url] (or even [url=http://en.wikipedia.org/wiki/WAMP]WAMP[/url] as I have on my home/development server).


All times are UTC. The time now is 23:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.