mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

owftheevil 2013-03-22 12:36

[QUOTE=ET_;334402]Question.

How is the GPU memory use computed, related to exponent, B1 and B2?

In other words, how can I know if B1 and B2 ranges fit in my GPU memory?

Luigi[/QUOTE]

The exponent affects the memory use by way of the fft size needed for that exponent. The B1 and B2 don't really have an effect. What really affects the memory use is the B-S exponent e and the number of relative primes p processed in a pass. If n is the fft size, each data sequence uses 8 * n bytes. I think I can get by with an overhead of 4 such sequences. In addition, e + p sequences are needed.

NBtarheel_33 2013-03-22 22:46

Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.

Dubslow 2013-03-22 23:06

[QUOTE=NBtarheel_33;334570]Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.[/QUOTE]

Host to device and back memory transfers are [i]painfully[/i] slow. CUDALucas (at least pre-bit shift) used one device to host memory transfer (just one!) per iteration at its maximum "-polite" setting -- that caused a performance hit of 20%. Actually using main memory (many transfers of much data in an "iteration) in a useful way will be impossibly slow.

owftheevil 2013-03-23 01:04

[QUOTE=NBtarheel_33;334570]Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.[/QUOTE]

There is one place where host ram can be used. Stage 2 initialization data can be stored there. That will make starting a new pass for the next batch of relative primes relatively quick and painless. The host to device transfers would be spread out enough so as not to cause a log jam.

NBtarheel_33 2013-03-24 09:07

[QUOTE=Dubslow;334575]Host to device and back memory transfers are [I]painfully[/I] slow. CUDALucas (at least pre-bit shift) used one device to host memory transfer (just one!) per iteration at its maximum "-polite" setting -- that caused a performance hit of 20%. Actually using main memory (many transfers of much data in an "iteration) in a useful way will be impossibly slow.[/QUOTE]

Bummer. But owftheevil's idea of storing Stage 2 initialization data there is better than nothing, I suppose.

GPUs should have ever-increasing amounts of RAM in the years to come, anyway. It's also much faster RAM - I think GPUs are already at DDR5, while DDR4 system RAM is still in its infancy.

Dubslow 2013-03-24 10:45

[QUOTE=NBtarheel_33;334752]
GPUs should have ever-increasing amounts of RAM in the years to come, anyway. It's also much faster RAM - I think GPUs are already at DDR5, while DDR4 system RAM is still in its infancy.[/QUOTE]

Don't be confused -- it's GDDR5, not DDR5. It is rather faster than DDR3 though
:smile:

[url]http://en.wikipedia.org/wiki/GDDR5[/url]
[quote]Like its predecessor, GDDR4, GDDR5 is based on DDR3 SDRAM memory which has double the data lines compared to DDR2 SDRAM...[/quote]
(GDDR3 is based on DDR2 tech.)

owftheevil 2013-04-13 22:45

Cudapm1 output:

[CODE]
M61076737 has a factor: 432634830991289176546683053423
[/CODE]Run with B1 = 65000, B2 = 12035000, n = 3360k, d = 2310, e =2, 8 rp per pass. It used about 600Mb of device memory. Stage 2 took ~53 minutes.

Edit: Looks like about 15 minutes longer to make e = 4.

Aramis Wyler 2013-04-13 23:22

That would definately put a dent in our p-1 deficit. Though it's hard to trade 25x p-1 work for 125x factoring work.

EDIT: Not that it wouldn't get used though. I was trading up 10 ghz day of factoring per ghz day of p-1, and this is a better deal than that. :smile:

bcp19 2013-04-14 15:22

Is the program fairly stable now or is it still 'in beta'? Also, is everything done on the GPU or does it take up a CPU core?

owftheevil 2013-04-14 16:36

[QUOTE=bcp19;337058]Is the program fairly stable now or is it still 'in beta'? Also, is everything done on the GPU or does it take up a CPU core?[/QUOTE]

Its not at all stable yet, and lacks a lot of basic functionality besides. It makes heavy use of a cpu core during initialization of stage 1 and when computing the gcd after either stage. Other than that, the cpu load is not noticeable, much like CUDALucas.

kladner 2013-04-14 16:44

[QUOTE=owftheevil;337063]Its not at all stable yet, and lacks a lot of basic functionality besides. It makes heavy use of a cpu core during initialization of stage 1 and when computing the gcd after either stage. Other than that, the cpu load is not noticeable, much like CUDALucas.[/QUOTE]

I look forward to trying it when it is ready to debut!


All times are UTC. The time now is 23:18.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.