mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-03-22, 12:36   #78
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Quote:
Originally Posted by ET_ View Post
Question.

How is the GPU memory use computed, related to exponent, B1 and B2?

In other words, how can I know if B1 and B2 ranges fit in my GPU memory?

Luigi
The exponent affects the memory use by way of the fft size needed for that exponent. The B1 and B2 don't really have an effect. What really affects the memory use is the B-S exponent e and the number of relative primes p processed in a pass. If n is the fft size, each data sequence uses 8 * n bytes. I think I can get by with an overhead of 4 such sequences. In addition, e + p sequences are needed.
owftheevil is offline   Reply With Quote
Old 2013-03-22, 22:46   #79
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

21338 Posts
Default

Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.
NBtarheel_33 is offline   Reply With Quote
Old 2013-03-22, 23:06   #80
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by NBtarheel_33 View Post
Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.
Host to device and back memory transfers are painfully slow. CUDALucas (at least pre-bit shift) used one device to host memory transfer (just one!) per iteration at its maximum "-polite" setting -- that caused a performance hit of 20%. Actually using main memory (many transfers of much data in an "iteration) in a useful way will be impossibly slow.
Dubslow is offline   Reply With Quote
Old 2013-03-23, 01:04   #81
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

4738 Posts
Default

Quote:
Originally Posted by NBtarheel_33 View Post
Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.
There is one place where host ram can be used. Stage 2 initialization data can be stored there. That will make starting a new pass for the next batch of relative primes relatively quick and painless. The host to device transfers would be spread out enough so as not to cause a log jam.
owftheevil is offline   Reply With Quote
Old 2013-03-24, 09:07   #82
NBtarheel_33
 
NBtarheel_33's Avatar
 
"Nathan"
Jul 2008
Maryland, USA

5·223 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Host to device and back memory transfers are painfully slow. CUDALucas (at least pre-bit shift) used one device to host memory transfer (just one!) per iteration at its maximum "-polite" setting -- that caused a performance hit of 20%. Actually using main memory (many transfers of much data in an "iteration) in a useful way will be impossibly slow.
Bummer. But owftheevil's idea of storing Stage 2 initialization data there is better than nothing, I suppose.

GPUs should have ever-increasing amounts of RAM in the years to come, anyway. It's also much faster RAM - I think GPUs are already at DDR5, while DDR4 system RAM is still in its infancy.
NBtarheel_33 is offline   Reply With Quote
Old 2013-03-24, 10:45   #83
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by NBtarheel_33 View Post
GPUs should have ever-increasing amounts of RAM in the years to come, anyway. It's also much faster RAM - I think GPUs are already at DDR5, while DDR4 system RAM is still in its infancy.
Don't be confused -- it's GDDR5, not DDR5. It is rather faster than DDR3 though


http://en.wikipedia.org/wiki/GDDR5
Quote:
Like its predecessor, GDDR4, GDDR5 is based on DDR3 SDRAM memory which has double the data lines compared to DDR2 SDRAM...
(GDDR3 is based on DDR2 tech.)
Dubslow is offline   Reply With Quote
Old 2013-04-13, 22:45   #84
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

Cudapm1 output:

Code:
M61076737 has a factor: 432634830991289176546683053423
Run with B1 = 65000, B2 = 12035000, n = 3360k, d = 2310, e =2, 8 rp per pass. It used about 600Mb of device memory. Stage 2 took ~53 minutes.

Edit: Looks like about 15 minutes longer to make e = 4.

Last fiddled with by owftheevil on 2013-04-13 at 23:33
owftheevil is offline   Reply With Quote
Old 2013-04-13, 23:22   #85
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

1A816 Posts
Default

That would definately put a dent in our p-1 deficit. Though it's hard to trade 25x p-1 work for 125x factoring work.

EDIT: Not that it wouldn't get used though. I was trading up 10 ghz day of factoring per ghz day of p-1, and this is a better deal than that.

Last fiddled with by Aramis Wyler on 2013-04-13 at 23:30 Reason: PS.
Aramis Wyler is offline   Reply With Quote
Old 2013-04-14, 15:22   #86
bcp19
 
bcp19's Avatar
 
Oct 2011

7×97 Posts
Default

Is the program fairly stable now or is it still 'in beta'? Also, is everything done on the GPU or does it take up a CPU core?
bcp19 is offline   Reply With Quote
Old 2013-04-14, 16:36   #87
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

13B16 Posts
Default

Quote:
Originally Posted by bcp19 View Post
Is the program fairly stable now or is it still 'in beta'? Also, is everything done on the GPU or does it take up a CPU core?
Its not at all stable yet, and lacks a lot of basic functionality besides. It makes heavy use of a cpu core during initialization of stage 1 and when computing the gcd after either stage. Other than that, the cpu load is not noticeable, much like CUDALucas.
owftheevil is offline   Reply With Quote
Old 2013-04-14, 16:44   #88
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Its not at all stable yet, and lacks a lot of basic functionality besides. It makes heavy use of a cpu core during initialization of stage 1 and when computing the gcd after either stage. Other than that, the cpu load is not noticeable, much like CUDALucas.
I look forward to trying it when it is ready to debut!
kladner is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3628 2023-04-17 22:08
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 15:18.


Fri Jul 7 15:18:58 UTC 2023 up 323 days, 12:47, 0 users, load averages: 1.08, 1.11, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔