mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-12-27, 03:37   #45
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Quote:
Originally Posted by f11ksx View Post
I have 1536 Mb with the gtx-580
Do you mean it is not enough, and there is no solutions?
It is not enough with CUDALucas 1.3,enough with CUDALucas 1.4.
I guess.
msft is offline   Reply With Quote
Old 2011-12-27, 17:57   #46
f11ksx
 
Dec 2011

158 Posts
Default

Thank you for the answers
f11ksx is offline   Reply With Quote
Old 2011-12-29, 06:34   #47
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

112310 Posts
Default Thoughts...

So I finally got around to installing CUDALucas 1.2b to use with my GTX 580s. I've been TFing 8 instances with two HD 5870s and two 580s. The 580s are faster.

I dropped one instance of mfaktc for CUDALucas and I can't believe how fast it is for LL testing. I haven't run a full 4 cores on one LL in a while, but even with 3 instances of mfakc running the LL is only going to take ~60 hours. I set 3 cores to run mfaktc and one core for CUDA. On my x9650 if I don't set it up that way it makes the system slow because TFing drives the cores to 100% on the nVidia cards..

So, the reason for my post is that I kinda feel like I'm wasting time using CPUs to LL or TF anymore. I have several systems that are runing LLs that might be better off doing something else. I know it's better to have them do something rather than just sit, but in the time it takes them to do one LL I could finish all my current assignments with CUDA (and that's just using the one 580).

Once 580s and 590s (and whatever else is coming) drop in price, we're going to be able to make a huge dent into LLing and TFing. And the other systems can work P-1 or easier DC checks. I can't wait to pickup some more cards that can run CUDA. Hopefully Windows 8 will fix the Bulldozer problem so I can use some of that system for LL or TF also. Just curious what everyone's thoughts are on this?

Quote:
Originally Posted by LaurV View Post
You are right! The hell is in that thing! And it is (theoretical) 1.3, not 1.2. I will put a photo when I get home, if you tell me how to run cudalucas on both gpu's.
BTW - LaurV, still curious as to what you ordered... I've seen some of the SuperMicro GPU supercomputing server solutions. Is it something like that? Pictures??

Last fiddled with by flashjh on 2011-12-29 at 06:48 Reason: Find out what runs 1.3 DPTF
flashjh is offline   Reply With Quote
Old 2011-12-29, 11:49   #48
diamonddave
 
diamonddave's Avatar
 
Feb 2004

25·5 Posts
Default

Quote:
Originally Posted by flashjh View Post
I dropped one instance of mfaktc for CUDALucas and I can't believe how fast it is for LL testing. I haven't run a full 4 cores on one LL in a while, but even with 3 instances of mfakc running the LL is only going to take ~60 hours.
That's some serious performance!

I'm a bit curious about your setup:

1) What's the size of your exponent? Are we talking LL or DC here?
2) How long does your CPU take for the same exponent?
3) Did you consider that your CPU, if it's a 4 core could do 4 test in parallel?
4) While using CUDALucas, what is the performance of your mfaktc instance? Exponent size, Bit Factored and SievePrime depth?

Also when using CUDALucas the core in the CPU basically does nothing, you can run an LL test on it with little to no impact on performance.

Thanks,
diamonddave is offline   Reply With Quote
Old 2011-12-29, 15:02   #49
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

21438 Posts
Default Some info

Quote:
Originally Posted by diamonddave View Post
That's some serious performance!
I'm a bit curious about your setup:

This is a QX9650 with 8GB DDR2-1066, 2 MSI GTX 580s, GA-EP45-UD3P - Boot overclock is 9.0 multiplier, 450FSB, memory set to 2.40B. Then I downclock with EasyTune6 to 290FSB - I haven't figured out why I get much better performance with that and it stays a lot cooler. I have the 3 mfaktc instances all using cores 1-3, not individually assigned and CUDA assigned to core 4. All 3 mfaktc use GPU 1 and CUDA uses GPU 2.

Quote:
1) What's the size of your exponent? Are we talking LL or DC here?
The TFs vary, right now I'm running 69-72 or 70-72 with no stages on 49XXXXXX to 52XXXXXX. The LL is first time 4524XXXX. I haven't tested anything higher, I asked GPU to 72 for Lucas-Lehmer assignments.

Quote:
2) How long does your CPU take for the same exponent?
I haven't run an LL with this setup but I'll get Prime95 installed and test it to see when it would finish the same exponent.

Quote:
3) Did you consider that your CPU, if it's a 4 core could do 4 test in parallel?
Do you mean stop the mfakto and run 4 LLs?

Quote:
4) While using CUDALucas, what is the performance of your mfaktc instance? Exponent size, Bit Factored and SievePrime depth?
All three of these are running 70-72 on a 4915XXXX exponent.

mfakto1:
Code:
   class | candidates |    time |    ETA | avg. rate | SievePrimes | CPU wait
657/4620 |      1.91G | 10.812s |  2h28m | 176.80M/s |        6153 |    2.40%
mfakto2:
Code:
class | candidates |    time |    ETA | avg. rate | SievePrimes | CPU wait
2316/4620 |      1.89G | 11.658s |  1h33m | 161.81M/s |        7033 |    2.11%
mfakto3:
Code:
class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait
2280/4620 | 1.89G | 12.222s | 1h38m | 154.34M/s | 7033 | 2.19%
CUDA:
Code:
Iteration 13990000 M( 4524XXXX )C, 0x0fc83c04f4e74388, n = 4194304, CUDALucas v1
.2b (0:52 real, 5.1693 ms/iter, ETA 44:52:20)
Quote:
Also when using CUDALucas the core in the CPU basically does nothing, you can run an LL test on it with little to no impact on performance.

Thanks,
I hadn't thought of that. When I test throughput for the LL I'll see what effect the CPU LL has on the system. Maybe I can run that too - which leads me back to the original post of what to do with all the extra CPUs. TF on GPU kinda makes a person impatient for LL on CPU. I guess I need to set it and forget it.

Last fiddled with by flashjh on 2011-12-29 at 15:05 Reason: add multiplier
flashjh is offline   Reply With Quote
Old 2011-12-29, 15:13   #50
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

On a 2600, I can get one of those LL's done in slightly less than a month, so three per month with one core for mfaktc. That fourth core of yours is doing literally nothing at the moment -- task manager should be reporting 1 or 2% usage. If you run LL on that core, memory restrictions will reduce mfakto throughput by 1 or 2% -- minor compared to the LL work you're doing. May or may not affect CUDALucas, and if it does, then it will be even less than mfakto. Some notes: CUDALucas I believe is up to version 1.4. Also, CUDALucas, in general, gets around 1/5th=1/4th of the throughput of mfakt*, measured in PrimeNet's GHz-Days metric. This is because the LL test is only sort of parallelizable, whereas TF is so-called 'embarrassingly parallel'. Thus most people run mfakt* on the GPU's, and keep the LL on the CPU simply because that's what it's most efficient at. Some people do use CUDALucas anyways because they don't care about PrimeNet GHz-Days anyways, and there's also the fact that P-1 factoring currently has no GPU equivalent and PrimeNet always has need there. If you can't wait for LL on CPU, then do P-1 factoring with that extra core. (Or TF-LMH, but P-1 would be more useful, I think.) (Edit: You could also run DC's.)

Last fiddled with by Dubslow on 2011-12-29 at 15:27
Dubslow is offline   Reply With Quote
Old 2011-12-29, 21:59   #51
f11ksx
 
Dec 2011

13 Posts
Default

For information: i run LL tests on CudaLucas in 5 days for n= +/- 50.xxx.xxx exponents, on GTX 580's card.

Last fiddled with by f11ksx on 2011-12-29 at 22:02
f11ksx is offline   Reply With Quote
Old 2011-12-30, 06:11   #52
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41×251 Posts
Default

Quote:
Originally Posted by flashjh View Post
So, the reason for my post is that I kinda feel like I'm wasting time using CPUs to LL or TF anymore.
That is what everybody (including me) is saying since ages here around. See all the discussion in GPU272 thread, too. TF-ing on CPU does not make any sense since years, the very first GPU's were "circles-around" faster. The new Fermi's are faster for LL/DC too. Usually a DC test takes below 24 hours on the hardware you got (how high the 580's are clocked?). And a first-LL on the 48M range takes below 65 hours (like the one you gave as example). But be careful that CL is using powers-of-two FFT sizes, that is why the time is not increasing on the same fashion like for P95. One 55M (around) exponent will take double then a 48M exponent as it will need to use a double FFT size. So, you will get about 130 hours for an 55M, and the time is almost constant (increasing very little, as higher expos need more iterations, but the time per iteration is almost constant), up to 80M or so, where is doubling again (next FFT step).
Currently I am doing 130 hours per LL in the higher 50M area, and 24 hours per DC in 28M-32M area, per each GPU, with a single copy of CL running on each GPU, and that will almost maximize the GPU.
Unfortunately mfaktc does not seems to take all the advantage of the Fermi's, the internal memory is not used at all, and it relies on CPU for filtering, I need to put all 4/8 cores into 4 or 6 copies of mfaktc to be able to maximize the two GPU's with them, and in this case the computer can't do something else without decreasing the GPU occupation percent. To have the GPU's at max, I need to keep the computer "idle". That is why I would prefer to use CL for DC in one GPU, and two or three copies of mfaktc to TF at the LL-front on the second GPU.

This is the optimum performance. At DC front you can clear one expo per each day per each GPU. This is the faster-ever method to clear the exponents. With trial-factoring at DC front you will NOT find a factor each day. Some days you can test 50 exponents for 2-3 bitlevels, or combinations of these (100-300Ghz-days/day) and find 1, 2, 3 factors, but next 5,7,15, etc days you will find none. TF is "lucky draw". DC is "sure". With DC at DC-front, you will clear one exponent per day, per GPU, no question! And (AND!) this will let your CPU free, so you still can do some P-1 testing on it. Or another DC, if you like, using P95, for a 3G processor you will get about 15-20ms per iteration using one core, so you can get one DC-out every week, or every two weeks. That is, with a Fermi and one (ONE!) CPU-core, you can clear 35 expos per month, at least. If you decide to work at DC front.

If you decide for LL front, the things are a bit different, and I explained them (not only once) in the GPU-2-72 topic.
LaurV is offline   Reply With Quote
Old 2011-12-30, 06:20   #53
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

For me at least, I can (almost) max out a my one GPU (460) with one of my four CPU cores, so mfaktc/TF makes more sense. I think it varies more with hardware setup than with actual stats and total throughput etc.. (Do you type a .. ?)

Note to flash: For reference, PrimeNet reports expected 5 days for 25M, and 19 days for 45M.

Last fiddled with by Dubslow on 2011-12-30 at 06:38 Reason: flash! read this again!
Dubslow is offline   Reply With Quote
Old 2011-12-30, 06:26   #54
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by LaurV View Post
That is what everybody (including me) is saying since ages here around. See all the discussion in GPU272 thread, too. TF-ing on CPU does not make any sense since years, the very first GPU's were "circles-around" faster. The new Fermi's are faster for LL/DC too. Usually a DC test takes below 24 hours on the hardware you got (how high the 580's are clocked?). And a first-LL on the 48M range takes below 65 hours (like the one you gave as example). But be careful that CL is using powers-of-two FFT sizes, that is why the time is not increasing on the same fashion like for P95. One 55M (around) exponent will take double then a 48M exponent as it will need to use a double FFT size. So, you will get about 130 hours for an 55M, and the time is almost constant (increasing very little, as higher expos need more iterations, but the time per iteration is almost constant), up to 80M or so, where is doubling again (next FFT step).
Currently I am doing 130 hours per LL in the higher 50M area, and 24 hours per DC in 28M-32M area, per each GPU, with a single copy of CL running on each GPU, and that will almost maximize the GPU.
Unfortunately mfaktc does not seems to take all the advantage of the Fermi's, the internal memory is not used at all, and it relies on CPU for filtering, I need to put all 4/8 cores into 4 or 6 copies of mfaktc to be able to maximize the two GPU's with them, and in this case the computer can't do something else without decreasing the GPU occupation percent. To have the GPU's at max, I need to keep the computer "idle". That is why I would prefer to use CL for DC in one GPU, and two or three copies of mfaktc to TF at the LL-front on the second GPU.

This is the optimum performance. At DC front you can clear one expo per each day per each GPU. This is the faster-ever method to clear the exponents. With trial-factoring at DC front you will NOT find a factor each day. Some days you can test 50 exponents for 2-3 bitlevels, or combinations of these (100-300Ghz-days/day) and find 1, 2, 3 factors, but next 5,7,15, etc days you will find none. TF is "lucky draw". DC is "sure". With DC at DC-front, you will clear one exponent per day, per GPU, no question! And (AND!) this will let your CPU free, so you still can do some P-1 testing on it. Or another DC, if you like, using P95, for a 3G processor you will get about 15-20ms per iteration using one core, so you can get one DC-out every week, or every two weeks. That is, with a Fermi and one (ONE!) CPU-core, you can clear 35 expos per month, at least. If you decide to work at DC front.

If you decide for LL front, the things are a bit different, and I explained them (not only once) in the GPU-2-72 topic.
Thanks for the breakdown. Once my LLs finish up I'll check into using that GPU for DC. That will leave 7 instances running TF still.
flashjh is offline   Reply With Quote
Old 2011-12-30, 09:46   #55
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

33110 Posts
Default GPU Computing Guide Update to v 0.07

Hi,
here an updated version of the GPU Computing Guide.

Changes:
- New versions of mfaktc, mfakto and CUDALucas. Links to all binaries...
- Missing CUDA 3.2/4.0 libs for CUDALucas can be downloaded, see page 2

Please check for major bugs. If valid maybe an admin could update the stickies...

Happy new year, Brain

GIMPS GPU Computing Cheat Sheet (pdf)

Last fiddled with by Brain on 2012-08-05 at 10:06
Brain is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Anti-poverty drug testing vs "high" tax deduction testing kladner Soap Box 3 2016-10-14 18:43
What am I testing? GARYP166 Information & Answers 9 2009-02-18 22:41
k=243 testing ?? gd_barnes Riesel Prime Search 20 2007-11-08 21:13
Testing grobie Marin's Mersenne-aries 1 2006-05-15 12:26
Speed of P-1 testing vs. Trial Factoring testing eepiccolo Math 6 2006-03-28 20:53

All times are UTC. The time now is 14:46.


Fri Jul 7 14:46:13 UTC 2023 up 323 days, 12:14, 0 users, load averages: 1.20, 1.23, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔