mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   GPU LL Testing FAQ (https://www.mersenneforum.org/showthread.php?t=16142)

msft 2011-12-27 03:37

[QUOTE=f11ksx;283566]I have 1536 Mb with the gtx-580
Do you mean it is not enough, and there is no solutions? :cry:[/QUOTE]
It is not enough with CUDALucas 1.3,enough with CUDALucas 1.4.
I guess.

f11ksx 2011-12-27 17:57

Thank you for the answers :smile:

flashjh 2011-12-29 06:34

Thoughts...
 
So I finally got around to installing CUDALucas 1.2b to use with my GTX 580s. I've been TFing 8 instances with two HD 5870s and two 580s. The 580s are faster.

I dropped one instance of mfaktc for CUDALucas and I can't believe how fast it is for LL testing. I haven't run a full 4 cores on one LL in a while, but even with 3 instances of mfakc running the LL is only going to take ~60 hours. I set 3 cores to run mfaktc and one core for CUDA. On my x9650 if I don't set it up that way it makes the system slow because TFing drives the cores to 100% on the nVidia cards..

So, the reason for my post is that I kinda feel like I'm wasting time using CPUs to LL or TF anymore. I have several systems that are runing LLs that might be better off doing something else. I know it's better to have them do something rather than just sit, but in the time it takes them to do one LL I could finish all my current assignments with CUDA (and that's just using the one 580).

Once 580s and 590s (and whatever else is coming) drop in price, we're going to be able to make a huge dent into LLing and TFing. And the other systems can work P-1 or easier DC checks. I can't wait to pickup some more cards that can run CUDA. Hopefully Windows 8 will fix the Bulldozer problem so I can use some of that system for LL or TF also. Just curious what everyone's thoughts are on this?

[QUOTE=LaurV;278227]You are right! The hell is in that thing! And it is (theoretical) 1.3, not 1.2. I will put a photo when I get home, if you tell me how to run cudalucas on both gpu's. :smile:[/QUOTE]

BTW - LaurV, still curious as to what you ordered... I've seen some of the [URL="http://www.throughwave.co.th/resources/products/supermicro/gpu_4page.pdf"]SuperMicro[/URL] GPU supercomputing server solutions. Is it something like that? Pictures?? :bow:

diamonddave 2011-12-29 11:49

[QUOTE=flashjh;283909]
I dropped one instance of mfaktc for CUDALucas and I can't believe how fast it is for LL testing. I haven't run a full 4 cores on one LL in a while, but even with 3 instances of mfakc running the LL is only going to take ~60 hours.[/QUOTE]

That's some serious performance!

I'm a bit curious about your setup:

1) What's the size of your exponent? Are we talking LL or DC here?
2) How long does your CPU take for the same exponent?
3) Did you consider that your CPU, if it's a 4 core could do 4 test in parallel?
4) While using CUDALucas, what is the performance of your mfaktc instance? Exponent size, Bit Factored and SievePrime depth?

Also when using CUDALucas the core in the CPU basically does nothing, you can run an LL test on it with little to no impact on performance.

Thanks,

flashjh 2011-12-29 15:02

Some info
 
[QUOTE=diamonddave;283926]That's some serious performance![/QUOTE]

I'm a bit curious about your setup:

This is a QX9650 with 8GB DDR2-1066, 2 MSI GTX 580s, GA-EP45-UD3P - Boot overclock is 9.0 multiplier, 450FSB, memory set to 2.40B. Then I downclock with EasyTune6 to 290FSB - I haven't figured out why I get [U]much[/U] better performance with that and it stays a lot cooler. I have the 3 mfaktc instances all using cores 1-3, not individually assigned and CUDA assigned to core 4. All 3 mfaktc use GPU 1 and CUDA uses GPU 2.

[QUOTE]1) What's the size of your exponent? Are we talking LL or DC here?[/QUOTE]

The TFs vary, right now I'm running 69-72 or 70-72 with no stages on 49XXXXXX to 52XXXXXX. The LL is first time 4524XXXX. I haven't tested anything higher, I asked GPU to 72 for Lucas-Lehmer assignments.

[QUOTE]2) How long does your CPU take for the same exponent?[/QUOTE]

I haven't run an LL with this setup but I'll get Prime95 installed and test it to see when it would finish the same exponent.

[QUOTE]3) Did you consider that your CPU, if it's a 4 core could do 4 test in parallel?[/QUOTE]

Do you mean stop the mfakto and run 4 LLs?

[QUOTE]4) While using CUDALucas, what is the performance of your mfaktc instance? Exponent size, Bit Factored and SievePrime depth?[/QUOTE]

All three of these are running 70-72 on a 4915XXXX exponent.

mfakto1:
[CODE] class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait
657/4620 | 1.91G | 10.812s | 2h28m | 176.80M/s | 6153 | 2.40%[/CODE]
mfakto2:
[CODE]class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait
2316/4620 | 1.89G | 11.658s | 1h33m | 161.81M/s | 7033 | 2.11%[/CODE]
mfakto3:
[CODE]class | candidates | time | ETA | avg. rate | SievePrimes | CPU wait
2280/4620 | 1.89G | 12.222s | 1h38m | 154.34M/s | 7033 | 2.19%[/CODE]
CUDA:
[CODE]Iteration 13990000 M( 4524XXXX )C, 0x0fc83c04f4e74388, n = 4194304, CUDALucas v1
.2b (0:52 real, 5.1693 ms/iter, ETA 44:52:20)[/CODE]
[QUOTE]Also when using CUDALucas the core in the CPU basically does nothing, you can run an LL test on it with little to no impact on performance.

Thanks,[/QUOTE]

I hadn't thought of that. When I test throughput for the LL I'll see what effect the CPU LL has on the system. Maybe I can run that too - which leads me back to the original post of what to do with all the extra CPUs. TF on GPU kinda makes a person impatient for LL on CPU. I guess I need to set it and forget it.

Dubslow 2011-12-29 15:13

On a 2600, I can get one of those LL's done in slightly less than a month, so three per month with one core for mfaktc. That fourth core of yours is doing literally nothing at the moment -- task manager should be reporting 1 or 2% usage. If you run LL on that core, memory restrictions will reduce mfakto throughput by 1 or 2% -- minor compared to the LL work you're doing. May or may not affect CUDALucas, and if it does, then it will be even less than mfakto. Some notes: CUDALucas I believe is up to version 1.4. Also, CUDALucas, in general, gets around 1/5th=1/4th of the throughput of mfakt*, measured in PrimeNet's GHz-Days metric. This is because the LL test is only sort of parallelizable, whereas TF is so-called 'embarrassingly parallel'. Thus most people run mfakt* on the GPU's, and keep the LL on the CPU simply because that's what it's most efficient at. Some people do use CUDALucas anyways because they don't care about PrimeNet GHz-Days anyways, and there's also the fact that P-1 factoring currently has no GPU equivalent and PrimeNet always has need there. If you can't wait for LL on CPU, then do P-1 factoring with that extra core. (Or TF-LMH, but P-1 would be more useful, I think.) (Edit: You could also run DC's.)

f11ksx 2011-12-29 21:59

For information: i run LL tests on CudaLucas in 5 days for n= +/- 50.xxx.xxx exponents, on GTX 580's card.

LaurV 2011-12-30 06:11

[QUOTE=flashjh;283909]So, the reason for my post is that I kinda feel like I'm wasting time using CPUs to LL or TF anymore.
[/QUOTE]

That is what everybody (including me) is saying since ages here around. See all the discussion in GPU272 thread, too. TF-ing on CPU does not make any sense since years, the very first GPU's were "circles-around" faster. The new Fermi's are faster for LL/DC too. Usually a DC test takes below 24 hours on the hardware you got (how high the 580's are clocked?). And a first-LL on the 48M range takes below 65 hours (like the one you gave as example). But be careful that CL is using powers-of-two FFT sizes, that is why the time is not increasing on the same fashion like for P95. One 55M (around) exponent will take double then a 48M exponent as it will need to use a double FFT size. So, you will get about 130 hours for an 55M, and the time is almost constant (increasing very little, as higher expos need more iterations, but the time per iteration is almost constant), up to 80M or so, where is doubling again (next FFT step).
Currently I am doing 130 hours per LL in the higher 50M area, and 24 hours per DC in 28M-32M area, per each GPU, with a single copy of CL running on each GPU, and that will almost maximize the GPU.
Unfortunately mfaktc does not seems to take all the advantage of the Fermi's, the internal memory is not used at all, and it relies on CPU for filtering, I need to put all 4/8 cores into 4 or 6 copies of mfaktc to be able to maximize the two GPU's with them, and in this case the computer can't do something else without decreasing the GPU occupation percent. To have the GPU's at max, I need to keep the computer "idle". That is why I would prefer to use CL for DC in one GPU, and two or three copies of mfaktc to TF at the LL-front on the second GPU.

This is the optimum performance. At DC front you can clear one expo per each day per each GPU. This is the faster-ever method to clear the exponents. With trial-factoring at DC front you will NOT find a factor each day. Some days you can test 50 exponents for 2-3 bitlevels, or combinations of these (100-300Ghz-days/day) and find 1, 2, 3 factors, but next 5,7,15, etc days you will find none. TF is "lucky draw". DC is "sure". With DC at DC-front, you will clear one exponent per day, per GPU, no question! And (AND!) this will let your CPU free, so you still can do some P-1 testing on it. Or another DC, if you like, using P95, for a 3G processor you will get about 15-20ms per iteration using one core, so you can get one DC-out every week, or every two weeks. That is, with a Fermi and one (ONE!) CPU-core, you can clear 35 expos per month, at least. If you decide to work at DC front.

If you decide for LL front, the things are a bit different, and I explained them (not only once) in the GPU-2-72 topic.

Dubslow 2011-12-30 06:20

For me at least, I can (almost) max out a my one GPU (460) with one of my four CPU cores, so mfaktc/TF makes more sense. I think it varies more with hardware setup than with actual stats and total throughput etc.. (Do you type a .. ?)

Note to flash: For reference, PrimeNet reports expected 5 days for 25M, and 19 days for 45M.

flashjh 2011-12-30 06:26

[QUOTE=LaurV;284025]That is what everybody (including me) is saying since ages here around. See all the discussion in GPU272 thread, too. TF-ing on CPU does not make any sense since years, the very first GPU's were "circles-around" faster. The new Fermi's are faster for LL/DC too. Usually a DC test takes below 24 hours on the hardware you got (how high the 580's are clocked?). And a first-LL on the 48M range takes below 65 hours (like the one you gave as example). But be careful that CL is using powers-of-two FFT sizes, that is why the time is not increasing on the same fashion like for P95. One 55M (around) exponent will take double then a 48M exponent as it will need to use a double FFT size. So, you will get about 130 hours for an 55M, and the time is almost constant (increasing very little, as higher expos need more iterations, but the time per iteration is almost constant), up to 80M or so, where is doubling again (next FFT step).
Currently I am doing 130 hours per LL in the higher 50M area, and 24 hours per DC in 28M-32M area, per each GPU, with a single copy of CL running on each GPU, and that will almost maximize the GPU.
Unfortunately mfaktc does not seems to take all the advantage of the Fermi's, the internal memory is not used at all, and it relies on CPU for filtering, I need to put all 4/8 cores into 4 or 6 copies of mfaktc to be able to maximize the two GPU's with them, and in this case the computer can't do something else without decreasing the GPU occupation percent. To have the GPU's at max, I need to keep the computer "idle". That is why I would prefer to use CL for DC in one GPU, and two or three copies of mfaktc to TF at the LL-front on the second GPU.

This is the optimum performance. At DC front you can clear one expo per each day per each GPU. This is the faster-ever method to clear the exponents. With trial-factoring at DC front you will NOT find a factor each day. Some days you can test 50 exponents for 2-3 bitlevels, or combinations of these (100-300Ghz-days/day) and find 1, 2, 3 factors, but next 5,7,15, etc days you will find none. TF is "lucky draw". DC is "sure". With DC at DC-front, you will clear one exponent per day, per GPU, no question! And (AND!) this will let your CPU free, so you still can do some P-1 testing on it. Or another DC, if you like, using P95, for a 3G processor you will get about 15-20ms per iteration using one core, so you can get one DC-out every week, or every two weeks. That is, with a Fermi and one (ONE!) CPU-core, you can clear 35 expos per month, at least. If you decide to work at DC front.

If you decide for LL front, the things are a bit different, and I explained them (not only once) in the GPU-2-72 topic.[/QUOTE]

Thanks for the breakdown. Once my LLs finish up I'll check into using that GPU for DC. That will leave 7 instances running TF still.

Brain 2011-12-30 09:46

GPU Computing Guide Update to v 0.07
 
Hi,
here an updated version of the GPU Computing Guide.

Changes:
- New versions of mfaktc, mfakto and CUDALucas. Links to all binaries...
- Missing CUDA 3.2/4.0 libs for CUDALucas can be downloaded, see page 2

Please check for major bugs. If valid maybe an admin could update the stickies...

Happy new year, Brain

[URL="http://www.mersenneforum.org/attachments/pdfs/GIMPS_GPU_Computing_Cheat_Sheet.pdf"]GIMPS GPU Computing Cheat Sheet (pdf)[/URL]


All times are UTC. The time now is 14:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.