mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

Manpowre 2013-05-14 01:22

CudaLucas with HyperQ enabled code
 
Good news.. after 7 hours tonight, I finally made HyperQ code with 32 simultaniously threads run fine here with my Titan..

I am using a number 31670941 as test number and normal cudalucas with my titan card runs this with an ETA of 21h 22m.

With 32 HyperQ threads, I actually manage to get ETA of 12h 44m. Thats alot.. almost half the time.. there is some overhead here which I expected, but still. its tremendously powerfull.

I still have some work to do, there are some iterations which needs to be checked, and I also have to run the code with all known mersenne primes to check that this version of cudalucas actually iterates through and gives me a prime on each and everyone.

The good thing is that I got the hyperQ code to run and the first test results came out..

owftheevil 2013-05-14 01:30

There's a quicker way to see if you are getting correct results. Run CuLu with the -r option. This goes through 10000 iterations of many of the known primes and compares the result to known residues.

owftheevil 2013-05-14 01:34

Thanks chalsall. Suggestions for improvement are always welcome.

Manpowre 2013-05-14 01:39

Titan HyperQ
 
1 HyperQ thread = 22h 27m = the same as cualucas normal mode
4 HyperQ threads = 10h 40m
8 HyperQ threads = 8h 12m
16 HyperQ threads = 7h 22m
24 HyperQ threads = 21h 28m
32 HyperQ threads = 13h

Very interesting to see that 4,8,16 threads scale well on the Titan with the CudaLucas code.

Im going to run the code now on a mersenne prime which doesnt take more than a few hours to go through, then Ill report back tomorrow..

Manpowre 2013-05-14 01:42

[QUOTE=owftheevil;340330]There's a quicker way to see if you are getting correct results. Run CuLu with the -r option. This goes through 10000 iterations of many of the known primes and compares the result to known residues.[/QUOTE]

haha, I got a nice access violation with the -r.. beatifull.. as I expected, there are iterations here which isnt safe to run for a longer time.. I already can see 2 places where there are IF checks where the iterations are not in sync with the numer of HyperQ threads being spawned.

But, in progress. I will continue with this tomorrow.. off to bed :)

Thanks for the great tip with -r..

Manpowre 2013-05-14 02:07

hehe
 
Well, The HyperQ threads initialization was put in device init func, not lucas init, so when the code destroyed threads, they werent recreated..

fixed.. so now I ran -r successfully..

I started 57885161 the 48th prime, and ETA is 22h 14m for one Titan card. that is very very quick.

The 25964951 prime is expected in 4h 18m
The 20996011 prime is expected in 3h 21m

The 3h run is the one I will leave on here until I get up from bed.. gotta sleep.. :)

I also tried a few smaller ones..
M( 132049 )C, 0xfffffffffffffff3, n = 7168, CUDALucas v2.03
M( 216091 )C, 0xfffffffffffffff3, n = 12288, CUDALucas v2.03
M( 756839 )C, 0xfffffffffffffffb, n = 40960, CUDALucas v2.03

owftheevil 2013-05-14 02:14

Are you running the different kernels in parallel?

Karl M Johnson 2013-05-14 03:34

[QUOTE=Manpowre;340332]1 HyperQ thread = 22h 27m = the same as cualucas normal mode
4 HyperQ threads = 10h 40m
8 HyperQ threads = 8h 12m
16 HyperQ threads = 7h 22m
24 HyperQ threads = 21h 28m
32 HyperQ threads = 13h[/QUOTE]
Is that the code executed on one Titan?
Or both of them?

Manpowre 2013-05-14 07:16

Titan HyperQ
 
Answer to 2 question.

Running kernels in parallell yes. well cudalucas runs all kernels in parallell, but Nvidia has made HyperThreading available, which means one kernel is inserted, and one step in, second kernel gets inserted into same cuda processor.

Running those tests with hyperQ yes.

I was thinking about the 24 parallell tests, its probably because its not divided by 32/2 = 16/2 = 8 /2 =4..
So 4,8,16 and 32 kernels are the ones to be used.

The 20996011 prime finished in a little more than 3h with a result.txt file.. so all good so far.

And I havent even started optimization. I still need to run this through a profiler to see the kernels are inserted tight enough and there are still 3 more different cuda kernel calls to get same kind of optimization.
Then the cleanup.. arguments needs to be added and some textual info back to the user that they now run HyperQ..

Manpowre 2013-05-14 08:55

[QUOTE=Karl M Johnson;340343]Is that the code executed on one Titan?
Or both of them?[/QUOTE]

The code was executed on One Titan. I havent looked into the grid part yet where I can use 2 titans for one thread.. now thats going to be interesting offcourse when I can do that.

Karl M Johnson 2013-05-14 10:38

Hehe, so no multi-gpu.
Very nice speedup.

owftheevil 2013-05-14 11:49

[QUOTE=Manpowre;340352]Answer to 2 question.

Running kernels in parallell yes. well cudalucas runs all kernels in parallell, but Nvidia has made HyperThreading available, which means one kernel is inserted, and one step in, second kernel gets inserted into same cuda processor.
[/QUOTE]

Actually cudalucas runs the kernels sequentially. The kernels are not independent, but must start with the output of the previous kernel to give correct results. Residues for the short tests you posted should have been all zero. (Well, cudalucus shouldn't have shown any residue at all, but instead gloriously announced that the number is prime.)

You could instead run two separate tests in parallel. You won't get the low single test times, but you will still almost double the throughput. Two tests could finish in 24h as opposed to one test in 21h. I was planning to get around to that sometime this summer, but using regular streaming methods instead of HyperQ. That way the more mundane cards could see some benefit too.

Manpowre 2013-05-14 13:24

[QUOTE=owftheevil;340377]Actually cudalucas runs the kernels sequentially. The kernels are not independent, but must start with the output of the previous kernel to give correct results. Residues for the short tests you posted should have been all zero. (Well, cudalucus shouldn't have shown any residue at all, but instead gloriously announced that the number is prime.)

You could instead run two separate tests in parallel. You won't get the low single test times, but you will still almost double the throughput. Two tests could finish in 24h as opposed to one test in 21h. I was planning to get around to that sometime this summer, but using regular streaming methods instead of HyperQ. That way the more mundane cards could see some benefit too.[/QUOTE]

When I debugged the code, I saw the RDSP call with a nice matrix to the GPU, then the normalize function call.

You are probably right about output, but its a normalization code, and this probably works on my code since I only tested primes so far and the normalization code prob didnt kick in. Actually, this is what I saw in the cudalucas code is that there are 2 iterations which I believe is not visited doing it this way.

This is the reason I dont publish the code yet until I am 100% sure its doing it the right way.. but atleast, HyperQ works and with prime numbers, atleast it iterates through very quickly.

owftheevil 2013-05-14 14:00

The normalize kernels are an essential part of each iteration, whether the number being tested is prime or not. It would probably be a good idea to learn the IBWDT algorithm you are working with here before spending too much more time optimizing the code.

Manpowre 2013-05-14 14:25

[QUOTE=owftheevil;340391]The normalize kernels are an essential part of each iteration, whether the number being tested is prime or not. It would probably be a good idea to learn the IBWDT algorithm you are working with here before spending too much more time optimizing the code.[/QUOTE]

I understand, also the Cuda kernels are launched as a matrix, this just means in my code I spawn cuda kernel matrix x 16 hyperq threads, and do the same with the normalization code afterwards, then increase the counter relatively.

Which when I look at it, it simply means the GPU is processing a bigger chunk each turn between RDSP and normalization afterwards.

I did not run the 2 codebases side by side with the output, and I will do so tonight to see what they output.. also I will trace the input to the rdsp gpu call on both variants to see what the algorithm inserts into the gpu and what it brings out to normalize.

Again, not knowing the algorithm in depth, I might be wrong.. Ill look into the algorithm tonight to get deeper understanding,

thank you for guidance.

Manpowre 2013-05-14 19:15

[QUOTE=owftheevil;340391]The normalize kernels are an essential part of each iteration, whether the number being tested is prime or not. It would probably be a good idea to learn the IBWDT algorithm you are working with here before spending too much more time optimizing the code.[/QUOTE]

You are right, I read about FFT and the algorithm to MOD a prime down, and it has to happen in sequence..

so, then, I can create a array of 16 HQ streams = 16 testing numbers (big primes), and execute HQ 0 as first prime round 1, HQ1 as second prime first round, etc.. to 15 then execute normalization code for each HQn.

Then each HQ stream will be independent, but they will be excuted in HyperQ steps to maximize the GPU usage when the GPU calls are executed.

That should take the speed of testing 16 primes simultaniously to somewhat more than testing one prime + overhead in the complexit of each cuda kernel.

would that do it ?

owftheevil 2013-05-14 19:50

That sounds much more like it would work, but I think two simultaneous tests will fill up the gpu. Three might see a slight benefit over two, but more than that won't.

Edit: Depending on the fft size. Smaller fft means more tests will see a benefit.

Manpowre 2013-05-14 21:19

[QUOTE=owftheevil;340432]That sounds much more like it would work, but I think two simultaneous tests will fill up the gpu. Three might see a slight benefit over two, but more than that won't.

Edit: Depending on the fft size. Smaller fft means more tests will see a benefit.[/QUOTE]

Agreed,

With 6gb memory on titan which is the only board at this moment with hyperQ enabled, I guess its possible to push more than 2 simultaniously threads.

Well see, I am going through all references and variables now in the main function calls to enable to put up to 16 searches at the same time, with a argumenn to set how many searches simultaniously.. its going to take some time.

Again, thanks for very good input..

owftheevil 2013-05-14 21:27

Its not the memory that is limiting, its the number of processors and the size of the kernels.

Karl M Johnson 2013-05-15 05:10

Does that mean we will not see an impressive(3x) reduction in time?

[QUOTE=Manpowre;340423]You are right, I read about FFT and the algorithm to MOD a prime down, and it has to happen in sequence..

so, then, I can create a array of 16 HQ streams = 16 testing numbers (big primes), and execute HQ 0 as first prime round 1, HQ1 as second prime first round, etc.. to 15 then execute normalization code for each HQn.

Then each HQ stream will be independent, but they will be excuted in HyperQ steps to maximize the GPU usage when the GPU calls are executed.

That should take the speed of testing 16 primes simultaniously to somewhat more than testing one prime + overhead in the complexit of each cuda kernel.

would that do it ?[/QUOTE]

Manpowre 2013-05-15 09:55

[QUOTE=Karl M Johnson;340527]Does that mean we will not see an impressive(3x) reduction in time?[/QUOTE]

I mentioned earlier in the thread I had to check the result with the new codebase utilizing hyperq, and thanks to good advice, it pointed me to a level of understanding the code which means to use HyperQ this way cant be done without dramatically rewriting cudalucas code which will take time.

But the HyperQ test did iterate and execute all threads, just that they didnt produce real result due to the need for the second GPU executin has to happen after the first GPU execution.

I read on the Kepler whitepaper 110 this:
[url]http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf[/url]

quote for HyperQ:
"Applications that previously encountered false serialization crosstasks,thereby limiting achieved GPU utilization, can see up to dramatic performance increase without changing any existing code."

So far I had to dramatically change the code even to use HyperQ as the example shown by Cuda 5.0.

CudaLucas was made for 1 thread and all variables are supporting 1 parallell task utilizing FFT on the GPU and cuda kernel algorithms which is the reason cudalucas gets the great speedup compared to CPU.

There are 2 steps done actually inside cudalucas:
CPU code to initialize cudalucas
* GPU code to run FFT part 1
CPU code to evaluate GPU code part 1
* GPU code part 2 to normalize data
CPU code to evaluate condition and steps + exit eval

The only way is to use HyperQ outside of this running scope to use the GPU when CPU code is running, and also to run parallell tasks insertions when GPU code is executed.

I am still researching.. what I proved is there is a tremendously speedup using hyperq, but for cudalucas it probably wont include the same dataset, it probably means using a second or third dataset to be inserted in parallell.

I also looked into spawning a new thread from one thread, which could theoretically without investigating so much mean that CPU code to evaluate could be moved to GPU and then GPU code part 2 could be spawned from GPU code part1. - Dynamic parallellism its called for GK110 chip only at this point.

Well see, I am looking for this without changing code part which Nvidia wrote, and see where that will take me.

TheJudger 2013-05-15 20:06

Hi Manpowre,

HyperQ is not (hyper-)threading. Actually even the oldest CUDA capable GPUs run multiple threads per core (reason: hide latency to memory as good as possible).
[LIST][*]CC 1.x GPUs can run only one GPU kernel at any time. If there are more kernels lauched they have to wait, no matter if the kernels are launched from a single host process or from different host processes, they are executed in serial order.[*]CC 2.x GPUs can run multiple kernels concurrently if and only if they launched from the [B]same[/B] host process.[*]CC 3.5 GPUs can run multiple kernels from [B]different[/B] host processes concurrently. This is called HyperQ.[/LIST]
Oliver

frmky 2013-05-15 23:19

[QUOTE=TheJudger;340596][*]CC 2.x GPUs can run multiple kernels concurrently if and only if they launched from the [B]same[/B] host process.[*]CC 3.5 GPUs can run multiple kernels from [B]different[/B] host processes concurrently. This is called HyperQ.[/LIST][/QUOTE]
It's not that simple. In CC 2.x, you need to be careful to launch kernels in different streams in a breadth-first manner. Launching kernels depth-first creates false dependencies that prevent them from running concurrently. HyperQ removes this restriction, and thus benefits even when launched from the same host process.

HyperQ does support concurrently running kernels from different host processes, but this is [B]not[/B] supported by the currently released CUDA toolkit. Support for this should be coming in the next version of CUDA.

Manpowre 2013-05-17 10:50

[QUOTE=frmky;340620]It's not that simple. In CC 2.x, you need to be careful to launch kernels in different streams in a breadth-first manner. Launching kernels depth-first creates false dependencies that prevent them from running concurrently. HyperQ removes this restriction, and thus benefits even when launched from the same host process.

HyperQ does support concurrently running kernels from different host processes, but this is [B]not[/B] supported by the currently released CUDA toolkit. Support for this should be coming in the next version of CUDA.[/QUOTE]

Yepp, I figured that out.. I tested with separate consoles towards same card and even setting the env variable to support this, and it just slowed down the execution of the code by half..

I am working on changing the cudalucas code to a c++ program with a .cu file to support this, but it will take time.. this is my summer project..

But I learned the cudalucas code in less than a week.. so I got pretty far.. thanks to all good responses.. really appreciate that..

BTW. I see the gpu lucas uses dd_real library and cudalucas uses the nvidia toolset double2ll, and I understand dd_real lib is more accurate and from what I read also faster ?

I cant get the QD lib to compile on my windows environment, even tried mingw and minsys, but it only compiles 32 bit, and even wont compile..

Then I looked into MPIR, which compiles just fine and links just fine in a separate project, but this lib seems very complicated..

Anyone knows a good dd_real lib I can reference for gpulucas ? Just wanted to test gpulucas compiled on the system.. the gpulucas code is more clean written, so easyer to understand..

owftheevil 2013-05-19 16:50

I finally figured out how to edit and flash the bios on my 560ti. At memory clock of 2089 Mhz, CuLu and CPm1 are stable, at 2088 Mhz, memtest quits giving errors. Think I'll run it at 2050Mhz.

chalsall 2013-05-19 16:55

[QUOTE=owftheevil;340953]I finally figured out how to edit and flash the bios on my 560ti. At memory clock of 2089 Mhz, CuLu and CPm1 are stable, at 2088 Mhz, memtest quits giving errors. Think I'll run it at 2050Mhz.[/QUOTE]

Care to enlighten? I hadn't looked into this option.

owftheevil 2013-05-19 17:01

The method requires wine or a windows virtual machine, a DOS bootable USB, the DOS version of nvflash.exe, NiBiTor.exe, and a reboot every time you want to change any setting. I'll post details if you want them.

chalsall 2013-05-19 17:15

[QUOTE=owftheevil;340955]I'll post details if you want them.[/QUOTE]

Please! :smile:

Because I'm in Barbados an RMA is almost out of the question -- the shipping costs involved would probably be almost as much, if not more, than the card cost me. I'm prepared to risk "bricking" this thing to get it stable.

owftheevil 2013-05-19 17:27

Don't think there's much chance of bricking the thing. You have a windows virtual machine? This is not necessary, but it makes creating the bootable DOS usb easier. If not, get wine installed, its needed to run the bios editor NiBiTor. I'll look up some links and get back.

chalsall 2013-05-19 17:29

[QUOTE=owftheevil;340957]You have a windows virtual machine?[/QUOTE]

Yes -- several... :wink: Plus, the machine which hosts the card is dual-boot.

[QUOTE=owftheevil;340957]I'll look up some links and get back.[/QUOTE]

Much appreciated!

owftheevil 2013-05-19 18:33

Here is how I edited and flashed the bios on my 560ti from Linux to reduce the memory clock. The usual disclaimer applies. If this works for you great, if not you have my sympathies, but nothing else (well, maybe half of chalsall's non-existent first born). It has worked without hitch for me over ~20 iterations. If your card is the only card in the machine, it would probably be a good idea to set up an autoexec.bat file to automatically flash the original saved bios in case something bad happens.

First, in Windows, make a dos bootable usb. I followed the instructions here:

[URL="http://www.sevenforums.com/tutorials/46707-ms-dos-bootable-flash-drive-create.html"]www.sevenforums.com/tutorials/46707-ms-dos-bootable-flash-drive-create.html[/URL]

Second, put nvflash on the dos usb. I got nvflash from here:

[URL="http://www.techpowerup.com/downloads/2229/nvflash-5-128-0-1/"]www.techpowerup.com/downloads/2229/nvflash-5-128-0-1/
[/URL]
Third, to get a backup copy of the bios and to get something to edit, boot from the usb, and run nvflash.

[CODE]nvflash --save filename.rom[/CODE]

Fourth, reboot and either in wine or a windows virtual machine, run NiBiTor, which I got here:

[URL="http://www.guru3d.com/files_details/nvidia_bios_editor_download_nibitor.html"]www.guru3d.com/files_details/nvidia_bios_editor_download_nibitor.html[/URL]

In NiBiTor, open the saved rom. Under the tools menu, select Fermi Clocks. This brings up a dialog with sections 3, 7, -(greyed out), and 15, with editable entries. The 3, 7, and 15 represent performance levels 0, 1, and 2 respectively as reported by the Nvidia X Server Settings application. The memory clock setting is entry 5, only section 15 is important. Edit this accordingly and save.

Fifth, reboot to the DOS usb and run nvflash on the newly edited bios.

[CODE]nvflash newbios.rom[/CODE]

When its done, reboot and you should be using the new clock settings.

Good luck, I hope this works for you.

chalsall 2013-05-19 18:42

[QUOTE=owftheevil;340964]Good luck, I hope this works for you.[/QUOTE]

Thank you VERY much Carl!

I may drive across the country tomorrow (will take me ~20 minutes :wink:) to give this a try! Will report back.

kladner 2013-05-19 20:42

This is wonderful to hear about even if I don't need it at the moment (running Windows here.) I had imagined that a driver hack would be needed. That, in spite of the fact that I've done editing of other BIOS's in the past. Thanks for the educational example! :smile:

NBtarheel_33 2013-05-20 08:39

[QUOTE=chalsall;340966]I may drive across the country tomorrow (will take me ~20 minutes :wink:)[/QUOTE]

20 minutes to cross the country...so going from one back yard to another would be crossing state lines, and stepping off your front porch would be crossing a county line, eh? :smile:

I imagine there is very little need for secondary or tertiary political subdivisions in Barbados...

chalsall 2013-05-20 20:51

[QUOTE=NBtarheel_33;341007]I imagine there is very little need for secondary or tertiary political subdivisions in Barbados...[/QUOTE]

LOL... Barbados is [I]very[/I] small. We only have one "city", and it doesn't have a mayor. (And, BTW, we don't have States nor Provinces; we have Parishes...)

[QUOTE=chalsall]I may drive across the country tomorrow to give this a try! Will report back.[/QUOTE]

I decided to delay this until Tuesday, to optimize my time "on the road".

But for everyone's amusement, I emailed EVGA (my card's manufacturer) yesterday asking their advise on how to down clock my card under Linux after owftheevil alerted us all to the possibility. Their response received today was:

[QUOTE=EVGA Support Email]We unfortunately do not officially support our cards while running linux. It would seem there is no way to adjust the clock speed.[/QUOTE]

Sigh....

kracker 2013-05-20 20:53

I don't know about EVGA, but flashing/editing your own BIOS may void your warranty, I don't know if they can tell of anything, just something I've heard.

sdbardwick 2013-05-20 21:05

[QUOTE=NBtarheel_33;339938]Thanks! I am interested, as I am sure all the GPU code authors are, in seeing if we get a match.

How many cores of the i5 are you using? 11 days is actually faster than CUDALucas! Must be better FFT size selection on Prime95.[/QUOTE]

[URL="http://mersenne.org/report_exponent/?exp_lo=82090249&exp_hi=10000&B1=Get+status"]Matching DC completed[/URL]

Was all 4 cores of i5-2500K @3.3Ghz (not overclocked).

chalsall 2013-05-20 21:05

[QUOTE=kracker;341042]I don't know about EVGA, but flashing/editing your own BIOS may void your warranty, I don't know if they can tell of anything, just something I've heard.[/QUOTE]

It almost certainly does void the warranty. And the manufacturer certainly can tell (by using the same nvflash utility owftheevil pointed us to (written and provided by NVidia)) by downloading the BIOS image found on the RMA'ed card and running an MD5/SHA/binary diff on it.

But, at the end of the day, if this is the only way Linux users are able to make the hardware we [B][I][U]own[/U][/I][/B] stable, many of us will do so (accepting the risks associated with the activity).

owftheevil 2013-05-20 21:15

However, if you keep a backup of the original bios and flash it back before RMA? Nividia supplies the flash utility and I think some companies provide bios updates for their cards.

chalsall 2013-05-20 21:22

[QUOTE=owftheevil;341046]However, if you keep a backup of the original bios and flash it back before RMA? Nividia supplies the flash utility and I think some companies provide bios updates for their cards.[/QUOTE]

Good point.

However, EVGA do not appear to provide BIOS updates for any of their graphics cards. Also, if someone "bricks" their card they won't be able to re-install the backup BIOS.

Although, truth be told, the most likely cause of hardware failure is because some gamer wanted to over-clock / over-voltage their card by way of a BIOS hack -- not someone wanting to down-clock their card for computational stability.

owftheevil 2013-05-20 23:24

As some of the thinking here is somewhat foreign to me, I feel a need to clarify my position.

When I mess with the bios as I do, I do so with the knowledge that what I am doing might destroy my card. I also do so without any intention of RMAing any card I have fiddled with in such a way. Even if I were to RMA such a card, I most definitely would not try to hide the fact that I had fiddled with the bios.

So again, if you follow these procedures, you do so at your own risk. Please take full responsibility for what you do.

That said, I don't believe there is much risk in this procedure. For me, the slight risk is worth the gain of a fully functional card.

chalsall 2013-05-20 23:37

[QUOTE=owftheevil;341053]So again, if you follow these procedures, you do so at your own risk. Please take full responsibility for what you do.[/QUOTE]

Absolutely.

Anyone who does anything beyond buying a card and plugging it in and returning it to the retailer or manufacturer if it doesn't work take their own risks.

Carl talked about how he found a solution to his particular problem. He was very clear that this solution may not work for everyone (or, even, anyone else).

[QUOTE=owftheevil;341053]That said, I don't believe there is much risk in this procedure. For me, the slight risk is worth the gain of a fully functional card.[/QUOTE]

It's called "covering your ass with paper".

Thanks for sharing your experience! :smile:

Edit1: Just to be clear, you, Carl, did exactly the correct thing.

Edit2: The hardware and firmware and software providers, on the other hand, might not of. They are the ones covering their asses with legal "paper". Try to enforce any of it, though.

James Heinrich 2013-05-21 14:21

Can I request that everyone please run a quick benchmark for me, I'd like to validate (and update) the lookup table I use to generate my [url=http://www.mersenne.ca/cudalucas.php]CUDALucas throughput page[/url]. Please run this simple benchmark on a wide variety of GPUs you have available and email the results to [email]james@mersenne.ca[/email] (or PM me here if you prefer).[code]CUDALucas -info -cufftbench 1048576 8388608 1048576[/code]

James Heinrich 2013-05-21 17:56

Also, please run the benchmark with v2.04 if possible.

chalsall 2013-05-21 18:59

[QUOTE=owftheevil;341053]That said, I don't believe there is much risk in this procedure. For me, the slight risk is worth the gain of a fully functional card.[/QUOTE]

Indeed. And I'm happy to report that this (appears to have) solved the problem on my card. I dropped the MemClock from 2004 MHz to 1960 MHz, and my GTX 560 has now passed three CUDALucas self-tests (-r), and over two hours of your memory test. I'm going to let it run for at least another four hours, but the card never survived more than an hour without at least one error at 2004 MHz.

Absolutely no issues with the nvflash -- your above procedure worked perfectly!

Thank you such much Carl!!! :bow:

chalsall 2013-05-21 23:03

[QUOTE=chalsall;341144]I'm going to let it run for at least another four hours, but the card never survived more than an hour without at least one error at 2004 MHz.[/QUOTE]

[CODE][chalsall@hobbit memtest013]$ ./memtest 74 100000 0 | tee 201305211516.txt

Initializing test using 1850MiB of memory on device 0

Beginning test.

Position 0, Iteration 10000, Errors: 0, completed 0.14%
Position 0, Iteration 20000, Errors: 0, completed 0.27%
...
Position 73, Iteration 90000, Errors: 0, completed 99.86%
Position 73, Iteration 100000, Errors: 0, completed 100.00%[/CODE]

Happiness!!! :smile:

owftheevil 2013-05-21 23:19

Now to see if that translates into error free CuLu and CPm1.

chalsall 2013-05-21 23:40

[QUOTE=owftheevil;341176]Now to see if that translates into error free CuLu and CPm1.[/QUOTE]

An excellent question. Let's run that experiment. I will report back in a couple of days...

[CODE][chalsall@hobbit cl]$ ./CUDALucas -v

CUDALucas v2.051 Alpha

[chalsall@hobbit cl]$ ./CUDALucas 29801743 | tee 201305211935.txt

Starting M29801743 fft length = 1568K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration 100, average error = 0.23153, max error = 0.34766
Iteration = 132 < 1000 && err = 0.35938 >= 0.35, increasing n from 1568K
Starting M29801743 fft length = 1600K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration 100, average error = 0.09618, max error = 0.14844
Iteration 200, average error = 0.10772, max error = 0.14062
Iteration 300, average error = 0.11145, max error = 0.14600
Iteration 400, average error = 0.11349, max error = 0.14844
Iteration 500, average error = 0.11475, max error = 0.15625
Iteration 600, average error = 0.11599, max error = 0.15234
Iteration 700, average error = 0.11640, max error = 0.14062
Iteration 800, average error = 0.11684, max error = 0.14062
Iteration 900, average error = 0.11732, max error = 0.15283
Iteration 1000, average error = 0.11744 < 0.25 (max error = 0.15625), continuing test.
Iteration 10000 M( 29801743 )C, 0x865251249daf325a, n = 1600K, CUDALucas v2.051 Alpha err = 0.17188 (0:54 real, 5.4364 ms/iter, ETA 44:59:20)
Iteration 20000 M( 29801743 )C, 0x2e336b1f4a815d5b, n = 1600K, CUDALucas v2.051 Alpha err = 0.17188 (0:53 real, 5.2620 ms/iter, ETA 43:31:50)
Iteration 30000 M( 29801743 )C, 0x47e2fdcaa747334b, n = 1600K, CUDALucas v2.051 Alpha err = 0.17188 (0:52 real, 5.2627 ms/iter, ETA 43:31:20)
[/CODE]

prime7989 2013-05-23 07:38

Dear Manpowre,
Can you tell me the url of your latest incarnation of CudaLucas that works on the gtx titan?
Also if you can do this:
Try modifying your code to also run Lucas-Lehmer tests on the GPUS for Fermat numbers:
The proof of correctness of this a theorem in my MSc thesis at U of Toronto.
The quadratic for this is : x^2 -5x +1=f(x)
Everything remains the same for the LL for Fn=(2^2^n)+1
Start with S0=5 instead of 4 or x[0]=5 instead of 4
and test for S(p-2)==0(mod Fn) as S(p-2)==0 iff Fn prime
Note that the recursive poly for Fermat and Mersenne numbers is the same.
That is: S_k=(S_(k-1))^2 -2
and FFT must take insto account that the type of Binay Fn is different from Mp.
M7=1111111_2 F_1=101 F_2=10001
Al

Manpowre 2013-05-23 10:02

[QUOTE=prime7989;341328]Dear Manpowre,
Can you tell me the url of your latest incarnation of CudaLucas that works on the gtx titan?[/QUOTE]
I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.

Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself.

[QUOTE=prime7989;341328]
Also if you can do this:
Try modifying your code to also run Lucas-Lehmer tests on the GPUS for Fermat numbers:
The proof of correctness of this a theorem in my MSc thesis at U of Toronto.
The quadratic for this is : x^2 -5x +1=f(x)
Everything remains the same for the LL for Fn=(2^2^n)+1
Start with S0=5 instead of 4 or x[0]=5 instead of 4
and test for S(p-2)==0(mod Fn) as S(p-2)==0 iff Fn prime
Note that the recursive poly for Fermat and Mersenne numbers is the same.
That is: S_k=(S_(k-1))^2 -2
and FFT must take insto account that the type of Binay Fn is different from Mp.
M7=1111111_2 F_1=101 F_2=10001
Al[/QUOTE]
Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).

prime7989 2013-05-23 11:44

[QUOTE=Manpowre;341333]I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.

Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself.


Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).[/QUOTE]
Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.

Manpowre 2013-05-23 18:04

[QUOTE=prime7989;341343]Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.[/QUOTE]

[url]http://sourceforge.net/projects/cudalucas/[/url]

Manpowre 2013-05-23 18:53

[QUOTE=Manpowre;341373][url]http://sourceforge.net/projects/cudalucas/[/url][/QUOTE]

The algorithm is just 1/3 of the total code.
The main iteration is done in the int check() function.
If you follow the check function, you will see the algorithm

TheJudger 2013-05-23 20:37

Hi Carl,

I've to annoy you again, sorry!
[CODE]
Position 213, Iteration 100000, Errors: 0, completed 91.06%
Position 214, Iteration 10000, Errors: 0, completed 91.11%
Position 214, Iteration 20000, Errors: 0, completed 91.15%
Position 214, Iteration 30000, Errors: 0, completed 91.19%
Position 214, Iteration 40000, Errors: 0, completed 91.23%
Position 214, Iteration 50000, Errors: 0, completed 91.28%
Position 214, Iteration 60000, Errors: 0, completed 91.32%
Position 214, Iteration 70000, Errors: 0, completed 91.36%
Position 214, Iteration 80000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.36%
Position 214, Iteration 90000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.32%
Position 214, Iteration 100000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.28%
[/CODE]

Quick fix line 137:[CODE]
printf("Position %d, Iteration %d, Errors: %d, completed %2.2f%%\n", pos, k, total, ([COLOR="Red"][B](double)[/B][/COLOR]pos*iter+k)*100 / (double) (s*iter));
[/CODE]

Oliver

owftheevil 2013-05-23 20:52

The numbers actually get that big? I'm often amazed at the things I can't imagine. Thanks.

Carl

TheJudger 2013-05-23 21:12

Hi Carl,

you could move the *100 to the other side of the division (*0.01). In this case it would take much longer to trigger the overflow. Currently it is 2^31-1 / 100 = ~21.5M iterations.
You [B]could[/B] add some timing information (iterations per second, estimated remaining time) to your memtest if you have some spare time.

Oliver

owftheevil 2013-05-23 22:17

Oliver

Thanks for the suggestions. Here's what I'm planning:

1. Include device and environment info at the beginning of the test.
2. Include timing, eta, and temperature info at each report.
3. Give address ranges of the memory being tested rather than the uninformative position 1 etc.

Don't know when I will get to it though.

Carl

James Heinrich 2013-05-25 13:12

I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:[code]Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)[/code]And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it.

However, when running a benchmark on 3200K I get this:[code]cudalucas -cufftbench 3276800 3276800 32768
CUFFT bench start = 3276800 end = 3276800 distance = 32768
CUFFT_Z2Z size= 3276800 time= 3.704131 msec[/code]Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent?

Prime95 2013-05-25 14:06

An LL iteration consists of a forward FFT, a point-wise squaring, an inverse FFT, and a rounding-to-integer-and-propagating-carries-step.

The benchmark only times one of the FFTs. So, your LL iteration did two 3.7ms FFTs, and spent 0.6 ms doing point-wise squaring and rounding/carry.

owftheevil 2013-05-25 14:10

cufftbench only times the ffts. 1 iteration of an LL test consists of 2 ffts, pointwise multiplication, normalization, and splicing. For a rough equivalence of the two timings, pretend iteration times are a multiple of fft times. A more accurate equivalence is iteration time = 2 * fft + k * n for some constant k and fft length n.

Edit: Looks like Prime95 beat me to it.

Manpowre 2013-05-26 20:05

[QUOTE=James Heinrich;341532]I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:[code]Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)[/code]And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it.

However, when running a benchmark on 3200K I get this:[code]cudalucas -cufftbench 3276800 3276800 32768
CUFFT bench start = 3276800 end = 3276800 distance = 32768
CUFFT_Z2Z size= 3276800 time= 3.704131 msec[/code]Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent?[/QUOTE]

When I read the cudalucas code, there are several things happening for each iteration. The first part is memcopy to gpu, then FFT work on the copied arrays, then normalizing the array and rounding to zero.

You are testing exponent 57885161 which is alot bigger than 3276800, therefore the array and the time to do FFT is less on a smaller exponent. so it makes sence.

Manpowre 2013-05-26 21:46

205 alpha and FFT lengths
 
I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.

2.05 alpha with automatic chosen FFT length:
Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A
lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36)

2.05 alpha with manual chosen FFT lengt = to what 2.03 uses.
Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A
lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13)

its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length..

Running the same exponent on 2.03 and manual FFT of 3584k gives this:
Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03
err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38)

prime7989 2013-05-28 09:02

[QUOTE=Manpowre;341637]I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.

2.05 alpha with automatic chosen FFT length:
Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A
lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36)

2.05 alpha with manual chosen FFT lengt = to what 2.03 uses.
Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A
lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13)

its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length..

Running the same exponent on 2.03 and manual FFT of 3584k gives this:
Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03
err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38)[/QUOTE]

which GPU do you get these timings on?
On a gtx titan with the same exponent as yours for a LL with fft len=3072K
i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs
I compiled cudalucas 2.05alpha with cuda 5.0 and sm_30 with the gpu kernel
<<256 256>> and /256 and 256>>

From:

[CODE]
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 512, 128 >>> (n, g_x, g_ct);
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));

normalize_kernel <<<n / threads, threads >>>
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 128, 128 >>>
(g_x, n, threads, g_data, g_ttp1);
[/CODE]
TO:

[CODE] cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct);
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));

normalize_kernel <<<n / threads, threads >>>
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>>
(g_x, n, threads, g_data, g_ttp1);[/CODE]

Manpowre 2013-05-28 20:21

[QUOTE=prime7989;341749]which GPU do you get these timings on?
On a gtx titan with the same exponent as yours for a LL with fft len=3072K
i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs
[/QUOTE]

Its a GTX Titan too. I changed the same as you did, and still get 0.5ms per iteration. 84h. with same FFT length as you ran your test with.

while 2.03 gives me 0.43ms per iteration.

prime7989 2013-05-28 20:48

For Gtx Titan in the Makefile change the option to nvcc to sm_30
 
[QUOTE=Manpowre;341794]Its a GTX Titan too. I changed the same as you did, and still get 0.5ms per iteration. 84h. with same FFT length as you ran your test with.

while 2.03 gives me 0.43ms per iteration.[/QUOTE]
For Gtx Titan in the Makefile change the option to nvcc to sm_30 instead of using sm_13 or sm_35 tell me if you get similar timings to me?
Al

Manpowre 2013-05-28 21:43

[QUOTE=prime7989;341795]For Gtx Titan in the Makefile change the option to nvcc to sm_30 instead of using sm_13 or sm_35 tell me if you get similar timings to me?
Al[/QUOTE]

I didnt compile it with makefile, I made a VS2010 project and mapped all .h files and all. I also tried sm_30, and compute_30. same result. strange.
Mt 2.03 was also compiiled on same machine..

owftheevil 2013-05-28 22:30

Hi Manpowre, prime7989,

If you want to run those kernels with 256 threads per block and you also want correct results, please make the following changes:

From:

[CODE]
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct);
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
normalize_kernel <<<n / threads, threads >>>
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>>
(g_x, n, threads, g_data, g_ttp1);[/CODE]to:

[CODE]
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 1024, 256 >>> (n, g_x, g_ct);
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
normalize_kernel <<<n / threads, threads >>>
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
normalize2_kernel <<< ((n + threads - 1) / threads + 255) / 256, 256 >>>
(g_x, n, threads, g_data, g_ttp1);[/CODE]The timings are somewhat meaningless otherwise.

Karl M Johnson 2013-05-28 23:04

I recall sm_13 being the best target architecture for most GPUs for CuLu, just saying.

LaurV 2013-05-29 02:39

What are you guys talking about "0.5" and "0.43" ms/iteration? I only see 4.xxx ms with the VS-compiled version, and 3.8xxx with the original, which is FASTER, and [B]not the other way around[/B]. What is the problem? I can't understand. For the records, we get about 5.xx ms with our OC'd gtx580, and even our i7-2600k gets 26ms/iter/core, which is (compounded all 4 cores, running 4 workers) equivalent to 6-7ms/iter. My opinion is still "wait for the maxwell". The difference in performance for Titans does not worth the difference in the money spent. Of course, if you did not buy it for other purposes (gaming, bitcoins, folding proteins, whatever, I don't know). If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.

LaurV grumpy...

kracker 2013-05-29 02:46

[QUOTE=LaurV;341829] then 570 is still the best buy, followed by 580.

LaurV grumpy...[/QUOTE]

Never mind I can't "officially" buy a 570 or 580 I think...

Just for my curiosity, how many Ghzdays(Yes, TF..wrong thread?) do you/can you get on at 680 per $?

kladner 2013-05-29 03:08

[QUOTE]If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.

LaurV grumpy... [/QUOTE]

Very grumpy here, too, that Gigabyte is trying to foist a 660ti on me as a "replacement" for my 570.

prime7989 2013-05-29 03:15

LL on Mersenne, Fermat numbers and Pepin's test
 
Hi Owftheevil,Manpowre
My recomended mod to the threads in invocation of the kernel does not give the right results. It for example says M110503 is composite instead of prime. So i would suggest leaving that at 128 until someone who understands the cudalucas 2.05Alpha code comes along and joins in the frey.
However my mod for Fermat numbers can be done in version 2.03 and 2.05 alpha if
n is calculated in code as 2^2^n+1 instead of 2^p-1. Does any one know which lines of code that n=2^p-1 is calculated or is it inherent in the cufft calls?
Thank you!
PS For example 2.03 version runs M61787581 in ETA 60 hrs.
will check for a few days if that is prime?
Also to be added to parse.c is trial division by small primes of Mp.
This can be done simply in parse.c by implementing it in GMP.
Also is there any GPU code that does a Pepins' test
on Fermat numbers?

owftheevil 2013-05-29 03:29

Well, I certainly understand R.D. Silverman better now.

frmky 2013-05-29 03:46

[QUOTE=prime7989;341838]Hi Owftheevil,Manpowre
... So i would suggest leaving that at 128 until someone who understands the cudalucas 2.05Alpha code comes along and joins in the frey...[/QUOTE]
Note that Owftheevil does understand the code and has made significant modifications to msft's original code. :smile:

LaurV 2013-05-29 06:42

[QUOTE=kracker;341832]Never mind I can't "officially" buy a 570 or 580 I think...

Just for my curiosity, how many Ghzdays(Yes, TF..wrong thread?) do you/can you get on at 680 per $?[/QUOTE]

I don't own 6xx cards, but from former testing and from James' page, their performance is lousy, compared with 5xx. I think you are hitting to GHzD per Watt, and not per dollar. Well, at the end it comes to dollars too, but this is when you plan to use them for long time, or live in an area where the electricity is expensive. The best part of Keplers (6xx) is the power management (Keplers can reduce the power and consume very little, compared to Fermis, like they have half the DP performance, or less, but consume a third of power, or a quarter - it highly depends of the cards you compare, but you are in these ranges) so, they have better performance per Watt. If you plan to use them for long and live in an area where the electricity is expensive, they can be a good long-term investment.

[URL="http://www.mersenne.ca/cudalucas.php?sort=gpw"]James' site[/URL] is a perfect starting point. You can sort it by other columns too, by clicking on the column head (:P)
[edit: for TF, as I did not see the first time that you asked for TF, the [URL="http://www.mersenne.ca/mfaktc.php?sort=jvr"]link is here[/URL]. Read about the last two columns, very good comparison criteria!]

Manpowre 2013-05-29 08:37

[QUOTE=prime7989;341838]Hi Owftheevil,Manpowre
My recomended mod to the threads in invocation of the kernel does not give the right results. It for example says M110503 is composite instead of prime. So i would suggest leaving that at 128 until someone who understands the cudalucas 2.05Alpha code comes along and joins in the frey.
However my mod for Fermat numbers can be done in version 2.03 and 2.05 alpha if
n is calculated in code as 2^2^n+1 instead of 2^p-1. Does any one know which lines of code that n=2^p-1 is calculated or is it inherent in the cufft calls?
Thank you!
PS For example 2.03 version runs M61787581 in ETA 60 hrs.
will check for a few days if that is prime?
Also to be added to parse.c is trial division by small primes of Mp.
This can be done simply in parse.c by implementing it in GMP.
Also is there any GPU code that does a Pepins' test
on Fermat numbers?[/QUOTE]

Hi Thanks for this, I had a feeling that the modifications might end up in wrong result, but didnt get to test it. I am just testing the 2.05 alpha, and I cant find it to be quick enough than 2.03, and I compiled sm_20, sm_30, sm_35. (didnt do sm_13 yet). I also see that 2.05 alpha doesnt set the FFT length to be most effective like in 2.03. so I am only running 2.03 at the moment.

The only way I can get to the iterations as low as less than 4ms each, is to take memory clock back up to stock, but that doesnt make the Titan stable as I understand the memory on the back side of the card heats up too much.

kracker 2013-05-29 14:21

[QUOTE=LaurV;341846]
[URL="http://www.mersenne.ca/cudalucas.php?sort=gpw"]James' site[/URL] is a perfect starting point. You can sort it by other columns too, by clicking on the column head (:P)
[edit: for TF, as I did not see the first time that you asked for TF, the [URL="http://www.mersenne.ca/mfaktc.php?sort=jvr"]link is here[/URL]. Read about the last two columns, very good comparison criteria!][/QUOTE]

So I guess buying a Titan for TF is terrible.. but even for LL testing, is it just worth the grand on my wallet?

Also, on James's site I saw that the 7970 GHz beats the 580... is that true? :confused:

James Heinrich 2013-05-29 15:03

[QUOTE=kracker;341879]I saw that the 7970 GHz beats the 580... is that true? :confused:[/QUOTE][i]Bdot[/i] has recently released mfkato v0.13 with GPU-sieving support so AMD has moved up the rankings accordingly.

LaurV 2013-05-29 15:27

Indeed. If you look for the results on the bitcoin forums, I always wondered why Radeons did not beat Nvidias at TF, from the very beginning... :razz: They are better at integer math, I mean if you don;t ask them to do much DP calculus, but as soon as someone port the FFT package to OpenCL - there is already one done by Apple, I posted a link somewhere - they may surprise us with LL test's speed too... It hurts me a bit to say that, I am pure Nvidia/cuda guy...

James Heinrich 2013-06-01 10:59

[QUOTE=owftheevil;341535]cufftbench only times the ffts. 1 iteration of an LL test consists of 2 ffts, pointwise multiplication, normalization, and splicing. For a rough equivalence of the two timings, pretend iteration times are a multiple of fft times. A more accurate equivalence is iteration time = 2 * fft + k * n for some constant k and fft length n.[/QUOTE]How hard would it be to include a benchmark mode that simulates actual iteration times, not just FFT times?
The FFT benchmark is useful for deciding which FFT size to use for a given exponent, but it's not very useful for helping to predict performance.

owftheevil 2013-06-01 13:19

[QUOTE]How hard would it be to include a benchmark mode that simulates actual iteration times, not just FFT times?
The FFT benchmark is useful for deciding which FFT size to use for a given exponent, but it's not very useful for helping to predict performance.
[/QUOTE]

Not too much trouble. The most time consuming part would be finding reasonable exponents for each of the different fft lengths.

This would also be useful for CUDALucas itself. There could be cases where shorter ffts take longer, but yield better iteration times.

James Heinrich 2013-06-01 14:08

[QUOTE=owftheevil;342234]Not too much trouble. The most time consuming part would be finding reasonable exponents for each of the different fft lengths.[/QUOTE]And if I (or someone) could assist by coming up with such a list, that would help the feature get implemented sooner? :smile:
If you have any suggestions for methodology for determining suitable exponents then I don't mind preparing that list for you. My method would be guess-and-check: try different exponents and see what FFT size is selected, populate my chart, and then try and fill in the gaps by guess-and-checking at what exponent would best fill the in-between FFT sizes. Would that work?

owftheevil 2013-06-01 14:43

Or just multiply fft by 15 or so and pick a prime close to that.

But to get it distributed so that you can get results soon would be a problem.

owftheevil 2013-06-01 15:04

Another thought, actual exponents might not be necessary for just a timing benchmark.

owftheevil 2013-06-01 19:52

Here's what it looks like so far. I've done a few trial runs with exponents near those specified in the table. Maximum errors are coming in around .25 and with all the error checking and checkpoint processing, iteration times are .1%-.7% higher in the actual test than in the benchmark results.

[CODE]filbert@filbert:~/Build/CudaLucas-2.052$ ./CUDALucas -cufftbench 3000 4000 5 -d 1

------- DEVICE 1 -------
name GeForce GTX 560 Ti
Compatibility 2.1
clockRate (MHz) 1900
memClockRate (MHz) 2080
totalGlobalMem 1073414144
totalConstMem 65536
l2CacheSize 524288
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 8
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1

CUFFT bench testing fft sizes 3000K to 4000K, doing 5 passes.
Pass 1, fft size = 3000K, exp up to 55296000, ave time = 9.306 msec, max-ave = 0.00000
Pass 1, fft size = 3024K, exp up to 55738368, ave time = 8.174 msec, max-ave = 0.00000

.
.
.

Pass 5, fft size = 3000K, exp up to 55296000, ave time = 9.307 msec, max-ave = 0.00132
Pass 5, fft size = 3024K, exp up to 55738368, ave time = 8.173 msec, max-ave = 0.00080
Pass 5, fft size = 3072K, exp up to 56623104, ave time = 8.443 msec, max-ave = 0.00111
Pass 5, fft size = 3087K, exp up to 56899584, ave time = 9.114 msec, max-ave = 0.00192
Pass 5, fft size = 3125K, exp up to 57600000, ave time = 10.053 msec, max-ave = 0.00158
Pass 5, fft size = 3136K, exp up to 57802752, ave time = 8.120 msec, max-ave = 0.00069
Pass 5, fft size = 3150K, exp up to 58060800, ave time = 9.222 msec, max-ave = 0.00037
Pass 5, fft size = 3200K, exp up to 58982400, ave time = 8.691 msec, max-ave = 0.00017
Pass 5, fft size = 3240K, exp up to 59719680, ave time = 9.396 msec, max-ave = 0.00124
Pass 5, fft size = 3360K, exp up to 61931520, ave time = 9.655 msec, max-ave = 0.00039
Pass 5, fft size = 3375K, exp up to 62208000, ave time = 10.542 msec, max-ave = 0.00272
Pass 5, fft size = 3402K, exp up to 62705664, ave time = 9.492 msec, max-ave = 0.00043
Pass 5, fft size = 3430K, exp up to 63221760, ave time = 10.350 msec, max-ave = 0.00118
Pass 5, fft size = 3456K, exp up to 63700992, ave time = 9.091 msec, max-ave = 0.00074
Pass 5, fft size = 3500K, exp up to 64512000, ave time = 10.426 msec, max-ave = 0.00123
Pass 5, fft size = 3528K, exp up to 65028096, ave time = 9.722 msec, max-ave = 0.00029
Pass 5, fft size = 3584K, exp up to 66060288, ave time = 9.315 msec, max-ave = 0.00096
Pass 5, fft size = 3600K, exp up to 66355200, ave time = 10.074 msec, max-ave = 0.00073
Pass 5, fft size = 3645K, exp up to 67184640, ave time = 11.112 msec, max-ave = 0.00154
Pass 5, fft size = 3675K, exp up to 67737600, ave time = 11.758 msec, max-ave = 0.00391
Pass 5, fft size = 3750K, exp up to 69120000, ave time = 12.185 msec, max-ave = 0.00084
Pass 5, fft size = 3780K, exp up to 69672960, ave time = 11.054 msec, max-ave = 0.00065
Pass 5, fft size = 3840K, exp up to 70778880, ave time = 11.030 msec, max-ave = 0.00142
Pass 5, fft size = 3888K, exp up to 71663616, ave time = 10.535 msec, max-ave = 0.00155
Pass 5, fft size = 3920K, exp up to 72253440, ave time = 10.931 msec, max-ave = 0.00047
Pass 5, fft size = 3969K, exp up to 73156608, ave time = 11.438 msec, max-ave = 0.00165
Pass 5, fft size = 4000K, exp up to 73728000, ave time = 11.430 msec, max-ave = 0.00124
filbert@filbert:~/Build/CudaLucas-2.052$ [/CODE]

James Heinrich 2013-06-01 21:22

[QUOTE=owftheevil;342267]Here's what it looks like so far.
iteration times are .1%-.7% higher in the actual test than in the benchmark results.[/QUOTE]That's beautiful!
1% deviation from actual isn't a problem for my purposes, where I'm aggregating results from many people on many GPUs for approximate average performance for a given architecture.

Now I'll just have to wait until this new benchmark mode is widely available (meaning, in many cases, Windows binaries).

TheJudger 2013-06-05 18:53

Hi Carl,

[QUOTE=owftheevil;340301]Here's version 0.13 with Olivers requested fflush(NULL) statements.

Oliver, thanks for showing me that.[/QUOTE]

two more whishes:[LIST=1][*]add "clean"-target to the Makefile[*]add "--generate-code arch=compute_30,code=sm_30" to the CUFLAGS in the Makefile[/LIST]
Your memtest refuses to run on GTX 690 (GK104? GK10x?) when compiled with the default CUFLAGS (Linux, nvcc V0.2.1221 (CUDA 5.0), driver 319.23):
[CODE]./memtest 10 10 0

Initializing test using 250MiB of memory on device 0

memtest.cu(187) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED[/CODE]
It doesn't matter how much memory, iterations or which device I choose.
When I add compute_30,sm_30 to the Makefile and recompile it runs fine here (GTX 690, TITAN, Tesla K20).

As you can see I'm really using your tool!

Oliver

P.S. I'm remembering this one: [url]http://www.mersenneforum.org/showpost.php?p=197975&postcount=8:[/url] should Carls memtest get his own thread?

msft 2013-06-11 18:11

1 Attachment(s)
Hi ,
I make Radion FFT Benchmark.
HD7750:
[QUOTE]
$ sh -x ./run.sh
+ rm *.o a.out
+ g++ -c main.cpp -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.8.291/include/
+ g++ -c clFFTPlans.cpp -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.8.291/include/
+ g++ main.o clFFTPlans.o /opt/clAmdFft-1.8.291/lib64/libclAmdFft.Runtime.so -lOpenCL -lfftw3
+ export LD_LIBRARY_PATH=:/opt/clAmdFft-1.8.291/lib64/
+ time ./a.out
Using device: Capeverde
AmdFFT_Z2Z size= 524288 time= 2.780000 msec
Everything went fine!
31.54user 249.49system 4:41.49elapsed 99%CPU (0avgtext+0avgdata 1334368maxresident)k
0inputs+45368outputs (0major+151072minor)pagefaults 0swaps
$
[/QUOTE]:smile:

kracker 2013-06-11 18:39

[QUOTE=msft;343077]Hi ,
I make Radion FFT Benchmark.
HD7750:
:smile:[/QUOTE]

:shock: Wow. At last someone...
If you need testers for anything I have two 7770's.

owftheevil 2013-06-11 19:33

Hi Oliver,

[QUOTE]two more whishes:[LIST=1][*]add "clean"-target to the Makefile[*]add "--generate-code arch=compute_30,code=sm_30" to the CUFLAGS in the Makefile[/LIST][/QUOTE]

Noted and added to the list. Thanks as always.

Carl

owftheevil 2013-06-11 19:36

Hi msft,

At first glance that time seems a bit slow, but then I have no idea how strong or weak a card the 7750 is. What would a comparable Nvidia card be?
In any event, those are interesting results. Do you plan on make a LL test out of this?

Carl

kracker 2013-06-11 19:36

[QUOTE=owftheevil;343087]Hi msft,

At first glance that time seems a bit slow, but then I have no idea how strong or weak a card the 7750 is. What would a comparable Nvidia card be?

Carl[/QUOTE]

a 650. OR 550. Not Ti

EDIT: Sorry, wasn't my question to answer, just realized that.

owftheevil 2013-06-11 19:45

Hi Kracker,

No problem, the question was for anyone with information. Thanks. The time seems much more reasonable now.

flashjh 2013-06-11 21:39

All, been out of the loop for a while. We just about settled in FL. I would like to get Carl's, et al, updates incorporated and working. Sourceforge has 2.04b, but I don't see 2.05. Can anyone catch me up on current developments, etc.?

Thanks

msft 2013-06-12 01:06

[QUOTE=kracker;343080]If you need testers for anything I have two 7770's.[/QUOTE]
I appreciate your offer.

msft 2013-06-12 01:08

[QUOTE=owftheevil;343087]Do you plan on make a LL test out of this?
[/QUOTE]
Some day.

msft 2013-06-13 05:39

1 Attachment(s)
New Version.
HD7750:
[QUOTE]
$ sh -x ./run.sh
+ rm *.o a.out
+ g++ -c main.cpp -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.10.321/include/
+ g++ -c clFFTPlans.cpp -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.10.321/include/
+ g++ main.o clFFTPlans.o /opt/clAmdFft-1.10.321/lib64/libclAmdFft.Runtime.so -lOpenCL -lfftw3
+ export LD_LIBRARY_PATH=:/opt/clAmdFft-1.10.321/lib64/
+ time ./a.out
Using device: Capeverde
AmdFFT size= 1048576 time= 4.200000 msec
AmdFFT size= 2097152 time= 7.900000 msec
AmdFFT size= 3145728 time= 27.000000 msec
AmdFFT size= 4194304 time= 15.900000 msec
AmdFFT size= 5242880 time= 80.200000 msec
AmdFFT size= 6291456 time= 137.800000 msec
23.93user 71.93system 45:40.64elapsed 3%CPU (0avgtext+0avgdata 970032maxresident)k
0inputs+18136outputs (0major+120912minor)pagefaults 0swaps
$
[/QUOTE]


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.