![]() |
hehe
Well, The HyperQ threads initialization was put in device init func, not lucas init, so when the code destroyed threads, they werent recreated..
fixed.. so now I ran -r successfully.. I started 57885161 the 48th prime, and ETA is 22h 14m for one Titan card. that is very very quick. The 25964951 prime is expected in 4h 18m The 20996011 prime is expected in 3h 21m The 3h run is the one I will leave on here until I get up from bed.. gotta sleep.. :) I also tried a few smaller ones.. M( 132049 )C, 0xfffffffffffffff3, n = 7168, CUDALucas v2.03 M( 216091 )C, 0xfffffffffffffff3, n = 12288, CUDALucas v2.03 M( 756839 )C, 0xfffffffffffffffb, n = 40960, CUDALucas v2.03 |
Are you running the different kernels in parallel?
|
[QUOTE=Manpowre;340332]1 HyperQ thread = 22h 27m = the same as cualucas normal mode
4 HyperQ threads = 10h 40m 8 HyperQ threads = 8h 12m 16 HyperQ threads = 7h 22m 24 HyperQ threads = 21h 28m 32 HyperQ threads = 13h[/QUOTE] Is that the code executed on one Titan? Or both of them? |
Titan HyperQ
Answer to 2 question.
Running kernels in parallell yes. well cudalucas runs all kernels in parallell, but Nvidia has made HyperThreading available, which means one kernel is inserted, and one step in, second kernel gets inserted into same cuda processor. Running those tests with hyperQ yes. I was thinking about the 24 parallell tests, its probably because its not divided by 32/2 = 16/2 = 8 /2 =4.. So 4,8,16 and 32 kernels are the ones to be used. The 20996011 prime finished in a little more than 3h with a result.txt file.. so all good so far. And I havent even started optimization. I still need to run this through a profiler to see the kernels are inserted tight enough and there are still 3 more different cuda kernel calls to get same kind of optimization. Then the cleanup.. arguments needs to be added and some textual info back to the user that they now run HyperQ.. |
[QUOTE=Karl M Johnson;340343]Is that the code executed on one Titan?
Or both of them?[/QUOTE] The code was executed on One Titan. I havent looked into the grid part yet where I can use 2 titans for one thread.. now thats going to be interesting offcourse when I can do that. |
Hehe, so no multi-gpu.
Very nice speedup. |
[QUOTE=Manpowre;340352]Answer to 2 question.
Running kernels in parallell yes. well cudalucas runs all kernels in parallell, but Nvidia has made HyperThreading available, which means one kernel is inserted, and one step in, second kernel gets inserted into same cuda processor. [/QUOTE] Actually cudalucas runs the kernels sequentially. The kernels are not independent, but must start with the output of the previous kernel to give correct results. Residues for the short tests you posted should have been all zero. (Well, cudalucus shouldn't have shown any residue at all, but instead gloriously announced that the number is prime.) You could instead run two separate tests in parallel. You won't get the low single test times, but you will still almost double the throughput. Two tests could finish in 24h as opposed to one test in 21h. I was planning to get around to that sometime this summer, but using regular streaming methods instead of HyperQ. That way the more mundane cards could see some benefit too. |
[QUOTE=owftheevil;340377]Actually cudalucas runs the kernels sequentially. The kernels are not independent, but must start with the output of the previous kernel to give correct results. Residues for the short tests you posted should have been all zero. (Well, cudalucus shouldn't have shown any residue at all, but instead gloriously announced that the number is prime.)
You could instead run two separate tests in parallel. You won't get the low single test times, but you will still almost double the throughput. Two tests could finish in 24h as opposed to one test in 21h. I was planning to get around to that sometime this summer, but using regular streaming methods instead of HyperQ. That way the more mundane cards could see some benefit too.[/QUOTE] When I debugged the code, I saw the RDSP call with a nice matrix to the GPU, then the normalize function call. You are probably right about output, but its a normalization code, and this probably works on my code since I only tested primes so far and the normalization code prob didnt kick in. Actually, this is what I saw in the cudalucas code is that there are 2 iterations which I believe is not visited doing it this way. This is the reason I dont publish the code yet until I am 100% sure its doing it the right way.. but atleast, HyperQ works and with prime numbers, atleast it iterates through very quickly. |
The normalize kernels are an essential part of each iteration, whether the number being tested is prime or not. It would probably be a good idea to learn the IBWDT algorithm you are working with here before spending too much more time optimizing the code.
|
[QUOTE=owftheevil;340391]The normalize kernels are an essential part of each iteration, whether the number being tested is prime or not. It would probably be a good idea to learn the IBWDT algorithm you are working with here before spending too much more time optimizing the code.[/QUOTE]
I understand, also the Cuda kernels are launched as a matrix, this just means in my code I spawn cuda kernel matrix x 16 hyperq threads, and do the same with the normalization code afterwards, then increase the counter relatively. Which when I look at it, it simply means the GPU is processing a bigger chunk each turn between RDSP and normalization afterwards. I did not run the 2 codebases side by side with the output, and I will do so tonight to see what they output.. also I will trace the input to the rdsp gpu call on both variants to see what the algorithm inserts into the gpu and what it brings out to normalize. Again, not knowing the algorithm in depth, I might be wrong.. Ill look into the algorithm tonight to get deeper understanding, thank you for guidance. |
[QUOTE=owftheevil;340391]The normalize kernels are an essential part of each iteration, whether the number being tested is prime or not. It would probably be a good idea to learn the IBWDT algorithm you are working with here before spending too much more time optimizing the code.[/QUOTE]
You are right, I read about FFT and the algorithm to MOD a prime down, and it has to happen in sequence.. so, then, I can create a array of 16 HQ streams = 16 testing numbers (big primes), and execute HQ 0 as first prime round 1, HQ1 as second prime first round, etc.. to 15 then execute normalization code for each HQn. Then each HQ stream will be independent, but they will be excuted in HyperQ steps to maximize the GPU usage when the GPU calls are executed. That should take the speed of testing 16 primes simultaniously to somewhat more than testing one prime + overhead in the complexit of each cuda kernel. would that do it ? |
| All times are UTC. The time now is 23:13. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.