![]() |
That sounds much more like it would work, but I think two simultaneous tests will fill up the gpu. Three might see a slight benefit over two, but more than that won't.
Edit: Depending on the fft size. Smaller fft means more tests will see a benefit. |
[QUOTE=owftheevil;340432]That sounds much more like it would work, but I think two simultaneous tests will fill up the gpu. Three might see a slight benefit over two, but more than that won't.
Edit: Depending on the fft size. Smaller fft means more tests will see a benefit.[/QUOTE] Agreed, With 6gb memory on titan which is the only board at this moment with hyperQ enabled, I guess its possible to push more than 2 simultaniously threads. Well see, I am going through all references and variables now in the main function calls to enable to put up to 16 searches at the same time, with a argumenn to set how many searches simultaniously.. its going to take some time. Again, thanks for very good input.. |
Its not the memory that is limiting, its the number of processors and the size of the kernels.
|
Does that mean we will not see an impressive(3x) reduction in time?
[QUOTE=Manpowre;340423]You are right, I read about FFT and the algorithm to MOD a prime down, and it has to happen in sequence.. so, then, I can create a array of 16 HQ streams = 16 testing numbers (big primes), and execute HQ 0 as first prime round 1, HQ1 as second prime first round, etc.. to 15 then execute normalization code for each HQn. Then each HQ stream will be independent, but they will be excuted in HyperQ steps to maximize the GPU usage when the GPU calls are executed. That should take the speed of testing 16 primes simultaniously to somewhat more than testing one prime + overhead in the complexit of each cuda kernel. would that do it ?[/QUOTE] |
[QUOTE=Karl M Johnson;340527]Does that mean we will not see an impressive(3x) reduction in time?[/QUOTE]
I mentioned earlier in the thread I had to check the result with the new codebase utilizing hyperq, and thanks to good advice, it pointed me to a level of understanding the code which means to use HyperQ this way cant be done without dramatically rewriting cudalucas code which will take time. But the HyperQ test did iterate and execute all threads, just that they didnt produce real result due to the need for the second GPU executin has to happen after the first GPU execution. I read on the Kepler whitepaper 110 this: [url]http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf[/url] quote for HyperQ: "Applications that previously encountered false serialization crosstasks,thereby limiting achieved GPU utilization, can see up to dramatic performance increase without changing any existing code." So far I had to dramatically change the code even to use HyperQ as the example shown by Cuda 5.0. CudaLucas was made for 1 thread and all variables are supporting 1 parallell task utilizing FFT on the GPU and cuda kernel algorithms which is the reason cudalucas gets the great speedup compared to CPU. There are 2 steps done actually inside cudalucas: CPU code to initialize cudalucas * GPU code to run FFT part 1 CPU code to evaluate GPU code part 1 * GPU code part 2 to normalize data CPU code to evaluate condition and steps + exit eval The only way is to use HyperQ outside of this running scope to use the GPU when CPU code is running, and also to run parallell tasks insertions when GPU code is executed. I am still researching.. what I proved is there is a tremendously speedup using hyperq, but for cudalucas it probably wont include the same dataset, it probably means using a second or third dataset to be inserted in parallell. I also looked into spawning a new thread from one thread, which could theoretically without investigating so much mean that CPU code to evaluate could be moved to GPU and then GPU code part 2 could be spawned from GPU code part1. - Dynamic parallellism its called for GK110 chip only at this point. Well see, I am looking for this without changing code part which Nvidia wrote, and see where that will take me. |
Hi Manpowre,
HyperQ is not (hyper-)threading. Actually even the oldest CUDA capable GPUs run multiple threads per core (reason: hide latency to memory as good as possible). [LIST][*]CC 1.x GPUs can run only one GPU kernel at any time. If there are more kernels lauched they have to wait, no matter if the kernels are launched from a single host process or from different host processes, they are executed in serial order.[*]CC 2.x GPUs can run multiple kernels concurrently if and only if they launched from the [B]same[/B] host process.[*]CC 3.5 GPUs can run multiple kernels from [B]different[/B] host processes concurrently. This is called HyperQ.[/LIST] Oliver |
[QUOTE=TheJudger;340596][*]CC 2.x GPUs can run multiple kernels concurrently if and only if they launched from the [B]same[/B] host process.[*]CC 3.5 GPUs can run multiple kernels from [B]different[/B] host processes concurrently. This is called HyperQ.[/LIST][/QUOTE]
It's not that simple. In CC 2.x, you need to be careful to launch kernels in different streams in a breadth-first manner. Launching kernels depth-first creates false dependencies that prevent them from running concurrently. HyperQ removes this restriction, and thus benefits even when launched from the same host process. HyperQ does support concurrently running kernels from different host processes, but this is [B]not[/B] supported by the currently released CUDA toolkit. Support for this should be coming in the next version of CUDA. |
[QUOTE=frmky;340620]It's not that simple. In CC 2.x, you need to be careful to launch kernels in different streams in a breadth-first manner. Launching kernels depth-first creates false dependencies that prevent them from running concurrently. HyperQ removes this restriction, and thus benefits even when launched from the same host process.
HyperQ does support concurrently running kernels from different host processes, but this is [B]not[/B] supported by the currently released CUDA toolkit. Support for this should be coming in the next version of CUDA.[/QUOTE] Yepp, I figured that out.. I tested with separate consoles towards same card and even setting the env variable to support this, and it just slowed down the execution of the code by half.. I am working on changing the cudalucas code to a c++ program with a .cu file to support this, but it will take time.. this is my summer project.. But I learned the cudalucas code in less than a week.. so I got pretty far.. thanks to all good responses.. really appreciate that.. BTW. I see the gpu lucas uses dd_real library and cudalucas uses the nvidia toolset double2ll, and I understand dd_real lib is more accurate and from what I read also faster ? I cant get the QD lib to compile on my windows environment, even tried mingw and minsys, but it only compiles 32 bit, and even wont compile.. Then I looked into MPIR, which compiles just fine and links just fine in a separate project, but this lib seems very complicated.. Anyone knows a good dd_real lib I can reference for gpulucas ? Just wanted to test gpulucas compiled on the system.. the gpulucas code is more clean written, so easyer to understand.. |
I finally figured out how to edit and flash the bios on my 560ti. At memory clock of 2089 Mhz, CuLu and CPm1 are stable, at 2088 Mhz, memtest quits giving errors. Think I'll run it at 2050Mhz.
|
[QUOTE=owftheevil;340953]I finally figured out how to edit and flash the bios on my 560ti. At memory clock of 2089 Mhz, CuLu and CPm1 are stable, at 2088 Mhz, memtest quits giving errors. Think I'll run it at 2050Mhz.[/QUOTE]
Care to enlighten? I hadn't looked into this option. |
The method requires wine or a windows virtual machine, a DOS bootable USB, the DOS version of nvflash.exe, NiBiTor.exe, and a reboot every time you want to change any setting. I'll post details if you want them.
|
| All times are UTC. The time now is 23:12. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.