mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Nvidia Cuda 6.0 unified memory management (https://www.mersenneforum.org/showthread.php?t=18902)

Manpowre 2013-11-17 12:33

Nvidia Cuda 6.0 unified memory management
 
Nvidia is about to release cuda 6.0. With this, a unified memory management, and I created this thread to discuss it.

[url]http://www.theregister.co.uk/2013/11/16/nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/[/url]

[url]http://hexus.net/tech/news/graphics/62493-nvidia-cuda-6-offers-unified-memory-programming-system/[/url]

By reading the links, the first I think of is that it will not speed up cudalucas or P1 testing with cuda. The reason for this is that Cuda will copy memory over to gpu anyway.

However, by simplifying the memcopy operations, we can move all memory variables into GPU as references, and clean up the code a bit, so we can actually build a second thread inside eg. cudalucas, and utilize the HyperQ functionality for Titan and Tesla 20x boards.

I am currently using HyperQ for my own mathematical algorithms, and it is working just fine, but the cudalucas code has been too complex to build a second thread into it.

What we also might see with cuda applications using the new technique from Cuda 6.0 is the possibility for 2 threads (2 instances of eg. cudalucas) running towards the same GPU. Since the memcopy will be done by cuda API, the running code could use the GPU while the other copies memory back to host memory for CPU cycles, and then vice versa when cpu cycles are done copy and execute the other thread back on GPU while first copies to host mem for CPU cycles. This is how it is working with Titan boards today on linux with the gateway function for cuda (because of the TITAN compute farm project this got developed only for linux). But with Cuda now taking over memcopy operations and host program just referencing the memory this could be a reality very quickly. I am hoping for this, as the Titan boards will then be able to use 2 or maybe even 3 simultaneously threads on the GPU keeping GPU 100% active instead of now, waiting for memcopy operations to finish before activating itself again.

Manpowre 2013-11-17 12:42

Also, the new multiGPU possibilities for FTT and BLAS libraries are amazing.
Theoretically, 2-8 GPUs can cooperate to do FFT operations speeding up the FFT operation.

"Drop-in libraries and multi-GPU scaling are also implemented in CUDA 6. We are told that the drop-in libraries will automatically accelerate “BLAS and FFTW calculations by up to 8X by simply replacing the existing CPU libraries with the GPU-accelerated equivalents”. Multi-GPU scaling is also supported in the new BLAS and FFT GPU libraries. These libraries “automatically scale performance across up to eight GPUs in a single node, delivering over nine teraflops of double precision performance per node, and supporting larger workloads than ever before (up to 512GB)”."

Question is, will it scale 2:1 with 2 cards ? I assume some overhead here.

With this, we can easily get up to 100m mersenne exponents and bigger on the GPUs as we have alot more firepower per node (for those that want to test 100m exponents) and alot more memory per node with GPUs.

Manpowre 2013-11-17 12:58

Also nvidias press release:
[url]http://nvidianews.nvidia.com/Releases/NVIDIA-Dramatically-Simplifies-Parallel-Programming-With-CUDA-6-a62.aspx[/url]

flashjh 2013-11-28 16:25

Is this going to push a re-write of CUDApm1 and CUDALucas?

Manpowre 2013-11-28 18:53

[QUOTE=flashjh;360541]Is this going to push a re-write of CUDApm1 and CUDALucas?[/QUOTE]

Well, when I did the research for this and read the suggested code change, we actually loose some overhead time using the API doing the memcopy instead of doing it ourselves in the code. But with the Maxwell platform, where there will be an ARM CPU next to the GPU, we can actually probably do the normalization on the ARM CPU instead of on the host CPU. that means we dont have to memcopy. Also with PciExpress3, the speed of a memcopy will be superfast.

I expect also Nvidia to release some support for HyperQ when doing the memcopy with the API instead of ourselves in the code.


All times are UTC. The time now is 17:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.