mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-11-17, 12:33   #1
Manpowre
 
"Svein Johansen"
May 2013
Norway

C916 Posts
Default Nvidia Cuda 6.0 unified memory management

Nvidia is about to release cuda 6.0. With this, a unified memory management, and I created this thread to discuss it.

http://www.theregister.co.uk/2013/11..._memory_party/

http://hexus.net/tech/news/graphics/...amming-system/

By reading the links, the first I think of is that it will not speed up cudalucas or P1 testing with cuda. The reason for this is that Cuda will copy memory over to gpu anyway.

However, by simplifying the memcopy operations, we can move all memory variables into GPU as references, and clean up the code a bit, so we can actually build a second thread inside eg. cudalucas, and utilize the HyperQ functionality for Titan and Tesla 20x boards.

I am currently using HyperQ for my own mathematical algorithms, and it is working just fine, but the cudalucas code has been too complex to build a second thread into it.

What we also might see with cuda applications using the new technique from Cuda 6.0 is the possibility for 2 threads (2 instances of eg. cudalucas) running towards the same GPU. Since the memcopy will be done by cuda API, the running code could use the GPU while the other copies memory back to host memory for CPU cycles, and then vice versa when cpu cycles are done copy and execute the other thread back on GPU while first copies to host mem for CPU cycles. This is how it is working with Titan boards today on linux with the gateway function for cuda (because of the TITAN compute farm project this got developed only for linux). But with Cuda now taking over memcopy operations and host program just referencing the memory this could be a reality very quickly. I am hoping for this, as the Titan boards will then be able to use 2 or maybe even 3 simultaneously threads on the GPU keeping GPU 100% active instead of now, waiting for memcopy operations to finish before activating itself again.
Manpowre is offline   Reply With Quote
Old 2013-11-17, 12:42   #2
Manpowre
 
"Svein Johansen"
May 2013
Norway

C916 Posts
Default

Also, the new multiGPU possibilities for FTT and BLAS libraries are amazing.
Theoretically, 2-8 GPUs can cooperate to do FFT operations speeding up the FFT operation.

"Drop-in libraries and multi-GPU scaling are also implemented in CUDA 6. We are told that the drop-in libraries will automatically accelerate “BLAS and FFTW calculations by up to 8X by simply replacing the existing CPU libraries with the GPU-accelerated equivalents”. Multi-GPU scaling is also supported in the new BLAS and FFT GPU libraries. These libraries “automatically scale performance across up to eight GPUs in a single node, delivering over nine teraflops of double precision performance per node, and supporting larger workloads than ever before (up to 512GB)”."

Question is, will it scale 2:1 with 2 cards ? I assume some overhead here.

With this, we can easily get up to 100m mersenne exponents and bigger on the GPUs as we have alot more firepower per node (for those that want to test 100m exponents) and alot more memory per node with GPUs.
Manpowre is offline   Reply With Quote
Old 2013-11-17, 12:58   #3
Manpowre
 
"Svein Johansen"
May 2013
Norway

110010012 Posts
Default

Also nvidias press release:
http://nvidianews.nvidia.com/Release...UDA-6-a62.aspx
Manpowre is offline   Reply With Quote
Old 2013-11-28, 16:25   #4
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

100011000112 Posts
Default

Is this going to push a re-write of CUDApm1 and CUDALucas?
flashjh is offline   Reply With Quote
Old 2013-11-28, 18:53   #5
Manpowre
 
"Svein Johansen"
May 2013
Norway

20110 Posts
Default

Quote:
Originally Posted by flashjh View Post
Is this going to push a re-write of CUDApm1 and CUDALucas?
Well, when I did the research for this and read the suggested code change, we actually loose some overhead time using the API doing the memcopy instead of doing it ourselves in the code. But with the Maxwell platform, where there will be an ARM CPU next to the GPU, we can actually probably do the normalization on the ARM CPU instead of on the host CPU. that means we dont have to memcopy. Also with PciExpress3, the speed of a memcopy will be superfast.

I expect also Nvidia to release some support for HyperQ when doing the memcopy with the API instead of ourselves in the code.
Manpowre is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Running CUDA on non-Nvidia GPUs Rodrigo GPU Computing 3 2016-05-17 05:43
Cuda-Z for nvidia and radeon ET_ GPU Computing 0 2013-11-20 11:38
NVIDIA TITAN 320.59 driver with cuda 5.5 Manpowre GPU Computing 43 2013-08-22 12:28
CUDA and video memory dbaugh GPU Computing 11 2012-03-25 18:42
NVIDIA CUDA C toolkit for G80 GPU available dsouza123 Programming 2 2007-02-18 12:50

All times are UTC. The time now is 18:08.

Tue Dec 1 18:08:59 UTC 2020 up 82 days, 15:19, 2 users, load averages: 2.05, 2.11, 2.06

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.