mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Data

Reply
 
Thread Tools
Old 2019-12-11, 20:12   #221
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1101010101102 Posts
Default

Quote:
Originally Posted by tServo View Post
After examining the source and the Cuda docs, it looks to be pretty easy to change the cuFFT calls from a single board to multiple board via the Cuda API.
Go for it.
I'll test it when you have something for 2 or 4 old gpus of same model. They're likely to be PCIe interfaced though.
kriesel is offline   Reply With Quote
Old 2019-12-16, 16:13   #222
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1101010101102 Posts
Default

Quote:
Originally Posted by tServo View Post
I disagree with this if a "modest" cluster is configured. The setup I am referring to would be either the "pizza box" ( 2U ) machines that can have 4 Tesla SXM2 V100s in them or perhaps the slightly larger 4U models that can hold 8 of the V100s. I know Dell and HP and Nvidia make these, perhaps others do also ( SuperMicro ? ).
The boards are connected via Nvidia's NVLink mesh network that is located on a large board right above the processor board and run pretty fast: https://en.wikipedia.org/wiki/NVLink

I have traced CudaLucas and it spends 90% of its time in Nvidia's cuFFT.
After examining the source and the Cuda docs, it looks to be pretty easy to change the cuFFT calls from a single board to multiple board via the Cuda API.

Thus, you would be doing the FFT crunching on 8 Teslas simultaneously over a fast mesh network and would be pretty efficient

However, this would cost BIG money ( 100K - 200K dollars ) and would not be sensible.
SXM2's spec is 300GB/second. That sounds fast, but it's a third or less of the throughput of on-gpu memory: 897GB/sec on Tesla V100, a full TB/sec on Radeon VII. Per Preda, as I recall, the fft code in gpuowl is largely memory bound.

So multiple gpus on the same task spend their time something like this:
do its own computing chunk of the job for one iteration;
spend 3 times as long communicating its results to another gpu that needs it as input for the next iteration;
repeat that communication delay for as many other gpus as also need its results for their inputs for the next iteration;
spend 3 times as long receiving its result from another gpu needed as input for this gpu's part of the next iteration;
repeat that communication delay for as many other gpus as also needed as input for this gpu's part of the next iteration.
Let's suppose that much or all the computation on-gpu is done in the shadow of all that communication traffic, via some pipelining magic.
So, for this scenario where it's 1 to 1 direct communication between gpus (assuming it can be done without a connecting flight in and out of cpu-land, and assuming the communication is via a full m x m crosspoint communication mesh), for m gpus sharing the load, it's 3 x 2 x m (+1?) chunk times.
Doing it on one gpu, avoiding those lengthier communication delays, it's m chunk times, so ~6 times faster. It's somewhat worse than this, if send and receive occur in parallel, while the achievable total rate of simultaneous send and receive is 200 GB/sec on available hardware (V100). And the max m appears to be six in that case, at least for V100, and two for form factors that can be put in workstations, before relaying delays are added in. http://images.nvidia.com/content/vol...whitepaper.pdf https://www.nvidia.com/de-de/design-...vlink-bridges/
kriesel is offline   Reply With Quote
Old 2019-12-16, 17:38   #223
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·3·569 Posts
Default

The amount of gpu ram occupied for gpuowl 6.5 on a 100Mdigit exponent PRP run is about 1.2GB. That amount of data to and fro once per iteration is at least 2.4GB/300GB/sec = 0.008seconds, about 10 times the current 5M fft time on a good Radeon VII. Estimating the 18M fft iteration time on the Radeon, scaling by p1.1, at 3.3msec, the SXM2 transfer takes about 2.4 times as long as the iteration.
kriesel is offline   Reply With Quote
Old 2019-12-17, 22:17   #224
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2·233 Posts
Default

Quote:
Originally Posted by kriesel View Post
The amount of gpu ram occupied for gpuowl 6.5 on a 100Mdigit exponent PRP run is about 1.2GB. That amount of data to and fro once per iteration is at least 2.4GB/300GB/sec = 0.008seconds, about 10 times the current 5M fft time on a good Radeon VII. Estimating the 18M fft iteration time on the Radeon, scaling by p1.1, at 3.3msec, the SXM2 transfer takes about 2.4 times as long as the iteration.
Reading the Nvidia documentation for multi-gpu cuFFT, they state the data for 1D transforms are split up into separate pieces for each gpu. Also, the pieces for each gpu are split up into pieces called "strings" and these strings in-flight transfers are overlapped with the computation.

I guess the only way to find out for sure is to actually try it. I have been eyeballing one of my machines that has 3 pcie x16 slots. I have enough old Titan black gpus for the job.
It will be awhile before I have time to attack it.

I think I remember reading that the gpu that is connected to a display cannot participate in the computation. The 4x and 8x servers all have a dedicated video port for a console
so it's no problem for them ( Tesla V100s don't have any video circuitry anyway ). Hence, my needing a machine with 3 Pcie slots.
tServo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Number of LLR tests completed so far? ellipse Twin Prime Search 26 2019-09-28 17:19
Largest number of LL tests before matching residues achieved? sdbardwick Lounge 1 2015-02-03 15:03
Completed 29M work not showing as completed in GPU72 Chuck GPU to 72 2 2013-02-02 03:25
Largest LL Test Ever Completed jinydu Lounge 40 2010-03-22 20:54
need Pentium 4s for 5th largest prime search (largest proth) wfgarnett3 Lounge 7 2002-11-25 06:34

All times are UTC. The time now is 21:28.

Sun Mar 29 21:28:35 UTC 2020 up 4 days, 19:01, 2 users, load averages: 1.46, 1.62, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.