mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-09-26, 10:25   #188
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

One titan can do an LL iteration with 4M ffts in about 2.75ms. 250Gb/s communication between the devices would be just enough for two titans to do iterations with 4M ffts in 2ms. With more devices the situation gets worse, approaching 500Gb/s for an infinite number of devices.

Stage 2 of p-1 on the other hand would benifit very nicely.

Last fiddled with by owftheevil on 2013-09-26 at 10:30
owftheevil is offline   Reply With Quote
Old 2013-09-26, 18:33   #189
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Quote:
Originally Posted by owftheevil View Post
One titan can do an LL iteration with 4M ffts in about 2.75ms. 250Gb/s communication between the devices would be just enough for two titans to do iterations with 4M ffts in 2ms. With more devices the situation gets worse, approaching 500Gb/s for an infinite number of devices.

Stage 2 of p-1 on the other hand would benifit very nicely.
What about 2headed GPUs, like GTX 690 ?
Say there's a hypothetical GTX Titan X2, which has 2 GK110 GPUs at lower clocks, but with the same 2688 shaders per GPU.
Would it perform better than two GTX Titans, from theoretical throughput point of view?

Last fiddled with by Karl M Johnson on 2013-09-26 at 18:34 Reason: yes
Karl M Johnson is offline   Reply With Quote
Old 2013-09-26, 20:49   #190
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
What about 2headed GPUs, like GTX 690 ?
Say there's a hypothetical GTX Titan X2, which has 2 GK110 GPUs at lower clocks, but with the same 2688 shaders per GPU.
Would it perform better than two GTX Titans, from theoretical throughput point of view?
Probably not. As always, it depends mostly on the latency and speed of the "bridge" and I'm not sure if internal SLI is any different (?)
kracker is offline   Reply With Quote
Old 2013-09-26, 22:17   #191
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Memory bandwidth would still be the limiting factor. We are almost up to that limit now with a single processor. The normalization and pointwise multiplication kernels could be split without increasing memory transfers, but they are only about 15% of the iteration time.

Is the memory on those cards shared or partitioned between the two processors?

Last fiddled with by owftheevil on 2013-09-26 at 22:19
owftheevil is offline   Reply With Quote
Old 2013-09-26, 22:53   #192
Robish
 
"Rob Gahan"
Aug 2013
Ireland

448 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Memory bandwidth would still be the limiting factor. We are almost up to that limit now with a single processor. The normalization and pointwise multiplication kernels could be split without increasing memory transfers, but they are only about 15% of the iteration time.

Is the memory on those cards shared or partitioned between the two processors?
Partitioned I think, 6gb = 3gb each gpu
Robish is offline   Reply With Quote
Old 2013-09-26, 23:00   #193
Robish
 
"Rob Gahan"
Aug 2013
Ireland

3610 Posts
Default

Quote:
Originally Posted by Robish View Post
Partitioned I think, 6gb = 3gb each gpu
I think I saw a performance review on videocardz that a dual gpu never out performs two singles ie 7990 is roughly 15% less than 2 x 7970s but sli and crossfire are to be avoided for gpu computation. each gpu should only be addressed from the pcie slot as a separate entity.
Robish is offline   Reply With Quote
Old 2013-09-26, 23:48   #194
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Radeon HD 7990 PCB 6GB VRAM
GeForce GTX 690 PCB 4GB VRAM
kracker is offline   Reply With Quote
Old 2013-09-27, 00:11   #195
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

So its looking like distributed LLs in any sense in not feasible at this time.

Sorry kracker and msft. Here's your thread back. Any new developments with cllucas?
owftheevil is offline   Reply With Quote
Old 2013-09-27, 04:40   #196
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

Quote:
Originally Posted by owftheevil View Post
So its looking like distributed LLs in any sense in not feasible at this time.

Sorry kracker and msft. Here's your thread back. Any new developments with cllucas?
Hi,
No idea at this time.
msft is offline   Reply With Quote
Old 2013-09-27, 06:50   #197
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72×197 Posts
Default

Quote:
Originally Posted by kracker View Post
Probably not. As always, it depends mostly on the latency and speed of the "bridge" and I'm not sure if internal SLI is any different (?)
In that case, if the manufacturer is clever, it won't be an "internal sli", but a different type of "bridge", more close related to the mobo's chipsets (think northbridge). For an actual existent example see Asus' Mars 2 cards, which put two 580s together using such a "specialized" bridge, therefore enabling the Mars to get about 30% speed gain compared with 590. To the unadvised, the 590 is just two 580s underclocked (due to internal and heat problems) and connected together over "internal SLI" bridge.

Anyhow, to come ontopic, there will be no advantage spreading LL tests over multiple cards. The external communication is always slower than the internal computing, and the LL test are freaky to parallelize, except the FFT used to do each iteration, but for that, the data are already available internally (you need it all, for error correction, etc), it would make no sense to move it around too much, wasting precious time. It will always take shorter time to make the calculus, than to move the data, make the calculus, bring back the results.

If you have two GPU's, then you will do much better doing two LL tests, one exponent in each GPU, with SLI or without SLI.

Always.
LaurV is offline   Reply With Quote
Old 2013-09-27, 13:11   #198
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Yes, one test on one GPU will always be best I think.

EDIT: On another note in 4h my 4th DC will finish with clLucas.

Last fiddled with by kracker on 2013-09-27 at 13:12
kracker is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
Can't get OpenCL to work on HD7950 Ubuntu 14.04.5 LTS VictordeHolland Linux 4 2018-04-11 13:44
OpenCL accellerated lattice siever pstach Factoring 1 2014-05-23 01:03
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
AMD's Graphics Core Next- a reason to accelerate towards OpenCL? Belteshazzar GPU Computing 19 2012-03-07 18:58

All times are UTC. The time now is 07:06.


Mon Aug 2 07:06:48 UTC 2021 up 10 days, 1:35, 0 users, load averages: 2.05, 1.98, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.