![]() |
One titan can do an LL iteration with 4M ffts in about 2.75ms. 250Gb/s communication between the devices would be just enough for two titans to do iterations with 4M ffts in 2ms. With more devices the situation gets worse, approaching 500Gb/s for an infinite number of devices.
Stage 2 of p-1 on the other hand would benifit very nicely. |
[QUOTE=owftheevil;354237]One titan can do an LL iteration with 4M ffts in about 2.75ms. 250Gb/s communication between the devices would be just enough for two titans to do iterations with 4M ffts in 2ms. With more devices the situation gets worse, approaching 500Gb/s for an infinite number of devices.
Stage 2 of p-1 on the other hand would benifit very nicely.[/QUOTE] What about 2headed GPUs, like GTX 690 ? Say there's a hypothetical GTX Titan X2, which has 2 GK110 GPUs at lower clocks, but with the same 2688 shaders per GPU. Would it perform better than two GTX Titans, from theoretical throughput point of view? |
[QUOTE=Karl M Johnson;354284]What about 2headed GPUs, like GTX 690 ?
Say there's a hypothetical GTX Titan X2, which has 2 GK110 GPUs at lower clocks, but with the same 2688 shaders per GPU. Would it perform better than two GTX Titans, from theoretical throughput point of view?[/QUOTE] Probably not. As always, it depends [I]mostly[/I] on the latency and speed of the "bridge" and I'm not sure if internal SLI is any different (?) |
Memory bandwidth would still be the limiting factor. We are almost up to that limit now with a single processor. The normalization and pointwise multiplication kernels could be split without increasing memory transfers, but they are only about 15% of the iteration time.
Is the memory on those cards shared or partitioned between the two processors? |
[QUOTE=owftheevil;354301]Memory bandwidth would still be the limiting factor. We are almost up to that limit now with a single processor. The normalization and pointwise multiplication kernels could be split without increasing memory transfers, but they are only about 15% of the iteration time.
Is the memory on those cards shared or partitioned between the two processors?[/QUOTE] Partitioned I think, 6gb = 3gb each gpu |
[QUOTE=Robish;354306]Partitioned I think, 6gb = 3gb each gpu[/QUOTE]
I think I saw a performance review on videocardz that a dual gpu never out performs two singles ie 7990 is roughly 15% less than 2 x 7970s but sli and crossfire are to be avoided for gpu computation. each gpu should only be addressed from the pcie slot as a separate entity. |
[URL="http://www.coolingconfigurator.com/upload/pictures/AMD-Radeon-7990-6GB-GDDR5-PCB.jpg"]Radeon HD 7990 PCB[/URL] 6GB VRAM
[URL="http://www.hardwareheaven.com/reviewimages/nvidia-geforce-gtx-690/GeForce_GTX_690_F_bare_PCB.jpg"]GeForce GTX 690 PCB[/URL] 4GB VRAM |
So its looking like distributed LLs in any sense in not feasible at this time.
Sorry kracker and msft. Here's your thread back. Any new developments with cllucas? |
[QUOTE=owftheevil;354314]So its looking like distributed LLs in any sense in not feasible at this time.
Sorry kracker and msft. Here's your thread back. Any new developments with cllucas?[/QUOTE] Hi, No idea at this time. |
[QUOTE=kracker;354296]Probably not. As always, it depends [I]mostly[/I] on the latency and speed of the "bridge" and I'm not sure if internal SLI is any different (?)[/QUOTE]
In that case, if the manufacturer is clever, it won't be an "internal sli", but a different type of "bridge", more close related to the mobo's chipsets (think northbridge). For an actual existent example see Asus' Mars 2 cards, which put two 580s together using such a "specialized" bridge, therefore enabling the Mars to get about 30% speed gain compared with 590. To the unadvised, the 590 is just two 580s underclocked (due to internal and heat problems) and connected together over "internal SLI" bridge. Anyhow, to come ontopic, there will be no advantage spreading LL tests over multiple cards. The external communication is always slower than the internal computing, and the LL test are freaky to parallelize, except the FFT used to do each iteration, but for that, the data are already available internally (you need it all, for error correction, etc), it would make no sense to move it around too much, wasting precious time. It will always take shorter time to make the calculus, than to move the data, make the calculus, bring back the results. If you have two GPU's, then you will do much better doing two LL tests, one exponent in each GPU, with SLI or without SLI. Always. |
Yes, one test on one GPU will always be best I think.
EDIT: On another note in 4h my 4th DC will finish with clLucas. |
| All times are UTC. The time now is 22:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.