![]() |
mfaktc and PCIe bus width
I was thinking about adding a second GPU; the two 16-lane PCIe slots on my motherboard are next to each other and I would really like to space the GPUs further apart for better heat dissipation.
The third PCIe slot can only run with 8 lanes. With the GPU sieving version of mfaktc, would this make any difference in the performance of the program? |
We tested this in our computer and there was no difference in throughput.
We tested "top/middle", "top/bottom" and "middle/bottom". We tested three cards also but our third card is slow (GT 430) so we are not sure if three fast cards would be an issue. |
[QUOTE=Chuck;334503]I was thinking about adding a second GPU; the two 16-lane PCIe slots on my motherboard are next to each other and I would really like to space the GPUs further apart for better heat dissipation.
The third PCIe slot can only run with 8 lanes. With the GPU sieving version of mfaktc, would this make any difference in the performance of the program?[/QUOTE] I had terrible performance by not using the designated slots. the MOBO instructions often states specific slot loading. Sadly two card next to each other (gtx-570s) caused the top card to be heat loaded by the lower card.... It didn't go so well.... Enter liquid cooling. Also if you OC the cards you will see a significant increase in power consumption. A 600 watt PS will likely need to be 1000 watt... Where's FlashJH when we need him to tell his horror stories.... |
I have a similar question but slightly more extreme. I have been thinking of getting a graphics card at some point something like a 740 when they come out eventually. The card will be designed for 16x with PCIe 3.0 but my motherboard only has PCIe 1.1. I know 2.0 cards should run on my system at an equivalent bus speed of 16x 1.1 or 8x 2.0. Would a PCI 3.0 card run at 4x 3.0 speeds? Would it run at all?
The motherboard is an ASUS P5K-VM if that helps. I will be wanting to use it for a variety of stuff including things like cudalucas, gpu-ecm, gpu P-1 etc. |
[QUOTE=Xyzzy;334507]..."top/bottom"...[/QUOTE]
Which did you prefer? :wink: |
[QUOTE=Xyzzy;334507]We tested this in our computer and there was no difference in throughput.
We tested "top/middle", "top/bottom" and "middle/bottom". We tested three cards also but our third card is slow (GT 430) so we are not sure if three fast cards would be an issue.[/QUOTE] My computer has the Raven-style case so it would be "left", "center", and "right". |
[QUOTE=swl551;334508]I had terrible performance by not using the designated slots. the MOBO instructions often states specific slot loading.[/QUOTE]
The MOBO (Asus P6X58D Premium) instructions say the third slot can be configured as 8 lanes, with the trade-off that the middle slot (which would be unused) would then also use 8 lanes. With the middle one at 16, then the third can only run at 1 lane. |
[QUOTE=Chuck;334520]My computer has the Raven-style case so it would be "left", "center", and "right".[/QUOTE]
Sounds like it time to try it and see what happens. Everything else is just speculation. (probably good speculation, but it is your stuff) |
[QUOTE=henryzz;334509]I have a similar question but slightly more extreme. I have been thinking of getting a graphics card at some point something like a 740 when they come out eventually. The card will be designed for 16x with PCIe 3.0 but my motherboard only has PCIe 1.1. I know 2.0 cards should run on my system at an equivalent bus speed of 16x 1.1 or 8x 2.0. Would a PCI 3.0 card run at 4x 3.0 speeds? Would it run at all? [/QUOTE]
A co-worker of mine tried running pci 3.0 cards in 2.0 slots. The board was solid and could handle 2x16, not one of the boards that can do 1x16 or 2x8. Even so, the PCI 3.0 card wouldn't run in it. They say they are backwards compatible (usually with a *) but the truth is it's really hit and miss if a 3.0 card will run in a 2.0 slot. If it will run, it will almost certainly run slower than it would in a 3.0 slot not because of the channels/bandwidth but because of the clock speeds. Personally I'd reccomend getting a higher end 500 series 2.0 card than a lower end 700 series 3.0 card unless you are looking for a good excuse to replace the motherboard. Regarding the Chuck's sitation, the 8x or 16x bandwidth won't matter at all, but the board configuration might. Make sure the mboard doesn't mind running 1 and 3 w/o 2. You might be suprised to find out that if you do run 1 and 3 w/o 2, they'll both run at full 16x speed anyway. EDIT: For reference, I run an overclocked 480 and and overclocked 580 on an ASUS board with a 750 watt power supply. |
Chuck could use a riser card or ribbon :smile:
|
[QUOTE=paulunderwood;334531]Chuck could use a riser card or ribbon :smile:[/QUOTE]
Now this is a great idea; something I never thought of. |
[QUOTE=Aramis Wyler;334524]A co-worker of mine tried running pci 3.0 cards in 2.0 slots. The board was solid and could handle 2x16, not one of the boards that can do 1x16 or 2x8. Even so, the PCI 3.0 card wouldn't run in it. They say they are backwards compatible (usually with a *) but the truth is it's really hit and miss if a 3.0 card will run in a 2.0 slot. If it will run, it will almost certainly run slower than it would in a 3.0 slot not because of the channels/bandwidth but because of the clock speeds. Personally I'd reccomend getting a higher end 500 series 2.0 card than a lower end 700 series 3.0 card unless you are looking for a good excuse to replace the motherboard.
Regarding the Chuck's sitation, the 8x or 16x bandwidth won't matter at all, but the board configuration might. Make sure the mboard doesn't mind running 1 and 3 w/o 2. You might be suprised to find out that if you do run 1 and 3 w/o 2, they'll both run at full 16x speed anyway. edit: looked it up I could get a PCIe 2.0 motherboard EDIT: For reference, I run an overclocked 480 and and overclocked 580 on an ASUS board with a 750 watt power supply.[/QUOTE] I suspected as much. I am not certain I can get a PCIe 2.0 motherboard with LGA 775(Core 2 Quad). I would probably use replacing that as an excuse for the whole system anyway. I am trying to wait for skylake to come out with DDR4. Currently one of my biggest problems with my system is that high density DDR2(currently have 4GB of memory) is expensive and barely available. I don't want to get a long term system with a memory architecture that is getting close to the end of it's life span(plus there has been a spate of new instruction sets that are useful that I would like). edit: looked it up I could get a PCIe 2.0 motherboard |
[QUOTE=Chuck;334503]I was thinking about adding a second GPU; the two 16-lane PCIe slots on my motherboard are next to each other and I would really like to space the GPUs further apart for better heat dissipation.
The third PCIe slot can only run with 8 lanes. With the GPU sieving version of mfaktc, would this make any difference in the performance of the program?[/QUOTE] My feeling says that PCIe 1.1 x1 is sufficient for a Geforce Titan if you do [B]G[/B]PU sieving. For [B]C[/B]PU sievings mfaktc needs to transfer 4 bytes per candidate. So if you card is capable of 200M/s you'll need 200M/s*4 = 800MB/s. Oliver |
I may have missed something, but as far as I know you can always put a 2.0 card in a 3.0 board. There's not any problem with getting a 3.0 board regardless of your gpu hardware. The problem would be trying to put a 3.0 card in a 2.0 board.
|
[QUOTE=Aramis Wyler;334558]I may have missed something, but as far as I know you can always put a 2.0 card in a 3.0 board. There's not any problem with getting a 3.0 board regardless of your gpu hardware. The problem would be trying to put a 3.0 card in a 2.0 board.[/QUOTE]
I have a PCI 3.0 gpu in a PCI 2.0 slot in the motherboard here. |
The only problem would be to put a card with a longer connector in a shorter slot, for which you will need a milling machine, and at the end it will not work...
PCIe is a versatile animal. You can put 2.0 cards into 3.0 slots, if they fit physically, and you get all 2.0 performance, as your mobo will always know the deal. You can put 3.0 cards into 2.0 slots, and you get all 2.0 performance, IF the card knows the deal. Not all cards know the deal, and in this case you get lower performance. Worst case you get 1.0a performance (for [URL="https://www.google.com/search?q=pcie+to+pcmcia&hl=en&source=lnms&tbm=isch&sa=X&ei=60RNUZ-wIe-aiAfgo4CwDw&ved=0CAoQ_AUoAQ&biw=1272&bih=806"]some cards[/URL], you always get lower performance, because that is what the "other side" knows, only). The fun is that you even can put [URL="http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/iphcd/iphcdpciexpress.htm"]shorter cards in longer slots[/URL]. The x1 cards end at pin 18, the x4 cards end at pin 32, and the x8 cards end at pin 50, but the x16 cards continue up to pin 82 (you have to check these numbers for yourself, I am not sure), but they [B][U]are[/U] pin-to-pin compatible[/B], and if you take care about the right alignment (the key-notches will help you in this direction) then yes, you can put any card in any slot, if it physically fits. Also, the PCIe 4.0 which is yet to appear, featuring another increase in the transfer rate (doubling it again, from 8GT to 16GT - giga-transfers-per-second) will use the same slot, what it will bring new is some "scrambling" algorithm for the data, to avoid the high-frequency interferences between the parallel lines. The limitation is not the connector, but the tracks on the PCB (very fine wires of copper running parallel there and working like antennas relative to each other). edit: [URL="http://en.wikipedia.org/wiki/PCI_Express#PCI_Express_2.0"]Wikipedia[/URL] is quite nice in explaining the 2.0/3.0/4.0 differences, and in showing that the bottleneck is in fact the manufacturing process of the silicon, and not the slot/connector. Also [URL="http://www.redbooks.ibm.com/abstracts/tips0456.html"]IBM says[/URL] that "[I]PCI Express uses an embedded clocking technique using 8b/10b encoding. The clock information is encoded directly into the data stream, rather than having the clock as a separate signal[/I]" which is why the card and the mobo negotiate the clock and the encoding "down" until they can understand each other. So, having different cards into different slots will always work at the performance of the "weakest link in the chain" ("no chain is stronger than its weakest link") or worse, but it will work. |
That sounds like a great wikipedia entry, but I have actual experience in the matter. We took a pci 3.0 video card and put it in a gaming motherboard with 2.0 slots (that we'd been using for about a year), and it wouldn't work. At all. We put the card in a friend's 3.0 board and it worked fine. We got the best 2.0 card we could and put it in the 2.0 board and it worked fine. Gave the 3.0 card to the friend for christmas.
By the standards, it should be exactly as you said. Unfortunately, not all manufacturers exactly meet spec. |
That is exactly why I said "[B]IF the card knows the deal[/B]" in the post above, stressing the IF. Your card would most probably worked in some old compatibility mode, if you set it to SVGA :razz: mode or something like that, hehe, or with the right drivers, or... etc. They still have to negotiate those clocks.
edit: my boss use to make fun of us every time when he can, crying one of his favorite sentences aloud: "the plug and play devices are not plug and play" (well, he says it more plastic) |
My observations concerning PCIe bandwidth and GPU throughput (though it's an antidote and the GPUs are very weak).
a) 1 GeForce 8600GT PCIe 16x -> 8.8GHz-d/day /w CPU Sieve b) 1 GeForce 8600GT PCIe 8x -> 8.8GHz-d/day /w CPU Sieve c) 2 GeForce 8600GT PCIe 8x in SLi -> each realize 8.6GHz-d/day /w CPU Sieve I then took one of these cards and put it in a motherboard /w a 16x slot electrically limited to 4x communications: d) 1 GeForce 8600GT PCIe 4x -> 8.8GHz-d/day /w CPU Sieve These cards support CUDA 1.1 and are the oldest architecture supported by mfactc. A faster card may be limited by PCIe bandwidth, but these cards seem to be okay with almost any PCIe bus. |
Currently on a 8600 GTS. I know how slow they are :smile:
|
[QUOTE=henryzz;334509]I have a similar question but slightly more extreme. I have been thinking of getting a graphics card at some point something like a 740 when they come out eventually. The card will be designed for 16x with PCIe 3.0 but my motherboard only has PCIe 1.1. I know 2.0 cards should run on my system at an equivalent bus speed of 16x 1.1 or 8x 2.0. Would a PCI 3.0 card run at 4x 3.0 speeds? Would it run at all?
The motherboard is an ASUS P5K-VM if that helps. I will be wanting to use it for a variety of stuff including things like cudalucas, gpu-ecm, gpu P-1 etc.[/QUOTE] Sorry to bump an old thread but I though here is probably best. Got a 750 Ti for Christmas. It is working quite happily in my PCIe 1.1 motherboard. Once I get all the programs compiled and working I will do some benchmarks. If someone has a 750 Ti which is in a PCIe 3.0 socket it would be interesting to compare. Does the speed of the cpu make any difference at all for gpu-sieving or cudalucas? If it does my Q6600 probably won't match a more recent cpu. |
[QUOTE=henryzz;390990][B]Sorry to bump an old thread but I though here is probably best.[/B]
Got a 750 Ti for Christmas. It is working quite happily in my PCIe 1.1 motherboard. Once I get all the programs compiled and working I will do some benchmarks. If someone has a 750 Ti which is in a PCIe 3.0 socket it would be interesting to compare. Does the speed of the cpu make any difference at all for gpu-sieving or cudalucas? If it does my Q6600 probably won't match a more recent cpu.[/QUOTE] Did you ever figure out the answer to your question. I just installed a Titan Z in a PCIe 3.0 socket and I'm no longer able to keep the card at 100%, even with several mfaktc instances. I'm wondering if I need to get a fast MB\CPU combo or if it's a limit of the PCIe 3.0 bus at this point? |
[QUOTE=flashjh;410642]Did you ever figure out the answer to your question. I just installed a Titan Z in a PCIe 3.0 socket and I'm no longer able to keep the card at 100%, even with several mfaktc instances. I'm wondering if I need to get a fast MB\CPU combo or if it's a limit of the PCIe 3.0 bus at this point?[/QUOTE]
Is your card running at 16x? Or are the PCIe lanes split with other cards? |
Man, don't waste the Z for factoring, give it a LL and a DC for the same exponent (to check if the residues match), one in each GPU, and let it go.
OTOH, what the "PerfCap Reason" says? You may not be able to max that card for power limitations or thermal/voltage, or whatever other reasons. Also, disable the DP in Nvidia Control Panel if you insist on doing TF with it (it almost doubles the speed). |
with GPU sieving enabled even PCIe x1 Gen 1 should be more than enough but I don't know for sure.
Oliver |
It's running at 16x, it hovers at @ 97% on both sides. No biggie, just wondering why it won't go to 100 (99%). I have plenty of power to keep the card @ full. I thought the same about the GPU Sieve, so I'll just let it go. Each side is already significantly faster than a 580.
I can run LL\DC on the card, but I had to remove my last TF 580 to put this card in, so if I move to LL, I'll not be doing any TF anymore. Do we have the running TF capacity to lose the 450 GHzDays\Day? |
[QUOTE=flashjh;410679]Do we have the running TF capacity to lose the 450 GHzDays\Day?[/QUOTE]
No. Please. We should be better off next week, but not today -- we need a bit more of a TF'ed buffer for both LL (various categories) and LLP-1. I /really/ need to get out more.... :smile: |
[QUOTE=chalsall;410683]No. Please.
We should be better off next week, but not today -- we need a bit more of a TF'ed buffer for both LL (various categories) and LLP-1. I /really/ need to get out more.... :smile:[/QUOTE] Yes, agreed! Ok, no problem. I'll leave the 'Z' on TF. It's doing ~1150 GHzDays\Day as of right now. Hope to get a little more out of it, but I need to finish testing. |
[QUOTE=flashjh;410688]Yes, agreed! Ok, no problem. I'll leave the 'Z' on TF. It's doing ~1150 GHzDays\Day as of right now. Hope to get a little more out of it, but I need to finish testing.[/QUOTE]
Now ye talking! I was going to say that 450 is a bit in the lower side for that card... :yucky: |
[QUOTE=flashjh;410642]Did you ever figure out the answer to your question. I just installed a Titan Z in a PCIe 3.0 socket and I'm no longer able to keep the card at 100%, even with several mfaktc instances. I'm wondering if I need to get a fast MB\CPU combo or if it's a limit of the PCIe 3.0 bus at this point?[/QUOTE]
No one got back to me on this. I hope to get a new much faster PC in around a years time. I will do before and after tests for the GPU. |
I know mfakto and mfaktc are dealing with different architectures even though they are based off similar code. With that said, PCIe width seems to make a huge difference on fast AMD cards even with GPU sieving on mfakto. I've been looking into why exactly that is but I haven't had much time lately. Is there any reason the number of classes per exponent is set to what it is? In mfakto it looks from the code like we may end up doing a lot of data transfer to/from the card between even GPU sieving and the TF step. Results checking also reads back a bit or so for each k checked which adds up. In my case I'm losing about 30% of my potential capacity due to PCI lane saturation in my hosts.
How similar are the two programs in how they handle scheduling work on the cards? |
Hi,
not sure how accurate nvidia-smi measures bandwidth, but: [CODE]# nvidia-smi -l 1 -a | grep Throughput Tx Throughput : 2000 KB/s Rx Throughput : 2000 KB/s Tx Throughput : 24000 KB/s Rx Throughput : 1000 KB/s Tx Throughput : 0 KB/s Rx Throughput : 189000 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Tx Throughput : 0 KB/s [...] [/CODE] started at the same time I have started mfaktc on a (factory OCed) GTX 970. Rx/Tx values are shown every second, the first 3 pairs (3 seconds) are during the initial selftest which utilized CPU and GPU sieve enabled kernels. After that it is doing regular work on M73.xxx.xxx with GPU sieving. Oliver |
[QUOTE=airsquirrels;410760]Is there any reason the number of classes per exponent is set to what it is?[/QUOTE]
[URL="http://www.mersenneforum.org/showpost.php?p=200887&postcount=37"]yes[/URL] :smile: There is nothing special about 420/4620, it would work with any other natural number >=1, too, but some numbers allow more efficent sieving than others. Oliver |
Is that TX/RX from the perspective of the card or the host?
I would assume the RX is reading the bitmap of results back? As far as I know that is the only significant data returned to the CPU. Maybe we can schedule a task to collapse that bitmap on-card and use some barriers to cause that task to wait for all of the other waves to complete? I'm not super familiar with the CUDA side but I'm very familiar with the OpenCL/AMD side. I have by no means done an exhaustive look at this and you are certainly far more familiar with the architectural decisions you made than I am. |
I haven't searched for documentation for Rx/Tx direction.
There is no result bitmap in mfaktc, just a small array of integers (32x 4 Byte) after each class is finished. The array can hold up to 10 factors per class. Oliver |
Interesting. That doesn't appear to be anywhere near enough bandwidth to saturate even a PCI 2.0 1x lane. Latency may be a different matter however if those transactions are delaying the next wave of work.
|
I finally got time to really dig into and fix this. This problem was mfakto/OpenCL specific, however I have not looked to see if there is a similar issue with mfaktc but the Nvidia GPU bandwidth stats do not seem to indicate there is.
It turns out all of the memory buffers used for mfakto were being allocated with the CL_MEM_USE_HOST_PTR flag, which was causing them to be allocated in system memory rather than on the GPU. Correcting this to CL_MEM_COPY_HOST_PTR to copy the pre-initialized data to the GPU but use all the sieve/bitarray memory on the device itself fixed the problem. My testing showed on a FuryX I was processing about 2.85 classes/second around 51M, leading to ~210 Sieve Kernel + TF Kernel pairs. This was using about 3.3 GB/second of PCIe bandwidth + overhead in a PCIe 3.0 X16 slots, which was fine and running ~960GhzDay/Day. Using an 8x slot incurred a slight penalty of 870GhzDay/Day, but a 4x slot at PCIe 2.0 (5 GTs/s) was running about 660GhzDay/Day, which makes sense since it would only have 2GB/sec or less of PCIe bandwidth available. After the fix I am able to perform all self tests correctly and get a full 1020GhzDay/Day even in a 4x PCIe 2.0 slot. There is also likely a LOT less CPU core usage. This is a 55% improvement in my x4 slot performance, and a modest 6% improvement on my 16x slots. After some more verification testing I plan to roll this patched mfakto version out across all of my cards, which should yield about a 20% average throughput increase on my AMD cards. |
[QUOTE=airsquirrels;420850]I finally got time to really dig into and fix this...[/QUOTE]
:tu: |
This is really cool. Did you replace the flag for all buffers? Is that all you had to change? I'm happy to roll that into the mfakto code.
The reason for the USE flag was simply that I took over from the CPU-sieve how the buffers are allocated. I missed to check for more performant ways ... |
I changed it for all of the buffers, but I never use CPU sieving. You could leave ktab as is since it isn't used by the GPU Sieve. That may or may not affect performance for CPU seiving.
I would be interested in how many others are using AMD cards and can test the release for improvement? |
[QUOTE=airsquirrels;421103]I changed it for all of the buffers, but I never use CPU sieving. You could leave ktab as is since it isn't used by the GPU Sieve. That may or may not affect performance for CPU seiving.
I would be interested in how many others are using AMD cards and can test the release for improvement?[/QUOTE] I'm up for testing anything! |
Linux or Windows?
I'm only equipped to cut Linux builds but perhaps bdot can update the buffers to all be COPY_HOST_PTR and cut some builds? |
[QUOTE=airsquirrels;421120]Linux or Windows?
I'm only equipped to cut Linux builds but perhaps bdot can update the buffers to all be COPY_HOST_PTR and cut some builds?[/QUOTE] Windows.. but I can compile mfakto from source assuming there's no code that's dependent on linux. |
Just find and replace CL_MEM_USE_HOST_PTR with CL_MEM_COPY_HOST_PTR in mfakto.cpp and gpusieve.cpp
|
[QUOTE=airsquirrels;421131]Just find and replace CL_MEM_USE_HOST_PTR with CL_MEM_COPY_HOST_PTR in mfakto.cpp and gpusieve.cpp[/QUOTE]
Nice! I wasn't expecting much of an improvement at all for my GPU since it's not really bottlenecked, but I'm getting 387 GHz/days vs 380 Ghz/days now, it's better than nothing :smile: |
Did this change affect the performance of higher bit levels versus lower bit levels?
|
You mean the fact that mfakto processes lower bitlevels faster? That is because mfakto needs to use different kernels (different implementations) for different bitlevels. This is unchanged. All kernels will benefit equally from this change (a certain percentage of improvement for each).
|
[QUOTE=Bdot;421535]You mean the fact that mfakto processes lower bitlevels faster? That is because mfakto needs to use different kernels (different implementations) for different bitlevels. This is unchanged. All kernels will benefit equally from this change (a certain percentage of improvement for each).[/QUOTE]
Yeah, that's what I was curious about. I was wondering if the kernels for higher bit levels were more memory intensive or not. |
| All times are UTC. The time now is 11:40. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.