![]() |
|
|
#34 |
|
"David"
Jul 2015
Ohio
11·47 Posts |
Is that TX/RX from the perspective of the card or the host?
I would assume the RX is reading the bitmap of results back? As far as I know that is the only significant data returned to the CPU. Maybe we can schedule a task to collapse that bitmap on-card and use some barriers to cause that task to wait for all of the other waves to complete? I'm not super familiar with the CUDA side but I'm very familiar with the OpenCL/AMD side. I have by no means done an exhaustive look at this and you are certainly far more familiar with the architectural decisions you made than I am. |
|
|
|
|
|
#35 |
|
"Oliver"
Mar 2005
Germany
45716 Posts |
I haven't searched for documentation for Rx/Tx direction.
There is no result bitmap in mfaktc, just a small array of integers (32x 4 Byte) after each class is finished. The array can hold up to 10 factors per class. Oliver |
|
|
|
|
|
#36 |
|
"David"
Jul 2015
Ohio
10000001012 Posts |
Interesting. That doesn't appear to be anywhere near enough bandwidth to saturate even a PCI 2.0 1x lane. Latency may be a different matter however if those transactions are delaying the next wave of work.
|
|
|
|
|
|
#37 |
|
"David"
Jul 2015
Ohio
20516 Posts |
I finally got time to really dig into and fix this. This problem was mfakto/OpenCL specific, however I have not looked to see if there is a similar issue with mfaktc but the Nvidia GPU bandwidth stats do not seem to indicate there is.
It turns out all of the memory buffers used for mfakto were being allocated with the CL_MEM_USE_HOST_PTR flag, which was causing them to be allocated in system memory rather than on the GPU. Correcting this to CL_MEM_COPY_HOST_PTR to copy the pre-initialized data to the GPU but use all the sieve/bitarray memory on the device itself fixed the problem. My testing showed on a FuryX I was processing about 2.85 classes/second around 51M, leading to ~210 Sieve Kernel + TF Kernel pairs. This was using about 3.3 GB/second of PCIe bandwidth + overhead in a PCIe 3.0 X16 slots, which was fine and running ~960GhzDay/Day. Using an 8x slot incurred a slight penalty of 870GhzDay/Day, but a 4x slot at PCIe 2.0 (5 GTs/s) was running about 660GhzDay/Day, which makes sense since it would only have 2GB/sec or less of PCIe bandwidth available. After the fix I am able to perform all self tests correctly and get a full 1020GhzDay/Day even in a 4x PCIe 2.0 slot. There is also likely a LOT less CPU core usage. This is a 55% improvement in my x4 slot performance, and a modest 6% improvement on my 16x slots. After some more verification testing I plan to roll this patched mfakto version out across all of my cards, which should yield about a 20% average throughput increase on my AMD cards. Last fiddled with by airsquirrels on 2016-01-01 at 18:08 |
|
|
|
|
|
#38 |
|
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
|
|
|
|
|
|
#39 |
|
Nov 2010
Germany
59710 Posts |
This is really cool. Did you replace the flag for all buffers? Is that all you had to change? I'm happy to roll that into the mfakto code.
The reason for the USE flag was simply that I took over from the CPU-sieve how the buffers are allocated. I missed to check for more performant ways ... Last fiddled with by Bdot on 2016-01-03 at 13:56 |
|
|
|
|
|
#40 |
|
"David"
Jul 2015
Ohio
51710 Posts |
I changed it for all of the buffers, but I never use CPU sieving. You could leave ktab as is since it isn't used by the GPU Sieve. That may or may not affect performance for CPU seiving.
I would be interested in how many others are using AMD cards and can test the release for improvement? |
|
|
|
|
|
#41 | |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23×271 Posts |
Quote:
|
|
|
|
|
|
|
#42 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Linux or Windows?
I'm only equipped to cut Linux builds but perhaps bdot can update the buffers to all be COPY_HOST_PTR and cut some builds? |
|
|
|
|
|
#43 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23×271 Posts |
|
|
|
|
|
|
#44 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Just find and replace CL_MEM_USE_HOST_PTR with CL_MEM_COPY_HOST_PTR in mfakto.cpp and gpusieve.cpp
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| NVIDIA GeForce GTX 1060 - PCIe 2.0 vs. 3.0 | chaoz23 | GPU Computing | 7 | 2017-08-03 08:40 |
| (patch) IniWriteFloat should limit its field width | Explorer09 | Software | 0 | 2015-09-23 01:02 |
| mfaktc on a Mac | bayanne | GPU Computing | 0 | 2013-10-18 09:59 |
| mfaktc (0.20) | fairsky | Software | 9 | 2013-09-24 12:58 |
| mfaktc | tichy | GPU Computing | 4 | 2010-12-03 21:51 |