mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc and PCIe bus width (https://www.mersenneforum.org/showthread.php?t=18011)

airsquirrels 2015-09-18 17:51

Is that TX/RX from the perspective of the card or the host?

I would assume the RX is reading the bitmap of results back? As far as I know that is the only significant data returned to the CPU. Maybe we can schedule a task to collapse that bitmap on-card and use some barriers to cause that task to wait for all of the other waves to complete? I'm not super familiar with the CUDA side but I'm very familiar with the OpenCL/AMD side.

I have by no means done an exhaustive look at this and you are certainly far more familiar with the architectural decisions you made than I am.

TheJudger 2015-09-18 18:00

I haven't searched for documentation for Rx/Tx direction.

There is no result bitmap in mfaktc, just a small array of integers (32x 4 Byte) after each class is finished. The array can hold up to 10 factors per class.

Oliver

airsquirrels 2015-09-18 18:06

Interesting. That doesn't appear to be anywhere near enough bandwidth to saturate even a PCI 2.0 1x lane. Latency may be a different matter however if those transactions are delaying the next wave of work.

airsquirrels 2016-01-01 18:08

I finally got time to really dig into and fix this. This problem was mfakto/OpenCL specific, however I have not looked to see if there is a similar issue with mfaktc but the Nvidia GPU bandwidth stats do not seem to indicate there is.

It turns out all of the memory buffers used for mfakto were being allocated with the CL_MEM_USE_HOST_PTR flag, which was causing them to be allocated in system memory rather than on the GPU.

Correcting this to CL_MEM_COPY_HOST_PTR to copy the pre-initialized data to the GPU but use all the sieve/bitarray memory on the device itself fixed the problem.

My testing showed on a FuryX I was processing about 2.85 classes/second around 51M, leading to ~210 Sieve Kernel + TF Kernel pairs. This was using about 3.3 GB/second of PCIe bandwidth + overhead in a PCIe 3.0 X16 slots, which was fine and running ~960GhzDay/Day.

Using an 8x slot incurred a slight penalty of 870GhzDay/Day, but a 4x slot at PCIe 2.0 (5 GTs/s) was running about 660GhzDay/Day, which makes sense since it would only have 2GB/sec or less of PCIe bandwidth available.

After the fix I am able to perform all self tests correctly and get a full 1020GhzDay/Day even in a 4x PCIe 2.0 slot. There is also likely a LOT less CPU core usage. This is a 55% improvement in my x4 slot performance, and a modest 6% improvement on my 16x slots.

After some more verification testing I plan to roll this patched mfakto version out across all of my cards, which should yield about a 20% average throughput increase on my AMD cards.

flashjh 2016-01-01 18:56

[QUOTE=airsquirrels;420850]I finally got time to really dig into and fix this...[/QUOTE]

:tu:

Bdot 2016-01-03 13:51

This is really cool. Did you replace the flag for all buffers? Is that all you had to change? I'm happy to roll that into the mfakto code.

The reason for the USE flag was simply that I took over from the CPU-sieve how the buffers are allocated. I missed to check for more performant ways ...

airsquirrels 2016-01-03 14:40

I changed it for all of the buffers, but I never use CPU sieving. You could leave ktab as is since it isn't used by the GPU Sieve. That may or may not affect performance for CPU seiving.

I would be interested in how many others are using AMD cards and can test the release for improvement?

kracker 2016-01-03 16:25

[QUOTE=airsquirrels;421103]I changed it for all of the buffers, but I never use CPU sieving. You could leave ktab as is since it isn't used by the GPU Sieve. That may or may not affect performance for CPU seiving.

I would be interested in how many others are using AMD cards and can test the release for improvement?[/QUOTE]

I'm up for testing anything!

airsquirrels 2016-01-03 17:09

Linux or Windows?

I'm only equipped to cut Linux builds but perhaps bdot can update the buffers to all be COPY_HOST_PTR and cut some builds?

kracker 2016-01-03 18:58

[QUOTE=airsquirrels;421120]Linux or Windows?

I'm only equipped to cut Linux builds but perhaps bdot can update the buffers to all be COPY_HOST_PTR and cut some builds?[/QUOTE]

Windows.. but I can compile mfakto from source assuming there's no code that's dependent on linux.

airsquirrels 2016-01-03 19:03

Just find and replace CL_MEM_USE_HOST_PTR with CL_MEM_COPY_HOST_PTR in mfakto.cpp and gpusieve.cpp


All times are UTC. The time now is 11:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.