mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

kracker 2014-06-24 14:54

[QUOTE=tului;376583]Any newer or beta mfakto we could test on our GCN cards? I'd love to squeeze some more performace out of them. They're not 79xx renames, they're the real deal new chips.[/QUOTE]

If you have 0.14, that is basically the latest. Here's mfakto on github, if you're curious.
[url]https://github.com/Bdot42/mfakto[/url]

tului 2014-06-25 00:56

[QUOTE=kracker;376615]If you have 0.14, that is basically the latest. Here's mfakto on github, if you're curious.
[url]https://github.com/Bdot42/mfakto[/url][/QUOTE]

Yea, cloned the repository and built it with VS2013, so maybe with /AVX I get some small boost. Most of the files were dated with the 0.14 release so looks like nothing has changed since then. I also downloaded and built MISFIT from source and was playing with that some.

Bdot 2014-06-25 01:55

/avx can certainly help when the sieve runs on the CPU (SieveOnGPU=0). But usually, the CPU can no longer keep up with the GPU, especially for high-end GPUs.

Which cards do you have, I'd be interested to see some performance info, but I'm not yet finished with my changes for the --perftest mode ...

If you build it yourself, could you uncomment the #define CL_PERFORMANCE_INFO in params.h and build this as a special version (e.g. mfakto_pi.exe) ...
Then modify the mfakto.ini file to set SieveOnGPU=0 and run

mfakto_pi -st > st_pi.log

This will run the selftest with timing information for each of the kernel invocations. Using that information I can see if using different kernels can improve performance.

kracker 2014-06-25 02:01

[QUOTE=Bdot;376664]/avx can certainly help when the sieve runs on the CPU (SieveOnGPU=0). But usually, the CPU can no longer keep up with the GPU, especially for high-end GPUs.[/QUOTE]

Hmm, I thought there has to be specific AVX code in source to be used?

tului 2014-06-25 07:22

[QUOTE=kracker;376667]Hmm, I thought there has to be specific AVX code in source to be used?[/QUOTE]

intrinsics can be used directly but afaik most modern compilers can look at the overall idea of what you're trying to do and possibly use AVX, SSE etc to improve it.

tului 2014-06-26 01:21

[QUOTE=Bdot;376664]/avx can certainly help when the sieve runs on the CPU (SieveOnGPU=0). But usually, the CPU can no longer keep up with the GPU, especially for high-end GPUs.

Which cards do you have, I'd be interested to see some performance info, but I'm not yet finished with my changes for the --perftest mode ...

If you build it yourself, could you uncomment the #define CL_PERFORMANCE_INFO in params.h and build this as a special version (e.g. mfakto_pi.exe) ...
Then modify the mfakto.ini file to set SieveOnGPU=0 and run

mfakto_pi -st > st_pi.log

This will run the selftest with timing information for each of the kernel invocations. Using that information I can see if using different kernels can improve performance.[/QUOTE]

I've got it running on 2 R7 260X cards at the moment. IB i5-3570K 16GB RAM at CAS 9 1866MHz. Do you want me to uncomment that and build them with the /SSE /AVX as well as all the other differing optimization levels or just use Ox(the full optimization mode) for all of them?

edit - anyone else having the main site crapping out with database errors right now @ 0122UTC ?

tului 2014-06-26 05:32

Attached are the logs, one for "SSE2" which is the default minimum for an x64 build. The other is of course with AVX, which has to be enabled via /arch:AVX.

Sadly I don't own a Haswell yet as I'm waiting for Haswell-E or I'd test AVX2 as well. If my wife lets me use her laptop I can get you numbers on an AMD A10-5750M APU with AVX as well. Sadly I don't think VS12 has any FMA options to play with.

[url]https://drive.google.com/file/d/0B0Yq8K5dWh1BaTU4V2FkRnY5M0E/edit?usp=sharing[/url]

[url]https://drive.google.com/file/d/0B0Yq8K5dWh1BU3FhX0ZOQTZXVGs/edit?usp=sharing[/url]

edit - Just glancing at them I don't see much difference. The small percentages here and there I see could be small bursts of CPU usage on my system. You know the kernels better than me so maybe they'll mean something to you.

kracker 2014-06-26 14:41

[QUOTE=tului;376781]Attached are the logs, one for "SSE2" which is the default minimum for an x64 build. The other is of course with AVX, which has to be enabled via /arch:AVX.

Sadly I don't own a Haswell yet as I'm waiting for Haswell-E or I'd test AVX2 as well. If my wife lets me use her laptop I can get you numbers on an AMD A10-5750M APU with AVX as well. Sadly I don't think VS12 has any FMA options to play with.

[url]https://drive.google.com/file/d/0B0Yq8K5dWh1BaTU4V2FkRnY5M0E/edit?usp=sharing[/url]

[url]https://drive.google.com/file/d/0B0Yq8K5dWh1BU3FhX0ZOQTZXVGs/edit?usp=sharing[/url]

edit - Just glancing at them I don't see much difference. The small percentages here and there I see could be small bursts of CPU usage on my system. You know the kernels better than me so maybe they'll mean something to you.[/QUOTE]

[code]
WARNING: Your GPU was detected as GCN (Graphics Core Next). These chips perform very slow with vector sizes of 4 or higher. Please change to VectorSize=2 in mfakto.ini and restart mfakto for optimal performance.
[/code]

tului 2014-06-27 02:29

[QUOTE=kracker;376796][code]
WARNING: Your GPU was detected as GCN (Graphics Core Next). These chips perform very slow with vector sizes of 4 or higher. Please change to VectorSize=2 in mfakto.ini and restart mfakto for optimal performance.
[/code][/QUOTE]

Yes, but I was running with GPU Sieving off so afaik nothing was sent to the GPU. On my "production" build I have vectors set to 2

Bdot 2014-06-27 09:24

Well, it's true that some parts of the work (namely the sieving) will move to the CPU, but that's not the part that is measured by the PERFORMANCE_INFO test. This test takes timings of the actual trial factoring of the sieved list of factor candidates - and is still done on the GPU.

We had different goals: you wanted to test how CPU-code optimization affects performance, I wanted to get GPU performance figures for the newer GCN generation :smile:.

So while the GPU tests with VectorSize=4 are interesting, the same would be needed for VectorSize=2, in order to see which is faster. Though not likely, it is possible that the newer GCN/openCL compilers have improved performance for higher vectorization, but I still think that 2 is the number to choose. For the next mfakto version I will change the default ...

Could you please use the same binaries and run this test again with VectorSize=2?

To do the optimization tests you wanted, please build a binary with CL_PERFORMANCE_INFO and SIEVE_SIZE_LIMIT commented out. Then have a look at the ini-file settings

TestSieveSizes
TestSievePrimes
TestGPUSieveSizes

The first two define the CPU-sieving tests to be performed, the second and third define the GPU-sieving tests.

To actually run these tests, start 'mfakto --perftest <n>' where <n> is the number of times each test is repeated for more reliable results, default is 10. This test is aimed at finding the optimal sieve size, which depends on both the CPU architecture and the SievePrimes value to be used.

I'm enhancing the perftest piece by piece - the actual trial factoring is not yet included. But the sieving performance can be tested very well already.

I'm quite sure that you will see differences for the different optimization levels here.

tului 2014-06-27 10:01

[url]https://drive.google.com/file/d/0B0Yq8K5dWh1BWjlqRjAyVjY0elE/edit?usp=sharing[/url]

[url]https://drive.google.com/file/d/0B0Yq8K5dWh1BMS1waG5qZXhKVXM/edit?usp=sharing[/url]

Here they go.


All times are UTC. The time now is 23:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.