![]() |
[QUOTE=Prime95;377292]Not all ulongs are a problem. I was surprised that bit_to_clr ulong operations were OK. I did turn on the GWDEBUG checks with an ugly and slow mul_32_32 macro and all was OK.
Presently, self tests pass with DEBUG_FACTOR_FIRST set, but only about half pass if it is not set.[/QUOTE] This sounds like a VectorSize problem. As if only half of the candidates are tested ... either VectorSize is not always evaluated, or it is set to 2 but Intel does not support vectors. In order to occupy as many compute units as possible, I made each thread work on a vector of FCs - the old VLIW4 and VLIW5 devices used 4 per thread, recent GCN devices perform best with 2 per thread ... Can you verify if the sieve returns the proper number of bits set? When defining DETAILED_INFO, mfakto will copy the device's arrays to the host and partially dump them. For the sieve array, it will count the bits. |
[QUOTE=Bdot;377314]This sounds like a VectorSize problem....
Can you verify if the sieve returns the proper number of bits set? When defining DETAILED_INFO, mfakto will copy the device's arrays to the host and partially dump them. For the sieve array, it will count the bits.[/QUOTE] I'm failing at VectorSize 1, 2, and 4. extract_bits reports about 4800 bits set. Interestingly, only the barrett32 routines fail. The barrett15 routines always work. |
[QUOTE=Prime95;377318]
Interestingly, only the barrett32 routines fail. The barrett15 routines always work.[/QUOTE] That does not sound like a sieve problem. Are they failing the same way when sieving on the CPU (SieveOnGPU=0)? But if DEBUG_FACTOR_FIRST makes them all work then it must be related to the thread ID or the location of the factor's bit in the sieve. I now committed the changes to github to apply the Intel workaround for CalcModularInverses. There are a few other changes that do compile, but are not yet complete or not well tested ... |
This code is failing in the barrett32 kernels. I don't have a workaround yet.
[CODE] my_k_base.d0 = k_base.d0 + NUM_CLASSES * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it my_k_base.d1 = k_base.d1 + mul_hi(NUM_CLASSES, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */ [/CODE] |
I tried the beta version of Intel's device driver -- no change. I've submitted a bug report to Intel Developer Zone.
|
[code]mfakto.cpp:39:18: fatal error: menu.h: No such file or directory
#include "menu.h" ^ compilation terminated. make: *** [mfakto.o] Error 1 [/code] :smile: |
I have a workaround:
In the barrett32 kernels replace the use NUM_CLASSES with variable num_c. Declare num_c as uint. Before the for loop add: num_c = NUM_CLASSES % (total_bit_count + 1000000); Now what?? I think I'll read the Intel documentation for optimization ideas. Are there any timings you'd like me to run? |
[QUOTE=kracker;377332][code]mfakto.cpp:39:18: fatal error: menu.h: No such file or directory
#include "menu.h" ^ compilation terminated. make: *** [mfakto.o] Error 1 [/code]:smile:[/QUOTE] :blush: There may be more issues ... I did not yet try to build on linux. Give me another day or two :smile: |
[QUOTE=Prime95;377336]I have a workaround:
In the barrett32 kernels replace the use NUM_CLASSES with variable num_c. Declare num_c as uint. Before the for loop add: num_c = NUM_CLASSES % (total_bit_count + 1000000); Now what?? I think I'll read the Intel documentation for optimization ideas. Are there any timings you'd like me to run?[/QUOTE] I've committed the changes for the barrett32_gs kernels. Is the same required for the CPU-sieve kernels? Could you test if/which kernels fail with SieveOnGPU=0? What's next? I'd need to know the speed of each of the kernels. For that, best would be to send me the output of a few minutes of 'mfakto -st', with CPU sieving, with CL_PERFORMANCE_INFO defined in the build. (this would also answer the question above :smile: ) I need that to update the kernel order in find_fastest_kernel(). To evaluate the GPU sieve speed, a normal binary without any DEBUG flags would be best. Then set in mfakto.ini: #TestSieveSizes (commented out to skip the CPU sieve tests) TestSievePrimes=60000,65000,70000,75000,80000,85000,90000,100000,110000,120000 TestGPUSieveSizes=32,48,64,96,120,128 Then run mfakto --perftest. This should take quite a while and finally come back with a table that can be used to find the optimal settings. There is no automated testing for GPUSieveProcessSize. For a complete picture, the above test would need to be repeated for all 4 possible values of GPUSieveProcessSize. The test with CL_PERFORMANCE_INFO gives the speed of the TF kernels. The optimal SievePrimes value is approximately where the reported incremental removal rate matches the TF speed. And when you have that, we all want to know, how many GHz-days/day IntelHD can churn out :cool: |
[QUOTE=Bdot;377346]:blush:
There may be more issues ... I did not yet try to build on linux. Give me another day or two :smile:[/QUOTE] Just one more thing... I'm "missing" termios.h(included from kbhit.h) It does say "// simulate _kbhit() on Linux" though, so... BTW, VS12 doesn't have it either. |
[QUOTE=kracker;377381]Just one more thing... I'm "missing" termios.h(included from kbhit.h)
It does say "// simulate _kbhit() on Linux" though, so... BTW, VS12 doesn't have it either.[/QUOTE] On Windows, you should not build kbhit.cpp. Just include conio.h, it will provide _kbhit(). That file is just a workaround for Linux as it does not provide that functionality in standard libraries. You will need to adjust the #ifdefs around that to choose the windows branch when building with MinGW. |
| All times are UTC. The time now is 23:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.