mersenneforum.org mtsieve
 Register FAQ Search Today's Posts Mark Forums Read

2020-05-25, 13:03   #309
rogue

"Mark"
Apr 2003
Between here and the

6,043 Posts

Quote:
 Originally Posted by Citrix I tried to write the code my self. I have attached it. The program mainly writes the factors to factor file which need to be processed by srfile. There is no input or output file. I only modified the FPU code and not the AVX multiplication code. I cannot get it to compile. gmp.h is missing. I do not have GMP installed on my computer. Can anyone help compile it? I do not have experience with GPU apps - how do I modify the code for GPU? Possibly cw_kernel.cl needs to be modified. Not sure what to do with the cw_kernel.h file.
Cool. I appreciate your attempt. It sounds like you are very close.

GMP is only needed or gfndsieve and dmdsieve. It seems that the makefile has a mistake since it has a dependency for the GMP library to build gcwsieve, which is what it appears that you started with. You can remove that dependency from the makefile.

The OpenCL source files have the .cl extension. These have to be converted to .h files which are then included in the GpuWorker.cpp class. This requires the use of perl (I use ActivePerl). The command is "perl cltoh.pl file.cl > file.h". It is easier to work edit the .cl files.

It might be easier to start with the OpenCL kernel for gfndsieve than gcwsieve as that kernel is much simpler. I don't know what experience you have with OpenCL.

I'm not certain what you mean by "modified the FPU code". I'm hoping that you did not modify any .S sources.

2020-05-26, 08:16   #310
Citrix

Jun 2003

23·197 Posts

Quote:
 Originally Posted by rogue Cool. I appreciate your attempt. It sounds like you are very close. GMP is only needed or gfndsieve and dmdsieve. It seems that the makefile has a mistake since it has a dependency for the GMP library to build gcwsieve, which is what it appears that you started with. You can remove that dependency from the makefile. The OpenCL source files have the .cl extension. These have to be converted to .h files which are then included in the GpuWorker.cpp class. This requires the use of perl (I use ActivePerl). The command is "perl cltoh.pl file.cl > file.h". It is easier to work edit the .cl files. It might be easier to start with the OpenCL kernel for gfndsieve than gcwsieve as that kernel is much simpler. I don't know what experience you have with OpenCL. I'm not certain what you mean by "modified the FPU code". I'm hoping that you did not modify any .S sources.
I was able to compile the code. Needed to do some debugging.

I would recommend you include https://github.com/GPUOpen-Libraries...L-SDK/releases & https://github.com/KhronosGroup/OpenCL-Headers with mtsieve source - instead of using AMD SDK which would need to be installed.

I am only using FPU code in the worker file and not the AVX code function. Something to look into when I have time to make the code even faster.

I do not have experience with openCL but gcwsieve cw_kernel.cpp code seems straight forward. I can look into it. Using Perl might be more challenging.

Do you know if anyone is using the gcwsieve to sieve? It can be made faster by switching multiplication by add.

Last fiddled with by Citrix on 2020-05-26 at 08:30

2020-05-26, 12:15   #311
rogue

"Mark"
Apr 2003
Between here and the

179B16 Posts

Quote:
 Originally Posted by Citrix I was able to compile the code. Needed to do some debugging. I would recommend you include https://github.com/GPUOpen-Libraries...L-SDK/releases & https://github.com/KhronosGroup/OpenCL-Headers with mtsieve source - instead of using AMD SDK which would need to be installed. I am only using FPU code in the worker file and not the AVX code function. Something to look into when I have time to make the code even faster. I do not have experience with openCL but gcwsieve cw_kernel.cpp code seems straight forward. I can look into it. Using Perl might be more challenging. Do you know if anyone is using the gcwsieve to sieve? It can be made faster by switching multiplication by add.
For the SDK, I have used AMD SDK for years and never had a reason to update, but since AMD doesn't seem to support it anymore, I could look into another OpenCL library.

AVX512 is a bit harder to use if you want to get the full benefit.

For perl, it is just a command line tool I used to generate a header file from a .cl file. Nothing more than that. In theory it could be converted to a .sh or .bat script.

This search uses gcwsieve for Generatlized Woodalls.

This search uses gcwsieve for Generatlized Cullens.

If you are thinking of an enhancement to gcwsieve, I suggest that you post in the other thread and include the math.

2020-05-26, 13:48   #312
rogue

"Mark"
Apr 2003
Between here and the

6,043 Posts

Quote:
 Originally Posted by rogue For the SDK, I have used AMD SDK for years and never had a reason to update, but since AMD doesn't seem to support it anymore, I could look into another OpenCL library. AVX512 is a bit harder to use if you want to get the full benefit. For perl, it is just a command line tool I used to generate a header file from a .cl file. Nothing more than that. In theory it could be converted to a .sh or .bat script. This search uses gcwsieve for Generatlized Woodalls. This search uses gcwsieve for Generatlized Cullens. If you are thinking of an enhancement to gcwsieve, I suggest that you post in the other thread and include the math.
To be clear, the framework only has code for AVX2, not AVX512. I don't have a CPU that supports AVX512, so I cannot write or test code for it.

2020-05-26, 21:14   #313
Citrix

Jun 2003

110001010002 Posts

Quote:
 Originally Posted by rogue AVX512 is a bit harder to use if you want to get the full benefit. For perl, it is just a command line tool I used to generate a header file from a .cl file. Nothing more than that. In theory it could be converted to a .sh or .bat script. If you are thinking of an enhancement to gcwsieve, I suggest that you post in the other thread and include the math.
I was able to modify the AVX - substantially faster. Getting 1.2Mp/sec on 4 cores.
Though limited to 2^52.

Code:
Is there a faster way of doing this

avx_set_16a(tempinvs);
avx_set_16b(powinvs);
avx_mulmod(dps, reciprocals);
avx_get_16a(tempinvs);

avx_set_16a(tempinvs);
avx_set_16b(multinvs);
avx_mulmod(dps, reciprocals);
avx_get_16a(tempinvs);

If I eliminate the first get step and the second set step the program misses factors.
Are you able to write some code to calculate n/2 (mod p) for a double and 64 bit integers. Currently I am having to use mulmod which is extremely slow.
For GPU - modifying the code should not take too long. Though will need to sieve deep enough with CPU first as program is finding too many factors. Might not be worth changing the GPU code as might have sieved deep enough with CPU anyway. How many factors can the GPU handle and when to switch over?

I will post gcwsieve code/algorithm in other thread.

Last fiddled with by Citrix on 2020-05-26 at 21:30

2020-05-26, 22:45   #314
rogue

"Mark"
Apr 2003
Between here and the

179B16 Posts

I am impressed with how quickly you have picked up on the framework. It is one of the reasons why I wrote the framework in the first place. The idea is that it should be easily accessible to anyone with some basic knowledge of C and C++ and the math they need for the Worker class.

Quote:
 Originally Posted by Citrix I was able to modify the AVX - substantially faster. Getting 1.2Mp/sec on 4 cores. Though limited to 2^52. Is there a faster way of doing this: Code:  avx_set_16a(tempinvs); avx_set_16b(powinvs); avx_mulmod(dps, reciprocals); avx_get_16a(tempinvs); avx_set_16a(tempinvs); avx_set_16b(multinvs); avx_mulmod(dps, reciprocals); avx_get_16a(tempinvs); If I eliminate the first get step and the second set step the program misses factors.
That it misses factors is very odd. Does it miss the same factors consistently when the lines in red are removed? The geta() and seta() functions access the same set of xmm registers. If this can be reproduced easily, then it should be debugged.

Quote:
 Are you able to write some code to calculate n/2 (mod p) for a double and 64 bit integers. Currently I am having to use mulmod which is extremely slow. For GPU - modifying the code should not take too long. Though will need to sieve deep enough with CPU first as program is finding too many factors. Might not be worth changing the GPU code as might have sieved deep enough with CPU anyway. How many factors can the GPU handle and when to switch over? I will post gcwsieve code/algorithm in other thread.
Please explain what you mean by "64 bit integers"? The FPU routines are limited to 62 bits. The SSE and AVX routines are limited to 52 bits. If you are referring to pure 64-bit and the CPU, I do not have mulmod or powmod routines for it. The GPU can handle 64-bit mulmods.

If the GPU is returning too many factors and you are concerned about overflowing a buffer, then you can call SetMinGpuPrime() from the App class. This will tell the framework to not use a GPU worker until it reaches the value passed to that method. This might require some experimentation on your part regarding that limit and the size of the factor buffer that the GPU will support.

2020-05-26, 23:13   #315
Citrix

Jun 2003

23×197 Posts

Quote:
 Originally Posted by rogue I am impressed with how quickly you have picked up on the framework. It is one of the reasons why I wrote the framework in the first place. The idea is that it should be easily accessible to anyone with some basic knowledge of C and C++ and the math they need for the Worker class. That it misses factors is very odd. Does it miss the same factors consistently when the lines in red are removed? The geta() and seta() functions access the same set of xmm registers. If this can be reproduced easily, then it should be debugged. Please explain what you mean by "64 bit integers"? The FPU routines are limited to 62 bits. The SSE and AVX routines are limited to 52 bits. If you are referring to pure 64-bit and the CPU, I do not have mulmod or powmod routines for it. The GPU can handle 64-bit mulmods. If the GPU is returning too many factors and you are concerned about overflowing a buffer, then you can call SetMinGpuPrime() from the App class. This will tell the framework to not use a GPU worker until it reaches the value passed to that method. This might require some experimentation on your part regarding that limit and the size of the factor buffer that the GPU will support.
I will see if I can post an example in which factors are missed.

Code:
void TestPrimesAVX(void)
{
double __attribute__((aligned(32))) powinvs[AVX_ARRAY_SIZE];

double __attribute__((aligned(32))) tempinvs[AVX_ARRAY_SIZE];

double __attribute__((aligned(32))) dps[AVX_ARRAY_SIZE];

// CODE MISSING HERE
for (int i = 0; i < AVX_ARRAY_SIZE; i++)
{
// This code is really slow.
// Faster to use mulmod
// Any other tricks? Can we use bitshift etc.
if (((uint64_t)tempinvs[i]) % 2 == 1) { tempinvs[i] = tempinvs[i] + dps[i]; }
tempinvs[i] = tempinvs[i] / 2;
if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; }
powinvs[i] = powinvs[i] / 2;
if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; }
powinvs[i] = powinvs[i] / 2;
}
}
I need to do the above multiple times. For the doubles in AVX can we do modular arithmetic or any bit operations. The above code is very slow and slower than using mulmod. By 64 bit integers I mean uint64_t type used in the FPU code.

Anything you can help with?

Last fiddled with by Citrix on 2020-05-26 at 23:17

2020-05-27, 01:28   #316
rogue

"Mark"
Apr 2003
Between here and the

6,043 Posts

Quote:
 Originally Posted by Citrix I will see if I can post an example in which factors are missed. Code: void TestPrimesAVX(void) { double __attribute__((aligned(32))) powinvs[AVX_ARRAY_SIZE]; double __attribute__((aligned(32))) tempinvs[AVX_ARRAY_SIZE]; double __attribute__((aligned(32))) dps[AVX_ARRAY_SIZE]; // CODE MISSING HERE for (int i = 0; i < AVX_ARRAY_SIZE; i++) { // This code is really slow. // Faster to use mulmod // Any other tricks? Can we use bitshift etc. if (((uint64_t)tempinvs[i]) % 2 == 1) { tempinvs[i] = tempinvs[i] + dps[i]; } tempinvs[i] = tempinvs[i] / 2; if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; } powinvs[i] = powinvs[i] / 2; if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; } powinvs[i] = powinvs[i] / 2; } } I need to do the above multiple times. For the doubles in AVX can we do modular arithmetic or any bit operations. The above code is very slow and slower than using mulmod. By 64 bit integers I mean uint64_t type used in the FPU code. Anything you can help with?
I don't know how to do bit operations with floating point values. What is the specific math you are trying to accomplish? Maybe there is a different way to get the desired result.

2020-05-27, 03:37   #317
Citrix

Jun 2003

23×197 Posts

Quote:
 Originally Posted by rogue I don't know how to do bit operations with floating point values. What is the specific math you are trying to accomplish? Maybe there is a different way to get the desired result.
Just multiplying the (Double) floating point number with 1/2 (mod p).

I looked into the code further. I was wrong before on get and set giving the error. This portion of the code gives the error. You can try it yourself.

Code:
                avx_set_16a(powinvs);
avx_set_16b(multinvs2);
avx_mulmod(dps, reciprocals);
avx_get_16a(powinvs);

The above works alright. Multinvs2 is 1/4 (mod p)

avx_set_16a(powinvs);
avx_set_16b(multinvs);
avx_mulmod(dps, reciprocals);
avx_mulmod(dps, reciprocals);
avx_get_16a(powinvs);

The above does not give right answer. Multinvs is 1/2 (mod p)

They should be the same.

The following does work correctly

avx_set_16a(powinvs);
avx_set_16b(multinvs);
avx_mulmod(dps, reciprocals);
avx_get_16a(powinvs);

avx_set_16a(powinvs);
avx_set_16b(multinvs);
avx_mulmod(dps, reciprocals);
avx_get_16a(powinvs);

 2020-05-27, 12:26 #318 rogue     "Mark" Apr 2003 Between here and the 6,043 Posts ave_mulmod modifies the contents ymm4-ymm7, thus you cannot call it two times in a row and expect the result to be a*b*b mod p. You would have to call ave_set_16b or avx_set_1b after the first avx_mulmod before calling avx_mulmod again.
 2020-05-29, 02:36 #319 Citrix     Jun 2003 30508 Posts I was finally able to modify the GPU code. I had to do the Perl portion manually. The script did not work for me. It goes into an infinite while loop. The code did compile. Though I cannot get the code to run. I have tried the original gcwsievecl.exe from mtsieve and that does not work either. afsievecl.exe does not work either. ppsieve (open cl) etc works fine. Any thoughts on how to fix this. Thanks. Last fiddled with by Citrix on 2020-05-29 at 02:38