mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-05-25, 13:03   #309
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

585310 Posts
Default

Quote:
Originally Posted by Citrix View Post
I tried to write the code my self. I have attached it.

The program mainly writes the factors to factor file which need to be processed by srfile. There is no input or output file.

I only modified the FPU code and not the AVX multiplication code.

I cannot get it to compile. gmp.h is missing. I do not have GMP installed on my computer. Can anyone help compile it?

I do not have experience with GPU apps - how do I modify the code for GPU? Possibly cw_kernel.cl needs to be modified. Not sure what to do with the cw_kernel.h file.
Cool. I appreciate your attempt. It sounds like you are very close.

GMP is only needed or gfndsieve and dmdsieve. It seems that the makefile has a mistake since it has a dependency for the GMP library to build gcwsieve, which is what it appears that you started with. You can remove that dependency from the makefile.

The OpenCL source files have the .cl extension. These have to be converted to .h files which are then included in the GpuWorker.cpp class. This requires the use of perl (I use ActivePerl). The command is "perl cltoh.pl file.cl > file.h". It is easier to work edit the .cl files.

It might be easier to start with the OpenCL kernel for gfndsieve than gcwsieve as that kernel is much simpler. I don't know what experience you have with OpenCL.

I'm not certain what you mean by "modified the FPU code". I'm hoping that you did not modify any .S sources.
rogue is offline   Reply With Quote
Old 2020-05-26, 08:16   #310
Citrix
 
Citrix's Avatar
 
Jun 2003

110000111102 Posts
Default

Quote:
Originally Posted by rogue View Post
Cool. I appreciate your attempt. It sounds like you are very close.

GMP is only needed or gfndsieve and dmdsieve. It seems that the makefile has a mistake since it has a dependency for the GMP library to build gcwsieve, which is what it appears that you started with. You can remove that dependency from the makefile.

The OpenCL source files have the .cl extension. These have to be converted to .h files which are then included in the GpuWorker.cpp class. This requires the use of perl (I use ActivePerl). The command is "perl cltoh.pl file.cl > file.h". It is easier to work edit the .cl files.

It might be easier to start with the OpenCL kernel for gfndsieve than gcwsieve as that kernel is much simpler. I don't know what experience you have with OpenCL.

I'm not certain what you mean by "modified the FPU code". I'm hoping that you did not modify any .S sources.
I was able to compile the code. Needed to do some debugging.

I would recommend you include https://github.com/GPUOpen-Libraries...L-SDK/releases & https://github.com/KhronosGroup/OpenCL-Headers with mtsieve source - instead of using AMD SDK which would need to be installed.

I am only using FPU code in the worker file and not the AVX code function. Something to look into when I have time to make the code even faster.

I do not have experience with openCL but gcwsieve cw_kernel.cpp code seems straight forward. I can look into it. Using Perl might be more challenging.

Do you know if anyone is using the gcwsieve to sieve? It can be made faster by switching multiplication by add.

Last fiddled with by Citrix on 2020-05-26 at 08:30
Citrix is offline   Reply With Quote
Old 2020-05-26, 12:15   #311
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3×1,951 Posts
Default

Quote:
Originally Posted by Citrix View Post
I was able to compile the code. Needed to do some debugging.

I would recommend you include https://github.com/GPUOpen-Libraries...L-SDK/releases & https://github.com/KhronosGroup/OpenCL-Headers with mtsieve source - instead of using AMD SDK which would need to be installed.

I am only using FPU code in the worker file and not the AVX code function. Something to look into when I have time to make the code even faster.

I do not have experience with openCL but gcwsieve cw_kernel.cpp code seems straight forward. I can look into it. Using Perl might be more challenging.

Do you know if anyone is using the gcwsieve to sieve? It can be made faster by switching multiplication by add.
For the SDK, I have used AMD SDK for years and never had a reason to update, but since AMD doesn't seem to support it anymore, I could look into another OpenCL library.

AVX512 is a bit harder to use if you want to get the full benefit.

For perl, it is just a command line tool I used to generate a header file from a .cl file. Nothing more than that. In theory it could be converted to a .sh or .bat script.

This search uses gcwsieve for Generatlized Woodalls.

This search uses gcwsieve for Generatlized Cullens.

If you are thinking of an enhancement to gcwsieve, I suggest that you post in the other thread and include the math.
rogue is offline   Reply With Quote
Old 2020-05-26, 13:48   #312
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3×1,951 Posts
Default

Quote:
Originally Posted by rogue View Post
For the SDK, I have used AMD SDK for years and never had a reason to update, but since AMD doesn't seem to support it anymore, I could look into another OpenCL library.

AVX512 is a bit harder to use if you want to get the full benefit.

For perl, it is just a command line tool I used to generate a header file from a .cl file. Nothing more than that. In theory it could be converted to a .sh or .bat script.

This search uses gcwsieve for Generatlized Woodalls.

This search uses gcwsieve for Generatlized Cullens.

If you are thinking of an enhancement to gcwsieve, I suggest that you post in the other thread and include the math.
To be clear, the framework only has code for AVX2, not AVX512. I don't have a CPU that supports AVX512, so I cannot write or test code for it.
rogue is offline   Reply With Quote
Old 2020-05-26, 21:14   #313
Citrix
 
Citrix's Avatar
 
Jun 2003

30368 Posts
Default

Quote:
Originally Posted by rogue View Post

AVX512 is a bit harder to use if you want to get the full benefit.

For perl, it is just a command line tool I used to generate a header file from a .cl file. Nothing more than that. In theory it could be converted to a .sh or .bat script.

If you are thinking of an enhancement to gcwsieve, I suggest that you post in the other thread and include the math.
I was able to modify the AVX - substantially faster. Getting 1.2Mp/sec on 4 cores.
Though limited to 2^52.

Code:
Is there a faster way of doing this

                avx_set_16a(tempinvs);
		avx_set_16b(powinvs);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(tempinvs);

		avx_set_16a(tempinvs);
		avx_set_16b(multinvs);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(tempinvs);

If I eliminate the first get step and the second set step the program misses factors.
Are you able to write some code to calculate n/2 (mod p) for a double and 64 bit integers. Currently I am having to use mulmod which is extremely slow.
For GPU - modifying the code should not take too long. Though will need to sieve deep enough with CPU first as program is finding too many factors. Might not be worth changing the GPU code as might have sieved deep enough with CPU anyway. How many factors can the GPU handle and when to switch over?

I will post gcwsieve code/algorithm in other thread.

Last fiddled with by Citrix on 2020-05-26 at 21:30
Citrix is offline   Reply With Quote
Old 2020-05-26, 22:45   #314
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

10110110111012 Posts
Default

I am impressed with how quickly you have picked up on the framework. It is one of the reasons why I wrote the framework in the first place. The idea is that it should be easily accessible to anyone with some basic knowledge of C and C++ and the math they need for the Worker class.

Quote:
Originally Posted by Citrix View Post
I was able to modify the AVX - substantially faster. Getting 1.2Mp/sec on 4 cores.
Though limited to 2^52.

Is there a faster way of doing this:

Code:
                avx_set_16a(tempinvs);
		avx_set_16b(powinvs);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(tempinvs);

		avx_set_16a(tempinvs);
		avx_set_16b(multinvs);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(tempinvs);
If I eliminate the first get step and the second set step the program misses factors.
That it misses factors is very odd. Does it miss the same factors consistently when the lines in red are removed? The geta() and seta() functions access the same set of xmm registers. If this can be reproduced easily, then it should be debugged.

Quote:
Are you able to write some code to calculate n/2 (mod p) for a double and 64 bit integers. Currently I am having to use mulmod which is extremely slow.

For GPU - modifying the code should not take too long. Though will need to sieve deep enough with CPU first as program is finding too many factors. Might not be worth changing the GPU code as might have sieved deep enough with CPU anyway. How many factors can the GPU handle and when to switch over?

I will post gcwsieve code/algorithm in other thread.
Please explain what you mean by "64 bit integers"? The FPU routines are limited to 62 bits. The SSE and AVX routines are limited to 52 bits. If you are referring to pure 64-bit and the CPU, I do not have mulmod or powmod routines for it. The GPU can handle 64-bit mulmods.

If the GPU is returning too many factors and you are concerned about overflowing a buffer, then you can call SetMinGpuPrime() from the App class. This will tell the framework to not use a GPU worker until it reaches the value passed to that method. This might require some experimentation on your part regarding that limit and the size of the factor buffer that the GPU will support.
rogue is offline   Reply With Quote
Old 2020-05-26, 23:13   #315
Citrix
 
Citrix's Avatar
 
Jun 2003

2·33·29 Posts
Default

Quote:
Originally Posted by rogue View Post
I am impressed with how quickly you have picked up on the framework. It is one of the reasons why I wrote the framework in the first place. The idea is that it should be easily accessible to anyone with some basic knowledge of C and C++ and the math they need for the Worker class.



That it misses factors is very odd. Does it miss the same factors consistently when the lines in red are removed? The geta() and seta() functions access the same set of xmm registers. If this can be reproduced easily, then it should be debugged.



Please explain what you mean by "64 bit integers"? The FPU routines are limited to 62 bits. The SSE and AVX routines are limited to 52 bits. If you are referring to pure 64-bit and the CPU, I do not have mulmod or powmod routines for it. The GPU can handle 64-bit mulmods.

If the GPU is returning too many factors and you are concerned about overflowing a buffer, then you can call SetMinGpuPrime() from the App class. This will tell the framework to not use a GPU worker until it reaches the value passed to that method. This might require some experimentation on your part regarding that limit and the size of the factor buffer that the GPU will support.
I will see if I can post an example in which factors are missed.

Code:
void TestPrimesAVX(void)
{
	double __attribute__((aligned(32))) powinvs[AVX_ARRAY_SIZE];

	double __attribute__((aligned(32))) tempinvs[AVX_ARRAY_SIZE];

	double __attribute__((aligned(32))) dps[AVX_ARRAY_SIZE];

	// CODE MISSING HERE
	for (int i = 0; i < AVX_ARRAY_SIZE; i++) 
	{
		// This code is really slow.
		// Faster to use mulmod
		// Any other tricks? Can we use bitshift etc.
		if (((uint64_t)tempinvs[i]) % 2 == 1) { tempinvs[i] = tempinvs[i] + dps[i]; }
		tempinvs[i] = tempinvs[i] / 2;
		if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; }
		powinvs[i] = powinvs[i] / 2;
		if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; }
		powinvs[i] = powinvs[i] / 2;
	}
}
I need to do the above multiple times. For the doubles in AVX can we do modular arithmetic or any bit operations. The above code is very slow and slower than using mulmod. By 64 bit integers I mean uint64_t type used in the FPU code.

Anything you can help with?

Last fiddled with by Citrix on 2020-05-26 at 23:17
Citrix is offline   Reply With Quote
Old 2020-05-27, 01:28   #316
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3·1,951 Posts
Default

Quote:
Originally Posted by Citrix View Post
I will see if I can post an example in which factors are missed.

Code:
void TestPrimesAVX(void)
{
	double __attribute__((aligned(32))) powinvs[AVX_ARRAY_SIZE];

	double __attribute__((aligned(32))) tempinvs[AVX_ARRAY_SIZE];

	double __attribute__((aligned(32))) dps[AVX_ARRAY_SIZE];

	// CODE MISSING HERE
	for (int i = 0; i < AVX_ARRAY_SIZE; i++) 
	{
		// This code is really slow.
		// Faster to use mulmod
		// Any other tricks? Can we use bitshift etc.
		if (((uint64_t)tempinvs[i]) % 2 == 1) { tempinvs[i] = tempinvs[i] + dps[i]; }
		tempinvs[i] = tempinvs[i] / 2;
		if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; }
		powinvs[i] = powinvs[i] / 2;
		if (((uint64_t)powinvs[i]) % 2 == 1) { powinvs[i] = powinvs[i] + dps[i]; }
		powinvs[i] = powinvs[i] / 2;
	}
}
I need to do the above multiple times. For the doubles in AVX can we do modular arithmetic or any bit operations. The above code is very slow and slower than using mulmod. By 64 bit integers I mean uint64_t type used in the FPU code.

Anything you can help with?
I don't know how to do bit operations with floating point values. What is the specific math you are trying to accomplish? Maybe there is a different way to get the desired result.
rogue is offline   Reply With Quote
Old 2020-05-27, 03:37   #317
Citrix
 
Citrix's Avatar
 
Jun 2003

2·33·29 Posts
Default

Quote:
Originally Posted by rogue View Post
I don't know how to do bit operations with floating point values. What is the specific math you are trying to accomplish? Maybe there is a different way to get the desired result.
Just multiplying the (Double) floating point number with 1/2 (mod p).

I looked into the code further. I was wrong before on get and set giving the error. This portion of the code gives the error. You can try it yourself.

Code:
                avx_set_16a(powinvs);
		avx_set_16b(multinvs2);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(powinvs);

The above works alright. Multinvs2 is 1/4 (mod p)

                 avx_set_16a(powinvs);
		avx_set_16b(multinvs);
		avx_mulmod(dps, reciprocals);
                 avx_mulmod(dps, reciprocals);
		avx_get_16a(powinvs);
 
The above does not give right answer. Multinvs is 1/2 (mod p)

They should be the same. 

The following does work correctly

		avx_set_16a(powinvs);
		avx_set_16b(multinvs);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(powinvs);

		avx_set_16a(powinvs);
		avx_set_16b(multinvs);
		avx_mulmod(dps, reciprocals);
		avx_get_16a(powinvs);
Citrix is offline   Reply With Quote
Old 2020-05-27, 12:26   #318
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3×1,951 Posts
Default

ave_mulmod modifies the contents ymm4-ymm7, thus you cannot call it two times in a row and expect the result to be a*b*b mod p. You would have to call ave_set_16b or avx_set_1b after the first avx_mulmod before calling avx_mulmod again.
rogue is offline   Reply With Quote
Old 2020-05-29, 02:36   #319
Citrix
 
Citrix's Avatar
 
Jun 2003

2×33×29 Posts
Default

I was finally able to modify the GPU code. I had to do the Perl portion manually. The script did not work for me. It goes into an infinite while loop. The code did compile.

Though I cannot get the code to run. I have tried the original gcwsievecl.exe from mtsieve and that does not work either. afsievecl.exe does not work either. ppsieve (open cl) etc works fine.

Any thoughts on how to fix this.

Thanks.

Last fiddled with by Citrix on 2020-05-29 at 02:38
Citrix is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 23:56.

Wed Aug 12 23:56:34 UTC 2020 up 26 days, 19:43, 0 users, load averages: 1.51, 1.57, 1.45

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.