mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

tului 2014-06-27 13:24

Is there much point going past 72 on TF work?

kladner 2014-06-27 15:33

[QUOTE=tului;376862]Is there much point going past 72 on TF work?[/QUOTE]

This has been the topic of much debate. The slippery answer is, "It depends," on the exponent range in particular. In the current active LL range, 74 is generally the target, but may be adjusted downward if LL workers are running short of material.

The tables [URL="http://www.gpu72.com/reports/available/"]here[/URL] show the current aim points based on the break even levels of TF vs LL effort required.

(The above can probably be better stated, and likely will be by those more mathematically capable than me.)

EDIT: It may also depend of whether one is running AMD or nVidia GPUs.

EDIT2: ATM, the page linked above is not showing me any data in the fields. I can't say if this is a problem on my end or at GPU72.

LaurV 2014-06-27 16:11

[QUOTE=tului;376862]Is there much point going past 72 on TF work?[/QUOTE]
Yes, depending on your GPU, and the exponent, you may be much better doing TF to 74 (even 75), or doing LL. See [URL="http://www.mersenne.ca/cudalucas.php?model=12"]here[/URL] for details, click on the cards you have (or any cards) to see the graphs.

chalsall 2014-06-27 16:17

[QUOTE=kladner;376872]This has been the topic of much debate.[/QUOTE]

Indeed! :wink:

[QUOTE=kladner;376872]The slippery answer is, "It depends," on the exponent range in particular. In the current active LL range, 74 is generally the target, but may be adjusted downward if LL workers are running short of material.[/QUOTE]

James put a lot of effort into providing a definitive analysis of where the "curves cross" (peer reviewed by many very smart and knowledgeable people). This is defined as where (and to what level) it is more efficient to "TF" than it is to "LL" (or "DC") [B][U]ON THE SAME COMPUTING DEVICE[/U][/B]. His analysis [URL="http://www.mersenne.ca/cudalucas.php?model=12"]is here[/URL].

Note importantly that the cross-over is not only a function of the candidate's size, but also the GPU core. You can click on the cards listed below the graph to see where the cross-over point is for that particular device (it varies slightly).

[QUOTE=kladner;376872]EDIT: It may also depend of whether one is running AMD or nVidia GPUs.[/QUOTE]

Definitely.

Since OpenCL can't currently do LL testing, technically any depth Makes Sense [SUP](TM)[/SUP] (and is why no AMD cards are listed in the table below the graph).

Bdot and others can definitely give you advise on what depth it makes the most sense for AMD cards to TF to (and where). I do code, not math...

[QUOTE=kladner;376872]The tables [URL="http://www.gpu72.com/reports/available/"]here[/URL] show the current aim points based on the break even levels of TF vs LL effort required.

EDIT2: ATM, the page linked above is not showing me any data in the fields. I can't say if this is a problem on my end or at GPU72.[/QUOTE]

And these tables show what GPU72 is currently aiming for, based on James' analysis and our current available firepower. The cells which are yellow are where we are aiming for, and will release back to Primenet (if a P-1 test has also been done, or Primenet is "hungry"). We should really be going to 75 "bits" for some of our current ranges, but we simply can't do that at the moment.

(P.S. The report from GPU is working for me. Please let me know if you continue to see issues.)

kracker 2014-06-27 18:36

[QUOTE=LaurV;376880]Yes, depending on your GPU, and the exponent, you may be much better doing TF to 74 (even 75), or doing LL. See [URL="http://www.mersenne.ca/cudalucas.php?model=12"]here[/URL] for details, click on the cards you have (or any cards) to see the graphs.[/QUOTE]
Although after 73 bits, production decreases a bit on AMD cards because it switches to a diffrent kernel...

[QUOTE=chalsall;376882]
Since OpenCL can't currently do LL testing, technically any depth Makes Sense [SUP](TM)[/SUP] (and is why no AMD cards are listed in the table below the graph).[/QUOTE]
I think a better wording would be "Since a LL testing program hasn't been ported/made to OpenCL..." :razz:

Technically, there [URL="http://mersenneforum.org/cllucas"]is[/URL] clLucas which I run on and off. It is "limited"/fastest on powers of two FFT's. If my memory is not mistaken, a 7970 does around 3.6 iter/ms on a 2M FFT.

Bdot 2014-06-27 21:38

[QUOTE=tului;376858][URL]https://drive.google.com/file/d/0B0Yq8K5dWh1BWjlqRjAyVjY0elE/edit?usp=sharing[/URL]

[URL]https://drive.google.com/file/d/0B0Yq8K5dWh1BMS1waG5qZXhKVXM/edit?usp=sharing[/URL]

Here they go.[/QUOTE]

Thanks a lot. The data shows that mfakto does not need to make a difference between the older and the current GCN generations. I compared your results (adjusted for clock speed and number of compute units) to some older results of a 7770 and a 7850. The R7 260X came in 0.5-1.5% short of the expected results for the 15-bit kernels, and 3.2-3.8% short for the 32-bit kernels.

I then compared it to some recent test results of a 7870XT. Here, your card was ahead 1-1.5% for the 15-bit kernels, and exactly as expected for the 32-bit kernels.

I think, the new GCN generation has a slight performance improvement except in 32-bit multiplications. The newer drivers add a slight decline to all. The differences are so small that no kernel reordering is required.

kladner 2014-06-27 23:30

[QUOTE=chalsall;376882]Indeed! :wink:


<snip>

(P.S. The report from GPU is working for me. Please let me know if you continue to see issues.)[/QUOTE]

It now seems to be fine. It must have been a blockage in the particular internet tube through which I am connected. :razz:

LaurV 2014-06-28 11:51

[QUOTE=kracker;376896]Technically, there [URL="http://mersenneforum.org/cllucas"]is[/URL] clLucas which I run on and off. It is "limited"/fastest on powers of two FFT's. If my memory is not mistaken, a 7970 does around 3.6 iter/ms on a 2M FFT.[/QUOTE]
Indeed, with the observation that the unit is milliseconds per iteration, and not iterations per millisecond (aren't you wishing that? :razz:)

clLucas is good if you get to test exponents under (and close to) L1=38492887, which is the last for which 2048k FFT can be used, and if you test exponents under L2=75846319 (and close to it), which is the last one for which 4096k FFT can be used. Because openCL FFT is not optimized for non-powers of two, testing for example a 40M exponents will be extremely slow if a non-power-of-2 FFT is used, and if the 4096 is used, the iteration takes the same time as for a 75M exponent. Which is very slow.

So, if you have a good GCN card, you can choose either:
- do DC LL for exponents close to and lower than L1
- do first time LL for exponents close to and lower than L2 (risky if no DC is done in parallel, you don't know if the final result will be right, and you can miss a prime in the worst case, which will be find later by other hunter, and you will feel very sorry :wink:)
- do TF. From the three, this is better and easier, you get max amount of credit, you make other guys happy. But you can't find primes (not that you will find so many, anyhow :razz:)

Prime95 2014-06-29 01:28

Please fix
 
The Intel compiler is whining, can someone please change the source code for the warnings below -- I don't know how to use git yet.

Also, the mfakto wiki page needs to update to the current version 0.14.

[CODE]In file included from :85:
.\barrett15.cl:218:40: warning: operator '>>' has lower precedence than '+'; '+'
will be evaluated first
res->d5 = mad24(a.d3, b.d2, res->d5) + res->d4 >> 15;
~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ ~~
.\barrett15.cl:218:40: note: place parentheses around the '+' expression to sile
nce this warning
res->d5 = mad24(a.d3, b.d2, res->d5) + res->d4 >> 15;
^
( )
.\barrett15.cl:223:40: warning: operator '>>' has lower precedence than '+'; '+'
will be evaluated first
res->d6 = mad24(a.d3, b.d3, res->d6) + res->d5 >> 15;
~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ ~~
.\barrett15.cl:223:40: note: place parentheses around the '+' expression to sile
nce this warning
res->d6 = mad24(a.d3, b.d3, res->d6) + res->d5 >> 15;
^
( )
.\barrett15.cl:227:40: warning: operator '>>' has lower precedence than '+'; '+'
will be evaluated first
res->d7 = mad24(a.d3, b.d4, res->d7) + res->d6 >> 15;
~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ ~~
.\barrett15.cl:227:40: note: place parentheses around the '+' expression to sile
nce this warning
res->d7 = mad24(a.d3, b.d4, res->d7) + res->d6 >> 15;
^
( )[/CODE]

Prime95 2014-06-29 02:13

bug reports:

1) The Intel compiler does not like -O3.

2) When clBuildProgram is called with invalid build options it returns error -43. If verbosity is set to 3, then clGetBuildInfo tries to get the build log -- there is none -- and some trash characters are output

Prime95 2014-06-29 02:17

Please change line 84 of Montgomery.cl to:

r2 += ((r1!=0)? (ulong_v)1UL : (ulong_v)0UL);

Batalov 2014-06-29 02:35

[CODE]warning: operator '>>' has lower precedence than '+'; '+' will be evaluated first[/CODE]Sheesh! That's quite an inconvenient precedence order. At least they put a warning.

With that in mind, you are right to worry about ternary '?'s possible precedence clash with '+'.

I've looked at the OCL specs book (all versions) and they never even specified the precedence. That's odd.

Prime95 2014-06-29 05:19

[QUOTE=Batalov;376972]I've looked at the OCL specs book (all versions) and they never even specified the precedence. That's odd.[/QUOTE]

I presume the standard specifies using C operator precedence.

Prime95 2014-06-29 05:21

Debugging nightmare. Intel does not support printf in kernels.

Bdot 2014-06-29 11:43

[QUOTE=Prime95;376965]The Intel compiler is whining, can someone please change the source code for the warnings below[/QUOTE]

Thank you, this is probably the reason why I never got this function to work. It is only used in a test kernel, not in any factoring.

[QUOTE=Prime95;376970]bug reports:

1) The Intel compiler does not like -O3.
[/QUOTE]
You can use the ini file option OCLCompileOptions to set whichever option you need. Do you know what is the best available optimization I should use for INTEL? -O2 or just -O?
[QUOTE=Prime95;376970]
2) When clBuildProgram is called with invalid build options it returns error -43. If verbosity is set to 3, then clGetBuildInfo tries to get the build log -- there is none -- and some trash characters are output[/QUOTE]
Hmm, if you dont see "Error ... clGetProgramBuildInfo failed.", then this would be an INTEL API bug. Or, it returns an OK status but no log ... I've added a check if the resulting log size is >0.
[QUOTE=Prime95;376971]Please change line 84 of Montgomery.cl to:

r2 += ((r1!=0)? (ulong_v)1UL : (ulong_v)0UL);[/QUOTE]
OK, done. But why? Shouldn't this be the same as
r2 += (ulong_v)((r1!=0)? 1UL : 0UL);

Prime95 2014-06-29 13:45

[QUOTE=Bdot;376992]
You can use the ini file option OCLCompileOptions to set whichever option you need. Do you know what is the best available optimization I should use for INTEL? -O2 or just -O?[/quote]

I eliminated the -O argument completely. I've not read the Intel docs to see if -O is supported.

I saw the ini file option. I think we need a solution that works without using that. Maybe grep for "Intel" in the device capabilities or something?

[quote]Hmm, if you dont see "Error ... clGetProgramBuildInfo failed.", then this would be an INTEL API bug. Or, it returns an OK status but no log ... I've added a check if the resulting log size is >0.[/quote]

The problem I think is this line:

if((status == CL_BUILD_PROGRAM_FAILURE) || (mystuff.verbosity > 2))

I'd set verbosity to 3, so clGetProgramBuildInfo was called even when there was no error.

Your log_size > 0 ought to solve this.

[quote]OK, done. But why? Shouldn't this be the same as
r2 += (ulong_v)((r1!=0)? 1UL : 0UL);[/QUOTE]

I'm with you, it should be the same.

Bdot 2014-06-29 19:46

[QUOTE=Prime95;376995]I eliminated the -O argument completely. I've not read the Intel docs to see if -O is supported.

I saw the ini file option. I think we need a solution that works without using that. Maybe grep for "Intel" in the device capabilities or something?
[/QUOTE]
Agreed, I may need to decouple the device detection from the kernel compilation, so that the compilation can use what has been found out about the device. Currently that needs to be set as GPUType ini file option. When that is set to NVIDIA, for example, then the kernel compilation skips O3. This should be automatic - I'll put that in for the next version.

[QUOTE=Prime95;376995]
The problem I think is this line:

if((status == CL_BUILD_PROGRAM_FAILURE) || (mystuff.verbosity > 2))

I'd set verbosity to 3, so clGetProgramBuildInfo was called even when there was no error.
[/QUOTE]
This was intentional so there is a chance to see build warnings or other output. AMD always provides non-zero build info.

Do you attempt to get all kernels to run? Most likely, there is no performance improvement when using mul24 instead of 32-bit multiplications on Intel. Therefore barrett24 and barrett15 will be rather slow. montgomery.cl is good only in some corner case, it was rather a test so I got a feeling how it compares to barrett.

Prime95 2014-06-29 20:33

[QUOTE=Prime95;376980]Debugging nightmare. Intel does not support printf in kernels.[/QUOTE]

My bad. Printf does work -- whew! One must delete the .elf file every time you change a .cl file. I'll see if I can change the makefile to do this for me automatically.

Right now, I'm just trying to get the program to pass the self-test. The (or a) problem is in GPU sieving or in translating the sieve into k_deltas. Optimization, if I'm so motivated, comes later.

Bdot 2014-06-29 21:32

[QUOTE=Prime95;377009]My bad. Printf does work -- whew! One must delete the .elf file every time you change a .cl file. [/QUOTE]
Remove the UseBinfile setting from the ini file, then it will always recompile.

Prime95 2014-06-30 00:47

[QUOTE=Bdot;377011]Remove the UseBinfile setting from the ini file, then it will always recompile.[/QUOTE]

Thanks!

Next mini-bug: In extract_bits exponent, k_base, shiftcount, bit_max64, bb are undefined in this code

[CODE]#if (TRACE_SIEVE_KERNEL > 0)
if (lid==TRACE_SIEVE_TID) printf((__constant char *)"extract_bits: exp=%d=%#x, k=%x:%x:%x, bits=%d, shift=%d, bit_max64=%d, bb=%x:%x:%x:%x:%x:%x, wpt=%u, base addr=%#x\n",
exponent, exponent, k_base.d2, k_base.d1, k_base.d0, bits_to_process, shiftcount, bit_max64, bb.d5, bb.d4, bb.d3, bb.d2, bb.d1, bb.d0, words_per_thread, bit_array);
#endif
[/CODE]

Prime95 2014-06-30 04:29

Here is the code in CalcModularInverses that is tripping up the Intel port:

[CODE] facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;
[/CODE]

I can't get Intel to produce 64-bit quantities here. What does the OpenCL spec say about 64-bit multiplies?

I also tried

facdist = 2*NUM_CLASSES; ulongtemp = exponent; facdist *= ulongtemp;

without success.

Any other ideas?

Bdot 2014-06-30 09:33

[QUOTE=Prime95;377037]Here is the code in CalcModularInverses that is tripping up the Intel port:

[CODE] facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;
[/CODE]I can't get Intel to produce 64-bit quantities here. What does the OpenCL spec say about 64-bit multiplies?

I also tried

facdist = 2*NUM_CLASSES; ulongtemp = exponent; facdist *= ulongtemp;

without success.

Any other ideas?[/QUOTE]

I noticed that the calculations are done from left to right, and the first factor usually determines the size of the calculation. If the target needs size adjustment, that is done on the result of the multiplication.

I'd try (can't test right now)

facdist = (ulong) exponent * 2 * NUM_CLASSES;

or even

facdist = (ulong) exponent * 2ULL * NUM_CLASSES##ULL;

Bdot 2014-06-30 14:21

If nothing helps, how about doint it all in 32-bit math?

facdist = (2 * NUM_CLASSES * (exponent%prime))%prime;

Could even be faster ... I'll test that for AMD cards. I saw that many emulated 64-bit operations take more than double the time of 32-bits ...

Edit: OK, I should have thought about this before writing ... this would work only for primes smaller than 464823 ...

Bdot 2014-06-30 16:19

[QUOTE=Prime95;377027]
Next mini-bug: In extract_bits exponent, k_base, shiftcount, bit_max64, bb are undefined in this code
[/QUOTE]
Fixed. I did not pay attention when I moved the common code to a shared method.

Prime95 2014-07-01 02:30

Please replace

facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;

with:

facdist = mul_16_32 (2 * NUM_CLASSES, exponent);

and then:

#define mul_16_32(a,b) ((ulong)(a) * (ulong)(b))

then for Intel we need (I hope I did this right):

#define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF)))


You can use the Intel definition for AMD if you think it will generate better code. This fixes one bug, but there is at least one more. I'm betting the calculation of bit_to_clr also will need a similar macro.

Prime95 2014-07-01 04:59

[QUOTE=Prime95;377093]I'm betting the calculation of bit_to_clr also will need a similar macro.[/QUOTE]

Nope, the bug is elsewhere. Time for some more nasty debugging.

tului 2014-07-01 21:06

Has anyone looked at what the upcoming/available HSA stuff might add to the mix? Obviously an APU isn't a 290X but having the same memory address space and whatever other stuff HSA allows seems really cool to me.

Bdot 2014-07-01 21:16

It would allow for faster "transfer" of data when sieving on the CPU, which can yield a better output (read: more GHz-days) than GPU sieving anyway. However, the memory transfer is not a bottleneck until you meet multiple high-end cards in the same system. For these cases, GPU sieving is the better choice these days. And HSA will not help there.

Therefore it may be exciting but will not help tf throughput a lot.

Bdot 2014-07-03 15:38

[QUOTE=Prime95;377093]Please replace

facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;

with:

facdist = mul_16_32 (2 * NUM_CLASSES, exponent);

and then:

#define mul_16_32(a,b) ((ulong)(a) * (ulong)(b))

then for Intel we need (I hope I did this right):

#define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF)))


You can use the Intel definition for AMD if you think it will generate better code. This fixes one bug, but there is at least one more. I'm betting the calculation of bit_to_clr also will need a similar macro.[/QUOTE]

The bit-shifting code is quite a slowdown on AMD (older cards: 6 cycles instead of 3, newer cards 17 cycles instead of 8). Not a big deal, as it is not the sieving itself, but I'd like to avoid it anyway.

Could you please try if the following code works on Intel? The OpenCL standard discourages using type casts, but use conversion functions instead. These two versions create exactly the same assembly on AMD (which is the same as the original):
[code]
facdist = convert_ulong(2 * NUM_CLASSES) * convert_ulong(exponent);
[/code][code]
uint2 temp;

temp.x = exponent * (2 * NUM_CLASSES);
temp.y = mul_hi(exponent, 2 * NUM_CLASSES);

facdist = as_ulong (temp);
[/code]

Bdot 2014-07-03 16:09

[QUOTE=Prime95;377098]Nope, the bug is elsewhere. Time for some more nasty debugging.[/QUOTE]
This would have been my bet as well. Did you verify that using the GWDEBUG printfs? This verification may also fall to the same 32-bit calculation, and maybe 2 wrongs make it right, and therefore no FAILs are reported?

Other occurances of ulong conversions are

memory access (maybe only 32 bits are saved?):
#define locsieve64 ((__local ulong *) locsieve)

sieve mask calculation (if that is done in 32 bits, we'll have way more zeros):
mask1 = (i131 > 63 ? 0 : ((ulong) 1 << i131)) | (i137 > 63 ? 0 : ((ulong) 1 << i137));

Prime95 2014-07-03 16:17

Neither of those versions work. Can we #define a work-around flag for Intel devices? If so, we'd need to release two different executables, right?

Do you want to wait until I find the next bug before making any decisions?

Bdot 2014-07-03 16:32

[QUOTE=Prime95;377287]Neither of those versions work. Can we #define a work-around flag for Intel devices? If so, we'd need to release two different executables, right?

Do you want to wait until I find the next bug before making any decisions?[/QUOTE]
That's really sad, and Intel should fix it.

For now, I will modify the host code to pass the detected device type as a define to the kernel. So we can do

[code]
#ifdef INTEL
#define mul_16_32(a,b) ((((ulong) ((uint)(a) * ((uint)(b) >> 16))) << 16) + (ulong) ((uint)(a) * ((uint)(b) & 0xFFFF)))
#else
#define mul_16_32(a,b) ((ulong)(a) * (ulong)(b))
#endif
[/code]This requires separation of device detection from code compilation, which I have not yet done. I'll try to get to that tonight.

Prime95 2014-07-03 16:37

Not all ulongs are a problem. I was surprised that bit_to_clr ulong operations were OK. I did turn on the GWDEBUG checks with an ugly and slow mul_32_32 macro and all was OK.

Presently, self tests pass with DEBUG_FACTOR_FIRST set, but only about half pass if it is not set.

Bdot 2014-07-03 20:10

[QUOTE=Prime95;377292]Not all ulongs are a problem. I was surprised that bit_to_clr ulong operations were OK. I did turn on the GWDEBUG checks with an ugly and slow mul_32_32 macro and all was OK.

Presently, self tests pass with DEBUG_FACTOR_FIRST set, but only about half pass if it is not set.[/QUOTE]
This sounds like a VectorSize problem. As if only half of the candidates are tested ... either VectorSize is not always evaluated, or it is set to 2 but Intel does not support vectors.

In order to occupy as many compute units as possible, I made each thread work on a vector of FCs - the old VLIW4 and VLIW5 devices used 4 per thread, recent GCN devices perform best with 2 per thread ...

Can you verify if the sieve returns the proper number of bits set? When defining DETAILED_INFO, mfakto will copy the device's arrays to the host and partially dump them. For the sieve array, it will count the bits.

Prime95 2014-07-03 20:32

[QUOTE=Bdot;377314]This sounds like a VectorSize problem....
Can you verify if the sieve returns the proper number of bits set? When defining DETAILED_INFO, mfakto will copy the device's arrays to the host and partially dump them. For the sieve array, it will count the bits.[/QUOTE]

I'm failing at VectorSize 1, 2, and 4.

extract_bits reports about 4800 bits set.

Interestingly, only the barrett32 routines fail. The barrett15 routines always work.

Bdot 2014-07-03 21:44

[QUOTE=Prime95;377318]
Interestingly, only the barrett32 routines fail. The barrett15 routines always work.[/QUOTE]

That does not sound like a sieve problem. Are they failing the same way when sieving on the CPU (SieveOnGPU=0)? But if DEBUG_FACTOR_FIRST makes them all work then it must be related to the thread ID or the location of the factor's bit in the sieve.

I now committed the changes to github to apply the Intel workaround for CalcModularInverses. There are a few other changes that do compile, but are not yet complete or not well tested ...

Prime95 2014-07-03 23:33

This code is failing in the barrett32 kernels. I don't have a workaround yet.

[CODE] my_k_base.d0 = k_base.d0 + NUM_CLASSES * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
my_k_base.d1 = k_base.d1 + mul_hi(NUM_CLASSES, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
[/CODE]

Prime95 2014-07-04 01:12

I tried the beta version of Intel's device driver -- no change. I've submitted a bug report to Intel Developer Zone.

kracker 2014-07-04 01:12

[code]mfakto.cpp:39:18: fatal error: menu.h: No such file or directory
#include "menu.h"
^
compilation terminated.
make: *** [mfakto.o] Error 1
[/code]

:smile:

Prime95 2014-07-04 02:18

I have a workaround:

In the barrett32 kernels replace the use NUM_CLASSES with variable num_c. Declare num_c as uint. Before the for loop add:

num_c = NUM_CLASSES % (total_bit_count + 1000000);


Now what?? I think I'll read the Intel documentation for optimization ideas. Are there any timings you'd like me to run?

Bdot 2014-07-04 09:33

[QUOTE=kracker;377332][code]mfakto.cpp:39:18: fatal error: menu.h: No such file or directory
#include "menu.h"
^
compilation terminated.
make: *** [mfakto.o] Error 1
[/code]:smile:[/QUOTE]
:blush:

There may be more issues ... I did not yet try to build on linux. Give me another day or two :smile:

Bdot 2014-07-04 10:25

[QUOTE=Prime95;377336]I have a workaround:

In the barrett32 kernels replace the use NUM_CLASSES with variable num_c. Declare num_c as uint. Before the for loop add:

num_c = NUM_CLASSES % (total_bit_count + 1000000);


Now what?? I think I'll read the Intel documentation for optimization ideas. Are there any timings you'd like me to run?[/QUOTE]

I've committed the changes for the barrett32_gs kernels.

Is the same required for the CPU-sieve kernels? Could you test if/which kernels fail with SieveOnGPU=0?

What's next? I'd need to know the speed of each of the kernels. For that, best would be to send me the output of a few minutes of 'mfakto -st', with CPU sieving, with CL_PERFORMANCE_INFO defined in the build. (this would also answer the question above :smile: ) I need that to update the kernel order in find_fastest_kernel().

To evaluate the GPU sieve speed, a normal binary without any DEBUG flags would be best. Then set in mfakto.ini:

#TestSieveSizes (commented out to skip the CPU sieve tests)
TestSievePrimes=60000,65000,70000,75000,80000,85000,90000,100000,110000,120000
TestGPUSieveSizes=32,48,64,96,120,128

Then run mfakto --perftest. This should take quite a while and finally come back with a table that can be used to find the optimal settings. There is no automated testing for GPUSieveProcessSize. For a complete picture, the above test would need to be repeated for all 4 possible values of GPUSieveProcessSize.

The test with CL_PERFORMANCE_INFO gives the speed of the TF kernels. The optimal SievePrimes value is approximately where the reported incremental removal rate matches the TF speed.

And when you have that, we all want to know, how many GHz-days/day IntelHD can churn out :cool:

kracker 2014-07-04 18:14

[QUOTE=Bdot;377346]:blush:

There may be more issues ... I did not yet try to build on linux. Give me another day or two :smile:[/QUOTE]

Just one more thing... I'm "missing" termios.h(included from kbhit.h)
It does say "// simulate _kbhit() on Linux" though, so...
BTW, VS12 doesn't have it either.

Bdot 2014-07-04 20:02

[QUOTE=kracker;377381]Just one more thing... I'm "missing" termios.h(included from kbhit.h)
It does say "// simulate _kbhit() on Linux" though, so...
BTW, VS12 doesn't have it either.[/QUOTE]
On Windows, you should not build kbhit.cpp. Just include conio.h, it will provide _kbhit(). That file is just a workaround for Linux as it does not provide that functionality in standard libraries.

You will need to adjust the #ifdefs around that to choose the windows branch when building with MinGW.

kracker 2014-07-04 20:28

[QUOTE=Bdot;377389]On Windows, you should not build kbhit.cpp. Just include conio.h, it will provide _kbhit(). That file is just a workaround for Linux as it does not provide that functionality in standard libraries.

You will need to adjust the #ifdefs around that to choose the windows branch when building with MinGW.[/QUOTE]

Well, should I make a new Makefile then? kbhit.cpp is set to build there, and sadly I think there is no easy way to differentiate between platforms in Makefile, a reason for ./configure :smile:

Prime95 2014-07-04 21:17

[QUOTE=Bdot;377349]

Is the same required for the CPU-sieve kernels? [/QUOTE]

Yes.

In barrett24.cl replace the 4620 in mul_24_48 with (4620 % (exp72.d1 + 1000000))

in common.cl calc_FC32, replace the 4620u in mul_hi with (4620 % (exponent + 1000000))


I sure hope you're not including these workarounds into the AMD code and they can be turned off when Intel fixes their compiler/drivers. BTW, here is the link to my bug report: [url]https://software.intel.com/en-us/forums/topic/517787[/url]

Prime95 2014-07-04 21:51

1 Attachment(s)
[QUOTE=Bdot;377349]
What's next? I'd need to know the speed of each of the kernels. For that, best would be to send me the output of a few minutes of 'mfakto -st', with CPU sieving, with CL_PERFORMANCE_INFO defined in the build. (this would also answer the question above :smile: ) I need that to update the kernel order in find_fastest_kernel().[/QUOTE]

4 files coming -- for vector_sizes 1,2,4,8

Prime95 2014-07-04 21:52

1 Attachment(s)
vs2

Prime95 2014-07-04 21:52

1 Attachment(s)
vs4

Prime95 2014-07-04 21:53

1 Attachment(s)
vs8

Prime95 2014-07-04 22:40

1 Attachment(s)
--perftest crashes, output attached. MSVC is useless in finding the cause (call_stack is of no value).

Prime95 2014-07-05 01:57

Bdot, have you looked at the Intel OpenCL optimization guide? I'm not enough of an OpenCL / mfakto expert to make much use of it -- maybe you can help. I ran three TF assignments in the 450M area and was getting a paltry 16GHz-days/day. I don't have a feel for what should be theoretically possible.

Guide is here: [url]https://software.intel.com/en-us/iocl_2014.b1_opg?language=it[/url]

I'm especially interested in optimizations that minimize / eliminates impact on memory bandwidth. The current mfakto does slow down running LL tests.

Prime95 2014-07-05 02:21

FYI, running mfakto on Haswell overclocked to 4GHz adds 30 watts to power consumption (up from 135 watts). It also increases temps by 2-3 degrees.

kracker 2014-07-05 02:36

[QUOTE=Prime95;377416]Bdot, have you looked at the Intel OpenCL optimization guide? I'm not enough of an OpenCL / mfakto expert to make much use of it -- maybe you can help. I ran three TF assignments in the 450M area and was getting a paltry 16GHz-days/day. I don't have a feel for what should be theoretically possible.

Guide is here: [url]https://software.intel.com/en-us/iocl_2014.b1_opg?language=it[/url]

I'm especially interested in optimizations that minimize / eliminates impact on memory bandwidth. The current mfakto does slow down running LL tests.[/QUOTE]

Just for fun. How much do you get from the DC range, lowest bit possible?

Also, how did you get mfakto to actually recognize the GPU?(I have a 4670K here)

Prime95 2014-07-05 03:18

[QUOTE=kracker;377421]Just for fun. How much do you get from the DC range, lowest bit possible?[/quote]

19.8 GHz-days/day

[quote]Also, how did you get mfakto to actually recognize the GPU?(I have a 4670K here)[/QUOTE]

I installed the OpenCL SDK and driver from Intel's web site. Then
mfakto -d 11
recognizes the HD 4600 graphics unit. Note: this is on a machine that does not have a separate GPU card.

legendarymudkip 2014-07-05 09:40

I have an i5-4670k and seeing that it worked on the HD4600, I decided to give it a go. However, whenever I open the executable from [url]http://mersenneforum.org/mfakto/mfakto-0.14/[/url] I always get:
ERROR: init_CL<3, 0> failed
and I am unsure as to what the problem is. Do I need to build it on my PC? If so, how do I do so?

Prime95 2014-07-05 19:09

[QUOTE=legendarymudkip;377439]I have an i5-4670k and seeing that it worked on the HD4600, I decided to give it a go. [/QUOTE]

Did you download the latest Intel driver? I got mine starting here: [url]https://software.intel.com/en-us/vcsource/tools/opencl-sdk[/url]

You should not need the SDK unless you want to compile and link an executable. The mfakto you downloaded won't work (it doesn't have the fixes I've made), but it should recognize your HD 4600 engine.

legendarymudkip 2014-07-05 19:49

I installed the latest non-beta driver for it today, but every time I try running either executable (32 or 64 bit) it always just quickly flashes up with
ERROR: init_CL<3, 0> failed
and then closes immediately. I don't know what the problem is, but I hope the error message is helpful :/

Bdot 2014-07-06 10:49

[QUOTE=kracker;377392]Well, should I make a new Makefile then? kbhit.cpp is set to build there, and sadly I think there is no easy way to differentiate between platforms in Makefile, a reason for ./configure :smile:[/QUOTE]
I will add #ifdefs to kbhit.cpp/.h so that we can build it on win as well.
[QUOTE=Prime95;377395]
In barrett24.cl replace the 4620 in mul_24_48 with (4620 % (exp72.d1 + 1000000))

in common.cl calc_FC32, replace the 4620u in mul_hi with (4620 % (exponent + 1000000))


I sure hope you're not including these workarounds into the AMD code and they can be turned off when Intel fixes their compiler/drivers. BTW, here is the link to my bug report: [URL]https://software.intel.com/en-us/forums/topic/517787[/URL][/QUOTE]
In common.c I used (4620 % (exponent + 1)) so that we don't have an overflow if exponent gets close to 2[SUP]32[/SUP]. As mfakto has a minimum of 1000000 for exponent, I hope this is OK too.
[QUOTE=Prime95;377398]4 files coming -- for vector_sizes 1,2,4,8[/QUOTE]
Very interesting. It seems HD4600 has plenty of registers, or is using them very efficiently. All vector sizes have almost the same speed, VectorSize=4 being fastest. Regarding the kernels, there is no big surprise. 32-bit kernels are way more efficient than 15-bit, 32_76 being best. It's raw speed of ~26M FC's per second can yield ~24 GHz-days/day with very high CPU sieving (to 200,000). GPU-sieving should be able to achieve ~20GHz-days/day.

[QUOTE=Prime95;377404]--perftest crashes, output attached. MSVC is useless in finding the cause (call_stack is of no value).[/QUOTE]
Looks like the driver does not like re-initializing the application. I'll try to build separate perftest modes for CPU and GPU sieving.

[QUOTE=Prime95;377416]Bdot, have you looked at the Intel OpenCL optimization guide? I'm not enough of an OpenCL / mfakto expert to make much use of it -- maybe you can help. I ran three TF assignments in the 450M area and was getting a paltry 16GHz-days/day. I don't have a feel for what should be theoretically possible.

Guide is here: [URL]https://software.intel.com/en-us/iocl_2014.b1_opg?language=it[/URL]

I'm especially interested in optimizations that minimize / eliminates impact on memory bandwidth. The current mfakto does slow down running LL tests.[/QUOTE]

I browsed through that online guide, but it is all very high-level. More details are available only for OpenCL for AVX, i.e. when you target the CPU instead of the built-in GPU.

The TF kernels require almost no bandwidth, it comes almost all from the sieving, where the global memory sieve is stressed. Local memory, that is used for counting and extracting the sieve bits, is probably implemented in the L2 cache - using a lot of it will cause more cache misses in LL tests.

My guess is, that using smaller SieveSize will reduce the amount of global memory, lower GPUSieveProcessSize will reduce the amount of L2 cache being accessed, and using lower SievePrimes will cause fewer accesses to global memory. The downside is that the TF kernels will have more work, but they use only registers ...

Eliminating memory bandwidth can be achieved only, if we changed the sieving to primarily use registers. On most platforms register space is pretty scarce, though. As you probably also want to avoid L2 cache accesses, local memory would need to be avoided too ...

[QUOTE=legendarymudkip;377465]I installed the latest non-beta driver for it today, but every time I try running either executable (32 or 64 bit) it always just quickly flashes up with
ERROR: init_CL<3, 0> failed
and then closes immediately. I don't know what the problem is, but I hope the error message is helpful :/[/QUOTE]

Please open a command prompt, cd to the mfakto directory and run mfakto there - then you will see the complete output. Interesting would be to see which error message is reported before the 'init_CL failed'.

Another test would be to run 'clinfo' which should come with the driver. It should report two devices (CPU and GPU) available for use with OpenCL.

legendarymudkip 2014-07-06 12:47

Select device - Error: No platform found
ERROR: init_CL<3, 0> failed

Ran from command prompt and got this.

potonono 2014-07-06 16:21

1 Attachment(s)
I was getting a similar error. On one of the earlier posts, it was mentioned that changing the GPU type specifically to NVIDIA in the INI file would skip one of the checks that intel's GPU didn't like. Attached is my output for clinfo and mfakto-x64.

kracker 2014-07-06 19:37

[QUOTE=potonono;377507]I was getting a similar error. On one of the earlier posts, it was mentioned that changing the GPU type specifically to NVIDIA in the INI file would skip one of the checks that intel's GPU didn't like. Attached is my output for clinfo and mfakto-x64.[/QUOTE]

You'll need to either get or compile the latest git, btw.

Bdot 2014-07-06 22:29

[QUOTE=legendarymudkip;377500]Select device - Error: No platform found
ERROR: init_CL<3, 0> failed

Ran from command prompt and got this.[/QUOTE]

"No platform found" means that no usable OpenCL driver is available.

[QUOTE=potonono;377507]I was getting a similar error. On one of the earlier posts, it was mentioned that changing the GPU type specifically to NVIDIA in the INI file would skip one of the checks that intel's GPU didn't like. Attached is my output for clinfo and mfakto-x64.[/QUOTE]

You seem to have solved the OpenCL driver issue. But mfakto 0.14 is not ready for IntelHD graphics.

[QUOTE=kracker;377518]You'll need to either get or compile the latest git, btw.[/QUOTE]
I will most likely provide a test-build in the next few days ...

kracker 2014-07-06 22:32

[QUOTE=Bdot;377543]
I will most likely provide a test-build in the next few days ...[/QUOTE]

:smile:
Good!
Now, if I can figure out why the heck my 4600 doesn't appear in clinfo while my other one does... Maybe because of my two AMD GPU's installed? Dunno..

tului 2014-07-07 01:03

I just finished 77 bit numbers on 2 260X cards. Is there any benefit to doing that? It pulls around 200GHz/day. If I need to stay on the "front" line please advise me where to pull numbers from.

I run these machines 24/7. I'd use the CPU too but it's crunching a 100M digit number due in February. I really want to help as much as possible(while pumping my numbers) any idea what to do. I can buld mfakto from source so if you have any tests you want me to run, feel free to shoot them my way.

Mark Rose 2014-07-07 09:29

[QUOTE=tului;377547]I just finished 77 bit numbers on 2 260X cards. Is there any benefit to doing that? It pulls around 200GHz/day. If I need to stay on the "front" line please advise me where to pull numbers from.

I run these machines 24/7. I'd use the CPU too but it's crunching a 100M digit number due in February. I really want to help as much as possible(while pumping my numbers) any idea what to do. I can buld mfakto from source so if you have any tests you want me to run, feel free to shoot them my way.[/QUOTE]

To stay on the "front line" the easiest way is to use GPU72 and pick "let GPU72 decide" when requesting assignments. It's currently more helpful to factor only up to 74 bits and do more assignments.

Bdot 2014-07-07 15:06

For these cards (like all GCN ones) it may even be more beneficial to just factor to 73 bits (as long as we have assignments there), as they're more efficient in that range. You can request the "to: 73" level from GPU72 as well, using the "What makes sense" option.

chalsall 2014-07-07 16:14

[QUOTE=Bdot;377578]For these cards (like all GCN ones) it may even be more beneficial to just factor to 73 bits (as long as we have assignments there), as they're more efficient in that range.[/QUOTE]

Indeed, and it's useful because it helps build up a buffer for Spidy's "rip-cord" for if and when we need to release candidates early (although it's becoming less and less of an issue). Oh, and I think it's safe to say that we will [I]always[/I] have such work available and needed. :wink:

[QUOTE=Bdot;377578]You can request the "to: 73" level from GPU72 as well, using the "What makes sense" option.[/QUOTE]

Yes, from the manual assignment form, or using MISFIT or teknohog's "primetools".

Be sure to use "What makes sense", or any other option except "Let GPU72 Decide" (LGD), as LGD sets the "Pledge Level" (currently to 74).

Mark Rose 2014-07-07 21:25

Can I ask a favour of someone with an AMD card on Linux using teknohog's primetools? I've got a fork where I've added support for taking into account checkpoint files when using the cache-by-ghz-days option, but I haven't been able to test it with mfakto checkpoint files (just mfaktc).

To test, clone or merge [url]https://github.com/MarkRose/primetools[/url] and execute the mfloop.py script as you normally do but with the -g option for caching GHz-days of work (a value of 1 will most likely not fetch any new work, if you have anything in your worktodo.txt file) and the -d option to print debugging info. It should state the percent completion from the checkpoint file.

Thanks!

kracker 2014-07-07 22:06

[QUOTE=Mark Rose;377609]Can I ask a favour of someone with an AMD card on Linux using teknohog's primetools? I've got a fork where I've added support for taking into account checkpoint files when using the cache-by-ghz-days option, but I haven't been able to test it with mfakto checkpoint files (just mfaktc).

To test, clone or merge [url]https://github.com/MarkRose/primetools[/url] and execute the mfloop.py script as you normally do but with the -g option for caching GHz-days of work (a value of 1 will most likely not fetch any new work, if you have anything in your worktodo.txt file) and the -d option to print debugging info. It should state the percent completion from the checkpoint file.

Thanks![/QUOTE]

Is there a reason for Linux? I use primetools(experimenting) in Windows.

Mark Rose 2014-07-07 23:15

[QUOTE=kracker;377612]Is there a reason for Linux? I use primetools(experimenting) in Windows.[/QUOTE]

I thought Windows users used MISFIT. No, there is no reason for Linux. If primetools is working, then by all means please try my fork :)

chalsall 2014-07-07 23:44

[QUOTE=Mark Rose;377622]No, there is no reason for Linux.[/QUOTE]

There are many reasons for Linux. It works reliably, for example.... :wink:

Mark Rose 2014-07-08 01:04

[QUOTE=chalsall;377626]There are many reasons for Linux. It works reliably, for example.... :wink:[/QUOTE]

lol yes. I defenestrated 11 years ago and that was one of the reasons why :)

kracker 2014-07-08 14:40

1 Attachment(s)
[QUOTE=Mark Rose;377609]Can I ask a favour of someone with an AMD card on Linux using teknohog's primetools? I've got a fork where I've added support for taking into account checkpoint files when using the cache-by-ghz-days option, but I haven't been able to test it with mfakto checkpoint files (just mfaktc).

To test, clone or merge [url]https://github.com/MarkRose/primetools[/url] and execute the mfloop.py script as you normally do but with the -g option for caching GHz-days of work (a value of 1 will most likely not fetch any new work, if you have anything in your worktodo.txt file) and the -d option to print debugging info. It should state the percent completion from the checkpoint file.

Thanks![/QUOTE]
:smile:

Mark Rose 2014-07-08 16:46

[QUOTE=kracker;377658]:smile:[/QUOTE]

Okay, thanks. Could you please paste the contents of that or another checkpoint file? I'll work on a fix.

kracker 2014-07-08 17:39

Two examples:
[code]
60000011 73 74 4620 mfakto 0.14-MGW: 60 0 6D868427
[/code]
[code]
68223611 71 72 4620 mfakto 0.14-MGW: 3736 0 8ED07642
[/code]

MGW could be/will be usually "Win", I just built my binary with MinGW.

Mark Rose 2014-07-08 18:39

[QUOTE=kracker;377673]Two examples:
[code]
60000011 73 74 4620 mfakto 0.14-MGW: 60 0 6D868427
[/code]
[code]
68223611 71 72 4620 mfakto 0.14-MGW: 3736 0 8ED07642
[/code]

MGW could be/will be usually "Win", I just built my binary with MinGW.[/QUOTE]

Ah, I see the problem. In mfakto files, it has an mfakto field that is not present mfaktc.

Can you please pull and test again? I've committed a change.

kracker 2014-07-08 22:39

1 Attachment(s)
:smile:

Mark Rose 2014-07-08 23:31

[QUOTE=kracker;377707]:smile:[/QUOTE]

Beautiful :)

Thanks!

kracker 2014-07-11 01:02

Finally got it recognized, it would only appear in clinfo if the iGPU was the primary active display... ugh

[code]
mfakto 0.15pre1-MGW (64bit build)


Runtime options
Inifile mfakto.ini
Verbosity 1
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 111157
GPUSieveProcessSize 24Ki bits
GPUSieveSize 96Mi bits
FlushInterval 8
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode compact
V5UserID none
ComputerID none
TimeStampInResults yes
VectorSize 4
GPUType AUTO
SmallExp no
UseBinfile mfakto_Kernels.elf
Compiletime options

Select device - Get device info:

OpenCL device info
name Intel(R) HD Graphics 4600 (Intel(R) Corporation)
device (driver) version OpenCL 1.2 (10.18.10.3621)
maximum threads per block 512
maximum threads per grid 134217728
number of multiprocessors 20 (20 compute elements)
clock rate 1200MHz

Automatic parameters
threads per grid 256
optimizing kernels for INTEL

Compiling kernels.

BUILD OUTPUT
In file included from :81:
.\barrett.cl:2040:3: error: use of undeclared identifier 'cl_uint'
cl_uint num_c;
^
.\barrett.cl:2065:3: error: use of undeclared identifier 'num_c'
num_c = NUM_CLASSES % (total_bit_count + 1000000);
^
.\barrett.cl:2123:32: error: use of undeclared identifier 'num_c'
my_k_base.d0 = k_base.d0 + num_c * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
^
.\barrett.cl:2124:39: error: use of undeclared identifier 'num_c'
my_k_base.d1 = k_base.d1 + mul_hi(num_c, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
^
.\barrett.cl:2187:3: error: use of undeclared identifier 'cl_uint'
cl_uint num_c;
^
.\barrett.cl:2212:3: error: use of undeclared identifier 'num_c'
num_c = NUM_CLASSES % (total_bit_count + 1000000);
^
.\barrett.cl:2270:32: error: use of undeclared identifier 'num_c'
my_k_base.d0 = k_base.d0 + num_c * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
^
.\barrett.cl:2271:39: error: use of undeclared identifier 'num_c'
my_k_base.d1 = k_base.d1 + mul_hi(num_c, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
^
.\barrett.cl:2334:3: error: use of undeclared identifier 'cl_uint'
cl_uint num_c;
^
.\barrett.cl:2359:3: error: use of undeclared identifier 'num_c'
num_c = NUM_CLASSES % (total_bit_count + 1000000);
^
.\barrett.cl:2417:32: error: use of undeclared identifier 'num_c'
my_k_base.d0 = k_base.d0 + num_c * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
^
.\barrett.cl:2418:39: error: use of undeclared identifier 'num_c'
my_k_base.d1 = k_base.d1 + mul_hi(num_c, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
^
.\barrett.cl:2481:3: error: use of undeclared identifier 'cl_uint'
cl_uint num_c;
^
.\barrett.cl:2506:3: error: use of undeclared identifier 'num_c'
num_c = NUM_CLASSES % (total_bit_count + 1000000);
^
.\barrett.cl:2564:32: error: use of undeclared identifier 'num_c'
my_k_base.d0 = k_base.d0 + num_c * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
^
.\barrett.cl:2565:39: error: use of undeclared identifier 'num_c'
my_k_base.d1 = k_base.d1 + mul_hi(num_c, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
^
.\barrett.cl:2628:3: error: use of undeclared identifier 'cl_uint'
cl_uint num_c;
^
.\barrett.cl:2653:3: error: use of undeclared identifier 'num_c'
num_c = NUM_CLASSES % (total_bit_count + 1000000);
^
.\barrett.cl:2711:32: error: use of undeclared identifier 'num_c'
my_k_base.d0 = k_base.d0 + num_c * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
^
.\barrett.cl:2712:39: error: use of undeclared identifier 'num_c'
my_k_base.d1 = k_base.d1 + mul_hi(num_c, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
^
.\barrett.cl:2775:3: error: use of undeclared identifier 'cl_uint'
cl_uint num_c;
^
.\barrett.cl:2800:3: error: use of undeclared identifier 'num_c'
num_c = NUM_CLASSES % (total_bit_count + 1000000);
^
.\barrett.cl:2858:32: error: use of undeclared identifier 'num_c'
my_k_base.d0 = k_base.d0 + num_c * k_delta; // k_delta can exceed 2^24: don't use mul24/mad24 for it
^
.\barrett.cl:2859:39: error: use of undeclared identifier 'num_c'
my_k_base.d1 = k_base.d1 + mul_hi(num_c, k_delta) - AS_UINT_V(k_base.d0 > my_k_base.d0); /* k is limited to 2^64 -1 so there is no need for k.d2 */
^

error: front end compiler failed build.
END OF BUILD OUTPUT
ERROR: load_kernels(0) failed
[/code]

Prime95 2014-07-11 01:46

Try replacing cl_uint with uint.

kracker 2014-07-11 02:39

1 Attachment(s)
[QUOTE=Prime95;377839]Try replacing cl_uint with uint.[/QUOTE]

Well, that fixed the errors.

EDIT: Just on a whim, I ran it with admin privileges. It works. WTF? Once kernels are compiled it seems to be fine...
EDIT2: Now it doesn't recompile (same error) hmm...

Bdot 2014-07-11 10:59

[QUOTE=kracker;377837]
[code].\barrett.cl:2040:3: error: use of undeclared identifier 'cl_uint'

[/code][/QUOTE]
sorry ... fixed now.

The other error is clearly an out-of-memory on the host. Too many things running?

Bdot 2014-07-11 11:22

[QUOTE=kracker;377842]
EDIT2: Now it doesn't recompile (same error) hmm...[/QUOTE]
Well, possible that your driver does not support writing binary kernels. I improved the error handling ...

kracker 2014-07-11 14:32

1 Attachment(s)
[QUOTE=Bdot;377868]Well, possible that your driver does not support writing binary kernels. I improved the error handling ...[/QUOTE]

Hmm, I don't know. I tried it on my i3-3220(HD2500) and it compiles fine, so I transferred the compiled kernel back and it seems to work... Probably something with my computer.(the i3 has a slightly older driver)

Also.... :whistle:

kracker 2014-07-11 14:36

Also:
:smile:
[code]
gcc -m64 -Wall -O3 -funroll-loops -ffast-math -finline-functions -frerun-loop-opt -fgcse-sm -fgcse-las -flto -I/opt/AMDAPP/include -DBUILD_OPENCL -c mfakto.cpp -o mfakto.o
mfakto.cpp: In function 'int load_kernels(cl_int*)':
mfakto.cpp:916:5: error: expected ';' before '}' token
}
^
mfakto.cpp:1016:18: error: 'numDevices' was not declared in this scope
for(i = 0; i < numDevices; i++)
^
mfakto.cpp:1018:8: error: 'binaries' was not declared in this scope
if(binaries != NULL && binaries[i] != NULL)
^
mfakto.cpp:1024:6: error: 'binaries' was not declared in this scope
if(binaries != NULL)
^
mfakto.cpp:1029:6: error: 'binarySizes' was not declared in this scope
if(binarySizes != NULL)
^
make: *** [mfakto.o] Error 1
[/code]

Bdot 2014-07-11 19:43

[QUOTE=kracker;377878]Also:
:smile:
[/QUOTE]
Oh-oh, things like these happen when in a hurry without checking ... even the smallest fix can introduce new bugs :gah:

kracker 2014-07-11 20:52

[QUOTE=Bdot;377863]sorry ... fixed now.

The other error is clearly an out-of-memory on the host. Too many things running?[/QUOTE]

That was the first thing I checked. VRAM is/was fine, I had 6GB free memory on the RAM when I tried it.

tului 2014-07-17 23:48

[QUOTE=Bdot;377901]Oh-oh, things like these happen when in a hurry without checking ... even the smallest fix can introduce new bugs :gah:[/QUOTE]

Well a checked out as of today VS2013 x64 with full optimizations, AVX and LTO works fine on R7 260x's. I've been tempted to enable my motherboards VirtuMVP just to add some Intel HD 4000 to the mix but the free Virtu Asus download isn't 8.1 compatible and I'm not buying the $30 real Virtu software

Bdot 2014-07-21 13:05

I've put the win-64 version of mfakto-0.15pre1 on the [URL="http://www.mersenneforum.org/mfakto/mfakto-0.15pre1/"]ftp[/URL]. It is [B]NOT YET FULLY TESTED FOR PRODUCTION[/B]!

This version should have all the fixes for IntelHD as suggested by George, however, lacking such a system I could not test that.

It comes with runtime-modifiable settings: press 'm' to see this menu:

[code]
Settings menu

Num Setting Current value (shortcut outside of the menu for de-/increasing this setting)

1 SievePrimes = 97990 (-/+)
2 SieveSize = 35 (s/S)
3 SieveProcessSize = 35 (p/P)
4 SievePrimesAdjust = 0 (a/A)
5 FlushInterval = 0 (f/F)
6 Verbosity = 1 (v/V)
7 PrintMode = 0 (r/R)
8 Kernel = cl_barrett15_73_2 (k/K)

0 Done (continue factoring)

-1 Exit mfakto (q/Q)
Change setting number:
[/code]Factoring is paused while the menu is shown. While in the menu, select by number. Outside the menu, pressing the keys in parenthesis changes the respective value is steps without pausing TF. Keypresses are evaluated only between classes. Any required reinitializations are done automatically. Changing the kernel is not yet implemented. This feature is intended to let you find the best settings much easier. Please try to break it (and let me know what you did to break it). This includes messing up the settings while running the selftest - there must be no missed factors no matter what you try.

Let me know if you see the need for other parameters to change at runtime (e.g. VectorSize, SieveOnGPU or MoreClasses - but they would require recompilation of the kernels, which I did not yet implement).

I'm not yet convinced of the usability of this feature - let me know if you have ideas how to improve it.

And of course, this version should succeed all self tests and not be slower than 0.14 (but also not faster - the kernels are unchanged, apart from the INTEL definitions).

legendarymudkip 2014-07-21 16:37

It works for me now. I get around 18GHzDays/Day throughput. Is there any way to increase this or does this sound around optimal?

Bdot 2014-07-21 18:52

Most likely, this is about the max you can get. You can try different VectorSize: from George's results I understood VectorSize=4 is fastest.

If it is only about speed for mfakto, then switch to CPU sieving (SieveOnGPU=0) and select a high SievePrimes (e.g. 200000). This will use a portion of a CPU core to help the HD4600.

With GPU sieving, the other options are to play around while it is running; see my previous post. SievePrimes, SieveSize, SieveProcessSize are the adjustable values that affect performance, maybe also FlushInterval.

Play around with it and tell us what the optimal settings are :smile:.

potonono 2014-07-22 03:12

1 Attachment(s)
I have no issue running the selftest now, though did still have to specify -d 11 for it to recognize the GPU. All tests were passed. All test still passed even when changing the various settings from the menu.

If I run with option --perftest, after a little bit of output the program generates a generic Windows error indicating that the program has stopped working. Attached are the selftest and perftest runs.

Bdot 2014-07-22 21:29

Thanks a lot for your testing![QUOTE=potonono;378772]I have no issue running the selftest now, though did still have to specify -d 11 for it to recognize the GPU. All tests were passed. All test still passed even when changing the various settings from the menu.
[/QUOTE]
I'm making a few changes now, maybe -d 11 will no longer be needed with the next version.
[QUOTE=potonono;378772]
If I run with option --perftest, after a little bit of output the program generates a generic Windows error indicating that the program has stopped working. Attached are the selftest and perftest runs.[/QUOTE]
Oh, right. George already reported that but I did not yet act on this ... maybe next version :smile:

If you already tested some real trial factoring, could you please report what the best values for SievePrimes, SieveSize, SieveProcessSize and maybe VectorSize are?

kracker 2014-07-23 02:49

Started st2. :smile:

Bdot 2014-07-24 08:57

Did anyone of you try the new feature on a real exponent to find more efficient settings than the defaults? If you try, then you'll notice that the best SievePrimes, SieveSize, SieveProcessSize may be different for different TF jobs ... and the good thing: any improvements you find by using this version can be applied to version 0.14 by writing them to the mfakto.ini file.

I'm interested to hear about any improvements and what you changed.

kracker 2014-07-24 13:39

[QUOTE=Bdot;378942]Did anyone of you try the new feature on a real exponent to find more efficient settings than the defaults? If you try, then you'll notice that the best SievePrimes, SieveSize, SieveProcessSize may be different for different TF jobs ... and the good thing: any improvements you find by using this version can be applied to version 0.14 by writing them to the mfakto.ini file.

I'm interested to hear about any improvements and what you changed.[/QUOTE]

I'll do/try that. :smile:

On another note...
[code]

Selftest statistics
number of tests 287351
successful tests 287350
no factor found 1

selftest FAILED!

ERROR: selftest failed, exiting.
[/code]
[code]
######### testcase 2584/32927 (M59000521[82-83]) #########
Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett15_83_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Jul 22 22:31 | 3828 0.1% | 1.094 n.a. | n.a. 81205 0.00%
no factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett15_83_gs_4]
ERROR: selftest failed for M59000521 (cl_barrett15_83_gs)
no factor found
tf(): total time spent: 1.094s

Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett15_88_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Jul 22 22:31 | 3828 0.1% | 1.221 n.a. | n.a. 81205 0.00%
M59000521 has a factor: 6190124149267876918004257

found 1 factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett15_88_gs_4]
selftest for M59000521 passed (cl_barrett15_88_gs)!
tf(): total time spent: 1.221s

Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett32_87_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Jul 22 22:31 | 3828 0.1% | 0.710 n.a. | n.a. 81205 0.00%
M59000521 has a factor: 6190124149267876918004257

found 1 factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett32_87_gs_4]
selftest for M59000521 passed (cl_barrett32_87_gs)!
tf(): total time spent: 0.711s

Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett32_88_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Jul 22 22:31 | 3828 0.1% | 0.730 n.a. | n.a. 81205 0.00%
M59000521 has a factor: 6190124149267876918004257

found 1 factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett32_88_gs_4]
selftest for M59000521 passed (cl_barrett32_88_gs)!
tf(): total time spent: 0.730s

Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett32_92_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Jul 22 22:31 | 3828 0.1% | 0.831 n.a. | n.a. 81205 0.00%
M59000521 has a factor: 6190124149267876918004257

found 1 factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett32_92_gs_4]
selftest for M59000521 passed (cl_barrett32_92_gs)!
tf(): total time spent: 0.831s
[/code]


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.