mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

chalsall 2016-01-02 17:32

[QUOTE=Prime95;420992]Yes, I build with MSVC 2005.[/QUOTE]

And what compiler do you use to produce the Linux executables for mprime?

It has been reported that this bug manifests under Linux as well as Windows (on some machines).

Prime95 2016-01-02 20:34

[QUOTE=chalsall;421010]And what compiler do you use to produce the Linux executables for mprime?[/QUOTE]

GCC but all the critical FFT code is assembled in Windows using Masm.

chalsall 2016-01-02 20:45

[QUOTE=Prime95;421024]GCC but all the critical FFT code is assembled in Windows using Masm.[/QUOTE]

Interesting...

So while we thought we had eliminated a variable (it's OS independent) this might actually come down to focusing on Microsoft's assembler's interaction with the Skylake architecture.

Or, maybe not...

Solving intermittent problems is fun! Not easy, mind you, but rewarding....

Madpoo 2016-01-02 20:59

[QUOTE=tha;420995]I will finish the current test, which will be another two hours. I will then start a new test with 8 threads working concurrently on the following worktodo.txt test case:
[/QUOTE]

You might be better off just stopping what you're running now and doing the torture test at 768K since that's known to cause the problem on affected CPUs.

Then if you get the roundoff errors you'll know your CPU is "in the club" and you can try to reproduce using a "real" exponent.

My guess is that to replicate what the torture test is doing, you'd want to have all 8 (physical and HT) cores in a single worker. Not 8 separate workers doing 8 separate tests. It could be something specific to the threading code and combining the separate chunks from each large multiplication.

megabit8 2016-01-02 21:40

Trying to interprete the results....
 
If I am making a FFT of type 2 - similar to gwsquare, NORMNUM = 1, NRMRTN = yi3eCORE (whatever this is, there's no source for it) for the data:
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, .... 0, 0 [10 ones then 768*1024 - 10 zeroes]
I get the following output:
1 2 4 6 8 10 12 14 16 18 18 16 14 12 10 8 6 4 2 0 0 ..... all zeroes till the end.
I was expecting:
1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2 1 0 .... zeroes till the end. Or at least double of this array because this is the real square. Is the norm routine doing something to this small data ?

[CODE]
for (int i = 768 * 1024; --i >= 0; )
{
*addr(gwdata, s, i) = i < 10 ? 1 : 0;
}
gw_fft(gwdata, asm_data);
for (int i = 768 * 1024; --i >= 0; )
{
output[i] = *addr(gwdata, s, i);
}
[/CODE]The only explanation I find is that gwsquare(p) actually computes: 2 * p^2 - 2 * p + 1 instead of p^2. Is it right ?

Prime95 2016-01-02 22:34

[QUOTE=megabit8;421033]If I am making a FFT of type 2 - similar to gwsquare, NORMNUM = 1, NRMRTN = yi3eCORE (whatever this is, there's no source for it) for the data:
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, .... 0, 0 [10 ones then 768*1024 - 10 zeroes]
I was expecting:
1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2 1 0 .... zeroes till the end. Or at least double of this array because this is the real square. Is the norm routine doing something to this small data ?

[/QUOTE]

Gwnum performs weighted transforms. The initial FFT data values are rarely integers. If you call set_fft_value it will apply the proper weighting factor. Similarly, get_fft_value will return the FFT value after removing the weighting factor.

Source for yi3eCORE is in ymult3a.asm. You'll have to wade through a pile of nasty MASM macros to see the generated assembly code. yi3eCORE is the rounding-to-integer and carry propagation code. y=AVX, i=Irrational FFT, e=calc round off error, CORE=optimized for CORE architectures.

Prime95 2016-01-02 22:35

[QUOTE=Madpoo;421030]You might be better off just stopping what you're running now and doing the torture test at 768K since that's known to cause the problem on affected CPUs.

Then if you get the roundoff errors you'll know your CPU is "in the club" and you can try to reproduce using a "real" exponent.

My guess is that to replicate what the torture test is doing, you'd want to have all 8 (physical and HT) cores in a single worker. Not 8 separate workers doing 8 separate tests. It could be something specific to the threading code and combining the separate chunks from each large multiplication.[/QUOTE]

Amen to part 1. As to part 2, a torture test is more like 8 workers running separate tests.

megabit8 2016-01-02 23:01

[QUOTE=Prime95;421036]Gwnum performs weighted transforms. The initial FFT data values are rarely integers. If you call set_fft_value it will apply the proper weighting factor. Similarly, get_fft_value will return the FFT value after removing the weighting factor.

Source for yi3eCORE is in ymult3a.asm. You'll have to wade through a pile of nasty MASM macros to see the generated assembly code. yi3eCORE is the rounding-to-integer and carry propagation code. y=AVX, i=Irrational FFT, e=calc round off error, CORE=optimized for CORE architectures.[/QUOTE]

Thank you for your prompt response.

megabit8 2016-01-03 01:44

[QUOTE=Prime95;421036]Gwnum performs weighted transforms. The initial FFT data values are rarely integers. If you call set_fft_value it will apply the proper weighting factor. Similarly, get_fft_value will return the FFT value after removing the weighting factor.
[/QUOTE]
Tried set_fft_value for the input:
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, .... 0
still the magic output appears with get_fft_value:
1 2 4 6 8 10 12 14 16 18 18 16 14 12 10 8 6 4 2 0 ... 0

Tried with 1, 3, 5, 7, 9, 0, 0, ...., 0
The output is: 1 6 28 74 152 248 278 252 162 0 .... 0
Instead of: 1 6 19 44 85 124 139 126 81 0 .... 0

Prime95 2016-01-03 02:45

[QUOTE=megabit8;421048]Tried set_fft_value for the input:
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, .... 0
still the magic output appears with get_fft_value:
1 2 4 6 8 10 12 14 16 18 18 16 14 12 10 8 6 4 2 0 ... 0

Tried with 1, 3, 5, 7, 9, 0, 0, ...., 0
The output is: 1 6 28 74 152 248 278 252 162 0 .... 0
Instead of: 1 6 19 44 85 124 139 126 81 0 .... 0[/QUOTE]

That is OK. Gwnum stuffs varying number of bits in each FFT word. In your case eith floor or ceiling of 14942209 / 768K.

chalsall 2016-01-03 21:48

So, tomorrow most of the rest of the world get's back to work.

It might be interesting to see what happens, as we have done additional research (and engaged in heated debate) on this issue while other's enjoyed their time off...

Sorry for being a prick. It's in my training; and my general nature...

Question everything, and take no offence of anything....


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.