mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Math (https://www.mersenneforum.org/forumdisplay.php?f=8)
-   -   Radix-8 roundoff / noise (https://www.mersenneforum.org/showthread.php?t=23754)

Prime95 2018-10-31 01:25

Radix-8 roundoff / noise
 
Pre AVX-512 prime95 did primarily radix-4 FFTs. AVX-512 prime95 does primarily radix-8 FFTs.

The reason for the change is AVX-512 has double the number of registers. In past Intel architectures, it has generally been profitable to do as much work as possible once data is loaded into registers.

Anyway, it has been pointed out by early users that the roundoff error for radix-8 FFTs is worse than radix-4 FFTs. My very preliminary observation is that the maximum roundoff error is about 20% worse. That is if radix-4 FFTs were generating a max roundoff of 0.3125 are now generating an error of 0.375.

Have other FFT developers noticed the same issue?

I presume the increased error is from radix-8 doing several multiplications by (SQRTHALF + i*SQRTHALF). Any ideas for reducing the increased noise?

As is, I estimate this increased noise will result in reduction of 0.1 bits per FFT input word.

Mysticial 2018-10-31 03:11

There's a messy solution that I've played with in the past. It wasn't to improve precision, but to improve performance.

Detach the twiddle factors from the butterfly. Don't let the implementation dictate the operations you are actually doing. IOW, don't let the implementation dictate where you do your twiddle factors.

Some years ago, I implemented an FFT that was mathematically and numerically the same as a split-radix FFT, but was implemented using a standard radix-4 access pattern. The split-radix FFT had a lower operation count, but radix-4 was more memory-friendly. So I was able to get the benefits of both. (Granted, I no longer use this FFT since operation-count is less relevant in today's environment.)

The location and values of the twiddle factors are not constant between butterflies. They change from butterfly to butterfly. For the split-radix/radix-4 case, I needed two types of butterflies, one with the twiddles in the middle, and one with the twiddles at the end/beginning. And it also complicated the load-streams from the twiddle tables.

This will probably get very ugly for the radix 8 case though. Try to find a layout of twiddle factors that's numerically the best. Then try to fit into (multiple) radix-8 butterfly types to cover all the cases.

ewmayer 2018-10-31 19:29

In my own work I have come across examples of counterintuitive behavior involving those muls-by-sqrt2 (or by 1/sqrt2) in the radix-8 DFT, especially when done via FMA. I recall being able to reduce the ROE levels in one case by deliberately rounding the precomputed sqrt2 multiplier "the wrong way", in another by using a mix of correctly and wrong-way rounded such multipliers. Since the number of such muls in a radix-8 DFT is relatively small, it's not terribly hard to try a bunch of combos and tabulate the resulting ROEs for some quasi-random sample of exponents and FFT lengths to see if one particular such combo of roundings shows consistently superior ROE behavior.

Prime95 2018-11-02 19:51

[QUOTE=Mysticial;499146]
Detach the twiddle factors from the butterfly. Don't let the implementation dictate the operations you are actually doing. IOW, don't let the implementation dictate where you do your twiddle factors..[/QUOTE]

I looked at this a decade ago and found the bookkeeping for split radix and twiddle factors would be a major headache. I abandoned any further investigation since most FFTs are not a power-of-two.

The good news is I've studied my code some and think I can reduce the radix-8 round off somewhat. I'll test my idea out soon and if it works, I have a fair amount of macros to rewrite. The basic idea is to combine the radix-8 twiddle factor (stored as cosine/sine, sine) with the SQRTHALF. This lets me rearrange computations so that there is a more balanced number of floating-point ops in calculating each radix-8 output.

Mysticial 2018-11-02 20:32

[QUOTE=Prime95;499374]I looked at this a decade ago and found the bookkeeping for split radix and twiddle factors would be a major headache. I abandoned any further investigation since most FFTs are not a power-of-two.[/QUOTE]

Split radix doesn't require the length to be a power-of-two. Just divisible by 4 at that level. My implementation back then supported it for 2^k, 3*2^k, and 5*2^k.

Non-power-of-two split radix also works with Bernstein's conjugate twiddle trick.

Speaking of which, do you use Bernstein's trick in any form? Because of the data reordering, I've never figured out how to make it work for the 2-pass or N-pass FFTs where all the high-order twiddles are "squeezed" into a single layer between each pass.

preda 2018-11-02 21:12

[QUOTE=Mysticial;499377]
Non-power-of-two split radix also works with Bernstein's conjugate twiddle trick.

Speaking of which, do you use Bernstein's trick in any form?[/QUOTE]

What is Bernstein's conjugate twiddle trick?

Mysticial 2018-11-02 21:28

[QUOTE=preda;499380]What is Bernstein's conjugate twiddle trick?[/QUOTE]

A "normal" split-radix butterfly has twiddle factors of [B][I]w[/I][/B] and [B][I]w^3[/I][/B] on the last two points.

Instead of using [B][I]w[/I][/B] and [B][I]w^3[/I][/B], you use [B][I]w[/I][/B] and [B][I]w^-1[/I][/B]. It produces the correct frequency domain outputs, but it further scrambles up the ordering of the points that got flipped from [B][I]w^3[/I][/B] to [B][I]w^-1[/I][/B].

Since [B][I]w[/I][/B] and [B][I]w^-1[/I][/B] are complex conjugates, you can then reuse the same twiddle factor load for both. Thus it eliminates half your twiddle factor loads and streams.

The trick isn't restricted to the split-radix butterfly. You can do it on the standard radix-4 butterfly as well though you only eliminate 1/3 of the twiddle loads as opposed to 1/2. The concept generalizes to higher radix butterflies and higher split-radix butterflies as well.

Prime95 2018-11-02 23:58

[QUOTE=Mysticial;499377]Speaking of which, do you use Bernstein's trick in any form? Because of the data reordering, I've never figured out how to make it work for the 2-pass or N-pass FFTs where all the high-order twiddles are "squeezed" into a single layer between each pass.[/QUOTE]

I do use Bernstein's trick.

Prime95 really does a standard radix-4 (now radix-8) FFT. When I say prime95 does two passes that simply refers to passes through main memory. Pass 1 reads large-strided data, pass 2 reads single-strided data.

I do further reduce the volume of sin/cos data at the expense of some extra complex multiplies.

Prime95 2018-11-03 16:38

Grrr. My reordering of radix-8 operations did not help. I cannot explain why.

My idea was to turn this:

[CODE]
;;R1 = ((r1+r5)+(r3+r7)) +((r2+r8)+(r4+r6))
;;R2 = ((r1-r5)+(i3-i7)) +SQRTHALF*(((r2+r8)-(r4+r6))+((i2-i8)+(i4-i6)))
[/CODE]

into this:

[CODE]
;;R1 = ((r1+r5)+(r3+r7)) +SQRT2*((r2+r8)+(r4+r6))
;;R2 = ((r1-r5)+(i3-i7)) + (((r2+r8)-(r4+r6))+((i2-i8)+(i4-i6)))
[/CODE]

In theory, the maximum roundoff error is highly likely to occur in calculating R2 because there are more floating point operations compared to calculating R1. In the modified code, one multiply was moved from the calculation of R2 and to the calculation of R1.

The new code works, but generates very slightly worse roundoff behavior. What was wrong with my logic?


All times are UTC. The time now is 18:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.