mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2018-10-31, 01:25   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011001102 Posts
Default Radix-8 roundoff / noise

Pre AVX-512 prime95 did primarily radix-4 FFTs. AVX-512 prime95 does primarily radix-8 FFTs.

The reason for the change is AVX-512 has double the number of registers. In past Intel architectures, it has generally been profitable to do as much work as possible once data is loaded into registers.

Anyway, it has been pointed out by early users that the roundoff error for radix-8 FFTs is worse than radix-4 FFTs. My very preliminary observation is that the maximum roundoff error is about 20% worse. That is if radix-4 FFTs were generating a max roundoff of 0.3125 are now generating an error of 0.375.

Have other FFT developers noticed the same issue?

I presume the increased error is from radix-8 doing several multiplications by (SQRTHALF + i*SQRTHALF). Any ideas for reducing the increased noise?

As is, I estimate this increased noise will result in reduction of 0.1 bits per FFT input word.
Prime95 is online now   Reply With Quote
Old 2018-10-31, 03:11   #2
Mysticial
 
Mysticial's Avatar
 
Sep 2016

33210 Posts
Default

There's a messy solution that I've played with in the past. It wasn't to improve precision, but to improve performance.

Detach the twiddle factors from the butterfly. Don't let the implementation dictate the operations you are actually doing. IOW, don't let the implementation dictate where you do your twiddle factors.

Some years ago, I implemented an FFT that was mathematically and numerically the same as a split-radix FFT, but was implemented using a standard radix-4 access pattern. The split-radix FFT had a lower operation count, but radix-4 was more memory-friendly. So I was able to get the benefits of both. (Granted, I no longer use this FFT since operation-count is less relevant in today's environment.)

The location and values of the twiddle factors are not constant between butterflies. They change from butterfly to butterfly. For the split-radix/radix-4 case, I needed two types of butterflies, one with the twiddles in the middle, and one with the twiddles at the end/beginning. And it also complicated the load-streams from the twiddle tables.

This will probably get very ugly for the radix 8 case though. Try to find a layout of twiddle factors that's numerically the best. Then try to fit into (multiple) radix-8 butterfly types to cover all the cases.
Mysticial is offline   Reply With Quote
Old 2018-10-31, 19:29   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

In my own work I have come across examples of counterintuitive behavior involving those muls-by-sqrt2 (or by 1/sqrt2) in the radix-8 DFT, especially when done via FMA. I recall being able to reduce the ROE levels in one case by deliberately rounding the precomputed sqrt2 multiplier "the wrong way", in another by using a mix of correctly and wrong-way rounded such multipliers. Since the number of such muls in a radix-8 DFT is relatively small, it's not terribly hard to try a bunch of combos and tabulate the resulting ROEs for some quasi-random sample of exponents and FFT lengths to see if one particular such combo of roundings shows consistently superior ROE behavior.
ewmayer is offline   Reply With Quote
Old 2018-11-02, 19:51   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×53×71 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Detach the twiddle factors from the butterfly. Don't let the implementation dictate the operations you are actually doing. IOW, don't let the implementation dictate where you do your twiddle factors..
I looked at this a decade ago and found the bookkeeping for split radix and twiddle factors would be a major headache. I abandoned any further investigation since most FFTs are not a power-of-two.

The good news is I've studied my code some and think I can reduce the radix-8 round off somewhat. I'll test my idea out soon and if it works, I have a fair amount of macros to rewrite. The basic idea is to combine the radix-8 twiddle factor (stored as cosine/sine, sine) with the SQRTHALF. This lets me rearrange computations so that there is a more balanced number of floating-point ops in calculating each radix-8 output.
Prime95 is online now   Reply With Quote
Old 2018-11-02, 20:32   #5
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22×83 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I looked at this a decade ago and found the bookkeeping for split radix and twiddle factors would be a major headache. I abandoned any further investigation since most FFTs are not a power-of-two.
Split radix doesn't require the length to be a power-of-two. Just divisible by 4 at that level. My implementation back then supported it for 2^k, 3*2^k, and 5*2^k.

Non-power-of-two split radix also works with Bernstein's conjugate twiddle trick.

Speaking of which, do you use Bernstein's trick in any form? Because of the data reordering, I've never figured out how to make it work for the 2-pass or N-pass FFTs where all the high-order twiddles are "squeezed" into a single layer between each pass.
Mysticial is offline   Reply With Quote
Old 2018-11-02, 21:12   #6
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Non-power-of-two split radix also works with Bernstein's conjugate twiddle trick.

Speaking of which, do you use Bernstein's trick in any form?
What is Bernstein's conjugate twiddle trick?
preda is offline   Reply With Quote
Old 2018-11-02, 21:28   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22×83 Posts
Default

Quote:
Originally Posted by preda View Post
What is Bernstein's conjugate twiddle trick?
A "normal" split-radix butterfly has twiddle factors of w and w^3 on the last two points.

Instead of using w and w^3, you use w and w^-1. It produces the correct frequency domain outputs, but it further scrambles up the ordering of the points that got flipped from w^3 to w^-1.

Since w and w^-1 are complex conjugates, you can then reuse the same twiddle factor load for both. Thus it eliminates half your twiddle factor loads and streams.

The trick isn't restricted to the split-radix butterfly. You can do it on the standard radix-4 butterfly as well though you only eliminate 1/3 of the twiddle loads as opposed to 1/2. The concept generalizes to higher radix butterflies and higher split-radix butterflies as well.

Last fiddled with by Mysticial on 2018-11-02 at 21:34
Mysticial is offline   Reply With Quote
Old 2018-11-02, 23:58   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011001102 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Speaking of which, do you use Bernstein's trick in any form? Because of the data reordering, I've never figured out how to make it work for the 2-pass or N-pass FFTs where all the high-order twiddles are "squeezed" into a single layer between each pass.
I do use Bernstein's trick.

Prime95 really does a standard radix-4 (now radix-8) FFT. When I say prime95 does two passes that simply refers to passes through main memory. Pass 1 reads large-strided data, pass 2 reads single-strided data.

I do further reduce the volume of sin/cos data at the expense of some extra complex multiplies.
Prime95 is online now   Reply With Quote
Old 2018-11-03, 16:38   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×53×71 Posts
Default

Grrr. My reordering of radix-8 operations did not help. I cannot explain why.

My idea was to turn this:

Code:
;;R1 = ((r1+r5)+(r3+r7))      +((r2+r8)+(r4+r6))
;;R2 = ((r1-r5)+(i3-i7)) +SQRTHALF*(((r2+r8)-(r4+r6))+((i2-i8)+(i4-i6)))
into this:

Code:
;;R1 = ((r1+r5)+(r3+r7))      +SQRT2*((r2+r8)+(r4+r6))
;;R2 = ((r1-r5)+(i3-i7)) + (((r2+r8)-(r4+r6))+((i2-i8)+(i4-i6)))
In theory, the maximum roundoff error is highly likely to occur in calculating R2 because there are more floating point operations compared to calculating R1. In the modified code, one multiply was moved from the calculation of R2 and to the calculation of R1.

The new code works, but generates very slightly worse roundoff behavior. What was wrong with my logic?
Prime95 is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Active noise cancelling CPU cooler kladner Hardware 3 2013-06-21 23:30
strange problem: efficient 'radix sums' jasonp Programming 13 2013-05-16 19:11
Roundoff error bcp19 Software 4 2013-02-14 21:23
HDD noise Jugger Information & Answers 10 2011-03-30 06:25
Playing with different radix LoveCraft Programming 7 2005-11-14 07:59

All times are UTC. The time now is 18:14.


Fri Jul 16 18:14:13 UTC 2021 up 49 days, 16:01, 1 user, load averages: 2.44, 2.30, 1.96

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.