mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Math (https://www.mersenneforum.org/forumdisplay.php?f=8)
-   -   Fast length-8 acyclic convolution algorithm (https://www.mersenneforum.org/showthread.php?t=24045)

ewmayer 2019-02-03 20:56

Fast length-8 acyclic convolution algorithm
 
I have an application that involves both cyclic and acyclic (negacyclic) real-data convolutions of length 8, with one of the two vector inputs consisting of constants in each case, and the other varying. For the cyclic convo we expect there to be highly efficient algorithms based on the factorizability over the integers of the polynomial x^8-1. And indeed, Nussbaumer gives an 8-cyclic algorithm needing just [46 add, 14 mul].

For the 8-acyclic, x^8+1 is irreducible over the integers, and things are more challenging. Nussbaumer's reference algorithm here needs a whopping [77 add, 21 mul]. We are targeting modern computer hardware, so can consider options which increase the multiply count, i.e. achieve a better balance of add and mul, i.e. our optimization target is cycle count on various kinds of compute hardware. If fused multiply-add (FMA) is available, even the dumb matrix-multiply form of the 8-acyclic with its 56 add and 64 mul can be done decently fast using just 64 fma. But we hope to do better. For example, calling the const and variable inputs to our 8-acyclic A = {a0,...,a7} and X = {x0,...,x7}, we can recast the 8-acyclic as a complex length-4 cyclic convolution via the right-angle-transform trick, in which we treat our 2 inputs as complex 4-vectors:

A = {a0+I.a4,a1+I.a5,a2+I.a6,a3+I.a7} and similarly for X,

and premultiply each of our complex 4-vectors by {w^0,w^1,w^2,w^3}, where w = exp(2.I.Pi/16) is a 16th root of unity. (More generally, we convert a length-N real acyclic into a length-N/2 complex cyclic by premultiplying our complexified input vectors by the 0,...,(N/2-1) powers of an (N/2)th complex root of unity). In the 4-vector case the resulting acyclic-effecting DWT weights are {w^0,w^1,w^2,w^3} = {1,c+I.s,(1+I)/sqrt(2),s+I.c}, where c = cos(Pi/8), s = sin(Pi/8). The DWT weighting of the A-vector is considered to be free since it's a 1-time precomputation, but applying the DWT multipliers to the X-input needs [6 add, 10 mul]. Nussbaumer gives a real 4-cyclic algorithm needing [15 add, 5 mul]; in our case we need a complex analog of that, which needs [40 add, 20 mul] since we double the add count but then replace each real*real mul with a complex one needing [2 add, 4 mul] in terms of the real-data opcount. Lastly we need to un-DWT-weight our outputs via multiplication by {w^0,w^-1,w^-2,w^-3}, which again costs [6 add, 10 mul]. Our total opcount for the right-angle variant is thus [52 add, 40 mul], which is much better balanced than the original [77 add, 21 mul] for the pure-real algo, and is also much more "FMA-izable". The opcount is still distressingly high compared to the the [46 add, 14 mul] of the real 8-cyclic, however.

Does anyone here know of a better 8-acyclic algorithm?


All times are UTC. The time now is 18:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.