Low latency filtering: Partitioned Convolution

work in progress - conceptual thoughts

Text and figures © DD4WH, under GNU GPLv3

A longstanding plan is the modification of the main audio filtering in UHSDR from time domain filtering to Fast Convolution filtering. However, in order to obtain filters steep enough, the FFT size for the FFT-iFFT audio chain has to be at least 2048 or even 4096 (FIR filter impulse response with 1025 / 2049 coefficients). This produces an inherent delay of 170msec @24ksps sample rate, which is unacceptable for CW operators and can also be annoying for operators in other modes.

A solution to this problem has been highlighted by Warren Pratt in his HAMRADIO 2018 talk at the Software Defined Academy, which is called "Partitioned Convolution" (see also Kulp 1988, Armelloni et al. 2003). In Partitioned Convolution, the filters impulse response is partitioned into separate blocks and so are the convolutions which are performed for the separate blocks and not one big FFT for the whole impulse response.

For UHSDR running on OVI40 with the STM32F7 processor, we would like to implement Fast Convolution filtering with partitioned convolution in order to minimize filter latency while maintaining a high quality filter with steep filter skirts ("brickwall").

[the following is just notes taken from understanding wdsp, firmin.c, "Standalone Partitioned overlap-save bandpass", Pratt 2018]

Setup (repeat every time the filter is adjusted):

calculate 2048 complex FIR filter coefficients (= impulse response) with windowing (Kaiser or Blackman-Harris 4-term)
partition coefficients into 8 blocks of 256 coeffs
Calculate an FFT256 of one block
store FFT results in fmask[8][512] --> I have to carefully understand how this is done in wdsp --> half of the impulse response is discarded

Real-time filter process:

accumulate 128 samples
overlap 50% with previous samples
FFT of those 256 samples
copy FFT result into fftout[buffidx]
k = buffidx
repeat for j=0; j < 8; i++ {
complex-multiply fftout[k] with fmask[j]
accumulate result of complex-multiply in accum[512]
k++ }
buffidx++
inverse FFT on accum[512]
discard first half and take last 256 samples as output [overlap & save]

Benchmark figures could be:

FFT size 128
partitioned blocks nfor = 8
no. of FIR coefficients 1024 (or 1025 ?)
running at 24ksps
delay of an 128-point FFT @24ksps -> 5.33msec
--> estimated memory consumption about 35kbytes

OR:

FFT size 256
partitioned blocks nfor = 8
no. of FIR coefficients 2048 (or 2049 ?)
running at 24ksps
delay of a 256-point FFT @24ksps -> 10.67msec
--> estimated memory consumption about 70kbytes

OR:

FFT size 128
partitioned blocks nfor = 16
no. of FIR coefficients 2048 (or 2049 ?)
running at 24ksps
delay of a 128-point FFT @24ksps -> 5.33msec
--> estimated memory consumption about 70kbytes

Armenolli et al. (2013): Implementation of real-time partitioned convolution on a DSP board. - IEEE workshop on Applications of Signal Processing to Audio and Acoustics HERE

Kulp, B.D. (1988): Digital Equalization using Fourier Transform Techniques. - HERE

Pratt, W. (2018): Open source DSP library wdsp. - HERE

Home

Low latency filtering: Partitioned Convolution

work in progress - conceptual thoughts

Text and figures © DD4WH, under GNU GPLv3

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally