Author Archives: Picaud Vincent

Direct convolution

By: Picaud Vincent

Re-posted from: https://pixorblog.wordpress.com/2016/07/17/direct-convolution/

For small kernels, direct convolution beats FFT based one. I present here a basic implementation. This implementation allows to compute

\gamma[k]=\sum\limits_{i\in\Omega^\alpha}\alpha[i]\beta[k+\lambda i],\text{ with }\lambda\in\mathbb{Z}^*\text{\ \ \ \ \ \ \ \ \ \ \ \ (1)}

From time to time we will use the notation \gamma=\alpha\bigodot\limits_\lambda\beta.

An arbitrary stride \lambda has been introduced to define:

Also note that with proper boundary extension (periodic and zero padding essentially), changing the sign of \lambda gives the adjoint operator:

\langle \delta, \alpha\bigodot\limits_\lambda\beta \rangle = \langle \alpha\bigodot\limits_{-\lambda}\delta,\beta \rangle

Disclaimer

Maybe the following is overwhelmingly detailed for a simple task like Eq. (1), but I have found some interests in writing this once for all. Maybe it can be useful for someone else.

Some notations

We note \Omega our vector domain (or support), for instance

\Omega^\alpha =\llbracket i_{\min} ,i_{\max} \rrbracket

means that \alpha[i] is defined for

i\in \llbracket i_{\min}, i_{\max} \rrbracket

To get interval lower/upper bounds we use the notation

i_{\min}=l(\Omega^\alpha)\text{ and }i_{\max}=u(\Omega^\alpha)

We denote by \lambda\Omega the scaled domain \Omega defined by:

\lambda\Omega=\llbracket \lambda^+\,i_{min}+\lambda^-\,i_{max},  \lambda^+\,i_{max}+\lambda^-\,i_{min} \rrbracket

where \lambda^+=\max{(0,\lambda)} and \lambda^-=\min{(0,\lambda)}

Finally we use A\setminus B the relative complement of B with respect to the set A defined by

A\setminus B = \{ i\ /\ (i\in A) \wedge (i\notin B) \}

This set is not necessary connex, however like we are working in \mathbb{Z}, it is sufficient to introduce the left and right parts (that can be empty)

(A\setminus B)_{\text{Left}}=\llbracket  l(A), \min{(u(A),l(B)-1)} \rrbracket

(A\setminus B)_{\text{Right}}=\llbracket \max{(l(A),u(B)+1)}, u(A) \rrbracket

Goal

Given two vectors \alpha, \beta defined on \Omega^\alpha, \Omega^\beta we want to define and implement an algorithm that computes \gamma[k] for k\in\Omega^\gamma.

First step, no boundary extension

We need to define the \Omega^\gamma_1 the domain that does not violate \beta domain of definition. This can be expressed as

\Omega^\gamma_1=\{k\in\mathbb{Z}\ /\ \forall i \in \Omega^\alpha \Rightarrow k+\lambda i \in \Omega^\beta \}

Let’s write the details,

(\forall i \in \Omega^\alpha  \Rightarrow k+\lambda i \in \Omega^\beta)\Leftrightarrow (\forall i \in \Omega^\alpha \Rightarrow l(\Omega^\beta)-\lambda i \le k \le u(\Omega^\beta)-\lambda i)

\Leftrightarrow \max\limits_{i\in \Omega^\alpha} l(\Omega^\beta)-\lambda i \le k \le \min\limits_{i\in \Omega^\alpha} u(\Omega^\beta)-\lambda i

\Leftrightarrow l(\Omega^\beta)-l(\lambda \Omega^\alpha) \le k \le u(\Omega^\beta)-u(\lambda \Omega^\alpha)

hence we have

\boxed{   \Omega^\gamma_1=\llbracket  l(\Omega^\beta)-l(\lambda \Omega^\alpha) , u(\Omega^\beta)-u(\lambda \Omega^\alpha) \rrbracket   }

Thus the computation of \gamma[k],\ k\in\Omega^\gamma (Eq. 1) is splitted into two parts:

  • one part \Omega^\gamma \cap \Omega^\gamma_1 free of boundary effect,
  • one part \Omega^\gamma \setminus \Omega^\gamma_1 that requires boundary extension \tilde{\beta}=\Phi(\beta,k)

The algorithm takes the following form:

latex-test.png

Second step, boundary extensions

Usually we define some classical boundary extensions. These extensions are computed from \beta[.] and are sometimes entailed by a validity condition. For a better clarity I give explicit lower/upper bounds:

\Omega^\beta = \llbracket  j_{\min} , j_{\max} \rrbracket \neq \emptyset

Left boundary (j<j_{min}) \tilde{\beta}_j = \Phi_{left}(\beta,j) validity condition
Mirror \tilde{\beta}_j  = \beta[2\,j_{min}-j] 2\,j_{min}-j_{max} \le j
Periodic (or cyclic) \tilde{\beta}_j =  \beta[j_{max}-j_{min}+j+1] 2\,j_{min}-j_{max}-1 \le j
Constant \tilde{\beta}_j = \beta[j_{min}] none
Zero padding \tilde{\beta}_j = 0 none
Right boundary (j>j_{max}) \tilde{\beta}_j = \Phi_{right}(\beta,j) validity condition
Mirror \tilde{\beta}_j  = \beta[2\,j_{max}-j] j\le 2\,j_{max}-j_{min}
Periodic (or cyclic) \tilde{\beta}_j = \beta[-j_{max}+j_{min}+j-1] j\le 2\,j_{max}-j_{min}+1
Constant \tilde{\beta}_j = \beta[j_{max}] none
Zero padding \tilde{\beta}_j = 0 none

As we want something general we want to get rid of these validity conditions.

Periodic case

Starting from a vector \beta defined on \llbracket L=0, U \rrbracket we want to define a periodic function \tilde{\beta} of period T=U+1. This function must fulfills the \tilde{\beta}[j+T]=\tilde{\beta}[j] relation.

We can do that by considering \tilde{\beta}=\beta \circ \phi^P_U(j) where

\phi^P_U(j)=\bmod_F(j,U+1)

and \bmod_F is the modulus function associated to a floored division.

For a vector defined on an arbitrary domain \llbracket j_{\min}, j_{\max} \rrbracket, we first translate the indices

\tau_{j_{\min}}(j)=j-j_{\min}

and then translate them back using \tau^{(-1)}_{j_{\min}}=\tau_{-j_{\min}}

Putting all together, we build a periodized vector

\boxed{\tilde{\beta} = \beta \circ \phi^P_{j_{\min},j_{\max}}}

where

\phi^P_{j_{\min},j_{\max}} = \tau^{(-1)}_{j_{\min}} \circ  \phi^P_{j_{\max}- j_{\min}} \circ \tau_{j_{\min}}

\boxed{\phi^P_{j_{\min},j_{\max}} = j_{\min} + \bmod_F(j-j_{\min},j_{\max}- j_{\min}+1)}

Mirror Symmetry case

Starting from a vector \beta defined on \llbracket L=0, U \rrbracket we can extend it by mirror symmetry on \llbracket U+1, 2U \rrbracket using \tilde{\beta}=\beta\circ \phi^M_U with

\phi^M_U(j)=U-|U-j|

The resulting vector \tilde{\beta}=\beta\circ \phi^M_U fulfills the \tilde{\beta}[U-j]=\tilde{\beta}[U+j] relation for j\in \llbracket 0, U \rrbracket.

To get a “global” definition we then periodize it on \llbracket 0, 2U-1 \rrbracket using \phi^P_{2U-1} (attention 2U-1 and not 2U, otherwise the component 0 is duplicated!).

For an arbitrary domain \llbracket j_{\min}, j_{\max} \rrbracket we use index translation as for the periodic case. Putting everything together we get:

\boxed{\tilde{\beta} = \beta \circ \phi^M_{j_{\min},j_{\max}}}

where

\phi^M_{j_{\min},j_{\max}} =  \tau^{(-1)}_{j_{\min}} \circ \phi^M_{j_{\max}- j_{\min}} \circ  \phi^P_{2(j_{\max}- j_{\min})-1} \circ \tau_{j_{\min}}

\boxed{ \phi^M_{j_{\min},j_{\max}} =j_{\max}-|j_{\max}-j_{\min}-\bmod_F(j-j_{\min},2(j_{\max}-j_{\min}))| }

Boundary extensions

To use the algorithm with boundary extensions, you only have to define:

\tilde{\beta}=\Phi(\beta,k+\lambda i)=\beta[\phi^X[k+\lambda i]]

where X is the boundary extension you have chosen (periodic, constant…). You do not have to take care of any validity condition, these formula are general.

Implementation

This is a straightforward implementation following as close as possible the presented formula. We did not try to optimize it, this would have obscured the presentation. Some ideas: reverse \alpha for \lambda<0 (access memory in the right order), use simd, or C++ meta-programming with loop unrolling for fixed \alpha size, specialize regarding to Vector/StridedVector or \lambda=\pm 1…

Preamble

Index translation / domain definition

There is however one last thing we have to explain. In languages like Julia, C… we are manipulating arrays having a common starting index: 1 in Julia, Fortran… or 0 in C, C++…

For this reason we do not manipulate \alpha on \Omega^\alpha but an another translated array \tilde{\alpha} defined on \llbracket 1, N^\alpha \rrbracket (Julia) or \llbracket 0, N^\alpha-1 \rrbracket (C++).

To cover all cases, I assume that the starting index is denoted by \tilde{i}_0.

The array \tilde{\alpha} is defined by:

\alpha[i] =  \tilde{\alpha}[\tilde{i}] = \tilde{\alpha}[i-l(\Omega^\alpha)+\tilde{i}_0]

Hence we must modify the initiale Eq. (1) to use \tilde{\alpha} instead of \alpha

\gamma[k]=\sum\limits_{i\in\Omega^\alpha}\alpha[i]\beta[k+\lambda i] = \sum\limits_{i\in\Omega^\alpha}\tilde{\alpha}[i-l(\Omega^\alpha)+\tilde{i}_0]\beta[k+\lambda i]

With \tilde{i}=i-l(\Omega^\alpha)+\tilde{i}_0 we have

i\in\Omega^\alpha \Leftrightarrow \tilde{i}\in\llbracket \tilde{i}_0,u(\Omega^\alpha)-l(\Omega^\alpha)+\tilde{i}_0 \rrbracket

and

k+\lambda i = k+ \lambda \tilde{i} + \underbrace{\lambda (l(\Omega^\alpha) - \tilde{i}_0)}_{\beta\_\text{offset}}

Thus, Eq (1) becomes:

\boxed{ \gamma[k]=\sum\limits_{\tilde{i}=\tilde{i}_0}^{u(\Omega^\alpha)-l(\Omega^\alpha)+\tilde{i}_0}\tilde{\alpha}[\tilde{i}]\beta[k+ \lambda \tilde{i} + \lambda (l(\Omega^\alpha) - \tilde{i}_0)]}

The 2 other arrays are less problematic:

  • For \beta array, which is our input array, we implicitly use \Omega^\beta = \llbracket \tilde{i}_0, \tilde{i}_0 + \text{length}(\beta) - 1 \rrbracket. This does not reduce the generality of the subroutine.
  • For \gamma which is the output array, as for \beta we assume it is defined on \llbracket \tilde{i}_0, \tilde{i}_0 +    \text{length}(\gamma) - 1 \rrbracket, but we provide \Omega^\gamma\subset \llbracket \tilde{i}_0, \tilde{i}_0 +    \text{length}(\gamma) - 1 \rrbracket to define the components we want to compute. The other components, \llbracket \tilde{i}_0, \tilde{i}_0 +    \text{length}(\gamma) - 1 \rrbracket \setminus \Omega^\gamma, will remain unmodified by the subroutine.

Definition of \alpha\_\text{offset}

As we have seen before, the convolution subroutine will have \tilde{\alpha} as argument, but we also need \Omega^\alpha. For the driver subroutine we do not directly provide this interval because its length is redundant with \tilde{\alpha} length. Instead we provide an \alpha\_\text{offset} offset. \Omega^\alpha is deduced from:

\Omega^\alpha = \llbracket -\alpha\_\text{offset}, -\alpha\_\text{offset} + \text{length}(\tilde{\alpha}) -1 \rrbracket

Note: this definition does not depend on \tilde{i}_0.

With \alpha\_\text{offset}=0 you are in the “usual situation”. If you have a window size of 2n+1, taking \alpha\_\text{offset}=n returns the middle of the window. Here, in the Fig. below, the graphical representation of an arbitrary case: a filter if size 4, with \alpha\_\text{offset}=2 and \lambda=3.

a_offset.png

Julia

Auxiliary subroutines

We start by defining the basic operations on sets:

function scale::Int64::UnitRange)
    ifelse(λ>0,
           UnitRange(λ*start(Ω),λ*last(Ω)),
           UnitRange(λ*last(Ω),λ*start(Ω)))
end

function compute_Ωγ1(Ωα::UnitRange,
                     λ::Int64,
                     Ωβ::UnitRange)
    
    λΩα = scale(λ,Ωα)

    UnitRange(start(Ωβ)-start(λΩα),
              last(Ωβ)-last(λΩα))
end

# Left & Right relative complements A\B
#
function relelativeComplement_left(A::UnitRange,
                                   B::UnitRange)
    UnitRange(start(A),
              min(last(A),start(B)-1))
end

function relelativeComplement_right(A::UnitRange,
                                    B::UnitRange)
    UnitRange(max(start(A),last(B)+1),
              last(A))
end

Boundary extensions

We then define the boundary extensions. Nothing special there, we only had to check that the Julia mod(x,y) function is the floored division version (by opposition to the rem(x,y) function which is the rounded toward zero division version).

const tilde_i0 = Int64(1)

function boundaryExtension_zeroPadding{T}(β::StridedVector{T},
                                          k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1
    
    if (k>=kmin)&&(k<=kmax)
        β[k]
    else
        T(0)
    end
end

function boundaryExtension_constant{T}(β::StridedVector{T},
                                       k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1

    if k<kmin
        β[kmin]
    elseif k<=kmax
        β[k]
    else
        β[kmax]
    end
end

function boundaryExtension_periodic{T}(β::StridedVector{T},
                                       k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1

    β[kmin+mod(k-kmin,1+kmax-kmin)]
end

function boundaryExtension_mirror{T}(β::StridedVector{T},
                                     k::Int64)
    kmin = tilde_i0
    kmax = length(β) + kmin - 1

    β[kmax-abs(kmax-kmin-mod(k-kmin,2*(kmax-kmin)))]
end

# For the user interface
#
boundaryExtension = 
    Dict(:ZeroPadding=>boundaryExtension_zeroPadding,
         :Constant=>boundaryExtension_constant,
         :Periodic=>boundaryExtension_periodic,
         :Mirror=>boundaryExtension_mirror)

Main subroutine

Finally we define the main subroutine. Its arguments have been defined in the preamble part. I just added one @simd & @inbounds because this has a significant impact concerning perfomance (see end of this post).

function direct_conv!{T}(tilde_α::StridedVector{T},
                         Ωα::UnitRange,
                         λ::Int64,
                         β::StridedVector{T},
                         γ::StridedVector{T},
                         Ωγ::UnitRange,
                         LeftBoundary::Symbol,
                         RightBoundary::Symbol)
    # Sanity check
    @assert λ!=0
    @assert length(tilde_α)==length(Ωα)
    @assert (start(Ωγ)>=1)&&(last(Ωγ)<=length(γ))

    # Initialization
    Ωβ = UnitRange(1,length(β))
    tilde_Ωα = 1:length(Ωα)
    
    for k in Ωγ
        γ[k]=0 
    end

    rΩγ1=intersect(Ωγ,compute_Ωγ1(Ωα,λ,Ωβ))
    
    # rΩγ1 part: no boundary effect
    #
    β_offset = λ*(start(Ωα)-tilde_i0)
    @simd for k in rΩγ1
        for i in tilde_Ωα
            @inbounds γ[k]+=tilde_α[i]*β[k+λ*i+β_offset]
        end
    end

    # Left part
    #
    rΩγ1_left = relelativeComplement_left(Ωγ,rΩγ1)
    Φ_left = boundaryExtension[LeftBoundary]
    
    for k in rΩγ1_left
        for i in tilde_Ωα
            γ[k]+=tilde_α[i]*Φ_left(β,k+λ*i+β_offset)
        end
    end

    # Right part
    #
    rΩγ1_right = relelativeComplement_right(Ωγ,rΩγ1)
    Φ_right = boundaryExtension[RightBoundary]
    
    for k in rΩγ1_right
        for i in tilde_Ωα
            γ[k]+=tilde_α[i]*Φ_right(β,k+λ*i+β_offset)
        end
    end
end

# Some UI functions, γ inplace modification 
#
function direct_conv!{T}(tilde_α::StridedVector{T},
                         α_offset::Int64,
                         λ::Int64,

                         β::StridedVector{T},

                         γ::StridedVector{T},
                         Ωγ::UnitRange,
                         
                         LeftBoundary::Symbol,
                         RightBoundary::Symbol)

    Ωα = UnitRange(-α_offset,
                   length(tilde_α)-α_offset-1)
    
    direct_conv!(tilde_α,
                 Ωα,
                 λ,
                 
                 β,

                 γ,
                 Ωγ,

                 LeftBoundary,
                 RightBoundary)
end

# Some UI functions, allocates γ 
#
function direct_conv{T}(tilde_α::StridedVector{T},
                        α_offset::Int64,
                        λ::Int64,

                        β::StridedVector{T},

                        LeftBoundary::Symbol,
                        RightBoundary::Symbol)

    γ = Array{T,1}(length(β))
    
    direct_conv!(tilde_α,
                 α_offset,
                 λ,

                 β,

                 γ,
                 UnitRange(1,length(γ)),

                 LeftBoundary,
                 RightBoundary)

    γ
end

In C/C++

As this post is already long I will not provide a complete code here. The only trap is to use the right mod function.

C/C++ modulus operator % is not standardized. Only the D%d=D-d*(D/d) relation is invariant allowing to define the Euclidean division. On the other side a lot of CPU x86 idiv…, truncate toward zero, as a consequence C/C++ generally uses this direction.

To be sure, we have to explicitly use our F-mod function:

// Floored mod
int modF(int D, int d)
{
    int r = std::fmod(D,d);
    if((r > 0 && d < 0) || (r < 0 && d > 0)) r = r + d;
    return r;
}

You can read:

Usages examples

Basic usages

Beware that due to the asymmetric role of \alpha and \beta the proposed approach does preserve all the mathematical properties of the \alpha\bigodot\limits_\lambda\beta operator.

  • Commutativity:

\alpha\bigodot\limits_{\lambda=-1}\beta=\beta\bigodot\limits_{\lambda=-1}\alpha

only for ZeroPadding

  • Adjoint operator:

\forall \lambda\in\mathbb{Z}^*,\ \langle \alpha\bigodot\limits_{\lambda}v ,w \rangle_E = \langle v , \alpha\bigodot\limits_{-\lambda} w \rangle_F

only for ZeroPadding and Periodic

  • I have assumed \mathbb{R} arrays (not \mathbb{C} ones): some conjugation are missing
  • Not considered here, but extension to n-dimensional & separable filters is immediate
push!(LOAD_PATH,"./")
using DirectConv

α=rand(4);
β=rand(10);

# Check adjoint operator
# -> restricted to ZeroPadding & Periodic
#    (asymmetric role of α and β)
#    
vβ=rand(length(β))
d1=dot(direct_conv(α,2,-3,vβ,:ZeroPadding,:ZeroPadding),β)
d2=dot(direct_conv(α,2,+3,β,:ZeroPadding,:ZeroPadding),vβ)

@assert abs(d1-d2)<sqrt(eps())

d1=dot(direct_conv(α,-1,-3,vβ,:Periodic,:Periodic),β)
d2=dot(direct_conv(α,-1,+3,β,:Periodic,:Periodic),vβ)

@assert abs(d1-d2)<sqrt(eps())

# Check commutativity 
# -> λ = -1 (convolution) and
#    restricted to ZeroPadding
#    (asymmetric role of α and β)
v1=zeros(20)
v2=zeros(20)
direct_conv!(α,0,-1,
             β,v1,UnitRange(1,20),:ZeroPadding,:ZeroPadding)
direct_conv!(β,0,-1,
             α,v2,UnitRange(1,20),:ZeroPadding,:ZeroPadding)

@assert (norm(v1-v2)<sqrt(eps()))

# Check Interval splitting
# (should work for any boundary extension type)
#
γ=direct_conv(α,3,2,β,:Mirror,:Periodic) # global computation
Γ=zeros(length(γ))
Ω1=UnitRange(1:3)
Ω2=UnitRange(4:length(γ))
direct_conv!(α,3,2,β,Γ,Ω1,:Mirror,:Periodic) # compute on Ω1
direct_conv!(α,3,2,β,Γ,Ω2,:Mirror,:Periodic) # compute on Ω2

@assert (norm(γ-Γ)<sqrt(eps()))

Performance?

In a previous post I gave a short derivation of the Savitzky-Golay filters. I used a FFT based convolution to apply the filters. It is interesting to compare the performance of the presented direct approach vs the FFT one.

push!(LOAD_PATH,"./")
using DirectConv

function apply_filter{T}(filter::StridedVector{T},signal::StridedVector{T})

    @assert isodd(length(filter))

    halfWindow = round(Int,(length(filter)-1)/2)
    
    padded_signal = 
        [signal[1]*ones(halfWindow);
         signal;
         signal[end]*ones(halfWindow)]

    filter_cross_signal = conv(filter[end:-1:1],
                               padded_signal)

    filter_cross_signal[2*halfWindow+1:end-2*halfWindow]
end

# Now we can create a (very) rough benchmark
M=Array(Float64,0,3)
β=rand(1000000);
for halfWidth in 1:2:40
    α=rand(2*halfWidth+1);

    fft_t0 = time()
    fft_v = apply_filter(α,β)
    fft_t1 = time()

    direct_t0 = time()
    direct_v = direct_conv(α,halfWidth,1,β,
                           :Constant,:Constant)
    direct_t1 = time()

    @assert (norm(fft_v -direct_v)<sqrt(eps()))

    M=vcat(M,
           Float64[length(α)
                   (fft_t1-fft_t0)*1e3
                   (direct_t1-direct_t0)*1e3])
end
M

cpu_time.png

ratio_cpu_time.png

We see that for small filters direct method can easily be 10 time faster than the FFT approach!

Conclusion: for small filters, use a direct approach!

Discussion

Optimization/performance

If I have time I will try to benchmark two basic implementations, a Julia one vs a C/C++ one. I’m a beginner in Julia language, with C++, I’m more at home.

I would be curious to see the difference between a basic implementation and an optimized one in Julia. Just to see how optimization can obfuscate (or not) the initial code and the performance gain. In C++ you generally have a lot of boiler-plate code (meta-programming…).

Applications

The basic Eq. (1) is common tool that can be used for:

  • deconvolution procedures,
  • decimated and undecimated wavelet transforms,
  • …

For wavelet transform especially the undecimated one, AFAIK Eq. (1) is really the good choice. I will certainly write some posts on these stuff.

Some extra reading:

Code

The code is on github.

Complement: more domains

The \Omega^\gamma_2 domain

We have introduced \Omega^\gamma_1 the domain that does not violate \beta domain of definition (given \Omega^\alpha and \Omega^\beta).

To be exhaustive we can introduce \Omega^\gamma_2 the domain that use at least one (i,k+\lambda i)\in \Omega^\alpha \times \Omega^\beta.

This domain is:

\Omega^\gamma_2=\{ k\in\mathbb{Z}\ /\ \exists i \in \Omega^\alpha \Rightarrow k+\lambda i \in \Omega^\beta \}

following arguments similar to those used for \Omega^\gamma_1 we get:

\boxed{  \Omega^\gamma_2=\llbracket  l(\Omega^\beta)-u(\lambda \Omega^\alpha) , u(\Omega^\beta)-l(\lambda \Omega^\alpha) \rrbracket  }

The \Omega^\beta_{2'} domain

We can also ask for the “dual” question: given \Omega^\alpha and \Omega^\gamma what is the domain of \beta, \Omega^\beta_{2'}, involved in the computation of \gamma

By definition, this domain must fulfill the following relation:

\Omega^\gamma_2(\Omega^\beta_{2'})=\Omega^\gamma

hence, using the previous result

\llbracket  l(\Omega^\beta_{2'})-u(\lambda \Omega^\alpha) , u(\Omega^\beta_{2'})-l(\lambda \Omega^\alpha) \rrbracket = \llbracket l(\Omega^\gamma),u(\Omega^\gamma) \rrbracket

which gives:

\boxed{ \Omega^\beta_{2'} = \llbracket l(\Omega^\gamma)+u(\lambda \Omega^\alpha),u(\Omega^\gamma)+l(\lambda \Omega^\alpha) \rrbracket }

Savitzky-Golay filters & Julia

By: Picaud Vincent

Re-posted from: https://pixorblog.wordpress.com/2016/07/13/savitzky-golay-filters-julia/

I always have found that presentations of the Savitzky-Golay filters were over tricky. I present here a simple derivation of these formula and a possible implementation in Julia.

Derivation of the formula

Given a polynomial of degree d with values

p(x_i)=\sum\limits_{j=0}^d c_j x_i^j

defined for a window of size 2n+1, \{x_i,\ i\in\llbracket -n,n \rrbracket \}

We want to find the value of its k-order derivative p^{(k)}(x_0) in the middle of the window assuming that the c_j are founded solving a least-squares problem:

\min\limits_{\mathbf{c}} \frac{1}{2} \| \mathbf{V} \mathbf{c} - \mathbf{y} \|_2^2

where \{y_i, i \in\llbracket -n,n \rrbracket \} is our signal values and \mathbf{V} is the Vandermonde matrix:

\mathbf{V}=   \left(     \begin{array}{c|c|c}       \vdots & \vdots & \vdots \\       1 & x_i^{(j-1)} & x_i^d \\       \vdots & \vdots & \vdots     \end{array}   \right)

using the normal equation

\mathbf{c}=(\mathbf{V}^t.\mathbf{V})^{-1}.\mathbf{V}^t.\mathbf{y}

and a QR decomposition, \mathbf{V}=\mathbf{Q}.\mathbf{R} we get

\mathbf{c}=\mathbf{R}^{-1}.\mathbf{Q}^t.\mathbf{y}

now we can express all the polynomial values p(x_i) in a vector \mathbf{p}=\mathbf{V}.\mathbf{c}. Lets rewrite this in matrix form:

\underbrace{\left(     \begin{array}{c}       p(x_{-n}) \\      \vdots \\         p(x_{0}) \\       \vdots \\       p(x_{+n})     \end{array}   \right)}\limits_{\mathbf{p}}=\underbrace{   \left(     \begin{array}{c|c|c}       \vdots & \vdots & \vdots \\       1 & x_i^{(j-1)} & x_i^d \\       \vdots & \vdots & \vdots     \end{array}   \right)}\limits_{\mathbf{V}}.\underbrace{\left(     \begin{array}{c}       c_0 \\      \vdots \\       c_n     \end{array}   \right)}\limits_{\mathbf{c}}

Now the “big trick” is to write the Taylor series and to remember that this formula is exact for polynomial functions:

\forall i,\ P(x_i) = \sum\limits_{j=0}^d \frac{x_i^j}{j!} P^{(j)}(x_0)

Lets rewrite this in matrix form:

\underbrace{     \left(       \begin{array}{c}         p(x_{-n}) \\         \vdots \\         p(x_{0}) \\         \vdots \\         p(x_{n}) \\       \end{array}     \right)   }_{\mathbf{p}} =   \underbrace{     \left(       \begin{array}{c|c|c}         \vdots & \vdots & \vdots \\         1 & \frac{x_i^{(j-1)}}{(j-1)!} &  \frac{x_i^{d}}{d!} \\         \vdots & \vdots & \vdots       \end{array}     \right)     }_{\mathbf{T}}  \underbrace{    \left(      \begin{array}{c}        P^{(0)}(x_0) \\        \vdots \\        P^{(k)}(x_0) \\        \vdots \\        P^{(d)}(x_0) \\      \end{array}    \right)  }_{\mathbf{p^\delta}}

With a good eye we see that \mathbf{V}=\mathbf{T}.\mathbf{D} where \mathbf{D} is a diagonal matrix:

\underbrace{   \left(     \begin{array}{c|c|c}       \vdots & \vdots & \vdots \\       1 & x_i^{(j-1)} & x_i^d \\       \vdots & \vdots & \vdots     \end{array}   \right)}\limits_{\mathbf{V}} = \underbrace{     \left(       \begin{array}{c|c|c}         \vdots & \vdots & \vdots \\         1 & \frac{x_i^{(j-1)}}{(j-1)!} &  \frac{x_i^{d}}{d!} \\         \vdots & \vdots & \vdots       \end{array}     \right)     }_{\mathbf{T}}.\underbrace{\left(     \begin{array}{ccc}       1 & & \\       & (j-1)! & \\       & & d!     \end{array}   \right)}\limits_{\mathbf{D}}

That’s all, we only have to group pieces:

\mathbf{V}.\mathbf{c}=\mathbf{P}=\mathbf{T}.\mathbf{p^\delta}=\mathbf{V}.\mathbf{D}^{-1}.\mathbf{p^\delta}

With the QR decomposition \mathbf{V}=\mathbf{Q}.\mathbf{R} and \mathbf{c}=\mathbf{R}^{-1}.\mathbf{Q}^t.\mathbf{y} we get:

\mathbf{Q}.\mathbf{Q}^t.\mathbf{y}=\mathbf{Q}.\mathbf{R}.\mathbf{D}^{-1}.\mathbf{p^\delta}

using the fact that \mathbf{Q}^t.\mathbf{Q}=\mathbf{I} we get:

\mathbf{Q}^t.\mathbf{y}=\mathbf{R}.\mathbf{D}^{-1}.\mathbf{p^\delta}

hence we have:

\boxed{ \mathbf{p^\delta} = \mathbf{D}.\mathbf{R}^{-1}.\mathbf{Q}^t.\mathbf{y} }

which is our final formula.

Symbolic computation to check that it works

We can use mathematica to do a symbolic computation using \mathbf{p^\delta} = \mathbf{D}.\mathbf{R}^{-1}.\mathbf{Q}^t.\mathbf{y}.

For a window width of 2n+1=7 points and a polynomial of degree d=2 we get:

n = 3; d = 2;
V = Table[If[j != 0, i^j, 1], {i, -n, n}, {j, 0, d}];
{Qt, R} = QRDecomposition[V];
DD = DiagonalMatrix[Table[Factorial[i], {i, 0, d}]];
DD.Inverse[R].Qt // TeXForm

\left( \begin{array}{ccccccc}  -\frac{2}{21} & \frac{1}{7} & \frac{2}{7} & \frac{1}{3} & \frac{2}{7} & \frac{1}{7} & -\frac{2}{21} \\  -\frac{3}{28} & -\frac{1}{14} & -\frac{1}{28} & 0 & \frac{1}{28} & \frac{1}{14} & \frac{3}{28} \\  \frac{5}{42} & 0 & -\frac{1}{14} & -\frac{2}{21} & -\frac{1}{14} & 0 & \frac{5}{42} \end{array} \right)

The first row is the smoothing P(0) filter, the second one the smoothed first order derivative P'(0) filter and the last one the smoothed second order derivative P''(0) filter.

A Julia implementation

Here I present a direct implementation in julia.

We first initialize a Vandermonde matrix:

function vandermonde(halfWindow::Int, polyDeg::Int,T::Type=Float64)
    
    @assert halfWindow>=0
    @assert polyDeg>=0
    
    x=T[i for i in -halfWindow:halfWindow]

    n = polyDeg+1
    m = length(x)
    
    V = Array{T}(m, n)
    
    for i = 1:m
        V[i,1] = T(1)
    end
    for j = 2:n
        for i = 1:m
            V[i,j] = x[i] * V[i,j-1]
        end
    end

    return V
end

We then compute the filter coefficients with the presented formula.

Attention we return the transposed matrix because in Julia it is more convenient to use SG[:,1] which is a julia vector. SG[1,:] would be a (1,n) julia matrix.

function SG(halfWindow::Int, polyDeg::Int,T::Type=Float64)

    @assert 2*halfWindow>polyDeg
    
    V=vandermonde(halfWindow,polyDeg,T)
    Q,R=qr(V)
    SG=R\Q'

    for i in 1:size(SG,1)
        SG[i,:]*=factorial(i-1)
    end
    
# CAVEAT: returns the transposed matrix

    return SG'
end

The final step to use the filter is to provide function to do the cross-correlation.

I do not want to talk too much about this subroutine because in a future post I will show how to efficiently compute such kind of convolution. Here we use a FFT, but with a short filter it is much more efficient (around 10 times faster) to use a direct computation. I will show how to implement

\gamma[k]=\sum\limits_i\alpha[i]\beta[k+\lambda i],\ with\ \lambda\in\mathbb{Z}^*

which can be used to compute discrete and stationary wavelet transform for instance.

One last thing, here we use constant padding to manage the boundaries.

function apply_filter{T}(filter::StridedVector{T},signal::StridedVector{T})

    @assert isodd(length(filter))

    halfWindow = round(Int,(length(filter)-1)/2)
    
    padded_signal = 
        [signal[1]*ones(halfWindow);
         signal;
         signal[end]*ones(halfWindow)]

    filter_cross_signal = conv(filter[end:-1:1], padded_signal)

    return filter_cross_signal[2*halfWindow+1:end-2*halfWindow]
end

Finally I have included a small usage example. To see well the effect of Savitzky-Golay filters, I have over smoothed with a large window width 2.n+1, n=20

using Winston

s=readdlm("signal.txt")[:,1]

sg=SG(20,3) # halt-window, polynomal degree

#________________

smoothed=apply_filter(sg[:,1],s)

plot(s,"r")
oplot(smoothed)
title("Smoothed")
savefig("smoothed.png")

#________________

smoothed_d1=apply_filter(sg[:,2],s)

plot(smoothed_d1)
title("Smoothed derivative")
savefig("smoothed_d1.png")

#________________

smoothed_d2=apply_filter(sg[:,3],s)

plot(smoothed_d2)
title("Smoothed 2-derivative")
savefig("smoothed_d2.png")

Here is the resulting plots:

smoothed.png smoothed_d1.png smoothed_d2.png

Final word

To reproduce these figures you can find the complete code on github.

Some remarks about min()/max() functions

By: Picaud Vincent

Re-posted from: https://pixorblog.wordpress.com/2016/06/27/some-remarks-about-minmax-functions/

Computing the min (or max) of two values looks like a trivial task.

For C++ the default Standard Template Library implementation is

inline const _Tp& min(const _Tp& __a, const _Tp& __b)
{
    return __b < __a ? __b : __a;
}

However for numerical computations this definition does not mix well with the IEEE 754 standard when dealing with potential NaN value.

For instance, min() is not commutative and is not equivatent to cmath::fmin().

We have:

#include <iostream>
#include <cmath>
#include <limits>
#include <cassert>

using namespace std;

int main()
{
    const auto x_nan = numeric_limits<double>::quiet_NaN();
    const auto x_1 = 1.;

    cout << boolalpha;

    cout << "\n\ncommutativity?";

    cout << "\n min: "
     << (min(x_1, x_nan) == min(x_nan, x_1));

    cout << "\nfmin: "
     << (fmin(x_1, x_nan) == fmin(x_nan, x_1));
}
commutativity?
 min: false
fmin: true

The IEEE 754 says:

In section 6.2 of the revised IEEE 754-2008 standard there are two anomalous functions (the maxnum and minnum functions that return the maximum of two operands that are expected to be numbers) that favor numbers — if just one of the operands is a NaN then the value of the other operand is returned.

On the opposite you can read about C++ rules here:

NaN values have the curious property that they compare as “unordered” with all values, even with themselves. In other words, if x is a NaN, and y is any floating-point value, NaN or not, then x<y, x>y, x<=y, x>=y, and x==y are all false, and x!=y is true.

Remark In C/C++ that is the reason why you can implement the isnan function by

bool isnan(const double x) { return x!=x;}

Using min/max in numerical code

C++

Like the IEEE 754 and C++ standard do not mix well, I personally redefine min()/max() functions as follow:

#include <type_traits>
#include <cmath>

namespace libraryNamespace
{
    template <typename T>
    inline enable_if_t<is_floating_point<T>::value, T>
      min(const T& t1, const T& t2)
    {
        return fmin(t1, t2);
    }

    template <typename T>
    constexpr enable_if_t<is_integral<T>::value, const T&>
      min(const T& t1, const T& t2)
    {
        return (t1 < t2) ? t1 : t2;
    }
    // max() function follows the same scheme
}

Julia

Julia follows the scheme I use in C++, you have a specialization for floating point numbers julia/base/math.jl

min{T<:AbstractFloat}(x::T, y::T) =
    ifelse((y < x) | (signbit(y) > signbit(x)),
           ifelse(isnan(y), x, y), ifelse(isnan(x), y, x))

and a generic operator julia/base/operator.jl

min(x,y) = ifelse(y < x, y, x)

We can check that the Float specialization follows the IEEE 754 rule:

(max(5,NaN)==max(NaN,5))&&(max(5,NaN)==5)
true

What is the cost?

Julia

There is a big difference between the simple definition in julia/base/operator.jl and the NaN aware code of julia/base/math.jl.

We can have a look at the assembly code.

I would like to thank Erik Schnetter who made me noticed that Julia 4.6 @code_native was bugged (see his comment at the end of this post). Hence to get the same result you must have a recent Julia version. In this post I use:

versioninfo()
Julia Version 0.5.0-dev+5453
Commit 1fd440e (2016-07-15 23:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, westmere)

With this Julia version you get the following asm codes:

@code_native(min(1,2))

gives

	.text
Filename: promotion.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 257
	cmpq	%rdi, %rsi
	cmovgeq	%rdi, %rsi
	movq	%rsi, %rax
	popq	%rbp
	retq

whereas

@code_native min(1.0,2.0)

gives

	.text
Filename: math.jl
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 203
	ucomisd	%xmm1, %xmm0
	seta	%al
	movmskpd	%xmm1, %ecx
	movmskpd	%xmm0, %edx
	andl	$1, %edx
	xorb	$1, %dl
	andb	%cl, %dl
	orb	%al, %dl
	je	L54
	movapd	%xmm1, %xmm2
	cmpordsd	%xmm2, %xmm2
	andpd	%xmm2, %xmm1
	andnpd	%xmm0, %xmm2
	orpd	%xmm1, %xmm2
	jmp	L75
L54:
	movapd	%xmm0, %xmm2
	cmpordsd	%xmm2, %xmm2
	andpd	%xmm2, %xmm0
	andnpd	%xmm1, %xmm2
	orpd	%xmm0, %xmm2
L75:
	movapd	%xmm2, %xmm0
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)

C++

We have the same in C++

#include <cmath>
#include <iostream>

int main()
{
  double x, y;
  std::cin >> x >> y;
  asm("#ASM FOR FMIN");
  double fmin_x_y = std::fmin(x, y);
  asm("#ASM FOR FMIN - END");
  std::cout << "\n" << fmin_x_y;

  asm("#ASM FOR MIN");
  double min_x_y = std::min(x, y);
  asm("#ASM FOR MIN - END");
  std::cout << "\n" << min_x_y;
  return 0;
}

compiled with

g++ -std=c++11 -O3 -S min.cpp -o min.asm

gives for min()

#ASM FOR MIN
	movsd	24(%rsp), %xmm0
	minsd	16(%rsp), %xmm0
	movsd	%xmm0, 8(%rsp)
#ASM FOR MIN - END

and for fmin()

#ASM FOR FMIN
	movsd	24(%rsp), %xmm1
	movsd	16(%rsp), %xmm0
	call	fmin
	movsd	%xmm0, 8(%rsp)
#ASM FOR FMIN - END

CPU time?

Illustration with the Jaccard distance defined by

d_J(x,y)=1-\frac{\sum\limits_i \min(x_i,y_i)}{\sum\limits_i \max(x_i,y_i)},\ \text{where}\ (x,y)\in\mathbb{R}_+^n\times\mathbb{R}_+^n

We compute this distance using 4 different approaches:

  • Julia min()/max() NaN aware
  • Julia min()/max() comparison
  • C fmin()/fmax() NaN aware
  • C min()/max() comparison

The Julia code is given below

function jaccard_julia_NaN_aware(a::Array{Float64,1},
                                 b::Array{Float64,1})

    @assert length(a)==length(b) 

    num::Float64 = 0
    den::Float64 = 0

    for i in 1:length(a)

        @inbounds num += min(a[i],b[i])
        @inbounds den += max(a[i],b[i])

    end
    return 1. - num/den
end

function jaccard_julia_comparison(a::Array{Float64,1},
                                  b::Array{Float64,1})

    @assert length(a)==length(b) 

    num::Float64 = 0
    den::Float64 = 0

    for i in 1:length(a)

        @inbounds num += ifelse(a[i]<b[i],a[i],b[i])
        @inbounds den += ifelse(a[i]>b[i],a[i],b[i])

    end
    return 1. - num/den
end

function jaccard_C_NaN_aware(a::Array{Float64,1},
                             b::Array{Float64,1})

    @assert length(a)==length(b) 

    return ccall((:jaccard_C_NaN_aware,
                  "./libjaccard.so"),
                 Float64,
                 (Int64,Ptr{Float64},Ptr{Float64}),
                 length(a),a,b)

end

function jaccard_C_comparison(a::Array{Float64,1},
                              b::Array{Float64,1})

    @assert length(a)==length(b) 

    return ccall((:jaccard_C_comparison,
                  "./libjaccard.so"),
                 Float64,
                 (Int64,Ptr{Float64},Ptr{Float64}),
                 length(a),a,b)

end

function test_distance(f,
                       v1::Array{Float64,1},
                       v2::Array{Float64,1})
    sum=0
    for i in 1:5000
        sum+=f(v1,v2)
    end
    sum
end

v1=rand(10000);
v2=rand(10000);

for name in (:jaccard_julia_NaN_aware,
             :jaccard_C_NaN_aware,
             :jaccard_julia_comparison,
             :jaccard_C_comparison)
    print("$name")
    @time @eval test_distance($name,v1,v2)
end

The associated C subroutines are:

#include <math.h>
#include <stdint.h>

double jaccard_C_NaN_aware(uint64_t size,
               double *a, double *b)
{
  double num = 0, den = 0;

  for(uint64_t i = 0; i < size; ++i)
    {
      num += fmin(a[i],b[i]);
      den += fmax(a[i],b[i]);
    }
  return 1. - num / den;
}

double jaccard_C_comparison(uint64_t size,
                double *a, double *b)
{
  double num = 0, den = 0;

  for(uint64_t i = 0; i < size; ++i)
    {
      num += (a[i] < b[i] ? a[i] : b[i]);
      den += (a[i] > b[i] ? a[i] : b[i]);
    }
  return 1. - num / den;
}

and the Makefile is

all: bench

bench: libjaccard.so
    @julia --optimize --check-bounds=no jaccard.jl

libjaccard.so: jaccard.c
    @gcc -O2 -shared -fPIC jaccard.c -o libjaccard.so

The results I get on my computer are:

make

output.png

We see that:

  • for the NaN aware case C and Julia running time is equivalent, great!
  • for the Comparison case C seems to be a little bit faster, but the gap is very small and more precise benchmark would be necessary to quantify it.
  • min()/max() using simple comparisons is more than 3.5 times faster than an implementation taking into account possible NaN values.

Julia 4.6 previous results

In the first version of this post, using julia version 4.6, I got this result:

output_4_6.png

There was a clear advantage for the C code, but it seems this is not the case anymore with julia version 0.5.

You can reproduce the results of this post, using the code available on GitHub.

A source of bugs?

You have to take care that a simple statement like

x\le\min(x,y)

is mathematically true but false in your code. It is even false for both the IEEE 754 and the comparison based versions of the min() function.

In the future I will write a post on Automatic Differentiation. To be brief automatic differentiation is a tool that allows you to efficiently compute gradient with nearly no modification of your original code. As example my personal implementation takes the form:

// Declare the current tape
//
AD_Tape<double> tape;

// Computes gradient of f(x,y,z)=x.z.sin(x.y)
//                   at (x,y,z)=(2,1,5)
//
// Note:
//
// grad(f)={ x * y * z * Cos(x * y) + z * Sin(x * y),
//           x^2 * z *
//           Cos(x * y), x * Sin(x * y) }
//
AD_Scalar<double> x, y, z, f;

x = 2;
y = 1;
z = 5;

f = x * z * sin(x * y);

const auto grad = ad_gradient(f);

std::cout << "\ngrad(f) = {"
          << grad[x] << "," 
          << grad[y] << ","
          << grad[z] << "}";

// The screen output is
//
// grad(f) = {0.385019,-8.32294,1.81859}

One classical approach uses operator overloading. For each basic operation you compute the function value and its differential.

One important and desirable property of AutoDiff library is that its use does not modify your program result.

Unfortunately a lot of AutoDiff libraries are bugged when they define the min()/max() functions.

It is really “natural” to define min() overloading by something like:

function min(x,y)
    ifelse(x<y,
           return (x,dx),
           return (y,dy))

But this implementation is buggy: if y is NaN we now know that x<y is always false, hence:

  • the original code will return x
  • the AutoDiff code will return (y,dy) (a priori =(NaN,NaN))

To stay consistent the correct implementation is something like:

function min(x,y)
    ifelse(x==min(x,y),
           return (x,dx),
           return (y,dy))

which provides the right result (id consistent with the original code) whatever x or y is NaN or not

We can check that, for instance, with Julia ForwardDiff.jl

(githash: 045a828)

using ForwardDiff
f(x::Vector)=max(2*x[1],x[2]);
x=[1.,NaN]
print("Original value: $(f(x))")
print("\nAutoDiff gradient: $(ForwardDiff.gradient(f, x))")
Original value: 2.0
AutoDiff gradient: [0.0,1.0]

which is wrong, in this case the right gradient is

\nabla [(x_1,x_2)\rightarrow max(2.x_1,x_2)]_{x_1=1,x_2=NaN}=(2,0)

Final word

  • Concerning min()/max() examples of this post I have observed a big performance gain in using Julia version 5.0. What is not clear is the reason of this gain:

    • I compiled my own version of Julia 5.0, but I used the 4.6 version shipped with Debian. Maybe my compiled version uses some different compiler flags (like -march=native…).
    • or there is a real improvement between version 4.6 -> 5.0

    You can reproduce the benchmark using the code available on GitHub. Any feed back is welcome.

  • The @native_code macro of Julia version 4.6 seems to be bugged.
  • It is “easy” to introduce bugs when you mix comparison operators and IEEE 754 min()/max() functions. Implementing min()/max() in a Automatic Differentiation framework is one perfect illustration of this and was the point that motivated me to write this post.
  • (Joke) If you are fed up with min()/max() you can use absolute value:

min(x,y)=\frac{1}{2}(|x+y|-|x-y|)

max(x,y)=\frac{1}{2}(|x+y|+|x-y|)

… but these relations do not follow the IEEE 754 standard!

I would like to thank my colleague JP. Both for having noticed that min()/max() was “abnormally” slow (Julia 4.6), Y. Borstein for some emails exchanges concerning min()/max() in Julia (before I wrote this post), Erik Schnetter who told me to switch to a more recent Julia version.