Author Archives: julia | FS

AbstractDifferentiation.jl for AD-backend agnostic code

By: julia | FS

Re-posted from: https://frankschae.github.io/post/abstract_differentiation/

Differentiable programming (∂P), i.e., the ability to differentiate general computer program structures, has enabled the efficient combination of existing packages for scientific computation and machine learning1. The Julia2 language is
well suited for ∂P, see also Chris’ article3 for a detailed examination. There is already a plethora of examples where ∂P has provided massive performance and accuracy advantages over black-box approaches to machine learning. This is because black-box machine learning approaches are flexible but require a large amount of data. Incorporating previously acquired knowledge about the structure of a problem reduces the amount of data and allows the learning task to be simplified4, for example, by focusing on learning only the parts of the model that are actually missing4 5. In the context of quantum control, we have demonstrated the power of this framework for closed6 and
open quantum systems7.

∂P is (commonly) realized by automatic differentiation (AD), which is a family of techniques to efficiently and accurately differentiate numeric functions expressed as computer programs. Generally, besides forward- and reverse-mode AD, the two main branches of AD,
a large variety of software implementations with different
pros and cons exists. The goal is to make the best choice in every part of the program without requiring users to significantly customize their code. Having a common ground by
ChainRules.jl empowers this idea of a
Glue AD where backend developers just define ChainRules overloads. However, switching from one backend to another on the user side can still be tedious because the user has to look up the syntax of the new AD package.

Mohamed Tarek has started to
implement a high level API for differentiation that unifies the APIs of all the AD packages in the Julia ecosystem. Ultimately, the API of our new package,
AbstractDifferentiation.jl, aims at enabling AD users to write AD backend-agnostic code. This will greatly facilitate the switching between different AD packages. Once the interface is completed and all tests are added, it is also planned that
DiffEqSensitivity.jl within the
SciML software suite adopts AbstractDifferentiation.jl as a better way of handling AD choices. In this part of my GSoC project, I’ve started to fix remaining errors of the
initial PR.

The interested reader is encouraged to look at Mohamed’s
first PR for a complete list of functions provided by AbstractDifferentiation.jl (and some great discussions about the package). In the rest of this blog post, I will focus on a concrete example to illustrate the main idea.

Optimization of the Rosenbrock function

The
Rosenbrock function is defined by

$$
g(x_1,x_2) = (a-x_1)^2 + b(x_2-x_1^2)^2.
$$

The function $g$ has a global minimum at $(x_1^\star, x_2^\star)= (a, a^2)$ with $g(x_1^\star, x_2^\star)=0$. In the following, we fix $a = 1$ and $b = 100$. The global minimum is located inside a long, narrow, banana-shaped, flat valley, which makes the function a common test case for optimization algorithms.

Let us now implement the
Gauss–Newton algorithm to find the global minimum. The Gauss–Newton algorithm iteratively finds the value of the $N$ variables ${\bf{x}}=(x_1,\dots, x_N)$ that minimize the sum of squares of $M$ residuals $(f_1,\dots, f_M)$

$$
S({\bf x}) = \frac{1}{2} \sum_{i=1}^M f_i({\bf x})^2.
$$

Starting from an initial guess ${\bf x_0}$ for the minimum, the method runs through the iterations

$$
{\bf x}^{k+1} = {\bf x}^k – \alpha_k \left(J^T J \right)^{-1} J^T f({\bf x}^k),
$$
where $J$ is the Jacobian matrix at ${\bf{x}}^k$ and $\alpha_k$ is the step length determined via a
line search subroutine.

The following plot shows the Rosenbrock function in 3D as well as a 2D heatmap including the global minimum ${\bf x^\star}=(1,1)$ and our initial guess ${\bf x_0}=(0,-0.1)$.

using Pkg
path = @__DIR__
cd(path); Pkg.activate("."); Pkg.instantiate()

## AbstractDifferentiation is not released yet!!

using AbstractDifferentiation
using Test, LinearAlgebra
using FiniteDifferences, ForwardDiff, Zygote
using Enzyme, UnPack
using Plots, LaTeXStrings
# using Diffractor: ∂⃖¹ ## Diffractor needs >julia@1.6

## Rosenbrock function
# R: R^2 -> R: x -> (a-x₁)² + b(x₂-x₁²)²
g(x,p) = (p[1]-x[1])^2 + p[2]*(x[2]-x[1]^2)^2

# visualization
p = [1.0,100.0]
x₀ = [0.0,-0.1]
xopt = [1.0,1.0]

do_plot = true
if do_plot    
    x₁, x₂ = -2.0:0.01:2.0, -0.6:0.01:3.5
    z = Surface((x₁,x₂)->g([x₁,x₂],p), x₁, x₂)
    pl1 = surface(x₁,x₂,z, linealpha = 0.3, c=cgrad(:thermal, scale = :exp), colorbar=true,
                labelfontsize=20,camera = (3,50),
                xlabel = L"x_1", ylabel = L"x_2")

    pl2 = heatmap(x₁,x₂,z, c=cgrad(:thermal, scale = :exp),
                labelfontsize=20,
                xlabel = L"x_1", ylabel = L"x_2")
    scatter!(pl2, [(x₀[1],x₀[2])], label=L"x_0", legendfontsize=15, markershape = :circle, markersize = 10, markercolor = :green)
    scatter!(pl2, [(xopt[1],xopt[2])],label=L"x^\star", legendfontsize=15, markershape = :star, markersize = 10, markercolor = :red)

    pl = plot(pl1,pl2, layout=(2,1))
    savefig(pl, "Rosenbrock.png")
end


To apply the Gauss-Newton algorithm to the Rosenbrock function $g$, we first cast $g$ into an appropriate form fulfilling $S({\bf x})$, i.e., we use:

$$
f:\mathbb{R}^2\rightarrow\mathbb{R}^2: {\bf x} \mapsto \begin{pmatrix}
f_1({\bf x}) \\
f_2({\bf x}) \\
\end{pmatrix} = \begin{pmatrix}
\sqrt{2}(a-x_1) \\
\sqrt{2b}(x_2-x_1^2)\\
\end{pmatrix},
$$

instead of $g$. We can easily compute the Jacobian of $f$ manually

$$
J = \begin{pmatrix}
-\sqrt{2} & 0 \\
-2x_1\sqrt{2b} & \sqrt{2b} \\
\end{pmatrix}.
$$

We can then implement a (simple, non-optimized) version of the Gauss-Newton algorithm as follows.

# bring Rosenbrock function into the form "sum of squares of functions"
f1(x,p) = convert(eltype(x),sqrt(2))*(p[1]-x[1])
f2(x,p) = convert(eltype(x),sqrt(2*p[2]))*(x[2]-x[1]^2)
f(x,p) = [f1(x,p),f2(x,p)]
function f(res,x,p) # Enzyme works with inplace functions
	res[1] = f1(x,p)
	res[2] = f2(x,p)
	return nothing
end

## manually pre-defined Jacobian
function Jacobian(x,p)
  [-convert(eltype(x),sqrt(2))   0
  -2*x[1]*convert(eltype(x),sqrt(2*p[2]))  convert(eltype(x),sqrt(2*p[2]))]
end

## Gauss-Newton scheme
function GaussNewton!(xs, x, p; maxiter=8, backend=nothing)
    for i=1:maxiter
        x = step(x, p, backend)
        @info i
        @show x
        push!(xs, x)
    end
    return xs, x
end
done(x,x2,p) = g(x2,p) < g(x,p)
function step(x, p, backend::Nothing, α=1//1)
  x2 = deepcopy(x)
  while !done(x,x2,p)
    J = Jacobian(x,p)
    d = -inv(J'*J)*J'*f(x,p)
    copyto!(x2,x + α*d)
    α = α//2
  end
  return x2
end

When we run the algorithm, we find the global minimum after about the 7th iteration.

xs = [x₀]
GaussNewton!(xs, x₀, p)
# output:
[ Info: 1 ]
x = [0.125, -0.08750000000000001]
[ Info: 2 ]
x = [0.234375, -0.047265625000000006]
[ Info: 3 ]
x = [0.4257812499999995, 0.06800537109374968]
[ Info: 4 ]
x = [0.5693359374999986, 0.21857223510742047]
[ Info: 5 ]
x = [0.784667968749996, 0.5165503501892037]
[ Info: 6 ]
x = [0.9999999999999961, 0.9536321163177449]
[ Info: 7 ]
x = [0.9999999999999989, 0.9999999999999999]
[ Info: 8 ]
x = [1.0, 1.0]

If computing the Jacobian by hand is too cumbersome (or not possible for other reasons), we can compute it using finite differences. Within the AbstractDifferentiation API, we can directly define, for instance, the Jacobian of
FiniteDifferences.jl as a new primitive operation.

## FiniteDifferences
struct FDMBackend{A} <: AD.AbstractFiniteDifference
    alg::A
end
FDMBackend() = FDMBackend(central_fdm(5, 1))
const fdm_backend = FDMBackend()
# Minimal interface
AD.@primitive function jacobian(ab::FDMBackend, f, xs...)
    return FiniteDifferences.jacobian(ab.alg, f, xs...)
end

# AD Jacobian returns tuple
# df_dx = AD.jacobian(fdm_backend, f(x,p), x₀, p)[1]
# df_dp = AD.jacobian(fdm_backend, f(x,p), x₀, p)[2]

@test AD.jacobian(fdm_backend, x->f(x,p), x₀)[1] ≈ Jacobian(x₀, p)
@test AD.jacobian(fdm_backend, f, x₀, p)[1] ≈ Jacobian(x₀, p)

After overloading the step function, we can run the Gauss-Newton algorithm as follows:

function step(x, p, backend, α=1//1)
  x2 = deepcopy(x)
  while !done(x,x2,p)
    J = AD.jacobian(backend, f, x, p)[1]
    d = -inv(J'*J)*J'*f(x,p)
    copyto!(x2,x + α*d)
    α = α//2
  end
  return x2
end


xs = [x₀]
GaussNewton!(xs, x₀, p, backend=fdm_backend)

If we want to use reverse-mode AD instead, for example via
Zygote.jl, a natural choice for the primitive is to define the pullback function. AbstractDifferentiation then generates the associated code to compute the Jacobian for us.

## Zygote
struct ZygoteBackend <: AD.AbstractReverseMode end
const zygote_backend = ZygoteBackend()
AD.@primitive function pullback_function(ab::ZygoteBackend, f, xs...)
    return function (vs)
        # Supports only single output
        _, back = Zygote.pullback(f, xs...)
        if vs isa AbstractVector
            back(vs)
        else
            @assert length(vs) == 1
            back(vs[1])
        end
    end
end
##

@test minimum(AD.jacobian(fdm_backend, f, x₀, p) .≈ AD.jacobian(zygote_backend, f, x₀, p))
xs = [x₀]
GaussNewton!(xs, x₀, p, backend=zygote_backend)

Typically, reverse-mode AD is only beneficial for functions $f:\mathbb{R}^N\rightarrow\mathbb{R}^M$ where $M \ll N$, thus it is also a good idea to compare the performance with respect to forward-mode AD (
ForwardDiff.jl)

## ForwardDiff
struct ForwardDiffBackend <: AD.AbstractForwardMode end
const forwarddiff_backend = ForwardDiffBackend()
AD.@primitive function pushforward_function(ab::ForwardDiffBackend, f, xs...)
    # jvp = f'(x)*v, i.e., differentiate f(x + h*v) wrt h at 0
    return function (vs)
        if xs isa Tuple
            @assert length(xs) <= 2
            if length(xs) == 1
                (ForwardDiff.derivative(h->f(xs[1]+h*vs[1]),0),)
            else
                ForwardDiff.derivative(h->f(xs[1]+h*vs[1], xs[2]+h*vs[2]),0)
            end
        else
            ForwardDiff.derivative(h->f(xs+h*vs),0)
        end
    end
end
##

@test minimum(AD.jacobian(fdm_backend, f, x₀, p) .≈ AD.jacobian(forwarddiff_backend, f, x₀, p))
xs = [x₀]
GaussNewton!(xs, x₀, p, backend=forwarddiff_backend)

where we have used that the Jacobian-vector product $f'(x)v$, i.e., the primitives of forward-mode AD, can be computed by
differentiating $f(x + hv)$ with respect to $h$ at 0.

Many AD packages, such as Zygote, have troubles with mutating functions.
Enzyme.jl is one of the exceptions. Additionally, it is very fast and has further improved the performance of the
adjoints implemented within the DiffEqSensitivity package.

## Enzyme
struct EnzymeBackend{T1,T2,T3,T4} <: AD.AbstractReverseMode
    out::T1
    λ::T2
    ∂f_∂x::T3
    ∂f_∂p::T4
end

out = zero(x₀)
λ = zero(x₀)
∂f_∂x = zero(x₀)
∂f_∂p = zero(p)

const enzyme_backend = EnzymeBackend(out,λ,∂f_∂x,∂f_∂p)
AD.@primitive function pullback_function(ab::EnzymeBackend, f, xs...)
    return function (vs)  
        # enzyme works only with inplace functions
        if !(vs isa AbstractVector)
            @assert length(vs) == 1 # Supports only single output
            vs = vs[1]
        end

        if xs isa Tuple
            @assert length(xs) == 2  # hard-coded for use case with two inputs
            x₀ = xs[1]
            p = xs[2]
        end

        @unpack out, λ, ∂f_∂x, ∂f_∂p = ab # cached in the struct, could also be created in here

        ∂f_∂x .*= false
        ∂f_∂p .*= false
        out .*= false

        copyto!(λ, vs)

        autodiff(Duplicated(out, λ), Duplicated(x₀, ∂f_∂x), Duplicated(p, ∂f_∂p)) do _out,_x, _p
            f(_out,_x,_p)
        end
        return (∂f_∂x,∂f_∂p)
    end
end
AD.isinplace(ab::EnzymeBackend) = true
AD.primalvalue(ab::EnzymeBackend, nothing, f, xs) = (f(ab.out,xs...);return ab.out)
##

@test minimum(AD.jacobian(fdm_backend, f, x₀, p) .≈ AD.jacobian(enzyme_backend, f, x₀, p))
xs = [x₀]
GaussNewton!(xs, x₀, p, backend=enzyme_backend)

Note that we have declared the Enzyme backend as inplace (which is important for internal control flow) and specified a primalvalue function returning the primal value of the forward pass.

Some current glitches

First, the push forward of a tuple of vectors, e.g., $(v_1, v_2)$, for a function with several input arguments is currently ambiguous. While AD.jacobian primitives and AD.pullback_function primitives interpret the push forward of our $f$ function as

$$
\left(\frac{\partial f(x_0,p)}{\partial x} v_1 , \frac{\partial f(x_0,p)}{\partial p} v_2 \right),
$$

AD.pushforward_function primitives compute

$$
\frac{\partial f(x_0,p)}{\partial x} v_1 + \frac{\partial f(x_0,p)}{\partial p} v_2.
$$

# pushforward_function wrt to multiple vectors is currently ambiguous
vs = (randn(2), randn(2))
res1 = AD.pushforward_function(fdm_backend, f, x₀, p)(vs)
res2 = AD.pushforward_function(forwarddiff_backend, f, x₀, p)(vs)

@test res2 ≈ res1[1] + res1[2]

Thus, we currently solve this issue by augmenting the input in the case of AD.pushforward_function primitives.

res2a = AD.pushforward_function(forwarddiff_backend, f, x₀, p)((vs[1], zero(vs[2])))
res2b = AD.pushforward_function(forwarddiff_backend, f, x₀, p)((zero(vs[1]), vs[2]))

@test res2a ≈ res1[1]
@test res2b ≈ res1[2]

The plural “primitives” is used here because we may have different pushforward_function primitives for different backends. For instance, we can define an additional pushforward_function primitive for FiniteDifferences by:

struct FDMBackend2{A} <: AD.AbstractFiniteDifference
    alg::A
end
FDMBackend2() = FDMBackend2(central_fdm(5, 1))
const fdm_backend2 = FDMBackend2()
AD.@primitive function pushforward_function(ab::FDMBackend2, f, xs...)
    return function (vs)
        FDM.jvp(ab.alg, f, tuple.(xs, vs)...)
    end
end

Second, to avoid misunderstandings for the output of a Hessian of a function with several input arguments, we allow only single input arguments to the Hessian function.

# Hessian only defined with respect to single input variable
@test_throws AssertionError H1 = AD.hessian(forwarddiff_backend, g, x₀, p)
H1 = AD.hessian(forwarddiff_backend, x->g(x,p), x₀)
H2 = AD.hessian(forwarddiff_backend, p->g(x₀,p), p)

Third, computing the Hessian requires to nest AD/backend calls. This can lead to failure if one tries to use Zygote over Zygote. To solve this problem, we have implemented a HigherOrderBackend that takes a tuple containing multiple backends (because, for example, using ForwardDiff over Zygote is perfectly fine).

# Hessian might fail if AD system calls must not be nested (e.g. Zygote over Zygote)
backends = AD.HigherOrderBackend((forwarddiff_backend,zygote_backend))
H3 = AD.hessian(backends, x->g(x,p), x₀)

Outlook

There are many other use cases, e.g.,

  • Sensitivity analysis of differential equations requires vector-Jacobian products for adjoint methods and Jacobian-vector products for tangent methods.
  • The
    Newton–Raphson method for rootfinding requires the gradient in the case of scalar function $f:\mathbb{R}\rightarrow\mathbb{R}$ and the Jacobian in case of $N$ (nonlinear) equations, i.e., finding the zeros of $f:\mathbb{R}^N\rightarrow\mathbb{R}^N$.
  • The
    Newton method in optimization requires the computation of the Hessian.

AbstractDifferentiation.jl is by no means complete yet. We are still in the very early stages, but we hope to make significant progress in the coming weeks. Some of the next steps are:

  • fixing remaining bugs, e.g., with respect to the computation of the Hessian and
  • adding AD/Finite Differentiation packages such as
    Diffractor.

If you have any questions or comments, please don’t hesitate to contact me!


  1. Mike Innes, Alan Edelman, et al., arXiv preprint arXiv:1907.07587 (2019). ↩︎

  2. Jeff Bezanson, Stefan Karpinski, et al., arXiv preprint arXiv:1209.5145 (2012). ↩︎

  3. Chris Rackauckas, The Winnower 8, DOI: 10.15200/winn.156631.13064 (2019). ↩︎

  4. Chris Rackauckas, Yingbo Ma, et al., arXiv preprint arXiv:2001.04385 (2020). ↩︎

  5. Raj Dandekar, Chris Rackauckas, et al., Patterns 1, 100145 (2020). ↩︎

  6. Frank Schäfer, Michal Kloc, et al., Mach. Learn.: Sci. Technol. 1, 035009 (2020). ↩︎

  7. Frank Schäfer, Pavel Sekatski, et al., Mach. Learn.: Sci. Technol. 2, 035004 (2021). ↩︎

Shadowing Methods for Forward and Adjoint Sensitivity Analysis of Chaotic Systems

By: julia | FS

Re-posted from: https://frankschae.github.io/post/shadowing/

In this post, we dig into sensitivity analysis of chaotic systems. Chaotic systems are dynamical, deterministic systems that are extremely sensitive to small changes in the initial state or the system parameters. Specifically, the dependence of a chaotic system on its initial conditions is well known as the “butterfly effect”. Chaotic models are encountered in various fields ranging from simple examples such as the double pendulum to highly complicated fluid or climate models.

Sensitivity analysis methods have proven to be very powerful for solving inverse problems such as parameter estimation or optimal control1 2 3. However, conventional sensitivity analysis methods may fail in chaotic systems due to the ill-conditioning of the initial value problem. Sophisticated methods, such as least squares shadowing4 (LSS) or non-intrusive least squares shadowing5 (NILSS) have been developed in the last decade. Essentially, these methods transform the initial value problem to a well conditioned optimization problem – the least squares shadowing problem. In this second part of my GSoC project, I implemented the LSS and the NILSS method within the
DiffEqSensitivity.jl package.

The objective for LSS and NILSS is a long-time average quantity. More precisely, we define the instantaneous objective by $g(u,p)$, where $u$ is the state and $p$ is the parameter of the differential equation. Then, the objective is obtained by averaging $g$ over an infinitely long trajectory:

$$
\langle g \rangle_∞ = \lim_{T \rightarrow ∞} \langle g \rangle_T,
$$
where
$$
\langle g \rangle_T = \frac{1}{T} \int_0^T g(u,s) \text{d}t.
$$
Under the assumption of ergodicity, $\langle g \rangle_∞$ only depends on $p$.

The Lorenz system

One of the most important chaotic models is the Lorenz system which is a simplified model for atmospheric convection. The Lorenz system has three states $x$, $y$, and $z$, as well as three parameters $\rho$, $\sigma$, and $\beta$. Its time evolution is given by the ODE:

$$
\begin{pmatrix}
\text{d}x \\
\text{d}y \\
\text{d}z \\
\end{pmatrix} = \begin{pmatrix}
\sigma (y-x)\\
x(\rho-z) – y\\
x y – \beta z \\
\end{pmatrix}\text{d}t
$$

For simplicity, let us fix $\sigma=10$ and $\beta=8/3$ and focus only on the sensitivity with respect to $\rho$. The classic Lorenz attractor is obtained when using $\rho=28$:

using Random; Random.seed!(1234)
using OrdinaryDiffEq
using Statistics
using QuadGK, ForwardDiff, Calculus
using DiffEqSensitivity
using SparseArrays, LinearAlgebra

# simulate 1 trajectory of the Lorenz system forward
function lorenz!(du,u,p,t)
  du[1] = 10*(u[2]-u[1])
  du[2] = u[1]*(p[1]-u[3]) - u[2]
  du[3] = u[1]*u[2] - (8//3)*u[3]
end

p = [28.0]
tspan_init = (0.0,30.0)
tspan_attractor = (30.0,50.0)
u0 = rand(3)
prob_init = ODEProblem(lorenz!,u0,tspan_init,p)
sol_init = solve(prob_init,Tsit5())
prob_attractor = ODEProblem(lorenz!,sol_init[end],tspan_attractor,p)
sol_attractor = solve(prob_attractor,Vern9(),abstol=1e-14,reltol=1e-14)

using Plots, LaTeXStrings
pl1 = plot(sol_init,vars=(1,2,3), legend=true,
  label = "initial",
  labelfontsize=20,
  lw = 2,
  xlabel = L"x", ylabel = L"y", zlabel = L"z",
  xlims=(-25,30),ylims=(-30,30),zlims=(5,49)
 )
plot!(pl1, sol_attractor,vars=(1,2,3), label="attractor",xlims=(-25,30),ylims=(-30,30),zlims=(5,49)
 )
savefig(pl1, "Lorenz_forward.png")


Here, we separated the trajectory in two parts: We plot the initial transient dynamics starting from random initial conditions towards the attractor in blue and the subsequent time evolution lying entirely on the attractor in orange.

Following Refs.4 and 5, we choose

$$
\langle z \rangle_∞ = \lim_{T \rightarrow ∞} \frac{1}{T} \int_0^T z \text{d}t
$$

as the objective, where we only use the trajectory that lies completely on the attractor (i.e., the orange trajectory in the plot on top). Let us first study the objective as a function of $\rho$.

function compute_objective(sol)
  quadgk(t-> sol(t)[end]/(tspan_attractor[2]-tspan_attractor[1]) ,tspan_attractor[1],tspan_attractor[2], atol=1e-14, rtol=1e-10)[1]
end

pl2 = plot(sol_attractor.t, getindex.(sol_attractor.u,3), ylabel=L"z(t)", xlabel=L"t", label=false, labelfontsize=20,lw = 2)
mean_z = [mean(getindex.(sol_attractor.u,3))]
int_z = compute_objective(sol_attractor)
hline!(pl2, [int_z], label=L"\langle z\rangle", lw = 2)
savefig(pl2, "zsingle.png")

# for each value of the parameter, solve 20 times the initial value problem
# wrap the procedure inside a function depending on p
function Lorenz_solve(p)
  u0 = rand(3)
  prob_init = ODEProblem(lorenz!,u0,tspan_init,p)
  sol_init = solve(prob_init,Tsit5())
  prob_attractor = ODEProblem(lorenz!,sol_init[end],tspan_attractor,p)
  sol_attractor = solve(prob_attractor,Vern9(),abstol=1e-14,reltol=1e-14)
  sol_attractor, prob_attractor
end

Niter = 10
ps = collect(0.0:1.0:50.0)
probs = []
sols = []
zmean = []
zstd = []
for ρ in ps
  @show ρ
  ztmp = []
  for i=1:Niter
    sol, prob = Lorenz_solve([ρ])
    zbar = compute_objective(sol)
    push!(sols, sol)
    push!(probs, prob)
    push!(ztmp, zbar)
  end
  push!(zmean,mean(ztmp))
  push!(zstd,std(ztmp))
end

pl3 = plot(ps,zmean, ribbon = zstd, ylabel=L"\langle z\rangle", xlabel=L"\rho", legend=false, labelfontsize=20, lw = 2, xlims=(0,50),ylims=(0,50))
savefig(pl3, "zvsrho.png")

pl4 = plot(pl2,pl3, margin=3Plots.mm, layout = (1, 2), size=(600,300))
savefig(pl4, "z.png")

We obtain:


That is, we find a slope of approximately one (almost everywhere except at the kink $\rho\approx 23$), and, therefore, we expect a sensitivity of

$$
\frac{\text{d}\langle z \rangle_∞}{\text{d} \rho} \approx 1.
$$

Conventional forward-mode sensitivity analysis and finite-differencing

For non-chaotic systems, we would just use the
standard discrete or continuous forward sensitivity methods or even finite-differencing. If we try to compute the sensitivity for the Lorenz system:

function G(p, prob=prob_attractor)
  tmp_prob = remake(prob,p=p)
  tmp_sol = solve(tmp_prob,Vern9(),abstol=1e-14,reltol=1e-14)
  res = compute_objective(tmp_sol)
  @info res
  res
end
sense_forward = ForwardDiff.gradient(G,p)
sense_calculus = Calculus.gradient(G,p)

we find diverging values:

$$
\begin{aligned}
& \frac{\text{d}\langle z \rangle_\infty}{\text{d} \rho} \Bigg\rvert_{\rho=28} \approx -49899 {\text{ (ForwardDiff)}} \\
&\frac{\text{d}\langle z \rangle_\infty}{\text{d} \rho} \Bigg\rvert_{\rho=28} \approx 472 {\text{ (Calculus)}}
\end{aligned}
$$

As pointed out in the NILSS paper, this is because the limit of $T\rightarrow ∞$ for a fixed initial state does not commute with the differentiation:

$$
\frac{\text{d}}{\text{d} \rho} \langle z \rangle_∞ \neq \lim_{T \rightarrow ∞} \frac{\partial}{\partial \rho} \langle z \rangle_T
$$

Similarly, using
uncertainty quantification one realizes that due to finite numerical precision and the associated unavoidable errors that are amplified exponentially, one cannot follow the true solution of a chaotic system for long times. We can visualize this by solving the Lorenz system twice with exactly the same parameters and initial condition but with different floating point number precision. In the following animation, we see an $O(1)$ difference between both trajectories after a few Lyapunov lengths:

prob_attractor1 = ODEProblem(lorenz!,sol_init[end],(0.0, 50.0),p)
prob_attractor2 = ODEProblem(lorenz!,convert.(Float32, sol_init[end]),(0f0, 50f0),convert.(Float32,p))
sol1 = solve(prob_attractor1,Tsit5(),abstol=1e-6,reltol=1e-6, saveat=0.01)
sol2 = solve(prob_attractor2,Tsit5(),abstol=1f-6,reltol=1f-6, saveat=0.01f0)

list_plots = []
t1 = 0.0
for i in 1:500
  t2 = i*0.1
  plt1 = plot(sol1, vars=(1,2,3), tspan=(t1,t2), denseplot=true, legend=true,
     label = "Float64", labelfontsize=20, lw = 2,
     xlabel = L"x", ylabel = L"y", zlabel = L"z",
     xlims=(-20,25),ylims=(-28,25),zlims=(5,48))
  plot!(plt1, sol2,vars=(1,2,3), tspan=(t1,t2), denseplot=true, label="Float32",
        xlims=(-20,25),ylims=(-28,25),zlims=(5,48))
  push!(list_plots, plt1)
end

anim = animate(list_plots,every=1)

pl1 = plot(sol1,vars=(1,2,3), legend=true,
  label = "Float64", labelfontsize=20, lw = 2,
  xlabel = L"x", ylabel = L"y", zlabel = L"z",
  xlims=(-20,25),ylims=(-28,25),zlims=(5,48)
 )
plot!(pl1, sol2,vars=(1,2,3), label="Float32", xlims=(-20,25),ylims=(-28,25),zlims=(5,48)
 )

savefig(pl1, "Lorenz_Floats.png")


Without animation:


Luckily, the
shadowing lemma states:

Although a numerically computed chaotic trajectory diverges exponentially from the true trajectory with the same initial coordinates, there exists an errorless trajectory with a slightly different initial condition that stays near (“shadows”) the numerically computed one.

Shadowing methods

The central idea of the shadowing methods is to distill the long-time effect (which actually shifts the attractor) due to a variation of the system parameters (upwards in the $z$-direction with increasing $\rho$ for the Lorenz system) from the transient effect, i.e., the butterfly effect that looks like exponentially diverging trajectories due to variations of the initial conditions. That implies that we aim at finding two trajectories, one with $p$ and one with $p+\delta p$, which do not diverge exponentially from each other (which exist thanks to the shadowing lemma). In this case, their difference will only contain the long-time effect. More details can be found in Refs. 4 and 5, including a visualization of both effects in Fig. 1 of Ref. 5.

LSS and NILSS for the Lorenz system

Switching to LSS or NILSS within the
SciML ecosystem is straightforward by either defining the associated LSS (ForwardLSSProblem or AdjointLSSProblem) or NILSS problem (NILSSProblem) type manually:

# objective
g(u,p,t) = u[end]

####
# LSS
####
lss_problem = ForwardLSSProblem(sol_attractor, ForwardLSS(alpha=DiffEqSensitivity.CosWindowing()), g)
@show shadow_forward(lss_problem) # 1.0095888187322035

lss_problem = ForwardLSSProblem(sol_attractor, ForwardLSS(alpha=DiffEqSensitivity.Cos2Windowing()), g)
@show shadow_forward(lss_problem) # 1.0343951385924328

lss_problem = ForwardLSSProblem(sol_attractor, ForwardLSS(alpha=10.0), g)
@show shadow_forward(lss_problem) # 1.0284286902740765

adjointlss_problem = AdjointLSSProblem(sol_attractor, AdjointLSS(alpha=10.0), g)
@show shadow_adjoint(adjointlss_problem) # 1.028428690274077

or by setting the sensealg= kwarg in solve():

# select via sensealg in solve
using Zygote

function GLSS(p; sensealg=ForwardLSS(), dt=0.01, g=nothing)
  _prob = remake(prob_attractor,p=p)
  _sol = solve(_prob,Vern9(),abstol=1e-14,reltol=1e-14,saveat=dt,sensealg=sensealg, g=g)
  sum(getindex.(_sol.u,3))
end

dp1 = Zygote.gradient((p)->GLSS(p),p) # 0.9694728321500617

Note that we have implemented three different options for forward shadowing with LSS():

  • CosWindowing() (default)
  • Cos2Windowing()
  • time dilation with a factor of $\alpha$.

Additionally, an adjoint implementation AdjointLSS() is available that is particularly recommended for a large number of system parameters. Based on the values computed above, we can easily check that AdjointLSS(alpha=10.0) agrees perfectly with ForwardLSS(alpha=10.0). In all cases considered, we find the expected sensitivity value of $\approx 1$.

However, the use of LSS() is (typically) much more expensive than the use of NILSS(), because LSS() needs to solve a large linear system. This linear system scales with the number of independent variables in the differential equation times the number of time steps and, thus, it can become very large. The computational and memory costs of NILSS() scale with the number of positive (unstable) Lyapunov exponents, since it constrains the optimization problem in the LSS method to its unstable subspace. In many cases, this number is much smaller than the number of independent variables, hence making NILSS() more efficient.

In the NILSS() algorithm, the user can control the number of steps per segment as well as the number of segments.

####
# NILSS
####

# make sure trajectory is fully on the attractor
Random.seed!(1234)
tspan_init = (0.0,100.0)
tspan_attractor = (100.0,120.0)
u0 = rand(3)
prob_init = ODEProblem(lorenz!,u0,tspan_init,p)
sol_init = solve(prob_init,Tsit5())
prob_attractor = ODEProblem(lorenz!,sol_init[end],tspan_attractor,p)

nseg = 100 # number of segments on time interval
nstep = 2001 # number of steps on each segment

nilss_prob = NILSSProblem(prob_attractor, NILSS(nseg, nstep), g)
@show DiffEqSensitivity.shadow_forward(nilss_prob,Tsit5()) # 0.9966924374966089

If the number of segments is chosen too small, a warning is thrown:

nseg = 20 # number of segments on time interval
nstep = 2001 # number of steps on each segment

nilss_prob = NILSSProblem(prob_attractor, NILSS(nseg, nstep), g)
@show DiffEqSensitivity.shadow_forward(nilss_prob,Tsit5()) # 1.0416028730638789

# Warning: Detected a large value of ξ at the beginning of a segment.
# └ @ DiffEqSensitivity ~/.julia/dev/DiffEqSensitivity/src/nilss.jl:474

In the future, we might add an option for the automate control of these variables following the proposal in the NILSS paper5.

Outlook

With respect to the shadowing methods for chaotic systems, we are planning to implement further methods, such as

  • NILSAS6
  • FD-NILSS7

in the upcoming weeks. For further information and a collection of other methods, the interested reader is invited to track the corresponding
issue in the DiffEqSensitivity.jl package.

If you have any questions or comments, please don’t hesitate to contact me!


  1. Frank Schäfer, Michal Kloc, et al., Mach. Learn.: Sci. Technol. 1, 035009 (2020). ↩︎

  2. Frank Schäfer, Pavel Sekatski, et al., Mach. Learn.: Sci. Technol. 2, 035004 (2021). ↩︎

  3. Chris Rackauckas, Yingbo Ma, et al., arXiv preprint arXiv:2001.04385 (2020). ↩︎

  4. Qiqi Wang, Rui Hu, et al. J. Comput. Phys 26, 210-224 (2014) ↩︎

  5. Angxiu Ni and Qiqi Wang. J. Comput. Phys 347, 56-77 (2017). ↩︎

  6. Angxiu Ni and Chaitanya Talnikar, J. Comput. Phys 395, 690-709, (2019) ↩︎

  7. Angxiu Ni, Qiqi Wang et al., J. Comput. Phys 394, 615-631 (2019) ↩︎

Neural Hybrid Differential Equations

By: julia | FS

Re-posted from: https://frankschae.github.io/post/hybridde/

I am delighted that I have been awarded my second GSoC stipend this year. I look forward to carrying out the ambitious project scope with my mentors
Chris Rackauckas,
Moritz Schauer,
Yingbo Ma, and
Mohamed Tarek. This year’s project is embedded within the
NumFocus/
SciML organization and comprises adjoint sensitivity methods for discontinuities, shadowing methods for chaotic dynamics, symbolically generated adjoint methods, and further AD tooling within the Julia Language.

This first post aims to illustrate our new (adjoint) sensitivity analysis tools with respect to event handling in (ordinary) differential equations (DEs).

Hybrid Differential Equations

DEs with additional explicit or implicit discontinuities are called hybrid DEs. Within the SciML software suite, such discontinuities may be incorporated into DE models by
callbacks. Evidently, the incorporation of discontinuities allows a user to specify changes (events) in the system, i.e., changes of the state or the parameters of the DE, which cannot be modeled by a plain ordinary DE. While explicit events can be described by
DiscreteCallbacks, implicit events have to be specified by
ContinuousCallbacks. That is, explicit events possess explicit event times, while implicit events are triggered when a continuous function evaluates to 0. Thus, implicit events require some sort of rootfinding procedure.

Some relevant examples for hybrid DEs with discrete or continuous callbacks are:

  • quantum optics experiments, where photon-counting measurements lead to jumps in the quantum state that occur with a variable rate, see for instance Appendix A in Ref.1 (ContinuousCallback).
  • a bouncing ball2 (ContinuousCallback).
  • classical point process models, such as a Poisson process3.
  • digital controllers4, where a continuous system dynamics is controlled by a discrete-time controller (DiscreteCallback).
  • pharmacokinetic models5, where explicit dosing times change the drug concentration in the blood (DiscreteCallback). The simplest possible example being the one-compartment model.
  • kicked oscillator dynamics, e.g., a harmonic oscillator that gets a kick at some time points (DiscreteCallback).

The associated sensitivity methods that allow us to differentiate through the respective hybrid DE systems have been recently introduced in Refs. 2 and 3.

Kicked harmonic oscillator

Let us consider the simple physical model of a damped harmonic oscillator, described by an ODE of the form

$$
\ddot{x}(t) + a\cdot\dot{x}(t) + b \cdot x(t) = 0 ,
$$

where $a=0.1$ and $b=1$ with initial conditions

$$
\begin{aligned}
x(t=0) &= 1 \\
v(t=0) &= \dot{x}(t=0) = 0.
\end{aligned}
$$

This second-order ODE can be
reduced to two first-order ODEs, such that we can straightforwardly simulate the resulting ODE with the DifferentialEquations.jl package. (Instead of doing this reduction manually, we could also use
ModelingToolkit.jl to transform the ODE in an automatic manner. Alternatively, for second-order ODEs, there is also a SecondOrderODEProblem implemented.) The Julia code reads:

using DiffEqFlux, DifferentialEquations, Flux, Optim, Plots, DiffEqSensitivity
using Zygote
using Random
u0 = Float32[1.; 0.]

tspan = (0.0f0,50.0f0)

dtsave = 0.5f0
t = tspan[1]:dtsave:tspan[2]

function oscillator!(du,u,p,t)
  du[1] = u[2]
  du[2] = - u[1] - 1//10*u[2]
  return nothing
end

prob_data = ODEProblem(oscillator!,u0,tspan)

# ODE without kicks
pl = plot(solve(prob_data,Tsit5(),saveat=t), label=["x(t)" "v(t)"])


We now include a kick to the velocity of the oscillator at regular time steps. Here, we choose both the time difference between the kicks and the increase in velocity as 1.

kicktimes = tspan[1]:1:tspan[2]
function kick!(integrator)
  integrator.u[end] += one(eltype(integrator.u))
end
cb_ = PresetTimeCallback(kicktimes,kick!,save_positions=(false,false))

sol_data = solve(prob_data,Tsit5(),callback=cb_,saveat=t)
t_data = sol_data.t
ode_data = Array(sol_data)

# visualize data
pl1 = plot(t_data,ode_data[1,:],label="data x(t)")
plot!(pl1,t_data,ode_data[2,:],label="data v(t)")

pl2 = plot(t_data[1:20],ode_data[1,1:20],label="data x(t)")
plot!(pl2,t_data[1:20],ode_data[2,1:20],label="data v(t)")
pl = plot(pl2, pl1, layout=(1,2), xlabel="t")


The left-hand side shows a zoom for short times to better resolve the kicks. Note that by setting save_positions=(true,true), the kicks would be saved before and after the event such that the kicks would appear completely vertically in the plot. The data on the right-hand will be used as training data below. In the spirit of universal differential equations6, we now aim at learning (potentially) missing parts of the model from these data traces.

High domain knowledge

For simplicity, we assume that we have almost perfect knowledge about our system. That is, we assume to know the basic structure of the ODE, including its parameters $a$ and $b$, and that the affect! function of the event only acts on the velocity. We then encode the affect as an additional component to the ODE. The task is thus to learn the dynamics of the third component of integrator.u. If we further set the initial value of that component to 1, then the neural network only has to learn that du[3] is 0. In other words, the output of the neural network must be 0 for all states u.

Random.seed!(123)
nn1 = FastChain(FastDense(2, 64, tanh),FastDense(64, 1))
p_nn1 = initial_params(nn1)

function f1!(du,u,p,t)
  du[1] = u[2]
  du[2] = - u[1] - 1//10*u[2]
  du[3] = nn1(u[1:2], p)[1]
  return nothing
end

affect!(integrator) = integrator.u[2] += integrator.u[3]
cb = PresetTimeCallback(kicktimes,affect!,save_positions=(false,false))
z0 = Float32[u0;one(u0[1])]
prob1 = ODEProblem(f1!,z0,tspan,p_nn1)

We can easily compare the time evolution of the neural hybrid DE with respect to the data:

# to visualize the predictions of the trained neural network below
function visualize(prob,p)
  _prob = remake(prob,p=p)
  ode_pred = Array(solve(_prob,Tsit5(),callback=cb,
                 saveat=dtsave))[1:2,:]
  pl1 = plot(t_data,ode_pred[1,:],label="x(t)")
  scatter!(pl1,t_data[1:5:end],ode_data[1,1:5:end],label="data x(t)")
  pl2 = plot(t_data,ode_pred[2,:],label="v(t)")
  scatter!(pl2,t_data[1:5:end],ode_data[2,1:5:end],label="data v(t)")

  pl = plot(pl1, pl2, layout=(1,2), xlabel="t")
  return pl, sum(abs2,ode_data .- ode_pred)
end

pl = plot(solve(prob1,Tsit5(),saveat=t,
  callback=cb
  ),label=["x(t)" "v(t)" "u3(t)"])


which (of course) doesn’t match the data due to the random initialization of the neural network parameters before training. The neural network can be trained, i.e., its parameters can be optimized, by minimizing a mean-squared error loss function:

### loss function
function loss(p; prob=prob1, sensealg = ReverseDiffAdjoint())
  _prob = remake(prob,p=p)
  pred = Array(solve(_prob,Tsit5(),callback=cb,
               saveat=dtsave,sensealg=sensealg))[1:2,:]
  sum(abs2,ode_data .- pred)
end

loss(p_nn1)

The recently implemented tools are deeply hidden within the
DiffEqSensitivity.jl package. However, while the user could previously only choose discrete sensitivities such as ReverseDiffAdjoint() or ForwardDiffAdjoint() that rely on direct differentiation through the solver operations to get accurate gradients, one can now also select continuous adjoint sensitivity methods such as BacksolveAdjoint(), InterpolatingAdjoint(), and QuadratureAdjoint() as the sensealg for hybrid DEs. Each choice has its own characteristics in terms of stability, scaling with parameters, and memory consumption, see, e.g.,
Chris’ talk at the SciML symposium at SIAM CSE.

###################################
# training loop
# optimize the parameters for a few epochs with ADAM
function train(prob, p_nn; sensealg=BacksolveAdjoint())
  opt = ADAM(0.0003f0)
  list_plots = []
  losses = []
  for epoch in 1:200
    println("epoch: $epoch / 200")
    _dy, back = Zygote.pullback(p -> loss(p,
      prob=prob,
      sensealg=sensealg), p_nn)
    gs = @time back(one(_dy))[1]
    push!(losses, _dy)
    if epoch % 10 == 0
      # plot every xth epoch
      pl, test_loss = visualize(prob, p_nn)
      println("Loss (epoch: $epoch): $test_loss")
      display(pl)
      push!(list_plots, pl)
    end
    Flux.Optimise.update!(opt, p_nn, gs)
    println("")
  end
  return losses, list_plots
end

# plot training loss
losses, list_plots = train(prob1, p_nn1)
pl1 = plot(losses, lw = 1.5, xlabel = "epoch", ylabel="loss", legend=false)
pl2 = list_plots[end]
pl3 = plot(solve(prob1,p=p_nn1,Tsit5(),saveat=t,
   callback=cb
  ), label=["x(t)" "v(t)" "u3(t)"])

pl = plot(pl2,pl3)


We see the expected constant value of u[3], indicating a kick to the velocity of +=1, at the kicking times over the full time interval.

Reducing the domain knowledge

If less physical information is included in the model design, the training becomes more difficult, e.g., due to
local minima. Possible modification for the kicked oscillator could be

  • changing the initial condition of the third component of u,
  • using another affect function affect!(integrator) = integrator.u[2] = integrator.u[3],
  • dropping the knowledge that only u[2] gets a kick by using a neural network with 2 outputs (+ a fourth component in the ODE):
affect2!(integrator) = integrator.u[1:2] = integrator.u[3:4]
function f2!(du,u,p,t)
  du[1] = u[2]
  du[2] = - u[1] - 1//10*u[2]
  du[3:4] .= nn2(u[1:2], nn_weights)
  return nothing
end
  • fitting the parameters $a$ and $b$ simultaneously:
function f3!(du,u,p,t)
  a = p[end-1]
  b = p[end]
  nn_weights = p[1:end-2]

  du[1] = u[2]
  du[2] = -b*u[1] - a*u[2]
  du[3:4] .= nn2(u[1:2], nn_weights)
  return nothing
end
  • inferring the entire underlying dynamics using a neural network with 4 outputs:
function f4!(du,u,p,t)
  Ω = nn3(u[1:2], p)

  du[1] = Ω[1]
  du[2] = Ω[2]
  du[3:4] .= Ω[3:4]
  return nothing
end
  • etc.

Outlook

With respect to the adjoint sensitivity methods for hybrid DEs, we are planning to

  • refine the adjoints in case of implicit discontinuities (ContinuousCallbacks) and
  • support direct usage through the
    jump problem interface

in the upcoming weeks. For further information, the interested reader is encouraged to look at the open
issues in the DiffEqSensitivity.jl package.

If you have any questions or comments, please don’t hesitate to contact me!


  1. Frank Schäfer, Pavel Sekatski, et al., Mach. Learn.: Sci. Technol. 2, 035004 (2021) ↩︎

  2. Ricky T. Q. Chen, Brandon Amos, Maximilian Nickel, arXiv preprint arXiv:2011.03902 (2020). ↩︎

  3. Junteng Jia, Austin R. Benson, arXiv preprint arXiv:1905.10403 (2019). ↩︎

  4. Michael Poli, Stefano Massaroli, et al., arXiv preprint arXiv:2106.04165 (2021). ↩︎

  5. Chris Rackauckas, Yingbo Ma, et al., “Accelerated predictive healthcare analytics with pumas, a high performance pharmaceutical modeling and simulation platform.” (2020). ↩︎

  6. Chris Rackauckas, Yingbo Ma, et al., arXiv preprint arXiv:2001.04385 (2020). ↩︎