Author Archives: Christopher Rackauckas

Summary of Julia Plotting Packages

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/summary-of-julia-plotting-packages/

This is a repost of my response on the Julia Discourse on this topic. I was asked to make a blog post so here you go!

The “Main” Plotting Packages

Here’s a quick summary of the most widely used plotting packages. I may have missed one, but I haven’t missed one that is very widely used.

  • Plots.jl is the most used. It’s probably the most documented, used in the most tutorials, and is used in many videos.
    • Pros: Its main draw is that it has a lot of plugins to other packages through its recipes system, which means that a lot of odd things like plot(sol::ODESolution) or showing the sparsity of a BandedMatrix just works. With all of these integrations, it’s normally what I would recommend first to newcomers since they will generally get the most done with the least work. It has a backend system, so you can make it generate plots via GR (the default), Plotly (i.e. make webpages), pyplot, PGFPlots (Latex output), UnicodePlots (i.e. output plots as text). This ease of use and flexibility is what its main draw is.
    • Cons: Its downside has traditionally been its startup time, though it’s nearly a second now so that’s fine. Its main downside now is mostly that it’s not as configurable as something like Makie, and it’s not as optimized if you get up to millions of points. Its flexibility means it’s not just for standard plots but also for animations, building small graphical user interfaces, and building small apps.
  • Makie is probably the second most popular. It’s natively Julia so it’s cool in that aspect, you can see code all the way down.
    • Pros: It’s very optimized for large plots, especially with GPU acceleration via the OpenGL backend (GLMakie). It has a lot of examples these days.
    • Cons: Its downside is that it’s a bit less “first user friendly”, given that its flexibility means there’s a lot more options you’re forced to specify everywhere. It has a recipe system now but it’s fairly new and not well-integrated with most of the ecosystem, so it’s not as seamless as Plots, though by 2024 I would assume that would largely be fixed. It has the longest startup time, used to be in minutes but now it’s like 5-10 seconds.
  • AlgebraOfGraphics.jl is a grammar of graphics front-end to Makie. This essentially means it has an API that looks and acts like R’s ggplot2. Thus it has largely the same pros and cons as Makie, since it’s just calling Makie under the hood, but with the additional pro of being more familiar to users coming from R or con if you haven’t worked with grammar of graphics before (or don’t like the style).
  • Gadfly is a grammar of graphics based library.
    • Pros: It’s very familiar to a ggplot2 user. Its default theme is pretty nice looking.
    • Cons: It’s a bit high on startup time, closer to Makie than Plots. Also, it’s pretty feature poor. In particular, it is missing 3D plots, animations, the ability to make interactive apps with buttons, etc. For these reasons more and more people are adopting AlgebraOfGraphics, but if you’re just doing some standard statistics it’s fine.
  • Vega and VegaLite are of the same camp as Gadfly in the focus towards “standard” statistics and data science, but using wrappers to Javascript libraries.
    • Pro: Fast startups
    • Cons Similar to Gadfly, little to no flexibility (making apps, animations, …) and integration with Julia libraries beyond Queryverse.
  • PlotlyLight is a no-frills wrapper to Plotly.
    • Pro: No startup time
    • Cons: Requires reading the Plotly docs to know how to use it and has little flexibility or integration into Julia libraries.
  • GR is a front end to a C library GR. It’s actually used as the default front-end from Plots.jl. Many more people use it from Plots.jl than directly due to the integrations and docs, but it is nice for some things on its own.
    • Pros: It’s fast, scales fairly well, has a fast startup time, has a nice GUI for investigating results, integrates well with ITerm, very flexible.
    • Cons: It’s docs are bit difficult, and it doesn’t have any integrations with Julia libraries.
  • PGFPlotsX.jl is a front-end to generate plots for Latex.
    • Pros: Fast startup, output to Latex which makes it easy to then further modify in publication documents.
    • Cons: Its interface is wonky, even if you are familiar with the pgfplots Latex package. This makes quite hard to use and teach. Very few integrations with Julia libraries (Measurements and Colors only?). Lacking flexibility in terms of animations and making apps, though it’s quite flexible in its ability to modify the plots and make weird things.
  • UnicodePlots.jl is very simple, fast startup, and plots to text. Its downside of course is that text is the only output it has.
  • Gaston.jl a front-end to gnuplot.
    • Pros: Fast startup.
    • Pretty basic, lacking flexibility and integrations with Julia packages. Requires gnuplot so limitations on where it can be installed (only supports linux?).
  • GMT.jl is “generic mapping tools”. It has some plotting tools highlighted here.
    • Pros: Has good examples in the docs. Nice extra tools for maps.
    • Cons: Missing some standard plot types like trisurf, missing integrations with other Julia packages.
  • GNUPlot.jl uses gnuplot under the hood.
    • Pros: Instant startup, has some interesting data science integrations for things like named datasets, very complete set of plots
    • Cons: Not the most complete documentation, requires gnuplot so limitations on where it can be installed (only supports linux?)

tl;dr on plotting in Julia

Plots.jl is the most used package in the Julia programming language for a reason. It’s very flexible, integrates with the most Julia packages so you’ll find it all throughout other docs, and it has many of the advantages of the other libraries through its backend system. Thus if you needed Latex output, use the pgfplots backend. If you needed a webpage, use the Plotly backend. Unicodeplots backend when you want text output. Or the GR default for the basics. With Julia v1.9 its startup time is much improved (and it’s like sub second on v1.10 beta), which was its major complaint before. If you’re going to use one plotting library and don’t care too much about every little detail, then Plots.jl is a good one to go with. It’s definitely not the best in any of the cases, animations are better in Makie, Latex is better in PGFPlotsX, etc., but it’s capable everywhere.

Makie.jl is catching up and may be the default in the near future. It scales well and its getting all of the niceties of Plots.jl. I wouldn’t learn it first if you’re new to Julia (right now, though that will likely change by 2024). But if you need animations or want to add custom buttons to a window (make a quick GUI-like thing), Makie is unmatched. If it makes its standard plotting interface a bit simpler, gets a few more integrations, and thus matches Plots.jl in simplicity, it may hit a “best of most worlds” soon.

Otherwise it’s a bit domain specific. If you were using Plots.jl and needed more flexibility for publication-quality plots, PGFPlotsX.jl can help. Or if you prefer grammar of graphics, AlgebraOfGraphics.jl is good. If you’re a stats person you may find Gadfly or VegaLite familiar, though I wouldn’t recommend them first because these don’t satisfy general user needs (try making a plot of an FEM output and see what I mean).

All of these are pretty good. You have a lot of options. In the end, pick the one that suits your needs best.

The post Summary of Julia Plotting Packages appeared first on Stochastic Lifestyle.

Integrating equation solvers with probabilistic programming through differentiable programming

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/integrating-equation-solvers-with-probabilistic-programming-through-differentiable-programming/

Part of the COMPUTATIONAL ABSTRACTIONS FOR PROBABILISTIC AND DIFFERENTIABLE PROGRAMMING WORKSHOP

Abstract: Many probabilistic programming languages (PPLs) attempt to integrate with equation solvers (differential equations, nonlinear equations, partial differential equations, etc.) from the inside, i.e. the developers of the PPLs like Stan provide differential equation solver choices as part of the suite. However, as equation solvers are an entire discipline to themselves with many active development communities and subfields, this places an immense burden on PPL developers to keep up with the changing landscape of tens of thousands of independent researchers. In this talk we will explore how Julia PPLs such as Turing.jl support of equation solvers from the outside, i.e. how the tools of differentiable programming allows equation solver libraries to be compatible with PPLs without requiring any co-development between the communities. We will discuss how this has enabled many advanced methods, such as adaptive solvers for stochastic differential equations and nonlinear tearing of differential-algebraic equations, to be integrated into the Turing.jl environment with no development effort required, and how this enables many workflows in scientific machine learning (SciML).

The post Integrating equation solvers with probabilistic programming through differentiable programming appeared first on Stochastic Lifestyle.

Direct Automatic Differentiation of (Differential Equation) Solvers vs Analytical Adjoints: Which is Better?

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/direct-automatic-differentiation-of-solvers-vs-analytical-adjoints-which-is-better/

Automatic differentiation of a “solver” is a subject with many details for doing it in the most effective form. For this reason, there are a lot of talks and courses that go into lots of depth on the topic. I recently gave a talk on some of the latest stuff in differentiable simulation with the American Statistical Association, and have some detailed notes on such adjoint derivations as part of the 18.337 Parallel Computing and Scientific Machine Learning graduate course at MIT. And there are entire organizations like my SciML Open Source Software Organization which work day-in and day-out on the development of new differentiable solvers.

I’ll give a brief summary of all my materials here below.

Continuous vs Discrete Differentiation of Solvers

AD of a solver can be done in essentially two different ways: either directly performing automatic differentiation to the steps of the solver, or by defining higher level adjoint rules that will compute the derivative. In some cases these can be mathematically equivalent. For example, forward sensitivity analysis of an ODE u=f(u,p,t) follows by the chain rule:

ddpdudt=ddpf(u,p,t)=dfdududp+fp

Thus if you solve the extended system of equations:

u=f(u,p,t)
s=dfdus+fp

then you get s=dudp as the solution to the new equations. So therefore, solve these bigger ODEs and you get the derivative of the solution with respect to parameters as the extra piece. One way to do “automatic differentiation” is to add a derivative rule to the AD library that “if you see ODE solve, then replace the solve with this extended solve and take the latter part as the derivative”. The other way of course is to simply do forward-mode automatic differentiation of the ODE solver library steps itself. It turns out that in this case, if you work out the math the two are mathematically equivalent. Note that it’s not computationally equivalent though since the AD process may SIMD the expressions in a different way, doing some constant folding and common subexpression elimination (CSE) in a way that’s different from the hand-coded version, and thus the performance can be very different even though it’s computationally the same algorithm.

However, there are cases where the “analytical” way of writing the derivative is not equivalent to its automatic differentiation counterpart. For example, the adjoint method is a different way to get dudp values in O(n+p) time (instead of the O(np) time of the forward sensitivities above) by solving an ODE forward and some related ODE backwards (for a full derivation and description, see the lecture notes or the recorded video). If you were to do reverse-mode automatic differentiation of the solver, you do not get a mathematically equivalent algorithm. For example, if the solver for the ODE was Euler’s method, reverse-mode AD would be mathematically equivalent to solving the forward ODE with Euler’s method and the reverse ODE with something like implicit Euler (where part of the implicit equation is solved exactly using a cached value from the forward solve).

So What is Better, Continuous Derivative Rules or Discrete Derivatives of the Solver?

Like any complex question, it depends. We had a manuscript which looked at this in quite some detail (and a biologically-oriented follow-up), and can boil it down to a few basic notes:

  • Forward-mode outperforms reverse-mode / adjoint techniques when the equations are “sufficiently small”. For modern implementations this seems to be at around 100.
  • For forward-mode cases, “good” automatic differentiation libraries can make use of structure between the primal and derivative constructions to better CSE/SIMD the generated code for the derivative term, thus forward-mode AD of the solver can be much faster than forward sensitivity analysis even though the two are mathematically the same operation.
  • For reverse-mode cases, the continuous adjoints seem to be faster with current implementations.

But that last bit then has many caveats to put on it. For one, there seems to be a trade-off between performance and stability here. This is noted in the appendix of the paper “Universal Differential Equations for Scientific Machine Learning, which states:

Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [53, 47, 94, 95, 96, 97] while continuous adjoints have been demonstrated to be more stable in others [98, 95] and can reduce spurious oscillations [99, 100, 101]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [102, 103, 104, 105, 106, 107, 108, 109, 110]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [111, 112, 113, 114, 115]

with the references pointing to those in the manuscript.

This is discussed in even more detail in the manuscript Stiff Neural Ordinary Differential Equations which showcases how there are many ways to implement “the adjoint method”, and they can have major differences in stability, essentially trading off memory or performance for improved stability properties.

Special Case: Implicit Equations

The above discussion shows that there are good reasons to differentiate solvers directly, and good reasons to instead write derivative rules for solvers which use forward/adjoint equations. For time series equations, this always has a trade-off. There is an important special case here though that for methods which iterate to convergence, automatic differentiation of the solver is essentially never a good idea. The reason is because the implicit function theorem gives that the derivative of the solution is directly defined at the solution point. For example, for solving f(x,p)=0, if x is the value of x which satisfies the equation, then dxdp=. In other words, Newton’s method might take n steps, and thus automatic differentation will need to differentiate f at least n times. But if you use the implicit function theorem result, then you only need to differentiate it once!

Note of course a similar performance vs stability trade-off does apply here. Since this derivation assumes you have x such that f(x,p)=0 exactly, but you don’t. Newton’s method from the solve will give you something that satisfies the equation to tolerance, so maybe f(x,p)108, which means that the derivative expression is also only approximate, and this then induces an error in the gradient etc. Thus direct differentiation of Newton’s method can be more accurate, and you need to worry about tolerance here if the gradients seem sufficiently off.

This does lead to some counter-intuitive results. For example, we had a paper where we exploited this to note that differentiating and ODE solve which goes to infinity (steady state) is faster than a “long ODE”, since steady states have a similar implicit definition. It’s quicker to go to infinity than it is to go to 1000, who would’ve thought? Math is fun.

Does Differentiation of Solver Internals Make Sense or Have a Meaning?

“ODE solvers” have all sorts of things in there, like adaptivity parameters and heuristics. One of the things that happens when you do automatic differentiation of the solver is that you aren’t just differentiating the solver’s states and parameters, but the process will differentiate everything. It turns out that AD of a solver can thus be useful in some tricky ways which put this to use. For example, at ICML we had a paper which regularized the parameters of a neural ODE by the sum of the computed error estimates of the adaptivity heuristics. This would then push the learned equation towards an area of parameter space where the adaptivity gives the largest time steps possible, and thus the learned equation is the “fastest possible learned equation that fits the time series”. Such a trick is only possible if you are doing automatic differentiation of the solver since you’d need to differentiate the solver’s internals in order to have access to those values in the loss function! This just shows one of many ways in which AD’s “extra information” which analytical continuous derivative definitions don’t have could potentially be useful for some applications.

Automatic Differentiation in Continuous Sensitivity Methods

Finally, I want to note that even when you attempt to avoid automatic differentiation of the solver by using continuous sensitivity methods, it turns out that the optimal way to build the extended equations is to use automatic differentiation!

Summary: there are many good reasons to do automatic differentiation of solvers, but there are also many good reasons to use some analytical derivative techniques. But even if you do analytical derivative techniques, you still want to automatic differentiate something in order to do it optimally!

For example, let’s return to the forward sensitivity equations:

u=f(u,p,t)
s=dfdus+fp

It turns out that dfdus does not require computing the full Jacobian. This operation, known as Jacobian-vector products or jvps, are the primitive operation of forward-mode automatic differentiation and thus special seeding of a forward-mode AD tool gives a faster and more robust algorithm than a finite difference form. When done correctly, this operation is computed without ever building the full Jacobian. A trick for this does exist in the finite difference sense as well:

dfdusf(u+ϵs)f(u)ϵ

since it is equivalent to the directional derivative. This is explained in more detail in these lectures (or accompanying video).

In the same vein, continuous adjoints of ODE solves boil down to defining a differential equation which is solved backwards and that differential equation which is solved backwards has a term which is dfduTs, i.e. Jacobian transposed times a vector, also known as the vector-Jacobian product because it’s equivalent to sTdfdu when transposed. It turns out that this is the primitive operation of reverse-mode AD, which then allows for computing this operation without fully building the Jacobian. There is no analogue for this operation with finite differencing, which means that there’s a pretty massive performance gain from doing this properly. Our paper A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions measures this effect on a stiff partial differential equation, getting:

The takeaway from this plot is that using these AD tricks results in a few orders of magnitude performance improvements (by avoiding the Jacobian construction, which are the “seeding” versions on the left, the right shows the difference that different AD techniques make, which itself is another few orders of magnitude). When people note that the Julia differential equation adjoint solvers are much faster than the adjoints from Sundials COVDES and IDAS on large equations, this part right here is one of the major factors because Sundials does not embed a reverse AD engine into its adjoint code to do the vjp definitions, and instead falls back to using a numerical formulation unless the user provides a vjp override, which is seemingly to be uncommon to do but from these plots clearly should be done more often.

Summary

In total, what can we takeaway so far about differentiating solvers?

  • There are some advantages to differentiating solvers, but there are also some advantages to mixing in analytical continuous adjoints. It’s context-dependent which is better.
  • Even when mixing in analytical continuous derivative rules, these are best defined with automatic differentiation within their constructed equations, so one cannot avoid AD completely if one wishes to achieve full performance on arbitrary models.
  • For cases which converge to some kind of implicitly defined solution, using special adjoint tricks will be much better than direct differentiation of the solver.

There’s still a lot more to mention, especially as stochastic simulation gets involved, but I’ll cut this here for now. As you can see, there’s still some open questions that are being investigated in the field, so if you find this interesting please feel free to get in touch.

The post Direct Automatic Differentiation of (Differential Equation) Solvers vs Analytical Adjoints: Which is Better? appeared first on Stochastic Lifestyle.