Author Archives: Christopher Rackauckas

The Essential Tools of Scientific Machine Learning (Scientific ML)

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/the-essential-tools-of-scientific-machine-learning-scientific-ml/

Scientific machine learning is a burgeoning discipline which blends scientific computing and machine learning. Traditionally, scientific computing focuses on large-scale mechanistic models, usually differential equations, that are derived from scientific laws that simplified and explained phenomena. On the other hand, machine learning focuses on developing non-mechanistic data-driven models which require minimal knowledge and prior assumptions. The two sides have their pros and cons: differential equation models are great at extrapolating, the terms are explainable, and they can be fit with small data and few parameters. Machine learning models on the other hand require “big data” and lots of parameters but are not biased by the scientists ability to correctly identify valid laws and assumptions.

However, the recent trend has been to merge the two disciplines, allowing explainable models that are data-driven, require less data than traditional machine learning, and utilize the knowledge encapsulated in centuries of scientific literature. The promise is to fuse a priori domain knowledge which doesn’t fit into a “dataset”, allow this knowledge to specify a general structure that prevents overfitting, reduces the number of parameters, and promotes extrapolatability, while still utilizing machine learning techniques to learn specific unknown terms in the model. This has started to be used for outcomes like automated hypothesis generation and accelerated scientific simulation.

The purpose of this blog post is to introduce the reader to the tools of scientific machine learning, identify how they come together, and showcase the existing open source tools which can help one get started. We will be focusing on differentiable programming frameworks in the major languages for scientific machine learning: C++, Fortran, Julia, MATLAB, Python, and R.

We will be comparing two important aspects: efficiency and composability. Efficiency will be taken in the context of scientific machine learning: by now most tools are well-optimized for the giant neural networks found in traditional machine learning, but, as will be discussed here, that does not necessarily make them efficient when deployed inside of differential equation solvers or when mixed with probabilistic programming tools. Additionally, composability is a key aspect of scientific machine learning since our toolkit is not ML in isolation. Our goal is not to do machine learning as seen in a machine learning conference (classification, NLP, etc.), and it’s not to do traditional machine learning as applied to scientific data. Instead, we are putting ML models and techniques into the heart of scientific simulation tools to accelerate and enhance them. Our neural networks need to fully integrate with tools that simulate satellites and robotics simulators. They need to integrate with the packages that we use in our scientific work for verifying numerical accuracy, tracking units, estimating uncertainty, and much more. We need our neural networks to play nicely with existing packages for delay differential equations or reconstruction of dynamical systems. Otherwise we need to write the entire toolchain from scratch! While writing a neural network framework may be a good undergraduate project with modern tools, writing a neural network framework plus adaptive stiff differential equation solvers plus a robotics simulation platform plus automatic differentiation plus probabilistic programming tooling plus … is infeasible. Instead of trying to reinvent 60 years of scientific computing for every new scientific ML for every new project, we need some way to make use of existing tools within the domain contexts!

Quick Summary Table

The following is a quick summary table of the tools to check out in each of the languages, with a color indicator for the efficiency and composability of these components.

Comparison Of Differential Equation Solver Software

Note that these indicators are for the use case of scientific ML, which is described in detail below, and not indicative of the support these AD systems give in traditional machine learning tasks.

Overview of Scientific Machine Learning and Scientific AI

First let’s give a quick overview of scientific machine learning. There are three overview resources that I would point to. For a broad overview of the topic in a position paper, the technical report from the workshop on “Basic Research Needs for Scientific Machine Learning” is a good overview of what people care about in the field. The ICERM Scientific Machine Learning Workshop page has quite a few videos on the topic. And lastly, I myself have a video giving an overview of different applications of scientific ML and scientific AI:

From these resources you can see that there are some common questions, for example:

Some of these questions are starting to get answers, others are still just questions. The key idea behind scientific machine learning is that applications of machine learning need to be approached from the context of the scientific domains with their existing tools and knowledge. While training a neural network to identify fraudulent credit card transactions is the end of the story (you have a good predictor, yay!), whereas in scientific ML, making a prediction is just one step among many in the scientific process.

For example, in systems biology and quantitative systems pharmacology, the ordinary differential equation models encode the known structure of the chemical reaction networks. To a biologist or pharmacologist, the Oregonator system:

\begin{align}\frac{dx}{dt} &= s(y-xy + x - qx^2)\\\frac{dy}{dt} &= (-y - xy + z)/s\\\frac{dz}{dt} &= w(x - z)\end{align}

is saying that protein z is upregulated by x and has linear decay. There are many possible ways to predict a time series, but this is the one generated from first principles of the chemical reaction network, and will do well in extrapolating to areas where you don’t have data because it has a basis in what we already know about how these biological systems work. While a neural network can be trained to predict the time series on the data that it is trained on, it may not have the ability to predict bifurcations where the dynamics occurs completely differently but we don’t have data: but prediction of bifurcations is the bread and butter of mathematical biology.

Also, a trained neural network doesn’t necessarily spit out hypotheses. When doing an investigation of pharmacological or climate models, one develops sets of differential equations and showcases how a given term is necessary for specific behavior such as oscillations to occur. Sometimes, dynamical outcomes can predict the existence of some unknown chemical factor since other feedbacks may be required for a differential equation to exhibit some global geometric result. So okay, you trained a neural network so it matches the timeseries… does it say that we’re missing some important chemical species? Does it say that a critical transition will occur if the climate enters a new regime (where we have no data)?

For a long time scientific computing has kept machine learning at an arm’s length because its lack of interpretability and structure mean that, despite it’s tremendous predicitve success, it is almost useless for answering these kinds of central scientific questions. While it could be used to great effect in some scenarios, predicting more answers like the data you’ve already seen is not scientifically interesting in many cases. Science requires understanding and extrapolation beyond the familiar. Also, the idea of using a system starting from a random configuration that goes to a local optimum is a little disconcerting. However, a confluence of events is quickly leading to a change of heart. The key has not been “let’s throw a neural network on this since it’s a universal approximator” like some early methods showcased (indeed, this lead to new but very inefficient ways of doing something that people had well-established tools for, so you can understand why the hype died down). Rather, the key has been to re-envision how to utilize a universal approximator into the context of tools and theories that were already being used in scientific computing. My favorite example is this paper which showcases how to solve high dimensional partial differential equations using backwards SDEs which are parameterized by neural networks. While there was some work before under the name latent differential equations, the 2018 NeurIPS best paper on neural ordinary differential equations really sparked a surge in thinking about directly learning differential equations, or pieces of differential equations. While the neural network itself may not be interpretable, if it learns a differential equation and differential equations have an interpretation, then it would seem that we have a machine-learning generated scientific hypothesis.

So, this is where the trend has taken us. Neural networks are being used in the context of ordinary and partial differential equations, and these are being mixed with probabilistic programming approaches to quantify uncertainties and are being slammed into toolchains which require differentiable programming to generate the gradients, and…

As you can see, tooling is getting complicated. And tooling is what needs to be discussed.

The Tools of Scientific Machine Learning

So, what exactly are the computational tools which are utilized in this burgeoning field of scientific machine learning? Since the field is still being developed it’s hard to pinpoint exactly what people are focusing on, but there are a few trends. Two of the big pieces are a neural network framework, and libraries for numerical differential equations. The use of a neural network framework is obvious: the whole point is training neural networks in many contexts. By far the most common context is a differential equation, hence the tooling for discretizing and solving differential equations is necessary. This involves tools such as solvers for ordinary and stochastic differential equations, tools for discretizing PDEs with finite difference, finite volume, finite element, and pseudospectral discretizations. The existence of PDEs just begs for sparse linear algebra tooling. Construction of sparse Jacobians for the Newton methods within implicit schemes calls for color differentiation mixed with sparse or compressed factorization methods.

And all of this needs to play nicely with the automatic differentiation tools used for the differentiable programming and probabilistic programming frameworks of the training process. That is the key issue: just because some old FORTRAN code solves a PDE well isn’t the end of the story anymore: if you cannot find a way to compute the derivatives of it then gradient descent won’t work!

That’s where our tooling list comes in. Partial differential equations are computationally difficult to solve. Neural networks are computationally difficult to train. Partial differential equations with neural networks is very difficult to do anything with. So to get anywhere, we need to have the most advanced methods for solving PDEs be compatible with our neural network frameworks. We need to make sure both methodologies are utilizing state-of-the-art techniques with efficient implementations, and that they can be correctly and efficiently composed. The types of tools which are necessary for large-scale scientific ML are:

  1. Tooling for solving neural differential equations
  2. Differentiable programming (automatic differentiation) tools
  3. Probabilistic programming tools to learn uncertainty from data
  4. Helper tools for sparsity detection and sparse differentiation
  5. Structured linear algebra tools
  6. Number types for mixed precision arithmetic
  7. Methods for discretizing partial differential equations
  8. Tools for generating and utilizing GPU kernels
  9. Uncertainty quantification and Global sensitivity analysis
  10. Surrogate modeling techniques

Example Challenge Problem: Natural Language Processing + PDE Construction

To motivate the tooling that is needed, let’s set the focus by picking an example challenge problem. A recent funding call asked for:


The Defense Advanced Research Projects Agency (DARPA) Defense Sciences Office (DSO) is requesting information on state-of-the-art approaches to generate multi-physics modeling and simulation codes directly from a description of the physical phenomena. Of interest are modeling and simulating increasingly complex systems involving multiple physics that require high fidelity simulations but have limited test data (e.g., combustion, hypersonics, nuclear stockpile).

One way to approach this with scientific ML would be to do the following:

  1. Build an Natural Language Processing (NLP) stack that interprets text into PDEs
  2. Autodiscretize and solve the PDE
  3. Write a loss function which checks the PDE solution against data
  4. Add regularization based on the global sensitivity and uncertainty of the solution

If you’ve been following recent advancements in the field of automated model building, you’ll see that such ideas are not that farfetched anymore. In fact, this is somewhat akin to recent differentiable rendering systems which are being tested for automatically learning an environment simultaneously to training a control circuit. Here, we are just training an NLP method to understand some scientific text simultaneously to its predictive ability through its PDE simulations.

However, just like in the case of differentiable rendering, we will need to make each of the “layers” differentiable: we will need to compute derivatives of the global sensitivity, uncertainty, the PDE’s solution, and the PDE’s generation. This means that while there may be great libraries available for each of these tasks, to arrive at our overall goal all of the components will need to be compatible with our chosen automatic differentiation (/differentiable programming) framework, and that is the most difficult part.

Let’s dig in.

Comparison of Scientific Machine Learning Packages and Tools

The Automatic Differentiation (Differentiable Programming) Frameworks

The dividing factor for scientific machine learning frameworks is not the language. Rather, it’s the differentiable programming or automatic differentiation framework which is utilized. For those who aren’t familiar, automatic differentiation is an umbrella term for a variety of techniques for efficiently computing accurate derivatives of more or less general programs. It is employed by all major neural network frameworks since a single reverse-mode AD backpass (also known as a backpropagation) can compute a full gradient, whereas numerical differentiation would take many forward passes and symbolic differentiation is simply untenable due to expression explosion.

Thus, the key to making a scientific ML stack work is by making every component compatible with the AD-system. This is because, if there is just one part of your loss function that isn’t AD-compatible, then the whole network won’t train. When this happens, you the user either have to derive and define adjoints for each of the missing functions, or you need either:

  • Beg the developers of the framework to add the package as a dependency and define the adjoints
  • Define the adjoints yourself
  • Rewrite the package utilizing the tools of the AD framework

For this reason, neural network frameworks like Tensorflow have painstakingly made sure that their frameworks cover all of the standard ML tasks. However, since we are not only doing machine learning but also are incorporating scientific computing, there are a lot of methods and packages that we will want to use which are simply not AD-compatible or have never been setup to be compatible with the AD framework. This is by far the most difficult problem for the practice of scientific machine learning.

The term differentiable programming has been adopted to describe AD systems which attempt to support an entire programming language. Each of the differentiable programming systems have had varying degrees of success with the language coverage. Early AD systems like ADIFOR (Fortran), TAF (Fortran), and ADOL-C (C++) achieved high coverage of their base languages, but did not attempt a large set of third-party packages. A lot of these frameworks are very efficient and can do source-to-source differentiation, which builds a new source program to compile which has very little overhead. This is very helpful for scientific machine learning since many nonlinear functions, like ODE definitions, can only be described in a highly scalar fashion, meaning that it is efficient for this use case. (ADOL-C also has a tracing mode, which is much slower than its source-to-source)

The next generation of AD systems were domain-specific. Stan is a probabilistic programming language for Bayesian estimation which included a robust AD system. Tensorflow is a well-known AD system for building computational graphs for neural networks and machine learning. By restricting to a specific domain, these systems were able to achieve very good results in their respective areas, and overtime have grown their support to other domains.

The third generation of AD systems attempted to improve upon Tensorflow and bring machine learning AD to a language level. Examples of this are PyTorch and Flux.jl‘s Tracker.jl, which use tracing to generate a local computational graph to backpropagate through. These systems will work on any code for which adjoints have been defined for all of their operations. PyTorch simulates differentiable programming by having an internal module which has pretty good coverage of numpy, meaning that lots of numpy code can be ported over by changing the underlying “numpy module” that is used. However, this does mean that you cannot take arbitrary packages off of pip and expect PyTorch to work on it. This is especially an issue since a lot of Python packages are written with C++ extensions or using Cython, meaning that PyTorch will not be directly compatible with them without a lot of work. On the other hand, Flux.jl / Tracker.jl ties into Julia’s multiple dispatch system which allows it to directly work on any pure-Julia package with sufficiently generic code. The reason is because there is simple to explain through an example. In Julia, there is only one `*` function, which stands for matrix multiplication, that everyone extends. `*` between different types calls a different methods (different implementations), but is still the same function. For example, Array*Array is standard dense multiplication defined in Julia’s Base, while Elemental.Matrix*Elemental.Matrix would use the the MPI-compatible Elemental.jl distributed linear algebra library. Thus one definition of the adjoint for `*` would then apply to all libraries which implement new types and new matrix multiplications, as long as they conform to the system. Since the differentiation rules for the Julia AD systems are defined in common libraries, like DiffRules.jl (with the next generation being ChainRules.jl), this means that all of the Julia AD packages, like ForwardDiff.jl and ReverseDiff.jl, all have this same property. Thus while Python packages might choose which AD system to support, packages in Julia choose to support AD and then allow the various AD systems to compete.

However, tracing-based AD systems have a very high overhead since their computational graph changes every time the values change (meaning they have to compile a new backpass each time, or worse, are interpreted), and they have to generate such a computational graph (meaning tons of allocations!). This is not a problem for traditional machine learning since the cost of a single matrix multiplication would dwarf the overhead. However, if highly scalar code shows up in the pipeline, such as the definition of a nonlinear ODE for a chemical reaction network for a pharmacometric system, this overhead starts to dominate the run time.

To handle this problem, differentiable programming frameworks have gone back to the history books and the newest versions utilize source-to-source transformations instead of tracing. In this category there is Zygote.jl and Swift for Tensorflow. While Zygote is fairly new, the property of Julia AD support mentioned before means that many packages already have a good degree of Zygote support simply by utilizing the multiple dispatch mechanism. Thus Zygote has been already used to showcase scientific machine learning applications like quantum machine learning and neural stochastic differential equations. While theoretically Swift for Tensorflow could do the same, I am not aware of anything like a Swift finite element method PDE discretization library, and thus while it’s a turning into a great AD system, it doesn’t currently have enough scientific ML support to warrant further consideration at the moment. It’s possible that Swift for TensorFlow will be successful enough as an AD system that people will start to build scientific libraries in Swift, but we won’t know for at least another five years.

Those are the differentiable programming frameworks that we will explore in more detail. There are other good AD systems, but they tend to lack the support for both scientific computing and machine learning required to complete scientific ML tasks. For example, there are some good MATLAB AD tools that you can find (and some support automatic sparsity detection). However, these tools do not have the kind of neural network and GPU support necessary for example to use a convolutional neural network alongside a PDE. On the other hand, there are some AD systems like JAX which are great for traditional machine learning but I do not think anyone has built any PDE solver libraries on (again, this is not a post about traditional ML). I’ll also throw a shoutout to packages like CasADi and DAEPACK which are good AD systems with differential equations, but do not integrate with larger package ecosystems. Tapenade is another great AD system which comes to mind (Fortran).

Neural Network Support

The first thing to discuss is neural network support. It’s not surprising that the AD systems which were built for traditional machine learning, like Tensorflow, Flux.jl (Julia), and PyTorch, have very complete support in this area. Convolutional layers, LSTMs, batch normalization, etc. are all readily available with all of the goodies like GPU support. If you really wanted to pick a winner here, PyTorch probably has the most complete support right now, but the other two are not far behind and developments occur daily.

The non-ML AD frameworks have a surprising lack of neural network support. Before digging into this, I would’ve guessed that someone would’ve made a neural network framework for ADIFOR or ADOL-C. However, my Google skills could not find any. That said, it’s not very difficult to roll your own dense layers (W*x + b). The limitation will come here when you want to start using convolutional neural networks or transformers, where you’d need to start rolling some significant CUDA code to get there while the other frameworks will give you it in one call. I would like to see the upstart frameworks like neural-fortran and OpenNN get some integration with these big AD systems. Another surprising tale was that Stan hasn’t received much neural network support yet. I say yet because there are some early results trying to integrate Stan with PyTorch. It would be really beneficial for Stan for Scientific ML if that effort was completed and pulled into the standard Stan build.

Summary: TensorFlow, Flux.jl, and PyTorch lead the pack quite a bit here.

Neural Differential Equations

Scientific computing has a lot of differential equations. Ordinary differential equations are a major topic of their own, with many scientific laws described in their language. Partial differential equations with a time component are solved by discretizing down to a set of ODEs to be solved. If there are constraints on the ODE, such as conservation of energy, then one has a differential-algebraic equation, and PDEs with some boundary conditions discretize down to DAEs as well. In other cases when there is randomness that is modeled, like in biological or financial models, stochastic differential equations (SDEs) are used. If the reactions are not instantaneous, then delay differential equations are used.

If you do scientific computing, you know this. These models are pervasive, and thus support for all of these kinds of models is essential for doing scientific ML. However, differential equations only recently got on the minds of those in ML. The method of neural differential equations is very interesting though, because there are two very different use cases of it. The best paper at NeurIPS 2018 showcased how neural ODEs could be used for standard ML classification problems, while here we want to do things like automated discovery of dynamical equations. This may seem like a detail, but it has a profound effect on the tools and methods which are necessary. From our studies (and folks in David Duvenaud’s lab) with neural differential equations for ML, it seems that the learned functions tend to be non-stiff and “fairly reversible” (this is mentioned at the end of this video). Thus the tools built for neural ODEs in traditional ML, like torchdiffeq, work for this domain. However, it has been pointed out multiple times that the adjoint function they use is generally unstable. Additionally, torchdiffeq only has non-stiff ODE solvers, which makes them unsuitable for learning the dynamics of many equations due to a lack of stability. Thus, while torchdiffeq is a fantastic package for its domain, its domain is not scientific ML.

A comprehensive overview of software for differential equations can be found here. If you’re using Julia, there already exists DiffEqFlux.jl which supports neural ODEs, neural SDEs, neural DDEs, neural PDEs, etc. the whole gambit of neural differential equations using the highly efficient DifferentialEquations.jl. It can also be used from R and Python, and its sensitivity analysis could be used to plug into systems in these other languages. If you’re using C++ or Fortran and only for ODEs/DAEs, then Sundials is a good choice. While it’s not directly integrated with ADOL-C, TAF, or ADIFOR, this library comes with adjoint sensitivity analysis functions that would make it easy to define the right gradients to hook into the system. In fact, Stan does exactly this to give ODE support (but not DAEs). A lesser known choice is FATODE which also defines adjoints, so it could also plug into AD systems but only supports ODEs.

Summary: Julia has expansive neural differential equation support with DiffEqFlux.jl, Stan has some ODE support, while the other systems leave you to your own devices. However, DifferentialEquations.jl, Sundials, and FATODE could be used to give existing AD systems support for ODEs and DAEs.

Probabilistic Programming

Probabilistic programming is the fancy differentiable programming word for Bayesian estimation of parameters. Instead of optimizing and getting point estimates for the best parameters, you get a posterior distribution. This lets you generate a fit with some uncertainty, which is necessary for the uncertainty quantification in the discussed challenge problem.

Probabilistic programming comes in two major flavors: Monte Carlo based methods and variational inference methods. Monte Carlo methods use a stochastic algorithm to approximate a distribution asymptotically, while variational inference methods write out the prior in some basis and push through the basis elements to get the posterior written in said basis. MCMC is generally used for problems with less parameters and variational inference for those with more, but there are times when it’s worthwhile to switch it up. Stan is the undeniable leader in probabilistic programming with MCMC, with it being the first big program to make use of Hamiltonian Monte Carlo (HMC) and the No U-Turn Sampling (NUTS) method for very robust automatic tuning of hyperparameters. This make Stan seem to “just work” in a way MCMC usually doesn’t. However, other ecosystems are now catching up, with Turing.jl in Julia and PyMC3 being two very good systems. While PyMC3 is built on Theano and thus not compatible with these AD systems, the experimental PyMC4 is a very promising system for TensorFlow. The new Gen system in Julia takes an interesting approach by making it more flexible and less automatic which can be helpful in the most difficult cases. These systems all seem to be in these higher level languages, so I could not find very many in Fortran or C++ (just CPProb).

For variational inference, Pyro for TensorFlow seems to be at the head of the pack for Bayesian neural networks, with Edward being another good choice. Gen in Julia is a recent addition with variational inference as well.

Summary: TensorFlow, PyTorch, and Julia have some good options for probabilistic programming. It will be interesting to see how this space keeps evolving.

Automated Sparsity Detection and Sparse Differentiation

Partial differential equations essentially always give an ODE discretization where the Jacobian is sparse, or it gives a nonlinear rootfinding problem where the Jacobian is sparse. Either way, using this sparsity is something that is required in order to do anything efficient in this domain. Thus, it’s no surprise that the AD systems which were built to handle scientific computing, such as TAF and ADOL-C, come with automated sparsity detection and sparse differentiation as standard features. Both of these make use of ColPACK for the matrix coloring portion. A recent addition to the Julia ecosystem is SparsityDetection.jl and SparseDiffTools.jl which give similar automated sparsity detection, matrix colorization, and integration with AD. My Google-foo could not find such support in Tensorflow, PyTorch, Stan, JAX, TAPENADE, or ADIFOR (this one was surprising). Since traditional neural networks are never sparse, and newer sparse ADs like graphical neural networks have a sparsity pattern that you know, it makes sense that machine learning libraries never cared about this. But it’s something we do miss when doing scientific ML.

Summary: TAF, ADOL-C, and the Julia AD tools are the ones which showcase sparsity support.

GPU Support

While the scientific computing ecosystems like Fortran, C++, and Julia picked up a few wins with differential equations and sparsity, the machine learning ecosystems like TensorFlow, PyTorch, and Julia are where a lot of nice tooling for GPUs exist. High level functions let users define efficient kernels directly in Julia and Python, and these will then be automatically compatible with AD. In Fortran or C++, you’re back to the same old story of writing a CUDA kernel yourself, and then having to define adjoint functions for the AD to work. This is still better than Stan which seems to just have OpenCL support. While I get the appeal of not being vendor-locked to NVIDIA, sources routinely show CUDA libraries like cudnn are much more efficient than their OpenCL counterparts, making OpenCL not the best choice for scientific ML which is already heavy on compute.

Summary: TensorFlow, PyTorch, and Julia have good tooling that will work with AD. C++ and Fortran you can of course use directly with CUDA, but it’s BYOA (bring your own adjoint).

Distributed Dense, Structured, and Sparse Linear Algebra

Once again, we have a similar language divide. In Fortran and C++ we have tools like ScaLAPACK, PARASOL, PETSc, Trilinos, and Elemental. While these are not necessarily baked into the AD systems, defining the adjoints for matrix multiplication and linear solving are pretty straightforward, so that’s not really a drawback here. Fortran, with its scientific computing focus, has the library SPARSEKIT which has some nice structured matrix support as well.

On the other hand, TensorFlow and PyTorch have very good distributed linear algebra support, except they leave off factorizations here and there. torch.distributed and TensorFlow sparse both seem to leave off LU and QR factorizations from the list of things to support. Once again, this is fine for traditional ML which would almost never use these, but when writing a PDE solver these will come up quite often. Julia stands in an interesting position here since, due to the way its AD overrides work, the Elemental.jl library automatically works for distributed and sparse linear algebra. This is currently a much better solution than the pure-Julia DistributedArrays.jl, though this space may be get a lot of development in the near future.

The piece that is quite unique to Julia here is BandedMatices.jl and BlockBandedMatrices.jl, which are specialized structured matrix libraries for these matrix types. It just so happens that many discretizations of systems of PDEs give you block banded matrices, and these libraries give a pretty significant speedup over using just a sparse matrix. Thus, this is a really nice library to have around (and it uses `*` and `\`… so it’s AD compatible).

Summary: Julia has a good number of tools for this, almost by accident. I’m not sure this interaction between AD and Elemental.jl / BlockBandedMatrices.jl was intended, but it makes for a very good PDE linear algebra ecosystem for Julia. TensorFlow has the right building blocks but needs sparse factorizations. PyTorch just needs more. C++ and Fortran give you a ton of tools but leave you to write your own adjoints, which isn’t that difficult in this case.

Surrogates, Global Sensitivity Analysis, and Uncertainty Quantification

Libraries for surrogates, global sensitivity analysis, and uncertainty quantification are commonly utilized for analyzing whether a PDE fit is good. It’s only natural that these will be used to analyze whether a neural PDE’s fit is good. Our challenge problem made note that I would like to use the global sensitivity or uncertainty of the result as a loss value that quantifies how trustworthy the result is.

However, there seems to be a lack of AD-compatible complete GSA and UQ systems. Julia has a newly created Surrogates.jl for surrogate modeling and optimization, which rivals the pySOT surrogate optimization package, but that’s the only complete package in this area that I could find which is AD compatible. That isn’t to say there aren’t other good packages, it’s just that they don’t integrate with any of the AD systems to be readily used in the loss functions. SMT for Python and MUQ for C++ are some great examples, with Dakota and PSUDAE are interfaced through their binaries. Even R has a great sensitivity package, and SimLab in MATLAB is pretty good as well. But you cannot expect to just run AD through them, and some of the functionality may not be easy to define adjoints for. Here, the Julia libraries of DiffEqSensitivity and DiffEqUncertainty are compatible with AD, but do not have all of the functionality you see from the other libraries. So, there’s no silver bullet and this field could use some work.

Summary: The only AD-compatible packages in this field are in Julia, but they still need a little more work to be as strong as some of the C++ offerings like Dakota or MUQ. This means there’s no silver bullet for GSA or UQ for scientific ML right now.

Distributed Parallelism

At this point, all of the ecosystems have pretty good distributed parallelism support. TensorFlow in particular seems built for deploying to clusters with GPUs and TPUs. torch.distributed has a very good list of what’s supported with its various backends. Julia can make use of MPI.jl, of which some of the AD systems are compatible (as long as the AD components are serializable). C++ and Fortran of course have MPI, and I am not aware of whether the source-to-source AD systems can handle MPI so please let me know! Stan also has a way of making use of MPI.

Summary: If you know MPI (common among people doing scientific computing), then any of these are fine.

Automated PDE Discretizations

At some point, if we are automatically generating PDEs, then we will need tooling for automatically discretizing PDEs. If you aren’t familiar with the topic, the ways to solve PDEs is to essentially transform them into systems of linear equations, nonlinear equations, or ordinary differential equations (or DAEs, or SDEs, etc.). There are a lot of great libraries for discretizing PDEs. An abundance of such libraries exist in C++, with deal.ii, SAMRAI, and hypre being well-known examples (and each of the C++ libraries also generally document how to use them from Fortran). There is a smaller but strong contingent in Python as well, with FEniCS being one of the top FEM packages, with Firedrake also in the running. Additionally, Daedalus is a strong package for pseudospectral discretizations. But again, none of these hook into ADOL-C, TensorFlow, or PyTorch, so they are great libraries if not used in an auto-PDE construction toolchain, but do have this fundamental limitation. There are some startup libraries brewing in Julia as well. ApproxFun.jl is an enhanced version of Chebfun for Julia which is really good for generating pseudospectral PDE discretizations (with adaptive mesh sizes), though its documentation will need to be more complete for most people to recognize all that it can do. DiffEqOperators.jl is a fledgling finite difference method library which ties into the neural network software to make its convolutions utilize cudnn play nicely with Flux.jl. And there are efforts for FEM in pure-Julia, with JuliaFEM and JuAFEM.jl picking up a lot of steam, with at least the latter having the ability to have neural networks hook into its local assembly routines as it’s just user-written Julia code. It’ll be interesting to see how these tools evolve in terms of their neural network friendliness.

Summary: There are some good PDE discretization libraries for Fortran, C++, and Python users, but neural networks / AD don’t generally play well with them. Julia has some libraries growing which are AD-friendly, but not feature complete yet.

Conclusion

There are many great differentiable programming frameworks, but expansive feature sets for traditional machine learning do not necessarily mean expansive feature sets for scientific machine learning. Here we took a deep dive into scientific machine learning and the tools which are available for solving its problems. By looking at the available packages, we see that Julia has the most mature scientific machine learning package ecosystem, though it (and all others) have a ton of work to do. While Fortran and C++ still have a lot of very great scientific computing tooling, the widely used AD systems with automated sparsity support, like ADOL-C and TAF, do not come with neural network libraries, ODE solver libraries, UQ libraries, etc. all baked in. In addition, the big Python frameworks have some great neural network support, but are missing many of the major features used in the study of (partial) differential equations, making them not as ideal for this specific purpose. At the same time, there do not seem to be competitive AD ecosystems in other popular scientific computing languages for MATLAB and R (again, there are great AD packages, just missing the things like convolutional neural networks and partial differential equation mixtures!). Thus, the easiest way forward would likely be to further develop the analysis features (global sensitivity analysis and uncertainty quantification), where one can rely on full-language AD systems like Zygote to then incorporate it into neural partial differential equation pipelines. The next most viable path would be to wrap a bunch of adjoints into ADOL-C for neural networks and analysis, which given the strong scientific computing contingent in C++, and this will likely happen in the near feature.

While this is somewhat of a fringe topic right now, I hope to see a lot of development in scientific ML and scientific AI in the near feature, and plan to track the developments with this blog. What I think is made clear here, however, is that simply focusing on traditional ML workflows will not be sufficient for coverage of this domain, and thus to see progress in scientific ML tools we will need to see features in AD systems which may not have been covered by their original agendas.

Citation

To cite this post, please use the following from The Winnower:

Christopher Rackauckas, The Essential Tools of Scientific Machine Learning (Scientific ML), The Winnower 6:e156631.13064 (2019). DOI: 10.15200/winn.156631.13064

The post The Essential Tools of Scientific Machine Learning (Scientific ML) appeared first on Stochastic Lifestyle.

Neural Jump SDEs (Jump Diffusions) and Neural PDEs

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/neural-jump-sdes-jump-diffusions-and-neural-pdes/

This is just an exploration of some new neural models I decided to jot down for safe keeping. DiffEqFlux.jl gives you the differentiable programming tools to allow you to use any DifferentialEquations.jl problem type (DEProblem) mixed with neural networks. We demonstrated this before, not just with neural ordinary differential equations, but also with things like neural stochastic differential equations and neural delay differential equations.

At the time we made DiffEqFlux, we were the “first to the gate” for many of these differential equations types and left it as an open question for people to find a use for these tools. And judging by the Arxiv papers that went out days after NeurIPS submissions were due, it looks like people now have justified some machine learning use cases for them. There were two separate papers on neural stochastic differential equations, showing them to be the limit of deep latent Gaussian models. Thus when you stick these new mathematical results on our existing adaptive high order GPU-accelerated neural SDE solvers, you get some very interesting and fast ways to learn some of the most cutting edge machine learning methods.

So I wanted to help you guys out with staying one step ahead of the trend by going to the next differential equations. One of the interesting NeurIPS-timed Arxiv papers was on jump ODEs. Following the DiffEqFlux.jl spirit, you can just follow the DifferentialEquations.jl tutorials on these problems, implement them, add a neural network, and it will differentiate through them. So let’s take it one step further and show an example of how you’d do that. I wanted to take a look at jump diffusions, or jump stochastic differential equations, which are exactly what they sound like. They are a mixture of these two methods. After that, I wanted to show how using some methods for stiff differential equations plus a method of lines discretization gives a way to train neural partial differential equations.

Instead of being fully defined by neural networks, I will also be showcasing how you can selectively make parts of a differential equation neuralitized and other parts pre-defined, something we’ve been calling mixed neural differential equations, so we’ll demonstrate a mixed neural jump stochastic differential equation and a mixed neural partial differential equation with fancy GPU-accelerated adaptive etc. methods. I’ll then leave as homework how to train a mixed neural jump stochastic partial differential equation with the fanciest methods, which should be easy to see from this blog post (so yes, that will be the MIT 18.337 homework). This blog post will highlight that these equations are all already possible within our framework, and will also show the specific places we see that we need to accelerate to really put these types of models into production.

Neural Jump Stochastic Differential Equations (Jump Diffusions)

To get to jump diffusions, let’s start with a stochastic differential equation. A stochastic differential equation is defined via

 dX_t = f(t,X_t)dt + g(t,X_t)dW_t

which is essentially saying that there is a deterministic term f and a continuous randomness term g driven by a Brownian motion. Theorems like Donsker’s theorem can be thought of as a generalization of the central limit theorem, saying that continuous stochastic processes of some large class can be reinterpreted as this kind of process (due to the Gaussian-ness of Brownian motion), so in some sense this is a very large encompassing class. If you haven’t seen the previous blog post which mentions how to define neural SDEs, please check that out now. Let’s start with a code that uses reverse-mode automatic differentiation through a GPU-accelerated high order adaptive SDE solver. The code looks like:

using Flux, DiffEqFlux, StochasticDiffEq, Plots, DiffEqMonteCarlo
 
u0 = Float32[2.; 0.] |> gpu
datasize = 30
tspan = (0.0f0,1.0f0)
 
function trueODEfunc(du,u,p,t)
    true_A = [-0.1 2.0; -2.0 -0.1] |> gpu
    du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
mp = Float32[0.2,0.2] |> gpu
function true_noise_func(du,u,p,t)
    du .= mp.*u
end
prob = SDEProblem(trueODEfunc,true_noise_func,u0,tspan)
 
# Take a typical sample from the mean
monte_prob = MonteCarloProblem(prob)
monte_sol = solve(monte_prob,SOSRI(),num_monte = 100)
monte_sum = MonteCarloSummary(monte_sol)
sde_data = Array(timeseries_point_mean(monte_sol,t))
 
dudt = Chain(x -> x.^3,
             Dense(2,50,tanh),
             Dense(50,2)) |> gpu
ps = Flux.params(dudt)
n_sde = x->neural_dmsde(dudt,x,mp,tspan,SOSRI(),saveat=t,reltol=1e-1,abstol=1e-1)
 
pred = n_sde(u0) # Get the prediction using the correct initial condition
 
dudt_(u,p,t) = Flux.data(dudt(u))
g(u,p,t) = mp.*u
nprob = SDEProblem(dudt_,g,u0,(0.0f0,1.2f0),nothing)
 
monte_nprob = MonteCarloProblem(nprob)
monte_nsol = solve(monte_nprob,SOSRI(),num_monte = 100)
monte_nsum = MonteCarloSummary(monte_nsol)
#plot(monte_nsol,color=1,alpha=0.3)
p1 = plot(monte_nsum, title = "Neural SDE: Before Training")
scatter!(p1,t,sde_data',lw=3)
 
scatter(t,sde_data[1,:],label="data")
scatter!(t,Flux.data(pred[1,:]),label="prediction")
 
function predict_n_sde()
  n_sde(u0)
end
loss_n_sde1() = sum(abs2,sde_data .- predict_n_sde())
loss_n_sde10() = sum([sum(abs2,sde_data .- predict_n_sde()) for i in 1:10])
Flux.back!(loss_n_sde1())
 
data = Iterators.repeated((), 10)
opt = ADAM(0.025)
cb = function () #callback function to observe training
  sample = predict_n_sde()
  # loss against current data
  display(sum(abs2,sde_data .- sample))
  # plot current prediction against data
  cur_pred = Flux.data(sample)
  pl = scatter(t,sde_data[1,:],label="data")
  scatter!(pl,t,cur_pred[1,:],label="prediction")
  display(plot(pl))
end
 
# Display the SDE with the initial parameter values.
cb()
 
Flux.train!(loss_n_sde1 , ps, Iterators.repeated((), 100), opt, cb = cb)
Flux.train!(loss_n_sde10, ps, Iterators.repeated((), 100), opt, cb = cb)
 
dudt_(u,p,t) = Flux.data(dudt(u))
g(u,p,t) = mp.*u
nprob = SDEProblem(dudt_,g,u0,(0.0f0,1.2f0),nothing)
 
monte_nprob = MonteCarloProblem(nprob)
monte_nsol = solve(monte_nprob,SOSRI(),num_monte = 100)
monte_nsum = MonteCarloSummary(monte_nsol)
#plot(monte_nsol,color=1,alpha=0.3)
p2 = plot(monte_nsum, title = "Neural SDE: After Training", xlabel="Time")
scatter!(p2,t,sde_data',lw=3,label=["x" "y" "z" "y"])
 
plot(p1,p2,layout=(2,1))
 
savefig("neural_sde.pdf")
savefig("neural_sde.png")

This just uses the diffeq_rd layer function to tell Flux to use reverse-mode AD (using Tracker.jl, unless you check out a bunch of weird Zygote.jl branches: wait for Zygote) and then trains the neural network using a discrete adjoint. While the previously posted example uses forward-mode, we have found that this is much much faster on neural SDEs, so if you’re trying to train them, I would recommend using this code instead (and I’ll get the examples updated).

Now to this equation let’s add jumps. A jump diffusion is defined like:

 du = f(u,p,t)dt + \sum g_i(u,t)dW^i + \sum c_i(u,p,t)dp_i

where dp_i are the jump terms. The jump terms differ from the Brownian terms because they are non-continuous: they are zero except at countably many time points where you “hit” the equation with an amount  c_i(u,p,t) . The timing at which these occur is based on an internal rate  \lambda_i of the jump  dp_i .

Jump diffusions are important because, just as there is a justification for the universality of stochastic differential equations, there is a justification here as well. The Levy Decomposition says that essentially any Markov process can be decomposed into something of this form. They also form the basis for many financial models, because for example changing regimes into a recession isn’t gradual but rather sudden. Models like Merton’s model thus use these as an essential tool in quantitative finance. So let’s train a neural network on that!

What we have to do is define jump processes and append them onto an existing differential equation. The documentation shows how to use the different jump definitions along with their pros and cons, so for now we will use ContinuousRateJump. Let’s define a ContinuousRateJump which has a constant rate and a neural network that decides what the effect of the jump ( c_i(u,p,t) ) will be. To do this, you’d simply put the neural network in there:

rate(u,p,t) = 2.0
affect!(integrator) = (integrator.u = dudt2(integrator.u))
jump = ConstantRateJump(rate,affect!)

where dudt2 is another neural network, and then wrap that into a jump problem:

prob = SDEProblem(dudt_,g,gpu(param(x)),tspan,nothing)
jump_prob = JumpProblem(prob,Direct(),jump,save_positions=(false,false))

And of course you can make this fancier: just replace that rate 2.0 with another neural network, make the g(u,p,t) term also have a neural network, etc.: explore this as you wish and go find some cool stuff. Let’s just stick with this as our example though, but please go ahead and make these changes and allow DiffEqFlux.jl to help you to explore your craziest mathematical idea!

Now when you solve this, the jumps also occur along with the stochastic differential equation. To show what that looks like, let’s define a jump diffusion and solve it 100 times, taking its mean as our training data:

using Flux, DiffEqFlux, StochasticDiffEq, Plots, DiffEqMonteCarlo,
    DiffEqJump
 
u0 = Float32[2.; 0.]
datasize = 30
tspan = (0.0f0,1.0f0)
 
function trueODEfunc(du,u,p,t)
  true_A = [-0.1 2.0; -2.0 -0.1]
  du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
const mp = Float32[0.2,0.2]
function true_noise_func(du,u,p,t)
  du .= mp.*u
end
 
true_rate(u,p,t) = 2.0
true_affect!(integrator) = (integrator.u[1] = integrator.u[1]/2)
true_jump = ConstantRateJump(true_rate,true_affect!)
prob = SDEProblem(trueODEfunc,true_noise_func,u0,tspan)
jump_prob = JumpProblem(prob,Direct(),true_jump,save_positions=(false,false))
 
# Take a typical sample from the mean
monte_prob = MonteCarloProblem(jump_prob)
monte_sol = solve(monte_prob,SOSRI(),num_monte = 100,parallel_type=:none)
plot(monte_sol,title="Training Data")
 
monte_sum = MonteCarloSummary(monte_sol)
sde_data = Array(timeseries_point_mean(monte_sol,t))

From the plot you can see wild discontinuities mixed in with an equation with continuous randomness. Just lovely.

A full code for training a neural jump diffusion thus is:

using Flux, DiffEqFlux, StochasticDiffEq, Plots, DiffEqMonteCarlo,
    DiffEqJump
 
u0 = Float32[2.; 0.] |> gpu
datasize = 30
tspan = (0.0f0,1.0f0)
 
function trueODEfunc(du,u,p,t)
  true_A = [-0.1 2.0; -2.0 -0.1] |> gpu
  du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
const mp = Float32[0.2,0.2] |> gpu
function true_noise_func(du,u,p,t)
  du .= mp.*u
end
 
true_rate(u,p,t) = 2.0
true_affect!(integrator) = (integrator.u[1] = integrator.u[1]/2)
true_jump = ConstantRateJump(true_rate,true_affect!)
prob = SDEProblem(trueODEfunc,true_noise_func,u0,tspan)
jump_prob = JumpProblem(prob,Direct(),true_jump,save_positions=(false,false))
 
# Take a typical sample from the mean
monte_prob = MonteCarloProblem(jump_prob)
monte_sol = solve(monte_prob,SOSRI(),num_monte = 100,parallel_type=:none)
monte_sum = MonteCarloSummary(monte_sol)
sde_data = Array(timeseries_point_mean(monte_sol,t))
 
dudt = Chain(x -> x.^3,
           Dense(2,50,tanh),
           Dense(50,2)) |> gpu
dudt2 = Chain(Dense(2,50,tanh),
            Dense(50,2)) |> gpu
ps = Flux.params(dudt,dudt2)
 
g(u,p,t) = mp.*u
n_sde = function (x)
    dudt_(u,p,t) = dudt(u)
    rate(u,p,t) = 2.0
    affect!(integrator) = (integrator.u = dudt2(integrator.u))
    jump = ConstantRateJump(rate,affect!)
    prob = SDEProblem(dudt_,g,param(x),tspan,nothing)
    jump_prob = JumpProblem(prob,Direct(),jump,save_positions=(false,false))
    solve(jump_prob, SOSRI(); saveat=t ,abstol = 0.1, reltol = 0.1) |> Tracker.collect
end
 
pred = n_sde(u0) # Get the prediction using the correct initial condition
 
dudt__(u,p,t) = Flux.data(dudt(u))
rate__(u,p,t) = 2.0
affect!__(integrator) = (integrator.u = Flux.data(dudt2(integrator.u)))
jump = ConstantRateJump(rate__,affect!__)
nprob = SDEProblem(dudt__,g,u0,(0.0f0,1.0f0),nothing)
njump_prob = JumpProblem(prob,Direct(),jump, save_positions = (false,false))
 
monte_nprob = MonteCarloProblem(njump_prob)
monte_nsol = solve(monte_nprob,SOSRI(),num_monte = 1000,parallel_type=:none, abstol = 0.1, reltol = 0.1)
monte_nsum = MonteCarloSummary(monte_nsol)
 
#plot(monte_nsol,color=1,alpha=0.3)
p1 = plot(monte_nsum, title = "Neural Jump Diffusion: Before Training")
scatter!(p1,t,sde_data',lw=3)
 
scatter(t,sde_data[1,:],label="data")
scatter!(t,Flux.data(pred[1,:]),label="prediction")
 
function predict_n_sde()
    n_sde(u0)
end
 
loss_n_sde1() = sum(abs2,sde_data .- predict_n_sde())
 function loss_n_sde100()
    loss = sum([sum(abs2,sde_data .- predict_n_sde()) for i in 1:100])
    @show loss
    loss
end
function loss_n_sde500()
    loss = sum([sum(abs2,sde_data .- predict_n_sde()) for i in 1:500])
    @show loss
    loss
end 
Flux.back!(loss_n_sde1())
 
data = Iterators.repeated((), 10)
opt = ADAM(0.025)
cb = function () #callback function to observe training
    sample = predict_n_sde()
    # loss against current data
    display(sum(abs2,sde_data .- sample))
    # plot current prediction against data
    cur_pred = Flux.data(sample)
    pl = scatter(t,sde_data[1,:],label="data")
    scatter!(pl,t,cur_pred[1,:],label="prediction")
    display(plot(pl))
end
 
# Display the SDE with the initial parameter values.
cb()
 
Flux.train!(loss_n_sde1 , ps, Iterators.repeated((), 100), opt, cb = cb)

Notice how it’s almost exactly the same as the SDE code but with the definition of the jumps. You still get the same high order adaptive GPU-accelerated (choice of implicit, etc.) SDE solvers, but now to this more generalized class of problems. Using the GPU gives a good speedup in the neural network case, but slows it down quite a bit when generating the training data since it’s not very parallel. Finding out new ways to use GPUs is one thing I am interested in perusing here. Additionally, using a lower tolerance StackOverflows Tracker.jl, which is something we have fixed with Zygote.jl and will be coming to releases once Zygote.jl on the differential equation solvers is more robust. Lastly, the plotting with GPU-based arrays is wonky right now, we’ll need to make the interface a little bit nicer. However, this is a proof of concept that this stuff does indeed work, though it takes awhile to train it to a “decent” loss (way more than the number of repetitions showcased in here).

[Note: you need to add using CuArrays to enable the GPU support. I turned it off by default because I was training this on my dinky laptop :)]

Neural Partial Differential Equations

Now let’s do a neural partial differential equation (PDE). We can start by pulling code from this older blog post on solving systems of stochastic partial differential equations with GPUs. Here I’m going to strip the stochastic part off, simply because I want to train this on my laptop before the flight ends, so again I’ll leave it as an exercise to do the same jump diffusion treatment to this PDE. Let’s start by defining the method of lines discretization for our PDE. If you don’t know what that is, please go read that blog post on defining SPDEs. What happens is the discretization gives you a set of ODEs to solve, which looks like:

using OrdinaryDiffEq, RecursiveArrayTools, LinearAlgebra,
      DiffEqOperators, Flux, CuArrays
 
# Define the constants for the PDE
const α₂ = 1.0f0
const α₃ = 1.0f0
const β₁ = 1.0f0
const β₂ = 1.0f0
const β₃ = 1.0f0
const r₁ = 1.0f0
const r₂ = 1.0f0
const D = 100.0f0
const γ₁ = 0.1f0
const γ₂ = 0.1f0
const γ₃ = 0.1f0
const N = 100
const X = reshape([i for i in 1:N for j in 1:N],N,N) |> gpu
const Y = reshape([j for i in 1:N for j in 1:N],N,N) |> gpu
const α₁ = 1.0f0.*(X.>=80)
 
const Mx = Array(Tridiagonal([1.0f0 for i in 1:N-1],[-2.0f0 for i in 1:N],[1.0f0 for i in 1:N-1])) |> gpu
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
 
# Define the initial condition as normal arrays
u0 = rand(Float32,N,N,3) |> gpu
const MyA = zeros(Float32,N,N) |> gpu
const AMx = zeros(Float32,N,N) |> gpu
const DA = zeros(Float32,N,N) |> gpu
 
# Define the discretized PDE as an ODE function
function f(_du,_u,p,t)
  u = reshape(_u,N,N,3)
  du= reshape(_du,N,N,3)
  A = @view u[:,:,1]
  B = @view u[:,:,2]
  C = @view u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  mul!(MyA,My,A)
  mul!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
# Solve the ODE
prob = ODEProblem(f,vec(u0),(0.0f0,100.0f0))
@time sol = solve(prob,BS3(),  progress=true,saveat = 5.0)
@time sol = solve(prob,ROCK2(),progress=true,saveat = 5.0)
 
 
using Plots; pyplot()
p1 = surface(X,Y,reshape(sol[end],N,N,3)[:,:,1],title = "[A]")
p2 = surface(X,Y,reshape(sol[end],N,N,3)[:,:,2],title = "[B]")
p3 = surface(X,Y,reshape(sol[end],N,N,3)[:,:,3],title = "[C]")
plot(p1,p2,p3,layout=grid(3,1))
savefig("neural_pde_training_data.png")
 
using DiffEqFlux, Flux
 
u0 = param(u0)
tspan = (0.0f0,100.0f0)
 
ann = Chain(Dense(3,50,tanh), Dense(50,3)) |> gpu
p1 = DiffEqFlux.destructure(ann)
ps = Flux.params(ann)
 
_ann = (u,p) -> reshape(p[3*50+51 : 2*3*50+50],3,50)*
                    tanh.(reshape(p[1:3*50],50,3)*u + p[3*50+1:3*50+50]) + p[2*3*50+51:end]
 
function dudt_(_u,p,t)
  u = reshape(_u,N,N,3)
  A = u[:,:,1]
  DA = D .* (A*Mx + My*A)
  _du = mapslices(x -> _ann(x,p),u,dims=3) |> gpu
  du = reshape(_du,N,N,3)
  x = vec(cat(du[:,:,1]+DA,du[:,:,2],du[:,:,3],dims=3))
end
 
prob = ODEProblem(dudt_,vec(Flux.data(u0)),tspan,Flux.data(p1))
@time diffeq_fd(p1,Array,length(u0)*length(0.0f0:5.0f0:100.0f0),prob,ROCK2(),progress=true,
                saveat=0.0f0:5.0f0:100.0f0)
 
function predict_fd()
  diffeq_fd(p1,Array,length(u0)*length(0.0f0:5.0f0:100.0f0),prob,ROCK2(),progress=true,
                  saveat=0.0f0:5.0f0:100.0f0)
end
 
function loss_fd()
  _sol = predict_fd()
  loss = sum(abs2,Array(sol) .- _sol)
  @show loss
  loss
end
loss_fd()
 
data = Iterators.repeated((), 10)
opt = ADAM(0.025)
 
Flux.train!(loss_fd, ps, data, opt)

The interesting part of this neural differential equation is the local/global aspect of parts. The mapslices call makes it so that way there’s a local nonlinear function of 3 variables applied at each point in space. While it keeps the neural network small, this currently does not do well with reverse-mode automatic differentiation or GPUs. That isn’t a major problem here because, since the neural network is kept small in this architecture, the number of parameters is also quite small. That said, reverse-mode AD will be required for fast adjoint passes, so this is still a work in progress / proof of concept, with a very specific point made (all that’s necessary here is overloads to make mapslices work well).

One point that really came out of this was the ODE solver methods. The ROCK2 method is much faster when generating the training data and when running diffeq_fd. It was a difference of 3 minutes with ROCK2 vs 40 minutes with BS3 (on the CPU), showing how specialized methods really are the difference between the problem being solvable or not. The standard implicit methods like Rodas5 aren’t performing well here either since the 30,000×30,000 dense matrix, and I didn’t take the time to specify sparsity patterns or whatnot to actually make them viable competitors. So for the lazy neural ODE use with sparsity, ROCK2 seems like a very interesting option. This is a testament to our newest GSoC crew’s results since it’s one of the newer methods implemented by our student Deepesh Thakur. There are still a few improvements that need to be made to make the eigenvalue estimates more GPU-friendly as well, making this performance result soon carry over to GPUs as well (currently, the indexing in this part of the code gives it trouble, so a PR is coming probably in a week or so). Lastly, I’m not sure what’s a good picture for these kinds of things, so I’m going to have to think about how to represent a global neural PDE fit.

Conclusion

Have fun with this. There are still some rough edges, for example plotting is still a little wonky because all of the automatic DiffEq solution plotting seems to index, so the GPU-based arrays don’t like that (I’ll update that soon now that it’s becoming a standard part of the workflow). Use it as starter code and find some cool stuff. Note that the examples shown here are not the only ones that are possible. This all just uses Julia’s generic programming and differentiable programming infrastructure in order to automatically generate code that is compatible with GPUs and automatic differentiation, so it’s impossible for me to enumerate all of the possible combinations. That means there’s plenty of things to explore. These are very early preliminary results, but shows that these equations are all possible. These examples show some places where we want to continue accelerating by both improving the methods and their implementation details. I look forward to doing an update with Zygote soon.

CITATION:

 Christopher Rackauckas, Neural Jump SDEs (Jump Diffusions) and Neural PDEs, The Winnower6:e155975.53637 (2019). DOI:10.15200/winn.155975.53637

The post Neural Jump SDEs (Jump Diffusions) and Neural PDEs appeared first on Stochastic Lifestyle.

Some State of the Art Packages in Julia v1.0

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/some-state-of-the-art-packages-in-julia-v1-0/

In this post I would like to explain some of the state of the art packages in Julia and how they are pushing forward their disciplines. I think the part I would really like to make clear is the ways which Julia gives these packages a competitive advantage and how these features are being utilized to reach their end goals. Maybe this will generate some more ideas!

What are the advantages for Julia in terms of package development?

Using Python/R to use other people’s packages is perfectly fine since then your code will just be as fast as the package code if all your time is in package code calls (I purposely leave MATLAB off this list because its interoperability is much more difficult to use). There is one big exception where using an efficient package can still be slow even when all of the time is spent in package code instead of Python/R: package calls will run into performance issues with any package that requires user-defined function inputs (first-order functions), like differential equation, optimization, etc. libraries. In a recent blog post we showed that Numba+SciPy integrators are still 10x slower than it should be, so it’s not the easy solution some people make it out to be. The reason of course is part of the design: Numba doesn’t do interprocedural optimizations through Fortran codes and when codes are separately compiled like a user input function. These context switching costs particularly hurt when the function calls are smaller, like in callback functions of an optimization modeling language, because it has to hit the Python interpret. so even if you have a faster function call in a fast package you can still have a limiting factor! If you have an asymptotically costly function then this could be mitigated, but that means you do have a lot of limitations on standard use cases.

This is why some Python optimization and differential equation libraries actually take in strings or SymPy expressions input (PyDSTool and JitCODE are two of many examples), since direct Numba/Cython usage for function input will have these problems so they have to write a compiler behind the scenes, which then of course means you have the restrictions that it has to be Float64 without arbitrary numbers and etc. etc. etc. Once you have your own compilation toolchain, you have to explicitly choose what kinds of functions and objects you can support, and so throwing in any arbitrary user codes isn’t necessarily going to work. You can see how this not only lowers features but also adds to development complexity. In fact, this is clearly seen by how PyTorch has a full JIT compiler inside of it along with a full math library! So developing PyTorch was more like developing Julia and the ML library at the same time, instead of just building an ML library. And then it doesn’t get the automatic composability features of Flux.jl so you have to build everything in there yourself… you can see how the package development efficiency difference is quite astronomical.

So how does this translate to packages? First Data Science and ML

Higher order functions, the ability to compose with other packages and use their types: that’s where are competitive advantage is. This fact is reflected in the state-of-the-artness of packages. A lot of data science, statistics, and ML libraries just take a matrix of data in and spit out predictions. In that sense, all of the time is spent in package code and while Julia does have a competitive advantage for building such packages (building a JIT compiler is not a prerequisite for building a neural net library in Julia), it doesn’t have an advantage for using such packages. This is why data science and ML have done just fine in Python, and statistics has done just fine in R. It would be hard to find real Julia gems in these areas since existing code works. The advantage Julia gives is simply that it’s easier to create the package, and so one would expect it to have brand new methods due to someone’s recent research. I’d point to Josh Day’s OnlineStats.jl as an example of using Julia for bleeding edge algorithms.

Now let’s look at Bayesian estimation and MCMC. For this area of data science you do need to take in full models. R and Python have resorted to taking Stan models in the form of a string. Stan is good but let’s admit that this interface is subpar. You’re writing in another language (Stan) and in a string (without syntax highlighting) in order to do Bayesian estimation from a scripting language? Julia has been building libraries which are easier to use due to being native Julia, like Turing.jl and DynamicHMC.jl. These libraries let you insert your likelihood function as just standard Julia code, making it much easier to make complicated models. For example, Stan only lets you use its two differential equation solvers (with a total of 5 options…) and only for ODEs, but Turing.jl and DynamicHMC.jl both can internally utilize DifferentialEquations.jl code for ODEs, SDEs, DAEs, DDEs, etc. That composition done efficiently is something you will not find elsewhere.

And neural net libraries are not as blackbox as most of ML. To get around the issues I mentioned earlier, you cannot just pass functions in so you have to tell the neural net library how to build the entire neural net. This is why PyTorch and Tensorflow are essentially languages embedded into Python which produce objects that represent the functions you want to write, which pass the functions to these packages to compile the right code (and utilize the GPU). Obviously this would be hard to write on your own, but the main disadvantage to users is that normal user code doesn’t work inside of these packages (you cannot just call to arbitrary SciPy code for example. PyTorch can do some of it if you can get its autodiff working, but it cannot autodiff through wrapped C++/Fortran code which is how most packages are written!). But Flux.jl has full composability with Julia code so it’s definitely state-of-the-art and has the flexibility that other neural net libraries are trying to build towards (Tensorflow is rebuilding into Swift in order to get this feature of Flux.jl).

What about Scientific Computing?

But where things tend to be wide open is scientific computing. This domain has to deal with user-input functions and abstract types. So I’d point to not just DifferentialEquations.jl and JuMP which are the clear well-funded front-runners in their discipline, but also a lot of the tooling around this area. IterativeSolvers.jl is a great example of something heavily unique. At face value, IterativeSolvers.jl is a lot like Pysparse’s iterative solvers module or the part of SciPy. Except it’s not when you look at the details. These both require a NumPy matrix. They specifically require a dense/sparse matrix in each documented function. However, one of the big uses of these algorithms is not supposed to be on sparse matrices but on matrix-free representations of `A*x`, but you cannot do this with these Python packages because they require a real matrix. In IterativeSolvers you just pass a type which has `mul!` (`*`) overloaded to be the function call you want and you have a matrix-free iterative solver algorithm. Complex numbers, arbitrary precision, etc. are all supported here which is important for ill-conditioned and quantum physics uses. Using abstract types you can also throw GPUArrays in there and have it just run. This library only keeps getting better.

One of the best scientific computing packages is BandedMatrices.jl. It lets you directly define banded matrices and then overloads things like `*` and `\` to use the right parts of BLAS. Compared to just using a sparse matrix (the standard MATLAB/Python way), this is SCREAMING FAST (the QR factorization difference is pretty big). Banded matrices show up everywhere in scientific computing (pretty much every PDE has a banded matrix operator). If it’s not a banded matrix, then you probably have a block-banded matrix, and thank god that @dlfivefifty also wrote BlockBandedMatrices.jl which will utilize the same speed tricks but expanded to block banded matrices (so, discretizations of systems of PDEs).

In fact, I would say that @dlfivefifty is an unsung hero in the backend of Julia’s scientific computing ecosystem. ApproxFun.jl is another example of something truly unique. While MATLAB has chebfun, ApproxFun is similar but creates lazy matrix-free operators (so you can send these linear operators directly to things like, you guessed it, IterativeSolvers.jl). It uses InfiniteMatrix (InfiniteArrays.jl) types to grow and do whatever sized calculation without allocating the matrices.

While those may look like niche topics if you don’t know about those, those are the types of libraries which people need in order to build packages which need fast linear algebra, which is essentially anything data science or scientific computing. So Julia’s competitive advantage in package building is compounded by the fact that there’s now great unique fundamental numerical tools!

Along the lines of these “fundamental tools” packages is Distributions.jl. It’s quite a boring package: it’s just a bunch of distributions with overloads, but I think Josh Day’s discussion of how you can write a code which is independent of the choice of distribution is truly noteworthy (and if you want a good introduction to Julia video, Josh’s is amazing). If you want to write code to compute quantiles in R or Python, you’d need to use a different function for each possible distribution, or take in a function that computes the pdf (which of course then gives you the problem of function input in these languages!). Package developers in other languages have thus far have just hard coded the distributions they need into things like SciPy, but it’s clear what the scalability of that is.

And then I’d end with JuMP, DifferentialEquations.jl, and LightGraphs.jl as very user-facing scientific computing packages which are in very important domains but don’t really have a comparison library in another scripting language. JuMP‘s advantage is that it allows you to control many things like branch-and-cut algorithms directly from JuMP, something which isn’t possible with “alternatives” like Pyomo (along with building sparse Hessians with autodifferentiation), so if you’re deep into a difficult problem it’s really the only choice. (NOTE: JuMP has not updated to Julia v1.0 yet so you’ll have to wait!). DifferentialEquations.jl is one of the few libraries with high order symplectic integrators, one of the few libraries with integrators for stiff delay differential equations, one of the two high order adaptive stochastic differential equation libraries (with the other being a (pretty good) re-implementation of DiffEq’s SRIW1), one of the only libraries with exponential integrators, one of the only libraries with … I can keep going, and of course it does well in first-order ODE benchmarks. So if you’re serious about solving differential equation based models then in many cases it’s hard to even find an appropriate alternative. And LightGraphs is in a truly different level performance wise than what came before, and it only keeps getting better. Using things like MetaGraphs lets you throw metadata in there as well. Its abstract structure lets you write and use graph algorithms independent of the graph implementation, and lets you swap implementations depending on what would be efficient for your application. So if you’re serious about technical computing, these are three libraries to take seriously.

These packages have allowed for many domain specific tools to be built. Physics is one domain which is doing really well in Julia, with DynamicalSystems.jl (recent winner of the DSWeb2018 prize by SIAM Dynamical Systems!) and QuantumOptics.jl being real top notch packages (I don’t know of a DynamicalSystems.jl comparison in any other language, and I don’t know anything that does all of the quantum stochasticity etc. of QO).

There’s still more to come of course. Compilation tools like PackageCompiler.jl and targetting asm.js with Charlotte.jl is only in its infancy. JuliaDB is shaping up to be a more flexible version of dask, and it’s almost there. We’re missing good parallel linear algebra libraries like a native PETSc, but of course that’s the kind of thing that can never be written in Python (Julia is competing against C++/Fortran/Chapel here). MPIArrays.jl might get there, though with a lot of work.

Conclusion

I hope this is more of a “lead a :horse: to water” kind of post. There are a lot of unique things in Julia, and I hope this explains why certain domains have been able to pull much further ahead than others. Most of the technical computing population is still doing data science on Float64 arrays with blackboxed algorithms (TSne, XGBoost, GLM, etc.) and in that drives a lot of the “immature” talk because what they see in Julia is similar or just a recreation of what exists elsewhere. But in more numerical analysis and scientific computing domains where higher precision matters (“condition number” is common knowledge), functions are input, and there are tons of unique algorithms in the literature that still have no implementation, I would even say that Julia surpassed MATLAB/Python/R awhile ago since it’s hard to find comparable packages to ours in these other languages.

If we want to capture the last flag, we really need to build out the competitive advantage Julia has in the area of data science / ML package development, because until it’s unique it’ll just be “recreations” which don’t draw the biggest user audience. While “programmers” will like the purity of one language and the ability to develop their own code, I think the completeness of statistical R or ML Python matters more to this group. I think a focus on big data via streaming algorithms + out-of-core is a good strategy, but it’s not something that cannot necessarily be done in C++ (dask), so there’s something more that’s required here.

Personally, I think part of the story will be around harnessing Julia’s generic array types for taking advantage of new hardware: new ML packages will not need to directly hardcode for TPUs if there is a TPUArray type that users can give to the package (note the difference here: the developers do not need to be “TPU-aware” in this case, instead the user can mix types together and use automatic code generation to get an efficient TPU-based neural net or other ML algorithm. This “bottom-up” code generation choice via the type system and fully dependent compilation is a place where Julia has
a competitive advantage
since package developers then only need to write a generic implementation of said algorithm and then it can work on any hardware for which someone has implemented a compatible type. As Moore’s law comes to an end and variations on GPUs / FPGAs become required to get a speedup, allowing the user to specify the hardware interaction while still using a standard package will definitely be an advantage!

The post Some State of the Art Packages in Julia v1.0 appeared first on Stochastic Lifestyle.