Author Archives: Christopher Rackauckas

Using Juno for Interactive Test-Driven Package Development

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/using-juno-interactive-test-driven-package-development/

I found myself explaining my development workflow all of the time, so I thought I’d show it in a video. This is some of the stuff I do. You can do it other ways. I like this way.

The post Using Juno for Interactive Test-Driven Package Development appeared first on Stochastic Lifestyle.

Algorithm efficiency comes from problem information

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/algorithm-efficiency-comes-problem-information/

This is a high level post about algorithms (especially mathematical, scientific, and data analysis algorithms) which I hope can help people who are not researchers or numerical software developers better understand how to choose and evaluate algorithms. When scientists and programmers think about efficiency of algorithms, they tend to think about high level ideas like temporary arrays, choice of language, and parallelism. Or they tend to think about low level ideas like pointer indirection, cache efficiency, and SIMD vectorization. However, while these features matter, most decent implementations of the same algorithm tend to get quite close in efficiency to each other (probably <3x if everyone is using a performant enough programming language or constructs). If you're sitting down thinking "what is the next algorithm that I am going to invent that will be truly a step beyond what existed before?", these computational details matter, but not as much as the answer to one important question:

What does my algorithm know about my problem?

Let me highlight one of the biggest success stories in numerical analysis: solving linear systems. The problem is simple, for what values of x does Ax=b? Most people learn early on in their math career that you solve this via Gaussian elimination which is a very nice \mathcal{O}(n^3) algorithm. In fact, you can show that solving a linear systems has to have \mathcal{O}(n^3) as a lower bound, so we’re done now right? It’s all a game of who can write the most efficient Gaussian elimination?

Not even close. MATLAB’s backslash (A\b) operator solves the linear system Ax=b, but what algorithm does it use? Well, Mathworks describes it with a picture:

MATLAB_dense

Do you think that’s it? If so, then you’re wrong. That’s only when the matrix is full/dense. If it’s a sparse matrix, then there’s another algorithm for it:

MATLAB_sparse

Julia’s backslash operator is similar, but what’s cool is you can actually inspect Julia’s source code and see exactly what it’s doing. But it’s the same idea. This kind of algorithm is known as a polyalgorithm since it’s many different algorithms together. The algorithm starts by checking what kind of type the matrix is and then performing a specialized algorithm on that type, and if no specialized algorithm exists, falls back to Gaussian elimination through a pivoted QR-factorization.

Why would you write an algorithm like this? The trivial answer is because it’s faster. If you know the matrix is diagonal then the solution is a \mathcal{O}(n) element-wise division by the diagonal. If you know the matrix is positive-definite then you can use a Cholesky decomposition to solve the equation which takes about half of the operations of Gaussian elimination. However, I think it’s better to realize the common thread here. The diagonal method is faster because it knows about properties of diagonal matrices and uses them. The Cholesky decomposition is faster because it utilizes details about positive-definite matrices in order to get rid of possible operations.

These specialized algorithms are fast because they use more “information” about the problem

And that’s only the start of this entire field of mathematics known as numerical linear algebra. There are iterative methods which can also be used. (Dense) Matrix factorizations are dense and thus require the same amount of memory as the full matrix itself (though some sparse matrix factorizations similarly need the memory for non-zero elements and maybe some more). Iterative solvers require only the ability to do A*x and thus are much more memory efficient and can scale well on sparse matrices. This also makes them easily to parallelize across large clusters and thus are the method of choice for large solving large sparse systems that arise from things like PDEs. Even then, you can still make it more efficient by choosing a preconditioner, but good choices of preconditioners are dependent on, you guessed it, properties of your A or the equation it is derived from.

As you can see, every significant step of this tool chain is about baking more information about the problem into the solution methods. This is why I would suggest that one thinks about the amount of information an algorithm contains about a problem since that’s where the true gains are. Let’s take another example.

Are neural networks “efficient”?

If there’s one area of computational data science that everyone is excited about right now, it’s neural networks. The quickest way to explain a neural network is that it is a computationally efficient way to approximate any mapping from inputs to outputs, f(x)=y. The f that it creates uses a lot of matrix multiplies which are easy to GPU-parallelize, and the deep in “deep neural networks” simply is the application of more matrix multiplies to get a better approximation of f. Movie recommendations are where you take in x = (data about movies the person previously watched) and then y = (the score that you think they’d give to other movies), and you use enough data to get a good enough approximation for the movies they would like and spit out the top few. Classification problems like “does this picture contain a bird?” is just about creating a mapping between x = (a picture represented by its pixel brightness), to y = 1 or 0 (yes or no: it contains a bird). You train this on enough data and it’s correct enough of the time. Amazing, right?

The next question people like to ask is, what neural network frameworks are good for these problems? TensorFlow, PyTorch, KNet.jl will all auto-GPU you things for you and in the end all give around the same performance. Of course, package developers will fight over these 2x-5x differences over the next decade, showing that their internal setup is good for these problems, while others will show it’s not good for these other problems. But what this doesn’t tell you is whether you should be using a neural network in the first place.

(Yes, you can use these computational tools for things other than neural networks, but let’s ignore that for now)

If you think about the point of this blog post, it should trigger something in you. I mentioned that neural networks can approximate any function f(x)=y… so what does the neural network “know” about the problem? Pretty much nothing. The reason why neural networks are popular is because they are a nice hammer that you can use on pretty much any problem, but that doesn’t mean it’s always good. In fact, because you can use a neural network on any problem it pretty much implies that it won’t be great on any problem. That’s the fundamental trade off between specificity and generality! My friend Lyndon White displayed this very nicely when he showed 7 different ways to solve a classification problem in Julia. Notice that the neural network method is dead last in efficiency. Neural networks aren’t bad, but they just don’t know anything about a problem. If the problem is linear, then a linear regression or linear SVM will DDDDOMINATE on the problem! If it’s a classification problem, then algorithms which incorporate knowledge about what a classification problem means, more so than “spit out y in [0,1] via sigmoid and say yes if y is greater than some threshold”, will do better. For this domain, it includes things like decision trees. For movie recommendation problems, factorization machines more succinctly capture our internal model of how a user’s movie ratings should be related, and thus it’s no surprise that they tend to win most of the Netflix prize type of competitions.

Note that neural networks can be made problem-specific as well. Convolutional neural networks are successful in image processing because they add an extra constraint about the relations of the data, specifically that small “stencils” of the larger matrix (i.e. groups of nearby pixels) should be interrelated. By adding that kind of information into the structure of the neural network and its optimization algorithms, this then becomes a very efficient tool for tasks where the data has this structure.

So neural networks are not useless even though they are not always efficient. They are extremely useful since they can be applied to pretty much any problem. But the general neural network architectures (like deep feed-forward neural networks) lack knowledge about the specific problems they are approximating, and so they cannot hope to be as efficient as domain-optimized algorithms. How do you make them better? You introduce ideas like “memory” to get recurrent neural networks, and it’s no surprise that the research shows that these methods are more efficient on problems which have some memory relation. But the generality of neural networks leads me to my favorite example.

Algorithm specificity for differential equations (okay, this goes into more depth!)

I spend my time developing new methods for (stochastic) differential equations and developing differential equation solving software. There are hundreds and hundreds of different algorithms for finding the function u(t) whose derivative satisfies u'(t) = f(t,u(t)) where f is the data given by the user starting to t_0 and ending at t_f (this is the definition of an ODE BTW. If this is new to you, think I’m given a model f which describes how things change and I want to compute where I will be in the future). And again, solving this problem efficiently is all about incorporating more information about the problem into the solver algorithm.

Just as a interesting note, you can solve a differential equation with a neural network, but it’s not good. I was hoping that the efficiency gained by using optimized GPU-libraries would be enough to overcome the efficiency lost by not specifying on the problem, but it wasn’t even close and the end result was that it was difficult for a deep neural network to solve non-stiff nonlinear problems like the Lotka-Volterra equations which standard differential equations software solve in microseconds. But what’s more interesting is why it failed. The neural network simply failed to understand the temporal dependencies of the problem. If you look at the failed solutions on the blog post, you see what happens is that the neural network doesn’t know to prioritize the early time points because, well, it’s obvious in a temporal model that if you are wrong now then you will be wrong in the future. But the neural network sees each time point as an unconnected matrix and from there the problem becomes much harder. We could try convolutional nets or recurrent nets, or bias the cost function itself, but this is still trying to get around “the information problem” which differential equations don’t have.

With a differential equation, we are modeling reality. In many cases, we know things about that reality. We might know that the trajectory of our rocket is smooth (it has all derivatives), and so we can develop high order algorithms which require 9 derivatives of the user’s f and it’s not surprise that these Vern algorithms benchmark really well. But is it the best integrator? No, there’s no such thing. First of all, these methods are only for “non-stiff” ODEs, that is an ODE which has a single timescale. If there are multiple timescales, then one can show that these are “unstable” and require a very small time step. So what’s the best algorithm?

What I try to highlight in the post on differential equation libraries is that a large choice of methods is essential to actually being efficient. For example, many only have a few integrators, any in many cases this is simply multistep methods (because integrators like LSODE and VODE, provided in ancient Fortran code, have an option to change between stiff and non-stiff problems). But are these actually good methods? That’s not a good way to look at it. Instead, look at what these multistep methods mean. The form for the BDF2 method is:

 y_{n+2} - \frac{4}{3}y_{n+1} + \frac{1}{3}y_n = \frac{2}{3}hf(t_{n+2},y_{n+2})

What are the advantages of this algorithm? Well, because it uses two past data points (assume you already know y_n and y_{n+1} and want y_{n+2}), you only have to solve an implicit function for the variable y_{n+2}. If the cost of evaluating the function is high (as is the case for example with large PDE discretizations), then you want to evaluate it less so this does well! Since this only has a single f call per update, that’s doing quite well.

However, it made a trade off. In order to decrease the number of function evaluations, it requires more than one previous data point. It also requires that “nothing happened” between those two points. If you have some discontinuity for example, like modeling a ball when it bounces or modeling the amount of chemicals in a patient and you give a dose, then the previous data is invalid. Adaptive order algorithms like LSODE or CVODE go back to the more error prone backwards Euler method

y_{n+1} - y_n = hf(t_{n+1},y_{n+1})

to compute one small step (it has to be small to make the error low since the error is \mathcal{O}(\Delta t)), then uses that step as the previous step in a BDF2, then goes up to BDF3 etc., where each higher method requires more previous data and has a lower error order. But if you’re hitting events quite often (Pk/Pd simulations where a drug dose happens every few hours), notice that this is really really bad. Like, awful: you might as well just have had an implicit Euler scheme!

And not only that, the assumption that this method is efficient requires that f is costly. Other methods, like Rosenbrock or (E)SDIRK methods, make the trade off to take more evaluations of f to get less error, and in turn use this decreased error to take larger time step (and thus less steps overall). And so it’s not surprise that these methods benchmark better on small system and even “medium sized systems”, while the BDF method benchmarks as more efficient on larger systems with a more expensive function evaluation (the Rosenbrock methods are things like Rosenbrock23 and Rodas4, and the BDF method is CVODE_BDF). Again, this is nothing more than incorporating more details about the problem into the choice of solution method.

This is why having a wide variety of methods which incorporate different types of problem information is really what matters for efficiency. IMEX methods are a great example of this because instead of the user specifying a single ODE u' = f(t,u), the user specifies two functions u' = f_1(t,u) + f_2(t,u) where one of the functions is “stiff” (and thus should be solved implicitly by things like Rosenbrock or BDF methods) while the other is “non-stiff” (and thus should be solved by explicit methods like the Verner method). By specifying this split form, libraries like ARKODE or DifferentialEquations.jl can utilize algorithms which incorporate this information and thus will be more efficient than any of the previously mentioned methods on appropriately split problems. Another example are symplectic integrators which incorporate the mathematics of the underlying symplectic manifold into the solver itself, for problems which have a symplectic manifold. What kinds of problems lie on a symplectic manifold? Well, those arising from second order differential equations and physical Hamiltonians like N-body problems of astrodynamics and molecular dynamics. These methods, by utilizing a fundamental property of the problem specification, can noticeably reduce the amount of drift in the energy and angular momentum of the approximated solution and make the resulting simulations closer to reality. The changes an algorithm choice can make can be orders of magnitude. In the paper I showed earlier on methods for stochastic differential equations, incorporating the idea of time correlation and interpolation of Brownian motion (the Brownian bridge) to build an adaptive method resulted in an average of 100,000x less time steps to compute solutions to the SDE, so much that I couldn’t even benchmark how long it would take for algorithms which didn’t make use of this because they could not finish! (In my next paper, I am able to estimate that it would take almost 6 years to solve this problem vs the 22 seconds we’re at now). Choice of language, cache efficiency, etc. pale in comparison to proper algorithm choice.

The Julia Programming Language

I hope I am getting across the point that the way to evaluate algorithms is by the information they use about the problem. This way of thinking about algorithms extends very far. I show in another blog post that the Julia programming language achieves C/Fortran levels of efficiency because it allows the LLVM compiler to have total type information about your program, as opposed to more dynamic languages like R/MATLAB/Python. Julia is specifically designed so that compilers can at compile-time compute as much type information as possible to essentially build the C code you would have written. When you understand the information that is being optimized on, these performance differences aren’t magical anymore: it’s just using more information, more specificity, and getting more performance out because of that.

In fact, I have to put a little blurb out here for the Julia programming language as well. Its multiple dispatch and abstract typing system is a very nice way to organize algorithms by their information. Let’s take how Julia’s linear algebra works. For a standard dense matrix, the backslash operator internally calls factorize(::AbstractMatrix) (and A_ldiv_B!, which is an inplace way of writing A\b) which I showed above is a polyalgorithm. But if you call factorize on some special matrix type, like a Diagonal or Tridiagonal matrix, it uses a specialization of factorize or A_ldiv_B! to perform the most efficient algorithm for that type of matrix. Then in user codes and packages, other people can define a matrix type MyMatrix <: AbstractMatrix and overload factorize or A_ldiv_B! to be efficient for their matrix type. Then in my algorithm like a differential equation solver which uses linear algebra under the hood, I can simply call factorize and Julia will then pick the most efficient form based on the matrix type that I currently have (even if that factorize method was defined in a package I do not depend on!). This means that my generic code (meaning type-insensitive code: I don't know the matrix types and just call factorize without caring what kind of matrix it is) can be made more efficient by a user specifying more problem information in the form of type information and dispatches. The result is that the type system with multiple dispatch doesn't just let the compilers get more information, but it also makes it easy for developers to pass along information and specializations that in tern customize other solvers. So the most efficient algorithm for your differential equation may be a Rosenbrock23 algorithm with a GMRES iterative solver that doesn't use the actual Jacobian because J*x can be computed directly from a function call (this is quite common in PDEs), and you don’t need to re-write your own. Instead you can make my generic Rosenbrock23 algorithm compile a code which does that… even though it was never written to know what kind of method factorize should be. This kind of “opening up of the black box” and being able to optimize pieces of it via information injection for your specific problem is a powerful tool for optimizing numerical algorithms to large scale problems. Given this post I hope it’s easy to understand why I think this will (and already has started to) lead to a new generation of efficient implementations.

(Okay okay, you can do generic programming with C++ templates, but you can’t make me do it! Look at this “gentle introduction” and you’ll be in my boat asking for a high-level language designed to have that information in mind, and I’ll point you to Julia)

Moral of the story

Getting your perfect caches lined up is small potatoes. BLAS and LAPACK are well-known super fast Fortran linear algebra libraries. They are extremely fast not just because they are bit twiddling with assembly (though they are), but they are super fast because they incorporate information about floating point precision and use that to their advantage. Neural networks, ODE solvers, etc. are all just tools for problems. The more specific they are to the problem you’re trying to solve, the more efficient they can be. So next time you’re looking to optimize some numerical codes, don’t ask “what can I do to make the computation more clean/efficient”, ask “what information about the problem is the solution method not using?” That’s where the big potatoes come from.

The post Algorithm efficiency comes from problem information appeared first on Stochastic Lifestyle.

Solving Systems of Stochastic PDEs and using GPUs in Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/solving-systems-stochastic-pdes-using-gpus-julia/

What I want to describe in this post is how to solve stochastic PDEs in Julia using GPU parallelism. I will go from start to finish, describing how to use the type-genericness of the DifferentialEquations.jl library in order to write a code that uses within-method GPU-parallelism on the system of PDEs. This is mostly a proof of concept: the most efficient integrators for this problem are not compatible with GPU parallelism yet, and the GPU parallelism isn’t fully efficient yet. However, I thought it would be nice to show an early progress report showing that it works and what needs to be fixed in Base Julia and various libraries for us to get the full efficiency.

Our Problem: 2-dimensional Reaction-Diffusion Equations

The reaction-diffusion equation is a PDE commonly handled in systems biology which is a diffusion equation plus a nonlinear reaction term. The dynamics are defined as:

u_t = D \Delta u + f(t,u)

But this doesn’t need to only have a single “reactant” u: this can be a vector of reactants and the f is then the nonlinear vector equations describing how these different pieces react together. Let’s settle on a specific equation to make this easier to explain. Let’s use a simple model of a 3-component system where A can diffuse through space to bind with the non-diffusive B to form the complex C (also non-diffusive, assume B is too big and gets stuck in a cell which causes C=A+B to be stuck as well). Other than the binding, we make each of these undergo a simple birth-death process, and we write down the equations which result from mass-action kinetics. If this all is meaningless to you, just understand that it gives the system of PDEs:

A_t = D \Delta A + \alpha_A(x) - \beta_A  A - r_1 A B + r_2 C

B_t = \alpha_B - \beta_B B - r_1 A B + r_2 C

C_t = \alpha_C - \beta_C C + r_1 A B - r_2 C

One addition that was made to the model is that we let \alpha_A(x) be the production of A, and we let that be a function of space so that way it only is produced on one side of our equation. Let’s make it a constant when x>80, and 0 otherwise, and let our spatial domain be x \in [0,100] and y \in [0,100].

This model is spatial: each reactant u(t,x,y) is defined at each point in space, and all of the reactions are local, meaning that f at spatial point (x,y) only uses u_i(t,x,y). This is an important fact which will come up later for parallelization.

Discretizing the PDE into ODEs

In order to solve this via a method of lines (MOL) approach, we need to discretize the PDE into a system of ODEs. Let’s do a simple uniformly-spaced grid finite difference discretization. Choose dx = 1 and dy = 1 so that we have 100*100=10000 points for each reactant. Notice how fast that grows! Put the reactants in a matrix such that A[i,j] = A(x_j,y_i), i.e. the columns of the matrix is the x values and the rows are the y values (this way looking at the matrix is essentially like looking at the discretized space).

So now we have 3 matrices (A, B, and C) for our reactants. How do we discretize the PDE? In this case, the diffusion term simply becomes a tridiagonal matrix M where [1,-2,1] is central band. You can notice that MA performs diffusion along the columns of A, and so this is diffusion along the y. Similarly, AM flips the indices and thus does diffusion along the rows of A making this diffusion along x. Thus D(M_yA + AM_x) is the discretized Laplacian (we could have separate diffusion constants and dx \neq dy if we want by using different constants on the M, but let’s not do that for this simple example. I’ll leave that as an exercise for the reader). I enforced a Neumann boundary condition with zero derivative (also known as a no-flux boundary condition) by reflecting the changes over the boundary. Thus the derivative operator is generated as:

const Mx = full(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]))
const My = copy(Mx)
# Do the reflections, different for x and y operators
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0

I also could have done this using the DiffEqOperators.jl library, but I wanted to show what it truly is at its core.

Since all of the reactions are local, we only have each point in space react separately. Thus this represents itself as element-wise equations on the reactants. Thus we can write it out quite simply. The ODE which then represents the PDE is thus in pseudo Julia code:

DA = D*(M*A + A*M)
@. DA + α₁ - β₁*A - r₁*A*B + r₂*C
@. α₂ - β₂*B - r₁*A*B + r₂*C
@. α₃ - β₃*C + r₁*A*B - r₂*C

Note here that I am using α₁ as a matrix (or row-vector, since that will broadcast just fine) where every point in space with x<80 has this zero, and all of the others have it as a constant. The other coefficients are all scalars. How do we do this with the ODE solver?

Our Type: ArrayPartition

The ArrayPartition is an interesting type from RecursiveArrayTools.jl which allows you to define “an array” as actually being different discrete subunits of arrays. Let’s assume that our initial condition is zero for everything and let the production terms build it up. This means that we can define:

A = zeros(M,N); B  = zeros(M,N); C = zeros(M,N)

Now we can put them together as:

u0 = ArrayPartition((A,B,C))

You can read the RecursiveArrayTools.jl README to get more familiar with what the ArrayPartition is, but really it’s an array where u[i] indexes into A first, B second, then C. It also has efficient broadcast, doing the A, B and C parts together (and this is efficient even if they don’t match types!). But since this acts as an array, to DifferentialEquations.jl it is an array!

The important part is that we can “decouple” the pieces of the array at anytime by accessing u.x, which holds our tuple of arrays. Thus our ODE using this ArrayPartition as its container can be written as follows:

function f(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  DA = D*(M*A + A*M)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end

where this is using @. to do inplace updates on our du to say how the full ArrayPartition should update in time. Note that we can make this more efficient by adding some cache variables to the diffusion matrix multiplications and using A_mul_B!, but let’s ignore that for now.

Together, the ODE which defines our PDE is thus:

prob = ODEProblem(f,u0,(0.0,100.0))
sol = solve(prob,BS3())

if I want to solve it on t \in [0,100]. Done! The solution gives back ArrayPartitions (and interpolates to create new ones if you use sol(t)). We can plot it in Plots.jl

and see the pretty gradients. Using this 3rd order explicit adaptive Runge-Kutta method we solve this equation in about 40 seconds. That’s okay.

Some Optimizations

There are some optimizations that can still be done. When we do A*B as matrix multiplication, we create another temporary matrix. These allocations can bog down the system. Instead we can pre-allocate the outputs and use the inplace functions A_mul_B! to make better use of memory. The easiest way to store these cache arrays are constant globals, but you can use closures (anonymous functions which capture data, i.e. (x)->f(x,y)) or call-overloaded types to do it without globals. The globals way (the easy way) is simply:

const MyA = zeros(N,N)
const AMx = zeros(N,N)
const DA = zeros(N,N)
function f(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(MyA,My,A)
  A_mul_B!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end

For reference, closures looks like:

MyA = zeros(N,N)
AMx = zeros(N,N)
DA = zeros(N,N)
function f_full(t,u,du,MyA,AMx,DA)
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(MyA,My,A)
  A_mul_B!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
f = (t,u,du)-> f_full(t,u,du,MyA,AMx,DA)

and a call overloaded type looks like:

struct MyFunction{T} <: Function
  MyA::T
  AMx::T
  DA::T
end
 
# Now define the overload
function (ff::MyFunction)(t,u,du)
  # This is a function which references itself via ff
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(ff.MyA,My,A)
  A_mul_B!(ff.AMx,A,Mx)
  @. ff.DA = D*(ff.MyA + ff.AMx)
  @. dA = f.DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
MyA = zeros(N,N)
AMx = zeros(N,N)
DA = zeros(N,N)
 
f = MyFunction(MyA,AMx,DA)
# Now f(t,u,du) is our function!

These last two ways enclose the pointer to our cache arrays locally but still present a function f(t,u,du) to the ODE solver.

Now since PDEs are large, many times we don’t care about getting the whole timeseries. Using the output controls from DifferentialEquations.jl, we can make it only output the final timepoint.

sol = solve(prob,BS3(),progress=true,save_everystep=false,save_start=false)

Also, if you’re using Juno this’ll give you a nice progress bar so you can track how it’s going.

Quick Note About Performance

We are using an explicit Runge-Kutta method here because that’s what works with GPUs so far. Matrix factorizations need to be implemented for GPUArrays before the implicit (stiff) solvers will be available, so here we choose BS3 since it’s fully broadcasting (not all methods are yet) and it’s fully GPU compatible. In practice, right now using an NxNx3 tensor as the initial condition / dependent variable with either OrdinaryDiffEq’s Rosenbrock23(), Rodas4(), or Sundials’ CVODE_BDF() is actually more efficient right now. But after Julia fixes its broadcasting issue and with some updates to Julia’s differentiation libraries to handle abstract arrays like in DiffEqDiffTools.jl, the stiff solvers will be usable with GPUs and all will be well.

Thus for reference I will show some ways to do this efficiently with stiff solvers. With a stiff solver we will not want to factorize the dense Jacobian since that would take forever. Instead we can use something like Sundials’ Krylov method:

u0 = zeros(N,N,3)
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N)
function f(t,u,du)
  A = @view u[:,:,1]
  B = @view u[:,:,2]
  C = @view u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  A_mul_B!(MyA,My,A)
  A_mul_B!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
# Solve the ODE
prob = ODEProblem(f,u0,(0.0,100.0))
using Sundials
@time sol = solve(prob,CVODE_BDF(linear_solver=:BCG))

and that will solve it in about a second. In this case it wouldn’t be more efficient to use the banded linear solver since the system of equations tends to have different parts of the system interact which makes the bands large, and thus a Krylov method is preferred. See this part of the docs for details on the available linear solvers from Sundials. DifferentialEquations.jl exposes a ton of Sundials’ possible choices so hopefully one works for your problem (preconditioners coming soon).

To do something similar with OrdinaryDiffEq.jl, we would need to make use of the linear solver choices in order to override the internal linear solve functions with some kind of sparse matrix solver like a Krylov method from IterativeSolvers.jl. For this size of problem though a multistep method like BDF is probably preferred though, at least until we implement some IMEX methods.

So if you want to solve it quickly right now, that’s how you do it. But let’s get back to our other story: the future is more exciting.

The Full ODE Code

As a summary, here’s a full PDE code:

using OrdinaryDiffEq, RecursiveArrayTools
 
# Define the constants for the PDE
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 100
const X = reshape([i for i in 1:100 for j in 1:100],N,N)
const Y = reshape([j for i in 1:100 for j in 1:100],N,N)
const α₁ = 1.0.*(X.>=80)
 
const Mx = full(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]))
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
 
# Define the initial condition as normal arrays
A = zeros(N,N); B  = zeros(N,N); C = zeros(N,N)
u0 = ArrayPartition((A,B,C))
 
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N)
# Define the discretized PDE as an ODE function
function f(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(MyA,My,A)
  A_mul_B!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
# Solve the ODE
prob = ODEProblem(f,u0,(0.0,100.0))
sol = solve(prob,BS3(),progress=true,save_everystep=false,save_start=false)
 
using Plots; pyplot()
p1 = surface(X,Y,sol[end].x[1],title = "[A]")
p2 = surface(X,Y,sol[end].x[2],title = "[B]")
p3 = surface(X,Y,sol[end].x[3],title = "[C]")
plot(p1,p2,p3,layout=grid(3,1))

Making Use of GPU Parallelism

That was all using the CPU. How do we make turn on GPU parallelism with DifferentialEquations.jl? Well, you don’t. DifferentialEquations.jl “doesn’t have GPU bits”. So wait… can we not do GPU parallelism? No, this is the glory of type-genericness, especially in broadcasted operations. To make things use the GPU, we simply use a GPUArray. If instead of zeros(N,M) we used GPUArray(zeros(N,M)), then u becomes an ArrayPartition of GPUArrays. GPUArrays naturally override broadcast such that dotted operations are performed on the GPU. DifferentialEquations.jl uses broadcast internally (except in this list of current exceptions due to a limitation with Julia’s inference engine which I have discussed with Jameson Nash (@vtjnash) who mentioned this should be fixed in Julia’s 1.0 release), and thus just by putting the array as a GPUArray, the array-type will take over how all internal updates are performed and turn this algorithm into a fully GPU-parallelized algorithm that doesn’t require copying to the CPU. Wasn’t that simple?

From that you can probably also see how to multithread everything, or how to set everything up with distributed parallelism. You can make the ODE solvers do whatever you want by defining an array type where the broadcast does whatever special behavior you want.

So to recap, the entire difference from above is changing to:

using CLArrays
gA = CLArray(A); gB  = CLArray(B); gC = CLArray(C)
const gMx = CLArray(Mx)
const gMy = CLArray(My)
const gα₁ = CLArray(α₁)
gu0 = ArrayPartition((gA,gB,gC))
 
const gMyA = zeros(N,N)
const gAMx = zeros(N,N)
const gDA = zeros(N,N)
function gf(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(gMyA,gMy,A)
  A_mul_B!(gAMx,A,gMx)
  @. DA = D*(gMyA + AgMx)
  @. dA = DA + gα₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
prob2 = ODEProblem(gf,gu0,(0.0,100.0))
GPUArrays.allowslow(false) # makes sure none of the slow fallbacks are used
@time sol = solve(prob2,BS3(),progress=true,dt=0.003,adaptive=false,save_everystep=false,save_start=false)
 
prob2 = ODEProblem(gf,gu0,(0.0,100.0))
sol = solve(prob2,BS3(),progress=true,save_everystep=false,save_start=false)
# Adaptivity currently fails due to https://github.com/JuliaGPU/CLArrays.jl/issues/10

You can use CUArrays if you want as well. It looks exactly the same as using CLArrays except you exchange the CLArray calls to CUArray. Go have fun.

And Stochastic PDEs?

Why not make it an SPDE? All that we need to do is extend each of the PDE equations to have a noise function. In this case, let’s use multiplicative noise on each reactant. This means that our noise update equation is:

function g(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  @. dA = γ₁*A
  @. dB = γ₂*A
  @. dC = γ₃*A
end

Now we just define and solve the system of SDEs:

prob = SDEProblem(f,g,u0,(0.0,100.0))
sol = solve(prob,SRIW1())

We can see the cool effect that diffusion dampens the noise in [A] but is unable to dampen the noise in [B] which results in a very noisy [C]. The stiff SPDE takes much longer to solve even using high order plus adaptivity because stochastic problems are just that much more difficult (current research topic is to make new algorithms for this!). It gets GPU’d just by using GPUArrays like before. But there we go: solving systems of stochastic PDEs using high order adaptive algorithms with within-method GPU parallelism. That’s gotta be a first? The cool thing is that nobody ever had to implement the GPU-parallelism either, it just exists by virtue of the Julia type system.

Side Notes

Warning: This can take awhile to solve! An explicit Runge-Kutta algorithm isn’t necessarily great here, though to use a stiff solver on a problem of this size requires once again smartly choosing sparse linear solvers. The high order adaptive method is pretty much necessary though since something like Euler-Maruyama is simply not stable enough to solve this at a reasonable dt. Also, the current algorithms are not so great at handling this problem. Good thing there’s a publication coming along with some new stuff…

Note: the version of SRIW1 which uses broadcast for GPUs is not on the current versions of StochasticDiffEq.jl since it’s slower due to a bug when fusing too many broadcasts which will hopefully get fixed in one of Julia’s 1.x releases. Until then, GPUs cannot be used with this algorithm without a (quick) modification.

Conclusion

So that’s where we’re at. GPU parallelism works because of abstract typing. But in some cases we need to help the GPU array libraries get up to snuff to handle all of the operations, and then we’ll really be in business! Of course there’s more optimizing that needs to be done, and we can do this by specializing code paths on bottlenecks as needed.

I think this is at least a nice proof of concept showing that Julia’s generic algorithms allow for one to not only take advantage of things like higher precision, but also take advantage of parallelism and extra hardware without having to re-write the underlying algorithm. There’s definitely more work that needs to be done, but I can see this usage of abstract array typing as being one of Julia’s “killer features” in the coming years as the GPU community refines its tools. I’d give at least a year before all of this GPU stuff is compatible with stiff solvers and linear solver choices (so that way it can make use of GPU-based Jacobian factorizations and Krylov methods). And comparable methods for SDEs are something I hope to publish soon since the current tools are simply not fit for this scale of problem: high order, adaptivity, sparse linear solvers, and A/L-stability all need to be combined in order to tackle this problem efficiently.

Full Script

Here’s the full script for recreating everything:

#######################################################
### Solve the PDE
#######################################################
 
using OrdinaryDiffEq, RecursiveArrayTools
 
# Define the constants for the PDE
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 100
const X = reshape([i for i in 1:100 for j in 1:100],N,N)
const Y = reshape([j for i in 1:100 for j in 1:100],N,N)
const α₁ = 1.0.*(X.>=80)
 
const Mx = full(Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1]))
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
 
# Define the initial condition as normal arrays
A = zeros(N,N); B  = zeros(N,N); C = zeros(N,N)
u0 = ArrayPartition((A,B,C))
 
const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N)
# Define the discretized PDE as an ODE function
function f(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(MyA,My,A)
  A_mul_B!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
# Solve the ODE
prob = ODEProblem(f,u0,(0.0,100.0))
@time sol = solve(prob,BS3(),progress=true,save_everystep=false,save_start=false)
 
using Plots; pyplot()
p1 = surface(X,Y,sol[end].x[1],title = "[A]")
p2 = surface(X,Y,sol[end].x[2],title = "[B]")
p3 = surface(X,Y,sol[end].x[3],title = "[C]")
plot(p1,p2,p3,layout=grid(3,1))
 
#######################################################
### Solve the PDE using CLArrays
#######################################################
 
using CLArrays
gA = CLArray(A); gB  = CLArray(B); gC = CLArray(C)
const gMx = CLArray(Mx)
const gMy = CLArray(My)
const gα₁ = CLArray(α₁)
gu0 = ArrayPartition((gA,gB,gC))
 
const gMyA = CLArray(MyA)
const gAMx = CLArray(AMx)
const gDA = CLArray(DA)
function gf(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  A_mul_B!(gMyA,gMy,A)
  A_mul_B!(gAMx,A,gMx)
  @. gDA = D*(gMyA + gAMx)
  @. dA = gDA + gα₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
prob2 = ODEProblem(gf,gu0,(0.0,100.0))
GPUArrays.allowslow(false)
@time sol = solve(prob2,BS3(),progress=true,dt=0.003,adaptive=false,save_everystep=false,save_start=false)
 
prob2 = ODEProblem(gf,gu0,(0.0,100.0))
sol = solve(prob2,BS3(),progress=true,save_everystep=false,save_start=false)
# Adaptivity currently fails due to https://github.com/JuliaGPU/CLArrays.jl/issues/10
 
#######################################################
### Solve the SPDE
#######################################################
 
using StochasticDiffEq
 
function g(t,u,du)
  A,B,C = u.x
  dA,dB,dC = du.x
  @. dA = γ₁*A
  @. dB = γ₂*A
  @. dC = γ₃*A
end
 
prob3 = SDEProblem(f,g,u0,(0.0,100.0))
sol = solve(prob3,SRIW1(),progress=true,save_everystep=false,save_start=false)
 
p1 = surface(X,Y,sol[end].x[1],title = "[A]")
p2 = surface(X,Y,sol[end].x[2],title = "[B]")
p3 = surface(X,Y,sol[end].x[3],title = "[C]")
plot(p1,p2,p3,layout=grid(3,1))
 
# Exercise: Do SPDE + GPU

The post Solving Systems of Stochastic PDEs and using GPUs in Julia appeared first on Stochastic Lifestyle.