Tag Archives: Programming

Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/useful-algorithms-that-are-not-optimized-by-jax-pytorch-or-tensorflow/

In some previous blog posts we described in details how one can generalize automatic differentiation to give automatically stability enhancements and all sorts of other niceties by incorporating graph transformations into code generation. However, one of the things which we didn’t go into too much is the limitation of these types of algorithms. This limitation is what we have termed “quasi-static” which is the property that an algorithm can be reinterpreted as some static algorithm. It turns out that for very fundamental reasons, this is the same limitation that some major machine learning frameworks impose on the code that they can fully optimize, such as Jax or Tensorflow. This led us to the question: are there algorithms which are not optimizable within this mindset, and why? The answer is now published at ICML 2021, so lets dig into this higher level concept.

The Space of Quasi-Static Algorithms

First of all, lets come up with a concrete idea of what a quasi-static algorithm is. It’s the space of algorithms which in some way can be re-expressed as a static algorithm. Think of a “static algorithm” as one which has a simple mathematical description that does not require a full computer description, i.e. no loops, rewriting to memory, etc. As an example, let’s take a look at an example from the Jax documentation. The following is something that the Jax JIT works on:

@jit
def f(x):
  for i in range(3):
    x = 2 * x
  return x
 
print(f(3))

Notice that it’s represented by something with control flow, i.e. it is code represented with a loop, but the but the loop is not necessary We can also understand this method as 2*2*2*x or 8*x. The demonstrated example of where the JIT will fail by default is:

@jit
def f(x):
  if x < 3:
    return 3. * x ** 2
  else:
    return -4 * x
 
# This will fail!
try:
  f(2)
except Exception as e:
  print("Exception {}".format(e))

In this case, we can see that there’s essentially two compute graphs split at x<3, and so as stated this does not have a single mathematical statement that describes the computation. You can get around this by doing lax.cond(x < 3, 3. * x ** 2, -4 * x), but notice this is a fundamentally different computation: the lax.cond form always computes both sides of the if statement before choosing which one to carry forward, while the true if statement changes its computation based on the conditional. The reason why the lax.cond form thus works with Jax's JIT compilation system is thus because it is quasi-static. The computations that will occur are fixed, even if the result is not, while the original if statement will change what is computed based on the input values. This limitation exists because Jax traces through a program to attempt to build the static compute graph under the hood, and it then attempts to do its actual transformations on this graph. Are there other kinds of frameworks that do something similar? It also turns out that the set of algorithms which are transformable into purely symbolic languages is the set of quasi-static algorithms, so something like Symbolics.jl also has a form of quasi-staticness manifest in the behaviors of its algorithms. And it’s for the same reason: in symbolic algorithms you define symbolic variables like “x” and “y”, and then trade through a program to build a static compute graph for “2x^2 + 3y” which you then treat symbolically. In the frequently asked questions, there is a question for what happens when a conversion of a function to symbolic fails. If you take a look at the example:

function factorial(x)
  out = x
  while x > 1
    x -= 1
    out *= x
  end
  out
end
 
@variables x
factorial(x)

You can see that the reason for this is because the algorithm is not representable as a single mathematical expression: the factorial cannot be written as a fixed number of multiplications because the number of multiplications is dependent on that value x you’re trying to compute x! for! The error that the symbolic language throws is “ERROR: TypeError: non-boolean (Num) used in boolean context”, which is saying that it does not know how to symbolically expand out “while x > 1” to be able to represent it statically. And this is not something that is not necessarily “fixable”, it’s fundamental to the fact that this algorithm is not able to be represented by a fixed computation and necessarily needs to change the computation based on the input.

Handling Non-Quasi-Static Algorithms in Symbolics and Machine Learning

The “solution” is to define a new primitive to the graph via “@register factorial(x)”, so that this function itself is a fixed node that does not try to be symbolically expanded. This is the same concept as defining a Jax primitive or a Tensorflow primitive, where an algorithm simply is not quasi-static and so the way to get a quasi-static compute graph is to treat the dynamic block just a function “y = f(x)” that is preordained. An in the context of both symbolic languages and machine learning frameworks, for this to work in full you also need to define derivatives of said function. That last part is the catch. If you take another look at the depths of the documentation of some of these tools, you’ll notice that many of these primitives representing non-static control flow fall outside of the realm that is fully handled.

Right there in the documentation it notes that you can replace a while loop with lax.while_loop, but that is not amenable to reverse-mode automatic differentiation. The reason is because its reverse-mode AD implementation assumes that such a quasi-static algorithm exists and uses this for two purposes, one for generating the backpass but secondly for generating the XLA (“Tensorflow”) description of the algorithm to then JIT compile optimize. XLA wants the static compute graph, which again, does not necessarily exist for this case, hence the fundamental limitation. The way to get around this of course is then to define your own primitive with its own fast gradient calculation and this problem goes away…

Or does it?

Where Can We Find The Limit Of Quasi-Static Optimizers?

There are machine learning frameworks which do not make the assumption of quasi-staticness but also optimize, and most of these things like Diffractor.jl, Zygote.jl, and Enzyme.jl in the Julia programming language (note PyTorch does not assume quasi-static representations, though TorchScript’s JIT compilation does). This got me thinking: are there actual machine learning algorithms for which this is a real limitation? This is a good question, because if you pull up your standard methods like convolutional neural networks, that’s a fixed function kernel call with a good derivative defined, or a recurrent neural network, that’s a fixed size for loop. If you want to break this assumption, you have to go to a space that is fundamentally about an algorithm where you cannot know “the amount of computation” until you know the specific values in the problem, and equation solvers are something of this form.

How many steps does it take for Newton’s method to converge? How many steps does an adaptive ODE solver take? This is not questions that can be answered a priori: they are fundamentally questions which require knowing:

  1. what equation are we solving?
  2. what is the initial condition?
  3. over what time span?
  4. with what solver tolerance?

For this reason, people who work in Python frameworks have been looking for the “right” way to treat equation solving (ODE solving, finding roots f(x)=0, etc.) as a blackbox representation. If you take another look at the Neural Ordinary Differential Equations paper, one of the big things it was proposing was the treatment of neural ODEs as a blackbox with a derivative defined by the ODE adjoint. The reason of course is because adaptive ODE solvers necessarily iterate to tolerance, so there is necessarily something like “while t < tend" which is dependent on whether the current computations are computed to tolerance. As something not optimized in the frameworks they were working in, this is something that was required to make the algorithm work.

Should You Treat Equation Solvers As a Quasi-Static Blackbox?

No it’s not fundamental to have to treat such algorithms as a blackbox. In fact, we had a rather popular paper a few years ago showing that neural stochastic differential equations can be trained with forward and reverse mode automatic differentiation directly via some Julia AD tools. The reason is because these AD tools (Zygote, Diffractor, Enzyme, etc.) do not necessarily assume quasi-static forms due to how they do direct source-to-source transformations, and so they can differentiate the adaptive solvers directly and spit out the correct gradients. So you do not necessarily have to do it in the “define a Tensorflow op” style, but which is better?

It turns out that “better” can be really hard to define because the two algorithms are not necessarily the same and can compute different values. You can boil this down to: do you want to differentiate the solver of the equation, or do you want to differentiate the equation and apply a solver to that? The former, which is equivalent to automatic differentiation of the algorithm, is known as discrete sensitivity analysis or discrete-then-optimize. The latter is continuous sensitivity analysis or optimize-then-discretize approaches. Machine learning is not the first field to come up against this problem, so the paper on universal differential equations and the scientific machine learning ecosystem has a rather long description that I will quote:

“””
Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [41, 37, 42, 43, 44, 45] while continuous adjoints have been demonstrated to be more stable in others [46, 43] and can reduce spurious oscillations [47, 48, 49]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [50, 51, 52, 53, 54, 55, 56, 57, 58]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [59, 60, 61, 62, 63], and our software contribution helps researchers switch between all of these optimization approaches in combination with hundreds of differential equation solver methods with a single line of code change.
“””

Or, tl;dr: there’s tons of prior research which generally shows that continuous adjoints are less stable than discrete adjoints, but they can be faster. We have done recent follow-ups which show these claims are true on modern problems with modern software. Specifically, this paper on stiff neural ODEs shows why discrete adjoints are more stable that continuous adjoints when training on multiscale data, but we also recently showed continuous adjoints can be much faster at gradient computations than (some) current AD techniques for discrete adjoints.

So okay, there’s a true benefit to using discrete adjoint techniques if you’re handling these hard stiff differential equations, differentiting partial differential equations, etc. and this has been known since the 80’s in the field of control theory. But other than that, it’s a wash, and so it’s not clear whether differentiating such algorithms is better in machine learning, right?

Honing In On An Essentially Non-Quasi-Static Algorithm Which Accelerates Machine Learning

This now brings us to how the recent ICML paper fits into this narrative. Is there a non-quasi-static algorithm that is truly useful for standard machine learning? The answer turns out to be yes, but how to get there requires a few slick tricks. First, the setup. Neural ODEs can be an interesting method for machine learning because they use an adaptive ODE solver to essentially choose the number of layers for you, so it’s like a recurrent neural network (or more specifically, like a residual neural network) that automatically finds the “correct” number of layers, where the number of layers is the number of steps the ODE solver decides to take. In other words, Neural ODEs for image processing are an algorithm that automatically do hyperparameter optimization. Neat!

But… what is the “correct” number of layers? For hyperparameter optimization you’d assume that would be “the least number of layers to make predictions accurately”. However, by default neural ODEs will not give you that number of layers: they will give you whatever they feel like. In fact, if you look at the original neural ODE paper, as the neural ODE trains it keeps increasing the number of layers it uses:

So is there a way to change the neural ODE to make it define “correct number of layers” as “least number of layers”? In the work Learning Differential Equations that are Easy to Solve they did just that. How they did it is that they regularized the training process of the neural ODE. They looked at the solution and noted that ODEs with have more changes going on are necessarily harder to solve, so you can transform the training process into hyperparameter optimization by adding a regularization term that says “make the higher order derivative terms as small as possible”. The rest of the paper is how to enact this idea. How was that done? Well, if you have to treat the algorithm as a blackbox, you need to define some blackbox way to defining high order derivatives which then leads to Jesse’s pretty cool formulation of Taylor-mode automatic differentiation. But no matter how you put it, that’s going to be an expensive object to compute: computing the gradient is more expensive than the forward pass, and the second derivative moreso than the gradient, and the third etc, so an algorithm that wants 6th derivatives is going to be nasty to train. With some pretty heroic work they got a formulation of this blackbox operation which takes twice as long to train but successfully does the hyperparmeter optimization.

End of story? Far from it!

The Better Way to Make Neural ODEs An Automatic Hyperparameter Optimizing Algorithm

Is there a way to make automatic hyperparameter optimization via neural ODEs train faster? Yes, and our paper makes them not only train faster than that other method, but makes it train faster than the vanilla neural ODE. We can make layer hyperparameter optimization less than free: we can make it cheaper than not doing the optimization! But how? The trick is to open the blackbox. Let me show you what a step of the adaptive ODE solver looks like:

Notice that the adaptive ODE solver chooses whether a time step is appropriate by using an error estimate. The ODE algorithm is actually constructed so that the error estimate, the estimate of “how hard this ODE is to solve”, is actually computed for free. What if we use this free error estimate as our regularization technique? It turns out that is 10x faster to train that before, while similarly automatically performing hyperparameter optimization.

Notice where we have ended up: the resulting algorithm is necessarily not quasi-static. This error estimate is computed by the actual steps of the adaptive ODE solver: to compute this error estimate, you have to do the same computations, the same while loop, as the ODE solver. In this algorithm, you cannot avoid directly differentiating the ODE solver because pieces of the ODE solver’s internal calculations are now part of the regularization. This is something that is fundamentally not optimized by methods that require quasi-static compute graphs (Jax, Tensorflow, etc.), and it is something that makes hyperparameter optimization cheaper than not doing hyperparameter optimization since the regularizer is computed for free. I just find this result so cool!

Conclusion: Finding the Limitations of Our Tools

So yes, the paper is a machine learning paper on how to do hyperparameter optimization for free using a trick on neural ODEs, but I think the general software context this sits in highlights the true finding of the paper. This is the first algorithm that I know of where there is both a clear incentive for it to be used in modern machine learning, but also, there is a fundamental reason why common machine learning frameworks like Jax and Tensorflow will not be able to treat them optimally. Even PyTorch’s TorchScript will fundamentally, due to the assumptions of its compilation process, no work on this algorithm. Those assumptions were smartly chosen because most algorithms can satisfy them, but this one cannot. Does this mean machine learning is algorithmically stuck in a rut? Possibly, because I thoroughly believe that someone working within a toolset that does not optimize this algorithm would have never found it, which makes it very thought-provoking to me.

What other algorithms are out there which are simply better than our current approaches but are worse only because of the current machine learning frameworks? I cannot wait until Diffractor.jl’s release to start probing this question deeper.

The post Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow appeared first on Stochastic Lifestyle.

ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/modelingtoolkit-modelica-and-modia-the-composable-modeling-future-in-julia/

Let me take a bit of time here to write out a complete canonical answer to ModelingToolkit and how it relates to Modia and Modelica. This question comes up a lot: why does ModelingToolkit exist instead of building on tooling for Modelica compilers? I’ll start out by saying I am a huge fan of Martin and Hilding’s work, I respect them a ton, and they have made major advances in this space. But I think ModelingToolkit tops what they have developed in a not-so-subtle way. And it all comes down to the founding principle, the foundational philosophy, of what a modeling language needs to do.

Composable Abstractions for Model Transformations

There is a major philosophical difference which is seen in both the development and usage of the tools. Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a paper about the package structure of DifferentialEquations.jl. The underlying principle is that not everyone wants or needs to be a developer of the package, but still may want to contribute. For example, it’s not uncommon that a researcher in ODE solvers wants to build a package that adds one solver to the SciML ecosystem. Doing this in their own package for their own academic credit, but with the free bonus that now it exists in the multiple dispatch world. In the design of DifferentialEquations.jl, solve(prob,IRKGL16()) now exists because of their package, and so we add it to the documentation. Some of this work is not even inside the organization, but we still support it. The philosophy is to include every researcher as a budding artist in the space of computational research, including all of the possible methods, and building an infrastructure that promotes a free research atmosphere in the methods. Top level defaults and documentation may lead people to the most stable aspects of the ecosystem, but with a flip of a switch you can be testing out the latest research.

When approaching modeling languages like Modelica, I noticed this idea was completely foreign to modeling languages. Modelica is created by a committee, but the implementations that people use are closed like Dymola, or monolithic like OpenModelica. This is not a coincidence but instead a fact of the design of the language. In the Modelica language, there is no reference to what transformations are being done to your models in order to make them “simulatable”. People know about Pantelides algorithm, and “singularity elimination”, but this is outside the language. It’s something that the compiler maybe gives you a few options for, but not something the user or the code actively interacts with. Every compiler is different, advances in one compiler do not help your model when you use another compiler, and the whole world is siloed. By this design, it is impossible for an external user to write compiler passes in Modelica which effects this model lowering process. You can tweak knobs, or write a new compiler. Or fork OpenModelica and hack on the whole compiler to just do the change you wanted.

I do not think that the symbolic transformations that are performed by Modelica are the complete set that everyone will need for all models for all time. I think in many cases you might want to write your own. For example, on SDEs there’s a Lamperti transformation which can exist which transforms general SDEs to SDEs with additive noise. It doesn’t always apply, but when it does it can greatly enhance solver speed and stability. This is niche enough that it’ll never be in a commercial Modelica compiler (in fact, they don’t even have SDEs), but it’s something that some user might want to be able to add to the process.

ModelingToolkit: Opening the Development Process

So the starting goal of ModelingToolkit is to give an open and modular transformation system on which a whole modeling ecosystem can thrive. My previous blog post exemplified how unfamiliar use cases for code transformations can solve many difficult mathematical problems, and my goal is to give this power to the whole development community. `structural_simplify` is something built into ModelingToolkit to do “the standard transformations” on the standard systems, but the world of transformations is so much larger. Log-transforming a few variables? Exponentiating a few to ensure positivity? Lamperti transforms of SDEs? Transforming to the sensitivity equations? And not just transformations, but functionality for inspecting and analyzing models. Are the equations linear? Which parameters are structurally identifiable?

From that you can see that Modia was a major inspiration for ModelingToolkit, but Modia did not go in this direction of decomposing the modeling language: it essentially is a simplified Modelica compiler in Julia. But ModelingToolkit is a deconstruction of what a modeling language is. It pulls it down to its component pieces and then makes it easy to build new modeling languages like Catalyst.jl which internally use ModelingToolkit for all of the difficult transformations. The deconstructed form is a jumping point for building new domain-based languages, along with new transformations which optimize the compiler for specific models. And then in the end, everybody who builds off of it gets improved stability, performance, and parallelism as the core MTK passes improve.

Bringing the Power to the People

Now there’s two major aspects that need to be handle to fully achieve such a vision though. If you want people to be able to reuse code between transformations, what you want is to expose how you are changing code. To achieve this goal, a new Computer Algebra System (CAS), Symbolics.jl, was created for ModelingToolkit.jl. The idea being, if we want everyone writing code transformations, they should all have easy access to a general mathematical toolset for doing such code transformations. We shouldn’t have everyone building a new code for differentiation, simplify, and substitution. And we shouldn’t have everyone relying on undocumented internals of ModelingToolkit.jl either: this should be something that is open, well-tested, documented, and a well-known system so that everyone can easily become a “ModelingToolkit compiler developer”. By building a CAS and making it a Julia standard, we can bridge that developer gap because now everyone knows how to easily manipulate models: they are just Symbolics.jl expressions.

The second major aspect is to achieve a natural embedding into the host language. Modelica is not a language in which people can write compiler passes, which introduces a major gap between the modeler and the developer of extensions to the modeling language. If we want to bridge this gap, we need to ensure the whole modeling language is used from a host which is a complete imperative programming language. And you need to do so in a language that is interactive, high performance, and has a well-developed ecosystem for modeling and simulation. Martin and Hilding had seen this fact as the synthesis for Modia with how Julia uniquely satisfies this need, but I think we need to take it a step further. To really make the embedding natural, you should be able to on the fly automatically convert code to and from the symbolic form. In the previous blog post I showcased how ModelingToolkit.jl could improve people’s code by automatically parallelizing it and performing index reduction even if the code was not written in ModelingToolkit.jl. This grows the developer audience of the transformation language from “anyone who wants to transform models” to “anyone who wants to automate improving models and general code”. This expansion of the audience is thus pulling in developers who are interested in things like automating parallelism and GPU codegen and bringing them into the MTK developer community.

Intern, since all of these advances then apply to the MTK internals and code generation tools such as Symbolics.jl’s build_function, new features are coming all of the time because of how the community is composed. The CTarget build_function was first created to transpile Julia code to C, and thus ModelingToolkit models can generate C outputs for compiling into embedded systems. This is serendipity when seeing one example, but it’s design when you notice that this is how the entire system is growing so fast.

But Can Distributed Development Be As Good As Specialized Code?

Now one of the questions we received early on was, won’t you not be able to match the performance a specialized compiler which was only made to work on Modelica, right? While at face value it may seem like hyperspecialization could be beneficial, the true effect of hyperspecialization is that algorithms are simply less efficient because less work has been put into them. Symbolics.jl has become a phenomenon of its own, with multiple different hundred comment threads digging through many aspects of the pros and cons of its designs, and that’s not even including the 200 person chat channel which has had tens of thousands of messages in the less than 2 months since the CAS was released. Tons of people are advising how to improve every single plus and multiply operation.

So it shouldn’t be a surprise that there are many details that have quickly been added which are still years away from a Modelica implementation. It automatically multithreads tree traversals and rewrite rules. It automatically generates fast parallelized code, and can do so in a way that composes with tearing of nonlinear equations. It lets users define their own high-performance and parallelized functions, register them, and stick them into the right hand side. And that is even excluding the higher level results, like the fact that it is fully differentiable and thus allows training neural networks decomposed within the models, and automatically discover equations from data.

Just at the very basic level we can see that the CAS is transforming the workflows of scientists and engineers in many aspects of the modeling process. By distributing the work of improving symbolic computing, we have already taken examples which were essentially not obtainable and making them instant with Symbolics.jl:

We are building out a full benchmarking system for the symbolic ecosystem to track performance over time and ensure it reaches the top level. It’s integrating pieces from The OSCAR project, getting lots of people tracking performance in their own work, and building a community. Each step is another major improvement and this ecosystem is making these steps fast. It will be hard for a few people working on the internals of a single Modelica compiler to keep up with such an environment, let alone repeating this work to every new Modelica-based project.

But How Do You Connect To Modelica?

This is a rather good question because there are a lot of models already written in Modelica, and it would be a shame for us to not be able to connect with that ecosystem. I will hint that there is coming tooling as part of JuliaSim for connecting to many pre-existing model libraries. In addition, we hope to make use of tooling like Modia.jl and TinyModia.jl will help us make a bridge.

Conclusion: Designing Around the Developer Community Has Many Benefits

The composability and distributed development nature of ModelingToolkit.jl is its catalyst. This is why ModelingToolkit.jl looks like it has rocket shoes on: it is fast and it is moving fast. And it’s because of the thought put into the design. It’s because ModelingToolkit.jl is including the entire research community as its asset instead of just its user. I plan to keep moving forward from here, looking back to learn from the greats, but building it in our own image. We’re taking the idea of a modeling language, distributing it throughout one of the most active developer communities in modeling and simulation, in a language which is made to build fast and parallelized code. And you’re invited.

PS: what about Simulink?

I’m just going to post a self-explanatory recent talk by Jonathan at the NASA Launch Services Program who saw a 15,000x acceleration by moving from Simulink to ModelingToolkit.jl.

Enough said.

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

Writing type-stable Julia code

By: Philippe Mainçon

Re-posted from: https://blog.sintef.com/industry-en/writing-type-stable-julia-code/

Julia code
This text presents type stability, which is one of the important concepts that one needs to understand in order to write high-performance Julia code.

This text is aimed at Julia users that are familiar with composite types, abstract types, functions, methods and multiple dispatch. At the same time, as little advanced Julia syntax as possible is used, to make the text accessible.

To type, or not to type

The developers of Julia wanted to solve the two-language problem. They have achieved this and produced a language that “walks like Python and runs like C”. Julia “walks like Python”, because it is not necessary to systematically define the type of every variable that appears in the code. It “runs like C” because it is a compiled language, and produces (or rather, can produce) highly efficient machine code.

Python and MATLAB are examples of interpreted language. In a pure interpreted language, the type of the variables is computed at run time, at the same time as the value of the variables. As long as the values of the inputs to the code are known at the top level (in the REPL or the top script), the interpretation infers, step by step the type of the variables, all the way down the call stack. This allows to write functions without specifying types, and this in turn allows to write generic code (for example an iterative solver that works just as well with Float64 and Float32 variables). The disadvantage is that inferring the type of variables on the fly introduces significant overhead at run time.

At the other end of the scale C and Fortran are examples of strictly typed compiled languages. Because the source code specifies the type of every variable in the function (both variables in the function interface, and local variables), the compiler can create efficient machine code for each function, just by considering the code of that function alone. The disadvantage is that type declaration takes time to write and clutters the source code, and (unless the language offers “templates”, as C++ does), it may be necessary to write several methods, identical in all but types of variables, to make an algorithm available to various data types.

Julia’s approach to type specifications

Julia takes the sweet spot in between, not requiring to specify the type of each variable, yet producing fast machine code. The trick is as follows: every time a method is called (so, at run time), with a combination of concrete types of arguments that has not yet been encountered for this method, the compiler quicks in. A “concrete type” is the information returned by typeof() when called on a variable. One example is Float64. This is as opposed to an abstract type, like Real, which is a set of concrete types, and includes Float64 and Float32. In the rest of this text “type” will refer to “concrete type”.

The compiler now has the source code of the method, and the types of all the arguments. The compiler will produce a method instance (or instance, for short), which is machine code for this combination. One interesting implication is that writing strictly typed method interfaces in Julia does not provide any improvement of machine code performance: the compiler takes the type of the arguments from the calling context anyway. A strictly typed interface has the disadvantage of offering no flexibility. A method that only accepts a Vector will not accept other vector-like things like a SubArray (an array view), a Adjoint (a transposed array), a SparseMatrix or a StaticArray, even thought the method probably implements an algorithm that would compile perfectly well for all of these.

However, providing partial specification of the type of the arguments of a method serves important purposes in Julia:

  1. If a function has several methods, it allows to specify which method should be executed (multiple dispatch). This is where abstract types like Real, AbstractVector and AbstractVector{<:Real} come into their own.
  2. It improves code readability, stating for example “this method expects some vector of some real numbers – but not a string”.
  3. It provides more graceful failures: “function foo has no method that takes in a string” is more informative that some esoteric failure down the line when attempting to add two strings.

What is type stability?

If the source code of the method is well written, the source code and the concrete type of all arguments is enough information for the compiler to infer the concrete type of every variable and expression within the method. The method is then said to be “typestable”, and the Julia compiler will produce efficient code.

If, for a variety of reasons that will be studied in the following, the type of a local variable cannot be inferred from the types of the arguments, the compiler will produce machine code full of “if”s, covering all options of what the type of each variable could be. The loss in performance is often significant, easily by a factor of 10.

If you are yourself able to infer the type of every local variable, and every expression in a method (or script) from the types (not the values) of the arguments or from constants in the code, the function will be typestable. Actually, as will be seen below, this inference of types is also allowed access to struct declarations, and to the types of the return values of functions called by the function you are studying.

The rest of this text will examine a variety of situations, ranging from obvious to more tricky, in which it is not possible to infer the types of local variables from the types of the arguments, resulting in type instability.

For this purpose, it will be useful to write down the information available to the compiler. So for example, if the method

function add(a::Number,b::Number)
    c = a+b
    return c
end

is called with a of type Float64 and b of type Int32, then we will write the information available to the compiler to create an instance as

instance add(a::Float64,b::Int32)
    c = a+b
    return c
end

instance is not Julia syntax, it is just a notation introduced in this text to describe an instance. In such instance description, a concrete type must be associated with every argument.

Example of Julia programming language code
Julia is a high-level, high-performance, dynamic programming language.

If, then

Consider the following method instance

instance largest(a::Float64,b::Int64)
    if a > b
        c = a
    else
        c = b    
    end
    return c
end

The variable c will be set to either a or b. c will take the value and the type of either a or b. The type of c depends on an operation a > b on the values of a and b: the type of c cannot be inferred from the type of arguments alone, and this code is not typestable.

Several approaches might be relevant to prevent type instability. The simplest is to code largest so that it only accepts two arguments of the same type.

function largest(a::R,b::R) where{R<:Real}
    if a > b
        c = a
    else
        c = b    
    end
    return c
end

The method is general, it can result in the generation of method instances like instance largest(a::Float64,b::Float64), instance largest(a::Int64,b::Int64) and many others. It cannot result in the generation of machine code for instance largest(a::Float64,b::Int64) (because R cannot be both Int64 and Float64). If we need to be able to handle variables of different types, yet want type stability, a solution is to use promotion to ensure that c is always of the same type.

function largest(a,b)
    pa,pb = promote(a,b)
    if a > b
        c = pa
    else
        c = pb  
    end
    return c
end

promote is defined so that pa and pb have the same type, and this type is inferred from the types of a and b. For example, for a call instance largest(a::Float64,b::Int64), the types of pa, pb and c will be Float64, to which one can convert a Int64 variable without loss of information (well, mostly).

Do not allow an if-then construct to return a variable which type depends on the branch taken.

Method return value

A method foo that would call the above first, not typestable, version of the method instance largest would receive as output a variable of a type that is value dependent: foo itself would not be typestable. The workaround here is to create typestable methods for largest, as suggested above.

One example is the method Base.findfirst(A), which given a Vector{Boolean} returns the index of the first true element of the vector. The catch is that if all the vector’s elements are false, the method returns nothing. nothing is of type Nothing, while the index is of type Int64. Using this method will make the calling method not typestable.

Avoid methods that return variables of value-dependant types.

Array of abstract element type

Consider the following code

v = [3.,1,"Hello world!"]
function showall(v)
    for e ∈ v
        @show e
    end
end
showall(v)

The above call showall(v) generates a method instance

instance showall(v::Array{Any,1})
    for e ∈ v
        @show e
    end
end

The concrete type of e cannot be inferred from Array{Any,1}, because Any is not a concrete type. More specifically, the type of e changes from one iteration to the next: the code is not typestable. If v is of type Array{Any,1}, even if V is has elements that are all of the same type, this does not help:

v = Vector{Any}(undef,3)
v[1] = 3.
v[2] = 1.
v[3] = 3.14
showall(v)

e may have the same type at each iteration, but this type still cannot be inferred from the type Array{Any,1} of the argument.

If we define w = randn(3), w has type Array{Float64,1}. This is much more informative: every element of w is known to have the same concrete type Float64. Hence the call showall(w) generates a method instance

instance showall(v::Array{Float64,1})
    for e ∈ v
        @show e
    end
end

and the compiler can infer that e is a Float64.

Wherever possible use arrays with a concrete element type.

Sometimes, the use of array with abstract element type is deliberate. One may really wish to iterate over a heterogeneous collection of elements and apply various methods of the same function to them: we design for dynamic dispatch, and must accept that the process of deciding which method to call takes time. Two techniques can be used to limit the performance penalty.

The first is the use of a “function barrier”: The loop over the heterogenous array should contain as little code as possible, ideally only the access to the arrays element, and the call to a method.

for e ∈ v
    foo(e)
end

If v contains elements of different type, the loop is not typestable and hence slow. Yet each value of e at each iteration has its unique concrete type, for which an instance of foo will be generated: foo can be made typestable and fast.

The second, a further improvement of the first, is to group elements by concrete type, for example, using a heterogenous arrays of homogeneous arrays.

vv = [[1.,2.,3.],[1,2]]
for v ∈ vv  # outerloop
    innerloop(v)
end
function innerloop(v)
    for e ∈ v
        foo(e)
    end
end

Here vv is an Array{Any,1}, containing two vectors of different types. vv[1] is a Array{Float64,1} and vv[2] is a Array{Int64,1}. Function innerloop is called twice and two instances are generated

instance innerloop(v::Array{Float64,1})
    for e ∈ v  # e is Float64
        foo(e)
    end
end
instance innerloop(v::Array{Int64,1})
    for e ∈ v  # e is Int64
        foo(e)
    end
end

and in both instances, the type of e is clearly defined: the instances are typestable.

The with this second approach is that the loop for v ∈ vv has few iterations (if the number of types is small compared to the number of elements in each types).

Structure of abstract field type

A similar loss of type stability arises when reading data from structures that have a field of abstract type:

struct SlowType
    a
end
struct JustAsBad
    a::Real
end
struct MuchBetter
    a::Float64
end
function show_a(s)
    @show s.a
end
show_a(SlowType(3.))
show_a(JustAsBad(3.))
show_a(MuchBetter(3.))

The first call to show_a generates

instance show_a(s::SlowType)
    @show s.a # The concrete type of field a of type SlowType cannot be inferred from the definition of SlowType
end

The second call to show_a has the same problem. The third call generates a typestable instance

instance show_a(s::Better)
    @show s.a # That's a Float64
end

It is often interesting to create structures with fields that can have various types. A classic example is Julia’s Complex type, which can have real and imaginary components which are either both Float64, both Float32 or other more exotic choices. This can be done without losing type stability by using parametric types:

struct FlexibleAndFast{R}
    a::R
end
show_a(FlexibleAndFast(3.))
show_a(FlexibleAndFast(3 ))

The above calls generate two typestable instances of show_a

instance show_a(s::FlexibleAndFast{Float64})
    @show s.a # That's a Float64
end
instance show_a(s::FlexibleAndFast{Int64})
    @show s.a # That's an Int64
end

Always use struct with fields of concrete types. Use parametric structure where necessary.

A note on constructors for parametric types

Consider a struct definition without inner constructor:

struct MyType{A,B}
    a::A
    b::B
end

Julia will automatically generate a constructor method with signature

MyType{A,B}(a::A,b::B)

Julia will also produce another method with signature

MyType(a::A,b::B)

because for MyType, it is possible to infer all type parameters from the types of the inputs to the constructor. Other constructors like

MyType{A}(a::A,b::B)

have to be defined explicitly (how should the compiler decide whether to interpret a single type-parameter input as A or B…).

Consider another example:

struct MyType{A,B,C}
    a::A
    b::B
end

Julia will automatically generate a constructor method with signature

MyType{A,B,C}(a::A,b::B)

but will not generate other methods. A method like

MyType{C}(a::A,b::B)

would have to be defined explicitly.

StaticArray

Julia Arrays are an example of parametric type, where the parameters are the type of elements, and the dimension (the number of indices). Importantly, the size of the array is not part of the type, it is a part of the value of the array.

The package StaticArrays.jl provides the type StaticArray, useful for avoiding another performance problem: garbage collection that follows the allocation of Arrays on the heap. This is because StaticArray are allocated on the stack, simplifying runtime memory management.

using StaticArrays
SA = SVector{3,Float64}([1.,2.,3.])
SA = SVector(1.,2.,3.)
SA = SVector([1.,2.,3.])

The first call to SVector is typestable: all the information needed to infer the type of SA is provided in curly braces. The second call is typestable too, because the compiler can deduce the same information from the type and number of inputs. The third call is problematic: while the type of the elements of SA can be inferred by the compiler, the length of [1.,2.,3.] is part of this array’s value, not type. The type of SA has a parameter that depends on the value (the size) of the argument passed to the constructor. Not only does this generate an instance of the constructor that is not type stable, but the non-inferable type of SA “contaminates” the calling code with type instability.

Val

What if we want to write a function that takes an Vector as an input, processes it (for example just keeps it as it is), and returns a SVector of the same shape. Of course we want this function to be general and not be limited to a given array size and we want this function to be typestable, for good performance.

First attempt:

function static(v::Vector)
    return SVector{length(v),eltype(v)}(v)
end

This function is not typestable. It constructs a variable of type StaticArray{(3,),Float64}, where 3 is obtained as the length of v, and the length is part of the value of an Array. Value-to-type alarm!

One possible solution is to use Val. Let us say that static is called by a function foo within which the length of v can be inferred at compile time. We could create the following code

function static(v,::Val{L}) where{L}
    return SVector{L,Float64}(v)
end
function foo()
    Val3 = Val(3)
    Val4 = Val(4)
    @show static([1.,2.,3.]   ,Val3)
    @show static([1.,2.,3.,4.],Val4)
end

The call Val(3) generates a variable, of type Val{3}. Clearly, Val as a function is not typestable, since it creates a variable of a type depending on the value of its argument.

However, function foo is typestable. This may come as a surprise, but two things conspire to allow this:

  1. The source code of foo explicitly mentions the constants 3 and 4, and the compiler has access to it.
  2. The compiler is greedy – it evaluates at compile time whenever possible. Hence the call Val(3) is evaluated during compilation, and Val3 is known to the compiler to be a a value-empty variable of type Val{3}.

In foo, the method static is called twice, leading to the generation of two typestable instances

instance static(v,::Val{3})  
    return SVector{3,Float64}(v)
end
instance static(v,::Val{4})
    return SVector{4,Float64}(v)
end

What if the length of the vectors is not defined as a constant in foo? If this length is the result of some computation, the call to Val with not be typestable. If foo is high enough in the call hierarchy, and outside any time-critical loop, this is not an issue: only foo will not be typestable, but functions that it calls can still be typestable (cf. the function barrier pattern).

Val allows to move type instability up the call hierarchy, or eliminate it altogether.

Anonymous function

Before Julia 1.6.0, the output type of an anonymous function was known to the compiler as Any. As a consequence, the following code was not typestable:

function foo(f)
    @show f(randn())
    return
end
foo(x->0)

However the following code was

function f(x)
    return 0
end
foo(f)

This is not an issue with Julia 1.6, and anonymous functions can be used.

@code_warntype

One important tool to check that an instance is typestable is the macro @code_warntype. For example

v = randn(3)
@code_warntype Val(length(v))
@code_warntype static(v,Val(length(v)))

The first invocation of @code_warntype outputs a semi-compiled code, and highlights some of the types in red: the call Val(3) is not typestable. The second invocation of @code_warntype produces an output in which all types are highlighted in blue: the call to static is typestable. Note that @code_warntype only analyses the compilation of the outermost function static – given the arguments v and Val(length(v)).

Profiler.jl

Profiler.jl provides a graphical representation of where processor time goes, in which code that is not typestable is highlighted. Output from the profiler often shows how type instability propagates: a single variable that is not typestable makes “anything it touches” type unstable.