Julia at NeurIPS and the Future of Machine Learning Tools

We are excited to share several research papers on the Julia and Flux machine learning ecosystem, to be presented at the NeurIPS Systems for ML Workshop. Since initially proposing the need for a first-class language and ecosystem for machine learning (ML), we have made considerable progress, including the ability to take gradients of arbitrary computations by leveraging Julia’s compiler, and compiling the resulting programs to specialized hardware such as Google’s Tensor Processing Units.

Here we talk about these papers and the projects that have brought these to life, namely: Flux.jl [paper], Zygote.jl [paper] and XLA.jl [paper].

Flux.jl is a library that gives a fresh take on machine learning as it exposes powerful tools to the user in a non-intrusive manner while remaining completely hackable, right to its core.

“Careful design of the underlying automatic differentiation allows freely mixing mathematical expressions, built-in and custom layers and algorithms with control flow in one model. This makes Flux unusually easy to extend to new problems.”



Flux plays nicely with the entire Julia ecosystem, leveraging Julia’s multiple dispatch to make sharing types and data between Flux and many widely used array types transparent (eg. CuArrays for effortless translation of models and data to the GPU). It even lets users extend Julia’s compiler and write custom GPU kernels within the same program.

In the Flux paper, we demonstrate the ease with which one is able to take advantage of the underlying ecosystem to express ideas and complicated thoughts. One example is how Flux models can be learned with custom training loops that can house arbitrary logic, including more complex gradient flows than a typical machine learning framework might support.

for x, c, d in training_set
    c_hat, d_hat = model(x)
    c_loss = loss(c_hat, y) + λ*loss(d_hat, 1 - d)
    d_loss = loss(d_hat, d)
    back!(c_loss)
    back!(d_loss)
    opt()
end

Flux.jl has been shown to run on par with contemporary deep learning libraries while being dramatically simpler, providing intelligent abstractions and maintaining a minimalist API.

Calculating derivatives is a recurrent and intensive task while training any large model, and compiler level optimisations for differentiable code have seen a recent surge in interest. Automatic Differentiation, a topic of much interest in the current ML landscape, can be used almost transparently when hooked into the language compiler.

Zygote.jl is one such example of doing source-to-source transformations of the Static Single Assignment (SSA) form, taking advantage of many of the recent improvements made to the base Julia compiler. Similar efforts such as Capstan.jl showcase an alternative application of these same compiler primitives toward automatic differentiation for different applications.



Zygote transparently generates adjoint code for arbitrary Julia functions, sacrificing neither speed nor the dynamism of the full Julia language. It interacts directly with Julia’s existing compiler and utilizes its full set of optimisation heuristics. It exposes a familiar interface, making usage extremely simple, as shown by the following example:

>>> @code_llvm derivative(x -> 5x+3, 1)
define i64 @"julia_#625_38792"(i64)
{ top:
    ret i64 5
}

It enables reverse mode AD while preserving existing language semantics. The Zygote paper also presents some benchmarks for simple functions against contemporary methods.

“It opens up the opportunity for robust traditional compiler techniques to be extended to machine learning, enabling kernel fusion or compilation for accelerators with no artificial limitations on the kinds of models that researchers can express.
This combination has not previously been possible in a high-level, general-purpose programming language.”

XLA.jl, released recently shows the ability to repurpose the Julia compiler to target Google’s TPUs.



This package combines the simple and elegant Flux models, applies Zygote’s AD and offloads the entire forward and backward pass onto the TPU for the utmost speed, bringing the entire story full circle. The XLA paper details its methodology, using Google’s latest XRT API to compile Julia code to XLA IR. It explains how the forward and backward passes are generated, as well as handling things such as control flow and compiling dynamic Julia code down to static sub-segments for execution on the TPU.

“Targeting TPUs using our compiler, we are able to evaluate the VGG19 forward pass on a batch of 100 images in 0.23s”

XLA.jl is written in under 1000 lines of code, a truly impressive feat considering the opportunities it opens up. It also shines a light on the language’s expressive power.

# An HLO operand that generates a random
# uniform random number of the specificed
# shape and element type:
struct HloRng <: HloOp{:rng}
    Type
    Shape
end

"""A function that adds random numbers to
each entry of a 1000x1000 matrix"""
@eval function add_rand_1000x1000(
        A::XRTArray{Float32, (1000, 1000), 2}
        random = $(HloRng(Float32,(1000, 1000)))()
    result = $(HloAdd())(random, A)
    return result
end

Google cloud TPUs provide an efficient, extremely high-performance computational platform able to dramatically speed up the demanding task of training models. From the BFloat16s.jl package which allows prototyping of algorithms on CPUs to check algorithmic stability with the restricted precision available within TPUs, to the internal compiler and related ML ecosystem, Julia supports a dynamic, familiar and high-performance environment for taking advantage of this special hardware. The progress made within the past few months and the recognition received have us very excited about the future of machine learning in the Julia ecosystem and the world at large.