Keno Fischer, Julia Computing Co-Founder and Chief Technology Officer (Tools) participated in a Quora Session March 18-23.
What are the biggest software challenges in machine learning and differentiable programming systems?
Machine learning has revived a significant amount of interest in techniques that were popular in the 70s and 80s but have not received mainstream attention since outside of academia or certain fairly niche use cases. Among these techniques are ideas like polyhedral compilation, optimizing array compilers and automatic differentiation (reverse mode AD in particular) and a few others. Additionally, the hardware folks are reviving ideas that have been out of vogue for a while: systolic arrays, software scheduled architectures (with software-managed hazards), innovation in ISA design and interconnects. Writing good compilers for these architectures is quite hard and not really a solved problem yet.
The reason a lot of these ideas failed last time around was that it turned out there was very little advantage in exploring these exotic ideas, because next year’s general purpose CPU would just be faster than any special purpose hardware you could possibly build. Physical limitations are preventing CPUs from getting faster. I’m writing this answer on a six year old machine and the most powerful single machine I currently have SSH access to is about a decade old – both of these would have been mostly unthinkable even a decade ago. So we need to find innovation elsewhere.
The good news is that (if done properly) all this innovation being driven by machine learning will have significant benefits in other fields. Pervasive availability of production quality automatic differentiation systems (“differentiable programming systems”) will have impact far beyond machine learning. Wherever any sort of optimization process happens, being able to very quickly compute the derivative of your objective function is a crucial prerequisite to getting a good result. This shows up everywhere: finance, astrophysics, medical imaging, personalized medicine, logistics and many others. The same is true for some of the other techniques. Our thesis for building the machine learning stack in Julia is to build it as a set of general infrastructure (AD, compiler support, hardware backends, developer tools, etc.) and have machine learning just fall out as a special case of what’s well supported. That sometimes causes some friction because current-generation machine learning systems tend to not require the full generality that this infrastructure provides, but I think it’s a crucial ingredient for next generation systems [1].
I would be remiss at this point not to point out some of the other work that’s going on in this area. There’s some great work out of Google with Swift for TensorFlow (a very different approach from what we’re doing, but taking differentiable programming the furthest of any of the non-Julia systems I have seen) and MLIR (which despite its name and much to my enthusiastic support is trying to build general purpose next-generation compiler infrastructure), TVM from UW (doing some great work on ML-driven compiler heuristics and search space exploration for generating really high performance kernels on all kinds of architectures). I also liked the goals for Myia when it was announced, but I haven’t heard much recently.
[1] I recently gave a talk on this (https://juliacomputing.com/blog/2019/02/19/growing-a-compiler.html). It’s a good overview of the various things we’re doing at the compiler level to support machine learning and differentiable programming. There’s also a larger point on how all of these things have to work together that I sometimes think gets lost.
What type of support for differentiable programming is available in Julia?
Julia as a language tends to make it very easy to write automatic differentiation packages. If you want forward mode AD, JuliaDiff/ForwardDiff.jl is the way to go. For reverse mode AD, the landscape is a lot larger. The packages that I can think of off the top of my head are:
and probably a couple of others that I’m forgetting. However, most of the cutting edge work is going in Zygote (FluxML/Zygote.jl), which is our next-generation, compiler-based automatic differentiation framework. It’s not quite ready for production use yet, but it’s already quite clear that it will be a significant usability improvement over the previous tracing based AD implementations. In particular:
-
We’ll be able to differentiate through many more AD-unaware codes (all the tracing based implementation requires code to be written fairly generically, which is generally a good property, but not always easy).
-
We no longer have the distinction between tracked arrays and arrays of tracked numbers (as well as the associated performance improvements).
-
The compiler will be able to introspect the AD data structures. This one in particular is quite cool. A lot of the literature on “tape optimizations” comes down to just applying standard compiler techniques jointly on the forward and backward pass.
That said, we’re still actively working on trying to figure out the correct support that the compiler needs to provide to make this possible. We’re generally against privileging any particular approach to solving a problem (so we wouldn’t just want to put AD into the compiler directly). The reason for this is simple. There are always choices to be made (as the long list of reverse mode packages shows), so if what we’ve built doesn’t quite work for your use case, we’d like to figure out why, but we don’t want to prevent you from building a specialized implementation that does what you need in the meantime.
Our approach instead is try to figure out the correct reusable abstractions that allow us to implement the thing we want, but have very clear, simple semantics and hopefully will be useful for use cases we haven’t even thought about yet. For example, right now Zygote is built on top of generated functions, a feature we added to support operations on higher dimensional arrays as well as to be able to embed a C++ compiler in Julia. But by coming up with a generalized abstraction, lots of cool things were built that we didn’t even think of when we implemented generated functions. Tools like Zygote are outgrowing generated functions a bit, so we need to come up with something slightly different (e.g. Very WIP: RFC: Add an alternative implementation of closures by Keno · Pull Request #31253 · JuliaLang/julia may be part of the answer), but we’ve been able to do quite a bit of iteration to make sure we get the interface and features right, just with what we already have. Now it’s just a question of adding a little bit of language support to make it amazing.