Author Archives: Julia Computing, Inc.

Demystifying Auto-vectorization in Julia

Introduction

One of the most impressive features of Julia is that it lets you write generic code. Julia’s powerful LLVM-based compiler can automatically generate highly efficient machine code for base functions and user-written functions alike, on any architecture supported by LLVM, making you worry less about writing specialized code for each of these architectures.

One additional benefit of relying on the compiler for performance rather than hand-coding hot loops in assembly is that it is significantly more future proof. Whenever a next generation instruction set architecture comes out, your Julia code automatically gets faster.

Following a (very) brief look at what the hardware provides, we’ll look at a simple example (the sum function) to see how the compiler can take advantage of the hardware architecture to accelerate generic Julia functions.

Intel SIMD Hardware and Addition Instructions

Modern Intel chips provide a range of instruction set extensions. Among these are the various revisions of the Streaming SIMD Extension (SSE) and several generations of Advanced Vector Extensions (available with their latest processor families). These extensions provide Single Instruction Multiple Data (SIMD)-style programming, providing significant speed up for code amenable to such a programming style.

SIMD Registers are 128 (SSE), 256 (AVX) or 512 (AVX512) bits wide. They can generally be used in chunks of 8/16/32 or 64 bits, but exactly which divisions are available and which operations can be performed depend on the exact hardware achitecture.

Here is how the Add operation instructions on this architecture look like:

  • (V)ADDPS: Takes two 128/256/512 bit values and adds 4/8/16 single precision values in parallel
  • (V)ADDPD: Takes two 128/256/512 bit values and adds 2/4/8 double precision values in parallel
  • (V)PADD(B/W/D/Q): Takes two 128/256/512 bit values and adds (up to 64) 8/16/32/64-bit integers in parallel
  • (V)ADDSUBP(S,W): Takes two inputs, the operation is (+,-,+,-,…) on packed values
  • There are also a few more exotic instructions that involve horizontal adds, saturating, etc.

An Example

The following code snippet shows a simple sum function (returning the sum of all the elements in an Array ‘a’) in Julia:

function mysum(a::Vector)
total = zero(eltype(a))
@simd for x in a				
total += x
end						
return total
end

We can visualize this sequential operation, as a simple sequence of memory loads and additions:

However, this is not the code that Julia actually generates under the hood. By taking advantage of the SIMD instruction set, the add operation is performed in two phases:

  1. During the first step (denoted “Vector Body” below), intermediate values are accumulated four at a time (in our example – depends on the hardware of course).

  2. The reduction step, in which the final four elements are summed together.

This picture is simplified a bit, but conveys the general idea of the transformation, in the real code, there is a few extra caveats that the compiler has to pay attention to.

  • If the array length is not known to be a multiple of the vector width, the compiler may have to generate an additional scalar part to sum the remaining elements (for high vector width – e.g. 32 that may itself use vector instructions). Depending on the hardware, the same is true if the memory alignment is not known.

  • To take advantage of “superscalarness” (the ability of a processor to execute more than one instruction in parallel over and above SIMD), compilers will often “unroll” the vector body, keeping more than one SIMD register’s worth of state at the expense of a larger reduction step. (on the previous illustration, imagine the vector body copied four times vertically, with sums happening every fourth set of four values).

  • If you’re summing floating-point, the above transformation may need to be explicitly allowed (the julia “@simd” macro does this for you), since floating-point arithmetic is in general non-associative (i.e. the result of the sum may differ between the two methods of computing it).

Machine Code Generated by the Compiler

In julia, we can use the @code_native macro to inspect the native code generated for any particular function. Trying this for our “mysum” function looks like for a summation of 100000 random numbers of Float64 type on a machine that supports AVX2, we can see precisely the pattern we expected:

@code_native mysum(rand(Float64 , 100000)) ;
vaddpd %ymm5, %ymm0, %ymm0
vaddpd %ymm6, %ymm2, %ymm2					
vaddpd %ymm7, %ymm3, %ymm3
vaddpd %ymm8, %ymm4, %ymm4
; NOTE: Omitting length check/branch here vaddpd %ymm0, %ymm2, %ymm0
vaddpd %ymm0, %ymm3, %ymm0
vaddpd %ymm0, %ymm4, %ymm0
vextractf128 $1, %ymm0, %xmm2
vaddpd %ymm2, %ymm0, %ymm0
vhaddpd %ymm0, %ymm0, %ymm0

The vector body phase on this machine is unrolled four times, using ymm0, 2, 3, and 4 as the accumulation registers.
The reduction step phase accumulates ymm2,3 and 4 into ymm0, and finally sums up parts of ymm0 itself to give the final result.

Here is how the machine code for the same function (arguments of type Float64) would look like on a machine that supports AVX512:

julia > @code_native mysum(rand(Float64 , 100000))  ;
vaddpd -192(%rdx), %zmm0, %zmm0
vaddpd -128(%rdx), %zmm2, %zmm2				
vaddpd -64(%rdx), %zmm3, %zmm3
vaddpd (%rdx), %zmm4, %zmm4
; NOTE: Omitting length check/branch here vaddpd %zmm0, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vaddpd %zmm0, %zmm4, %zmm0
vshuff64x2 $14, %zmm0, %zmm0, %zmm2 vaddpd %zmm2, %zmm0, %zmm0
vpermpd $238, %zmm0, %zmm2
vaddpd %zmm2, %zmm0, %zmm0
vpermilpd $1, %zmm0, %zmm2
vaddpd %zmm2, %zmm0, %zmm0

It is evident that the machine code generated might look different on other architectures, or with different data types, and might even look more complicated, but the pattern of generating the best machine code possible for Vector Body and Reduction phases on any architecture is consistent, therefore making auto-vectorization in Julia a cake walk.

Julia Computing and Julia Featured in Forbes Asia

Bengaluru, India – Forbes Asia has published a major feature on Julia Computing and Julia.

The article is titled “How a New Programming Language Created by Four Scientists Is Now Used by the World’s Biggest Companies” by Suparna Dutt D’Cunha. It describes the geneses of Julia and Julia Computing and how Julia is being used today.

For example, Julia users, partners and firms hiring Julia programmers include Amazon, Apple, BlackRock, Capital One, Citibank, Comcast, Disney, Facebook, Ford, Google, Grindr, IBM, Intel, KPMG, Microsoft, NASA, Oracle, PwC and Uber.

So far, the article has more than 60 thousand page views and that number is still climbing.

As Stefan Karpinski, Julia Computing CTO for Open Source, explains, “Julia empowers data scientists, physicists, quantitative finance traders and robot designers to solve problems without having to become computer programmers or hire computer programmers to translate their functions into computer code.”

About Julia and Julia Computing

Julia is the fastest modern high performance open source computing language for data, analytics, algorithmic trading, machine learning and artificial intelligence. Julia combines the functionality and ease of use of Python, R, Matlab, SAS and Stata with the speed of C++ and Java. Julia delivers dramatic improvements in simplicity, speed, capacity and productivity. Julia provides parallel computing capabilities out of the box and unlimited scalability with minimal effort. With more than 1.2 million downloads and +161% annual growth, Julia is one of the top programming languages developed on GitHub and adoption is growing rapidly in finance, insurance, energy, robotics, genomics, aerospace and many other fields.

Julia users, partners and employers hiring Julia programmers in 2017 include Amazon, Apple, BlackRock, Capital One, Citibank, Comcast, Disney, Facebook, Ford, Google, Grindr, IBM, Intel, KPMG, Microsoft, NASA, Oracle, PwC and Uber.

  1. Julia is lightning fast. Julia is being used in production today and has generated speed improvements up to 1,000x for insurance model estimation and parallel supercomputing astronomical image analysis.

  2. Julia provides unlimited scalability. Julia applications can be deployed on large clusters with a click of a button and can run parallel and distributed computing quickly and easily on tens of thousands of nodes.

  3. Julia is easy to learn. Julia’s flexible syntax is familiar and comfortable for users of Python, R and Matlab.

  4. Julia integrates well with existing code and platforms. Users of C, C++, Python, R and other languages can easily integrate their existing code into Julia.

  5. Elegant code. Julia was built from the ground up for mathematical, scientific and statistical computing. It has advanced libraries that make programming simple and fast and dramatically reduce the number of lines of code required – in some cases, by 90% or more.

  6. Julia solves the two language problem. Because Julia combines the ease of use and familiar syntax of Python, R and Matlab with the speed of C, C++ or Java, programmers no longer need to estimate models in one language and reproduce them in a faster production language. This saves time and reduces error and cost.

Julia Computing was founded in 2015 by the creators of the open source Julia language to develop products and provide support for businesses and researchers who use Julia.

Julia – The Artificial Intelligence Computer Programming Language for the Next 150 Years

Hollywood, CA – Do you know what programming language will be used for artificial intelligence in the Year 2150?

According to TV science fiction drama The 100, the answer is Julia.

The 100 tells the story of a post-apocalyptic Earth shaped by artificial intelligence.

In Season 3, Episode 12, Raven Reyes, performed by Lindsey Morgan, discovers code that enables her team to access A.L.I.E. 2.0, the latest version of the most important artificial intelligence program in the world.

A screenshot reveals that the crucial code is written in Julia and copied from the Julia language repository on GitHub.

According to Viral Shah, co-creator of the Julia language and CEO of Julia Computing, “The 100 staff did their research. It makes sense that they would use Julia to represent the language of artificial intelligence in 2150. Today Julia is being used to guide self-driving race cars and diagnose serious medical conditions. Julia is the most expressive and most powerful language for artificial intelligence today, and based on current growth, this trend will continue to accelerate in the months and years ahead.”


Screenshot from The 100 and the Julia language GitHub repository
from which it was copied

The 100 is not the only Hollywood television program featuring Julia.

Julia also made an appearance on Casual, season 3, episode 5. During this episode, the lead character, Alex, interviews for a Chief Technology Officer position at a technology startup company. “The people we meet your age,” the interviewer says to Alex, “they’re not versed in the newer languages: Swift, Julia, Google Go.”

Shah explained, “The character interviewing Alex in Casual hasn’t had quite enough experience with Julia programmers. While it’s true that Julia adoption has exploded among young people – especially at universities and at fast-growing companies including Amazon, Apple, Facebook, Google and Uber – data scientists and researchers with decades of experience are also flocking to Julia to take advantage of Julia’s superior performance and easy-to-learn syntax. Examples include Nobel Laureate Thomas J. Sargent who describes himself as a “walking advertisement for Julia,” and the engineering manager at a major US industrial firm who wrote to tell us:

“I just wanted to thank you for Julia. I am the manager of an engineering group responsible for quite a few numerical tools from stand-alone small programs to full-blown finite element codes. Julia is the most exciting thing I’ve seen in years. When a language is cool enough for a 50-something year old manager to spend his spare time programming in it at home, you know that you’ve kindled serious excitement.”

About Julia and Julia Computing

Julia is the fastest modern high performance open source computing language for data, analytics, algorithmic trading, machine learning and artificial intelligence. Julia combines the functionality and ease of use of Python, R, Matlab, SAS and Stata with the speed of C++ and Java. Julia delivers dramatic improvements in simplicity, speed, capacity and productivity. Julia provides parallel computing capabilities out of the box and unlimited scalability with minimal effort. With more than 1 million downloads and +161% annual growth, Julia is one of the top 10 programming languages developed on GitHub and adoption is growing rapidly in finance, insurance, energy, robotics, genomics, aerospace and many other fields.

Julia users, partners and employers hiring Julia programmers in 2017 include Amazon, Apple, BlackRock, Capital One, Comcast, Disney, Facebook, Ford, Google, Grindr, IBM, Intel, KPMG, Microsoft, NASA, Oracle, PwC, Raytheon and Uber.

  1. Julia is lightning fast. Julia provides speed improvements up to
    1,000x for insurance model estimation, 225x for parallel
    supercomputing image analysis and 11x for macroeconomic modeling.

  2. Julia is easy to learn. Julia’s flexible syntax is familiar and
    comfortable for users of Python, R and Matlab.

  3. Julia provides unlimited scalability. Julia applications can be deployed on large clusters with a click of a button and can run parallel and distributed computing quickly and easily on tens of thousands of nodes.

  4. Julia integrates well with existing code and platforms. Users of
    Python, R, Matlab and other languages can easily integrate their
    existing code into Julia.

  5. Elegant code. Julia was built from the ground up for
    mathematical, scientific and statistical computing, and has advanced
    libraries that make coding simple and fast, and dramatically reduce
    the number of lines of code required – in some cases, by 90%
    or more.

  6. Julia solves the two language problem. Because Julia combines
    the ease of use and familiar syntax of Python, R and Matlab with the
    speed of C, C++ or Java, programmers no longer need to estimate
    models in one language and reproduce them in a faster
    production language. This saves time and reduces error and cost.

Julia Computing was founded in 2015 by the creators of the open source Julia language to develop products and provide support for businesses and researchers who use Julia.