Author Archives: Julia Developers

JSoC project update – DataStreams.jl

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/VtZnNbOxvJc/datastreams

Data processing got ya down? Good news! The DataStreams.jl package, er, framework, has arrived!

The DataStreams processing framework provides a consistent interface for working with data, from source to sink and eventually every step in-between. It’s really about putting forth an interface (specific types and methods) to go about ingesting and transferring data sources that hopefully makes for a consistent experience for users, no matter what kind of data they’re working with.

How does it work?

DataStreams is all about creating “sources” (Julia types that represent true data sources; e.g. csv files, database backends, etc.), “sinks” or data destinations, and defining the appropriate Data.stream!(source, sink) methods to actually transfer data from source to sink. Let’s look at a quick example.

Say I have a table of data in a CSV file on my local machine and need to do a little cleaning and aggregation on the data before building a model with the GLM.jl package. Let’s see some code in action:

using CSV, SQLite, DataStreams, DataFrames

# let's create a Julia type that understands our data file
csv_source = CSV.Source("datafile.csv")

# let's also create an SQLite destination for our data
# according to its structure
db = SQLite.DB() # create an in-memory SQLite database

# creates an SQLite table
sqlite_sink = SQLite.Sink(Data.schema(csv_source), db, "mydata")

# parse the CSV data directly into our SQLite table
Data.stream!(csv_source, sqlite_sink)

# now I can do some data cleansing/aggregation
# ...various SQL statements on the "mydata" SQLite table...

# now I'm ready to get my data out and ready for model fitting
sqlite_source = SQLite.Source(sqlite_sink)

# stream our data into a Julia structure (Data.Table)
dt = Data.stream!(sqlite_source, Data.Table)

# convert to DataFrame (non-copying)
df = DataFrame(dt)

# do model-fitting
OLS = glm(Y~X,df,Normal(),IdentityLink())

Here we see it’s quite simple to create a Source type by wrapping a true datasource (our CSV file), a destination for that data (an SQLite table), and to transfer the data. We can then turn our SQLite.Sink into an SQLite.Source for getting the data back out again.

So What Have You Really Been Working On?

Well, a lot actually. Even though the DataStreams framework is currently simple and minimalistic, it took a lot of back and forth on the design, including several discussions at this year’s JuliaCon at MIT. Even with a tidy little framework, however, the bulk of the work still lies in actually implementing the interface in various packages. The two that are ready for release today are CSV.jl and SQLite.jl. They are currently available for julia 0.4+ only.

Quick rundown of each package:

  • CSV: provides types and methods for working with CSV and other delimited files. Aims to be (and currently is) the fastest and most flexible CSV reader in Julia.
  • SQLite: an interface to the popular SQLite local-machine database. Provides methods for creating/managing database files, along with executing SQL statements and viewing the results of such.
So What’s Next?
  • ODBC.jl: the next package to get the DataStreams makeover is ODBC. I’ve already started work on this and hopefully should be ready soon.
  • Other packages: I’m always on the hunt for new ways to spread the framework; if you’d be interested in implementing DataStreams for your own package or want to collaborate, just ping me and I’m happy to discuss!
  • transforms: an important part of data processing tasks is not just connecting to and moving the data to somewhere else: often you need to clean/transform/aggregate the data in some way in-between. Right now, that’s up to users, but I have some ideas around creating DataStreams-friendly ways to easily incorporate transform steps as data is streamed from one place to another.
  • DataStreams for chaining pipelines + transforms: I’m also excited about the idea of creating entire DataStreams, which would define entire data processing tasks end-to-end. Setting up a pipeline that could consistently move and process data gets even more powerful as we start looking into automatic-parallelism and extensibility.
  • DataStream scheduling/management: I’m also interested in developing capabilities around scheduling and managing DataStreams.

The work on DataStreams.jl was carried out as part of the Julia Summer of Code program, made possible thanks to the generous support of the Gordon and Betty Moore Foundation, and MIT.

JSOC 2015 project – Automatic Differentiation in Julia with ForwardDiff.jl

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/MoVbDxwx03A/auto-diff-in-julia

This summer, I’ve had the good fortune to be able to participate in the first ever Julia Summer of Code (JSoC), generously sponsored by the Gordon and Betty Moore Foundation. My JSoC project was to explore the use of Julia for automatic differentiation (AD), a topic with a wide array of applications in the field of optimization.

Under the mentorship of Miles Lubin and Theodore Papamarkou, I completed a major overhaul of ForwardDiff.jl, a Julia package for calculating derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functions (or any callable Julia type, really).

By the end of this post, you’ll hopefully know a little bit about how ForwardDiff.jl works, why it’s useful, and why Julia is uniquely well-suited for AD compared to other languages.

Seeing ForwardDiff.jl In Action

Before we get down to the nitty-gritty details, it might be helpful to see a simple example that illustrates various methods from ForwardDiff.jl’s API.

The snippet below is a somewhat contrived example, but works well enough as an introduction to the package. First, we define a target function we’d like to differentiate, then use ForwardDiff.jl to calculate some derivatives of the function at a given input:

“`julia
julia> using ForwardDiff

julia> f(x::Vector) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

julia> x = rand(5)
5-element Array{Float64,1}:
0.986403
0.140913
0.294963
0.837125
0.650451

julia> g = ForwardDiff.gradient(f); # g = ∇f

julia> g(x)
5-element Array{Float64,1}:
1.01358
2.50014
1.72574
1.10139
1.2445

julia> j = ForwardDiff.jacobian(g); # j = J(∇f)

julia> j(x)
5×5 Array{Float64,2}:
0.585111 3.48083 1.7706 0.994057 1.03257
3.48083 1.06079 5.79299 3.25245 3.37871
1.7706 5.79299 0.423981 1.65416 1.71818
0.994057 3.25245 1.65416 0.251396 0.964566
1.03257 3.37871 1.71818 0.964566 0.140689

julia> ForwardDiff.hessian(f, x) # H(f)(x) == J(∇f)(x), as expected
5×5 Array{Float64,2}:
0.585111 3.48083 1.7706 0.994057 1.03257
3.48083 1.06079 5.79299 3.25245 3.37871
1.7706 5.79299 0.423981 1.65416 1.71818
0.994057 3.25245 1.65416 0.251396 0.964566
1.03257 3.37871 1.71818 0.964566 0.140689
“`

Tada! Okay, that’s not too exciting – I could’ve just done the same thing with Calculus.jl! What makes ForwardDiff.jl different? Read on to find out…

How ForwardDiff.jl Works

What is Automatic Differentiation?

In broad terms, automatic differentiation describes a class of algorithms for automatically taking exact derivatives of user-provided functions. In addition to producing more accurate results, AD methods are also often faster than other common differentiation methods (such as finite differencing).

There are two main flavors of AD: forward mode, and reverse mode. As you might’ve guessed, this post only discusses forward mode, which is the kind of AD implemented by ForwardDiff.jl.

ForwardDiffNumber Types

ForwardDiff.jl implements several new number types, all of which are subtypes of ForwardDiffNumber{N,T,C} <: Number.

Elementary numerical functions on these types are then overloaded to evaluate both the original function and its derivative(s), returning the results in the form of a new ForwardDiffNumber. Thus, we can pass these number types into a general function $f$ (which is assumed to be composed of the overloaded elementary functions), and the derivative information is naturally propagated at each step of the calculation by way of the chain rule.

This propagation occurs all the way through to the result of the function, which is itself a ForwardDiffNumber (or an Array{F<:ForwardDiffNumber}). This final result contains the value $f(x)$ and the derivative $f’(x)$, where $x$ was the original point of evaluation.

Simple Forward Mode AD in Julia: Dual Numbers

The easiest way to write actual Julia code demonstrating this technique is to implement a simple dual number type. Note that there is already a Julia package dedicated to such an implementation, but we’re going to roll our own here for pedagogical purposes.

Here’s how we’ll define our DualNumber type:

“`julia
immutable DualNumber{T} <: Number
value::T
deriv::T
end

value(d::DualNumber) = d.value
deriv(d::DualNumber) = d.deriv
“`

Next, we can start defining functions on DualNumber. Here are a few examples to give you a feel for the process:

“`julia
function Base.sqrt(d::DualNumber)
new_value = sqrt(value(d))
new_deriv = 0.5 / new_value
return DualNumber(new_value, new_deriv*deriv(d))
end

function Base.sin(d::DualNumber)
new_value = sin(value(d))
new_deriv = cos(value(d))
return DualNumber(new_value, new_deriv*deriv(d))
end

function Base.(:+)(a::DualNumber, b::DualNumber)
new_value = value(a) + value(b)
new_deriv = deriv(a) + deriv(b)
return DualNumber(new_value, new_deriv)
end

function Base.(:*)(a::DualNumber, b::DualNumber)
val_a, val_b = value(a), value(b)
new_value = val_a * val_b
new_deriv = val_b * deriv(a) + val_a * deriv(b)
return DualNumber(new_value, new_deriv)
end
“`

We can now evaluate the derivative of any scalar function composed of the above elementary functions. To do so, we simply pass an instance of our DualNumber type into the function, and extract the derivative from the result. For example:

“`julia
julia> f(x) = sqrt(sin(x * x)) + x
f (generic function with 1 method)

julia> f(1.0)
1.8414709848078965

julia> d = f(DualNumber(1.0, 1.0))
DualNumber{Float64}(1.8414709848078965,1.5403023058681398)

julia> deriv1 = deriv(d)
1.589002649374538

julia> using Calculus; deriv2 = Calculus.derivative(f, 1.0)
1.5890026493377403

julia> deriv1 – deriv2
3.679767601738604e-11
“`

Notice that our dual number result comes close to the result obtained from Calculus.jl, but is actually slightly different. That slight difference is due to the approximation error inherent to the finite differencing method employed by Calculus.jl.

In reality, the number types that ForwardDiff.jl implements are quite a bit more complicated than DualNumber. ForwardDiffNumbers behave like ensembles of dual numbers and hyper-dual numbers (the higher-order analog of dual numbers), allowing for simultaneous calculation of multiple higher-order partial derivatives. For an in-depth examination of ForwardDiff.jl’s number type implementation, see this section of the developer documentation.

Performance Comparison: The Ackley Function

To illustrate the performance gains that can be achieved using ForwardDiff.jl, let’s do a (very) naive benchmark comparison between ForwardDiff.jl and other libraries. For this comparison, we’ll be comparing the time to calculate the gradient of a function using ForwardDiff.jl, Calculus.jl, and a Python-based AD tool, AlgoPy.

To put these benchmarks in context, here are my machine specs, along the version info of the relevant libraries:

Machine Specs

  • Model: Late 2013 MacBook Pro
  • OS: OSX 10.9.5
  • Processor: 2.6 GHz Intel Core i5
  • Memory: 8 GB 1600 MHz DDR3

Version Info

  • Julia v0.4.1-pre+15
    • ForwardDiff.jl v0.1.2
    • Calculus.jl v0.1.13
  • Python v2.7.9
    • AlgoPy v0.5.1

Defining the Ackley Function

The function we’ll be using in our test is the Ackley function, which is mathematically defined as

Here’s the definition of the function in Julia:

julia
function ackley(x)
a, b, c = 20.0, -0.2, 2.0*π
len_recip = inv(length(x))
sum_sqrs = zero(eltype(x))
sum_cos = sum_sqrs
for i in x
sum_cos += cos(c*i)
sum_sqrs += i^2
end
return (-a * exp(b * sqrt(len_recip*sum_sqrs)) -
exp(len_recip*sum_cos) + a + e)
end

…and here’s the corresponding Python definition:

python
def ackley(x):
a, b, c = 20.0, -0.2, 2.0*numpy.pi
len_recip = 1.0/len(x)
sum_sqrs, sum_cos = 0.0, 0.0
for i in x:
sum_cos += algopy.cos(c*i)
sum_sqrs += i*i
return (-a * algopy.exp(b*algopy.sqrt(len_recip*sum_sqrs)) -
algopy.exp(len_recip*sum_cos) + a + numpy.e)

Benchmark Results

The benchmarks were performed with input vectors of length 16, 1600, and 16000, taking the best time out of 5 trials for each test.

Function evaluation time

The below table compares the evaluation times of ackley(x) in both Python and Julia:

length(x) Python time (s) Julia time (s) Speed-Up vs. Python
16 0.00011 2.3e-6 47.83x
1600 0.00477 4.0269e-5 118.45x
16000 0.04747 0.00037 128.30x

Gradient evaluation time

The below table compares the evaluation times of ∇ackley(x) using various libraries (the chunk_size column denotes a configuration option passed to the ForwardDiff.gradient method, see the chunk-mode docs for details.):

length(x) AlgoPy time (s) Calculus.jl time (s) ForwardDiff time (s) chunk_size
16 0.00212 2.2e-5 3.5891e-5 16
1600 0.53439 0.10259 0.01304 10
16000 101.55801 11.18762 1.35411 10

Here, we show the speed-up ratio of ForwardDiff.jl over the other libraries:

length(x) Speed-Up vs. AlgoPy Speed-Up vs. Calculus.jl
16 59.07x 0.61x
1600 40.98x 7.86x
16000 74.99x 8.26x

As you can see, Python + AlgoPy falls pretty short of the speeds achieved by Julia + ForwardDiff.jl, or even Julia + Calculus.jl. While Calculus.jl is actually almost twice as fast as ForwardDiff.jl for the lowest input dimension vector, it is ~8 times slower than ForwardDiff.jl for the higher input dimension vectors.

Another metric that might be useful to look at is the “slowdown ratio” between the gradient evaluation time and the function evaluation time, defined as:

Here are the results (lower is better):

length(x) AlgoPy ratio Calculus.jl ratio ForwardDiff.jl ratio
16 19.27 9.56 15.60
1600 112.03 2547.61 323.82
16000 2139.41 30236.81 3659.77

Both AlgoPy and ForwardDiff.jl beat out Calculus.jl for evaluation at higher input dimensions, which isn’t too surprising. AlgoPy beating ForwardDiff.jl, though, might catch you off guard – ForwardDiff.jl had the fastest absolute runtimes, after all! One explanation for this outcome is that AlgoPy falls back to vectorized Numpy methods when calculating the gradient, while the ackley function itself uses your usual, slow Python scalar arithmetic. Julia’s scalar arithmetic performance is much faster than Python’s, so ForwardDiff.jl doesn’t have as much “room for improvement” as AlgoPy does.

Julia’s AD Advantage

At the beginning of this post, I promised I would give the reader an answer to the question: “Why is Julia uniquely well-suited for AD compared to other languages?”

There are several good answers, but the chief reason for Julia’s superiority is its efficient implementation of multiple dispatch.

Unlike many other languages, Julia’s type-based operator overloading is fast and natural, as it’s one of the central design tenets of the language. Since Julia is JIT-compiled, the bytecode representation of a Julia function can be tied directly to the types with which the function is called. This allows the compiler to optimize every Julia method for the specific input type at runtime.

This ability is phenomenally useful for implementing forward mode AD, which relies almost entirely on operator overloading in order to work. In most other scientific computing languages, operator overloading is either very slow (e.g. MATLAB), fraught with weird edge cases (e.g. Python), arduous to implement generally (e.g. C++) or some combination of all three. In addition, very few languages allow operator overloading to naturally extend to native, black-box, user-written code. Julia’s multiple dispatch is the secret weapon leveraged by ForwardDiff.jl to overcome these hurdles.

Future Directions

The new version of ForwardDiff.jl has just been released, but development of the package is still ongoing! Here’s a list of things I’d like to see ForwardDiff.jl support in the future:

  • More elementary function definitions on ForwardDiffNumbers
  • More optimized versions of existing elementary function definitions on ForwardDiffNumbers
  • Methods for evaluating Jacobian-matrix products (highly useful in conjunction with reverse mode AD).
  • Parallel/shared-memory/distributed-memory versions of current API methods for handling problems with huge input/output dimensions
  • A more robust benchmarking suite for catching performance regressions

If you have any ideas on how to make ForwardDiff.jl more useful, feel free to open a pull request or issue in the package’s GitHub repository.

JSoC 2015 project: Interactive Visualizations in Julia with GLVisualize.jl

By: Julia Developers

Re-posted from: http://feedproxy.google.com/~r/JuliaLang/~3/7_ISTrT-7L0/glvisualize

GLVisualize is an interactive visualization library that supports 2D and 3D rendering as well as building of basic GUIs. It’s written entirely in Julia and OpenGL.
I’m really glad that I could continue working on this project with the support of Julia Summer of Code.

During JSoC, my main focus was on advancing GLVisualize, but also improving the surrounding infrastructure like GeometryTypes, FileIO, ImageMagick, MeshIO and FixedSizeArrays.
All recorded gifs in this blog post suffer from lossy compression. You can click on most of them to see the code that produced them.

One of the most interesting parts of GLVisualize is, that it’s combining GUIs and visualizations, instead of relying on a 3rd party library like QT for GUIs.
This has many advantages and disadvantages.
The main advantage is, that interactive visualization share a lot of infrastructure with GUI libraries.
By combining these two, new features are possible, e.g. text editing of labels in 3D space, or making elements of a visualization work like a button. These features should end up being pretty snappy, since GLVisualize was created with high performance in mind.

Obviously, the biggest downside is, that it is really hard to reach the maturity and feature completeness from e.g. QT.

So to really get the best of both worlds a lot of work is needed.

Current status of GLVisualize, and what I’ve been doing during JSoC

A surprisingly large amount of time went into improving FileIO together with Tim Holy.
The selling point of FileIO is, that one can just load a file into FileIO and it will recognize the format and load the respective IO library.
This makes it a lot easier to start working with files in Julia, since no prior knowledge about formats and loading files in Julia is needed.
This is perfect for a visualization library, since most visualization start from data, that comes in some format, which might even be unknown initially.

Since all files are loaded with the same function, it becomes much easier to implement functionality like drag and drop of any file supported by FileIO.
To give you an example, the implementation of the drag and drop feature in GLVisualize only needs a few lines of code thanks to FileIO:

drag and drop

Another feature I’ve been working on is better 2D support.
I’ve implemented different anti-aliased marker, text rendering and line types.
Apart from the image markers, they all use the distance field technique, to achieve view independent anti-aliasing.
Here are a few examples:

lines
markers

In the last example all the markers move together.
This is actually one of the core feature of GLVisualize. The markers share the same memory for the positions on the GPU without any overhead. Each marker then just has a different offset to that shared position.
This is easily achieved in GLVisualize, since all visualization methods are defined on the GPU objects.
This also works for GPU objects which come from some simulation calculated on the GPU.

During JSoC, I also implemented sliders and line editing widgets for GLVisualize.
One can use them to add interactivity to parameters of a visualization:

line_edit
arbitrary_surf

I have also worked with David P. Sanders to visualize his billiard model, which demonstrates the particle system and a new camera type.

billiard
The particle system can use any mesh primitive. To make it easy to load and create meshes, Steve Kelly and I rewrote the Meshes package to include more features and have a better separation of mesh IO and manipulation. The IO is now in MeshIO, which supports the FileIO interface. The mesh types are in GeometryTypes and meshing algorithms are in different packages in the JuliaGeometry org.

In this example one can see, that there are also some GUI widgets to interact with the camera.
The small rectangles in the corner are for switching between orthographic and perspective projection. The cube can be used to center the camera on a particular side.
These kind of widgets are easy to implement in GLVisualize, as it is build for GUIs and interactivity from the beginning.
Better camera controls are a big usability win, and I will put more time into improving these even further.

I recorded one last demo to give you some more ideas of what GLVisualize is currently capable of:

interactivity

The demo shows different kind of animations, 3D text editing and pop ups that are all relatively easy to include in any visualization created with GLVisualize.

All of this looks promising, but there is still a lot of work needed!
First of all, there is still no tagged version of GLVisualize that will just install via Julia’s package manager.
This is because Reactive.jl and Images.jl are currently not tagged on a version that works with GLVisualize.

On the other side, the API is not that thought out yet.
It is planned to use more ideas from Escher.jl and Compose.jl to improve the API.
The goal is to fully support the Compose interface at some point.
Like that, GLVisualize can be used as a backend for Gadfly. This will make Gadfly much fitter for large, animated data sets.
In the next weeks, I will need to work on tutorials, documentations and handling edge cases better.

Big thanks go to the Julia team and everyone involved to make this possible!