Author Archives: David Anthoff

Query.jl v0.7.x released

I just released Query.jl
version v0.7.1. The v0.7.x series includes a number of smaller improvements
and some major new experimental features. This post describes and explains
all new features included in this release.

Enable {} everywhere

{} is the syntax for named tuples in Query.jl.
In previous versions this syntax only worked in @select statements,
and it only worked at the top level (for example you couldn’t create a
named tuple that has a field that is itself a named tuple). The new release
now enables the {} syntax everywhere in queries.

This is especially handy for @group statements. Take the following
query as an example:

using DataFrames, Query

df = DataFrame(sex=[:male, :female, :female, :female], age=[21., 30., 45., 34.], children=[2,2,1,2])

@from i in df begin
    @group i.age by {i.sex, i.children} into g
    @select {g.key.sex, g.key.children, age=mean(g)}
    @collect DataFrame
end

Here we have data about four individuals: their sex, their age and how
many children they have. We then group this data by sex and children, and
compute the average age for each group. The expression by {i.sex, i.children}
creates a named tuple as the grouping key, which is handy later on, because
we can now access the individual fields of the group key by their name.

Experimental standalone commands

I also added number of new standalone commands that enable a user experience
that is inspired by both the method syntax of the LINQ query operators
and dplyr. I am still tinkering with the precise details of this API, so
things might change going forward (hence “experimental”).

Here is a code example that highlights some of the new features and how
they play together with some of the other packages that I’ve created
over the last year:

using Dataverse

df = load("http://www.david-anthoff.com/blog/data/exampledata.csv") |>
    @groupby(_.Children) |>
    @select({Children=_.key,Age=mean(_..Age)}) |>
    @orderby(_.Children) |>
    @tee(save("output.csv")) |>
    @where(_.Age>30) |>    
    @tee(save("output.feather")) |>
    DataFrame

First, this uses the unregistered Dataverse.jl
package (I’m still looking for a better name, ideas welcome!). That package
pulls together a set of data packages that play nicely together and adds
a number of small experimental features on top of Query.jl.

The query starts out by loading a CSV file from a URL. The load function
is from the FileIO.jl package.
You can use that function to load files from web addresses or local disc,
and it works for CSV, Feather, Excel, Stata, SPSS and SAS files.

The next thing to note is that this query uses the pipe operator |>
to build up a data processing pipeline. Essentially you pass your data
through a series of manipulation commands and chain those together via
the |> operator.

The @groupby macro call is the first new standalone query command
introduced in this version of Query.jl.
The argument to the @groupby macro is an anonymous function that selects
the key by which the source data should be grouped. The normal julia
syntax for this would be @groupby(i->i.Children). In the example
above I use another experimental syntax in the new Query.jl
version: when you write an expression that contains the _ symbol,
that expression is translated into _ -> your_original_expression. This
syntax is just a short-cut to writing anonymous functions. In the context
of querying table-like sources, it typically stands for the current row
in query commands. The syntax _.Children here therefor extracts the
value of the Children column for each row, and rows are grouped by
that value.

The next command is a @select macro call. It takes an anonymous function
that projects each element from the source stream. The @groupby macro
creates a stream of groups, so each element that the @select macro sees
will be an element of type Grouping. A Grouping element has one
field key that holds the value of the group key for that group. In
this example I am accessing that value via the _.key expression in
the @select call. Any Grouping element is also at the same time
an array of the elements that were grouped. In our example, the
Grouping element is an array of the rows that make up a single group.
In this example I want to compute the average age for each group. The problem
here is that _ will be an array of rows, not an array of the age column
for each group. But I need to pass just a vector of values to the mean
function, not a vector of rows. The standard julia way to achieve this
would be to use a generator expression like mean(i.Age for i in _).
Because these kind of summary operations for groups are so common in data
analysis, I also added another new experimental syntax to Query.jl
in this version that simplifies the syntax for this common pattern. The
syntax a..b is a shortcut for map(i->i.b,a) in this version, and
that enables the concise syntax mean(_..Age) that is used in our example.

The next command @orderby is relatively simple: it takes an anonymous
function that extracts the key by which things should be sorted from each
source element. So in this example we are sorting the dataset by the
Children column at this stage.

The @tee macro is another experimental feature. It mimics the
tee shell command. It
takes the thing that is piped into it and pipes it into both the argument
of the @tee macro, and then passes the input unmodified to the next
stage in the general pipeline. In the example I use this to save the intermediate
dataset we have created so far into a CSV file, but then pass the data
on to further data manipulation commands.

The save function is once again from the FileIO.jl
package. It currently supports saving tabular data as CSV and Feather files.

The @where command filters the source dataset: only elements for
which the anonymous function that is passed to the @where command returns
true are passed on to the next stage. I am using the _ syntax once
again here to create the anonymous function.

The next @tee command just saves the now filtered dataset to a Feather
file.

The final line of this query pipes the result into a DataFrame. Note
that we never allocated or used any of the table types like DataFrame
or DataTable in the query so far. We could have easily concluded the
query with a save call and actually never materialized the data into
a DataFrame if we just wanted to load a file, manipulate it and save
it back to disc again. But if we intend to use the data for further
in-memory manipulation it is of course convenient to store it in a
DataFrame (or any of the other supported table types, see the list
in IterableTables.jl).

You can find a bit more information about these new experimental features
in the documentation here.

I am really interested in feedback on these new features! They are obviously
not complete at this point, but I hope you can get a general feel for
the direction, and any comments on that and anything else would be most
welcome. I track both the issues over at Query.jl
and the discussion on the julia forum.

One important final point: this new syntax will eventually augment the syntax
we already have in Query.jl.
The @from macro will not go away! In fact, one can easily combine the
two styles, for example in the following way:

using Dataverse

df = load("http://www.david-anthoff.com/blog/data/exampledata.csv") |>
    @query(i, begin
        @group i by i.Children into g
        @select {Children=g.key, Age=mean(g..Age)}
    end) |>
    @where(_.Age>30) |>
    DataFrame

Various smaller changes

Query.jl now uses the iterable
tables interface definition in TableTraits.jl.
This change should be entirely transparent to users, it just amounts
to some reorganization of packages in the background.

The release also includes a number of performance improvements and bug fixes.

Thanks

Thanks to Florian for help with the
.. syntax, and to Keno and Steven
for pointing me to the @tee name.

This post is being discussed here.