Author Archives: David Anthoff

Query.jl v0.11 released

By: David Anthoff

Re-posted from: https://www.queryverse.org//2019/02/02/query-v0.11/

We just released Query.jl v0.11. The new version adds four new standalone query operators: @select, @rename, @mutate and @unique. We also fixed a number of bugs and wrote a bunch of new documentation. Finally, the standalone versions of the query operators are no longer considered experimental, so feel free to use the versions that use the pipe operator |> without any worries that they might go away in the future. The rest of this post will showcase the new query operators we added.

The @select query operator

The standalone @select query operator makes it easier to select a subset of columns from a source table and arrange them in some specified order. Note that the discussion in this section is about the standalone @select query operator, not the LINQ style version of @select.

In the most basic example, you can simply specify all the columns that you want to include in the result table. You can select columns either by their name, or their position.

Lets first create a DataFrame with some example data that we will use as the basis for most examples in this post:

julia> using Query, DataFrames

julia> df = DataFrame(firstCol=rand(10), colA=rand(10), colB=rand(10), colC=rand(10), lastCol=rand(10))
10×5 DataFrame
│ Row │ firstCol    │ colA      │ colB       │ colC      │ lastCol   │
│     │ Float64     │ Float64   │ Float64    │ Float64   │ Float64   │
├─────┼─────────────┼───────────┼────────────┼───────────┼───────────┤
│ 1   │ 0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │
│ 2   │ 0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │
│ 3   │ 0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │
│ 4   │ 0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │
│ 5   │ 0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │
│ 6   │ 0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │
│ 7   │ 0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │
│ 8   │ 0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │
│ 9   │ 0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │
│ 10  │ 0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │

Now we select the column named colB, the second column and the column named lastCol into a new table:

julia> df |> @select(:colB, 2, :lastCol)
10x3 query result
colB       │ colA      │ lastCol
───────────┼───────────┼──────────
0.164546   │ 0.726354  │ 0.0685362
0.00516908 │ 0.132284  │ 0.847543
0.443238   │ 0.421744  │ 0.518726
0.80987    │ 0.204774  │ 0.866429
0.760544   │ 0.362015  │ 0.810517
0.596101   │ 0.0614884 │ 0.230335
0.714068   │ 0.702575  │ 0.308963
0.573619   │ 0.246562  │ 0.148938
0.289367   │ 0.432564  │ 0.592507
0.932312   │ 0.666146  │ 0.684766

Note that we are achieving two things at the same time here: we are picking only a few of the columns from the source table, and we are rearranging the order of the columns.

There are a whole number of options available that make it easier to select multiple columns without specifying each individually.

First, we can use the standard julia range syntax to select consecutive columns:

julia> df |> @select(:colB : :lastCol, :colA)
10x4 query result
colB       │ colC      │ lastCol   │ colA
───────────┼───────────┼───────────┼──────────
0.164546   │ 0.419195  │ 0.0685362 │ 0.726354
0.00516908 │ 0.619361  │ 0.847543  │ 0.132284
0.443238   │ 0.581122  │ 0.518726  │ 0.421744
0.80987    │ 0.0977708 │ 0.866429  │ 0.204774
0.760544   │ 0.775301  │ 0.810517  │ 0.362015
0.596101   │ 0.0613467 │ 0.230335  │ 0.0614884
0.714068   │ 0.283429  │ 0.308963  │ 0.702575
0.573619   │ 0.869932  │ 0.148938  │ 0.246562
0.289367   │ 0.92641   │ 0.592507  │ 0.432564
0.932312   │ 0.352096  │ 0.684766  │ 0.666146

In that example, we first selected all columns between colB and lastCol (so that would be colB, colC and lastCol), and then we also included colA as the second argument to the @select call.

We can also use the range syntax with column position indices:

julia> df |> @select(3:5, :colA)
10x4 query result
colB       │ colC      │ lastCol   │ colA
───────────┼───────────┼───────────┼──────────
0.164546   │ 0.419195  │ 0.0685362 │ 0.726354
0.00516908 │ 0.619361  │ 0.847543  │ 0.132284
0.443238   │ 0.581122  │ 0.518726  │ 0.421744
0.80987    │ 0.0977708 │ 0.866429  │ 0.204774
0.760544   │ 0.775301  │ 0.810517  │ 0.362015
0.596101   │ 0.0613467 │ 0.230335  │ 0.0614884
0.714068   │ 0.283429  │ 0.308963  │ 0.702575
0.573619   │ 0.869932  │ 0.148938  │ 0.246562
0.289367   │ 0.92641   │ 0.592507  │ 0.432564
0.932312   │ 0.352096  │ 0.684766  │ 0.666146

We also provide a number of convenience functions for bulk column selection. For example, to select all columns that have a name that starts with col, we can write:

julia> df |> @select(startswith("col"))
10x3 query result
colA      │ colB       │ colC
──────────┼────────────┼──────────
0.726354  │ 0.164546   │ 0.419195
0.132284  │ 0.00516908 │ 0.619361
0.421744  │ 0.443238   │ 0.581122
0.204774  │ 0.80987    │ 0.0977708
0.362015  │ 0.760544   │ 0.775301
0.0614884 │ 0.596101   │ 0.0613467
0.702575  │ 0.714068   │ 0.283429
0.246562  │ 0.573619   │ 0.869932
0.432564  │ 0.289367   │ 0.92641
0.666146  │ 0.932312   │ 0.352096

Or we can select all columns that have a name that ends with Col:

julia> df |> @select(endswith("Col"))
10x2 query result
firstCol    │ lastCol
────────────┼──────────
0.0815091   │ 0.0685362
0.984546    │ 0.847543
0.452009    │ 0.518726
0.551547    │ 0.866429
0.0537391   │ 0.810517
0.837029    │ 0.230335
0.701862    │ 0.308963
0.626259    │ 0.148938
0.000844883 │ 0.592507
0.365743    │ 0.684766

Lets say we have a situation where we want to select a large number of columns that we can easily select with one of the bulk column options discussed so far, except for one specific column. Here is how we can run such a query:

julia> df |> @select(startswith("col"), -:colB)
10x2 query result
colA      │ colC
──────────┼──────────
0.726354  │ 0.419195
0.132284  │ 0.619361
0.421744  │ 0.581122
0.204774  │ 0.0977708
0.362015  │ 0.775301
0.0614884 │ 0.0613467
0.702575  │ 0.283429
0.246562  │ 0.869932
0.432564  │ 0.92641
0.666146  │ 0.352096

Here we first select all columns that have a name that starts with col. At this point we have selected the columns colA, colB and colC. Next, we remove colB from this list, by specifying a minus - in front of its name as the next argument.

There is one twist to removing columns: if the first argument to the @select query operator starts with a minus, then we implicitly start out with all columns selected. Here is an example that demonstrates this:

julia> df |> @select(-:colB)
10x4 query result
firstCol    │ colA      │ colC      │ lastCol
────────────┼───────────┼───────────┼──────────
0.0815091   │ 0.726354  │ 0.419195  │ 0.0685362
0.984546    │ 0.132284  │ 0.619361  │ 0.847543
0.452009    │ 0.421744  │ 0.581122  │ 0.518726
0.551547    │ 0.204774  │ 0.0977708 │ 0.866429
0.0537391   │ 0.362015  │ 0.775301  │ 0.810517
0.837029    │ 0.0614884 │ 0.0613467 │ 0.230335
0.701862    │ 0.702575  │ 0.283429  │ 0.308963
0.626259    │ 0.246562  │ 0.869932  │ 0.148938
0.000844883 │ 0.432564  │ 0.92641   │ 0.592507
0.365743    │ 0.666146  │ 0.352096  │ 0.684766

Because we start this @select query operator with an argument that has a minus - in front, we start out with all columns, and then remove the colB column.

The minus - can of course also be combined with all the bulk column options. The next example selects all columns from the source table, except if they have a column name that starts with col:

julia> df |> @select(-startswith("col"))
10x2 query result
firstCol    │ lastCol
────────────┼──────────
0.0815091   │ 0.0685362
0.984546    │ 0.847543
0.452009    │ 0.518726
0.551547    │ 0.866429
0.0537391   │ 0.810517
0.837029    │ 0.230335
0.701862    │ 0.308963
0.626259    │ 0.148938
0.000844883 │ 0.592507
0.365743    │ 0.684766

If you are familiar with dplyr all of this will be highly familiar: we largely copied the semantics from that package.

The @rename query operator

The @rename query operator renames one or more columns from a table. Each rename is specified as a Pair, where the first element is either the old name of the column that should be renamed, or the position of the column. The second element must be a Symbol with the new name for the column. Here is an example:

julia> df |> @rename(:colB=>:foo, 1=>:bar)
10x5 query result
bar         │ colA      │ foo        │ colC      │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362
0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543
0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726
0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429
0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517
0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335
0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963
0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938
0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507
0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766

Here we are renaming the colB column to foo, and the first column to bar. This example highlights that one can rename multiple columns in a single call to @rename.

All other columns in a source table are left unmodified by the @rename query operator, and the order of columns is also not changed.

The semantics of this query operator once again follow the equivalent dplyr function.

The @mutate query operator

The @mutate query operator can be used to either add columns to a table, or mutate the content of some columns in a table. Lets say we want to add a new column to our table that is the sum of column colA and the log of column colB:

julia> df |> @mutate(newCol = _.colA + log(_.colB))
10x6 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol   │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼────────────
0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │ -1.07821
0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │ -5.13278
0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │ -0.391904
0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │ -0.00610799
0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │ 0.0882928
0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │ -0.455857
0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │ 0.365798
0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │ -0.309229
0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │ -0.807497
0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │ 0.596058

We pass each column we want compute for the output table as a keyword argument to the @mutate query operator. The name of the keyword argument will become the name of the column that we are computing. To the right of the equal sign = we pass an expression that computes the value for the current row for the new column. We have access to the current row via the _ syntax (in the same way it is used in all the other standalone query operators). In our example this allows us for example to reference the value of colA and colB in the current row _. If the output name for the column (in our case newCol) does not exist as a column in the input table, a new column with that name will be added at the end of the table.

The next example uses an output name for the column that already exists, in which case the content of that column will be replaced with the new, computed values:

julia> df |> @mutate(firstCol = _.colA + log(_.colB))
10x5 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
-1.07821    │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362
-5.13278    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543
-0.391904   │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726
-0.00610799 │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429
0.0882928   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517
-0.455857   │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335
0.365798    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963
-0.309229   │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938
-0.807497   │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507
0.596058    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766

One can also specify multiple output columns in a single call to @mutate, for example the following code replaces the values in one existing column (named firstCol) and adds a new column named newCol.

julia> df |> @mutate(firstCol = _.colA + log(_.colB), newCol=_.colC * 2)
10x6 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol   │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼─────────
-1.07821    │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │ 0.838391
-5.13278    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │ 1.23872
-0.391904   │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │ 1.16224
-0.00610799 │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │ 0.195542
0.0882928   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │ 1.5506
-0.455857   │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │ 0.122693
0.365798    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │ 0.566858
-0.309229   │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │ 1.73986
-0.807497   │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │ 1.85282
0.596058    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │ 0.704192

Note that in all of these examples, @mutate leaves any column that is not named explicitly as an output column alone.

And it might not come as a surprise at this point: we again followed the lead of the equivalent dplyr function in the design.

The @unique query operator

The @unique query operator is simple: it removes any duplicate values from an input sequence. Here is a simple example where we remove duplicate values from an array of Int values:

julia> [1,2,3,4,5,2,4] |> @unique()
?-element query result
 1
 2
 3
 4
 5

One can of course also use this with tabular data:

 julia> df2 = DataFrame(a=[1,2,3,2], b=[4,3,2,3])
4×2 DataFrame
 Row  a      b     
      Int64  Int64 
├─────┼───────┼───────┤
 1    1      4     
 2    2      3     
 3    3      2     
 4    2      3     

julia> df2 |> @unique()
3x2 query result
a  b
──┼──
1  4
2  3
3  2

Conclusion

As always, please let us know about any issues or problems that you might run into while using Query.jl over in the github repository.

Query.jl v0.10 released

By: David Anthoff

Re-posted from: https://www.queryverse.org//2018/08/29/query-v0.10/

I just released Query.jl v0.10. This is the first version that is compatible with julia 1.0, and at the same time it drops support for older julia versions.

This version comes with two breaking changes, other than that there are no new features.

New way to select group key

In previous versions one selected the value of a group key via the .key syntax, like in this example:

@from i in df begin
    @group i by i.children into g
    @select {Key=g.key,Count=length(g)}
    @collect DataFrame
end

This syntax is no longer support, instead you have to use the key function to access the key of a group:

@from i in df begin
    @group i by i.children into g
    @select {Key=key(g),Count=length(g)}
    @collect DataFrame
end

.. syntax removed

In previous versions one could use the a..b syntax as a shortcut for map(i->i.b, a). This syntax has been removed, and instead groups of table rows now support access to a whole column via a simple a.b syntax. For example, a typical groupby query might now look like this:

@from i in df begin
    @group i by i.state into g
    @select {group=key(g),mage=mean(g.age), oldest=maximum(g.age), youngest=minimum(g.age)}
    @collect DataFrame
end