By: David Anthoff
Re-posted from: https://www.queryverse.org//2019/02/02/query-v0.11/
We just released Query.jl v0.11. The new version adds four new standalone query operators: @select
, @rename
, @mutate
and @unique
. We also fixed a number of bugs and wrote a bunch of new documentation. Finally, the standalone versions of the query operators are no longer considered experimental, so feel free to use the versions that use the pipe operator |>
without any worries that they might go away in the future. The rest of this post will showcase the new query operators we added.
The @select
query operator
The standalone @select
query operator makes it easier to select a subset of columns from a source table and arrange them in some specified order. Note that the discussion in this section is about the standalone @select
query operator, not the LINQ style version of @select
.
In the most basic example, you can simply specify all the columns that you want to include in the result table. You can select columns either by their name, or their position.
Lets first create a DataFrame
with some example data that we will use as the basis for most examples in this post:
julia> using Query, DataFrames
julia> df = DataFrame(firstCol=rand(10), colA=rand(10), colB=rand(10), colC=rand(10), lastCol=rand(10))
10×5 DataFrame
│ Row │ firstCol │ colA │ colB │ colC │ lastCol │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────────┼───────────┼────────────┼───────────┼───────────┤
│ 1 │ 0.0815091 │ 0.726354 │ 0.164546 │ 0.419195 │ 0.0685362 │
│ 2 │ 0.984546 │ 0.132284 │ 0.00516908 │ 0.619361 │ 0.847543 │
│ 3 │ 0.452009 │ 0.421744 │ 0.443238 │ 0.581122 │ 0.518726 │
│ 4 │ 0.551547 │ 0.204774 │ 0.80987 │ 0.0977708 │ 0.866429 │
│ 5 │ 0.0537391 │ 0.362015 │ 0.760544 │ 0.775301 │ 0.810517 │
│ 6 │ 0.837029 │ 0.0614884 │ 0.596101 │ 0.0613467 │ 0.230335 │
│ 7 │ 0.701862 │ 0.702575 │ 0.714068 │ 0.283429 │ 0.308963 │
│ 8 │ 0.626259 │ 0.246562 │ 0.573619 │ 0.869932 │ 0.148938 │
│ 9 │ 0.000844883 │ 0.432564 │ 0.289367 │ 0.92641 │ 0.592507 │
│ 10 │ 0.365743 │ 0.666146 │ 0.932312 │ 0.352096 │ 0.684766 │
Now we select the column named colB
, the second column and the column named lastCol
into a new table:
julia> df |> @select(:colB, 2, :lastCol)
10x3 query result
colB │ colA │ lastCol
───────────┼───────────┼──────────
0.164546 │ 0.726354 │ 0.0685362
0.00516908 │ 0.132284 │ 0.847543
0.443238 │ 0.421744 │ 0.518726
0.80987 │ 0.204774 │ 0.866429
0.760544 │ 0.362015 │ 0.810517
0.596101 │ 0.0614884 │ 0.230335
0.714068 │ 0.702575 │ 0.308963
0.573619 │ 0.246562 │ 0.148938
0.289367 │ 0.432564 │ 0.592507
0.932312 │ 0.666146 │ 0.684766
Note that we are achieving two things at the same time here: we are picking only a few of the columns from the source table, and we are rearranging the order of the columns.
There are a whole number of options available that make it easier to select multiple columns without specifying each individually.
First, we can use the standard julia range syntax to select consecutive columns:
julia> df |> @select(:colB : :lastCol, :colA)
10x4 query result
colB │ colC │ lastCol │ colA
───────────┼───────────┼───────────┼──────────
0.164546 │ 0.419195 │ 0.0685362 │ 0.726354
0.00516908 │ 0.619361 │ 0.847543 │ 0.132284
0.443238 │ 0.581122 │ 0.518726 │ 0.421744
0.80987 │ 0.0977708 │ 0.866429 │ 0.204774
0.760544 │ 0.775301 │ 0.810517 │ 0.362015
0.596101 │ 0.0613467 │ 0.230335 │ 0.0614884
0.714068 │ 0.283429 │ 0.308963 │ 0.702575
0.573619 │ 0.869932 │ 0.148938 │ 0.246562
0.289367 │ 0.92641 │ 0.592507 │ 0.432564
0.932312 │ 0.352096 │ 0.684766 │ 0.666146
In that example, we first selected all columns between colB
and lastCol
(so that would be colB
, colC
and lastCol
), and then we also included colA
as the second argument to the @select
call.
We can also use the range syntax with column position indices:
julia> df |> @select(3:5, :colA)
10x4 query result
colB │ colC │ lastCol │ colA
───────────┼───────────┼───────────┼──────────
0.164546 │ 0.419195 │ 0.0685362 │ 0.726354
0.00516908 │ 0.619361 │ 0.847543 │ 0.132284
0.443238 │ 0.581122 │ 0.518726 │ 0.421744
0.80987 │ 0.0977708 │ 0.866429 │ 0.204774
0.760544 │ 0.775301 │ 0.810517 │ 0.362015
0.596101 │ 0.0613467 │ 0.230335 │ 0.0614884
0.714068 │ 0.283429 │ 0.308963 │ 0.702575
0.573619 │ 0.869932 │ 0.148938 │ 0.246562
0.289367 │ 0.92641 │ 0.592507 │ 0.432564
0.932312 │ 0.352096 │ 0.684766 │ 0.666146
We also provide a number of convenience functions for bulk column selection. For example, to select all columns that have a name that starts with col
, we can write:
julia> df |> @select(startswith("col"))
10x3 query result
colA │ colB │ colC
──────────┼────────────┼──────────
0.726354 │ 0.164546 │ 0.419195
0.132284 │ 0.00516908 │ 0.619361
0.421744 │ 0.443238 │ 0.581122
0.204774 │ 0.80987 │ 0.0977708
0.362015 │ 0.760544 │ 0.775301
0.0614884 │ 0.596101 │ 0.0613467
0.702575 │ 0.714068 │ 0.283429
0.246562 │ 0.573619 │ 0.869932
0.432564 │ 0.289367 │ 0.92641
0.666146 │ 0.932312 │ 0.352096
Or we can select all columns that have a name that ends with Col
:
julia> df |> @select(endswith("Col"))
10x2 query result
firstCol │ lastCol
────────────┼──────────
0.0815091 │ 0.0685362
0.984546 │ 0.847543
0.452009 │ 0.518726
0.551547 │ 0.866429
0.0537391 │ 0.810517
0.837029 │ 0.230335
0.701862 │ 0.308963
0.626259 │ 0.148938
0.000844883 │ 0.592507
0.365743 │ 0.684766
Lets say we have a situation where we want to select a large number of columns that we can easily select with one of the bulk column options discussed so far, except for one specific column. Here is how we can run such a query:
julia> df |> @select(startswith("col"), -:colB)
10x2 query result
colA │ colC
──────────┼──────────
0.726354 │ 0.419195
0.132284 │ 0.619361
0.421744 │ 0.581122
0.204774 │ 0.0977708
0.362015 │ 0.775301
0.0614884 │ 0.0613467
0.702575 │ 0.283429
0.246562 │ 0.869932
0.432564 │ 0.92641
0.666146 │ 0.352096
Here we first select all columns that have a name that starts with col
. At this point we have selected the columns colA
, colB
and colC
. Next, we remove colB
from this list, by specifying a minus -
in front of its name as the next argument.
There is one twist to removing columns: if the first argument to the @select
query operator starts with a minus, then we implicitly start out with all columns selected. Here is an example that demonstrates this:
julia> df |> @select(-:colB)
10x4 query result
firstCol │ colA │ colC │ lastCol
────────────┼───────────┼───────────┼──────────
0.0815091 │ 0.726354 │ 0.419195 │ 0.0685362
0.984546 │ 0.132284 │ 0.619361 │ 0.847543
0.452009 │ 0.421744 │ 0.581122 │ 0.518726
0.551547 │ 0.204774 │ 0.0977708 │ 0.866429
0.0537391 │ 0.362015 │ 0.775301 │ 0.810517
0.837029 │ 0.0614884 │ 0.0613467 │ 0.230335
0.701862 │ 0.702575 │ 0.283429 │ 0.308963
0.626259 │ 0.246562 │ 0.869932 │ 0.148938
0.000844883 │ 0.432564 │ 0.92641 │ 0.592507
0.365743 │ 0.666146 │ 0.352096 │ 0.684766
Because we start this @select
query operator with an argument that has a minus -
in front, we start out with all columns, and then remove the colB
column.
The minus -
can of course also be combined with all the bulk column options. The next example selects all columns from the source table, except if they have a column name that starts with col
:
julia> df |> @select(-startswith("col"))
10x2 query result
firstCol │ lastCol
────────────┼──────────
0.0815091 │ 0.0685362
0.984546 │ 0.847543
0.452009 │ 0.518726
0.551547 │ 0.866429
0.0537391 │ 0.810517
0.837029 │ 0.230335
0.701862 │ 0.308963
0.626259 │ 0.148938
0.000844883 │ 0.592507
0.365743 │ 0.684766
If you are familiar with dplyr all of this will be highly familiar: we largely copied the semantics from that package.
The @rename
query operator
The @rename
query operator renames one or more columns from a table. Each rename is specified as a Pair
, where the first element is either the old name of the column that should be renamed, or the position of the column. The second element must be a Symbol
with the new name for the column. Here is an example:
julia> df |> @rename(:colB=>:foo, 1=>:bar)
10x5 query result
bar │ colA │ foo │ colC │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
0.0815091 │ 0.726354 │ 0.164546 │ 0.419195 │ 0.0685362
0.984546 │ 0.132284 │ 0.00516908 │ 0.619361 │ 0.847543
0.452009 │ 0.421744 │ 0.443238 │ 0.581122 │ 0.518726
0.551547 │ 0.204774 │ 0.80987 │ 0.0977708 │ 0.866429
0.0537391 │ 0.362015 │ 0.760544 │ 0.775301 │ 0.810517
0.837029 │ 0.0614884 │ 0.596101 │ 0.0613467 │ 0.230335
0.701862 │ 0.702575 │ 0.714068 │ 0.283429 │ 0.308963
0.626259 │ 0.246562 │ 0.573619 │ 0.869932 │ 0.148938
0.000844883 │ 0.432564 │ 0.289367 │ 0.92641 │ 0.592507
0.365743 │ 0.666146 │ 0.932312 │ 0.352096 │ 0.684766
Here we are renaming the colB
column to foo
, and the first column to bar
. This example highlights that one can rename multiple columns in a single call to @rename
.
All other columns in a source table are left unmodified by the @rename
query operator, and the order of columns is also not changed.
The semantics of this query operator once again follow the equivalent dplyr function.
The @mutate
query operator
The @mutate
query operator can be used to either add columns to a table, or mutate the content of some columns in a table. Lets say we want to add a new column to our table that is the sum of column colA
and the log of column colB
:
julia> df |> @mutate(newCol = _.colA + log(_.colB))
10x6 query result
firstCol │ colA │ colB │ colC │ lastCol │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼────────────
0.0815091 │ 0.726354 │ 0.164546 │ 0.419195 │ 0.0685362 │ -1.07821
0.984546 │ 0.132284 │ 0.00516908 │ 0.619361 │ 0.847543 │ -5.13278
0.452009 │ 0.421744 │ 0.443238 │ 0.581122 │ 0.518726 │ -0.391904
0.551547 │ 0.204774 │ 0.80987 │ 0.0977708 │ 0.866429 │ -0.00610799
0.0537391 │ 0.362015 │ 0.760544 │ 0.775301 │ 0.810517 │ 0.0882928
0.837029 │ 0.0614884 │ 0.596101 │ 0.0613467 │ 0.230335 │ -0.455857
0.701862 │ 0.702575 │ 0.714068 │ 0.283429 │ 0.308963 │ 0.365798
0.626259 │ 0.246562 │ 0.573619 │ 0.869932 │ 0.148938 │ -0.309229
0.000844883 │ 0.432564 │ 0.289367 │ 0.92641 │ 0.592507 │ -0.807497
0.365743 │ 0.666146 │ 0.932312 │ 0.352096 │ 0.684766 │ 0.596058
We pass each column we want compute for the output table as a keyword argument to the @mutate
query operator. The name of the keyword argument will become the name of the column that we are computing. To the right of the equal sign =
we pass an expression that computes the value for the current row for the new column. We have access to the current row via the _
syntax (in the same way it is used in all the other standalone query operators). In our example this allows us for example to reference the value of colA
and colB
in the current row _
. If the output name for the column (in our case newCol
) does not exist as a column in the input table, a new column with that name will be added at the end of the table.
The next example uses an output name for the column that already exists, in which case the content of that column will be replaced with the new, computed values:
julia> df |> @mutate(firstCol = _.colA + log(_.colB))
10x5 query result
firstCol │ colA │ colB │ colC │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
-1.07821 │ 0.726354 │ 0.164546 │ 0.419195 │ 0.0685362
-5.13278 │ 0.132284 │ 0.00516908 │ 0.619361 │ 0.847543
-0.391904 │ 0.421744 │ 0.443238 │ 0.581122 │ 0.518726
-0.00610799 │ 0.204774 │ 0.80987 │ 0.0977708 │ 0.866429
0.0882928 │ 0.362015 │ 0.760544 │ 0.775301 │ 0.810517
-0.455857 │ 0.0614884 │ 0.596101 │ 0.0613467 │ 0.230335
0.365798 │ 0.702575 │ 0.714068 │ 0.283429 │ 0.308963
-0.309229 │ 0.246562 │ 0.573619 │ 0.869932 │ 0.148938
-0.807497 │ 0.432564 │ 0.289367 │ 0.92641 │ 0.592507
0.596058 │ 0.666146 │ 0.932312 │ 0.352096 │ 0.684766
One can also specify multiple output columns in a single call to @mutate
, for example the following code replaces the values in one existing column (named firstCol
) and adds a new column named newCol
.
julia> df |> @mutate(firstCol = _.colA + log(_.colB), newCol=_.colC * 2)
10x6 query result
firstCol │ colA │ colB │ colC │ lastCol │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼─────────
-1.07821 │ 0.726354 │ 0.164546 │ 0.419195 │ 0.0685362 │ 0.838391
-5.13278 │ 0.132284 │ 0.00516908 │ 0.619361 │ 0.847543 │ 1.23872
-0.391904 │ 0.421744 │ 0.443238 │ 0.581122 │ 0.518726 │ 1.16224
-0.00610799 │ 0.204774 │ 0.80987 │ 0.0977708 │ 0.866429 │ 0.195542
0.0882928 │ 0.362015 │ 0.760544 │ 0.775301 │ 0.810517 │ 1.5506
-0.455857 │ 0.0614884 │ 0.596101 │ 0.0613467 │ 0.230335 │ 0.122693
0.365798 │ 0.702575 │ 0.714068 │ 0.283429 │ 0.308963 │ 0.566858
-0.309229 │ 0.246562 │ 0.573619 │ 0.869932 │ 0.148938 │ 1.73986
-0.807497 │ 0.432564 │ 0.289367 │ 0.92641 │ 0.592507 │ 1.85282
0.596058 │ 0.666146 │ 0.932312 │ 0.352096 │ 0.684766 │ 0.704192
Note that in all of these examples, @mutate
leaves any column that is not named explicitly as an output column alone.
And it might not come as a surprise at this point: we again followed the lead of the equivalent dplyr function in the design.
The @unique
query operator
The @unique
query operator is simple: it removes any duplicate values from an input sequence. Here is a simple example where we remove duplicate values from an array of Int
values:
julia> [1,2,3,4,5,2,4] |> @unique()
?-element query result
1
2
3
4
5
One can of course also use this with tabular data:
julia> df2 = DataFrame(a=[1,2,3,2], b=[4,3,2,3])
4×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 3 │
│ 3 │ 3 │ 2 │
│ 4 │ 2 │ 3 │
julia> df2 |> @unique()
3x2 query result
a │ b
──┼──
1 │ 4
2 │ 3
3 │ 2
Conclusion
As always, please let us know about any issues or problems that you might run into while using Query.jl over in the github repository.