David Anthoff | juliabloggers.com

Re-posted from: https://www.queryverse.org//2019/02/02/query-v0.11/

We just released Query.jl v0.11. The new version adds four new standalone query operators: @select, @rename, @mutate and @unique. We also fixed a number of bugs and wrote a bunch of new documentation. Finally, the standalone versions of the query operators are no longer considered experimental, so feel free to use the versions that use the pipe operator |> without any worries that they might go away in the future. The rest of this post will showcase the new query operators we added.

The `@select` query operator

The standalone @select query operator makes it easier to select a subset of columns from a source table and arrange them in some specified order. Note that the discussion in this section is about the standalone @select query operator, not the LINQ style version of @select.

In the most basic example, you can simply specify all the columns that you want to include in the result table. You can select columns either by their name, or their position.

Lets first create a DataFrame with some example data that we will use as the basis for most examples in this post:

julia> using Query, DataFrames

julia> df = DataFrame(firstCol=rand(10), colA=rand(10), colB=rand(10), colC=rand(10), lastCol=rand(10))
10×5 DataFrame
│ Row │ firstCol    │ colA      │ colB       │ colC      │ lastCol   │
│     │ Float64     │ Float64   │ Float64    │ Float64   │ Float64   │
├─────┼─────────────┼───────────┼────────────┼───────────┼───────────┤
│ 1   │ 0.0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │
│ 2   │ 0.984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │
│ 3   │ 0.452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │
│ 4   │ 0.551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │
│ 5   │ 0.0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │
│ 6   │ 0.837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │
│ 7   │ 0.701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │
│ 8   │ 0.626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │
│ 9   │ 0.000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │
│ 10  │ 0.365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │

Now we select the column named colB, the second column and the column named lastCol into a new table:

julia> df |> @select(:colB, 2, :lastCol)
10x3 query result
colB       │ colA      │ lastCol
───────────┼───────────┼──────────
164546   │ 0.726354  │ 0.0685362
00516908 │ 0.132284  │ 0.847543
443238   │ 0.421744  │ 0.518726
80987    │ 0.204774  │ 0.866429
760544   │ 0.362015  │ 0.810517
596101   │ 0.0614884 │ 0.230335
714068   │ 0.702575  │ 0.308963
573619   │ 0.246562  │ 0.148938
289367   │ 0.432564  │ 0.592507
932312   │ 0.666146  │ 0.684766

Note that we are achieving two things at the same time here: we are picking only a few of the columns from the source table, and we are rearranging the order of the columns.

There are a whole number of options available that make it easier to select multiple columns without specifying each individually.

First, we can use the standard julia range syntax to select consecutive columns:

julia> df |> @select(:colB : :lastCol, :colA)
10x4 query result
colB       │ colC      │ lastCol   │ colA
───────────┼───────────┼───────────┼──────────
164546   │ 0.419195  │ 0.0685362 │ 0.726354
00516908 │ 0.619361  │ 0.847543  │ 0.132284
443238   │ 0.581122  │ 0.518726  │ 0.421744
80987    │ 0.0977708 │ 0.866429  │ 0.204774
760544   │ 0.775301  │ 0.810517  │ 0.362015
596101   │ 0.0613467 │ 0.230335  │ 0.0614884
714068   │ 0.283429  │ 0.308963  │ 0.702575
573619   │ 0.869932  │ 0.148938  │ 0.246562
289367   │ 0.92641   │ 0.592507  │ 0.432564
932312   │ 0.352096  │ 0.684766  │ 0.666146

In that example, we first selected all columns between colB and lastCol (so that would be colB, colC and lastCol), and then we also included colA as the second argument to the @select call.

We can also use the range syntax with column position indices:

julia> df |> @select(3:5, :colA)
10x4 query result
colB       │ colC      │ lastCol   │ colA
───────────┼───────────┼───────────┼──────────
164546   │ 0.419195  │ 0.0685362 │ 0.726354
00516908 │ 0.619361  │ 0.847543  │ 0.132284
443238   │ 0.581122  │ 0.518726  │ 0.421744
80987    │ 0.0977708 │ 0.866429  │ 0.204774
760544   │ 0.775301  │ 0.810517  │ 0.362015
596101   │ 0.0613467 │ 0.230335  │ 0.0614884
714068   │ 0.283429  │ 0.308963  │ 0.702575
573619   │ 0.869932  │ 0.148938  │ 0.246562
289367   │ 0.92641   │ 0.592507  │ 0.432564
932312   │ 0.352096  │ 0.684766  │ 0.666146

We also provide a number of convenience functions for bulk column selection. For example, to select all columns that have a name that starts with col, we can write:

julia> df |> @select(startswith("col"))
10x3 query result
colA      │ colB       │ colC
──────────┼────────────┼──────────
726354  │ 0.164546   │ 0.419195
132284  │ 0.00516908 │ 0.619361
421744  │ 0.443238   │ 0.581122
204774  │ 0.80987    │ 0.0977708
362015  │ 0.760544   │ 0.775301
0614884 │ 0.596101   │ 0.0613467
702575  │ 0.714068   │ 0.283429
246562  │ 0.573619   │ 0.869932
432564  │ 0.289367   │ 0.92641
666146  │ 0.932312   │ 0.352096

Or we can select all columns that have a name that ends with Col:

julia> df |> @select(endswith("Col"))
10x2 query result
firstCol    │ lastCol
────────────┼──────────
0815091   │ 0.0685362
984546    │ 0.847543
452009    │ 0.518726
551547    │ 0.866429
0537391   │ 0.810517
837029    │ 0.230335
701862    │ 0.308963
626259    │ 0.148938
000844883 │ 0.592507
365743    │ 0.684766

Lets say we have a situation where we want to select a large number of columns that we can easily select with one of the bulk column options discussed so far, except for one specific column. Here is how we can run such a query:

julia> df |> @select(startswith("col"), -:colB)
10x2 query result
colA      │ colC
──────────┼──────────
726354  │ 0.419195
132284  │ 0.619361
421744  │ 0.581122
204774  │ 0.0977708
362015  │ 0.775301
0614884 │ 0.0613467
702575  │ 0.283429
246562  │ 0.869932
432564  │ 0.92641
666146  │ 0.352096

Here we first select all columns that have a name that starts with col. At this point we have selected the columns colA, colB and colC. Next, we remove colB from this list, by specifying a minus - in front of its name as the next argument.

There is one twist to removing columns: if the first argument to the @select query operator starts with a minus, then we implicitly start out with all columns selected. Here is an example that demonstrates this:

julia> df |> @select(-:colB)
10x4 query result
firstCol    │ colA      │ colC      │ lastCol
────────────┼───────────┼───────────┼──────────
0815091   │ 0.726354  │ 0.419195  │ 0.0685362
984546    │ 0.132284  │ 0.619361  │ 0.847543
452009    │ 0.421744  │ 0.581122  │ 0.518726
551547    │ 0.204774  │ 0.0977708 │ 0.866429
0537391   │ 0.362015  │ 0.775301  │ 0.810517
837029    │ 0.0614884 │ 0.0613467 │ 0.230335
701862    │ 0.702575  │ 0.283429  │ 0.308963
626259    │ 0.246562  │ 0.869932  │ 0.148938
000844883 │ 0.432564  │ 0.92641   │ 0.592507
365743    │ 0.666146  │ 0.352096  │ 0.684766

Because we start this @select query operator with an argument that has a minus - in front, we start out with all columns, and then remove the colB column.

The minus - can of course also be combined with all the bulk column options. The next example selects all columns from the source table, except if they have a column name that starts with col:

julia> df |> @select(-startswith("col"))
10x2 query result
firstCol    │ lastCol
────────────┼──────────
0815091   │ 0.0685362
984546    │ 0.847543
452009    │ 0.518726
551547    │ 0.866429
0537391   │ 0.810517
837029    │ 0.230335
701862    │ 0.308963
626259    │ 0.148938
000844883 │ 0.592507
365743    │ 0.684766

If you are familiar with dplyr all of this will be highly familiar: we largely copied the semantics from that package.

The `@rename` query operator

The @rename query operator renames one or more columns from a table. Each rename is specified as a Pair, where the first element is either the old name of the column that should be renamed, or the position of the column. The second element must be a Symbol with the new name for the column. Here is an example:

julia> df |> @rename(:colB=>:foo, 1=>:bar)
10x5 query result
bar         │ colA      │ foo        │ colC      │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362
984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543
452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726
551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429
0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517
837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335
701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963
626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938
000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507
365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766

Here we are renaming the colB column to foo, and the first column to bar. This example highlights that one can rename multiple columns in a single call to @rename.

All other columns in a source table are left unmodified by the @rename query operator, and the order of columns is also not changed.

The semantics of this query operator once again follow the equivalent dplyr function.

The `@mutate` query operator

The @mutate query operator can be used to either add columns to a table, or mutate the content of some columns in a table. Lets say we want to add a new column to our table that is the sum of column colA and the log of column colB:

julia> df |> @mutate(newCol = _.colA + log(_.colB))
10x6 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol   │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼────────────
0815091   │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │ -1.07821
984546    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │ -5.13278
452009    │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │ -0.391904
551547    │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │ -0.00610799
0537391   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │ 0.0882928
837029    │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │ -0.455857
701862    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │ 0.365798
626259    │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │ -0.309229
000844883 │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │ -0.807497
365743    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │ 0.596058

We pass each column we want compute for the output table as a keyword argument to the @mutate query operator. The name of the keyword argument will become the name of the column that we are computing. To the right of the equal sign = we pass an expression that computes the value for the current row for the new column. We have access to the current row via the _ syntax (in the same way it is used in all the other standalone query operators). In our example this allows us for example to reference the value of colA and colB in the current row _. If the output name for the column (in our case newCol) does not exist as a column in the input table, a new column with that name will be added at the end of the table.

The next example uses an output name for the column that already exists, in which case the content of that column will be replaced with the new, computed values:

julia> df |> @mutate(firstCol = _.colA + log(_.colB))
10x5 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol
────────────┼───────────┼────────────┼───────────┼──────────
-1.07821    │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362
-5.13278    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543
-0.391904   │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726
-0.00610799 │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429
0.0882928   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517
-0.455857   │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335
0.365798    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963
-0.309229   │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938
-0.807497   │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507
0.596058    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766

One can also specify multiple output columns in a single call to @mutate, for example the following code replaces the values in one existing column (named firstCol) and adds a new column named newCol.

julia> df |> @mutate(firstCol = _.colA + log(_.colB), newCol=_.colC * 2)
10x6 query result
firstCol    │ colA      │ colB       │ colC      │ lastCol   │ newCol
────────────┼───────────┼────────────┼───────────┼───────────┼─────────
-1.07821    │ 0.726354  │ 0.164546   │ 0.419195  │ 0.0685362 │ 0.838391
-5.13278    │ 0.132284  │ 0.00516908 │ 0.619361  │ 0.847543  │ 1.23872
-0.391904   │ 0.421744  │ 0.443238   │ 0.581122  │ 0.518726  │ 1.16224
-0.00610799 │ 0.204774  │ 0.80987    │ 0.0977708 │ 0.866429  │ 0.195542
0.0882928   │ 0.362015  │ 0.760544   │ 0.775301  │ 0.810517  │ 1.5506
-0.455857   │ 0.0614884 │ 0.596101   │ 0.0613467 │ 0.230335  │ 0.122693
0.365798    │ 0.702575  │ 0.714068   │ 0.283429  │ 0.308963  │ 0.566858
-0.309229   │ 0.246562  │ 0.573619   │ 0.869932  │ 0.148938  │ 1.73986
-0.807497   │ 0.432564  │ 0.289367   │ 0.92641   │ 0.592507  │ 1.85282
0.596058    │ 0.666146  │ 0.932312   │ 0.352096  │ 0.684766  │ 0.704192

Note that in all of these examples, @mutate leaves any column that is not named explicitly as an output column alone.

And it might not come as a surprise at this point: we again followed the lead of the equivalent dplyr function in the design.

The `@unique` query operator

The @unique query operator is simple: it removes any duplicate values from an input sequence. Here is a simple example where we remove duplicate values from an array of Int values:

julia> [1,2,3,4,5,2,4] |> @unique()
?-element query result
 1
 2
 3
 4
 5

One can of course also use this with tabular data:

 julia> df2 = DataFrame(a=[1,2,3,2], b=[4,3,2,3])
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 3     │
│ 3   │ 3     │ 2     │
│ 4   │ 2     │ 3     │

julia> df2 |> @unique()
3x2 query result
a │ b
──┼──
1 │ 4
2 │ 3
3 │ 2

Conclusion

As always, please let us know about any issues or problems that you might run into while using Query.jl over in the github repository.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: David Anthoff

Query.jl v0.11 released

The `@select` query operator

The `@rename` query operator

The `@mutate` query operator

The `@unique` query operator

Conclusion

Welcome to the new Queryverse homepage

Query.jl v0.10 released

New way to select group key

`..` syntax removed

The @select query operator

The @rename query operator

The @mutate query operator

The @unique query operator

Conclusion

New way to select group key

.. syntax removed

The `@select` query operator

The `@rename` query operator

The `@mutate` query operator

The `@unique` query operator

`..` syntax removed