Column selectors in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/02/06/colsel.html

Introduction

In my last post I have discussed row selector rules for data frames
in DataFrames.jl.

This time I will cover available column selector options.

This post was written under Julia 1.5.3 and DataFrames 0.22.5.

First we set up the environment:

julia> using DataFrames

julia> df = DataFrame(a=1, b=2, x1=3, x2=4, y1=5, y2=6)
1×6 DataFrame
 Row │ a      b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      2      3      4      5      6

Column selection in indexing syntax

A starting point of column selection syntax is indexing. I will go through
all available options. In later sections of this post I discuss other functions
that support column indexing as they build upon the indexing syntax.

Selecting a single column

You can use either: an integer, a Symbol or a string. Also begin and end
is supported just like for arrays. Here are some examples:

julia> df[:, 1]
1-element Array{Int64,1}:
 1

julia> df[:, :a]
1-element Array{Int64,1}:
 1

julia> df[:, "a"]
1-element Array{Int64,1}:
 1

julia> df[:, begin]
1-element Array{Int64,1}:
 1

julia> df[:, end]
1-element Array{Int64,1}:
 6

Note that if in indexing a single column is selected then it is extracted out
from a data frame as a vector. In all other column selector options described
below you always get a data frame as a result of the operation.

Selecting multiple columns using vectors

You can use the vector selectors: all integers, all Bool, all Symbol,
or all string (mixing styles is not allowed). Note that you can also use begin
and end to define a vector (e.g. in ranges). Here are some examples:

julia> df[:, [1, 2]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, [trues(2); falses(4)]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, [:a, :b]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, ["a", "b"]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, begin:2]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, 2:end]
1×5 DataFrame
 Row │ b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     2      3      4      5      6

Special selectors

There are the following special selectors available:

  • : and All() allowing you to select all columns:
julia> df[:, :]
1×6 DataFrame
 Row │ a      b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      2      3      4      5      6

julia> df[:, All()]
1×6 DataFrame
 Row │ a      b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      2      3      4      5      6
  • a regular expression picking columns whose name matches it:
julia> df[:, r"1"]
1×2 DataFrame
 Row │ x1     y1
     │ Int64  Int64
─────┼──────────────
   1 │     3      5

julia> df[:, r"x"]
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     3      4
  • Between selector allowing you to specify a range of columns (you can specify
    the start and stop column using any of the single column selector syntaxes):
julia> df[:, Between(3, :y1)]
1×3 DataFrame
 Row │ x1     x2     y1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      4      5

julia> df[:, Between(begin, "x1")]
1×3 DataFrame
 Row │ a      b      x1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3
  • Not selector allowing you to specify the columns you want to exclude from
    the resulting data frames. You can put any valid other column selector
    inside Not:
julia> df[:, Not("a")]
1×5 DataFrame
 Row │ b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     2      3      4      5      6

julia> df[:, Not(r"x")]
1×4 DataFrame
 Row │ a      b      y1     y2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      5      6

julia> df[:, Not(Between(:x1, end))]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
  • Cols selector picking a union of other selectors passed as its arguments:
julia> df[:, Cols()]
0×0 DataFrame

julia> df[:, Cols(:x1, 1)]
1×2 DataFrame
 Row │ x1     a
     │ Int64  Int64
─────┼──────────────
   1 │     3      1

julia> df[:, Cols(r"1", r"x")]
1×3 DataFrame
 Row │ x1     y1     x2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      5      4

Note that the order of columns in the produced data frame reflects the order
in which ther are selected by consecutive selectors passed inside Cols.

Selecting just column names

Sometimes one only wants to pick names of columns meeting some condition
without actually creating a new data frame. This can be achieved using the
names function.

If you just write names(df) you get a list of all column names of a data frame
as a vector of strings:

julia> names(df)
6-element Array{String,1}:
 "a"
 "b"
 "x1"
 "x2"
 "y1"
 "y2"

You can also pass any column selector expression we described above as a second
argument to names to get a subset of the column names, e.g.:

julia> names(df, r"x")
2-element Array{String,1}:
 "x1"
 "x2"

julia> names(df, Not(Between("b", "x2")))
3-element Array{String,1}:
 "a"
 "y1"
 "y2"

Additionally you can pass a type as a second argument to the names function,
and in this case you will get a vector of column names whose element type is
a subtype of passed argument. Here is an example:

julia> df2 = DataFrame([[1, 2], [1.0, 2.0], ["1", "2"], [1, missing]], :auto)
2×4 DataFrame
 Row │ x1     x2       x3      x4
     │ Int64  Float64  String  Int64?
─────┼─────────────────────────────────
   1 │     1      1.0  1             1
   2 │     2      2.0  2       missing

julia> names(df2, Int)
1-element Array{String,1}:
 "x1"

julia> names(df2, Union{Int, Missing})
2-element Array{String,1}:
 "x1"
 "x4"

julia> names(df2, Real)
2-element Array{String,1}:
 "x1"
 "x2"

There is also an upcoming feature that will be soon available in DataFrames.jl 1.0
release. You are going to be able to pass as a second argument to the names
function a predicate taking the column name as a string and returning true for:
columns that should be kept. Here are some examples (note that this code snippet
will only work if you ceck out main branch of DataFrames.jl from its GitHub
repository):

julia> df3 = DataFrame(x1=1, x2=2, y=3)
1×3 DataFrame
 Row │ x1     x2     y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> names(df3, startswith('x'))
2-element Array{String,1}:
 "x1"
 "x2"

julia> names(df3, x -> length(x) == 1)
1-element Array{String,1}:
 "y"

The names function is quite often used if you want to perform multiple
transformations of columns using select, transform or combine, e.g.:

julia> select(df, r"x", names(df, r"x") .=> ByRow(sqrt))
1×4 DataFrame
 Row │ x1     x2     x1_sqrt  x2_sqrt
     │ Int64  Int64  Float64  Float64
─────┼────────────────────────────────
   1 │     3      4  1.73205      2.0

Other functions that support passing column selectors

Here is a list of functions that accept the column selectors we discussed in the
indexing section above:

  • describe takes cols keyword argument allowing you to choose for which
    columns descriptive statistics should be computed;
  • combine, select, select!, transform, transform!, allowing you to
    just keep the passed columns;
  • groupby that takes a second positional argument a column selector;
  • flatten taking a column selector as a second positional argument giving
    informations which columns should be flattened;
  • stack allowing you to specify measure variables and id variables using
    the column selectors;
  • unstack where you can pass columns that specify row keys for unstacking;
  • issorted, sortperm, sort and sort! where the second positional argument
    is a column selector (they also allow some additional syntax to specify how
    sorting should be performed but this is a separate topic);
  • unique, unique!, and nonunique where second positional argument is a
    column selector signaling which column should be checked for uniqueness;
  • allowmissing, allowmissing!, disallowmissing, disallowmissing!,
    completecases, dropmissing, dropmissing! which all take as a second
    argument a column selector indicating which columns should be taken into
    account in transformation performed.

Conclusions

As you can see in this post DataFrames.jl provides a very flexible system of
allowed column selectors. Also, good understanding column selector semantics
will boost your efficiency when working with DataFrames.jl as we consistently
support the same patterns across whole package.