How is => used in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/07/17/pair.html

Introduction

A recent StackOverflow question prompted me to write down a glossary in
what cases DataFrames.jl allows the use of =>. In this post I summarize it, as
indeed it has a heavy usage.

Essentially in many places we use => instead of =. The difference is that
=> is treated by the Julia parser as a two argument operator in Julia creating
a Pair object (which is much more flexible than = which has a very strict
allowed usage and it is possible to dispatch on Pair).

The functions in DataFrames.jl that have a special treatement of => are:

  • DataFrame, insertcols!
  • select, select!, transform, transform!, combine
  • filter, filter!
  • describe
  • raname!, rename
  • innerjoin, leftjoin, rightjoin, outerjoin, antijoin, semijoin

I have grouped these functions by the same meaning of =>, so as you can see
its meaning is context dependent. I describe these usages in sections below.

All examples were run under Julia 1.4.2 and DataFrames.jl 0.21.4. They are meant
to be executed linearly (so later examples assume that earlier examples were
run).

Column assignment: DataFrame, insertcols!

In this case the syntax is target_column_name => value and means that
target_column_name should be set to hold value, for example:

julia> using DataFrames

julia> df = DataFrame(:x => 1:4, :y => "a")
4×2 DataFrame
│ Row │ x     │ y      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ a      │
│ 4   │ 4     │ a      │

julia> insertcols!(df, :a => 'a':'d', :b => exp(1))
4×4 DataFrame
│ Row │ x     │ y      │ a    │ b       │
│     │ Int64 │ String │ Char │ Float64 │
├─────┼───────┼────────┼──────┼─────────┤
│ 1   │ 1     │ a      │ 'a'  │ 2.71828 │
│ 2   │ 2     │ a      │ 'b'  │ 2.71828 │
│ 3   │ 3     │ a      │ 'c'  │ 2.71828 │
│ 4   │ 4     │ a      │ 'd'  │ 2.71828 │

Column transformation: select, select!, transform, transform!, combine

In this case there are three allowed forms:

  • source_columns => transformation_function => target_column_name
  • source_columns => transformation_function (target column name will be
    automatically generated)
  • source_column => target_column_name (essentially a way to rename a column)

Here are some examples:

julia> transform(df, :x => maximum)
4×5 DataFrame
│ Row │ x     │ y      │ a    │ b       │ x_maximum │
│     │ Int64 │ String │ Char │ Float64 │ Int64     │
├─────┼───────┼────────┼──────┼─────────┼───────────┤
│ 1   │ 1     │ a      │ 'a'  │ 2.71828 │ 4         │
│ 2   │ 2     │ a      │ 'b'  │ 2.71828 │ 4         │
│ 3   │ 3     │ a      │ 'c'  │ 2.71828 │ 4         │
│ 4   │ 4     │ a      │ 'd'  │ 2.71828 │ 4         │

julia> combine(df, :x => maximum => :mx)
1×1 DataFrame
│ Row │ mx    │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │

julia> select(df, [:x, :b] => +)
4×1 DataFrame
│ Row │ x_b_+   │
│     │ Float64 │
├─────┼─────────┤
│ 1   │ 3.71828 │
│ 2   │ 4.71828 │
│ 3   │ 5.71828 │
│ 4   │ 6.71828 │

julia> transform(df, :a => :b, :b => :a)
4×4 DataFrame
│ Row │ x     │ y      │ a       │ b    │
│     │ Int64 │ String │ Float64 │ Char │
├─────┼───────┼────────┼─────────┼──────┤
│ 1   │ 1     │ a      │ 2.71828 │ 'a'  │
│ 2   │ 2     │ a      │ 2.71828 │ 'b'  │
│ 3   │ 3     │ a      │ 2.71828 │ 'c'  │
│ 4   │ 4     │ a      │ 2.71828 │ 'd'  │

(note that what source_coulumn_names and transformation_function can be is
quite flexible — please have a look into the documentation to learn about
the available options)

Row selection: filter, filter!

In this case there allowed form is source_columns => predicate, for
example:

julia> filter(:x => >(1.5), df)
3×4 DataFrame
│ Row │ x     │ y      │ a    │ b       │
│     │ Int64 │ String │ Char │ Float64 │
├─────┼───────┼────────┼──────┼─────────┤
│ 1   │ 2     │ a      │ 'b'  │ 2.71828 │
│ 2   │ 3     │ a      │ 'c'  │ 2.71828 │
│ 3   │ 4     │ a      │ 'd'  │ 2.71828 │

Summarizing: describe

In this case there allowed form is target_column_name => aggregation_function.
Then aggregation_function is applied to every column of a data frame and the
result is stored in target_column_name column. For example:

julia> describe(df, :first => first, :last => last)
4×3 DataFrame
│ Row │ variable │ first   │ last    │
│     │ Symbol   │ Any     │ Any     │
├─────┼──────────┼─────────┼─────────┤
│ 1   │ x        │ 1       │ 4       │
│ 2   │ y        │ a       │ a       │
│ 3   │ a        │ 'a'     │ 'd'     │
│ 4   │ b        │ 2.71828 │ 2.71828 │

Names transformation: raname!, rename

The form is source_column => target_column_name to rename source_column to
target_column_name, e.g.:

julia> rename(df, "x" => "xx", "y" => "yy")
4×4 DataFrame
│ Row │ xx    │ yy     │ a    │ b       │
│     │ Int64 │ String │ Char │ Float64 │
├─────┼───────┼────────┼──────┼─────────┤
│ 1   │ 1     │ a      │ 'a'  │ 2.71828 │
│ 2   │ 2     │ a      │ 'b'  │ 2.71828 │
│ 3   │ 3     │ a      │ 'c'  │ 2.71828 │
│ 4   │ 4     │ a      │ 'd'  │ 2.71828 │

julia> rename(df, 1 => :xx, 2 => :yy)
4×4 DataFrame
│ Row │ xx    │ yy     │ a    │ b       │
│     │ Int64 │ String │ Char │ Float64 │
├─────┼───────┼────────┼──────┼─────────┤
│ 1   │ 1     │ a      │ 'a'  │ 2.71828 │
│ 2   │ 2     │ a      │ 'b'  │ 2.71828 │
│ 3   │ 3     │ a      │ 'c'  │ 2.71828 │
│ 4   │ 4     │ a      │ 'd'  │ 2.71828 │

Joining: innerjoin, leftjoin, rightjoin, outerjoin, antijoin, semijoin

Here, you can pass left_column_name => right_column_name as an on keyword
argument in the case when left and right joined data frames have different names
of columns on which the join should be performed, for instance:

julia> innerjoin(df, df2, on = :x => :x2)
4×5 DataFrame
│ Row │ x     │ y      │ a    │ b       │ a2   │
│     │ Int64 │ String │ Char │ Float64 │ Char │
├─────┼───────┼────────┼──────┼─────────┼──────┤
│ 1   │ 1     │ a      │ 'a'  │ 2.71828 │ 'a'  │
│ 2   │ 2     │ a      │ 'b'  │ 2.71828 │ 'b'  │
│ 3   │ 3     │ a      │ 'c'  │ 2.71828 │ 'c'  │
│ 4   │ 4     │ a      │ 'd'  │ 2.71828 │ 'd'  │

julia> innerjoin(df, df2, on = ["x" => "x2", "a" => "a2"])
4×4 DataFrame
│ Row │ x     │ y      │ a    │ b       │
│     │ Int64 │ String │ Char │ Float64 │
├─────┼───────┼────────┼──────┼─────────┤
│ 1   │ 1     │ a      │ 'a'  │ 2.71828 │
│ 2   │ 2     │ a      │ 'b'  │ 2.71828 │
│ 3   │ 3     │ a      │ 'c'  │ 2.71828 │
│ 4   │ 4     │ a      │ 'd'  │ 2.71828 │

(in the second example we have performed join on two pairs of columns)

I hope that you will find the examples provided above useful when working with
DataFrames.jl!