Is case_when needed in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/12/18/casewhen.html

Introduction

Recently I received a very interesting question regarding transforming data
using the DataFrames.jl. One of the users wanted to know if we have
a functionality similar to the case_when function in dplyr. When trying
to answer it I came to the conclusion that we do not need it that we can
reproduce it using the ⋅ ? ⋅ : ⋅ ternary operator.

In this post I will be reproducing selected examples from the documentation of
case_when.

Reproducing dplyr examples

In the examples I will first show R code and then show an Julia code.

I R examples I assume that dplyr is loaded. The Julia examples were tested under
Julia 1.5.3, DataFrames.jl 0.22.2, DataFramesMeta.jl v0.6.0, HTTP.jl v0.8.19,
JSON3.jl v1.5.1, and Pipe.jl v1.3.0.

Example 1

This is the most basic case_when usage scenario.

R code:

> library(dplyr)
> x <- 1:10
> case_when(
+   x %% 35 == 0 ~ "fizz buzz",
+   x %% 5 == 0 ~ "fizz",
+   x %% 7 == 0 ~ "buzz",
+   TRUE ~ as.character(x)
+ )
 [1] "1"    "2"    "3"    "4"    "fizz" "6"    "buzz" "8"    "9"    "fizz"

Julia code:

julia> x = 1:10
1:10

julia> (x -> x % 6 == 0 ? "fizz buzz" :
             x % 2 == 0 ? "fizz" :
             x % 3 == 0 ? "buzz" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "fizz"
 "buzz"
 "fizz"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

In this basic example note the following things:

  • in the Julia code we do not need to load any package; we are using the
    functionality in built into the language;
  • we create an anonymous function that is then broadcasted over an input vector
    using the . operator;
  • both codes look almost the same, apart from a bit different punctuation.

In summary – in my opinion the basic use case shows that the ternary operator
is as convenient as case_when from dplyr.

Example 2

In this example missing values are introduced. We are reusing the vector created
in the previous exercise.

R code:

> x[2:4] <- NA_real_
> case_when(
+   x %% 35 == 0 ~ "fizz buzz",
+   x %% 5 == 0 ~ "fizz",
+   x %% 7 == 0 ~ "buzz",
+   is.na(x) ~ "nope",
+   TRUE ~ as.character(x)
+ )
 [1] "1"    "nope" "nope" "nope" "fizz" "6"    "buzz" "8"    "9"    "fizz"

Julia code (two variants):

julia> x = [2 <= i <= 4 ? missing : x[i] for i in axes(x, 1)]

10-element Array{Union{Missing, Int64},1}:
  1
   missing
   missing
   missing
  5
  6
  7
  8
  9
 10

julia> (x -> isequal(x % 6, 0) ? "fizz buzz" :
             isequal(x % 2, 0) ? "fizz" :
             isequal(x % 3, 0) ? "buzz" :
             ismissing(x) ? "nope" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "nope"
 "nope"
 "nope"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

julia> (x -> coalesce(x % 6 == 0, false) ? "fizz buzz" :
             coalesce(x % 2 == 0, false) ? "fizz" :
             coalesce(x % 3 == 0, false) ? "buzz" :
             ismissing(x) ? "nope" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "nope"
 "nope"
 "nope"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

additionally note that the code can be simplified if we put the ismissing
condition first:

julia> (x -> ismissing(x) ? "nope" :
             x % 6 == 0 ? "fizz buzz" :
             x % 2 == 0 ? "fizz" :
             x % 3 == 0 ? "buzz" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "nope"
 "nope"
 "nope"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

Note the following patterns in this example:

  • we had to materialize the vector in Julia in a bit more complex way as the
    initial x vector was a 1:10 range which was read only;
  • in R comparison to missing is treated as failing by default; on the other
    hand Julia is strict about boolean tests and one has to use either the
    isequal or the coalesce functions to handle missing values (or move the
    ismissing test to the top); this strictness introduces a bit more verbosity
    in the code at the benefit of allowing the user to catch potential bugs in the
    logic of the code more easily.

Example 3

In this example we use the starwars dataset that is shipped with dplyr. So we
first have to fetch it from the Internet in Julia. Here is the code that does
the trick:

julia> using DataFrames

julia> using HTTP

julia> using JSON3

julia> using Pipe

julia> starwars = @pipe HTTP.get("https://swapi.dev/api/people/").body |>
                  JSON3.read |> _.results |> DataFrame |>
                  transform(_,
                            :species => ByRow(x -> isempty(x) ? "Human" : "Droid"),
                            [:height, :mass] .=> ByRow(x -> parse(Int, x)),
                            renamecols=false)
10×16 DataFrame
 Row │ name                height  mass   hair_color     skin_color   eye_color ⋯
     │ String              Int64   Int64  String         String       String    ⋯
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172     77  blond          fair         blue      ⋯
   2 │ C-3PO                  167     75  n/a            gold         yellow
   3 │ R2-D2                   96     32  n/a            white, blue  red
   4 │ Darth Vader            202    136  none           white        yellow
   5 │ Leia Organa            150     49  brown          light        brown     ⋯
   6 │ Owen Lars              178    120  brown, grey    light        blue
   7 │ Beru Whitesun lars     165     75  brown          light        blue
   8 │ R5-D4                   97     32  n/a            white, red   red
   9 │ Biggs Darklighter      183     84  black          light        brown     ⋯
  10 │ Obi-Wan Kenobi         182     77  auburn, white  fair         blue-gray
                                                               10 columns omitted

We have fetched only 10 rows of data for the analysis (this is the number of
observations that the exposed API produces), but it is enough for our purposes.

As a side note – observe how easy it is in JuliaData ecosystem to fetch a JSON
file from the Internet, parse it, populate a DataFrame, and finally do some
column preprocessing to get the right column types for data that we are
interested it analyzing later.

Let us move to the example. In this case we want to process more than one column
using the case_when function within a data transformation pipeline.

R code

> starwars %>%
+   select(name:mass, gender, species) %>%
+   mutate(
+     type = case_when(
+       height > 200 | mass > 200 ~ "large",
+       species == "Droid"        ~ "robot",
+       TRUE                      ~ "other"
+     )
+   )
# A tibble: 87 x 6
   name               height  mass gender    species type
   <chr>               <int> <dbl> <chr>     <chr>   <chr>
 1 Luke Skywalker        172    77 masculine Human   other
 2 C-3PO                 167    75 masculine Droid   robot
 3 R2-D2                  96    32 masculine Droid   robot
 4 Darth Vader           202   136 masculine Human   large
 5 Leia Organa           150    49 feminine  Human   other
 6 Owen Lars             178   120 masculine Human   other
 7 Beru Whitesun lars    165    75 feminine  Human   other
 8 R5-D4                  97    32 masculine Droid   robot
 9 Biggs Darklighter     183    84 masculine Human   other
10 Obi-Wan Kenobi        182    77 masculine Human   other
# … with 77 more rows

Julia code

julia> @pipe starwars |>
             select(_, Between(:name, :mass), :gender, :species) |>
             transform(_, [:height, :mass, :species] =>
                          ByRow((height, mass, species) ->
                                height > 200 || mass > 200 ? "large" :
                                species == "Droid" ? "robot" :
                                "other") =>
                          :type)
10×6 DataFrame
 Row │ name                height  mass   gender  species  type
     │ String              Int64   Int64  String  String   String
─────┼────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172     77  male    Human    other
   2 │ C-3PO                  167     75  n/a     Droid    robot
   3 │ R2-D2                   96     32  n/a     Droid    robot
   4 │ Darth Vader            202    136  male    Human    large
   5 │ Leia Organa            150     49  female  Human    other
   6 │ Owen Lars              178    120  male    Human    other
   7 │ Beru Whitesun lars     165     75  female  Human    other
   8 │ R5-D4                   97     32  n/a     Droid    robot
   9 │ Biggs Darklighter      183     84  male    Human    other
  10 │ Obi-Wan Kenobi         182     77  male    Human    other

or if you like using DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @pipe starwars |>
             select(_, Between(:name, :mass), :gender, :species) |>
             @eachrow _ begin
                 @newcol type::Vector{String}
                 :type = :height > 200 || :mass > 200 ? "large" :
                         :species == "Droid" ? "robot" :
                         "other"
             end
10×6 DataFrame
 Row │ name                height  mass   gender  species  type
     │ String              Int64   Int64  String  String   String
─────┼────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172     77  male    Human    other
   2 │ C-3PO                  167     75  n/a     Droid    robot
   3 │ R2-D2                   96     32  n/a     Droid    robot
   4 │ Darth Vader            202    136  male    Human    large
   5 │ Leia Organa            150     49  female  Human    other
   6 │ Owen Lars              178    120  male    Human    other
   7 │ Beru Whitesun lars     165     75  female  Human    other
   8 │ R5-D4                   97     32  n/a     Droid    robot
   9 │ Biggs Darklighter      183     84  male    Human    other
  10 │ Obi-Wan Kenobi         182     77  male    Human    other

As you can see it is easy to use the ternary operator also in the case of
several variables. Using DataFrames.jl requires a bit of boilerplate syntax.
This limitation can be conveniently overcome using DataFramesMeta.jl, in the
above example I decided to use the @eachrow macro.

Conclusions

As you can see using the ternary operator in Julia gives us a very similar
functionality and syntax in comparison to the case_when function from dplyr.
Apart from the differences how missing values are handled I have discussed above
there are two features that make the solution in Julia more convenient in my
opinion:

  • in case_when all values on right hand side have to have the same type, while
    in Julia there is no such restriction;
  • case_when evaluates all right hand side expressions, while the ternary
    operator evaluates only what has to be evaluated to determine the result
    of the operation (this is often preferred when some operations may
    throw an error for certain values of their arguments).