Updating views in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/17/views.html

Introduction

Today I want to preview a feature that will be introduced in 1.3 release
of DataFrames.jl. We will talk about new ways of updating the columns
of a data frame, when one is working with views. My objective is to explain
the rationale behind the new functionality and the way it works.

This post was tested under Julia 1.6.1 and DataFrames.jl checked out at
main branch on Sep 17, 2021 (SHA-1 facb6721e7450c63f2d5684b78e3c3489ed999b0)

What is a SubDataFrame and when it is useful?

In DataFrames.jl you can construct views of data frame object using the
view function or the @view macro exactly like you can create views of arrays
in Julia Base. Here is a simple example:

julia> using DataFrames

(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v1.2.2 `https://github.com/JuliaData/DataFrames.jl.git#main`

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[2:3, :]
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     3      6

Now the dfv object is a view of df data frame. It means that it references
to the same data in memory as the parent data frame df, but allows
to access only a slice of it: in our case we have picked rows 2 and 3 and
all columns.

The key features of a view are:

  • mutating its contents also mutates the contents of the parent data frame;
  • it is cheap to create as it is enough to store only the reference to the parent
    data frame and which rows and columns got selected;
  • it is memory efficient (no copying of data happens);
  • using it has a small computational overhead as when we index a view we need
    to perform transformation of these indices to the parent data frame indices.

Let us show the first feature as it is most important from the functionality
perspective:

julia> dfv[1, 1] = 100
100

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   100      5
   2 │     3      6

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │   100      5
   3 │     3      6

julia> df[3, 1] = 200
200

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   100      5
   2 │   200      6

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │   100      5
   3 │   200      6

As you can see changing dfv also changes df, and vice versa – changing df
also changes dfv (if the changed cells are selected in the view).

To understand performance consider two simple implementations of a procedure
computing 90% confidence interval of correlation between two variables using
bootstrapping:

julia> using Statistics

julia> function bootcor1(df, c1, c2, n)
           cors = Float64[]
           for _ in 1:n
               tmp = df[rand(1:nrow(df), nrow(df)), :]
               push!(cors, cor(tmp[!, c1], tmp[!, c2]))
           end
           return quantile(cors, [0.05, 0.95])
       end
bootcor1 (generic function with 1 method)

julia> function bootcor2(df, c1, c2, n)
           cors = Float64[]
           for _ in 1:n
               tmp = @view df[rand(1:nrow(df), nrow(df)), :]
               push!(cors, cor(tmp[!, c1], tmp[!, c2]))
           end
           return quantile(cors, [0.05, 0.95])
       end
bootcor2 (generic function with 1 method)

(the functions could be further optimized for performance but I did not want to
overly complicate the code)

The difference between bootcor1 and bootcor2 is that the former copies a
data frame, while the latter uses a view. Both take four parameters:

  • df: a data frame to analyze
  • c1, c2: column identifiers of columns we want to compute the correlation;
  • n: number of bootstrapping samples;

Now create a simple data frame and compare the performance of both functions
(I present timings after compilation):

julia> df = DataFrame(rand(10^5, 10), :auto);

julia> @time bootcor1(df, :x1, :x2, 10_000)
 47.059650 seconds (430.02 k allocations: 81.976 GiB, 1.88% gc time)
2-element Vector{Float64}:
 -0.007373812772086598
  0.0029150608879804406

julia> @time bootcor2(df, :x1, :x2, 10_000)
 11.239822 seconds (80.02 k allocations: 7.453 GiB, 0.92% gc time)
2-element Vector{Float64}:
 -0.007643923412421664
  0.002966538851599437

As you can see, because the data frame was wide (10 columns), we saved a lot of
time by avoiding copying of the data.

Of course if the data frame were narrower we would not see such a difference:

julia> df = DataFrame(rand(10^5, 2), :auto);

julia> @time bootcor1(df, :x1, :x2, 10_000)
 10.829548 seconds (190.02 k allocations: 22.363 GiB, 1.60% gc time)
2-element Vector{Float64}:
 -0.006650139955186956
  0.0038227359319118795

julia> @time bootcor2(df, :x1, :x2, 10_000)
 10.963020 seconds (80.02 k allocations: 7.453 GiB, 0.53% gc time)
2-element Vector{Float64}:
 -0.006575024146232311
  0.0038253588364537162

The reason is that now while using a view still allocates less this is offset
by the fact that working with views has some computational overhead as it was
explained above.

What is new for SubDataFrame in DataFrames.jl 1.3?

Let us start with our original small data frame:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

Assume you wanted to assign a 1.5 value in the first row of column :a.
Before the upcoming DataFrames.jl 1.3 release it is quite cumbersome. If you
try doing it you get:

julia> df[1, :a] = 1.5
ERROR: InexactError: Int64(1.5)

You need to do two steps:

  1. promote the element type of column :a to allow Float64 values;
  2. perform the assignment.

Here is a way to do it:

julia> df.a = Vector{Float64}(df.a)
3-element Vector{Float64}:
 1.0
 2.0
 3.0

julia> df[1, :a] = 1.5
1.5

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Int64
─────┼────────────────
   1 │     1.5      4
   2 │     2.0      5
   3 │     3.0      6

Here is one more, well known, example of a similar situation, that sometimes
surprises users:

julia> df[1, :b] = 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Int64
─────┼────────────────
   1 │     1.5     97
   2 │     2.0      5
   3 │     3.0      6

In this case Julia silently converted Char value 'a' to its Int
representation which is 97.

The key change in the 1.3 release of DataFrames.jl is that views will allow to
use ! as row index (currently it is disallowed). The mechanics of this
functionality is the same as when ! is used for DataFrame objects – a
column will get replaced in the data frame.

A natural question is the following with what will it get replaced? It is quite
valid as we are replacing only a portion of the column. The design decision we
took is that promote_type will be used to decide the element type of the new
column combining the element type of the already present column and element type
of the newly assigned values.

Therefore in our examples above, when using a view you get the following:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[1:1, :]
1×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4

julia> dfv[!, :a] = [1.5]
1-element Vector{Float64}:
 1.5

julia> dfv[!, :b] .= 'a'
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │     1.5  a
   2 │     2.0  5
   3 │     3.0  6

julia> dfv
1×2 SubDataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │     1.5  a

As you can see it works both with standard assignment as well as with
broadcasted assignment.

Admittedly you still have to make two steps in the process:

  1. create a view;
  2. perform an assignment to it.

This is a bit cumbersome. Fortunately we can expect that in the future
DataFramesMeta.jl will provide a convenience syntax to perform
conditional assignment using this feature, e.g. like in data.table, where you
can write something like df[x == 1, y := 2] to set column y to 2 if
column x is equal to 1.

One special case that is often required is adding columns. It is supported
with both : and ! row selectors (like for DataFrame objects). In this case
we do not have a reference column in a parent data frame, so rows that are not
included in the view are filled with missing.

Here are two examples:

julia> dfv[!, :c] = ["x"]
1-element Vector{String}:
 "x"

julia> dfv[:, :d] .= true
1-element Vector{Bool}:
 1

julia> df
3×4 DataFrame
 Row │ a        b    c        d
     │ Float64  Any  String?  Bool?
─────┼────────────────────────────────
   1 │     1.5  a    x           true
   2 │     2.0  5    missing  missing
   3 │     3.0  6    missing  missing

julia> dfv
1×4 SubDataFrame
 Row │ a        b    c        d
     │ Float64  Any  String?  Bool?
─────┼──────────────────────────────
   1 │     1.5  a    x         true

The only limitation is that in this case it is only allowed if SubDataFrame
was created with : as column selector. The reason of this limitation is that
when one uses : selector we are guaranteed that SubDataFrame has the same
columns and in the same order as its parent, so the requested operation is
guaranteed not to be problematic in interpretation (otherwise we would have to
handle e.g. the case when we want to add a column whose name is not present in
the SubDataFrame but is present in its parent which could confuse users).

Conclusions

In summary the new functionality allows to replace columns in a data frame
through its view. The two main intended use cases of this feature are:

  • adding new columns for which we have data only for some rows
    (selected in the view); it is only allowed when SubDataFrame
    was created with : as column selector;
  • updating data in existing columns even if the new elements cannot be converted
    to the element type of existing column; in this case promote_type is used to
    determine the target column element type.