To copy or not to copy, that is the question

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/10/copying.html

Introduction

I have used Julia for quite some time now.
Therefore I thought that I have acquired adequate habits of using this language safely.
Unfortunately, it is not fully true.
Today I wanted to share with you my thoughts on an important pattern that I need to
constantly re-learn.

As you probably know developers of packages in Julia want them to be fast. For this
reason they often optimize them. One of the common optimizations is avoidance of
unnecessary allocations. This approach makes sense, but runs the risk of unwanted
mutation of such data.

In general you have two scenarios when you can can have an issue with unwanted
mutation:

  • when you pass your data to a function that might mutate it;
  • when you get data that is returned from function and you later mutate it
    and at the same time this operation mutates some other data that you have not
    expected to be changed.

The first scenario is handled pretty cleanly in Julia. Any function that
might mutate its arguments by convention should have its name terminated with !.
This is well explained in the Julia manual in the
Append ! to names of functions that modify their arguments section.

The second case is less obvious and is the focus of my post today. I will explain
it using two examples taken from real questions of Julia users asked during last week.

The post was written under Julia 1.9.0-rc1, DataFrames.jl 1.5.0, GLM.jl 1.8.1, and
ShiftedArrays.jl 2.0.0.

First example: shifted arrays

Consider the following code:

julia> using DataFrames

julia> using Random

julia> using ShiftedArrays: lag

julia> Random.seed!(1234);

julia> df = DataFrame(x = rand(10))
10×1 DataFrame
 Row │ x
     │ Float64
─────┼───────────
   1 │ 0.579862
   2 │ 0.411294
   3 │ 0.972136
   4 │ 0.0149088
   5 │ 0.520355
   6 │ 0.639562
   7 │ 0.839622
   8 │ 0.967143
   9 │ 0.131026
  10 │ 0.946453

julia> df.xlag = lag(df.x)
10-element ShiftedVector{Float64, Missing, Vector{Float64}}:
  missing
 0.5798621201341324
 0.4112941179498505
 0.9721360824554687
 0.014908849285099945
 0.520354993723718
 0.6395615996802734
 0.8396219340580711
 0.967142768915383
 0.13102565622085904

julia> df
10×2 DataFrame
 Row │ x          xlag
     │ Float64    Float64?
─────┼────────────────────────────
   1 │ 0.579862   missing
   2 │ 0.411294         0.579862
   3 │ 0.972136         0.411294
   4 │ 0.0149088        0.972136
   5 │ 0.520355         0.0149088
   6 │ 0.639562         0.520355
   7 │ 0.839622         0.639562
   8 │ 0.967143         0.839622
   9 │ 0.131026         0.967143
  10 │ 0.946453         0.131026

The potential trap is that the :xlag column is a view. If we modify :x also :xlag will be modified.

Here is an example:

julia> df.x[1] = 0.0
0.0

julia> df
10×2 DataFrame
 Row │ x          xlag
     │ Float64    Float64?
─────┼────────────────────────────
   1 │ 0.0        missing
   2 │ 0.411294         0.0
   3 │ 0.972136         0.411294
   4 │ 0.0149088        0.972136
   5 │ 0.520355         0.0149088
   6 │ 0.639562         0.520355
   7 │ 0.839622         0.639562
   8 │ 0.967143         0.839622
   9 │ 0.131026         0.967143
  10 │ 0.946453         0.131026

In some cases you might want it indeed. But in many cases this will lead to a bug,
especially if both objects are used in different parts of code. Here I stored them in
a data frame because it is a convenient way to show what happens.

Another pitfall is when you try to mutate such a data frame by e.g. pushing rows to it:

julia> push!(df, (10.0, 11.0))
┌ Error: Error adding value to column :xlag. Maybe it was forgotten to ask for column element type promotion, which can be done by passing the promote=true keyword argument.

As you can see the operation errored because we have a view stored
as a column of a data frame, and views are not resizable.

This might seem to be a simple mistake, but in practice users report making this error.
The simplest way to resolve this issue is to make sure you copy the view:

julia> df.xlag = copy(lag(df.x))
10-element Vector{Union{Missing, Float64}}:
  missing
 0.0
 0.4112941179498505
 0.9721360824554687
 0.014908849285099945
 0.520354993723718
 0.6395615996802734
 0.8396219340580711
 0.967142768915383
 0.13102565622085904

julia> push!(df, (10.0, 11.0))
11×2 DataFrame
 Row │ x           xlag
     │ Float64     Float64?
─────┼─────────────────────────────
   1 │  0.0        missing
   2 │  0.411294         0.0
   3 │  0.972136         0.411294
   4 │  0.0149088        0.972136
   5 │  0.520355         0.0149088
   6 │  0.639562         0.520355
   7 │  0.839622         0.639562
   8 │  0.967143         0.839622
   9 │  0.131026         0.967143
  10 │  0.946453         0.131026
  11 │ 10.0             11.0

Second example: linear models

Let me illustrate another pitfall with linear models. Start with creating a simple data set:

julia> Random.seed!(1234);

julia> df = DataFrame(randn(10, 3), [:x1, :x2, :error]);

julia> df.y = df.x1 + df.x2 + df.error;

julia> df.wts = rand(10);

julia> df
10×5 DataFrame
 Row │ x1          x2          error       y           wts
     │ Float64     Float64     Float64     Float64     Float64
─────┼───────────────────────────────────────────────────────────
   1 │  0.970656    0.705993   -1.0565      0.620151   0.840641
   2 │ -0.979218    1.09156     0.148361    0.260698   0.523948
   3 │  0.901861    0.871498   -1.83851    -0.0651514  0.0128461
   4 │ -0.0328031   0.0856911  -1.07363    -1.02074    0.40573
   5 │ -0.600792    0.960079   -0.20563     0.153657   0.124074
   6 │ -1.44518     0.907837    0.770703    0.233363   0.648874
   7 │  2.70742    -1.46506    -1.21394     0.0284219  0.574296
   8 │  1.52445    -0.215859    1.0915      2.40009    0.408537
   9 │  0.759804    0.575094    1.22004     2.55494    0.0731709
  10 │ -0.881437   -1.79563    -0.0758316  -2.7529     0.501756

Now fit a weighted linear model:

julia> using GLM

julia> model = lm(@formula(y ~ x1 + x2), df, wts=df.wts)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.163691    0.721581  -0.23    0.8550   -7.38861    7.06123
x1            0.698275    0.554725   1.26    0.4108   -4.85598    6.25253
x2            1.03178     0.732565   1.41    0.3753   -6.30312    8.36668
─────────────────────────────────────────────────────────────────────────

Assume we write a simple function that computes variance-covariance
matrix of coefficients of our model:

function our_vcov(model)
    X = modelmatrix(model)
    u² = residuals(model) .^ 2
    wts = model.model.rr.wts
    X .*= sqrt.(wts)
    u² .*= wts
    dfr = dof_residual(model)
    M = inv(X'*X)
    return M * sum(u²) / dfr
end

Let us check if we get the same results as returned by the built-in vcov function:

julia> vcov(model)
3×3 Matrix{Float64}:
  0.520679   -0.0861524  -0.0613806
 -0.0861524   0.30772     0.168677
 -0.0613806   0.168677    0.536652

julia> our_vcov(model)
3×3 Matrix{Float64}:
  0.520679   -0.0861524  -0.0613806
 -0.0861524   0.30772     0.168677
 -0.0613806   0.168677    0.536652

All looks good. Let us run our function again:

julia> our_vcov(model)
3×3 Matrix{Float64}:
  0.954193  -0.196279  -0.191441
 -0.196279   0.534882   0.295785
 -0.191441   0.295785   0.965165

Wait – we got a different result this time. What is the problem?

The issue is with the X .*= sqrt.(wts) operation that updates X in place.
And in operation X = modelmatrix(model) we bound to X a value that
is internally stored in the model object. Thus we, unintentionally mutated
model.

In this case a fix would be for example to write X = X .* sqrt.(wts) or
X = copy(modelmatrix(model)).

The problem is that it is easy to forget it. Both X = modelmatrix(model)
and X .*= sqrt.(wts) operations look natural and can be easily used together.

Additionally if you look at modelmatrix help:

help?> modelmatrix
search: modelmatrix ModelMatrix

  modelmatrix(model::RegressionModel)

  Return the model matrix (a.k.a. the design matrix).

nothing warns you that the operation may return some value that is stored in model
or its copy.

A note to package developers: this kind of bug is not likely to be caught by standard tests
as it becomes apparent only after you run the operation twice on some object (and usually
tests are run only once).

Conclusions

So what is the pattern that I need to re-learn? The golden rule is:

Always check if a value returned by some function is freshly allocated.

Unfortunately, as you could see from the examples I gave, it is easy to forget to verify this.
The major problem is that there is no visual aid that would help you to identify such issues
(as opposed to functions mutating their arguments that end with !).

If you are user of DataFrames.jl you are partially guarded against the risks I discuss today.
The commonly used functions like select, combine, transform, or the DataFrame constructor
make sure to make a copy of columns of a data frame they return.
Indeed it reduces their performance, but it helps with safety and usually the performance hit is minimal.
(and for users obsessed with speed we have copycols=false keyword argument or functions with
! that work in-place).

So my answer to the question from the title of the post is: do copy (unless you are sure that you can
safely skip copying and the performance benefits are worth it).