To copy or not to copy, that is the question
Introduction
I have used Julia for quite some time now. Therefore I thought that I have acquired adequate habits of using this language safely. Unfortunately, it is not fully true. Today I wanted to share with you my thoughts on an important pattern that I need to constantly re-learn.
As you probably know developers of packages in Julia want them to be fast. For this reason they often optimize them. One of the common optimizations is avoidance of unnecessary allocations. This approach makes sense, but runs the risk of unwanted mutation of such data.
In general you have two scenarios when you can can have an issue with unwanted mutation:
- when you pass your data to a function that might mutate it;
- when you get data that is returned from function and you later mutate it and at the same time this operation mutates some other data that you have not expected to be changed.
The first scenario is handled pretty cleanly in Julia. Any function that
might mutate its arguments by convention should have its name terminated with !
.
This is well explained in the Julia manual in the
Append !
to names of functions that modify their arguments section.
The second case is less obvious and is the focus of my post today. I will explain it using two examples taken from real questions of Julia users asked during last week.
The post was written under Julia 1.9.0-rc1, DataFrames.jl 1.5.0, GLM.jl 1.8.1, and ShiftedArrays.jl 2.0.0.
First example: shifted arrays
Consider the following code:
julia> using DataFrames
julia> using Random
julia> using ShiftedArrays: lag
julia> Random.seed!(1234);
julia> df = DataFrame(x = rand(10))
10×1 DataFrame
Row │ x
│ Float64
─────┼───────────
1 │ 0.579862
2 │ 0.411294
3 │ 0.972136
4 │ 0.0149088
5 │ 0.520355
6 │ 0.639562
7 │ 0.839622
8 │ 0.967143
9 │ 0.131026
10 │ 0.946453
julia> df.xlag = lag(df.x)
10-element ShiftedVector{Float64, Missing, Vector{Float64}}:
missing
0.5798621201341324
0.4112941179498505
0.9721360824554687
0.014908849285099945
0.520354993723718
0.6395615996802734
0.8396219340580711
0.967142768915383
0.13102565622085904
julia> df
10×2 DataFrame
Row │ x xlag
│ Float64 Float64?
─────┼────────────────────────────
1 │ 0.579862 missing
2 │ 0.411294 0.579862
3 │ 0.972136 0.411294
4 │ 0.0149088 0.972136
5 │ 0.520355 0.0149088
6 │ 0.639562 0.520355
7 │ 0.839622 0.639562
8 │ 0.967143 0.839622
9 │ 0.131026 0.967143
10 │ 0.946453 0.131026
The potential trap is that the :xlag
column is a view. If we modify :x
also :xlag
will be modified.
Here is an example:
julia> df.x[1] = 0.0
0.0
julia> df
10×2 DataFrame
Row │ x xlag
│ Float64 Float64?
─────┼────────────────────────────
1 │ 0.0 missing
2 │ 0.411294 0.0
3 │ 0.972136 0.411294
4 │ 0.0149088 0.972136
5 │ 0.520355 0.0149088
6 │ 0.639562 0.520355
7 │ 0.839622 0.639562
8 │ 0.967143 0.839622
9 │ 0.131026 0.967143
10 │ 0.946453 0.131026
In some cases you might want it indeed. But in many cases this will lead to a bug, especially if both objects are used in different parts of code. Here I stored them in a data frame because it is a convenient way to show what happens.
Another pitfall is when you try to mutate such a data frame by e.g. pushing rows to it:
julia> push!(df, (10.0, 11.0))
┌ Error: Error adding value to column :xlag. Maybe it was forgotten to ask for column element type promotion, which can be done by passing the promote=true keyword argument.
As you can see the operation errored because we have a view stored as a column of a data frame, and views are not resizable.
This might seem to be a simple mistake, but in practice users report making this error.
The simplest way to resolve this issue is to make sure you copy
the view:
julia> df.xlag = copy(lag(df.x))
10-element Vector{Union{Missing, Float64}}:
missing
0.0
0.4112941179498505
0.9721360824554687
0.014908849285099945
0.520354993723718
0.6395615996802734
0.8396219340580711
0.967142768915383
0.13102565622085904
julia> push!(df, (10.0, 11.0))
11×2 DataFrame
Row │ x xlag
│ Float64 Float64?
─────┼─────────────────────────────
1 │ 0.0 missing
2 │ 0.411294 0.0
3 │ 0.972136 0.411294
4 │ 0.0149088 0.972136
5 │ 0.520355 0.0149088
6 │ 0.639562 0.520355
7 │ 0.839622 0.639562
8 │ 0.967143 0.839622
9 │ 0.131026 0.967143
10 │ 0.946453 0.131026
11 │ 10.0 11.0
Second example: linear models
Let me illustrate another pitfall with linear models. Start with creating a simple data set:
julia> Random.seed!(1234);
julia> df = DataFrame(randn(10, 3), [:x1, :x2, :error]);
julia> df.y = df.x1 + df.x2 + df.error;
julia> df.wts = rand(10);
julia> df
10×5 DataFrame
Row │ x1 x2 error y wts
│ Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────
1 │ 0.970656 0.705993 -1.0565 0.620151 0.840641
2 │ -0.979218 1.09156 0.148361 0.260698 0.523948
3 │ 0.901861 0.871498 -1.83851 -0.0651514 0.0128461
4 │ -0.0328031 0.0856911 -1.07363 -1.02074 0.40573
5 │ -0.600792 0.960079 -0.20563 0.153657 0.124074
6 │ -1.44518 0.907837 0.770703 0.233363 0.648874
7 │ 2.70742 -1.46506 -1.21394 0.0284219 0.574296
8 │ 1.52445 -0.215859 1.0915 2.40009 0.408537
9 │ 0.759804 0.575094 1.22004 2.55494 0.0731709
10 │ -0.881437 -1.79563 -0.0758316 -2.7529 0.501756
Now fit a weighted linear model:
julia> using GLM
julia> model = lm(@formula(y ~ x1 + x2), df, wts=df.wts)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + x1 + x2
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.163691 0.721581 -0.23 0.8550 -7.38861 7.06123
x1 0.698275 0.554725 1.26 0.4108 -4.85598 6.25253
x2 1.03178 0.732565 1.41 0.3753 -6.30312 8.36668
─────────────────────────────────────────────────────────────────────────
Assume we write a simple function that computes variance-covariance matrix of coefficients of our model:
function our_vcov(model)
X = modelmatrix(model)
u² = residuals(model) .^ 2
wts = model.model.rr.wts
X .*= sqrt.(wts)
u² .*= wts
dfr = dof_residual(model)
M = inv(X'*X)
return M * sum(u²) / dfr
end
Let us check if we get the same results as returned by the built-in vcov
function:
julia> vcov(model)
3×3 Matrix{Float64}:
0.520679 -0.0861524 -0.0613806
-0.0861524 0.30772 0.168677
-0.0613806 0.168677 0.536652
julia> our_vcov(model)
3×3 Matrix{Float64}:
0.520679 -0.0861524 -0.0613806
-0.0861524 0.30772 0.168677
-0.0613806 0.168677 0.536652
All looks good. Let us run our function again:
julia> our_vcov(model)
3×3 Matrix{Float64}:
0.954193 -0.196279 -0.191441
-0.196279 0.534882 0.295785
-0.191441 0.295785 0.965165
Wait - we got a different result this time. What is the problem?
The issue is with the X .*= sqrt.(wts)
operation that updates X
in place.
And in operation X = modelmatrix(model)
we bound to X
a value that
is internally stored in the model
object. Thus we, unintentionally mutated
model
.
In this case a fix would be for example to write X = X .* sqrt.(wts)
or
X = copy(modelmatrix(model))
.
The problem is that it is easy to forget it. Both X = modelmatrix(model)
and X .*= sqrt.(wts)
operations look natural and can be easily used together.
Additionally if you look at modelmatrix
help:
help?> modelmatrix
search: modelmatrix ModelMatrix
modelmatrix(model::RegressionModel)
Return the model matrix (a.k.a. the design matrix).
nothing warns you that the operation may return some value that is stored in model
or its copy.
A note to package developers: this kind of bug is not likely to be caught by standard tests as it becomes apparent only after you run the operation twice on some object (and usually tests are run only once).
Conclusions
So what is the pattern that I need to re-learn? The golden rule is:
Always check if a value returned by some function is freshly allocated.
Unfortunately, as you could see from the examples I gave, it is easy to forget to verify this.
The major problem is that there is no visual aid that would help you to identify such issues
(as opposed to functions mutating their arguments that end with !
).
If you are user of DataFrames.jl you are partially guarded against the risks I discuss today.
The commonly used functions like select
, combine
, transform
, or the DataFrame
constructor
make sure to make a copy of columns of a data frame they return.
Indeed it reduces their performance, but it helps with safety and usually the performance hit is minimal.
(and for users obsessed with speed we have copycols=false
keyword argument or functions with
!
that work in-place).
So my answer to the question from the title of the post is: do copy (unless you are sure that you can safely skip copying and the performance benefits are worth it).