Does DataFrames.jl copy or not copy, that is the question
Introduction
Some time ago I wrote a post about my thoughts on copying of data when working with it in Julia.
Today I want to focus on a related, but more narrow topic related to DataFrames.jl. People starting to work with this package are sometimes confused when columns get copied and when they are not copied. I want to discuss the most common cases in this post.
Spoiler! The post is a bit long. If you want simple advice - you can skip to the section with conclusions.
The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.
Getting a column from a data frame
Let us start with a simpler case. When does copying happen if we get a column form a data frame?
First we set up some initial data:
julia> using DataFrames
julia> df = DataFrame(a=1:10^6)
1000000×1 DataFrame
Row │ a
│ Int64
─────────┼─────────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
⋮ │ ⋮
999997 │ 999997
999998 │ 999998
999999 │ 999999
1000000 │ 1000000
999991 rows omitted
There are three ways to get the :a
column from this data frame: df.a
, df[:, :a]
and df[!, :a]
.
Let us check them one by one. Start with df.a
:
julia> df.a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
julia> @allocated df.a
0
df.a
extracts the column without copying data. You can see it by the fact that there are no allocations performed in this operation.
Now check df[:, :a]
, which uses a standard row index :
that is also used in arrays:
julia> df[:, :a]
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
julia> @allocated df[:, :a]
8000048
df[:, :a]
copies data, we see a lot of memory allocated this time. This is an identical behavior to how :
works for arrays.
Finally check df[!, :a]
, which uses a non-standard !
row index:
julia> df[!, :a]
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
julia> @allocated df[!, :a]
0
We can see that df[!, :a]
does not allocate. It is equivalent to df.a
, just with a bit different syntax
(the indexing syntax with !
is handy if we wanted to select multiple columns from a data frame, which is not possible with df.a
syntax).
This part was relatively easy. Now let us turn to a harder case of setting a column of a data frame.
Case 1: setting a column in a data frame using assignment
First store the :a
column in a temporary variable a
(without copying it):
julia> a = df.a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
Now let us check various options of creation of a column that will store a
.
Begin with creating of a new column.
julia> df.b = a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
julia> df.b === a
true
We can see that if we put df.b
on the left hand side the operation does not copy the passed data.
You probably already can guess that the same happens with df[!, :c]
on left hand side. Indeed
it is the case:
julia> df[!, :c] = a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
julia> df.c === a
true
What about df[:, :d]
? Let us see:
julia> df[:, :d] = a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
julia> df.d === a
false
So we see a first difference. When creating a new column the data was copied. But what would happen if some column already existed in a data frame?
Well for df.b
and df[!, :c]
syntaxes nothing would change, as they just put
a right hand side vector into a data frame without copying it.
But for df[:, :d]
the situation is different. Let us check:
julia> d = df.d;
julia> df[:, :d] = a;
julia> df.d === a
false
julia> df.d === d
true
We can see that if we use the df[:, :d]
syntax on left hand side the operation is in-place,
that is the vector already present in df
is reused and the data is stored in a column
already present in a data frame. This means that we cannot use df[:, :d] = ...
to change
element type of column :d
. Let us see:
julia> df[:, :d] = a .+ 0.5;
ERROR: InexactError: Int64(1.5)
Indeed a .+ 0.5
contains floating point values, and the :d
column allowed only integers.
Note that with df.b = ...
or df[!, :c] = ...
we would not have this issue as they
replace columns with what is passed on a right hand side:
julia> df.b = a .+ 0.5
1000000-element Vector{Float64}:
1.5
2.5
3.5
4.5
5.5
6.5
7.5
⋮
999995.5
999996.5
999997.5
999998.5
999999.5
1.0000005e6
There is one more twist to this story. It is related to ranges.
The issue is that DataFrame
object always materializes ranges
stored in it.
Therefore the following operation allocates data:
julia> df.b = 1:10^6
1:1000000
julia> df.b
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000
The issue is that generally df.b = ...
does not allocate, but since we disallow storing
ranges as columns of a data frame (in our case the 1:10^6
range) the allocation still takes place.
You would have the same behavior with df[!, :c] = 1:10^6
.
Case 2: setting a column in a data frame using broadcasted assignment
Julia is famous for its powerful broadcasting capabilities. Let us thus investigate what happens when we
replace =
with .=
in our experiments. We will reproduce all the examples we gave above from scratch.
Start with df.b .= a
:
julia> df = DataFrame(a=1:10^6);
julia> a = df.a;
julia> df.b .= a;
julia> df.b === a
false
We now see a difference. The :b
column is freshly allocated.
Let us check the two other options of creation of a new column:
julia> df[!, :c] .= a;
julia> df.c === a
false
julia> df[:, :d] .= a;
julia> df.d === a
false
They have the same effect: a new column gets allocated.
In the case of an existing column df.b .= ...
and df[!, :c] .= ...
would again create a new copied column:
julia> df.b .= a .+ 0.5
1000000-element Vector{Float64}:
1.5
2.5
3.5
4.5
5.5
6.5
7.5
⋮
999995.5
999996.5
999997.5
999998.5
999999.5
1.0000005e6
The difference is with df[:, :d] .= ...
:
julia> d = df.d;
julia> df[:, :d] .= a;
julia> df.d === a
false
julia> df.d === d
true
julia> df[:, :d] .= a .+ 0.5
ERROR: InexactError: Int64(1.5)
So we see that we have here an in-place operation just like with df[:, :d] = ...
.
Conclusions
As a summary let me discuss a common anti-pattern:
df.a = df.b
Given the examples I presented we know that after this operation the :a
and :b
columns
of the df
data frame are aliased, i.e. df.a === df.b
produces true
. Usually this is not
a desired situation as many operations assume that columns of a data frame do not share memory.
Fortunately, we also already learned an easy fix to the aliasing problem. You can just write:
df.a .= df.b
To get a copy of :b
stored in column :a
.
I hope the examples I gave in my post today will be useful for your work with DataFrames.jl.