# Introduction

Some time ago I wrote a post about my thoughts on copying of data when working with it in Julia.

Today I want to focus on a related, but more narrow topic related to DataFrames.jl. People starting to work with this package are sometimes confused when columns get copied and when they are not copied. I want to discuss the most common cases in this post.

Spoiler! The post is a bit long. If you want simple advice - you can skip to the section with conclusions.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

# Getting a column from a data frame

Let us start with a simpler case. When does copying happen if we get a column form a data frame?

First we set up some initial data:

julia> using DataFrames

julia> df = DataFrame(a=1:10^6)
1000000×1 DataFrame
Row │ a
│ Int64
─────────┼─────────
1 │       1
2 │       2
3 │       3
4 │       4
5 │       5
⋮    │    ⋮
999997 │  999997
999998 │  999998
999999 │  999999
1000000 │ 1000000
999991 rows omitted


There are three ways to get the :a column from this data frame: df.a, df[:, :a] and df[!, :a]. Let us check them one by one. Start with df.a:

julia> df.a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000

julia> @allocated df.a
0


df.a extracts the column without copying data. You can see it by the fact that there are no allocations performed in this operation.

Now check df[:, :a], which uses a standard row index : that is also used in arrays:

julia> df[:, :a]
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000

julia> @allocated df[:, :a]
8000048


df[:, :a] copies data, we see a lot of memory allocated this time. This is an identical behavior to how : works for arrays.

Finally check df[!, :a], which uses a non-standard ! row index:

julia> df[!, :a]
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000

julia> @allocated df[!, :a]
0


We can see that df[!, :a] does not allocate. It is equivalent to df.a, just with a bit different syntax (the indexing syntax with ! is handy if we wanted to select multiple columns from a data frame, which is not possible with df.a syntax).

This part was relatively easy. Now let us turn to a harder case of setting a column of a data frame.

# Case 1: setting a column in a data frame using assignment

First store the :a column in a temporary variable a (without copying it):

julia> a = df.a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000


Now let us check various options of creation of a column that will store a. Begin with creating of a new column.

julia> df.b = a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000

julia> df.b === a
true


We can see that if we put df.b on the left hand side the operation does not copy the passed data. You probably already can guess that the same happens with df[!, :c] on left hand side. Indeed it is the case:

julia> df[!, :c] = a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000

julia> df.c === a
true


What about df[:, :d]? Let us see:

julia> df[:, :d] = a
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000

julia> df.d === a
false


So we see a first difference. When creating a new column the data was copied. But what would happen if some column already existed in a data frame?

Well for df.b and df[!, :c] syntaxes nothing would change, as they just put a right hand side vector into a data frame without copying it. But for df[:, :d] the situation is different. Let us check:

julia> d = df.d;

julia> df[:, :d] = a;

julia> df.d === a
false

julia> df.d === d
true


We can see that if we use the df[:, :d] syntax on left hand side the operation is in-place, that is the vector already present in df is reused and the data is stored in a column already present in a data frame. This means that we cannot use df[:, :d] = ... to change element type of column :d. Let us see:

julia> df[:, :d] = a .+ 0.5;
ERROR: InexactError: Int64(1.5)


Indeed a .+ 0.5 contains floating point values, and the :d column allowed only integers. Note that with df.b = ... or df[!, :c] = ... we would not have this issue as they replace columns with what is passed on a right hand side:

julia> df.b = a .+ 0.5
1000000-element Vector{Float64}:
1.5
2.5
3.5
4.5
5.5
6.5
7.5
⋮
999995.5
999996.5
999997.5
999998.5
999999.5
1.0000005e6


There is one more twist to this story. It is related to ranges. The issue is that DataFrame object always materializes ranges stored in it. Therefore the following operation allocates data:

julia> df.b = 1:10^6
1:1000000

julia> df.b
1000000-element Vector{Int64}:
1
2
3
4
5
6
7
⋮
999995
999996
999997
999998
999999
1000000


The issue is that generally df.b = ... does not allocate, but since we disallow storing ranges as columns of a data frame (in our case the 1:10^6 range) the allocation still takes place. You would have the same behavior with df[!, :c] = 1:10^6.

# Case 2: setting a column in a data frame using broadcasted assignment

Julia is famous for its powerful broadcasting capabilities. Let us thus investigate what happens when we replace = with .= in our experiments. We will reproduce all the examples we gave above from scratch.

Start with df.b .= a:

julia> df = DataFrame(a=1:10^6);

julia> a = df.a;

julia> df.b .= a;

julia> df.b === a
false


We now see a difference. The :b column is freshly allocated.

Let us check the two other options of creation of a new column:

julia> df[!, :c] .= a;

julia> df.c === a
false

julia> df[:, :d] .= a;

julia> df.d === a
false


They have the same effect: a new column gets allocated.

In the case of an existing column df.b .= ... and df[!, :c] .= ... would again create a new copied column:

julia> df.b .= a .+ 0.5
1000000-element Vector{Float64}:
1.5
2.5
3.5
4.5
5.5
6.5
7.5
⋮
999995.5
999996.5
999997.5
999998.5
999999.5
1.0000005e6


The difference is with df[:, :d] .= ...:

julia> d = df.d;

julia> df[:, :d] .= a;

julia> df.d === a
false

julia> df.d === d
true

julia> df[:, :d] .= a .+ 0.5
ERROR: InexactError: Int64(1.5)


So we see that we have here an in-place operation just like with df[:, :d] = ....

# Conclusions

As a summary let me discuss a common anti-pattern:

df.a = df.b


Given the examples I presented we know that after this operation the :a and :b columns of the df data frame are aliased, i.e. df.a === df.b produces true. Usually this is not a desired situation as many operations assume that columns of a data frame do not share memory.

Fortunately, we also already learned an easy fix to the aliasing problem. You can just write:

df.a .= df.b


To get a copy of :b stored in column :a.

I hope the examples I gave in my post today will be useful for your work with DataFrames.jl.