Introduction

In DataFrames.jl 1.4 release we reached the target state of data frame indexing that we aimed for when we designed 1.0 release.

In this post I want to present you what mental model you should have when thinking about indexing into a data frame. The reason why learning this is important is that we needed to extend standard indexing rules that are defined for arrays in Julia to cover all scenarios that users need when working with data frames.

I will focus on working with a single column of a data frame as this is the most common indexing scenario.

The code was tested under Julia 1.8.2 and DataFrames.jl 1.4.2.

Rule 1: indexing always requires passing both row and column selector

When you index into a data frame you must pass exactly two dimensions: a row selector and a column selector like df[row_selector, column_selector]. When indexing you can think of a data frame as of a matrix.

Rule 2: all indexing that works on matrices works the same way for data frames

The benefit of the rule that data frame follows matrix indexing is that it is easy to remember. If you know how matrix indexing works then all translates directly to a data frame.

Here are some examples of extracting a column or its fragment from a data frame.

julia> using DataFrames

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> df[1, 1]
1

julia> df[1:2, 1]
2-element Vector{Int64}:
 1
 2

julia> df[:, 2]
3-element Vector{Int64}:
 4
 5
 6

(As I commented in the Introduction in this post I concentrate on getting a single column from a data frame so the second index was always an integer.)

One important rule of Julia Base indexing is that when you get a column or its part from a matrix a copy is made, except if you extract a single element. This is exactly what happens for data frame: df[:, 2] is a copy of a second column stored in df. You can check it by writing e.g.:

julia> df[:, 2] === eachcol(df)[2]
false

julia> df[:, 2] == eachcol(df)[2]
true

We see that we get the same data, but not the same object.

If you instead want a view of a column without copying the data, you can use @view exactly like you would do it with matrices:

julia> @view df[1, 1]
0-dimensional view(::Vector{Int64}, 1) with eltype Int64:
1

julia> @view df[1:2, 1]
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 1
 2

julia> @view df[:, 2]
3-element view(::Vector{Int64}, :) with eltype Int64:
 4
 5
 6

As you can see, again all worked as if df were a matrix.

The same rules that worked for getting data from a data frame work for setting data in a data frame. You have two options here: normal assignment and broadcasted assignment.

julia> df[1:2, 1] = [11, 12]
2-element Vector{Int64}:
 11
 12

julia> df[:, 2] .= 100
3-element view(::Vector{Int64}, :) with eltype Int64:
 100
 100
 100

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │    11    100
   2 │    12    100
   3 │     3    100

These operations, again, work exactly like for matrices. In particular they are in-place, that is no memory is allocated when performing them. The data is written into already allocated column. This rule is important as it means that by such assignment you cannot change the element type of a column:

julia> df[:, 1] = ["a", "b", "c"]
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[:, 2] .= "x"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

In summary, standard array indexing works the same way for matrices and for data frames.

In what follows I describe the extensions to the indexing rules that are DataFrames.jl specific.

Rule 3: you can use strings or symbols to pass column names

The first extension is related to the fact that standard matrices have to be indexed by integers. In DataFrames.jl you can alternatively use string or symbol to select a column when indexing. Here are some basic examples:

julia> df[1:2, "a"]
2-element Vector{Int64}:
 11
 12

julia> df[:, :b] .= 1000
3-element view(::Vector{Int64}, :) with eltype Int64:
 1000
 1000
 1000

This is intuitive so far. However, this rule leads to one extension. It is related to the fact that you can pass a column name that does not exist yet in a data frame. In this case if you pass : as a column selector a new column in a data frame will be allocated (that is copy of the source data will be performed) and the new column will be added at the end of the data frame:

julia> df[:, "d"] = [-1, -2, -3]
3-element Vector{Int64}:
 -1
 -2
 -3

julia> df
3×3 DataFrame
 Row │ a      b      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │    11   1000     -1
   2 │    12   1000     -2
   3 │     3   1000     -3

This rule has two additional special cases. The first is when a data frame has no columns yet. In such a situation you can add any vector to a data frame:

julia> df2 = DataFrame()
0×0 DataFrame

julia> df2[:, :x] = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> df2
2×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2

What is non-standard with this rule? We allow changing the number of rows in a data frame from zero to whatever is needed. For a data frame having some columns changing their number of rows is not allowed with indexing.

The second special case is when you try to create a new column in a view of a data frame. It is not allowed in general, but if the view was created with : as a column selector we accept it (the reason is that in this case view subsets only rows, but does not change columns; this is a common use case in practice). Here is an example:

julia> df
3×3 DataFrame
 Row │ a      b      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │    11   1000     -1
   2 │    12   1000     -2
   3 │     3   1000     -3

julia> dfv = @view df[[3, 1], :]
2×3 SubDataFrame
 Row │ a      b      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3   1000     -3
   2 │    11   1000     -1

julia> dfv[:, :e] = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> dfv
2×4 SubDataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼─────────────────────────────
   1 │     3   1000     -3       1
   2 │    11   1000     -1       2

julia> df
3×4 DataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼──────────────────────────────
   1 │    11   1000     -1        2
   2 │    12   1000     -2  missing
   3 │     3   1000     -3        1

As you can see in this case a new column gets missing for rows that are not present in the dfv data frame.

Rule 4: You can use Not to negate row selection

This is a pretty simple rule. Instead of selecting rows, you can use Not to say which rows you want to drop when indexing:

julia> df
3×4 DataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼──────────────────────────────
   1 │    11   1000     -1        2
   2 │    12   1000     -2  missing
   3 │     3   1000     -3        1

julia> df[Not(2), :d]
2-element Vector{Int64}:
 -1
 -3

Rule 5: There is a special ! row selector

The special ! row selector selects all rows similarly to :, but it has a different behavior in relation to column allocation.

When extracting the data from a data frame if you use ! no copying of data is made. Instead, you just get a column as it is stored in the source data frame (or an appropriate view if you work with a view of a data frame):

julia> df[!, 1]
3-element Vector{Int64}:
 11
 12
  3

julia> df[!, 1] === eachcol(df)[1]
true

julia> dfv[!, 1]
2-element view(::Vector{Int64}, [3, 1]) with eltype Int64:
  3
 11

The reason why ! is allowed for when extracting a column from a data frame is performance. In general it is not safe to use !. You should prefer : as copying data (in R like style) is safer. However, in some cases, when your operations are memory-bound or you need to maximize performance, you have ! at hand to avoid unnecessary operations.

For writing data into a data frame if you use ! selector it has also a different behavior than :. Recall, that : is always in-place. Instead ! stores a fresh column in a data frame.

julia> df
3×4 DataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼──────────────────────────────
   1 │    11   1000     -1        2
   2 │    12   1000     -2  missing
   3 │     3   1000     -3        1

julia> df[!, "a"] = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> dfv[!, "e"] .= "x"
2-element Vector{String}:
 "x"
 "x"

julia> df
3×4 DataFrame
 Row │ a       b      d      e
     │ String  Int64  Int64  Any
─────┼───────────────────────────────
   1 │ a        1000     -1  x
   2 │ b        1000     -2  missing
   3 │ c        1000     -3  x

How the fresh column is stored depends on the type of data frame you use and the type of operation:

  • if you use standard assignment on a data frame (df[!, "a"] = ["a", "b", "c"]) then source vector is not copied, but is stored as-is (so this kind of storage is the fastest way to store a column in a data frame).
  • if you use broadcasted assignment (df[!, "a"] .= ["a", "b", "c"]) or assign into a view of a data frame (dfv[!, "a"] = ["x", "x"]) then a new column is freshly allocated (this kind of behavior is needed if you want to change element type of the column already present in a data frame, but want a copy of the source vector to be made for safety reasons).

So for assignment with ! the basic rule to remember is that it replaces the existing column in a data frame. Then the additional rule is that normal assignment on a data frame is non-copying, and broadcasted assignment or using a view of a data frame allocates a copy.

These rules were designed to cover all possible scenarios that users might need when working with columns of a data frame.

Rule 6: property access works the same as if you used ! as row selector

With this rule we are getting to the point that changed in DataFrames.jl 1.4 (but was planned and announced much earlier).

When you write df.a it is always treated the same as df[!, "a"], for extracting a column from a data frame, for assignment, and for broadcasted assignment.

Since DataFrames.jl 1.4 this is all you need to remember about property access to a data frame. We decided on this rule as it is easy to remember and it does not add any new concepts or exceptions for users to learn.

However, unfortunately, this is the place where we had to deviate from how property access works in Base Julia. Let me explain the issue step by step.

If you write df.a = ["a", "b", "c"] then you expect that column a in df data frame is replaced by ["a", "b", "c"]. And this is what happens. Recall that this is what df[!, "a"] = ["a", "b", "c"] also does.

Now this means that, if we want df.a .= ["a", "b", "c"] to work the same as df[!, "a"] .= ["a", "b", "c"], then both operations replace column a in df.

And here we have a slight inconsistency, as in Base Julia users would expect that df.a .= ["a", "b", "c"] would work in-place, like df[:, "a"] .= ["a", "b", "c"]. This is not the case.

Therefore you need to remember that in DataFrames.jl if you use property access syntax for setting data it always replaces the colum: both when doing assignment and broadcasted assignment.

The reason why we decided on this design is twofold. First is learning. We have a simple rule: ! selector and property access work the same way always. The second reason is convenience. Most users preferred the following operation:

julia> df3 = DataFrame(c=1:3)
3×1 DataFrame
 Row │ c
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df3.c .= 'x'
3-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

julia> df3
3×1 DataFrame
 Row │ c
     │ Char
─────┼──────
   1 │ x
   2 │ x
   3 │ x

to replace column c with a vector of 'x' values. Note that instead if you use : as row selector you get a bit surprising result:

julia> df3 = DataFrame(c=1:3)
3×1 DataFrame
 Row │ c
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df3[:, "c"] .= 'x'
3-element view(::Vector{Int64}, :) with eltype Int64:
 120
 120
 120

julia> df3
3×1 DataFrame
 Row │ c
     │ Int64
─────┼───────
   1 │   120
   2 │   120
   3 │   120

The reason of this behavior is that Char is silently converted to Int:

julia> convert(Int, 'x')
120

What is the general benefit of the behavior we adopted? When you write df[!, "a"] .= 'x' or df.a .= 'x' you are sure that as a result you will have a freshly allocated column "a" in df containing 'x' values as all its elements, independent of the fact if column "a" was already present in the data frame or not. So this is like git push --force operation. It is guaranteed to succeed no matter what the starting state of the data frame was (of course assuming that the object on the right hand side supports broadcasting and has proper dimensions).

Why has it taken us so long to reach this state?

The reason why we landed with these rules only in DataFrames.jl 1.4 is that before Julia 1.7 it was not possible to support the behavior we wanted. Therefore we had to wait till Julia 1.7 to be released to achieve consistency between property access and ! row selector behavior in DataFrames.jl in all cases.

So people porting code from DataFrames.jl earlier than 1.4 or from Julia earlier than 1.7 will notice a change that previously df.col .= value was in-place, and currently it allocates a new column. We have considered the risks that this change will be problematic for users, and the conclusion was that:

  • In a vast majority of code this change will not affect the result and users will not even notice this change.
  • It might affect the code that is performance critical (as an extra copy is now made). However, if this is the case it is easy to fix the issue (and it will not cause the code to be broken).
  • It might affect the code that relied on the fact that previously a conversion during in-place assignment was made (and e.g. relied on the fact that such assignment does not change element type of the column). In this case such code would indeed be broken. In this situation the conclusion was that such cases, although possible in practice, are most likely quite rare and if present would affect only experienced Julia users who have a good grasp of conversion upon assignment rules in Base Julia. We concluded that such users would be able to identify the problematic cases and change df.col .= value to df[:, :col] .= value in their code to get back the old behavior.

Conclusions

I hope you found this post useful in building your understanding how the details of indexing in DataFrames.jl work and what are their intended use cases.

Additionally, wanted to present you the mental process we went through when we were making hard design decisions in DataFrames.jl development team. In this process we had to balance three things:

  • user convenience (especially taking into account the fact that target audience of DataFrames.jl are data scientists, who sometimes are not computer science experts);
  • internal (within DataFrames.jl) consistency of rules that one needs to learn;
  • minimization of cases when we diverge from what Base Julia defines for similar objects (the challenge was that data frame is neither a matrix nor a struct, and has different design requirements, but we borrow the syntax from both).

In this post I have not covered all the details of DataFrames.jl indexing rules. If you want to learn about the indexing behavior every supported scenario please check the Indexing section of the DataFrames.jl manual.