Introduction

I recently see that DataFrames.jl use ! as a row selector for a data frame a lot.

Over a year ago, when we have taken data frames indexing seriously, there was a very big debate if ! should be allowed in expressions like df[!, :a] to get an :a column without copying. The conclusion was that we need to have it, but our intention was that it would be reserved for advanced uses only, while in normal circumstances a user would not need to even know that it exists.

In this post let me review the use-cases of ! and comment on its alternatives.

This post was written under Julia 1.5.3 and DataFrames 0.22.4.

First we set up the environment:

julia> using DataFrames

julia> df = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

Reading a single column from a data frame

If you want to get a single column :col1 from a data frame df you have the following options:

  • df[!, :col1], df[!, "col1"], df.col1, and df."col1": get you the column without copying;
  • df[:, :col1] and df[:, "col1"]: gets you a copy of the column.

As you see to get a single column without copying it is usually much easier to rwiere df.col1 than e.g. df[!, :col1] and the operation has exactly the same result.

The only case when df[!, :col1] is more convenient is when you have a column name stored in a variable. Then the following are equivalent:

julia> v = :col1
:col1

julia> df[!, v]
3-element Array{Int64,1}:
 1
 2
 3

julia> getproperty(df, v)
3-element Array{Int64,1}:
 1
 2
 3

and indeed using ! is a big more convenient in this case, as you cannot pass variable v to an expression like df.col1.

Reading multiple columns from a data frame

If you want to get a two columns [:col1, :col2] from a data frame df you have the following options (I am leaving out the sting version and other column selectors we support for simplicity):

  • df[!, [:col1, :col2]] and select(df, [:col1, :col2], copycols=false): creates you a new data frame (a fresh wrapper object is allocated) but the columns of the new data frame are taken from df;
  • df[:, [:col1, :col2]] and select(df, [:col1, :col2]): gets you a new data frame with columns copied.

Note that for multiple column selection you can alternatively use the select function. The difference between select and indexing is that select returns a data frame even if a single column is selected, e.g. like this:

julia> select(df, 1)
3×1 DataFrame
 Row │ col1
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

while as we have explained above we have:

julia> df[!, 1]
3-element Array{Int64,1}:
 1
 2
 3

Note that as in the df[!, [:col1, :col2]] syntax copying of columns is not done this operation is generally not recommended. Using such a data frame often leads to very hard-to-find bugs as when you modify contents of the columns of the newly created data frame also the source is mutated.

Making a view of a data frame

In this case we have:

julia> view(df, !, :col1)
3-element view(::Array{Int64,1}, :) with eltype Int64:
 1
 2
 3

julia> view(df, !, [:col1, :col2])
3×2 SubDataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

and the views are exactly the same as if we used view(df, :, :col1) and view(df, :, [:col1, :col2]) respectively.

In this case ! is supported mainly to allow an easy annotation of whole expressions using data frame indexing with @views, e.g. imagine you have the following code:

julia> x = [1, 2, 3, 4]
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> df[!, 1] + x[1:3]
3-element Array{Int64,1}:
 2
 4
 6

and in order to avoid copying x you want to annotate the whole expression with @views. Thanks to the fact that ! is supported with view you can just write:

julia> @views df[!, 1] + x[1:3]
3-element Array{Int64,1}:
 2
 4
 6

Assigning to a single column

The difference between df[!, :co11] = [11, 12, 13] and df[:, :col1] = [11, 12, 13] is that using ! puts a new column passed on the right hand side to the data frame without copying it (no matter if the column exists or not in the data frame), while : assigns to an existing column in-place.

Therefore df[!, :co11] = [11, 12, 13] is equivalent to df.col1 = [11, 12, 13]. On the other hand df[:, :co11] = [11, 12, 13] is equivalent to df.col1[:] = [11, 12, 13], if the column :col1 is present in the data frame.

Here is an example:

julia> v = [11, 13, 13]
3-element Array{Int64,1}:
 11
 13
 13

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> col1 = df2.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2[!, :col1] = v
3-element Array{Int64,1}:
 11
 13
 13

julia> col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2.col1 === v
true

vs.

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> col1 = df2.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df2[:, :col1] = v
3-element Array{Int64,1}:
 11
 13
 13

julia> col1
3-element Array{Int64,1}:
 11
 13
 13

julia> df2.col1 === v
false

You might have noticed that when I described : I have added a condition that it is equivalen to getproperty syntax only when the column is present in the data frame. The reason is that if column is not present in a data frame then we have:

julia> df
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> df[:, :newcol] = v
3-element Array{Int64,1}:
 11
 12
 13

julia> df
3×3 DataFrame
 Row │ col1   col2  newcol
     │ Int64  Char  Int64
─────┼─────────────────────
   1 │     1  a         11
   2 │     2  b         12
   3 │     3  c         13

julia> df.newcol === v
false

So instead of an in-place operation (which is not possible as the column is not present in the data frame), we get a copy operation.

On the other hand:

julia> df.newcol2[:] = v
ERROR: ArgumentError: column name :newcol2 not found in the data frame; existing most similar names are: :newcol

just fails as there is no column to index into.

The other special case is SubDataFrame, where using ! for assignment is not allowed, just like for getproperty syntax:

julia> dfv = view(df, :, :)
3×3 SubDataFrame
 Row │ col1   col2  newcol
     │ Int64  Char  Int64
─────┼─────────────────────
   1 │     1  a         11
   2 │     2  b         12
   3 │     3  c         13

julia> dfv[!, :col1] = 1:3
ERROR: ArgumentError: setting index of SubDataFrame using ! as row selector is not allowed

julia> dfv.col1 = 1:3
ERROR: ArgumentError: Replacing or adding of columns of a SubDataFrame is not allowed. Instead use `df[:, col_ind] = v` or `df[:, col_ind] .= v` to perform an in-place assignment.

There is one exception to the rule that ! replaces column in a data frame without copying. This is the case, when you want o assign a range to a data frame column. In this situation materialization of range always happens as there is a general rule that we do not allow storing ranges in a data frame as they are not mutable, which is something that users usually do not like (using a range is a common operation to add an identifier column to a data frame). Here is an example:

julia> df2 = copy(df)
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> ids = axes(df2, 1)
Base.OneTo(3)

julia> df2[!, :id] = ids
Base.OneTo(3)

julia> df2.id
3-element Array{Int64,1}:
 1
 2
 3

As you can see idxs range was materialized to a Vector{Int}.

Assigning to multiple columns

This case is a bit simpler than assigning to a single column case above. The reason is that we do not allow to create new columns when multiple columns are selected. Therefore the rule is: df[!, [:col1, :col2]] = new_values replaces columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values updates them in-place.

Note that new_values must be either a data frame or a matrix, and for ! the columns in df will be always freshly allocated.

Broadcasting assignment to a single column

This is the point where a bit of complexity is introduced, as now getproperty syntax (i.e. df.col) behaves similarly to : indexing and not to ! indexig.

The rules are the following:

  • df[!, :col] .= v allocates a new column and replaces the old one or if :col is not present in df allocates and adds it;
  • df[:, :col] .= v updates the column in-place or allocates or if :col is not present in df allocates adds it;
  • df.col .= v is only allowed if col is present in df and operates in-place.

Note that if :col is not present in df then using ! and : are equivalent.

Also note that in SubDataFrame it is not allowed to add new columns and ! syntax is not allowed.

Broadcasting assignment to multiple columns

Again this case is simpler than broadcasting assigning to a single column case above. The reason is that we do not allow to create new columns when multiple columns are selected. Therefore the rule is: df[!, [:col1, :col2]] .= new_values replaces columns :col1 and :col2 in df, while df[:, [:col1, :col2]] = new_values updates them in-place.

Summary of the cases

Wrapping up the cases we see that ! means the following:

  • in selection context: get me a column or a data frame without copying columns.
  • in views: make me a view (the same as : row selector);
  • in assignment to a single column: replace or add the column to a data frame without copying;
  • in assignment to a multiple columns: replace the colums in a data frame with copying;
  • in broadcasting assignment: allocate a new column and store it (and in the case of a single column selector optionally add it if it is missing);

And : means the following:

  • in selection context: get me a column or data frame with copying of columns.
  • in views: make me a view (the same as : row selector);
  • in assignment to a single column: change the column in-place or add the column to a data frame with copying;
  • in assignment to a multiple columns: change the colums in-place in a data frame;
  • in broadcasting assignment: perform in-place update of columns (and in the case of a single column selector optionally allocate and add it if it is missing);

Finally getproperty (the df.col style) means the following:

  • in selection context: get me a column without copying.
  • in assignment: replace or add the column to a data frame without copying;
  • in broadcasting assignment: update an existing column in-place.

In short (simplifying a bit):

  • ! gets you columns without copying and when setting columns it replaces them;
  • : gets you columns with copying and when setting columns it does this in-place;
  • getproperty gets you columns without copying and setting columns it replaces them, except for broadcasting assignment, when it updates them in-place.

From a practical perspective the major difference between in-place and replace operations is that replacing columns is needed if new values have a different type than the old ones.

For instance here ! works and : fails:

julia> df
3×2 DataFrame
 Row │ col1   col2
     │ Int64  Char
─────┼─────────────
   1 │     1  a
   2 │     2  b
   3 │     3  c

julia> df[:, :col1] .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[!, :col1] .= "a"
3-element Array{String,1}:
 "a"
 "a"
 "a"

julia> df
3×2 DataFrame
 Row │ col1    col2
     │ String  Char
─────┼──────────────
   1 │ a       a
   2 │ a       b
   3 │ a       c

Another practical limitation is that broadcasting assignment like df.col .= v is not allowed when :col is not present in a data frame (there is a chance that in the future it will be allowed, see here).

Conclusions

As you can see there are cases when ! row selector is needed to cover all potential use-cases. However, most common operations are done on a single column and in this case:

  • for getting a column or assigning to a column instead of df[!, :col] and df[!, :col] = v it is usually better to just write df.col and df.col = v respectively as it is the same and simpler to type and read;
  • currently the case where ! is really needed is broacasting assignment context where df[!, :col] .= v is the only relatively nice way to freshly allocate a column with v broadcasted into it (but when I look at the codes of DataFrames.jl users this pattern is used much less frequently than we expected when we designed the ecosystem).

Also it is useful to keep the following mental model of possible operations on columns of a data frame (again simplifying a bit by leaving out the corner cases):

  • when you get a column from a data frame you either can: 1) get it with copying, which is achieved with :, or 2) get it as-is (no copying, this operation is achieved with !);
  • when you set a column of a data frame you can either do: 1) an in-place update (which is done with :), or 2) replace a column without copying the right hand side (which is done with ! and = assignment), or 3) replace a column with copying the right hand side (which is done with ! and .= assignment).

I hope this post was helpful. If you are interested in a definitive specification of all the indexing rules in DataFrames.jl you can find them here.