On the bang row selector in DataFrames.jl
Introduction
I recently see that DataFrames.jl use !
as a row selector for a data
frame a lot.
Over a year ago, when we have taken data frames indexing seriously, there was a
very big debate if !
should be allowed in expressions like df[!, :a]
to get
an :a
column without copying. The conclusion was that we need to have it, but
our intention was that it would be reserved for advanced uses only, while
in normal circumstances a user would not need to even know that it exists.
In this post let me review the use-cases of !
and comment on its alternatives.
This post was written under Julia 1.5.3 and DataFrames 0.22.4.
First we set up the environment:
julia> using DataFrames
julia> df = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
Reading a single column from a data frame
If you want to get a single column :col1
from a data frame df
you have the
following options:
df[!, :col1]
,df[!, "col1"]
,df.col1
, anddf."col1"
: get you the column without copying;df[:, :col1]
anddf[:, "col1"]
: gets you a copy of the column.
As you see to get a single column without copying it is usually much easier to
rwiere df.col1
than e.g. df[!, :col1]
and the operation has exactly the same
result.
The only case when df[!, :col1]
is more convenient is when you have a column
name stored in a variable. Then the following are equivalent:
julia> v = :col1
:col1
julia> df[!, v]
3-element Array{Int64,1}:
1
2
3
julia> getproperty(df, v)
3-element Array{Int64,1}:
1
2
3
and indeed using !
is a big more convenient in this case, as you cannot pass
variable v
to an expression like df.col1
.
Reading multiple columns from a data frame
If you want to get a two columns [:col1, :col2]
from a data frame df
you
have the following options (I am leaving out the sting version and other column
selectors we support for simplicity):
df[!, [:col1, :col2]]
andselect(df, [:col1, :col2], copycols=false)
: creates you a new data frame (a fresh wrapper object is allocated) but the columns of the new data frame are taken fromdf
;df[:, [:col1, :col2]]
andselect(df, [:col1, :col2])
: gets you a new data frame with columns copied.
Note that for multiple column selection you can alternatively use the select
function. The difference between select
and indexing is that select
returns
a data frame even if a single column is selected, e.g. like this:
julia> select(df, 1)
3×1 DataFrame
Row │ col1
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
while as we have explained above we have:
julia> df[!, 1]
3-element Array{Int64,1}:
1
2
3
Note that as in the df[!, [:col1, :col2]]
syntax copying of columns is not
done this operation is generally not recommended. Using such a data frame often
leads to very hard-to-find bugs as when you modify contents of the columns of
the newly created data frame also the source is mutated.
Making a view of a data frame
In this case we have:
julia> view(df, !, :col1)
3-element view(::Array{Int64,1}, :) with eltype Int64:
1
2
3
julia> view(df, !, [:col1, :col2])
3×2 SubDataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
and the views are exactly the same as if we used view(df, :, :col1)
and
view(df, :, [:col1, :col2])
respectively.
In this case !
is supported mainly to allow an easy annotation of whole
expressions using data frame indexing with @views
, e.g. imagine you have
the following code:
julia> x = [1, 2, 3, 4]
4-element Array{Int64,1}:
1
2
3
4
julia> df[!, 1] + x[1:3]
3-element Array{Int64,1}:
2
4
6
and in order to avoid copying x
you want to annotate the whole expression with
@views
. Thanks to the fact that !
is supported with view
you can just write:
julia> @views df[!, 1] + x[1:3]
3-element Array{Int64,1}:
2
4
6
Assigning to a single column
The difference between df[!, :co11] = [11, 12, 13]
and df[:, :col1] = [11,
12, 13]
is that using !
puts a new column passed on the right hand side to
the data frame without copying it (no matter if the column exists or not in the
data frame), while :
assigns to an existing column in-place.
Therefore df[!, :co11] = [11, 12, 13]
is equivalent to df.col1 = [11, 12,
13]
. On the other hand df[:, :co11] = [11, 12, 13]
is equivalent to
df.col1[:] = [11, 12, 13]
, if the column :col1
is present in the data frame.
Here is an example:
julia> v = [11, 13, 13]
3-element Array{Int64,1}:
11
13
13
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> col1 = df2.col1
3-element Array{Int64,1}:
1
2
3
julia> df2[!, :col1] = v
3-element Array{Int64,1}:
11
13
13
julia> col1
3-element Array{Int64,1}:
1
2
3
julia> df2.col1 === v
true
vs.
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> col1 = df2.col1
3-element Array{Int64,1}:
1
2
3
julia> df2[:, :col1] = v
3-element Array{Int64,1}:
11
13
13
julia> col1
3-element Array{Int64,1}:
11
13
13
julia> df2.col1 === v
false
You might have noticed that when I described :
I have added a condition that
it is equivalen to getproperty
syntax only when the column is present in the
data frame. The reason is that if column is not present in a data frame
then we have:
julia> df
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> df[:, :newcol] = v
3-element Array{Int64,1}:
11
12
13
julia> df
3×3 DataFrame
Row │ col1 col2 newcol
│ Int64 Char Int64
─────┼─────────────────────
1 │ 1 a 11
2 │ 2 b 12
3 │ 3 c 13
julia> df.newcol === v
false
So instead of an in-place operation (which is not possible as the column is not present in the data frame), we get a copy operation.
On the other hand:
julia> df.newcol2[:] = v
ERROR: ArgumentError: column name :newcol2 not found in the data frame; existing most similar names are: :newcol
just fails as there is no column to index into.
The other special case is SubDataFrame
, where using !
for assignment is not
allowed, just like for getproperty
syntax:
julia> dfv = view(df, :, :)
3×3 SubDataFrame
Row │ col1 col2 newcol
│ Int64 Char Int64
─────┼─────────────────────
1 │ 1 a 11
2 │ 2 b 12
3 │ 3 c 13
julia> dfv[!, :col1] = 1:3
ERROR: ArgumentError: setting index of SubDataFrame using ! as row selector is not allowed
julia> dfv.col1 = 1:3
ERROR: ArgumentError: Replacing or adding of columns of a SubDataFrame is not allowed. Instead use `df[:, col_ind] = v` or `df[:, col_ind] .= v` to perform an in-place assignment.
There is one exception to the rule that !
replaces column in a data frame
without copying. This is the case, when you want o assign a range to a data
frame column. In this situation materialization of range always happens as there
is a general rule that we do not allow storing ranges in a data frame as they
are not mutable, which is something that users usually do not like (using a
range is a common operation to add an identifier column to a data frame). Here
is an example:
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> ids = axes(df2, 1)
Base.OneTo(3)
julia> df2[!, :id] = ids
Base.OneTo(3)
julia> df2.id
3-element Array{Int64,1}:
1
2
3
As you can see idxs
range was materialized to a Vector{Int}
.
Assigning to multiple columns
This case is a bit simpler than assigning to a single column case above. The
reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] = new_values
replaces
columns :col1
and :col2
in df
, while df[:, [:col1, :col2]] = new_values
updates them in-place.
Note that new_values
must be either a data frame or a matrix, and for !
the
columns in df
will be always freshly allocated.
Broadcasting assignment to a single column
This is the point where a bit of complexity is introduced, as now getproperty
syntax (i.e. df.col
) behaves similarly to :
indexing and not to !
indexig.
The rules are the following:
df[!, :col] .= v
allocates a new column and replaces the old one or if:col
is not present indf
allocates and adds it;df[:, :col] .= v
updates the column in-place or allocates or if:col
is not present indf
allocates adds it;df.col .= v
is only allowed ifcol
is present indf
and operates in-place.
Note that if :col
is not present in df
then using !
and :
are equivalent.
Also note that in SubDataFrame
it is not allowed to add new columns and !
syntax is not allowed.
Broadcasting assignment to multiple columns
Again this case is simpler than broadcasting assigning to a single column case above.
The reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] .= new_values
replaces
columns :col1
and :col2
in df
, while df[:, [:col1, :col2]] = new_values
updates them in-place.
Summary of the cases
Wrapping up the cases we see that !
means the following:
- in selection context: get me a column or a data frame without copying columns.
- in views: make me a view (the same as
:
row selector); - in assignment to a single column: replace or add the column to a data frame without copying;
- in assignment to a multiple columns: replace the colums in a data frame with copying;
- in broadcasting assignment: allocate a new column and store it (and in the case of a single column selector optionally add it if it is missing);
And :
means the following:
- in selection context: get me a column or data frame with copying of columns.
- in views: make me a view (the same as
:
row selector); - in assignment to a single column: change the column in-place or add the column to a data frame with copying;
- in assignment to a multiple columns: change the colums in-place in a data frame;
- in broadcasting assignment: perform in-place update of columns (and in the case of a single column selector optionally allocate and add it if it is missing);
Finally getproperty
(the df.col
style) means the following:
- in selection context: get me a column without copying.
- in assignment: replace or add the column to a data frame without copying;
- in broadcasting assignment: update an existing column in-place.
In short (simplifying a bit):
!
gets you columns without copying and when setting columns it replaces them;:
gets you columns with copying and when setting columns it does this in-place;getproperty
gets you columns without copying and setting columns it replaces them, except for broadcasting assignment, when it updates them in-place.
From a practical perspective the major difference between in-place and replace operations is that replacing columns is needed if new values have a different type than the old ones.
For instance here !
works and :
fails:
julia> df
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> df[:, :col1] .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
julia> df[!, :col1] .= "a"
3-element Array{String,1}:
"a"
"a"
"a"
julia> df
3×2 DataFrame
Row │ col1 col2
│ String Char
─────┼──────────────
1 │ a a
2 │ a b
3 │ a c
Another practical limitation is that broadcasting assignment like df.col .= v
is not allowed when :col
is not present in a data frame (there is a chance that
in the future it will be allowed, see here).
Conclusions
As you can see there are cases when !
row selector is needed to cover all
potential use-cases. However, most common operations are done on a single
column and in this case:
- for getting a column or assigning to a column instead of
df[!, :col]
anddf[!, :col] = v
it is usually better to just writedf.col
anddf.col = v
respectively as it is the same and simpler to type and read; - currently the case where
!
is really needed is broacasting assignment context wheredf[!, :col] .= v
is the only relatively nice way to freshly allocate a column withv
broadcasted into it (but when I look at the codes of DataFrames.jl users this pattern is used much less frequently than we expected when we designed the ecosystem).
Also it is useful to keep the following mental model of possible operations on columns of a data frame (again simplifying a bit by leaving out the corner cases):
- when you get a column from a data frame you either can: 1) get it with copying,
which is achieved with
:
, or 2) get it as-is (no copying, this operation is achieved with!
); - when you set a column of a data frame you can either do: 1) an in-place update
(which is done with
:
), or 2) replace a column without copying the right hand side (which is done with!
and=
assignment), or 3) replace a column with copying the right hand side (which is done with!
and.=
assignment).
I hope this post was helpful. If you are interested in a definitive specification of all the indexing rules in DataFrames.jl you can find them here.