Updating views in DataFrames.jl
Introduction
Today I want to preview a feature that will be introduced in 1.3 release of DataFrames.jl. We will talk about new ways of updating the columns of a data frame, when one is working with views. My objective is to explain the rationale behind the new functionality and the way it works.
This post was tested under Julia 1.6.1 and DataFrames.jl checked out at main branch on Sep 17, 2021 (SHA-1 facb6721e7450c63f2d5684b78e3c3489ed999b0)
What is a SubDataFrame
and when it is useful?
In DataFrames.jl you can construct views of data frame object using the
view
function or the @view
macro exactly like you can create views of arrays
in Julia Base. Here is a simple example:
julia> using DataFrames
(@v1.6) pkg> st DataFrames
Status `~/.julia/environments/v1.6/Project.toml`
[a93c6f00] DataFrames v1.2.2 `https://github.com/JuliaData/DataFrames.jl.git#main`
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> dfv = @view df[2:3, :]
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 5
2 │ 3 6
Now the dfv
object is a view of df
data frame. It means that it references
to the same data in memory as the parent
data frame df
, but allows
to access only a slice of it: in our case we have picked rows 2
and 3
and
all columns.
The key features of a view are:
- mutating its contents also mutates the contents of the parent data frame;
- it is cheap to create as it is enough to store only the reference to the parent data frame and which rows and columns got selected;
- it is memory efficient (no copying of data happens);
- using it has a small computational overhead as when we index a view we need to perform transformation of these indices to the parent data frame indices.
Let us show the first feature as it is most important from the functionality perspective:
julia> dfv[1, 1] = 100
100
julia> dfv
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 100 5
2 │ 3 6
julia> df
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 100 5
3 │ 3 6
julia> df[3, 1] = 200
200
julia> dfv
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 100 5
2 │ 200 6
julia> df
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 100 5
3 │ 200 6
As you can see changing dfv
also changes df
, and vice versa - changing df
also changes dfv
(if the changed cells are selected in the view).
To understand performance consider two simple implementations of a procedure computing 90% confidence interval of correlation between two variables using bootstrapping:
julia> using Statistics
julia> function bootcor1(df, c1, c2, n)
cors = Float64[]
for _ in 1:n
tmp = df[rand(1:nrow(df), nrow(df)), :]
push!(cors, cor(tmp[!, c1], tmp[!, c2]))
end
return quantile(cors, [0.05, 0.95])
end
bootcor1 (generic function with 1 method)
julia> function bootcor2(df, c1, c2, n)
cors = Float64[]
for _ in 1:n
tmp = @view df[rand(1:nrow(df), nrow(df)), :]
push!(cors, cor(tmp[!, c1], tmp[!, c2]))
end
return quantile(cors, [0.05, 0.95])
end
bootcor2 (generic function with 1 method)
(the functions could be further optimized for performance but I did not want to overly complicate the code)
The difference between bootcor1
and bootcor2
is that the former copies a
data frame, while the latter uses a view. Both take four parameters:
df
: a data frame to analyzec1
,c2
: column identifiers of columns we want to compute the correlation;n
: number of bootstrapping samples;
Now create a simple data frame and compare the performance of both functions (I present timings after compilation):
julia> df = DataFrame(rand(10^5, 10), :auto);
julia> @time bootcor1(df, :x1, :x2, 10_000)
47.059650 seconds (430.02 k allocations: 81.976 GiB, 1.88% gc time)
2-element Vector{Float64}:
-0.007373812772086598
0.0029150608879804406
julia> @time bootcor2(df, :x1, :x2, 10_000)
11.239822 seconds (80.02 k allocations: 7.453 GiB, 0.92% gc time)
2-element Vector{Float64}:
-0.007643923412421664
0.002966538851599437
As you can see, because the data frame was wide (10 columns), we saved a lot of time by avoiding copying of the data.
Of course if the data frame were narrower we would not see such a difference:
julia> df = DataFrame(rand(10^5, 2), :auto);
julia> @time bootcor1(df, :x1, :x2, 10_000)
10.829548 seconds (190.02 k allocations: 22.363 GiB, 1.60% gc time)
2-element Vector{Float64}:
-0.006650139955186956
0.0038227359319118795
julia> @time bootcor2(df, :x1, :x2, 10_000)
10.963020 seconds (80.02 k allocations: 7.453 GiB, 0.53% gc time)
2-element Vector{Float64}:
-0.006575024146232311
0.0038253588364537162
The reason is that now while using a view still allocates less this is offset by the fact that working with views has some computational overhead as it was explained above.
What is new for SubDataFrame
in DataFrames.jl 1.3?
Let us start with our original small data frame:
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
Assume you wanted to assign a 1.5
value in the first row of column :a
.
Before the upcoming DataFrames.jl 1.3 release it is quite cumbersome. If you
try doing it you get:
julia> df[1, :a] = 1.5
ERROR: InexactError: Int64(1.5)
You need to do two steps:
- promote the element type of column
:a
to allowFloat64
values; - perform the assignment.
Here is a way to do it:
julia> df.a = Vector{Float64}(df.a)
3-element Vector{Float64}:
1.0
2.0
3.0
julia> df[1, :a] = 1.5
1.5
julia> df
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼────────────────
1 │ 1.5 4
2 │ 2.0 5
3 │ 3.0 6
Here is one more, well known, example of a similar situation, that sometimes surprises users:
julia> df[1, :b] = 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> df
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼────────────────
1 │ 1.5 97
2 │ 2.0 5
3 │ 3.0 6
In this case Julia silently converted Char
value 'a'
to its Int
representation which is 97
.
The key change in the 1.3 release of DataFrames.jl is that views will allow to
use !
as row index (currently it is disallowed). The mechanics of this
functionality is the same as when !
is used for DataFrame
objects - a
column will get replaced in the data frame.
A natural question is the following with what will it get replaced? It is quite
valid as we are replacing only a portion of the column. The design decision we
took is that promote_type
will be used to decide the element type of the new
column combining the element type of the already present column and element type
of the newly assigned values.
Therefore in our examples above, when using a view you get the following:
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> dfv = @view df[1:1, :]
1×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
julia> dfv[!, :a] = [1.5]
1-element Vector{Float64}:
1.5
julia> dfv[!, :b] .= 'a'
1-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> df
3×2 DataFrame
Row │ a b
│ Float64 Any
─────┼──────────────
1 │ 1.5 a
2 │ 2.0 5
3 │ 3.0 6
julia> dfv
1×2 SubDataFrame
Row │ a b
│ Float64 Any
─────┼──────────────
1 │ 1.5 a
As you can see it works both with standard assignment as well as with broadcasted assignment.
Admittedly you still have to make two steps in the process:
- create a view;
- perform an assignment to it.
This is a bit cumbersome. Fortunately we can expect that in the future
DataFramesMeta.jl will provide a convenience syntax to perform
conditional assignment using this feature, e.g. like in data.table
, where you
can write something like df[x == 1, y := 2]
to set column y
to 2
if
column x
is equal to 1
.
One special case that is often required is adding columns. It is supported
with both :
and !
row selectors (like for DataFrame
objects). In this case
we do not have a reference column in a parent data frame, so rows that are not
included in the view are filled with missing
.
Here are two examples:
julia> dfv[!, :c] = ["x"]
1-element Vector{String}:
"x"
julia> dfv[:, :d] .= true
1-element Vector{Bool}:
1
julia> df
3×4 DataFrame
Row │ a b c d
│ Float64 Any String? Bool?
─────┼────────────────────────────────
1 │ 1.5 a x true
2 │ 2.0 5 missing missing
3 │ 3.0 6 missing missing
julia> dfv
1×4 SubDataFrame
Row │ a b c d
│ Float64 Any String? Bool?
─────┼──────────────────────────────
1 │ 1.5 a x true
The only limitation is that in this case it is only allowed if SubDataFrame
was created with :
as column selector. The reason of this limitation is that
when one uses :
selector we are guaranteed that SubDataFrame
has the same
columns and in the same order as its parent, so the requested operation is
guaranteed not to be problematic in interpretation (otherwise we would have to
handle e.g. the case when we want to add a column whose name is not present in
the SubDataFrame
but is present in its parent which could confuse users).
Conclusions
In summary the new functionality allows to replace columns in a data frame through its view. The two main intended use cases of this feature are:
- adding new columns for which we have data only for some rows
(selected in the view); it is only allowed when
SubDataFrame
was created with:
as column selector; - updating data in existing columns even if the new elements cannot be converted
to the element type of existing column; in this case
promote_type
is used to determine the target column element type.