Introduction

Today I want to preview a feature that will be introduced in 1.3 release of DataFrames.jl. We will talk about new ways of updating the columns of a data frame, when one is working with views. My objective is to explain the rationale behind the new functionality and the way it works.

This post was tested under Julia 1.6.1 and DataFrames.jl checked out at main branch on Sep 17, 2021 (SHA-1 facb6721e7450c63f2d5684b78e3c3489ed999b0)

What is a SubDataFrame and when it is useful?

In DataFrames.jl you can construct views of data frame object using the view function or the @view macro exactly like you can create views of arrays in Julia Base. Here is a simple example:

julia> using DataFrames

(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v1.2.2 `https://github.com/JuliaData/DataFrames.jl.git#main`

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[2:3, :]
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     3      6

Now the dfv object is a view of df data frame. It means that it references to the same data in memory as the parent data frame df, but allows to access only a slice of it: in our case we have picked rows 2 and 3 and all columns.

The key features of a view are:

  • mutating its contents also mutates the contents of the parent data frame;
  • it is cheap to create as it is enough to store only the reference to the parent data frame and which rows and columns got selected;
  • it is memory efficient (no copying of data happens);
  • using it has a small computational overhead as when we index a view we need to perform transformation of these indices to the parent data frame indices.

Let us show the first feature as it is most important from the functionality perspective:

julia> dfv[1, 1] = 100
100

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   100      5
   2 │     3      6

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │   100      5
   3 │     3      6

julia> df[3, 1] = 200
200

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   100      5
   2 │   200      6

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │   100      5
   3 │   200      6

As you can see changing dfv also changes df, and vice versa - changing df also changes dfv (if the changed cells are selected in the view).

To understand performance consider two simple implementations of a procedure computing 90% confidence interval of correlation between two variables using bootstrapping:

julia> using Statistics

julia> function bootcor1(df, c1, c2, n)
           cors = Float64[]
           for _ in 1:n
               tmp = df[rand(1:nrow(df), nrow(df)), :]
               push!(cors, cor(tmp[!, c1], tmp[!, c2]))
           end
           return quantile(cors, [0.05, 0.95])
       end
bootcor1 (generic function with 1 method)

julia> function bootcor2(df, c1, c2, n)
           cors = Float64[]
           for _ in 1:n
               tmp = @view df[rand(1:nrow(df), nrow(df)), :]
               push!(cors, cor(tmp[!, c1], tmp[!, c2]))
           end
           return quantile(cors, [0.05, 0.95])
       end
bootcor2 (generic function with 1 method)

(the functions could be further optimized for performance but I did not want to overly complicate the code)

The difference between bootcor1 and bootcor2 is that the former copies a data frame, while the latter uses a view. Both take four parameters:

  • df: a data frame to analyze
  • c1, c2: column identifiers of columns we want to compute the correlation;
  • n: number of bootstrapping samples;

Now create a simple data frame and compare the performance of both functions (I present timings after compilation):

julia> df = DataFrame(rand(10^5, 10), :auto);

julia> @time bootcor1(df, :x1, :x2, 10_000)
 47.059650 seconds (430.02 k allocations: 81.976 GiB, 1.88% gc time)
2-element Vector{Float64}:
 -0.007373812772086598
  0.0029150608879804406

julia> @time bootcor2(df, :x1, :x2, 10_000)
 11.239822 seconds (80.02 k allocations: 7.453 GiB, 0.92% gc time)
2-element Vector{Float64}:
 -0.007643923412421664
  0.002966538851599437

As you can see, because the data frame was wide (10 columns), we saved a lot of time by avoiding copying of the data.

Of course if the data frame were narrower we would not see such a difference:

julia> df = DataFrame(rand(10^5, 2), :auto);

julia> @time bootcor1(df, :x1, :x2, 10_000)
 10.829548 seconds (190.02 k allocations: 22.363 GiB, 1.60% gc time)
2-element Vector{Float64}:
 -0.006650139955186956
  0.0038227359319118795

julia> @time bootcor2(df, :x1, :x2, 10_000)
 10.963020 seconds (80.02 k allocations: 7.453 GiB, 0.53% gc time)
2-element Vector{Float64}:
 -0.006575024146232311
  0.0038253588364537162

The reason is that now while using a view still allocates less this is offset by the fact that working with views has some computational overhead as it was explained above.

What is new for SubDataFrame in DataFrames.jl 1.3?

Let us start with our original small data frame:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

Assume you wanted to assign a 1.5 value in the first row of column :a. Before the upcoming DataFrames.jl 1.3 release it is quite cumbersome. If you try doing it you get:

julia> df[1, :a] = 1.5
ERROR: InexactError: Int64(1.5)

You need to do two steps:

  1. promote the element type of column :a to allow Float64 values;
  2. perform the assignment.

Here is a way to do it:

julia> df.a = Vector{Float64}(df.a)
3-element Vector{Float64}:
 1.0
 2.0
 3.0

julia> df[1, :a] = 1.5
1.5

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Int64
─────┼────────────────
   1 │     1.5      4
   2 │     2.0      5
   3 │     3.0      6

Here is one more, well known, example of a similar situation, that sometimes surprises users:

julia> df[1, :b] = 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Int64
─────┼────────────────
   1 │     1.5     97
   2 │     2.0      5
   3 │     3.0      6

In this case Julia silently converted Char value 'a' to its Int representation which is 97.

The key change in the 1.3 release of DataFrames.jl is that views will allow to use ! as row index (currently it is disallowed). The mechanics of this functionality is the same as when ! is used for DataFrame objects - a column will get replaced in the data frame.

A natural question is the following with what will it get replaced? It is quite valid as we are replacing only a portion of the column. The design decision we took is that promote_type will be used to decide the element type of the new column combining the element type of the already present column and element type of the newly assigned values.

Therefore in our examples above, when using a view you get the following:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[1:1, :]
1×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4

julia> dfv[!, :a] = [1.5]
1-element Vector{Float64}:
 1.5

julia> dfv[!, :b] .= 'a'
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │     1.5  a
   2 │     2.0  5
   3 │     3.0  6

julia> dfv
1×2 SubDataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │     1.5  a

As you can see it works both with standard assignment as well as with broadcasted assignment.

Admittedly you still have to make two steps in the process:

  1. create a view;
  2. perform an assignment to it.

This is a bit cumbersome. Fortunately we can expect that in the future DataFramesMeta.jl will provide a convenience syntax to perform conditional assignment using this feature, e.g. like in data.table, where you can write something like df[x == 1, y := 2] to set column y to 2 if column x is equal to 1.

One special case that is often required is adding columns. It is supported with both : and ! row selectors (like for DataFrame objects). In this case we do not have a reference column in a parent data frame, so rows that are not included in the view are filled with missing.

Here are two examples:

julia> dfv[!, :c] = ["x"]
1-element Vector{String}:
 "x"

julia> dfv[:, :d] .= true
1-element Vector{Bool}:
 1

julia> df
3×4 DataFrame
 Row │ a        b    c        d
     │ Float64  Any  String?  Bool?
─────┼────────────────────────────────
   1 │     1.5  a    x           true
   2 │     2.0  5    missing  missing
   3 │     3.0  6    missing  missing

julia> dfv
1×4 SubDataFrame
 Row │ a        b    c        d
     │ Float64  Any  String?  Bool?
─────┼──────────────────────────────
   1 │     1.5  a    x         true

The only limitation is that in this case it is only allowed if SubDataFrame was created with : as column selector. The reason of this limitation is that when one uses : selector we are guaranteed that SubDataFrame has the same columns and in the same order as its parent, so the requested operation is guaranteed not to be problematic in interpretation (otherwise we would have to handle e.g. the case when we want to add a column whose name is not present in the SubDataFrame but is present in its parent which could confuse users).

Conclusions

In summary the new functionality allows to replace columns in a data frame through its view. The two main intended use cases of this feature are:

  • adding new columns for which we have data only for some rows (selected in the view); it is only allowed when SubDataFrame was created with : as column selector;
  • updating data in existing columns even if the new elements cannot be converted to the element type of existing column; in this case promote_type is used to determine the target column element type.