Introduction

Recently on Julia Slack there was a question about using the subset function to drop whole groups from GroupedDataFrame in DataFrames.jl. I thought that indeed this case is tricky enough to be worth a post.

The examples were tested under Julia 1.7.0 and DataFrames.jl 1.3.2.

Standard use cases of the subset function

Let us start with creating some sample data:

julia> using DataFrames

julia> df = DataFrame(id=[1, 1, 1, 1, 2, 2], x=1:6)
6×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     2      5
   6 │     2      6

julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
⋮
Last Group (2 rows): id = 2
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     2      6

Assume we want to keep rows having value of :x less than the mean of this column from df. This can be achieved with:

julia> using Statistics

julia> subset(df, :x => x -> x .< mean(x))
3×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3

The same operation can be easily done groupwise. Now we keep rows that have the value of :x less than the mean of this column per group defined by :id:

julia> subset(gdf, :x => x -> x .< mean(x))
3×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     2      5

The limitation of the subset contract

The subset function requires that the return value of the passed condition is a vector. Therefore the following operation fails:

julia> subset(df, :x => x -> true)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

although we might expect that broadcasting would be applied to the result of the function and all rows would be kept. For a reference e.g. select would perform such broadcasting automatically:

julia> select(df, All(), :x => x -> true)
6×3 DataFrame
 Row │ id     x      x_function
     │ Int64  Int64  Bool
─────┼──────────────────────────
   1 │     1      1        true
   2 │     1      2        true
   3 │     1      3        true
   4 │     1      4        true
   5 │     2      5        true
   6 │     2      6        true

You might wonder why this restriction is made. Initially we allowed non-vector return values, but they turned to be confusing for the users so we disallowed them.

Let me give an example. If the user wants to keep all rows for which the :id column is equal to 1 one should write:

julia> subset(df, :id => ByRow(==(1)))
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

However, it turned out that users frequently were forgetting to add ByRow wrapper and instead used:

julia> subset(df, :id => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

Now it throws an error, but if we have not imposed the restriction that we require a vector to be returned we would get the following result:

julia> subset(df, :id => x -> fill(x == 1, length(x)))
0×2 DataFrame

as the whole column :id would be compared to 1 and the result of this comparison is false.

Dropping whole groups from a GroupedDataFrame

The requirement that the condition must return a vector was added for safety reasons. However, there is one case when it is a bit problematic.

Assume we want to keep from the gdf GroupedDataFrame all groups for which the mean of :x column is less than 3. The problem is that the following condition fails:

julia> subset(gdf, :x => x -> mean(x) < 3)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.

since the comparing the mean of the :x column to 3 produces a scalar Bool value.

The solution is to manually expand the result of the condition to match the number of rows in the group:

julia> subset(gdf, :x => x -> fill(mean(x) < 3, length(x)))
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

This is unfortunately a bit inconvenient.

An alternative approach would be to use the filter function which applied to GroupedDataFrame always works on whole groups:

julia> filter(:x => x -> mean(x) < 3, gdf) |> DataFrame
4×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4

(we had to pass the result of filter to DataFrame constructor, as otherwise we would get a filtered GroupedDataFrame)

Conclusions

The design of subset I discussed in this post shows one of the challenges we face when defining APIs in DataFrames.jl. There often is a tension between developer convenience and safety. In this example allowing only vectors as results of conditions in the subset function is safer since it allows to catch some common bugs in the users code. The cost is that in some cases (most notably dropping whole groups from a GroupedDataFrame) it is a bit inconvenient.