# The curious case of subset condition

# Introduction

Recently on Julia Slack there was a question about using the `subset`

function
to drop whole groups from `GroupedDataFrame`

in DataFrames.jl.
I thought that indeed this case is tricky enough to be worth a post.

The examples were tested under Julia 1.7.0 and DataFrames.jl 1.3.2.

# Standard use cases of the `subset`

function

Let us start with creating some sample data:

```
julia> using DataFrames
julia> df = DataFrame(id=[1, 1, 1, 1, 2, 2], x=1:6)
6×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
5 │ 2 5
6 │ 2 6
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
⋮
Last Group (2 rows): id = 2
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 2 5
2 │ 2 6
```

Assume we want to keep rows having value of `:x`

less than the mean of this
column from `df`

. This can be achieved with:

```
julia> using Statistics
julia> subset(df, :x => x -> x .< mean(x))
3×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
```

The same operation can be easily done groupwise. Now we keep rows that have the
value of `:x`

less than the mean of this column per group defined by `:id`

:

```
julia> subset(gdf, :x => x -> x .< mean(x))
3×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 2 5
```

# The limitation of the `subset`

contract

The `subset`

function requires that the return value of the passed condition
is a vector. Therefore the following operation fails:

```
julia> subset(df, :x => x -> true)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
```

although we might expect that broadcasting would be applied to the result of
the function and all rows would be kept. For a reference e.g. `select`

would
perform such broadcasting automatically:

```
julia> select(df, All(), :x => x -> true)
6×3 DataFrame
Row │ id x x_function
│ Int64 Int64 Bool
─────┼──────────────────────────
1 │ 1 1 true
2 │ 1 2 true
3 │ 1 3 true
4 │ 1 4 true
5 │ 2 5 true
6 │ 2 6 true
```

You might wonder why this restriction is made. Initially we allowed non-vector return values, but they turned to be confusing for the users so we disallowed them.

Let me give an example. If the user wants to keep all rows for which the `:id`

column is equal to `1`

one should write:

```
julia> subset(df, :id => ByRow(==(1)))
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
```

However, it turned out that users frequently were forgetting to add `ByRow`

wrapper and instead used:

```
julia> subset(df, :id => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
```

Now it throws an error, but if we have not imposed the restriction that we require a vector to be returned we would get the following result:

```
julia> subset(df, :id => x -> fill(x == 1, length(x)))
0×2 DataFrame
```

as the whole column `:id`

would be compared to `1`

and the result of this
comparison is `false`

.

# Dropping whole groups from a `GroupedDataFrame`

The requirement that the condition must return a vector was added for safety reasons. However, there is one case when it is a bit problematic.

Assume we want to keep from the `gdf`

`GroupedDataFrame`

all groups for which
the mean of `:x`

column is less than `3`

. The problem is that the following
condition fails:

```
julia> subset(gdf, :x => x -> mean(x) < 3)
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
```

since the comparing the mean of the `:x`

column to `3`

produces a scalar `Bool`

value.

The solution is to manually expand the result of the condition to match the number of rows in the group:

```
julia> subset(gdf, :x => x -> fill(mean(x) < 3, length(x)))
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
```

This is unfortunately a bit inconvenient.

An alternative approach would be to use the `filter`

function which applied
to `GroupedDataFrame`

always works on whole groups:

```
julia> filter(:x => x -> mean(x) < 3, gdf) |> DataFrame
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
```

(we had to pass the result of `filter`

to `DataFrame`

constructor, as otherwise
we would get a filtered `GroupedDataFrame`

)

# Conclusions

The design of `subset`

I discussed in this post shows one of the challenges we
face when defining APIs in DataFrames.jl. There often is a tension between
developer convenience and safety. In this example allowing only vectors as
results of conditions in the `subset`

function is safer since it allows to
catch some common bugs in the users code. The cost is that in some cases
(most notably dropping whole groups from a `GroupedDataFrame`

) it is a bit
inconvenient.