# DataFrames.jl: why do we have both subset and filter functions?

# Introduction

Before I start let me comment that exactly one year ago this blog has been started. I hope to keep posting weekly updates on the Julia language, and especially its ecosystem for data science, so:

Now let us go back to business.

The 1.1 release of the DataFrames.jl package introduced a small fix of how
the `subset`

function works. Today I will discuss its design and compare it
to the `filter`

function.

In this post I am using Julia 1.6.1 and DataFrames.jl 1.1.0.

# The design of `filter`

The `filter`

function is defined in Julia Base. Therefore in DataFrames.jl we
add methods to it. Let us start with the contract for `filter(f, a)`

then:

Return a copy of collection

`a`

, removing elements for which`f`

is`false`

. The function`f`

is passed one argument.

How do we translate this into DataFrames.jl realm? We have to cases.

If `a`

is an `AbstractDataFrame`

then we treat it as a collection of rows.
Therefore `f`

will get one row of data and we expect it to return a `Bool`

value.
As a result of the operation we produce a `DataFrame`

(unless `view`

keyword
argument is `true`

in which case we return a `SubDataFrame`

).

Here is a basic example:

```
julia> using DataFrames
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> filter(row -> row.a != 2, df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 3
```

A more efficient (faster to execute) way to express the same is:

```
julia> filter(:a => !=(2), df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 3
```

As you can see the style is that you pass a `Pair`

or column name and a
predicate function (i.e. a function that produces `Bool`

). This has two
benefits. Firstly, the operation is type stable (thus faster). Secondly, in the
`row -> row.a != 2`

we define a new anonymous function with each call of
`filter`

, which causes compilation (unless the operation is wrapped in a
function or we predefine the predicate function).

The second case is when `a`

is a `GroupedDataFrame`

. In this case `f`

will get
one group and should return a `Bool`

value again. The result will be a
`GroupedDataFrame`

with groups appropriately removed:

```
julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
julia> filter(sdf -> sdf.a != [2], gdf)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
```

A `Pair`

version is also supported:

```
julia> filter(:a => !=([2]), gdf)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
```

A crucial thing to note is that this time the predicate gets a data frame (or its column/columns).

In summary — the `filter`

function (apart from the `view`

keyword argument and
a special `Pair`

syntax that improves the performance) works exactly like
the Julia Base contract requires.

Before we move forward you might notice that the `Pair`

syntax for the
`AbstractDataFrame`

case is different than the same syntax for `select`

,
`transform`

, and `combine`

functions, where always a whole column is passed.
Indeed there is a small inconsistency. It was left for user convenience
and consistency with Julia Base.

On the other hand `subset`

is fully consistent with the rest of DataFrames.jl
ecosystem, so let us move to it now.

# The design of `subset`

The `subset`

function is designed for filtering of rows in a way consistent
with the `select`

, `transform`

, and `combine`

functions. The contract for
the `subset(df, args...)`

function is:

Return a copy of data frame

`df`

containing only rows for which all values produced by transformation(s)`args`

for a given row are`true`

.

If instead of a `df`

data frame you pass a `GroupedDataFrame`

the rules are
the same, but the difference is that they apply to the `parent`

of the
`GroupedDataFrame`

. So this leads us to a list of differences from `filter`

, as
in `subset`

:

- the
`AbstactDataFrame`

/`GroupedDataFrame`

argument goes first; - you are allowed do pass multiple conditions on which you want to perform row selection;
- always works on whole columns;
- always filters rows;
- the transformation is expected to return a vector (not a scalar
`Bool`

— remember we are filtering rows so the length of the vector must match the number of rows); - by default always produces a data frame.

The additional differences follow the available keyword arguments:

- all transformations must produce vectors containing
`true`

or`false`

; however, optionally`missing`

is allowed if`skipmissing=true`

(this option is not available in`filter`

); - for
`GroupedDataFrame`

case if`ungroup=false`

the resulting data frame is re-grouped based on the same grouping columns as the source`GroupedDataFrame`

(but by default a data frame is returned).

The `view`

keyword argument works like in `filter`

and allows you to produce
a `SubDataFrame`

instead of a `DataFrame`

.

Enough theory, let us get to the examples:

```
julia> df2 = DataFrame(a=repeat(1:3, 2), b=1:6)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 1 4
5 │ 2 5
6 │ 3 6
julia> subset(df2, :a => ByRow(==(1)), :b => ByRow(isodd))
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
```

Here you can see that we had to wrap predicates in `ByRow`

to make sure
that a vector of `Bool`

is produce by the filtering conditions. Otherwise
you would get an error:

```
julia> subset(df2, :a => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
```

(By the way: this is a thing that was changed in DataFrames.jl 1.1 release;
previously unintentionally returning scalar `Bool`

was allowed which was error
prone, as the comparison was made against a whole vector — not its elements.)

The second key thing to remember is that `subset`

filters rows always,
also in `GroupedDataFrame`

case:

```
julia> gdf2 = groupby(df2, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 4
⋮
Last Group (2 rows): a = 3
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 3 3
2 │ 3 6
julia> subset(gdf2, :b => (x -> x .== maximum(x)))
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
```

This is often very useful if we want to filter rows by some within-group condition, like in the example above.

Finally, let me show the `skipmissing`

keyword argument at work:

```
julia> df3 = DataFrame(a=[1, missing, 3, 4])
4×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
3 │ 3
4 │ 4
julia> subset(df3, :a => ByRow(isodd))
ERROR: ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset(df3, :a => ByRow(isodd), skipmissing=true)
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 3
```

# Conclusions

In summary both `filter`

and `subset`

are useful, but in
different contexts. The basic rules are:

- if you have multiple conditions to apply use
`subset`

; - if you want to easily handle
`missing`

values use`subset`

; - if you have a single predicate that takes a single row (or a scalar)
and returns
`Bool`

and want to filter a data frame use`filter`

(this saves you typing`ByRow`

in`subset`

); - if you have a single predicate that returns
`Bool`

and want to filter whole groups of a`GroupedDataFrame`

(as opposed to rows) use`filter`

.

The things are unfortunately a bit complex, but we provide them for user
convenience as both `filter`

and `subset`

are useful in different contexts.

Before I finish let me highlight that there are also in-place `filter!`

and
`subset!`

variants of these functions.