DataFrames.jl: why do we have both subset and filter functions?
Introduction
Before I start let me comment that exactly one year ago this blog has been started. I hope to keep posting weekly updates on the Julia language, and especially its ecosystem for data science, so:
Now let us go back to business.
The 1.1 release of the DataFrames.jl package introduced a small fix of how
the subset
function works. Today I will discuss its design and compare it
to the filter
function.
In this post I am using Julia 1.6.1 and DataFrames.jl 1.1.0.
The design of filter
The filter
function is defined in Julia Base. Therefore in DataFrames.jl we
add methods to it. Let us start with the contract for filter(f, a)
then:
Return a copy of collection
a
, removing elements for whichf
isfalse
. The functionf
is passed one argument.
How do we translate this into DataFrames.jl realm? We have to cases.
If a
is an AbstractDataFrame
then we treat it as a collection of rows.
Therefore f
will get one row of data and we expect it to return a Bool
value.
As a result of the operation we produce a DataFrame
(unless view
keyword
argument is true
in which case we return a SubDataFrame
).
Here is a basic example:
julia> using DataFrames
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> filter(row -> row.a != 2, df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 3
A more efficient (faster to execute) way to express the same is:
julia> filter(:a => !=(2), df)
2×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 3
As you can see the style is that you pass a Pair
or column name and a
predicate function (i.e. a function that produces Bool
). This has two
benefits. Firstly, the operation is type stable (thus faster). Secondly, in the
row -> row.a != 2
we define a new anonymous function with each call of
filter
, which causes compilation (unless the operation is wrapped in a
function or we predefine the predicate function).
The second case is when a
is a GroupedDataFrame
. In this case f
will get
one group and should return a Bool
value again. The result will be a
GroupedDataFrame
with groups appropriately removed:
julia> gdf = groupby(df, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
julia> filter(sdf -> sdf.a != [2], gdf)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
A Pair
version is also supported:
julia> filter(:a => !=([2]), gdf)
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
A crucial thing to note is that this time the predicate gets a data frame (or its column/columns).
In summary — the filter
function (apart from the view
keyword argument and
a special Pair
syntax that improves the performance) works exactly like
the Julia Base contract requires.
Before we move forward you might notice that the Pair
syntax for the
AbstractDataFrame
case is different than the same syntax for select
,
transform
, and combine
functions, where always a whole column is passed.
Indeed there is a small inconsistency. It was left for user convenience
and consistency with Julia Base.
On the other hand subset
is fully consistent with the rest of DataFrames.jl
ecosystem, so let us move to it now.
The design of subset
The subset
function is designed for filtering of rows in a way consistent
with the select
, transform
, and combine
functions. The contract for
the subset(df, args...)
function is:
Return a copy of data frame
df
containing only rows for which all values produced by transformation(s)args
for a given row aretrue
.
If instead of a df
data frame you pass a GroupedDataFrame
the rules are
the same, but the difference is that they apply to the parent
of the
GroupedDataFrame
. So this leads us to a list of differences from filter
, as
in subset
:
- the
AbstactDataFrame
/GroupedDataFrame
argument goes first; - you are allowed do pass multiple conditions on which you want to perform row selection;
- always works on whole columns;
- always filters rows;
- the transformation is expected to return a vector (not a scalar
Bool
— remember we are filtering rows so the length of the vector must match the number of rows); - by default always produces a data frame.
The additional differences follow the available keyword arguments:
- all transformations must produce vectors containing
true
orfalse
; however, optionallymissing
is allowed ifskipmissing=true
(this option is not available infilter
); - for
GroupedDataFrame
case ifungroup=false
the resulting data frame is re-grouped based on the same grouping columns as the sourceGroupedDataFrame
(but by default a data frame is returned).
The view
keyword argument works like in filter
and allows you to produce
a SubDataFrame
instead of a DataFrame
.
Enough theory, let us get to the examples:
julia> df2 = DataFrame(a=repeat(1:3, 2), b=1:6)
6×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 1 4
5 │ 2 5
6 │ 3 6
julia> subset(df2, :a => ByRow(==(1)), :b => ByRow(isodd))
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
Here you can see that we had to wrap predicates in ByRow
to make sure
that a vector of Bool
is produce by the filtering conditions. Otherwise
you would get an error:
julia> subset(df2, :a => ==(1))
ERROR: ArgumentError: functions passed to `subset` must return an AbstractVector.
(By the way: this is a thing that was changed in DataFrames.jl 1.1 release;
previously unintentionally returning scalar Bool
was allowed which was error
prone, as the comparison was made against a whole vector — not its elements.)
The second key thing to remember is that subset
filters rows always,
also in GroupedDataFrame
case:
julia> gdf2 = groupby(df2, :a)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 4
⋮
Last Group (2 rows): a = 3
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 3 3
2 │ 3 6
julia> subset(gdf2, :b => (x -> x .== maximum(x)))
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
This is often very useful if we want to filter rows by some within-group condition, like in the example above.
Finally, let me show the skipmissing
keyword argument at work:
julia> df3 = DataFrame(a=[1, missing, 3, 4])
4×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ missing
3 │ 3
4 │ 4
julia> subset(df3, :a => ByRow(isodd))
ERROR: ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values
julia> subset(df3, :a => ByRow(isodd), skipmissing=true)
2×1 DataFrame
Row │ a
│ Int64?
─────┼────────
1 │ 1
2 │ 3
Conclusions
In summary both filter
and subset
are useful, but in
different contexts. The basic rules are:
- if you have multiple conditions to apply use
subset
; - if you want to easily handle
missing
values usesubset
; - if you have a single predicate that takes a single row (or a scalar)
and returns
Bool
and want to filter a data frame usefilter
(this saves you typingByRow
insubset
); - if you have a single predicate that returns
Bool
and want to filter whole groups of aGroupedDataFrame
(as opposed to rows) usefilter
.
The things are unfortunately a bit complex, but we provide them for user
convenience as both filter
and subset
are useful in different contexts.
Before I finish let me highlight that there are also in-place filter!
and
subset!
variants of these functions.