Handling summary statistics of empty collections
Introduction
When designing solutions for data analysis one is often faced with a tough choice between doing the correct thing and the convenient thing.
One particular case of such a situation is computing summary statistics of empty collections. The reason is that often such statistics are not properly defined for empty data (so Julia throws an error), but data scientist instead would want to get some flag value instead. Today I want to discuss typical cases of such situations and possible solutions.
The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.
Update: also Missings 1.1.0 is added at the very end of this post (make sure to check it out!).
Sum and product
In this case the situation is least problematic. Typically you will get what you expect (i.e. respectively zero or one of the domain of values you aggregate):
julia> sum(Int[])
0
julia> prod(Float64[])
1.0
julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0
julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0
The only case that is problematic is when you want to work with an empty container of a too-wide element type:
julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})
A standard solution, since both sum
and prod
are reductions is to provide an
initialization value manually in this case:
julia> sum([], init=0)
0
julia> prod([], init=1.0)
1.0
Minimum and maximum
When computing minimum
and maximum
by default you get an error with empty
collections:
julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer
As you can see, we get and error are prompted to pass the init
value for the
reduction, so the situation is less convenient.
Indeed, if such init
value can be reasonably passed this is a good solution:
julia> minimum(Float64[], init=Inf)
Inf
julia> maximum(Int[], init=typemin(Int))
-9223372036854775808
or e.g. if we know we are working with values that must be in some range, we can provide this range. A common case is with probabilities:
julia> minimum([], init=1.0)
1.0
julia> maximum([], init=0.0)
0.0
Sometimes, however, we might want to have a special signal value. In this case you have two options. One is to check if the collection is empty, the other is to catch exception:
julia> x = Int[]
Int64[]
julia> isempty(x) ? missing : minimum(x)
missing
julia> try
minimum(x)
catch e
isa(e, MethodError) ? missing : rethrow(e)
end
missing
You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The isempty
check in some cases is not a O(1) operation. The most
common case is skipmissing
. The try
-catch
approach introduces the cost of
handling of the exception.
Extrema
In case of the extrema
function the situation is analogous to minimum
and
maximum
. The only difference is that you pass two values to init
if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:
julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)
Note that in this case minimum is greater than the maximum, so we can immediately see that the passed collection was empty.
Mean, variance, and standard deviation
When computing mean
, var
, or std
, we get NaN
when working with an empty
collection:
julia> using Statistics
julia> mean(Int[])
NaN
julia> var(Float64[])
NaN
julia> std(Float64[])
NaN
This is expected, as we are performing division by zero in their computation.
Also, similarly to sum
and prod
, when the collection has a too-wide element
type we get an error:
julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})
If we do not like this default behavior and want to handle an empty collection in a special case checking if the container is empty is a standard solution:
julia> x = Float64[]
Float64[]
julia> isempty(x) ? missing : var(x)
missing
Median and quantile
Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use init
value (as they are not
reductions):
julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]
julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector
Here, currently, the only solution is to either check if the collection is empty or catch the exception:
julia> x = Float64[]
Float64[]
julia> isempty(x) ? missing : median(x)
missing
julia> try
quantile(x, 0.1)
catch e
isa(e, ArgumentError) ? missing : rethrow(e)
end
missing
Conclusions
My post today was meant to be a quick reference for Julia users who sometimes hit these issues when working with their data. My experience is that the most common scenario of this kind is connected with missing data. Here is a typical problematic case:
julia> using DataFrames
julia> using Random
julia> Random.seed!(1234);
julia> df = DataFrame(id=rand(1:10^6, 10^6),
value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
Row │ id value
│ Int64 Int64?
─────────┼─────────────────
1 │ 325977 4
2 │ 549052 9
3 │ 218587 9
4 │ 894246 8
5 │ 353112 1
6 │ 394256 10
7 │ 953125 missing
8 │ 795547 5
9 │ 494250 1
⋮ │ ⋮ ⋮
999993 │ 967428 9
999994 │ 557085 1
999995 │ 353965 5
999996 │ 590548 10
999997 │ 657727 2
999998 │ 928733 3
999999 │ 884126 missing
1000000 │ 587503 2
999983 rows omitted
I on purpose generated the data in a way that has quite a few missing
values:
julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
Row │ value proprow
│ Int64? Float64
─────┼───────────────────
1 │ 1 0.091109
2 │ 2 0.091387
3 │ 3 0.091394
4 │ 4 0.090954
5 │ 5 0.090504
6 │ 6 0.091412
7 │ 7 0.090809
8 │ 8 0.090844
9 │ 9 0.090254
10 │ 10 0.090795
11 │ missing 0.090538
Now notice that some groups will only have missing
values (in the output below
group with :id
equal to 12
in row 7 is such a case):
julia> combine(groupby(df, :id),
:value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
Row │ id propmissing
│ Int64 Float64
────────┼──────────────────────
1 │ 2 0.0
2 │ 3 0.25
3 │ 4 0.0
4 │ 6 0.0
5 │ 8 0.0
6 │ 9 0.333333
7 │ 12 1.0
8 │ 15 0.5
9 │ 16 0.0
⋮ │ ⋮ ⋮
632159 │ 999990 0.0
632160 │ 999991 0.0
632161 │ 999992 0.0
632162 │ 999993 0.0
632163 │ 999994 0.0
632164 │ 999996 0.0
632165 │ 999997 0.0
632166 │ 1000000 0.0
632149 rows omitted
If we now try to compute e.g. median value per group while skipping missing we fail:
julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]
The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the isempty
check.
So first define a helper function:
julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)
and now we can write:
julia> combine(groupby(df, :id),
:value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
Row │ id value_function_skipmissing
│ Int64 Union{Missing, Float64}
────────┼─────────────────────────────────────
1 │ 2 3.5
2 │ 3 5.0
3 │ 4 8.0
4 │ 6 4.0
5 │ 8 1.0
6 │ 9 9.0
7 │ 12 missing
8 │ 15 3.0
9 │ 16 7.0
⋮ │ ⋮ ⋮
632159 │ 999990 4.0
632160 │ 999991 3.0
632161 │ 999992 8.0
632162 │ 999993 2.0
632163 │ 999994 7.0
632164 │ 999996 4.0
632165 │ 999997 7.0
632166 │ 1000000 7.0
632149 rows omitted
As you can see we get missing
in row 7 for group with :id
value equal to
12
as expected.
If you have some thoughts about pros and cons of the approaches I discussed today please check out this issue and comment there. Thank you!
Update
After writing this post we released Missings.jl 1.1.0 which contains
the emptymissing
function that mostly resolves the issues discussed today.
Now you can just write:
julia> using Missings
julia> using Statistics
julia> emptymissing(median)([])
missing