# Handling summary statistics of empty collections

# Introduction

When designing solutions for data analysis one is often faced with a tough choice between doing the correct thing and the convenient thing.

One particular case of such a situation is computing summary statistics of empty collections. The reason is that often such statistics are not properly defined for empty data (so Julia throws an error), but data scientist instead would want to get some flag value instead. Today I want to discuss typical cases of such situations and possible solutions.

The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.

Update: also Missings 1.1.0 is added at the very end of this post (make sure to check it out!).

# Sum and product

In this case the situation is least problematic. Typically you will get what you expect (i.e. respectively zero or one of the domain of values you aggregate):

```
julia> sum(Int[])
0
julia> prod(Float64[])
1.0
julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0
julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0
```

The only case that is problematic is when you want to work with an empty container of a too-wide element type:

```
julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})
```

A standard solution, since both `sum`

and `prod`

are reductions is to provide an
initialization value manually in this case:

```
julia> sum([], init=0)
0
julia> prod([], init=1.0)
1.0
```

# Minimum and maximum

When computing `minimum`

and `maximum`

by default you get an error with empty
collections:

```
julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer
```

As you can see, we get and error are prompted to pass the `init`

value for the
reduction, so the situation is less convenient.

Indeed, if such `init`

value can be reasonably passed this is a good solution:

```
julia> minimum(Float64[], init=Inf)
Inf
julia> maximum(Int[], init=typemin(Int))
-9223372036854775808
```

or e.g. if we know we are working with values that must be in some range, we can provide this range. A common case is with probabilities:

```
julia> minimum([], init=1.0)
1.0
julia> maximum([], init=0.0)
0.0
```

Sometimes, however, we might want to have a special signal value. In this case you have two options. One is to check if the collection is empty, the other is to catch exception:

```
julia> x = Int[]
Int64[]
julia> isempty(x) ? missing : minimum(x)
missing
julia> try
minimum(x)
catch e
isa(e, MethodError) ? missing : rethrow(e)
end
missing
```

You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The `isempty`

check in some cases is not a O(1) operation. The most
common case is `skipmissing`

. The `try`

-`catch`

approach introduces the cost of
handling of the exception.

# Extrema

In case of the `extrema`

function the situation is analogous to `minimum`

and
`maximum`

. The only difference is that you pass two values to `init`

if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:

```
julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)
```

Note that in this case minimum is greater than the maximum, so we can immediately see that the passed collection was empty.

# Mean, variance, and standard deviation

When computing `mean`

, `var`

, or `std`

, we get `NaN`

when working with an empty
collection:

```
julia> using Statistics
julia> mean(Int[])
NaN
julia> var(Float64[])
NaN
julia> std(Float64[])
NaN
```

This is expected, as we are performing division by zero in their computation.
Also, similarly to `sum`

and `prod`

, when the collection has a too-wide element
type we get an error:

```
julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})
```

If we do not like this default behavior and want to handle an empty collection in a special case checking if the container is empty is a standard solution:

```
julia> x = Float64[]
Float64[]
julia> isempty(x) ? missing : var(x)
missing
```

# Median and quantile

Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use `init`

value (as they are not
reductions):

```
julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]
julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector
```

Here, currently, the only solution is to either check if the collection is empty or catch the exception:

```
julia> x = Float64[]
Float64[]
julia> isempty(x) ? missing : median(x)
missing
julia> try
quantile(x, 0.1)
catch e
isa(e, ArgumentError) ? missing : rethrow(e)
end
missing
```

# Conclusions

My post today was meant to be a quick reference for Julia users who sometimes hit these issues when working with their data. My experience is that the most common scenario of this kind is connected with missing data. Here is a typical problematic case:

```
julia> using DataFrames
julia> using Random
julia> Random.seed!(1234);
julia> df = DataFrame(id=rand(1:10^6, 10^6),
value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
Row │ id value
│ Int64 Int64?
─────────┼─────────────────
1 │ 325977 4
2 │ 549052 9
3 │ 218587 9
4 │ 894246 8
5 │ 353112 1
6 │ 394256 10
7 │ 953125 missing
8 │ 795547 5
9 │ 494250 1
⋮ │ ⋮ ⋮
999993 │ 967428 9
999994 │ 557085 1
999995 │ 353965 5
999996 │ 590548 10
999997 │ 657727 2
999998 │ 928733 3
999999 │ 884126 missing
1000000 │ 587503 2
999983 rows omitted
```

I on purpose generated the data in a way that has quite a few `missing`

values:

```
julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
Row │ value proprow
│ Int64? Float64
─────┼───────────────────
1 │ 1 0.091109
2 │ 2 0.091387
3 │ 3 0.091394
4 │ 4 0.090954
5 │ 5 0.090504
6 │ 6 0.091412
7 │ 7 0.090809
8 │ 8 0.090844
9 │ 9 0.090254
10 │ 10 0.090795
11 │ missing 0.090538
```

Now notice that some groups will only have `missing`

values (in the output below
group with `:id`

equal to `12`

in row 7 is such a case):

```
julia> combine(groupby(df, :id),
:value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
Row │ id propmissing
│ Int64 Float64
────────┼──────────────────────
1 │ 2 0.0
2 │ 3 0.25
3 │ 4 0.0
4 │ 6 0.0
5 │ 8 0.0
6 │ 9 0.333333
7 │ 12 1.0
8 │ 15 0.5
9 │ 16 0.0
⋮ │ ⋮ ⋮
632159 │ 999990 0.0
632160 │ 999991 0.0
632161 │ 999992 0.0
632162 │ 999993 0.0
632163 │ 999994 0.0
632164 │ 999996 0.0
632165 │ 999997 0.0
632166 │ 1000000 0.0
632149 rows omitted
```

If we now try to compute e.g. median value per group while skipping missing we fail:

```
julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]
```

The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the `isempty`

check.

So first define a helper function:

```
julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)
```

and now we can write:

```
julia> combine(groupby(df, :id),
:value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
Row │ id value_function_skipmissing
│ Int64 Union{Missing, Float64}
────────┼─────────────────────────────────────
1 │ 2 3.5
2 │ 3 5.0
3 │ 4 8.0
4 │ 6 4.0
5 │ 8 1.0
6 │ 9 9.0
7 │ 12 missing
8 │ 15 3.0
9 │ 16 7.0
⋮ │ ⋮ ⋮
632159 │ 999990 4.0
632160 │ 999991 3.0
632161 │ 999992 8.0
632162 │ 999993 2.0
632163 │ 999994 7.0
632164 │ 999996 4.0
632165 │ 999997 7.0
632166 │ 1000000 7.0
632149 rows omitted
```

As you can see we get `missing`

in row 7 for group with `:id`

value equal to
`12`

as expected.

If you have some thoughts about pros and cons of the approaches I discussed today please check out this issue and comment there. Thank you!

# Update

After writing this post we released Missings.jl 1.1.0 which contains
the `emptymissing`

function that mostly resolves the issues discussed today.
Now you can just write:

```
julia> using Missings
julia> using Statistics
julia> emptymissing(median)([])
missing
```