Introduction

When designing solutions for data analysis one is often faced with a tough choice between doing the correct thing and the convenient thing.

One particular case of such a situation is computing summary statistics of empty collections. The reason is that often such statistics are not properly defined for empty data (so Julia throws an error), but data scientist instead would want to get some flag value instead. Today I want to discuss typical cases of such situations and possible solutions.

The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.

Update: also Missings 1.1.0 is added at the very end of this post (make sure to check it out!).

Sum and product

In this case the situation is least problematic. Typically you will get what you expect (i.e. respectively zero or one of the domain of values you aggregate):

julia> sum(Int[])
0

julia> prod(Float64[])
1.0

julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0

julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0

The only case that is problematic is when you want to work with an empty container of a too-wide element type:

julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})

A standard solution, since both sum and prod are reductions is to provide an initialization value manually in this case:

julia> sum([], init=0)
0

julia> prod([], init=1.0)
1.0

Minimum and maximum

When computing minimum and maximum by default you get an error with empty collections:

julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer

As you can see, we get and error are prompted to pass the init value for the reduction, so the situation is less convenient.

Indeed, if such init value can be reasonably passed this is a good solution:

julia> minimum(Float64[], init=Inf)
Inf

julia> maximum(Int[], init=typemin(Int))
-9223372036854775808

or e.g. if we know we are working with values that must be in some range, we can provide this range. A common case is with probabilities:

julia> minimum([], init=1.0)
1.0

julia> maximum([], init=0.0)
0.0

Sometimes, however, we might want to have a special signal value. In this case you have two options. One is to check if the collection is empty, the other is to catch exception:

julia> x = Int[]
Int64[]

julia> isempty(x) ? missing : minimum(x)
missing

julia> try
           minimum(x)
       catch e
           isa(e, MethodError) ? missing : rethrow(e)
       end
missing

You could wrap both solutions with a function for convenience, if you use them often in your code. Their downside is that they add a bit of computational overhead. The isempty check in some cases is not a O(1) operation. The most common case is skipmissing. The try-catch approach introduces the cost of handling of the exception.

Extrema

In case of the extrema function the situation is analogous to minimum and maximum. The only difference is that you pass two values to init if you want to use this method. Here is an example assuming we are processing data that are probabilities:

julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)

Note that in this case minimum is greater than the maximum, so we can immediately see that the passed collection was empty.

Mean, variance, and standard deviation

When computing mean, var, or std, we get NaN when working with an empty collection:

julia> using Statistics

julia> mean(Int[])
NaN

julia> var(Float64[])
NaN

julia> std(Float64[])
NaN

This is expected, as we are performing division by zero in their computation. Also, similarly to sum and prod, when the collection has a too-wide element type we get an error:

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

If we do not like this default behavior and want to handle an empty collection in a special case checking if the container is empty is a standard solution:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : var(x)
missing

Median and quantile

Computing quantiles, and median in particular, is the least convenient case as for them we always get an error and cannot use init value (as they are not reductions):

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector

Here, currently, the only solution is to either check if the collection is empty or catch the exception:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : median(x)
missing

julia> try
           quantile(x, 0.1)
       catch e
           isa(e, ArgumentError) ? missing : rethrow(e)
       end
missing

Conclusions

My post today was meant to be a quick reference for Julia users who sometimes hit these issues when working with their data. My experience is that the most common scenario of this kind is connected with missing data. Here is a typical problematic case:

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df = DataFrame(id=rand(1:10^6, 10^6),
                      value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
     Row │ id      value
         │ Int64   Int64?
─────────┼─────────────────
       1 │ 325977        4
       2 │ 549052        9
       3 │ 218587        9
       4 │ 894246        8
       5 │ 353112        1
       6 │ 394256       10
       7 │ 953125  missing
       8 │ 795547        5
       9 │ 494250        1
    ⋮    │   ⋮        ⋮
  999993 │ 967428        9
  999994 │ 557085        1
  999995 │ 353965        5
  999996 │ 590548       10
  999997 │ 657727        2
  999998 │ 928733        3
  999999 │ 884126  missing
 1000000 │ 587503        2
        999983 rows omitted

I on purpose generated the data in a way that has quite a few missing values:

julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
 Row │ value    proprow
     │ Int64?   Float64
─────┼───────────────────
   1 │       1  0.091109
   2 │       2  0.091387
   3 │       3  0.091394
   4 │       4  0.090954
   5 │       5  0.090504
   6 │       6  0.091412
   7 │       7  0.090809
   8 │       8  0.090844
   9 │       9  0.090254
  10 │      10  0.090795
  11 │ missing  0.090538

Now notice that some groups will only have missing values (in the output below group with :id equal to 12 in row 7 is such a case):

julia> combine(groupby(df, :id),
               :value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
    Row │ id       propmissing
        │ Int64    Float64
────────┼──────────────────────
      1 │       2     0.0
      2 │       3     0.25
      3 │       4     0.0
      4 │       6     0.0
      5 │       8     0.0
      6 │       9     0.333333
      7 │      12     1.0
      8 │      15     0.5
      9 │      16     0.0
   ⋮    │    ⋮          ⋮
 632159 │  999990     0.0
 632160 │  999991     0.0
 632161 │  999992     0.0
 632162 │  999993     0.0
 632163 │  999994     0.0
 632164 │  999996     0.0
 632165 │  999997     0.0
 632166 │ 1000000     0.0
            632149 rows omitted

If we now try to compute e.g. median value per group while skipping missing we fail:

julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

The solution, as we discussed in this post is to handle the case of an empty collection in a special way. I typically prefer the isempty check.

So first define a helper function:

julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)

and now we can write:

julia> combine(groupby(df, :id),
               :value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
    Row │ id       value_function_skipmissing
        │ Int64    Union{Missing, Float64}
────────┼─────────────────────────────────────
      1 │       2                         3.5
      2 │       3                         5.0
      3 │       4                         8.0
      4 │       6                         4.0
      5 │       8                         1.0
      6 │       9                         9.0
      7 │      12                   missing
      8 │      15                         3.0
      9 │      16                         7.0
   ⋮    │    ⋮                 ⋮
 632159 │  999990                         4.0
 632160 │  999991                         3.0
 632161 │  999992                         8.0
 632162 │  999993                         2.0
 632163 │  999994                         7.0
 632164 │  999996                         4.0
 632165 │  999997                         7.0
 632166 │ 1000000                         7.0
                           632149 rows omitted

As you can see we get missing in row 7 for group with :id value equal to 12 as expected.

If you have some thoughts about pros and cons of the approaches I discussed today please check out this issue and comment there. Thank you!

Update

After writing this post we released Missings.jl 1.1.0 which contains the emptymissing function that mostly resolves the issues discussed today. Now you can just write:

julia> using Missings

julia> using Statistics

julia> emptymissing(median)([])
missing