Introduction

Some time ago I have written a post about ABC of handling missing values in Julia. Its objective was to give an introduction to the topic for the newcomers. However, occasionally users complain that working with missing values in Julia is less convenient than in e.g. Python or R.

Such opinions are always debatable, so recently I decided to run a small pool on Julia Discourse about the skipmissing function. The question was if we want to shorten the skipmissing name into something that is more convenient to use in interactive work. To my surprise, a vast majority of voters preferred a verbose and explicit operation name. This preference regarding handling of missings, to my surprise, reminded me of several passages from The Zen of Python:

  • Explicit is better than implicit.
  • Readability counts.
  • In the face of ambiguity, refuse the temptation to guess.

So given this preference, how should Julia users retain convenience. Let me share some of my thoughts on this topic that did not make into my previous post.

This post was written under Julia 1.7.2, Missings.jl 1.0.2, MissingsAsFalse.jl 0.1, and StatsBase.jl 0.33.16.

Verbosity

Indeed writing missing, skipmissing, and passmissing (the last one is defined in Missings.jl), in places where they are needed, might seem verbose. However, the good news is that most of the time you do not have to type them fully thanks to completions so in your editor/REPL:

  • instead of missing write mis<tab>;
  • instead of skipmissing write skipm<tab>;
  • instead of passmissing write pas<tab>.

Let us compare. In R:

sum(c(1, NA), na.rm=T)

vs Julia:

sum(skipmissing([1, missing]))

seems longer. However, if you take into account the amount of typing you need to do (number of keystrokes) it is the same.

Missing values in logical conditions

As I have written in this post I personally strongly recommend using coalesce to handle missing values in logical conditions. This allows you, to follow the Explicit is better than implicit. principle by explicitly showing in the code if missing should be treated as true or as false.

Here is a short example:

julia> c = missing
missing

julia> c ? "true" : "false"
ERROR: TypeError: non-boolean (Missing) used in boolean context

julia> coalesce(c, false) ? "true" : "false"
"false"

However, some users find it more convenient to use @mfalse macro from the MissingsAsFalse.jl package:

julia> using MissingsAsFalse

julia> @mfalse c ? "true" : "false"
"false"

Correlation matrix with missing values

A common, and relatively complex case of handling missing values, is computing of correlation matrix of data that contains missings. Let us check what Julia offers here. We will use the pairwise function from StatsBase.jl.

julia> using Random

julia> using StatsBase

julia> using Statistics

julia> Random.seed!(1234);

julia> x = rand([1:10; missing], 16, 4)
16×4 Matrix{Union{Missing, Int64}}:
  4          missing   2         10
  7         8           missing   8
  3          missing   7          5
 10         9          8         10
  4         2          7          6
  5         6          1           missing
   missing  7          8          9
  9         3          2          6
  6         7         10           missing
  9         3          4          6
  7         2          9          8
  9         7          3          3
  1         8          6          5
  3         9          4          7
  5         7          3          2
  8         5           missing   4

julia> pairwise(cor, eachcol(x))
4×4 Matrix{Union{Missing, Float64}}:
 1.0        missing   missing   missing
  missing  1.0        missing   missing
  missing   missing  1.0        missing
  missing   missing   missing  1.0

julia> pairwise(cor, eachcol(x), skipmissing=:pairwise)
4×4 Matrix{Float64}:
  1.0        -0.218849    -0.0177704   0.0922413
 -0.218849    1.0          0.00973122  0.102969
 -0.0177704   0.00973122   1.0         0.364821
  0.0922413   0.102969     0.364821    1.0

julia> pairwise(cor, eachcol(x), skipmissing=:listwise)
4×4 Matrix{Float64}:
  1.0       -0.229568   -0.100028   0.247283
 -0.229568   1.0        -0.127153  -0.0420058
 -0.100028  -0.127153    1.0        0.67068
  0.247283  -0.0420058   0.67068    1.0

By default pairwise for cor returns missing when at least one of the columns contains missing values. Use :pairwise value of skipmissing keyword argument to skip entries with a missing value in either of the two vectors passed to cor and use :listwise to skip entries with a missing value in any of the vectors passed to pairwise.

In this example we see that since there are several ways how missing values should be handled by cor in pairwise computations, instead of using the skipmissing function, a keyword argument is used allowing to specify what the data scientist wants exactly.

A similar situation is in the subset and subset! functions from DataFrames.jl, that also take a skipmissing keyword argument, to simplify handling of logical conditions specifying which rows from a source data frame should be kept.

Conclusions

In Julia the approach is that handling of missing values is explicit. This choice is guided by the fact that then in the code it is explicitly visible what was the developer’s decision about how they should be treated. This choice makes code more verbose. Today I have tried to show that, especially with a good editor/REPL support, in practice it does not introduce a large overhead.