Is case_when needed in DataFrames.jl?
Introduction
Recently I received a very interesting question regarding transforming data
using the DataFrames.jl. One of the users wanted to know if we have
a functionality similar to the case_when
function in dplyr. When trying
to answer it I came to the conclusion that we do not need it that we can
reproduce it using the ⋅ ? ⋅ : ⋅
ternary operator.
In this post I will be reproducing selected examples from the documentation of
case_when
.
Reproducing dplyr examples
In the examples I will first show R code and then show an Julia code.
I R examples I assume that dplyr is loaded. The Julia examples were tested under Julia 1.5.3, DataFrames.jl 0.22.2, DataFramesMeta.jl v0.6.0, HTTP.jl v0.8.19, JSON3.jl v1.5.1, and Pipe.jl v1.3.0.
Example 1
This is the most basic case_when
usage scenario.
R code:
> library(dplyr)
> x <- 1:10
> case_when(
+ x %% 35 == 0 ~ "fizz buzz",
+ x %% 5 == 0 ~ "fizz",
+ x %% 7 == 0 ~ "buzz",
+ TRUE ~ as.character(x)
+ )
[1] "1" "2" "3" "4" "fizz" "6" "buzz" "8" "9" "fizz"
Julia code:
julia> x = 1:10
1:10
julia> (x -> x % 6 == 0 ? "fizz buzz" :
x % 2 == 0 ? "fizz" :
x % 3 == 0 ? "buzz" :
string(x)).(x)
10-element Array{String,1}:
"1"
"fizz"
"buzz"
"fizz"
"5"
"fizz buzz"
"7"
"fizz"
"buzz"
"fizz"
In this basic example note the following things:
- in the Julia code we do not need to load any package; we are using the functionality in built into the language;
- we create an anonymous function that is then broadcasted over an input vector
using the
.
operator; - both codes look almost the same, apart from a bit different punctuation.
In summary – in my opinion the basic use case shows that the ternary operator
is as convenient as case_when
from dplyr.
Example 2
In this example missing values are introduced. We are reusing the vector created in the previous exercise.
R code:
> x[2:4] <- NA_real_
> case_when(
+ x %% 35 == 0 ~ "fizz buzz",
+ x %% 5 == 0 ~ "fizz",
+ x %% 7 == 0 ~ "buzz",
+ is.na(x) ~ "nope",
+ TRUE ~ as.character(x)
+ )
[1] "1" "nope" "nope" "nope" "fizz" "6" "buzz" "8" "9" "fizz"
Julia code (two variants):
julia> x = [2 <= i <= 4 ? missing : x[i] for i in axes(x, 1)]
10-element Array{Union{Missing, Int64},1}:
1
missing
missing
missing
5
6
7
8
9
10
julia> (x -> isequal(x % 6, 0) ? "fizz buzz" :
isequal(x % 2, 0) ? "fizz" :
isequal(x % 3, 0) ? "buzz" :
ismissing(x) ? "nope" :
string(x)).(x)
10-element Array{String,1}:
"1"
"nope"
"nope"
"nope"
"5"
"fizz buzz"
"7"
"fizz"
"buzz"
"fizz"
julia> (x -> coalesce(x % 6 == 0, false) ? "fizz buzz" :
coalesce(x % 2 == 0, false) ? "fizz" :
coalesce(x % 3 == 0, false) ? "buzz" :
ismissing(x) ? "nope" :
string(x)).(x)
10-element Array{String,1}:
"1"
"nope"
"nope"
"nope"
"5"
"fizz buzz"
"7"
"fizz"
"buzz"
"fizz"
additionally note that the code can be simplified if we put the ismissing
condition first:
julia> (x -> ismissing(x) ? "nope" :
x % 6 == 0 ? "fizz buzz" :
x % 2 == 0 ? "fizz" :
x % 3 == 0 ? "buzz" :
string(x)).(x)
10-element Array{String,1}:
"1"
"nope"
"nope"
"nope"
"5"
"fizz buzz"
"7"
"fizz"
"buzz"
"fizz"
Note the following patterns in this example:
- we had to materialize the vector in Julia in a bit more complex way as the
initial
x
vector was a1:10
range which was read only; - in R comparison to
missing
is treated as failing by default; on the other hand Julia is strict about boolean tests and one has to use either theisequal
or thecoalesce
functions to handle missing values (or move theismissing
test to the top); this strictness introduces a bit more verbosity in the code at the benefit of allowing the user to catch potential bugs in the logic of the code more easily.
Example 3
In this example we use the starwars
dataset that is shipped with dplyr. So we
first have to fetch it from the Internet in Julia. Here is the code that does
the trick:
julia> using DataFrames
julia> using HTTP
julia> using JSON3
julia> using Pipe
julia> starwars = @pipe HTTP.get("https://swapi.dev/api/people/").body |>
JSON3.read |> _.results |> DataFrame |>
transform(_,
:species => ByRow(x -> isempty(x) ? "Human" : "Droid"),
[:height, :mass] .=> ByRow(x -> parse(Int, x)),
renamecols=false)
10×16 DataFrame
Row │ name height mass hair_color skin_color eye_color ⋯
│ String Int64 Int64 String String String ⋯
─────┼───────────────────────────────────────────────────────────────────────────
1 │ Luke Skywalker 172 77 blond fair blue ⋯
2 │ C-3PO 167 75 n/a gold yellow
3 │ R2-D2 96 32 n/a white, blue red
4 │ Darth Vader 202 136 none white yellow
5 │ Leia Organa 150 49 brown light brown ⋯
6 │ Owen Lars 178 120 brown, grey light blue
7 │ Beru Whitesun lars 165 75 brown light blue
8 │ R5-D4 97 32 n/a white, red red
9 │ Biggs Darklighter 183 84 black light brown ⋯
10 │ Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
10 columns omitted
We have fetched only 10 rows of data for the analysis (this is the number of observations that the exposed API produces), but it is enough for our purposes.
As a side note – observe how easy it is in JuliaData ecosystem to fetch a JSON
file from the Internet, parse it, populate a DataFrame
, and finally do some
column preprocessing to get the right column types for data that we are
interested it analyzing later.
Let us move to the example. In this case we want to process more than one column
using the case_when
function within a data transformation pipeline.
R code
> starwars %>%
+ select(name:mass, gender, species) %>%
+ mutate(
+ type = case_when(
+ height > 200 | mass > 200 ~ "large",
+ species == "Droid" ~ "robot",
+ TRUE ~ "other"
+ )
+ )
# A tibble: 87 x 6
name height mass gender species type
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 masculine Human other
2 C-3PO 167 75 masculine Droid robot
3 R2-D2 96 32 masculine Droid robot
4 Darth Vader 202 136 masculine Human large
5 Leia Organa 150 49 feminine Human other
6 Owen Lars 178 120 masculine Human other
7 Beru Whitesun lars 165 75 feminine Human other
8 R5-D4 97 32 masculine Droid robot
9 Biggs Darklighter 183 84 masculine Human other
10 Obi-Wan Kenobi 182 77 masculine Human other
# … with 77 more rows
Julia code
julia> @pipe starwars |>
select(_, Between(:name, :mass), :gender, :species) |>
transform(_, [:height, :mass, :species] =>
ByRow((height, mass, species) ->
height > 200 || mass > 200 ? "large" :
species == "Droid" ? "robot" :
"other") =>
:type)
10×6 DataFrame
Row │ name height mass gender species type
│ String Int64 Int64 String String String
─────┼────────────────────────────────────────────────────────────
1 │ Luke Skywalker 172 77 male Human other
2 │ C-3PO 167 75 n/a Droid robot
3 │ R2-D2 96 32 n/a Droid robot
4 │ Darth Vader 202 136 male Human large
5 │ Leia Organa 150 49 female Human other
6 │ Owen Lars 178 120 male Human other
7 │ Beru Whitesun lars 165 75 female Human other
8 │ R5-D4 97 32 n/a Droid robot
9 │ Biggs Darklighter 183 84 male Human other
10 │ Obi-Wan Kenobi 182 77 male Human other
or if you like using DataFramesMeta.jl:
julia> using DataFramesMeta
julia> @pipe starwars |>
select(_, Between(:name, :mass), :gender, :species) |>
@eachrow _ begin
@newcol type::Vector{String}
:type = :height > 200 || :mass > 200 ? "large" :
:species == "Droid" ? "robot" :
"other"
end
10×6 DataFrame
Row │ name height mass gender species type
│ String Int64 Int64 String String String
─────┼────────────────────────────────────────────────────────────
1 │ Luke Skywalker 172 77 male Human other
2 │ C-3PO 167 75 n/a Droid robot
3 │ R2-D2 96 32 n/a Droid robot
4 │ Darth Vader 202 136 male Human large
5 │ Leia Organa 150 49 female Human other
6 │ Owen Lars 178 120 male Human other
7 │ Beru Whitesun lars 165 75 female Human other
8 │ R5-D4 97 32 n/a Droid robot
9 │ Biggs Darklighter 183 84 male Human other
10 │ Obi-Wan Kenobi 182 77 male Human other
As you can see it is easy to use the ternary operator also in the case of
several variables. Using DataFrames.jl requires a bit of boilerplate syntax.
This limitation can be conveniently overcome using DataFramesMeta.jl, in the
above example I decided to use the @eachrow
macro.
Conclusions
As you can see using the ternary operator in Julia gives us a very similar
functionality and syntax in comparison to the case_when
function from dplyr.
Apart from the differences how missing values are handled I have discussed above
there are two features that make the solution in Julia more convenient in my
opinion:
- in
case_when
all values on right hand side have to have the same type, while in Julia there is no such restriction; case_when
evaluates all right hand side expressions, while the ternary operator evaluates only what has to be evaluated to determine the result of the operation (this is often preferred when some operations may throw an error for certain values of their arguments).