Comparing dplyr vs DataFrames.jl
Introduction
This time the post is inspired by the proposal of Andrey Oskin
(thank you for submitting it, below I have adapted business problem description
and dplyr
source codes that Andrey provided).
Andrey shared with me typical tasks that he is faced with when doing logs
analysis. To make things concrete, assume that you have a site and you collect
users’ clicks. In the output of this process you get a table with two fields:
time of click (ts
column below, measured in seconds) and the user identifier
(user_id
column below).
Given such data there are natural business questions, that we can ask, like:
- How many sessions an average user has?
- How many users have exactly two sessions?
- Find top 10 users, ordered by the descending number of sessions?
- What is the average time between sessions start?
Session in these questions is a more or less arbitrary thing, usually, it has a meaning of sequence of events that come together as there is a short time difference between consecutive events. In the examples we assume that if a user has not clicked on our site for 900 seconds after the last click the session is over.
What I do in this post is take a toy data set that has this structure and
dplyr
codes that Andrey shared with me that answer the business questions
presented above and rewrite them to DataFrames.jl.
The objective of this post is to compare the syntaxes of dplyr
and
DataFrames.jl. Therefore neither dplyr
nor DataFrames.jl codes were tuned
to be optimal. Rather I have just taken what Andrey proposed in dplyr
and
translated it to DataFrames.jl in a way that first came to my mind (but trying
to use piping). However, in the last part of the post I out of curiosity I
decided compare the performance of the codes.
All codes were tested under R version 4.0.2 and dplyr 1.0.0. For Julia I used version 1.5.0-rc1.0 and packages: DataFrames.jl 0.21.4, Pipe.jl 1.3.0, and ShiftedArrays 1.0.0. If you do not have much experience with setting-up Julia project environments, in this post I give a simple recipe how you can do it easily while ensuring you use exactly the same versions of the packages as I do.
Setting up the stage
In the first step we load the required packages, create a data frame that
will be used later and sort it by the ts
column.
In all examples in this post I first present R code, and then Julia code. The expected output is shown in a comment. After each step I briefly comment on the Julia code.
dplyr
library(dplyr)
df <- data.frame(ts = c(1,10,20,1000,1010,1200,2200,2220,30,500,1500,1600),
user_id = c(1,1,1,1,1,1,1,1,2,2,2,2)) %>%
arrange(ts)
df
# ts user_id
# 1 1 1
# 2 10 1
# 3 20 1
# 4 30 2
# 5 500 2
# 6 1000 1
# 7 1010 1
# 8 1200 1
# 9 1500 2
# 10 1600 2
# 11 2200 1
# 12 2220 1
DataFrames.jl
using DataFrames
using Pipe
using ShiftedArrays
using Statistics
df = @pipe DataFrame!(ts = [1,10,20,1000,1010,1200,2200,2220,30,500,1500,1600],
user_id = [1,1,1,1,1,1,1,1,2,2,2,2]) |>
sort(_, :ts)
# 12×2 DataFrame
# │ Row │ ts │ user_id │
# │ │ Int64 │ Int64 │
# ├─────┼───────┼─────────┤
# │ 1 │ 1 │ 1 │
# │ 2 │ 10 │ 1 │
# │ 3 │ 20 │ 1 │
# │ 4 │ 30 │ 2 │
# │ 5 │ 500 │ 2 │
# │ 6 │ 1000 │ 1 │
# │ 7 │ 1010 │ 1 │
# │ 8 │ 1200 │ 1 │
# │ 9 │ 1500 │ 2 │
# │ 10 │ 1600 │ 2 │
# │ 11 │ 2200 │ 1 │
# │ 12 │ 2220 │ 1 │
In this step I used two things that are worth learning:
- A
@pipe
macro from the Pipes.jl package allows to pass result of the left hand side of|>
to the right hand side in the position where_
is placed. In this case_
is a first argument tosort
. - I used
DataFrame!
constructor; the!
in this case means that columns passed to a freshly constructed data frame are not copied (by defaultDataFrame
constructor copies passed columns for safety).
Compute session identifier for each row of data
So the first task is to identify sessions in our data. For each user
a session_id
column gives a number of session for this user, starting from
zero. Remember, that we assume that a fresh session starts for some user, if
two consecutive events for this user are separated by at least 900 seconds.
dplyr
session_df <- df %>%
group_by(user_id) %>%
mutate(prev_ts = lag(ts)) %>%
mutate(diff_ts = ts - prev_ts) %>%
mutate(diff_ts = ifelse(is.na(diff_ts), 0, diff_ts)) %>%
mutate(session_start = ifelse(diff_ts >= 900, 1, 0)) %>%
mutate(session_id = cumsum(session_start)) %>%
select(-prev_ts, -diff_ts, -session_start)
session_df
# # A tibble: 12 x 3
# # Groups: user_id [2]
# ts user_id session_id
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 10 1 0
# 3 20 1 0
# 4 30 2 0
# 5 500 2 0
# 6 1000 1 1
# 7 1010 1 1
# 8 1200 1 1
# 9 1500 2 1
# 10 1600 2 1
# 11 2200 1 2
# 12 2220 1 2
DataFrames.jl
session_df = @pipe df |>
groupby(_, :user_id) |>
combine(:ts => ts -> begin
prev_ts = lag(ts)
diff_ts = ts .- prev_ts
diff_ts = coalesce.(diff_ts, 0)
session_start = diff_ts .> 900
session_id = cumsum(session_start)
return (ts=ts, session_id=session_id)
end, _, ungroup=false)
# GroupedDataFrame with 2 groups based on key: user_id
# First Group (8 rows): user_id = 1
# │ Row │ user_id │ ts │ session_id │
# │ │ Int64 │ Int64 │ Int64 │
# ├─────┼─────────┼───────┼────────────┤
# │ 1 │ 1 │ 1 │ 0 │
# │ 2 │ 1 │ 10 │ 0 │
# │ 3 │ 1 │ 20 │ 0 │
# │ 4 │ 1 │ 1000 │ 1 │
# │ 5 │ 1 │ 1010 │ 1 │
# │ 6 │ 1 │ 1200 │ 1 │
# │ 7 │ 1 │ 2200 │ 2 │
# │ 8 │ 1 │ 2220 │ 2 │
# ⋮
# Last Group (4 rows): user_id = 2
# │ Row │ user_id │ ts │ session_id │
# │ │ Int64 │ Int64 │ Int64 │
# ├─────┼─────────┼───────┼────────────┤
# │ 1 │ 2 │ 30 │ 0 │
# │ 2 │ 2 │ 500 │ 0 │
# │ 3 │ 2 │ 1500 │ 1 │
# │ 4 │ 2 │ 1600 │ 1 │
Now I could have rewritten the dpyr
code to DataFrames.jl in many ways, but
a most natural thing to do it was for me to use the following syntax:
combine(source_column => transformation_function, grouped_data_frame)
With this approach I can conveniently define an anonymous function
within a begin
-end
block and return a (ts=ts, session_id=session_id)
value
that is a NamedTuple
and will get expanded into two columns of a data frame.
I use ungroup=false
syntax to keep the result a GroupedDataFrame
to match
what we get in dplyr
.
Also, in the code of the function I use the lag
function from ShiftedArrays.jl.
Now we have all information to answer our business questions.
How many sessions an average user has?
dplyr
session_df %>%
summarize(session_num = max(session_id) + 1) %>%
ungroup() %>%
summarize(avg_sessions_per_user = sum(session_num) / nrow(.))
# # A tibble: 1 x 1
# avg_sessions_per_user
# <dbl>
# 1 2.5
DataFrames.jl
@pipe session_df |>
combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
combine(_, :session_num => mean => :avg_sessions_per_user)
# 1×1 DataFrame
# │ Row │ avg_sessions_per_user │
# │ │ Float64 │
# ├─────┼───────────────────────┤
# │ 1 │ 2.5 │
Observe, that in the DataFrames.jl code the first combine
is applied
to GroupedDataFrame
while the second combine
is applied to a DataFrame
.
How many users have exactly two sessions?
dplyr
session_df %>%
summarize(session_num = max(session_id) + 1) %>%
filter(session_num == 2) %>%
summarize(number_of_two_session_users = nrow(.))
# # A tibble: 1 x 1
# number_of_two_session_users
<int>
1 1
DataFrames.jl
@pipe session_df |>
combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
filter(:session_num => ==(2), _) |>
DataFrame(number_of_two_session_users = nrow(_))
# 1×1 DataFrame
# │ Row │ number_of_two_session_users │
# │ │ Int64 │
# ├─────┼─────────────────────────────┤
# │ 1 │ 1 │
In this code observe that :session_num => ==(2)
syntax means that in the
filter
function we pass each element of :session_num
column to ==(2)
function, which is a curried version of a standard x == 2
comparison.
Find top 10 users, ordered by the descending number of sessions?
dplyr
session_df %>%
summarize(session_num = max(session_id) + 1) %>%
ungroup() %>%
arrange(desc(session_num)) %>%
mutate(rn = row_number()) %>%
filter(rn <= 10) %>%
select(-rn)
# # A tibble: 2 x 2
# user_id session_num
# <dbl> <dbl>
# 1 1 3
# 2 2 2
DataFrames.jl
@pipe session_df |>
combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
sort(_, :session_num, rev=true) |>
first(_, 10)
# 2×2 DataFrame
# │ Row │ user_id │ session_num │
# │ │ Int64 │ Int64 │
# ├─────┼─────────┼─────────────┤
# │ 1 │ 1 │ 3 │
# │ 2 │ 2 │ 2 │
Here note that in the :session_id => (x -> maximum(x) + 1) => :session_num
expression we have to wrap x -> maximum(x) + 1
in parentheses to get the
correct result (if you would omit it => :session_num
would be treated as a
part of an anonymous function definition).
What is the average time between sessions start?
dplyr
session_df %>%
arrange(ts) %>%
group_by(user_id, session_id) %>%
mutate(rn = row_number()) %>%
filter(rn == 1) %>%
group_by(user_id) %>%
mutate(prev_start = lag(ts)) %>%
filter(!is.na(prev_start)) %>%
mutate(sess_diff = ts - prev_start) %>%
ungroup() %>%
summarize(avg_session_starts = mean(sess_diff))
# # A tibble: 1 x 1
# avg_session_starts
# <dbl>
# 1 1223
DataFrames.jl
@pipe session_df |>
parent |>
sort(_, :ts) |>
groupby(_, [:user_id, :session_id]) |>
combine(first, _) |>
groupby(_, :user_id) |>
combine(_, :ts => diff => :sess_diff) |>
combine(_, :sess_diff => mean => :avg_session_starts)
# 1×1 DataFrame
# │ Row │ avg_session_starts │
# │ │ Float64 │
# ├─────┼────────────────────┤
# │ 1 │ 1223.0 │
Here we show that you can use the parent
function to get access to the data
frame of which GroupedDataFrame
is a view. Also a common pattern is
combine(first, _)
to extract the first row of each group in a
GroupedDataFrame
.
Scaling the computations to a larger input data set
In this part I want to check the performance of dplyr
and DataFrames.jl codes
that we have just discussed. For this I want to replicate df
5,000,000 times.
In order to avoid having only users number 1 and 2 (and thus to have more groups
to analyze), we will want to create new user ids in an arbitrary fashion.
We will want to have a look what is timing of the considered operations.
dplyr
time_test <- function(df) {
n <- 5*10^6
print(system.time(df <- df %>%
slice(rep(row_number(), n)) %>%
mutate(user_id = user_id + rep(1:n, each=nrow(df))) %>%
arrange(ts)
))
print(system.time(
session_df <- df %>% group_by(user_id) %>% mutate(prev_ts = lag(ts)) %>%
mutate(diff_ts = ts - prev_ts) %>%
mutate(diff_ts = ifelse(is.na(diff_ts), 0, diff_ts)) %>%
mutate(session_start = ifelse(diff_ts >= 900, 1, 0)) %>%
mutate(session_id = cumsum(session_start)) %>%
select(-prev_ts, -diff_ts, -session_start)
))
print(system.time(
session_df %>%
summarize(session_num = max(session_id) + 1) %>%
ungroup() %>%
summarize(avg_sessions_per_user = sum(session_num)/nrow(.))
))
print(system.time(
session_df %>%
summarize(session_num = max(session_id) + 1) %>%
filter(session_num == 1) %>%
summarize(number_of_one_session_users = nrow(.))
))
print(system.time(
session_df %>%
summarize(session_num = max(session_id) + 1) %>%
ungroup() %>%
arrange(desc(session_num)) %>%
mutate(rn = row_number()) %>%
filter(rn <= 10) %>%
select(-rn)
))
print(system.time(
session_df %>%
arrange(ts) %>%
group_by(user_id, session_id) %>%
mutate(rn = row_number()) %>%
filter(rn == 1) %>%
group_by(user_id) %>%
mutate(prev_start = lag(ts)) %>%
filter(!is.na(prev_start)) %>%
mutate(sess_diff = ts - prev_start) %>%
ungroup() %>%
summarize(avg_session_starts = mean(sess_diff))
))
}
time_test(df)
# user system elapsed
# 6.909 2.099 9.009
# user system elapsed
# 222.398 2.159 224.567
# `summarise()` ungrouping output (override with `.groups` argument)
# user system elapsed
# 6.115 0.000 6.116
# `summarise()` ungrouping output (override with `.groups` argument)
# user system elapsed
# 6.842 0.000 6.843
# `summarise()` ungrouping output (override with `.groups` argument)
# user system elapsed
# 7.409 0.000 7.409
# user system elapsed
# 273.404 2.156 275.569
DataFrames.jl
function time_test(df)
n = 5*10^6
@time df = @pipe repeat(df, n) |>
setindex!(_, _.user_id + repeat(1:n, inner=nrow(df)), !, :user_id) |>
sort(_, :ts)
@time session_df = @pipe df |>
groupby(_, :user_id) |>
combine(:ts => ts -> begin
prev_ts = lag(ts)
diff_ts = ts .- prev_ts
diff_ts = coalesce.(diff_ts, 0)
session_start = diff_ts .> 900
session_id = cumsum(session_start)
return (ts=ts, session_id=session_id)
end, _, ungroup=false)
@time @pipe session_df |>
combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
combine(_, :session_num => mean => :avg_sessions_per_user)
@time @pipe session_df |>
combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
filter(:session_num => ==(1), _) |>
DataFrame(avg_sessions_per_user = nrow(_))
@time @pipe session_df |>
combine(_, :session_id => (x -> maximum(x) + 1) => :session_num) |>
sort(_, :session_num, rev=true) |>
first(_, 10)
@time @pipe session_df |>
parent |>
sort(_, :ts) |>
groupby(_, [:user_id, :session_id]) |>
combine(first, _) |>
groupby(_, :user_id) |>
combine(_, :ts => diff => :sess_diff) |>
combine(_, :sess_diff => mean => :avg_session_starts)
return nothing
end
time_test(df)
# 9.301248 seconds (37.16 M allocations: 17.919 GiB, 10.01% gc time)
# 35.576141 seconds (485.70 M allocations: 24.391 GiB, 8.58% gc time)
# 2.170619 seconds (40.30 M allocations: 1.990 GiB, 21.22% gc time)
# 1.507405 seconds (40.30 M allocations: 1.580 GiB, 8.43% gc time)
# 1.852281 seconds (40.67 M allocations: 1.634 GiB, 7.21% gc time)
# 20.191882 seconds (166.55 M allocations: 22.924 GiB, 8.62% gc time)
As you can see, in the example queries we have investigated DataFrames.jl turns
out to be competitive with dplyr
in terms of timing of the operations (let me
stress again that my objective here was not to write the fastest possible codes
in R and Julia that yield the desired results, therefore these values should be
treated lightly).