Introduction

This time the post is inspired by the proposal of Andrey Oskin (thank you for submitting it, below I have adapted business problem description and dplyr source codes that Andrey provided).

Andrey shared with me typical tasks that he is faced with when doing logs analysis. To make things concrete, assume that you have a site and you collect users’ clicks. In the output of this process you get a table with two fields: time of click (ts column below, measured in seconds) and the user identifier (user_id column below).

Given such data there are natural business questions, that we can ask, like:

1. How many sessions an average user has?
2. How many users have exactly two sessions?
3. Find top 10 users, ordered by the descending number of sessions?
4. What is the average time between sessions start?

Session in these questions is a more or less arbitrary thing, usually, it has a meaning of sequence of events that come together as there is a short time difference between consecutive events. In the examples we assume that if a user has not clicked on our site for 900 seconds after the last click the session is over.

What I do in this post is take a toy data set that has this structure and dplyr codes that Andrey shared with me that answer the business questions presented above and rewrite them to DataFrames.jl.

The objective of this post is to compare the syntaxes of dplyr and DataFrames.jl. Therefore neither dplyr nor DataFrames.jl codes were tuned to be optimal. Rather I have just taken what Andrey proposed in dplyr and translated it to DataFrames.jl in a way that first came to my mind (but trying to use piping). However, in the last part of the post I out of curiosity I decided compare the performance of the codes.

All codes were tested under R version 4.0.2 and dplyr 1.0.0. For Julia I used version 1.5.0-rc1.0 and packages: DataFrames.jl 0.21.4, Pipe.jl 1.3.0, and ShiftedArrays 1.0.0. If you do not have much experience with setting-up Julia project environments, in this post I give a simple recipe how you can do it easily while ensuring you use exactly the same versions of the packages as I do.

Setting up the stage

In the first step we load the required packages, create a data frame that will be used later and sort it by the ts column.

In all examples in this post I first present R code, and then Julia code. The expected output is shown in a comment. After each step I briefly comment on the Julia code.

dplyr

DataFrames.jl

In this step I used two things that are worth learning:

• A @pipe macro from the Pipes.jl package allows to pass result of the left hand side of |> to the right hand side in the position where _ is placed. In this case _ is a first argument to sort.
• I used DataFrame! constructor; the ! in this case means that columns passed to a freshly constructed data frame are not copied (by default DataFrame constructor copies passed columns for safety).

Compute session identifier for each row of data

So the first task is to identify sessions in our data. For each user a session_id column gives a number of session for this user, starting from zero. Remember, that we assume that a fresh session starts for some user, if two consecutive events for this user are separated by at least 900 seconds.

dplyr

DataFrames.jl

Now I could have rewritten the dpyr code to DataFrames.jl in many ways, but a most natural thing to do it was for me to use the following syntax:

combine(source_column => transformation_function, grouped_data_frame)


With this approach I can conveniently define an anonymous function within a begin-end block and return a (ts=ts, session_id=session_id) value that is a NamedTuple and will get expanded into two columns of a data frame.

I use ungroup=false syntax to keep the result a GroupedDataFrame to match what we get in dplyr.

Also, in the code of the function I use the lag function from ShiftedArrays.jl.

How many sessions an average user has?

dplyr

DataFrames.jl

Observe, that in the DataFrames.jl code the first combine is applied to GroupedDataFrame while the second combine is applied to a DataFrame.

How many users have exactly two sessions?

dplyr

DataFrames.jl

In this code observe that :session_num => ==(2) syntax means that in the filter function we pass each element of :session_num column to ==(2) function, which is a curried version of a standard x == 2 comparison.

Find top 10 users, ordered by the descending number of sessions?

dplyr

DataFrames.jl

Here note that in the :session_id => (x -> maximum(x) + 1) => :session_num expression we have to wrap x -> maximum(x) + 1 in parentheses to get the correct result (if you would omit it => :session_num would be treated as a part of an anonymous function definition).

What is the average time between sessions start?

dplyr

DataFrames.jl

Here we show that you can use the parent function to get access to the data frame of which GroupedDataFrame is a view. Also a common pattern is combine(first, _) to extract the first row of each group in a GroupedDataFrame.

Scaling the computations to a larger input data set

In this part I want to check the performance of dplyr and DataFrames.jl codes that we have just discussed. For this I want to replicate df 5,000,000 times. In order to avoid having only users number 1 and 2 (and thus to have more groups to analyze), we will want to create new user ids in an arbitrary fashion.

We will want to have a look what is timing of the considered operations.

dplyr

DataFrames.jl

As you can see, in the example queries we have investigated DataFrames.jl turns out to be competitive with dplyr in terms of timing of the operations (let me stress again that my objective here was not to write the fastest possible codes in R and Julia that yield the desired results, therefore these values should be treated lightly).