Back to the basics: don't lag behind with your DataFrames.jl skills
Introduction
Several of my recent posts were a bit more advanced, so today I thought to write something for beginners. A common task that one needs to do in data analysis pipelines is lagging a variable in time. Today I will show you several patterns how this can be done using DataFrames.jl and ShiftedArrays.jl. In particular, I was prompted to write this post by a 2.0 release of the ShiftedArrays.jl package and it reflects the current state of this package.
In the post I use Julia 1.8.1, DataFrames.jl 1.3.6, and ShiftedArrays.jl 2.0.
Setting up the scene
Let us load the packages that we will use in this post and create some starting data frame with which we will work:
julia> using DataFrames
julia> using Dates
julia> using ShiftedArrays: lag
julia> df = DataFrame(day=Date.(2020, 9, [1, 2, 4, 5, 1, 2, 3, 6]),
id=repeat(1:2, inner=4),
v=1:8)
8×3 DataFrame
Row │ day id v
│ Date Int64 Int64
─────┼──────────────────────────
1 │ 2020-09-01 1 1
2 │ 2020-09-02 1 2
3 │ 2020-09-04 1 3
4 │ 2020-09-05 1 4
5 │ 2020-09-01 2 5
6 │ 2020-09-02 2 6
7 │ 2020-09-03 2 7
8 │ 2020-09-06 2 8
Here an important pattern is using ShiftedArrays: lag
. Since 2.0 release
the ShiftedArrays.jl package does not export the lag
function. Therefore
to be able to use it without having to write ShiftedArrays.lag
you need
to explicitly list it when loading the ShiftedArrays.jl package.
Simple lagging
Assume we want to lag variable v
in the whole data frame. for this we can
use the transform
function like this:
julia> transform(df, :v => lag)
8×4 DataFrame
Row │ day id v v_lag
│ Date Int64 Int64 Int64?
─────┼───────────────────────────────────
1 │ 2020-09-01 1 1 missing
2 │ 2020-09-02 1 2 1
3 │ 2020-09-04 1 3 2
4 │ 2020-09-05 1 4 3
5 │ 2020-09-01 2 5 4
6 │ 2020-09-02 2 6 5
7 │ 2020-09-03 2 7 6
8 │ 2020-09-06 2 8 7
If we wanted to lag it by more than one period we could write:
julia> transform(df, :v => (x -> lag(x, 2)) => :v_lag2)
8×4 DataFrame
Row │ day id v v_lag2
│ Date Int64 Int64 Int64?
─────┼───────────────────────────────────
1 │ 2020-09-01 1 1 missing
2 │ 2020-09-02 1 2 missing
3 │ 2020-09-04 1 3 1
4 │ 2020-09-05 1 4 2
5 │ 2020-09-01 2 5 3
6 │ 2020-09-02 2 6 4
7 │ 2020-09-03 2 7 5
8 │ 2020-09-06 2 8 6
As an exercise, here is an example how to do lagging by different periods using a comprehension:
julia> transform(df, [:v => (x -> lag(x, i)) => "v_lag$i" for i in 1:3])
8×6 DataFrame
Row │ day id v v_lag1 v_lag2 v_lag3
│ Date Int64 Int64 Int64? Int64? Int64?
─────┼─────────────────────────────────────────────────────
1 │ 2020-09-01 1 1 missing missing missing
2 │ 2020-09-02 1 2 1 missing missing
3 │ 2020-09-04 1 3 2 1 missing
4 │ 2020-09-05 1 4 3 2 1
5 │ 2020-09-01 2 5 4 3 2
6 │ 2020-09-02 2 6 5 4 3
7 │ 2020-09-03 2 7 6 5 4
8 │ 2020-09-06 2 8 7 6 5
Lagging by group
Simple lagging in our example seems not very useful. The reason is that it looks
that we should lag the :v
variable in groups defined by the :id
variable.
In DataFrames.jl this is simple, the only thing you need to change in the
above codes is passing groupby(df, :id)
to transform
:
julia> transform(groupby(df, :id), :v => lag)
8×4 DataFrame
Row │ day id v v_lag
│ Date Int64 Int64 Int64?
─────┼───────────────────────────────────
1 │ 2020-09-01 1 1 missing
2 │ 2020-09-02 1 2 1
3 │ 2020-09-04 1 3 2
4 │ 2020-09-05 1 4 3
5 │ 2020-09-01 2 5 missing
6 │ 2020-09-02 2 6 5
7 │ 2020-09-03 2 7 6
8 │ 2020-09-06 2 8 7
(the remaining examples can be updated using the same pattern)
Calendar lagging by group
The above example was better, but assumed that we want to lag data by observations (so we shift a vector disregarding dates). What if we wanted to lag the data by one calendar day. Here are two ways how you can do it (they slightly differ in the produced result, so the choice of the approach depends on what you would want in practice). In both approaches I assume that the passed dates are unique by group.
In the first approach we want to keep all observations after lagging of the
:v
variable and record only days that match some observations (before or after
lagging).
julia> combine(groupby(df, :id)) do sdf
sort!(outerjoin(select(sdf, Not(:id)),
select(sdf,
:day => ByRow(x -> x + Day(1)) => :day,
:v => :v_lag),
on=:day), :day)
end
12×4 DataFrame
Row │ id day v v_lag
│ Int64 Date Int64? Int64?
─────┼─────────────────────────────────────
1 │ 1 2020-09-01 1 missing
2 │ 1 2020-09-02 2 1
3 │ 1 2020-09-03 missing 2
4 │ 1 2020-09-04 3 missing
5 │ 1 2020-09-05 4 3
6 │ 1 2020-09-06 missing 4
7 │ 2 2020-09-01 5 missing
8 │ 2 2020-09-02 6 5
9 │ 2 2020-09-03 7 6
10 │ 2 2020-09-04 missing 7
11 │ 2 2020-09-06 8 missing
12 │ 2 2020-09-07 missing 8
In this approach in each group defined by the :id
column we create a helper
data frame with :day
column incremented by one and outer join it with the
original data frame (with :id
column dropped), next we sort the result on
:day
.
In the second approach we keep only observations in the originally observed period for a given group, but fill all the potential dates in the considered period.
julia> combine(groupby(df, :id)) do sdf
ref_df = DataFrame(day=minimum(sdf.day):Day(1):maximum(sdf.day))
transform!(leftjoin!(ref_df, select(sdf, Not(:id)), on=:day),
:v => lag)
end
11×4 DataFrame
Row │ id day v v_lag
│ Int64 Date Int64? Int64?
─────┼─────────────────────────────────────
1 │ 1 2020-09-01 1 missing
2 │ 1 2020-09-02 2 1
3 │ 1 2020-09-03 missing 2
4 │ 1 2020-09-04 3 missing
5 │ 1 2020-09-05 4 3
6 │ 2 2020-09-01 5 missing
7 │ 2 2020-09-02 6 5
8 │ 2 2020-09-03 7 6
9 │ 2 2020-09-04 missing 7
10 │ 2 2020-09-05 missing missing
11 │ 2 2020-09-06 8 missing
In this approach, again by :id
group, we first create a reference data frame
with all dates in the observed period, next we leftjoin!
the source data frame
into it (dropping :id
as it is not needed), and finally just use the lag
function. Note that in this solution, in comparison to the previous one, rows
for dates 2020-09-06
(group 1) and 2020-09-07
(group 2) are missing, but we
have an extra row 2020-09-05
(group 2) filled with missing.
Conclusions
Lagging of variables is quite often needed in practice, especially in panel econometrics context. I hope you will find the patterns I have presented in this post useful in your data analysis projects. Additionally, I have chosen the examples so that several functions from DataFrames.jl are employed in combination.