Introduction

Several of my recent posts were a bit more advanced, so today I thought to write something for beginners. A common task that one needs to do in data analysis pipelines is lagging a variable in time. Today I will show you several patterns how this can be done using DataFrames.jl and ShiftedArrays.jl. In particular, I was prompted to write this post by a 2.0 release of the ShiftedArrays.jl package and it reflects the current state of this package.

In the post I use Julia 1.8.1, DataFrames.jl 1.3.6, and ShiftedArrays.jl 2.0.

Setting up the scene

Let us load the packages that we will use in this post and create some starting data frame with which we will work:

julia> using DataFrames

julia> using Dates

julia> using ShiftedArrays: lag

julia> df = DataFrame(day=Date.(2020, 9, [1, 2, 4, 5, 1, 2, 3, 6]),
                      id=repeat(1:2, inner=4),
                      v=1:8)
8×3 DataFrame
 Row │ day         id     v
     │ Date        Int64  Int64
─────┼──────────────────────────
   1 │ 2020-09-01      1      1
   2 │ 2020-09-02      1      2
   3 │ 2020-09-04      1      3
   4 │ 2020-09-05      1      4
   5 │ 2020-09-01      2      5
   6 │ 2020-09-02      2      6
   7 │ 2020-09-03      2      7
   8 │ 2020-09-06      2      8

Here an important pattern is using ShiftedArrays: lag. Since 2.0 release the ShiftedArrays.jl package does not export the lag function. Therefore to be able to use it without having to write ShiftedArrays.lag you need to explicitly list it when loading the ShiftedArrays.jl package.

Simple lagging

Assume we want to lag variable v in the whole data frame. for this we can use the transform function like this:

julia> transform(df, :v => lag)
8×4 DataFrame
 Row │ day         id     v      v_lag
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2        1
   3 │ 2020-09-04      1      3        2
   4 │ 2020-09-05      1      4        3
   5 │ 2020-09-01      2      5        4
   6 │ 2020-09-02      2      6        5
   7 │ 2020-09-03      2      7        6
   8 │ 2020-09-06      2      8        7

If we wanted to lag it by more than one period we could write:

julia> transform(df, :v => (x -> lag(x, 2)) => :v_lag2)
8×4 DataFrame
 Row │ day         id     v      v_lag2
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2  missing
   3 │ 2020-09-04      1      3        1
   4 │ 2020-09-05      1      4        2
   5 │ 2020-09-01      2      5        3
   6 │ 2020-09-02      2      6        4
   7 │ 2020-09-03      2      7        5
   8 │ 2020-09-06      2      8        6

As an exercise, here is an example how to do lagging by different periods using a comprehension:

julia> transform(df, [:v => (x -> lag(x, i)) => "v_lag$i" for i in 1:3])
8×6 DataFrame
 Row │ day         id     v      v_lag1   v_lag2   v_lag3
     │ Date        Int64  Int64  Int64?   Int64?   Int64?
─────┼─────────────────────────────────────────────────────
   1 │ 2020-09-01      1      1  missing  missing  missing
   2 │ 2020-09-02      1      2        1  missing  missing
   3 │ 2020-09-04      1      3        2        1  missing
   4 │ 2020-09-05      1      4        3        2        1
   5 │ 2020-09-01      2      5        4        3        2
   6 │ 2020-09-02      2      6        5        4        3
   7 │ 2020-09-03      2      7        6        5        4
   8 │ 2020-09-06      2      8        7        6        5

Lagging by group

Simple lagging in our example seems not very useful. The reason is that it looks that we should lag the :v variable in groups defined by the :id variable. In DataFrames.jl this is simple, the only thing you need to change in the above codes is passing groupby(df, :id) to transform:

julia> transform(groupby(df, :id), :v => lag)
8×4 DataFrame
 Row │ day         id     v      v_lag
     │ Date        Int64  Int64  Int64?
─────┼───────────────────────────────────
   1 │ 2020-09-01      1      1  missing
   2 │ 2020-09-02      1      2        1
   3 │ 2020-09-04      1      3        2
   4 │ 2020-09-05      1      4        3
   5 │ 2020-09-01      2      5  missing
   6 │ 2020-09-02      2      6        5
   7 │ 2020-09-03      2      7        6
   8 │ 2020-09-06      2      8        7

(the remaining examples can be updated using the same pattern)

Calendar lagging by group

The above example was better, but assumed that we want to lag data by observations (so we shift a vector disregarding dates). What if we wanted to lag the data by one calendar day. Here are two ways how you can do it (they slightly differ in the produced result, so the choice of the approach depends on what you would want in practice). In both approaches I assume that the passed dates are unique by group.

In the first approach we want to keep all observations after lagging of the :v variable and record only days that match some observations (before or after lagging).

julia> combine(groupby(df, :id)) do sdf
           sort!(outerjoin(select(sdf, Not(:id)),
                           select(sdf,
                                  :day => ByRow(x -> x + Day(1)) => :day,
                                  :v => :v_lag),
                           on=:day), :day)
       end
12×4 DataFrame
 Row │ id     day         v        v_lag
     │ Int64  Date        Int64?   Int64?
─────┼─────────────────────────────────────
   1 │     1  2020-09-01        1  missing
   2 │     1  2020-09-02        2        1
   3 │     1  2020-09-03  missing        2
   4 │     1  2020-09-04        3  missing
   5 │     1  2020-09-05        4        3
   6 │     1  2020-09-06  missing        4
   7 │     2  2020-09-01        5  missing
   8 │     2  2020-09-02        6        5
   9 │     2  2020-09-03        7        6
  10 │     2  2020-09-04  missing        7
  11 │     2  2020-09-06        8  missing
  12 │     2  2020-09-07  missing        8

In this approach in each group defined by the :id column we create a helper data frame with :day column incremented by one and outer join it with the original data frame (with :id column dropped), next we sort the result on :day.

In the second approach we keep only observations in the originally observed period for a given group, but fill all the potential dates in the considered period.

julia> combine(groupby(df, :id)) do sdf
           ref_df = DataFrame(day=minimum(sdf.day):Day(1):maximum(sdf.day))
           transform!(leftjoin!(ref_df, select(sdf, Not(:id)), on=:day),
                      :v => lag)
       end
11×4 DataFrame
 Row │ id     day         v        v_lag
     │ Int64  Date        Int64?   Int64?
─────┼─────────────────────────────────────
   1 │     1  2020-09-01        1  missing
   2 │     1  2020-09-02        2        1
   3 │     1  2020-09-03  missing        2
   4 │     1  2020-09-04        3  missing
   5 │     1  2020-09-05        4        3
   6 │     2  2020-09-01        5  missing
   7 │     2  2020-09-02        6        5
   8 │     2  2020-09-03        7        6
   9 │     2  2020-09-04  missing        7
  10 │     2  2020-09-05  missing  missing
  11 │     2  2020-09-06        8  missing

In this approach, again by :id group, we first create a reference data frame with all dates in the observed period, next we leftjoin! the source data frame into it (dropping :id as it is not needed), and finally just use the lag function. Note that in this solution, in comparison to the previous one, rows for dates 2020-09-06 (group 1) and 2020-09-07 (group 2) are missing, but we have an extra row 2020-09-05 (group 2) filled with missing.

Conclusions

Lagging of variables is quite often needed in practice, especially in panel econometrics context. I hope you will find the patterns I have presented in this post useful in your data analysis projects. Additionally, I have chosen the examples so that several functions from DataFrames.jl are employed in combination.