Introduction

This time the blog post will be shorter than usual.

Logan Kilpatrick has just released a new version of JuliaAcademy course for DataFrames.jl that was updated to its 0.22 release and also contains some new material.

You can find course materials on GitHub here, while the videos will be released in the coming days; first two 1. Environment Setup and 2. First Steps With Data Frames are already available for watching.

Before you go

Not to leave you with just bare links let me present a short example how you can process dates in DataFrames.jl while taking into account the possibility that there might be missing values in the data (this is a question I was recently asked how to do it).

I am using Julia 1.5.3, DataFrames 0.22.1, and Missings 0.4.4.

First we prepare a sample data frame:

julia> using Dates, Missings, DataFrames

julia> df1 = DataFrame(date = Date.(2020, 1, 1:10));

julia> allowmissing!(df1);

julia> df1.date[5] = missing;

julia> df1
10×1 DataFrame
 Row │ date
     │ Date?
─────┼────────────
   1 │ 2020-01-01
   2 │ 2020-01-02
   3 │ 2020-01-03
   4 │ 2020-01-04
   5 │ missing
   6 │ 2020-01-06
   7 │ 2020-01-07
   8 │ 2020-01-08
   9 │ 2020-01-09
  10 │ 2020-01-10

We now want to split :date column into three columns that will contain year, month and day respectively. Here is the way how you can achieve it:

julia> df2 = transform(df1, @. :date =>
                               ByRow(passmissing([year, month, day])) =>
                               [:year, :month, :day])
10×4 DataFrame
 Row │ date        year     month    day
     │ Date?       Int64?   Int64?   Int64?
─────┼───────────────────────────────────────
   1 │ 2020-01-01     2020        1        1
   2 │ 2020-01-02     2020        1        2
   3 │ 2020-01-03     2020        1        3
   4 │ 2020-01-04     2020        1        4
   5 │ missing     missing  missing  missing
   6 │ 2020-01-06     2020        1        6
   7 │ 2020-01-07     2020        1        7
   8 │ 2020-01-08     2020        1        8
   9 │ 2020-01-09     2020        1        9
  10 │ 2020-01-10     2020        1       10

Note that we are using here a common pattern that you can use broadcasting to easily specify multiple operations on the same object (in this case this is the same source column).

Finally we go back and collect the :year, :month and :day columns into one column that contains the original Date values:

julia> df3 = transform(df2, [:year, :month, :day] =>
                            ByRow(passmissing(Date)) =>
                            :date2)
10×5 DataFrame
 Row │ date        year     month    day      date2
     │ Date?       Int64?   Int64?   Int64?   Date?
─────┼───────────────────────────────────────────────────
   1 │ 2020-01-01     2020        1        1  2020-01-01
   2 │ 2020-01-02     2020        1        2  2020-01-02
   3 │ 2020-01-03     2020        1        3  2020-01-03
   4 │ 2020-01-04     2020        1        4  2020-01-04
   5 │ missing     missing  missing  missing  missing
   6 │ 2020-01-06     2020        1        6  2020-01-06
   7 │ 2020-01-07     2020        1        7  2020-01-07
   8 │ 2020-01-08     2020        1        8  2020-01-08
   9 │ 2020-01-09     2020        1        9  2020-01-09
  10 │ 2020-01-10     2020        1       10  2020-01-10

This time we took advantage of the fact that Date takes three positional arguments and this is the default behavior of transformation specifications in DataFrames.jl in which multiple source columns are provided.

This is all for today. Bye!