Introduction

Recently I have commented on an interesting question on StackOveflow.

The problem was stated as follows. Given this input table:

2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

produce the following output table:

5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

As you can see the task is to repeat each row of the source data frame as many times as column :Duration tells you but additionally increment the Start_Time column by one minute in each consecutive row.

This question caught my attention, because it referenced to a similar question using Pandas. However, I found it quite hard to immediately understand what is going on in that code, while in DataFrames.jl the solution seemed to be relatively simple.

This post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.

The solution using `flatten`

We start with creating the source data frame:

julia> using DataFrames, Dates

julia> df = DataFrame(Name=["John", "Joseph"],
                      Channel=["A", "B"],
                      Duration=[2,3],
                      Start_Time=Time.(["16:00:00", "15:05:00"]))
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

Now in order to solve the task one needs to remember that data frame can store columns having any element type. Therefore a first natural step is to transform the :Start_Time column from a vector holding only a starting time to a vector holding a range of times as defined by :Duration and :Start_Time columns.

This is easy to achieve using the transform function:

julia> df2 = transform(df, [:Start_Time, :Duration] =>
                           ByRow((x,y) -> x .+ Minute.(0:y-1)) =>
                           :Start_Time)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

alternatively one could create the df2 data frame e.g. like this:

julia> df2 = copy(df)
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ Joseph  B               3  15:05:00

julia> df2.Start_Time = [x .+ Minute.(0:y-1) for
                         (x, y) in zip(df2.Start_Time, df2.Duration)]
2-element Vector{Vector{Time}}:
 [Time(16), Time(16, 1)]
 [Time(15, 5), Time(15, 6), Time(15, 7)]

julia> df2
2×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Array…
─────┼──────────────────────────────────────────────────────────────
   1 │ John    A               2  Time[16:00:00, 16:01:00]
   2 │ Joseph  B               3  Time[15:05:00, 15:06:00, 15:07:0…

a small benefit of transform is that is is easier to put this operation in a chain of transformations as it takes and returns a data frame.

Once you have a df2 data frame then you need to flatten the :Start_Time column into multiple rows. This is easily done using the flatten function like this:

julia> flatten(df2, :Start_Time)
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

and you are done!

For sure I know DataFrames.jl much better than Pandas. However, what I feel (and I am for sure biased here) is that it is much easier to reason about what DataFrames.jl code does.

The solution using iteration

Another approach that could be used to handle this task would be to construct the resulting data frame incrementally. In this case it is a bit more complex than the flatten solution, but it is very often quite convenient so I thought to show it. Here is the code:

julia> df3 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df)
           chunk = repeat(DataFrame(row), row.Duration)
           chunk.Start_Time .+= Minute.(0:row.Duration-1)
           append!(df3, chunk)
       end

julia> df3
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

julia> df4 = DataFrame()
0×0 DataFrame

julia> for row in eachrow(df), i in 0:row.Duration-1
           push!(df4, row)
           df4.Start_Time[end] += Minute(i)
       end

julia> df4
5×4 DataFrame
 Row │ Name    Channel  Duration  Start_Time
     │ String  String   Int64     Time
─────┼───────────────────────────────────────
   1 │ John    A               2  16:00:00
   2 │ John    A               2  16:01:00
   3 │ Joseph  B               3  15:05:00
   4 │ Joseph  B               3  15:06:00
   5 │ Joseph  B               3  15:07:00

The point of these examples is that append! and push! are quite fast in DataFrames.jl and I find them easy to reason about.

Conclusion

I hope that you found these examples useful. In particular functions like flatten are easy to forget about while they often are very handy, especially in combination with the fact that data frame can store objects of any type in its columns.

In particular, you can store a vector of vectors or a vector of structs as a data frame column. This is a type of storage that users of such data bases as BigQuery or Snowflake tend to like. An especially notable feature of this functionality is that such data frames can be easily written to and read back from a file using e.g. Arrow.jl.

If you would like to check out another example of using a vector of vectors as a column of a data frame you can have a look at notebook 5 of the JuliaAcademy DataFrames.jl tutorial.

Introduction

The solution using flatten

The solution using iteration

Conclusion

The solution using `flatten`