Expand your DataFrames.jl toolbox: the flatten function
Introduction
Recently I have commented on an interesting question on StackOveflow.
The problem was stated as follows. Given this input table:
2×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ Joseph B 3 15:05:00
produce the following output table:
5×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ John A 2 16:01:00
3 │ Joseph B 3 15:05:00
4 │ Joseph B 3 15:06:00
5 │ Joseph B 3 15:07:00
As you can see the task is to repeat each row of the source data frame as many
times as column :Duration
tells you but additionally increment the
Start_Time
column by one minute in each consecutive row.
This question caught my attention, because it referenced to a similar question using Pandas. However, I found it quite hard to immediately understand what is going on in that code, while in DataFrames.jl the solution seemed to be relatively simple.
This post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.
The solution using flatten
We start with creating the source data frame:
julia> using DataFrames, Dates
julia> df = DataFrame(Name=["John", "Joseph"],
Channel=["A", "B"],
Duration=[2,3],
Start_Time=Time.(["16:00:00", "15:05:00"]))
2×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ Joseph B 3 15:05:00
Now in order to solve the task one needs to remember that data frame can store
columns having any element type. Therefore a first natural step is to transform
the :Start_Time
column from a vector holding only a starting time to a vector
holding a range of times as defined by :Duration
and :Start_Time
columns.
This is easy to achieve using the transform
function:
julia> df2 = transform(df, [:Start_Time, :Duration] =>
ByRow((x,y) -> x .+ Minute.(0:y-1)) =>
:Start_Time)
2×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Array…
─────┼──────────────────────────────────────────────────────────────
1 │ John A 2 Time[16:00:00, 16:01:00]
2 │ Joseph B 3 Time[15:05:00, 15:06:00, 15:07:0…
alternatively one could create the df2
data frame e.g. like this:
julia> df2 = copy(df)
2×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ Joseph B 3 15:05:00
julia> df2.Start_Time = [x .+ Minute.(0:y-1) for
(x, y) in zip(df2.Start_Time, df2.Duration)]
2-element Vector{Vector{Time}}:
[Time(16), Time(16, 1)]
[Time(15, 5), Time(15, 6), Time(15, 7)]
julia> df2
2×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Array…
─────┼──────────────────────────────────────────────────────────────
1 │ John A 2 Time[16:00:00, 16:01:00]
2 │ Joseph B 3 Time[15:05:00, 15:06:00, 15:07:0…
a small benefit of transform
is that is is easier to put this operation in a
chain of transformations as it takes and returns a data frame.
Once you have a df2
data frame then you need to flatten the :Start_Time
column into multiple rows. This is easily done using the flatten
function like
this:
julia> flatten(df2, :Start_Time)
5×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ John A 2 16:01:00
3 │ Joseph B 3 15:05:00
4 │ Joseph B 3 15:06:00
5 │ Joseph B 3 15:07:00
and you are done!
For sure I know DataFrames.jl much better than Pandas. However, what I feel (and I am for sure biased here) is that it is much easier to reason about what DataFrames.jl code does.
The solution using iteration
Another approach that could be used to handle this task would be to construct
the resulting data frame incrementally. In this case it is a bit more complex
than the flatten
solution, but it is very often quite convenient so I thought
to show it. Here is the code:
julia> df3 = DataFrame()
0×0 DataFrame
julia> for row in eachrow(df)
chunk = repeat(DataFrame(row), row.Duration)
chunk.Start_Time .+= Minute.(0:row.Duration-1)
append!(df3, chunk)
end
julia> df3
5×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ John A 2 16:01:00
3 │ Joseph B 3 15:05:00
4 │ Joseph B 3 15:06:00
5 │ Joseph B 3 15:07:00
or
julia> df4 = DataFrame()
0×0 DataFrame
julia> for row in eachrow(df), i in 0:row.Duration-1
push!(df4, row)
df4.Start_Time[end] += Minute(i)
end
julia> df4
5×4 DataFrame
Row │ Name Channel Duration Start_Time
│ String String Int64 Time
─────┼───────────────────────────────────────
1 │ John A 2 16:00:00
2 │ John A 2 16:01:00
3 │ Joseph B 3 15:05:00
4 │ Joseph B 3 15:06:00
5 │ Joseph B 3 15:07:00
The point of these examples is that append!
and push!
are quite fast in
DataFrames.jl and I find them easy to reason about.
Conclusion
I hope that you found these examples useful. In particular functions like
flatten
are easy to forget about while they often are very handy, especially in
combination with the fact that data frame can store objects of any type in its
columns.
In particular, you can store a vector of vectors or a vector of struct
s as a
data frame column. This is a type of storage that users of such data bases as
BigQuery or Snowflake tend to like. An especially notable feature
of this functionality is that such data frames can be easily written to and read
back from a file using e.g. Arrow.jl.
If you would like to check out another example of using a vector of vectors as a column of a data frame you can have a look at notebook 5 of the JuliaAcademy DataFrames.jl tutorial.