Getting ready for JuliaCon Local Eindhoven 2023
Introduction
I am really excited that soon we will have a Julia conference in Europe. The JuliaCon Local Eindhoven 2023 will take place on December 1, 2023 in Eindhoven.
I thought that as nice preparation for this event it would be useful to share some tips that can be useful for DataFrames.jl users. Frequently, when working with tables, one has to process time series data. I want to share a few examples how such data can be stored in DataFrames.jl.
The post was written under Julia 1.9.2 and DataFrames.jl 1.6.1.
Setting up the stage
Let us start with a data frame that has 4 columns and 1000 rows representing consecutive measurements of some data.
I also add the :period
column keeping track of the moment at which the measurement was made.
julia> using DataFrames
julia> using Random
julia> Random.seed!(1234);
julia> df1 = DataFrame(rand(1000, 4), :auto);
julia> df1.period = 1:1000;
julia> df1
1000×5 DataFrame
Row │ x1 x2 x3 x4 period
│ Float64 Float64 Float64 Float64 Int64
──────┼───────────────────────────────────────────────────
1 │ 0.579862 0.473374 0.366288 0.771849 1
2 │ 0.411294 0.176997 0.652475 0.938484 2
3 │ 0.972136 0.676856 0.900155 0.390673 3
4 │ 0.0149088 0.177963 0.374975 0.694063 4
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
997 │ 0.424295 0.78012 0.577681 0.274228 997
998 │ 0.481554 0.0441882 0.700181 0.317677 998
999 │ 0.401528 0.658903 0.438309 0.211217 999
1000 │ 0.58985 0.909857 0.354735 0.442829 1000
992 rows omitted
Reshaping data: wide to long
The first operation we want to do is reshaping the data from wide to long format,
in which we will have three columns: the period, the variable name, and the variable value.
To achieve this we can use the stack
function:
julia> df2 = stack(df1, Not(:period))
4000×3 DataFrame
Row │ period variable value
│ Int64 String Float64
──────┼─────────────────────────────
1 │ 1 x1 0.579862
2 │ 2 x1 0.411294
3 │ 3 x1 0.972136
4 │ 4 x1 0.0149088
⋮ │ ⋮ ⋮ ⋮
3997 │ 997 x4 0.274228
3998 │ 998 x4 0.317677
3999 │ 999 x4 0.211217
4000 │ 1000 x4 0.442829
3992 rows omitted
Reshaping data: long to wide
Now let us perform the reverse operation, but this time put :period
as columns.
We can do it using the unstack
function:
julia> df3 = unstack(df2, :variable, :period, :value)
4×1001 DataFrame
Row │ variable 1 2 3 4 5 6 ⋯
│ String Float64? Float64? Float64? Float64? Float64? Float64? ⋯
─────┼────────────────────────────────────────────────────────────────────────
1 │ x1 0.579862 0.411294 0.972136 0.0149088 0.520355 0.639562 ⋯
2 │ x2 0.473374 0.176997 0.676856 0.177963 0.670122 0.042361
3 │ x3 0.366288 0.652475 0.900155 0.374975 0.387857 0.779594
4 │ x4 0.771849 0.938484 0.390673 0.694063 0.724383 0.130453
995 columns omitted
Nesting columns
Another way to reshape this data is to put all periods in a single cell of a data frame. We can do it in two ways.
The first is to keep them in columns :x1
, :x2
, :x3
, and :x4
. Here are two ways
how you can do it starting from df1
or df2
:
julia> combine(df1, Not(:period) .=> Ref, renamecols=false)
1×4 DataFrame
Row │ x1 x2 ⋯
│ Array… Array… ⋯
─────┼────────────────────────────────────────────────────────────────────────
1 │ [0.579862, 0.411294, 0.972136, 0… [0.473374, 0.176997, 0.676856, 0… ⋯
2 columns omitted
julia> unstack(df2, [], :variable, :value, combine=collect)
1×4 DataFrame
Row │ x1 x2 ⋯
│ Array…? Array…? ⋯
─────┼────────────────────────────────────────────────────────────────────────
1 │ [0.579862, 0.411294, 0.972136, 0… [0.473374, 0.176997, 0.676856, 0… ⋯
2 columns omitted
Alternatively, we could want to have two columns, the first with variable names
and the second with the vectors. Again, let me show two ways to do it starting
from df2
and df3
data frames:
julia> df4 = combine(groupby(df2, :variable), :value => Ref, renamecols=false)
4×2 DataFrame
Row │ variable value
│ String SubArray…
─────┼─────────────────────────────────────────────
1 │ x1 [0.579862, 0.411294, 0.972136, 0…
2 │ x2 [0.473374, 0.176997, 0.676856, 0…
3 │ x3 [0.366288, 0.652475, 0.900155, 0…
4 │ x4 [0.771849, 0.938484, 0.390673, 0…
julia> combine(df3, :variable, AsTable(Not(:variable)) => ByRow(identity∘collect) => :value)
4×2 DataFrame
Row │ variable value
│ String Array…
─────┼─────────────────────────────────────────────
1 │ x1 Union{Missing, Float64}[0.579862…
2 │ x2 Union{Missing, Float64}[0.473374…
3 │ x3 Union{Missing, Float64}[0.366288…
4 │ x4 Union{Missing, Float64}[0.771849…
These examples are slightly more complicated so let me explain them.
In the first one we use Ref
to protect a vector against expansion into multiple columns
(note that no copying of data happens here, to perform a copy you should write Ref∘copy
).
In the second one the AsTable(Not(:variable))
source column selector produces a NamedTuple
.
However, Julia users probably are aware that with 1000 entries such a NamedTuple
would take a long
time to compile. For this reason we are using an optimization that DataFrames.jl provides.
If we pass a function of a form some_function∘collect
then this compilation step is avoided.
In our case the function is just identity
as we are happy with producing a vector (which collect
already returns).
Unnesting columns
As a last example let us show how to expand the df4
data frame back into multiple columns.
This is easy. Just write:
julia> select(df4, :variable, :value => AsTable)
4×1001 DataFrame
Row │ variable x1 x2 x3 x4 x5 x6 ⋯
│ String Float64 Float64 Float64 Float64 Float64 Float64 ⋯
─────┼────────────────────────────────────────────────────────────────────────
1 │ x1 0.579862 0.411294 0.972136 0.0149088 0.520355 0.639562 ⋯
2 │ x2 0.473374 0.176997 0.676856 0.177963 0.670122 0.042361
3 │ x3 0.366288 0.652475 0.900155 0.374975 0.387857 0.779594
4 │ x4 0.771849 0.938484 0.390673 0.694063 0.724383 0.130453
995 columns omitted
Since we used AsTable
as target column names specifier we got auto-generated column names.
Conclusions
I hope you liked the examples I presented today and learned something from them.
I am sure that if you attend JuliaCon Local Eindhoven 2023 you are going to see a lot of highly informative and inspirational talks there!