Advanced reshaping in DataFrames.jl
In DataFrames.jl currently you have two functions that you can use
for reshaping your data: stack
and unstack
. Their design goals are very
simple:
stack
allows you to go from wide to long data format;unstck
works the other way around and takes data in long format producing a wide table.
In this post I want to focus on the unstack
function. Starting from explaining
its basic usage and then covering three common more complex scenarios.
This post was written under Julia 1.6.1 and DataFrames.jl 1.1.1.
Note! In this post I explan new features of unstack
introduced
in DataFrames.jl 1.4 release.
Preparing the data
Consider the following data (and practice a bit basic data transformation skills):
julia> using DataFrames
julia> sales = DataFrame(year=repeat(2001:2003, inner=4),
quarter=repeat(1:4, outer=3),
north=1:12, south=21:32)
12×4 DataFrame
Row │ year quarter north south
│ Int64 Int64 Int64 Int64
─────┼──────────────────────────────
1 │ 2001 1 1 21
2 │ 2001 2 2 22
3 │ 2001 3 3 23
4 │ 2001 4 4 24
5 │ 2002 1 5 25
6 │ 2002 2 6 26
7 │ 2002 3 7 27
8 │ 2002 4 8 28
9 │ 2003 1 9 29
10 │ 2003 2 10 30
11 │ 2003 3 11 31
12 │ 2003 4 12 32
julia> costs = select(sales, :year, :quarter, [:north, :south] .=> x -> x/2,
renamecols=false)
12×4 DataFrame
Row │ year quarter north south
│ Int64 Int64 Float64 Float64
─────┼──────────────────────────────────
1 │ 2001 1 0.5 10.5
2 │ 2001 2 1.0 11.0
3 │ 2001 3 1.5 11.5
4 │ 2001 4 2.0 12.0
5 │ 2002 1 2.5 12.5
6 │ 2002 2 3.0 13.0
7 │ 2002 3 3.5 13.5
8 │ 2002 4 4.0 14.0
9 │ 2003 1 4.5 14.5
10 │ 2003 2 5.0 15.0
11 │ 2003 3 5.5 15.5
12 │ 2003 4 6.0 16.0
julia> long_sales = stack(sales, [:north, :south], [:year, :quarter],
variable_name=:region, value_name=:sales)
24×4 DataFrame
Row │ year quarter region sales
│ Int64 Int64 String Int64
─────┼───────────────────────────────
1 │ 2001 1 north 1
2 │ 2001 2 north 2
3 │ 2001 3 north 3
4 │ 2001 4 north 4
5 │ 2002 1 north 5
6 │ 2002 2 north 6
7 │ 2002 3 north 7
8 │ 2002 4 north 8
9 │ 2003 1 north 9
10 │ 2003 2 north 10
11 │ 2003 3 north 11
12 │ 2003 4 north 12
13 │ 2001 1 south 21
14 │ 2001 2 south 22
15 │ 2001 3 south 23
16 │ 2001 4 south 24
17 │ 2002 1 south 25
18 │ 2002 2 south 26
19 │ 2002 3 south 27
20 │ 2002 4 south 28
21 │ 2003 1 south 29
22 │ 2003 2 south 30
23 │ 2003 3 south 31
24 │ 2003 4 south 32
julia> long_costs = stack(costs, [:north, :south], [:year, :quarter],
variable_name=:region, value_name=:costs)
24×4 DataFrame
Row │ year quarter region costs
│ Int64 Int64 String Float64
─────┼─────────────────────────────────
1 │ 2001 1 north 0.5
2 │ 2001 2 north 1.0
3 │ 2001 3 north 1.5
4 │ 2001 4 north 2.0
5 │ 2002 1 north 2.5
6 │ 2002 2 north 3.0
7 │ 2002 3 north 3.5
8 │ 2002 4 north 4.0
9 │ 2003 1 north 4.5
10 │ 2003 2 north 5.0
11 │ 2003 3 north 5.5
12 │ 2003 4 north 6.0
13 │ 2001 1 south 10.5
14 │ 2001 2 south 11.0
15 │ 2001 3 south 11.5
16 │ 2001 4 south 12.0
17 │ 2002 1 south 12.5
18 │ 2002 2 south 13.0
19 │ 2002 3 south 13.5
20 │ 2002 4 south 14.0
21 │ 2003 1 south 14.5
22 │ 2003 2 south 15.0
23 │ 2003 3 south 15.5
24 │ 2003 4 south 16.0
julia> long = innerjoin(long_sales, long_costs, on=[:year, :quarter, :region])
24×5 DataFrame
Row │ year quarter region sales costs
│ Int64 Int64 String Int64 Float64
─────┼────────────────────────────────────────
1 │ 2001 1 north 1 0.5
2 │ 2001 2 north 2 1.0
3 │ 2001 3 north 3 1.5
4 │ 2001 4 north 4 2.0
5 │ 2002 1 north 5 2.5
6 │ 2002 2 north 6 3.0
7 │ 2002 3 north 7 3.5
8 │ 2002 4 north 8 4.0
9 │ 2003 1 north 9 4.5
10 │ 2003 2 north 10 5.0
11 │ 2003 3 north 11 5.5
12 │ 2003 4 north 12 6.0
13 │ 2001 1 south 21 10.5
14 │ 2001 2 south 22 11.0
15 │ 2001 3 south 23 11.5
16 │ 2001 4 south 24 12.0
17 │ 2002 1 south 25 12.5
18 │ 2002 2 south 26 13.0
19 │ 2002 3 south 27 13.5
20 │ 2002 4 south 28 14.0
21 │ 2003 1 south 29 14.5
22 │ 2003 2 south 30 15.0
23 │ 2003 3 south 31 15.5
24 │ 2003 4 south 32 16.0
The basics of unstack
Assume we want to get the sales
table back. We need to unstack
our long
table putting :year
and :quarter
in rows and :region
in columns, while
taking :sales
as values:
julia> unstack(long, [:year, :quarter], :region, :sales)
12×4 DataFrame
Row │ year quarter north south
│ Int64 Int64 Int64? Int64?
─────┼────────────────────────────────
1 │ 2001 1 1 21
2 │ 2001 2 2 22
3 │ 2001 3 3 23
4 │ 2001 4 4 24
5 │ 2002 1 5 25
6 │ 2002 2 6 26
7 │ 2002 3 7 27
8 │ 2002 4 8 28
9 │ 2003 1 9 29
10 │ 2003 2 10 30
11 │ 2003 3 11 31
12 │ 2003 4 12 32
We also check that we have recovered what we wanted:
julia> unstack(long, [:year, :quarter], :region, :sales) == sales
true
However, now try to put only :year
in rows. If we want to drop :quarter
then
we get:
julia> unstack(long, :year, :region, :sales)
ERROR: ArgumentError: Duplicate entries in unstack at row 2 for key (2001,) and variable north. Pass allowduplicates=true to allow them.
julia> unstack(long, :year, :region, :sales, allowduplicates=true)
3×3 DataFrame
Row │ year north south
│ Int64 Int64? Int64?
─────┼───────────────────────
1 │ 2001 4 24
2 │ 2002 8 28
3 │ 2003 12 32
Clearly even if we pass allowduplicates=true
we do not get what we most likely
wanted. This leads us to the first case.
Pivot tables with unstack
Most likely we want to aggregate sales per year using the sum
function. This
is a classic pivot table task. In DataFrames.jl currently one does it in two
steps: first aggregate, then reshape. Here is how you can do it (I am showing
two separate steps, but you could use e.g. Chain.jl to streamline
the processing):
julia> tmp1 = combine(groupby(long, [:year, :region]), :sales => sum => :sales)
6×3 DataFrame
Row │ year region sales
│ Int64 String Int64
─────┼──────────────────────
1 │ 2001 north 10
2 │ 2001 south 90
3 │ 2002 north 26
4 │ 2002 south 106
5 │ 2003 north 42
6 │ 2003 south 122
julia> unstack(tmp1, :year, :region, :sales)
3×3 DataFrame
Row │ year north south
│ Int64 Int64? Int64?
─────┼───────────────────────
1 │ 2001 10 90
2 │ 2002 26 106
3 │ 2003 42 122
Multiple variables put in columns
What if we wanted to put only :year
in rows, but both :quarter
and :region
in columns?
In this case we need to create a temporary column which we combine the
:quarter
and :region
. Here is a simple example:
julia> tmp2 = transform(long, [:region, :quarter] => ByRow(string) => :rq)
24×6 DataFrame
Row │ year quarter region sales costs rq
│ Int64 Int64 String Int64 Float64 String
─────┼────────────────────────────────────────────────
1 │ 2001 1 north 1 0.5 north1
2 │ 2001 2 north 2 1.0 north2
3 │ 2001 3 north 3 1.5 north3
4 │ 2001 4 north 4 2.0 north4
5 │ 2002 1 north 5 2.5 north1
6 │ 2002 2 north 6 3.0 north2
7 │ 2002 3 north 7 3.5 north3
8 │ 2002 4 north 8 4.0 north4
9 │ 2003 1 north 9 4.5 north1
10 │ 2003 2 north 10 5.0 north2
11 │ 2003 3 north 11 5.5 north3
12 │ 2003 4 north 12 6.0 north4
13 │ 2001 1 south 21 10.5 south1
14 │ 2001 2 south 22 11.0 south2
15 │ 2001 3 south 23 11.5 south3
16 │ 2001 4 south 24 12.0 south4
17 │ 2002 1 south 25 12.5 south1
18 │ 2002 2 south 26 13.0 south2
19 │ 2002 3 south 27 13.5 south3
20 │ 2002 4 south 28 14.0 south4
21 │ 2003 1 south 29 14.5 south1
22 │ 2003 2 south 30 15.0 south2
23 │ 2003 3 south 31 15.5 south3
24 │ 2003 4 south 32 16.0 south4
julia> unstack(tmp2, :year, :rq, :sales)
3×9 DataFrame
Row │ year north1 north2 north3 north4 south1 south2 south3 south4
│ Int64 Int64? Int64? Int64? Int64? Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────────
1 │ 2001 1 2 3 4 21 22 23 24
2 │ 2002 5 6 7 8 25 26 27 28
3 │ 2003 9 10 11 12 29 30 31 32
Note that this additional step is only required for columns as for rows
unstack
accepts multiple columns as shown above.
Multiple value variables
Now we get to my favorite Chekhov’s gun element of our story. Why do we
have a :costs
column in our long
table? The reason is that now we will
discuss how one can unstack
a data frame on multiple value columns.
Here you have three options how you want to store the values:
- use a nested field;
- stack them vertically;
- merge them horizontally.
Let me now discuss the three options. Nesting the field can be done e.g. in the following way:
julia> tmp3 = transform(long, AsTable([:sales, :costs]) =>
ByRow(identity) =>
:indicators)
24×6 DataFrame
Row │ year quarter region sales costs indicators
│ Int64 Int64 String Int64 Float64 NamedTupl…
─────┼────────────────────────────────────────────────────────────────────
1 │ 2001 1 north 1 0.5 (sales = 1, costs = 0.5)
2 │ 2001 2 north 2 1.0 (sales = 2, costs = 1.0)
3 │ 2001 3 north 3 1.5 (sales = 3, costs = 1.5)
4 │ 2001 4 north 4 2.0 (sales = 4, costs = 2.0)
5 │ 2002 1 north 5 2.5 (sales = 5, costs = 2.5)
6 │ 2002 2 north 6 3.0 (sales = 6, costs = 3.0)
7 │ 2002 3 north 7 3.5 (sales = 7, costs = 3.5)
8 │ 2002 4 north 8 4.0 (sales = 8, costs = 4.0)
9 │ 2003 1 north 9 4.5 (sales = 9, costs = 4.5)
10 │ 2003 2 north 10 5.0 (sales = 10, costs = 5.0)
11 │ 2003 3 north 11 5.5 (sales = 11, costs = 5.5)
12 │ 2003 4 north 12 6.0 (sales = 12, costs = 6.0)
13 │ 2001 1 south 21 10.5 (sales = 21, costs = 10.5)
14 │ 2001 2 south 22 11.0 (sales = 22, costs = 11.0)
15 │ 2001 3 south 23 11.5 (sales = 23, costs = 11.5)
16 │ 2001 4 south 24 12.0 (sales = 24, costs = 12.0)
17 │ 2002 1 south 25 12.5 (sales = 25, costs = 12.5)
18 │ 2002 2 south 26 13.0 (sales = 26, costs = 13.0)
19 │ 2002 3 south 27 13.5 (sales = 27, costs = 13.5)
20 │ 2002 4 south 28 14.0 (sales = 28, costs = 14.0)
21 │ 2003 1 south 29 14.5 (sales = 29, costs = 14.5)
22 │ 2003 2 south 30 15.0 (sales = 30, costs = 15.0)
23 │ 2003 3 south 31 15.5 (sales = 31, costs = 15.5)
24 │ 2003 4 south 32 16.0 (sales = 32, costs = 16.0)
julia> unstack(tmp3, [:year, :quarter], :region, :indicators)
12×4 DataFrame
Row │ year quarter north south
│ Int64 Int64 NamedTup…? NamedTup…?
─────┼───────────────────────────────────────────────────────────────────────
1 │ 2001 1 (sales = 1, costs = 0.5) (sales = 21, costs = 10.5)
2 │ 2001 2 (sales = 2, costs = 1.0) (sales = 22, costs = 11.0)
3 │ 2001 3 (sales = 3, costs = 1.5) (sales = 23, costs = 11.5)
4 │ 2001 4 (sales = 4, costs = 2.0) (sales = 24, costs = 12.0)
5 │ 2002 1 (sales = 5, costs = 2.5) (sales = 25, costs = 12.5)
6 │ 2002 2 (sales = 6, costs = 3.0) (sales = 26, costs = 13.0)
7 │ 2002 3 (sales = 7, costs = 3.5) (sales = 27, costs = 13.5)
8 │ 2002 4 (sales = 8, costs = 4.0) (sales = 28, costs = 14.0)
9 │ 2003 1 (sales = 9, costs = 4.5) (sales = 29, costs = 14.5)
10 │ 2003 2 (sales = 10, costs = 5.0) (sales = 30, costs = 15.0)
11 │ 2003 3 (sales = 11, costs = 5.5) (sales = 31, costs = 15.5)
12 │ 2003 4 (sales = 12, costs = 6.0) (sales = 32, costs = 16.0)
The second option is vertical stacking:
julia> vcat(unstack(long, [:year, :quarter], :region, :sales),
unstack(long, [:year, :quarter], :region, :costs),
source=:indicator=>["sales", "costs"])
24×5 DataFrame
Row │ year quarter north south indicator
│ Int64 Int64 Float64? Float64? String
─────┼───────────────────────────────────────────────
1 │ 2001 1 1.0 21.0 sales
2 │ 2001 2 2.0 22.0 sales
3 │ 2001 3 3.0 23.0 sales
4 │ 2001 4 4.0 24.0 sales
5 │ 2002 1 5.0 25.0 sales
6 │ 2002 2 6.0 26.0 sales
7 │ 2002 3 7.0 27.0 sales
8 │ 2002 4 8.0 28.0 sales
9 │ 2003 1 9.0 29.0 sales
10 │ 2003 2 10.0 30.0 sales
11 │ 2003 3 11.0 31.0 sales
12 │ 2003 4 12.0 32.0 sales
13 │ 2001 1 0.5 10.5 costs
14 │ 2001 2 1.0 11.0 costs
15 │ 2001 3 1.5 11.5 costs
16 │ 2001 4 2.0 12.0 costs
17 │ 2002 1 2.5 12.5 costs
18 │ 2002 2 3.0 13.0 costs
19 │ 2002 3 3.5 13.5 costs
20 │ 2002 4 4.0 14.0 costs
21 │ 2003 1 4.5 14.5 costs
22 │ 2003 2 5.0 15.0 costs
23 │ 2003 3 5.5 15.5 costs
24 │ 2003 4 6.0 16.0 costs
or
julia> unstack(stack(long, [:sales, :costs], [:year, :quarter, :region],
variable_name=:indicator),
[:year, :quarter, :indicator], :region, :value)
24×5 DataFrame
Row │ year quarter indicator north south
│ Int64 Int64 String Float64? Float64?
─────┼───────────────────────────────────────────────
1 │ 2001 1 sales 1.0 21.0
2 │ 2001 2 sales 2.0 22.0
3 │ 2001 3 sales 3.0 23.0
4 │ 2001 4 sales 4.0 24.0
5 │ 2002 1 sales 5.0 25.0
6 │ 2002 2 sales 6.0 26.0
7 │ 2002 3 sales 7.0 27.0
8 │ 2002 4 sales 8.0 28.0
9 │ 2003 1 sales 9.0 29.0
10 │ 2003 2 sales 10.0 30.0
11 │ 2003 3 sales 11.0 31.0
12 │ 2003 4 sales 12.0 32.0
13 │ 2001 1 costs 0.5 10.5
14 │ 2001 2 costs 1.0 11.0
15 │ 2001 3 costs 1.5 11.5
16 │ 2001 4 costs 2.0 12.0
17 │ 2002 1 costs 2.5 12.5
18 │ 2002 2 costs 3.0 13.0
19 │ 2002 3 costs 3.5 13.5
20 │ 2002 4 costs 4.0 14.0
21 │ 2003 1 costs 4.5 14.5
22 │ 2003 2 costs 5.0 15.0
23 │ 2003 3 costs 5.5 15.5
24 │ 2003 4 costs 6.0 16.0
Finally we might want to perform horizontal merging which can be done e.g. like this:
julia> outerjoin(unstack(long, [:year, :quarter], :region, :sales),
unstack(long, [:year, :quarter], :region, :costs),
on=[:year, :quarter], renamecols="_sales" => "_costs")
12×6 DataFrame
Row │ year quarter north_sales south_sales north_costs south_costs
│ Int64 Int64 Int64? Int64? Float64? Float64?
─────┼────────────────────────────────────────────────────────────────────
1 │ 2001 1 1 21 0.5 10.5
2 │ 2001 2 2 22 1.0 11.0
3 │ 2001 3 3 23 1.5 11.5
4 │ 2001 4 4 24 2.0 12.0
5 │ 2002 1 5 25 2.5 12.5
6 │ 2002 2 6 26 3.0 13.0
7 │ 2002 3 7 27 3.5 13.5
8 │ 2002 4 8 28 4.0 14.0
9 │ 2003 1 9 29 4.5 14.5
10 │ 2003 2 10 30 5.0 15.0
11 │ 2003 3 11 31 5.5 15.5
12 │ 2003 4 12 32 6.0 16.0
Concluding remmarks
Today I have focused mostly on the unstack
function, and only mentioned
stack
in a few places.
However, it is also worth to know that there are two other functions that are
very often handy and easy to forget about. One is good old permutedims
(transposing a data frame) and the other is flatten
(flattening nested
columns). If you want to widen your DataFrames.jl related arsenal of tricks I
recommend you to check out their documentation.