Transforming multiple columns with multiple functions in DataFrames.jl
Introduction
A common question asked in relation to mutation of data frame objects in DataFrames.jl is how to apply multiple transformation functions to multiple columns of a data frame. In this post I want to show several possible approaches to this task.
The codes were run under Julia 1.6.1 and DataFrames.jl 1.2.0.
The basic approach
First create a data frame we will work with:
julia> using DataFrames
julia> df = DataFrame(id=repeat(1:2, 3), c1=1:6, c2=11:16)
6×3 DataFrame
Row │ id c1 c2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 11
2 │ 2 2 12
3 │ 1 3 13
4 │ 2 4 14
5 │ 1 5 15
6 │ 2 6 16
Now assume we want to calculate minimum, maximum and sum of columns
:c1
and :c2
. You can do it like this:
julia> combine(df, [:c1, :c2] .=> [minimum maximum sum])
1×6 DataFrame
Row │ c1_minimum c2_minimum c1_maximum c2_maximum c1_sum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────
1 │ 1 11 6 16 21 81
or like this:
julia> combine(df, [:c1 :c2] .=> [minimum, maximum, sum])
1×6 DataFrame
Row │ c1_minimum c1_maximum c1_sum c2_minimum c2_maximum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────
1 │ 1 6 21 11 16 81
depending on what order of output columns you prefer.
Let us try to understand what is going on here. The combine
function accepts
vectors or matrices of transformation specifications given as pairs using the
=>
operator. If we check the internal broadcasting operation we see:
julia> [:c1 :c2] .=> [minimum, maximum, sum]
3×2 Matrix{Pair{Symbol, _A} where _A}:
:c1=>minimum :c2=>minimum
:c1=>maximum :c2=>maximum
:c1=>sum :c2=>sum
julia> [:c1, :c2] .=> [minimum maximum sum]
2×3 Matrix{Pair{Symbol, _A} where _A}:
:c1=>minimum :c1=>maximum :c1=>sum
:c2=>minimum :c2=>maximum :c2=>sum
In both cases we create a matrix of transformations. Note that the crucial
trick here is that we broadcast a vector againsst a 1-row matrix to get a final
matrix. For instance in [:c1 :c2] .=> [minimum, maximum, sum]
, the [:c1 :c2]
part is a 1-row matrix (notice that a space is separating its elements) and
[minimum, maximum, sum]
is a vector (notice ,
separating its elements).
Exactly the same pattern can be applied to grouped data frames as you can see here:
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (3 rows): id = 1
Row │ id c1 c2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 11
2 │ 1 3 13
3 │ 1 5 15
⋮
Last Group (3 rows): id = 2
Row │ id c1 c2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 12
2 │ 2 4 14
3 │ 2 6 16
julia> combine(gdf, [:c1, :c2] .=> [minimum maximum sum])
2×7 DataFrame
Row │ id c1_minimum c2_minimum c1_maximum c2_maximum c1_sum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────
1 │ 1 1 11 5 15 9 39
2 │ 2 2 12 6 16 12 42
julia> combine(gdf, [:c1 :c2] .=> [minimum, maximum, sum])
2×7 DataFrame
Row │ id c1_minimum c1_maximum c1_sum c2_minimum c2_maximum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────
1 │ 1 1 5 9 11 15 39
2 │ 2 2 6 12 12 16 42
Programmatic specification of column names
Sometimes instead of writing down [:c1, :c2]
manually one would want to select
the columns programmatically. Here are some approaches we currently support:
The most general one is by using the names
function with an appropriate column
selector:
julia> combine(df, names(df, r"c") .=> [minimum maximum sum])
1×6 DataFrame
Row │ c1_minimum c2_minimum c1_maximum c2_maximum c1_sum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────
1 │ 1 11 6 16 21 81
or
julia> combine(df, names(df, Not(:id)) .=> [minimum maximum sum])
1×6 DataFrame
Row │ c1_minimum c2_minimum c1_maximum c2_maximum c1_sum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────
1 │ 1 11 6 16 21 81
With grouped data frames an option that is often useful is to use the
valuecols
funcion that picks all non-grouping columns:
julia> combine(gdf, valuecols(gdf) .=> [minimum maximum sum])
2×7 DataFrame
Row │ id c1_minimum c2_minimum c1_maximum c2_maximum c1_sum c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────
1 │ 1 1 11 5 15 9 39
2 │ 2 2 12 6 16 12 42
As above, in all the cases it is useful to investigate the arguments passed to the enclosing functions:
julia> names(df, r"c")
2-element Vector{String}:
"c1"
"c2"
julia> names(df, Not(:id))
2-element Vector{String}:
"c1"
"c2"
julia> valuecols(gdf)
2-element Vector{Symbol}:
:c1
:c2
Nested columns approach
Another approach one could use to apply multiple functions to multiple colums is to create nested columns like this:
julia> combine(df,
[:c1, :c2] .=> x -> [(min=minimum(x), max=maximum(x), sum=sum(x))],
renamecols=false)
1×2 DataFrame
Row │ c1 c2
│ NamedTup… NamedTup…
─────┼──────────────────────────────────────────────────────────────
1 │ (min = 1, max = 6, sum = 21) (min = 11, max = 16, sum = 81)
or
julia> combine(df,
[:c1, :c2] .=> x -> Ref((min=minimum(x), max=maximum(x), sum=sum(x))),
renamecols=false)
1×2 DataFrame
Row │ c1 c2
│ NamedTup… NamedTup…
─────┼──────────────────────────────────────────────────────────────
1 │ (min = 1, max = 6, sum = 21) (min = 11, max = 16, sum = 81)
Notice that the anonymous function we use produces a NamedTuple
, so we need
to protect it against being expanded by wrapping it with a vector or Ref
object.
This trick similar to what is done in broadcasting in Julia Base. Let me give
a simple example:
julia> [1 2] .=> (a=1, b=2)
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
julia> [1 2] .=> [(a=1, b=2)]
1×2 Matrix{Pair{Int64, NamedTuple{(:a, :b), Tuple{Int64, Int64}}}}:
1=>(a = 1, b = 2) 2=>(a = 1, b = 2)
julia> [1 2] .=> Ref((a=1, b=2))
1×2 Matrix{Pair{Int64, NamedTuple{(:a, :b), Tuple{Int64, Int64}}}}:
1=>(a = 1, b = 2) 2=>(a = 1, b = 2)
and as you can see wrapping a NamedTuple
in a vector or Ref
made broadcasting
use it “as is” (essentially we had to create a 1-element container that was
broadcasted over). As a side note: Ref
is usually a technically preferred way
to make such a protection, but vector is easier to type and in most cases it
produces exactly the same result, so I often use it.
Going back to our aggregation case, the nested column is most useful with grouped data frame:
julia> df_agg = combine(gdf,
[:c1, :c2] .=> x->[(min=minimum(x), max=maximum(x), sum=sum(x))],
renamecols=false)
2×3 DataFrame
Row │ id c1 c2
│ Int64 NamedTup… NamedTup…
─────┼─────────────────────────────────────────────────────────────────────
1 │ 1 (min = 1, max = 5, sum = 9) (min = 11, max = 15, sum = 39)
2 │ 2 (min = 2, max = 6, sum = 12) (min = 12, max = 16, sum = 42)
Finally you might ask how to un-nest the columns. If you want to do it for a single column you can just write:
julia> select(df_agg, :id, :c1 => AsTable)
2×4 DataFrame
Row │ id min max sum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 1 5 9
2 │ 2 2 6 12
However, for multiple columns this will fail:
julia> select(df_agg, :id, [:c1, :c2] .=> AsTable)
ERROR: ArgumentError: Duplicate column name(s) returned: :min, :max, :sum
The problem is that, as you can see, we would try to create columns with names
:min
, :max
, and :sum
twice, which is disallowed.
What you can do instead is to provide explicit column names:
julia> select(df_agg, :id, [:c1, :c2] .=> [c .* ["min", "max", "sum"] for c in ["c1_", "c2_"]])
2×7 DataFrame
Row │ id c1_min c1_max c1_sum c2_min c2_max c2_sum
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────
1 │ 1 1 5 9 11 15 39
2 │ 2 2 6 12 12 16 42
As usual, it is worth to investigate the argument passed to the combine
function
to make sure we understand exactly the syntax:
julia> [:c1, :c2] .=> [c .* ["min", "max", "sum"] for c in ["c1_", "c2_"]]
2-element Vector{Pair{Symbol, Vector{String}}}:
:c1 => ["c1_min", "c1_max", "c1_sum"]
:c2 => ["c2_min", "c2_max", "c2_sum"]
Conclusions
In conclusion let me highlight one important feature of the syntax using the
=>
operator that we have exploited in this post.
The trick is that =>
is a callable (technically it calls a Pair
constructor). Therefore it is very easy to use it in programmatic scenarios
when one might want to specify column names or applied functions at run time
e.g. by storing them in some variables.