Mass transformations of data frames how-to
Introduction
A very common question related to the usage of DataFrames.jl is how to perform mass transformations of data frames. Typically users want to apply the same function to all columns, rows, or individual cells of a data frame.
In this post I want to summarize basic patterns allowing to perform these tasks. I split the examples by the type of task performed and the requested type of the output of the operation.
The code was tested under Julia 1.5.3 and DataFrames 0.22.2.
In the post we will consider the following source data frame:
julia> using DataFrames
julia> df = DataFrame(reshape(1:24, 6, 4), :auto)
6×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 7 13 19
2 │ 2 8 14 20
3 │ 3 9 15 21
4 │ 4 10 16 22
5 │ 5 11 17 23
6 │ 6 12 18 24
Note that it is important that all columns of the data frame have the same type as usually when we apply mass transformations to different columns this condition is required to be met (it is not a strict rule that this is the case, but I have found that e.g. trying to apply a function that works on floats to strings is one of the most common cases causing confusion of the users).
Each column to a vector
If you want to apply a transformation to each column and get a vector as a
result then use the eachcol
iterator for this. Here are some options you might
find useful:
julia> sum.(eachcol(df))
4-element Array{Int64,1}:
21
57
93
129
julia> map(sum, eachcol(df))
4-element Array{Int64,1}:
21
57
93
129
julia> [sum(x) for x in eachcol(df)]
4-element Array{Int64,1}:
21
57
93
129
julia> [name => sum(x) for (name, x) in pairs(eachcol(df))]
4-element Array{Pair{Symbol,Int64},1}:
:x1 => 21
:x2 => 57
:x3 => 93
:x4 => 129
Each column to a data frame
If you want to produce a data frame as a result of applying a function to all
columns you can either use mapcols
or combine
:
julia> combine(df, names(df) .=> sum)
1×4 DataFrame
Row │ x1_sum x2_sum x3_sum x4_sum
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────────
1 │ 21 57 93 129
julia> combine(df, names(df) .=> sum, renamecols=false)
1×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 21 57 93 129
julia> combine(df, names(df) .=> sum .=> names(df), renamecols=false)
1×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 21 57 93 129
julia> mapcols(sum, df)
1×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 21 57 93 129
In general, as you can see mapcols
was designed to handle this scenario, while
combine
can be used when you would want to perform more different
transformations of the passed data frame (at the cost of being more verbose).
Each row to a vector
In this case you have two major. The basic one is to use eachrow
:
julia> sum.(eachrow(df))
6-element Array{Int64,1}:
40
44
48
52
56
60
julia> map(sum, eachrow(df))
6-element Array{Int64,1}:
40
44
48
52
56
60
julia> [sum(x) for x in eachrow(df)]
6-element Array{Int64,1}:
40
44
48
52
56
60
This should be OK for most cases. The problem with this approach is that
eachrow
is not type stable. So when you have very many rows or need column
type information in the values passed to the aggregation function use
Tables.namedtupleiterator
:
julia> sum.(Tables.namedtupleiterator(df))
6-element Array{Int64,1}:
40
44
48
52
56
60
julia> map(sum, Tables.namedtupleiterator(df))
6-element Array{Int64,1}:
40
44
48
52
56
60
julia> [sum(x) for x in Tables.namedtupleiterator(df)]
6-element Array{Int64,1}:
40
44
48
52
56
60
whih will be faster and type stable (but at the cost of having to be compiled, which can be problematic if you have a lot of columns in you data frame as I have recently explained in this post).
You might ask when one wants type stability in the context of small tables. Here is an example:
julia> df2 = DataFrame(x1=[1, 2, missing], x2 = [1, missing, missing])
3×2 DataFrame
Row │ x1 x2
│ Int64? Int64?
─────┼──────────────────
1 │ 1 1
2 │ 2 missing
3 │ missing missing
julia> (sum∘skipmissing).(Tables.namedtupleiterator(df2))
3-element Array{Int64,1}:
2
2
0
julia> (sum∘skipmissing).(eachrow(df2))
ERROR: ArgumentError: reducing over an empty collection is not allowed
As you can see in the last row of df2
we have only missing
values. If we are
in a type stable context, sum
knows that it should produce an integer 0
,
while in a type unstable context we get an error as it is impossible to tell
what should be the type of 0
that should be produced.
Each row to a data frame
This case is typically handled by using the combine
or the select
functions
(which in the considered scenario produce the same output) along with the
ByRow
wrapper. Here are two examples differing in whether we pass rows as
consecutive positional arguments or as a NamedTuple
to an aggregation
function:
julia> combine(df, names(df) => ByRow(+) => :sum)
6×1 DataFrame
Row │ sum
│ Int64
─────┼───────
1 │ 40
2 │ 44
3 │ 48
4 │ 52
5 │ 56
6 │ 60
julia> combine(df, AsTable(names(df)) => sum => :sum)
6×1 DataFrame
Row │ sum
│ Int64
─────┼───────
1 │ 40
2 │ 44
3 │ 48
4 │ 52
5 │ 56
6 │ 60
Note that in the NamedTuple
passing option we are type stable so the following
code works as in the example from the previous section:
julia> combine(df2, AsTable(names(df2)) => ByRow(sum∘skipmissing) => :sum)
3×1 DataFrame
Row │ sum
│ Int64
─────┼───────
1 │ 2
2 │ 2
3 │ 0
Each cell to a matrix
In order to transform each cell and store the result in a matrix you have the following basic options:
julia> Matrix(df) .^ 2
6×4 Array{Int64,2}:
1 49 169 361
4 64 196 400
9 81 225 441
16 100 256 484
25 121 289 529
36 144 324 576
julia> Matrix(df .^ 2)
6×4 Array{Int64,2}:
1 49 169 361
4 64 196 400
9 81 225 441
16 100 256 484
25 121 289 529
36 144 324 576
julia> [df[i, j]^2 for i in axes(df, 1), j in axes(df, 2)]
6×4 Array{Int64,2}:
1 49 169 361
4 64 196 400
9 81 225 441
16 100 256 484
25 121 289 529
36 144 324 576
In general the first of them (conversion to a Matrix
and then working with it)
should be fastest.
Each cell to a data frame
In this case one can use the same pattern as the second one above. Just write:
julia> df .^ 2
6×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 49 169 361
2 │ 4 64 196 400
3 │ 9 81 225 441
4 │ 16 100 256 484
5 │ 25 121 289 529
6 │ 36 144 324 576
Conclusions
Now you should have a basic understanding of different options how data frame
can be transformed by-row, by-column, or by-cell. I have skipped the discussion
of analogous operations for GroupedDataFrame
. If you would want to perform
them per-group then using the combine
or select
examples given above will
just work also for GroupedDataFrame
as a source of data.