Transforming multiple columns in DataFrames.jl
Introduction
Today I want to comment on a recurring topic that DataFrames.jl users raise. The question is how one should transform multiple columns of a data frame using operation specification syntax.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
What is operation specification syntax?
In DataFrames.jl the combine, select, and transform functions allow
users for passing the requests for data transformation using operation
specification syntax. This syntax is feature-rich, and you can find its
description for example here. Today I want to focus on its principal concept.
In a general form each request for making an operation on data has the (E)xtract-(T)ransform-(L)oad form. That means that we need to specify:
- source columns to get data from (the extract part);;
- the operation to apply to these columns (the transform part);
- the target columns where we want to store the result of the operation (the load part).
These tree parts are syntactically expressed using the following form:
[source columns specification] => [transformation function] => [target columns specification]
Let me give an example. Assume you have the following data:
julia> using DataFrames
julia> df = DataFrame(reshape(1:15, 5, 3), :auto)
5×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
We want to compute the sum of column "x1" and store it in column names "x1_sum"
Since the sum function performs the addition operation the syntax specification should be:
"x1" => sum => "x1_sum"
Let us check it with the combine function:
julia> combine(df, "x1" => sum => "x1_sum")
1×1 DataFrame
Row │ x1_sum
│ Int64
─────┼────────
1 │ 15
In this syntax it is important to note two things:
- the
"x1"column as a whole was passed to thesumfunction (as we want to compute its sum); - the
"x1"column is a single positional argument passed to thesumfunction.
Two natural questions that arise are the following:
- What if I do not want to perform an operation on a whole column, but on its elements (a.k.a. vectorization of operation)?
- What if I want to pass multiple columns as a source for computations?
We will now investigate these two dimensions.
Vectorization of operations
Vectorization in DataFrames.jl is easy. Just wrap the function you use in the ByRow object. Here is an example:
julia> combine(df, "x1" => string => "x1_str")
1×1 DataFrame
Row │ x1_str
│ String
─────┼─────────────────
1 │ [1, 2, 3, 4, 5]
julia> combine(df, "x1" => ByRow(string) => "x1_strs")
5×1 DataFrame
Row │ x1_strs
│ String
─────┼─────────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
Note that "x1" => string => "x1_str" passed the whole "x1" column to the string function so we got a single "[1, 2, 3, 4, 5]"
string in the output.
While writing "x1" => ByRow(string) => "x1_strs" passed each element of "x1" column to the string function individually,
so in the result we got a vector of five string representations of numbers of the numbers from the source.
Passing multiple columns
Now let us have a look at passing multiple columns. There are two ways you can do it.
The first is when your function accepts multiple positional arguments. An example of such function is string see:
julia> string(df.x1, df.x2)
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"
If we pass a collection of columns as a source in operation specification syntax we get this behavior:
julia> combine(df, ["x1", "x2"] => string => "x1_x2_str")
1×1 DataFrame
Row │ x1_x2_str
│ String
─────┼─────────────────────────────────
1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]
Naturally, the above combines with vectorization. Therefore since:
julia> string.(df.x1, df.x2)
5-element Vector{String}:
"16"
"27"
"38"
"49"
"510"
we also have:
julia> combine(df, ["x1", "x2"] => ByRow(string) => "x1_x2_strs")
5×1 DataFrame
Row │ x1_x2_strs
│ String
─────┼────────────
1 │ 16
2 │ 27
3 │ 38
4 │ 49
5 │ 510
However, there are cases when we have a function that expects multiple columns to be passed as a single positional argument.
This is handled in DataFrames.jl with the AsTable wrapper, which you can apply to the source columns.
If you use it then instead of getting multiple positional arguments the function will get a single positional argument
that will be a NamedTuple holding the source columns.
To convince ourselves that this is indeed what happens let us create a helper function:
julia> function helper(x)
@show x
return string(x.x1, x.x2)
end
helper (generic function with 1 method)
This helper function first prints us its only argument x and next assumes that it has x1 and x2 fields and applies the string function to them.
Let us first check it in practice:
julia> helper((x1=[1, 2, 3, 4, 5], x2=[6, 7, 8, 9, 10]))
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"
Now let us use the helper function with combine:
julia> combine(df, AsTable(["x1", "x2"]) => helper => "x1_x2_str")
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
1×1 DataFrame
Row │ x1_x2_str
│ String
─────┼─────────────────────────────────
1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]
Indeed, we see that helper got a named tuple holding two columns of the source data frame.
Again, this syntax plays well with ByRow:
julia> combine(df, AsTable(["x1", "x2"]) => ByRow(helper) => "x1_x2_strs")
x = (x1 = 1, x2 = 6)
x = (x1 = 2, x2 = 7)
x = (x1 = 3, x2 = 8)
x = (x1 = 4, x2 = 9)
x = (x1 = 5, x2 = 10)
5×1 DataFrame
Row │ x1_x2_strs
│ String
─────┼────────────
1 │ 16
2 │ 27
3 │ 38
4 │ 49
5 │ 510
We see that this time helper got a separate named tuple for each row of source data frame.
Conclusions
In summary today we discussed two special operations in DataFrames.jl operation specification syntax:
- the
ByRowwhich vectorizes the function passed to it; - the
AsTablewhich allows us to pass source columns as a single named tuple to the transformation function (instead of passing them as consecutive positional arguments, which is the default).
I hope these examples were useful in helping you understand the design of operation specification syntax.