Handling vectors of vectors in DataFrames.jl
Introduction
In this post I want to explore a topic that is a corner case of transformation mini-language in DataFrames.jl. The issue is about transformations that produce vectors as values. This question arises when users start to do more advanced operations so I decided that it deserves a deeper treatment (as in the last post this time I have chosen a topic following a question from the user I got recently).
This post was written under Julia 1.6.1, Arrow 1.6.2, and DataFrames.jl 1.2.2.
The standard behavior
By default if a transformation operation returns a vector it gets expanded into multiple rows, as this is something that typically users expect. Here is a basic example:
julia> using DataFrames
julia> df = DataFrame(x=[1, 1, 2, 3, 3])
5×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 2
4 │ 3
5 │ 3
julia> combine(df, :x => reverse)
5×1 DataFrame
Row │ x_reverse
│ Int64
─────┼───────────
1 │ 3
2 │ 3
3 │ 2
4 │ 1
5 │ 1
julia> combine(df, :x => unique)
3×1 DataFrame
Row │ x_unique
│ Int64
─────┼──────────
1 │ 1
2 │ 2
3 │ 3
As you can see in the unique
example the number of rows produced is flexible
with combine
. This would not be the case for select
or transform
which
require the number of rows in the result to match the source, so we get:
julia> select(df, :x => reverse)
5×1 DataFrame
Row │ x_reverse
│ Int64
─────┼───────────
1 │ 3
2 │ 3
3 │ 2
4 │ 1
5 │ 1
julia> select(df, :x => unique)
ERROR: ArgumentError: length 3 of vector returned from function unique is different from number of rows 5 of the source data frame.
Similar rules apply when split-apply-combine strategy is used:
julia> gdf = groupby(DataFrame(id=[1, 1, 1, 2, 2, 2], x=[1, 1, 2, 1, 3, 3]), :id)
GroupedDataFrame with 2 groups based on key: id
First Group (3 rows): id = 1
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 1
3 │ 1 2
⋮
Last Group (3 rows): id = 2
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 2 3
3 │ 2 3
julia> combine(gdf, :x => reverse)
6×2 DataFrame
Row │ id x_reverse
│ Int64 Int64
─────┼──────────────────
1 │ 1 2
2 │ 1 1
3 │ 1 1
4 │ 2 3
5 │ 2 3
6 │ 2 1
julia> combine(gdf, :x => unique)
4×2 DataFrame
Row │ id x_unique
│ Int64 Int64
─────┼─────────────────
1 │ 1 1
2 │ 1 2
3 │ 2 1
4 │ 2 3
julia> select(gdf, :x => reverse)
6×2 DataFrame
Row │ id x_reverse
│ Int64 Int64
─────┼──────────────────
1 │ 1 2
2 │ 1 1
3 │ 1 1
4 │ 2 3
5 │ 2 3
6 │ 2 1
julia> select(gdf, :x => unique)
ERROR: ArgumentError: all functions must return vectors with as many values as rows in each group
Putting a vector into a single row
Sometimes one wants to put a vector into a single row of the resulting data frame.
In such case the recommended way to achieve the desired result is to wrap the result
in Ref
, just like in broadcasting:
julia> combine(df, :x => Ref∘unique)
1×1 DataFrame
Row │ x_Ref_unique
│ Array…
─────┼──────────────
1 │ [1, 2, 3]
julia> combine(gdf, :x => Ref∘unique)
2×2 DataFrame
Row │ id x_Ref_unique
│ Int64 Array…
─────┼─────────────────────
1 │ 1 [1, 2]
2 │ 2 [1, 3]
This pattern is typically most useful when working with grouped data frames.
Here it is worth to mention that this wrapping is not required if we are
performing a ByRow
operation, as ByRow
automatically wraps everything our
transformation function produces in an additional vector as a container.
Here is an example:
julia> select(df, :x => ByRow(x -> fill(x, x)))
5×1 DataFrame
Row │ x_function
│ Array…
─────┼────────────
1 │ [1]
2 │ [1]
3 │ [2, 2]
4 │ [3, 3, 3]
5 │ [3, 3, 3]
If you wanted to expand this result into multiple rows I recommend using the
flatten
function in the post-processing, as I discuss it in this post:
julia> flatten(select(df, :x => ByRow(x -> fill(x, x))), 1)
10×1 DataFrame
Row │ x_function
│ Int64
─────┼────────────
1 │ 1
2 │ 1
3 │ 2
4 │ 2
5 │ 3
6 │ 3
7 │ 3
8 │ 3
9 │ 3
10 │ 3
Producing a table as an output from a transformation
The patterns above are standard. However, as users get more advanced they start doing complex transformations that produce multiple columns in their code.
Here is a simple example to start with:
julia> select(df, :x => ByRow(x -> (a=x, b=fill(x, x))) => AsTable)
5×2 DataFrame
Row │ a b
│ Int64 Array…
─────┼──────────────────
1 │ 1 [1]
2 │ 1 [1]
3 │ 2 [2, 2]
4 │ 3 [3, 3, 3]
5 │ 3 [3, 3, 3]
This worked nicely because ByRow(x -> (a=x, b=fill(x, x)))
produced a vector
of NamedTuples
that was cleanly handled by AsTable
.
However, the following fails when aggregating GroupedDataFrame
:
julia> combine(gdf, :x => (x -> (a=sum(x), b=x)) => AsTable)
ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed
The problem is that the NamedTuple
the transformation produces mixes a scalar
(in the column :a
) and a vector (in the column :b
). This is disallowed as it
would be not clear if the user wants the scalar value of :a
to be broadcasted or
the vector stored in :b
to be put into a single row of output.
Here are the ways to achieve both behaviors. If you want to broadcast :a
into
multiple rows to match the length of :b
the simplest approach is to use
a DataFrame
instead of a NamedTuple
:
julia> combine(gdf, :x => (x -> DataFrame(a=sum(x), b=x)) => AsTable)
6×3 DataFrame
Row │ id a b
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 4 1
2 │ 1 4 1
3 │ 1 4 2
4 │ 2 7 1
5 │ 2 7 3
6 │ 2 7 3
Here the trick is that the DataFrame
constructor performs pseudo-broadcasting
automatically.
On the other hand if you want the vector stored in :b
to be put into a single row
do one of the following:
julia> combine(gdf, :x => (x -> (a=[sum(x)], b=[x])) => AsTable)
2×3 DataFrame
Row │ id a b
│ Int64 Int64 SubArray…
─────┼─────────────────────────
1 │ 1 4 [1, 1, 2]
2 │ 2 7 [1, 3, 3]
julia> combine(gdf, :x => (x -> DataFrame(a=sum(x), b=Ref(x))) => AsTable)
2×3 DataFrame
Row │ id a b
│ Int64 Int64 SubArray…
─────┼─────────────────────────
1 │ 1 4 [1, 1, 2]
2 │ 2 7 [1, 3, 3]
In the first approach we wrapped both :a
and :b
in a vector, and in the second
we used the pseudo-broadcasting supported by the DataFrame
constructor again.
Why using nested vectors might be beneficial?
There are two kinds of benefits. One is readability. The other is performance.
Regarding the readability. Consider you have a homogeneous set of values. Then you might prefer to store them in a one column to keep them together. Here is an example of such data:
julia> df1 = DataFrame(x = [fill(i, 1000) for i in 1:10000])
10000×1 DataFrame
Row │ x
│ Array…
───────┼───────────────────────────────────
1 │ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ……
2 │ [2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ……
3 │ [3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ……
⋮ │ ⋮
9999 │ [9999, 9999, 9999, 9999, 9999, 9…
10000 │ [10000, 10000, 10000, 10000, 100…
9995 rows omitted
julia> df2 = select(df1, :x => AsTable)
10000×1000 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 ⋯
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int ⋯
───────┼───────────────────────────────────────────────────────
1 │ 1 1 1 1 1 1 1 ⋯
2 │ 2 2 2 2 2 2 2
3 │ 3 3 3 3 3 3 3
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
9999 │ 9999 9999 9999 9999 9999 9999 9999 99
10000 │ 10000 10000 10000 10000 10000 10000 10000 100 ⋯
993 columns and 9995 rows omitted
Of course it is subjective, but provided that the nested data makes sense to be
kept together (e.g. it could be a time series), I would prefer to use df1
than
df2
. Clearly, in this case one could argue that one could consider using narrow
rather than wide table format, but in practical cases we would typically
have many additional columns with e.g. metadata that would have a constant value
for the whole series, and then I often find it more convenient not to use narrow
format.
Now let us handle the performance issue. Assume we would want to transform the data by summing it. Here are the timings of operations in both cases (the timings are for a second run of each operation):
julia> @time select(df1, :x => ByRow(sum));
0.006237 seconds (102 allocations: 83.953 KiB)
julia> @time select(df2, x -> sum.(eachrow(x)));
1.829907 seconds (68.27 M allocations: 1.167 GiB, 4.75% gc time, 1.30% compilation time)
Note that in the case of df2
one could consider writing something like
select(df2, names(df2, r"x") => ByRow(+))
to make the operation type-stable,
but the performance of this will be very bad. In this discussion you
can read about the future plans of improving simple row aggregations (like
sum
here), but I could have used some more complex transformation operation
which even after that changes would be much faster on nested column.
In short - nested column, assuming its eltype
is concrete, solves the tension
between type instability of DataFrame
vs potentially extremely long compilation
times when you want to switch to a type stable mode via e.g. a Tuple
or
a NamedTuple
.
Storage of data frames with nested columns
One drawback of nested columns is that they cannot be easily persistently stored in CSV files. However, they are easy enough to work with using Arrow.jl:
julia> Arrow.write("test.arrow", df1)
"test.arrow"
julia> df1_read = DataFrame(Arrow.Table("test.arrow"));
julia> df1_read == df1
true
Here, you just need to remember that df1_read
created this way is read-only
(at the benefit of performance). You would need to write
DataFrame(Arrow.Table("test.arrow"), copycols=true)
to get mutable columns.
Conclusions
Vectors of vectors are not used very commonly in the other data processing ecosystems. However, DataFrames.jl is designed to allow easy processing of such data and I find them convenient surprisingly often.