Benchmarking split-apply-combine: DataFrames.jl vs Pandas
Introduction
Recently chapters 5 and 6 of my Julia for Data Analysis book were published in MEAP. Finally we are getting to some more fun stuff like parametric types or discussion of options for working with strings in Julia.
In the comments from the readers who decided to opt-in for early access to the book I was asked how does performance of DataFrames.jl compare to Pandas. Therefore today I decided to run a small benchmark.
The codes were run under:
- Julia 1.7.2, DataFrames.jl 1.3.4, InlineStrings.jl 1.1.2, PooledArrays.jl 1.4.2, CategoricalArrays.jl 0.10.6, and BenchmarkTools.jl 1.3.1;
- Python 3.9.12, Pandas 1.4.2.
The test scenario
I want to:
- generate 1,000,000 random strings of length 8 consisting of letters;
- put them in a data frame;
- group the data frame by the string column and calculate number of rows in every group.
In the tests I want to compare the performance of different options how strings can be stored.
Julia code
In Julia tests I check the following storage modes of strings: Vector{String}
,
Vector{String15}
(inline string), Vector{Symbol}
, PooledVector{String}
,
and CategoricalVector{String}
. Here is the benchmark:
julia> using Random
julia> using InlineStrings
julia> using BenchmarkTools
julia> using PooledArrays
julia> using CategoricalArrays
julia> Random.seed!(1234);
julia> df = transform!(DataFrame(str=[randstring() for _ in 1:10^6]),
:str .=>
[inlinestrings, ByRow(Symbol),
PooledArray, CategoricalArray] .=>
[:istr, :sym, :pstr, :cstr])
1000000×5 DataFrame
Row │ str istr sym pstr cstr
│ String String15 Symbol String Cat…
─────────┼──────────────────────────────────────────────────
1 │ KYDtLOxn KYDtLOxn KYDtLOxn KYDtLOxn KYDtLOxn
2 │ UkZj0CRg UkZj0CRg UkZj0CRg UkZj0CRg UkZj0CRg
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
999999 │ CZL6fcG5 CZL6fcG5 CZL6fcG5 CZL6fcG5 CZL6fcG5
1000000 │ 1MqvPpVb 1MqvPpVb 1MqvPpVb 1MqvPpVb 1MqvPpVb
999996 rows omitted
julia> @belapsed combine(groupby($df, :str), nrow)
0.0486384
julia> @belapsed combine(groupby($df, :istr), nrow)
0.0461695
julia> @belapsed combine(groupby($df, :sym), nrow)
0.0422065
julia> @belapsed combine(groupby($df, :pstr), nrow)
0.0096023
julia> @belapsed combine(groupby($df, :cstr), nrow)
0.0304362
As you can see PooledVector
is by far fastest. Next comes CategoricalVector
,
but it is significantly slower. The performance of String
, String15
,
and Symbol
is comparable; String
is slowest and Symbol
is fastest.
Python code
In Python I compared Series
containing str
and Categorical
. Here is the
code:
import pandas as pd
import random
import string
random.seed(1234)
s = [''.join(random.choice(string.ascii_letters) for _ in range(8)) for _ in range(10**6)]
cs = pd.Categorical(s)
df = pd.DataFrame({'str': s, 'cstr': cs})
%time df.groupby(['str']).size()
CPU times: total: 1.56 s
Wall time: 1.56 s
Out[8]:
str
AAAGINxY 1
AAAIavpP 1
AAASsiaU 1
AAAfnfqH 1
AAAxQZTv 1
..
zzzNaWBc 1
zzzTCZdT 1
zzzmmTcP 1
zzzoySQG 1
zzzwzAsA 1
Length: 1000000, dtype: int64
%time df.groupby(['cstr']).size()
CPU times: total: 109 ms
Wall time: 105 ms
Out[9]:
cstr
AAAGINxY 1
AAAIavpP 1
AAASsiaU 1
AAAfnfqH 1
AAAxQZTv 1
..
zzzNaWBc 1
zzzTCZdT 1
zzzmmTcP 1
zzzoySQG 1
zzzwzAsA 1
Length: 1000000, dtype: int64
As you can see using Categorical
is much faster. However, both options
are visibly slower than any of the Julia variants we have considered.
Conclusions
First, I would like to thank all reviewers and readers of my book in MEAP. They really help to improve it and I appreciate the feedback I receive a lot.
Now let me turn to a comment on benchmark results. First, I want to remark that I usually avoid running such comparisons as it is really hard to do them comprehensively and objectively. Therefore, I recommend you to treat my conclusions as a general guidance inferred from the tests I have performed:
- DataFrames.jl was consistently faster than Pandas.
- In both ecosystems it is better to “pool” data before running split-apply-combine operation; this is especially relevant if you expect to run such operation many times.
- In Julia PooledArrays.jl is visibly faster than CategoricalArrays.jl. Therefore if you just need compression (as offered by PooledArrays.jl) I do not recommend you to use CategoricalArrays.jl (use it if you need a first-class container for categorical data). Note that we see such a big difference because we had very many levels in our data. If we had only a few levels (a typical use case for CategoricalArrays.jl) its performance would be much closer to PooledArrays.jl.
- In Julia if you use non-pooled vector the choice of string storage type does not affect the considered timings a lot. If you would like to learn when this decision is important I recommend you to read this post.
I hope you will find the presented benchmarks useful in your daily data wrangling tasks.