Welcome to DuckDB
Introduction
I usually write in this blog about DataFrames.jl. This time I decided to switch my mode to good old SQL queries.
The inspiration from the post is a follow up after the talk “Data manipulation in Julia for pandas users” I gave last week during Data Science Summit 2022. In that talk I explained how DataFrames.jl compares to pandas.
After the talk, I thought to check how the comparison would look against DuckDB, which is an excellent in-process SQL OLAP database management system.
In this post I show how a sample analytical scenario can be executed in DataFrames.jl and DuckDB, and how excellently they are integrated.
The code was run under Julia 1.8.2, CSV.jl 0.10.8, DataFrames.jl 1.4.4, and DuckDB.jl 0.6.0 on a laptop with 32GB of RAM. I perform tests using 8 cores.
The task
When I was preparing the “Data manipulation in Julia for pandas users” talk I wanted to choose some data set that is not too big (so that things are quick) nor not too small (so that they are not too quick).
So I went to Kaggle and searched for datasets with: 5 to 10 GB of data, data in CSV files, a few (but not many) files. Ordered such datasets by hotness and the one on top of the list was: Instagram dataset, consisting of three tables:
instagram_posts.csv
: 42M rows of individual post information;instagram_locations.csv
: 1.2M rows of location information;instagram_profiles.csv
: 4.5M rows of user profile information.
You can find the dataset here.
For each post we can match a profile of a person who made it and a location where it was made. Therefore I decided to do the following operations as a test:
- read the data from disk to RAM;
- drop all rows with missing data in posts;
- drop all rows with missing data in profiles, and in case of duplicate profiles keep only one of them;
- drop all rows with missing data in locations;
- left join posts and profiles;
- left join posts and locations;
- compute some aggregate statistics of posts data by country.
Below I show how these tasks can be performed in DataFrames.jl and DuckDB.
(as a side note - doing this “as is” in pandas was not possible on my laptop as I was running out of RAM and some of the steps in this process took too long; therefore in my talk during the conference I presented a bit different scenario)
Initial inspection of the data
We start with Julia to peek into data. First do it in Julia:
julia> using DataFrames
julia> using CSV
julia> using Statistics
julia> describe(CSV.read("instagram_posts.csv", DataFrame, limit=100), :eltype)
10×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼─────────────────────────────────────────
1 │ sid Int64
2 │ sid_profile Int64
3 │ post_id String15
4 │ profile_id Int64
5 │ location_id Int64
6 │ cts String31
7 │ post_type Int64
8 │ description Union{Missing, String}
9 │ numbr_likes Int64
10 │ number_comments Int64
julia> describe(CSV.read("instagram_profiles.csv", DataFrame, limit=100), :eltype)
11×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼─────────────────────────────────────────────
1 │ sid Int64
2 │ profile_id Int64
3 │ profile_name String31
4 │ firstname_lastname String31
5 │ description Union{Missing, String}
6 │ following Union{Missing, Int64}
7 │ followers Union{Missing, Int64}
8 │ n_posts Union{Missing, Int64}
9 │ url Union{Missing, String}
10 │ cts String31
11 │ is_business_account Union{Missing, Bool}
julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=100), :eltype)
23×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼──────────────────────────────────────────────────
1 │ sid Int64
2 │ id Int64
3 │ name String
4 │ street Union{Missing, String}
5 │ zip Union{Missing, String15}
6 │ city Union{Missing, String}
7 │ region Missing
8 │ cd Union{Missing, String3}
9 │ phone Union{Missing, String31}
10 │ aj_exact_city_match Union{Missing, Bool}
11 │ aj_exact_country_match Union{Missing, Bool}
12 │ blurb String
13 │ dir_city_id Union{Missing, String15}
14 │ dir_city_name Union{Missing, String}
15 │ dir_city_slug Union{Missing, String}
16 │ dir_country_id Union{Missing, String3}
17 │ dir_country_name Union{Missing, String15}
18 │ lat Union{Missing, Float64}
19 │ lng Union{Missing, Float64}
20 │ primary_alias_on_fb String
21 │ slug String
22 │ website String
23 │ cts String31
Since the data frames had many columns I summarized them keeping only element type. I loaded first 100 rows of data for inspection.
Now a similar task in DuckDB:
julia> using DuckDB
julia> con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_posts.csv'
LIMIT 100
""")
10×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼──────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ sid_profile INTEGER YES missing missing missing
3 │ post_id VARCHAR YES missing missing missing
4 │ profile_id BIGINT YES missing missing missing
5 │ location_id BIGINT YES missing missing missing
6 │ cts TIMESTAMP YES missing missing missing
7 │ post_type INTEGER YES missing missing missing
8 │ description VARCHAR YES missing missing missing
9 │ numbr_likes INTEGER YES missing missing missing
10 │ number_comments INTEGER YES missing missing missing
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_profiles.csv'
LIMIT 100
""")
11×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼──────────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ profile_id BIGINT YES missing missing missing
3 │ profile_name VARCHAR YES missing missing missing
4 │ firstname_lastname VARCHAR YES missing missing missing
5 │ description VARCHAR YES missing missing missing
6 │ following INTEGER YES missing missing missing
7 │ followers INTEGER YES missing missing missing
8 │ n_posts INTEGER YES missing missing missing
9 │ url VARCHAR YES missing missing missing
10 │ cts VARCHAR YES missing missing missing
11 │ is_business_account BOOLEAN YES missing missing missing
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_locations.csv'
LIMIT 100
""")
23×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼─────────────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ id BIGINT YES missing missing missing
3 │ name VARCHAR YES missing missing missing
4 │ street VARCHAR YES missing missing missing
5 │ zip VARCHAR YES missing missing missing
6 │ city VARCHAR YES missing missing missing
7 │ region VARCHAR YES missing missing missing
8 │ cd VARCHAR YES missing missing missing
9 │ phone VARCHAR YES missing missing missing
10 │ aj_exact_city_match BOOLEAN YES missing missing missing
11 │ aj_exact_country_match BOOLEAN YES missing missing missing
12 │ blurb VARCHAR YES missing missing missing
13 │ dir_city_id VARCHAR YES missing missing missing
14 │ dir_city_name VARCHAR YES missing missing missing
15 │ dir_city_slug VARCHAR YES missing missing missing
16 │ dir_country_id VARCHAR YES missing missing missing
17 │ dir_country_name VARCHAR YES missing missing missing
18 │ lat DOUBLE YES missing missing missing
19 │ lng DOUBLE YES missing missing missing
20 │ primary_alias_on_fb VARCHAR YES missing missing missing
21 │ slug VARCHAR YES missing missing missing
22 │ website VARCHAR YES missing missing missing
23 │ cts TIMESTAMP YES missing missing missing
Note that we already see how nicely DuckDB is integrated with Julia. You get
a DataFrame
as a result, so you can work with it, if you wanted.
Performing the analysis
Let us check the performance of the requested operation end-to-end in both ecosystems.
Start with pure Julia solution:
julia> @time begin
posts = CSV.read("instagram_posts.csv", DataFrame,
select=[:profile_id, :location_id, :number_comments])
profiles = CSV.read("instagram_profiles.csv", DataFrame,
select=[:profile_id, :n_posts])
locations = CSV.read("instagram_locations.csv", DataFrame,
select=[:id, :dir_country_name])
posts2 = dropmissing(posts)
profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
:n_posts => maximum => :n_posts)
locations2 = dropmissing(locations)
leftjoin!(posts2, profiles2, on="profile_id")
leftjoin!(posts2, locations2, on="location_id" => "id")
gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
end
23.323866 seconds (514.97 k allocations: 5.622 GiB, 3.30% gc time)
235×3 DataFrame
Row │ dir_country_name n_posts_mean_skipmissing number_comments_mean_skipmissing
│ String? Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
1 │ Afghanistan 445.58 8.10767
2 │ Albania 524.789 6.83736
3 │ Algeria 353.933 9.9979
4 │ American Samoa 117.692 7.87879
5 │ Andorra 486.206 7.01296
6 │ Angola 469.621 7.48282
7 │ Anguilla 1206.22 4.77193
8 │ Antarctica 1165.97 20.3073
9 │ Antigua 638.995 8.05999
10 │ Argentina 727.673 11.0821
11 │ Armenia 663.196 4.76728
⋮ │ ⋮ ⋮ ⋮
226 │ Uruguay 560.936 8.00079
227 │ Uzbekistan 445.975 8.84245
228 │ Vanuatu 539.349 4.63688
229 │ Vatican City 583.498 7.05353
230 │ Venezuela 551.79 9.4976
231 │ Vietnam 582.829 4.46671
232 │ Western Sahara 643.888 4.6676
233 │ Yemen 411.5 27.9416
234 │ Zambia 411.779 7.09239
235 │ Zimbabwe 518.17 8.3875
214 rows omitted
Now compare it to DuckDB:
julia> @time DBInterface.execute(con,
"""
SELECT dir_country_name,
mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
FROM (SELECT profile_id, location_id, number_comments FROM 'instagram_posts.csv'
WHERE profile_id IS NOT NULL
AND location_id IS NOT NULL
AND number_comments IS NOT NULL
) po
LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
FROM 'instagram_profiles.csv'
WHERE profile_id IS NOT NULL
AND n_posts IS NOT NULL
GROUP BY profile_id) pr
ON po.profile_id = pr.profile_id
LEFT JOIN (SELECT id, dir_country_name FROM 'instagram_locations.csv'
WHERE id IS NOT NULL
AND dir_country_name IS NOT NULL) loc
ON po.location_id = loc.id)
WHERE dir_country_name IS NOT NULL
GROUP BY dir_country_name
ORDER BY dir_country_name
""")
93.220900 seconds (166.60 k allocations: 8.578 MiB, 0.11% compilation time)
235×3 DataFrame
Row │ dir_country_name n_posts_mean number_comments_mean
│ String? Float64? Float64?
─────┼──────────────────────────────────────────────────────────
1 │ Afghanistan 445.58 8.10767
2 │ Albania 524.789 6.83736
3 │ Algeria 353.933 9.9979
4 │ American Samoa 117.692 7.87879
5 │ Andorra 486.206 7.01296
6 │ Angola 469.621 7.48282
7 │ Anguilla 1206.22 4.77193
8 │ Antarctica 1165.97 20.3073
9 │ Antigua 638.995 8.05999
10 │ Argentina 727.673 11.0821
11 │ Armenia 663.196 4.76728
⋮ │ ⋮ ⋮ ⋮
226 │ Uruguay 560.936 8.00079
227 │ Uzbekistan 445.975 8.84245
228 │ Vanuatu 539.349 4.63688
229 │ Vatican City 583.498 7.05353
230 │ Venezuela 551.79 9.4976
231 │ Vietnam 582.829 4.46671
232 │ Western Sahara 643.888 4.6676
233 │ Yemen 411.5 27.9416
234 │ Zambia 411.779 7.09239
235 │ Zimbabwe 518.17 8.3875
214 rows omitted
We see that we get the same result. The performance is worse. However, let us split out the data ingest into RAM part to a separate process.
Here we will use a very nice feature of DuckDB that you can register a data frame to be a table.
We start with Julia again:
julia> @time begin
posts2 = dropmissing(posts)
profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
:n_posts => maximum => :n_posts)
locations2 = dropmissing(locations)
leftjoin!(posts2, profiles2, on="profile_id")
leftjoin!(posts2, locations2, on="location_id" => "id")
gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
end;
5.632483 seconds (1.61 k allocations: 4.065 GiB, 2.94% gc time)
Now check DuckDB:
julia> DuckDB.register_data_frame(con, posts, "posts")
julia> DuckDB.register_data_frame(con, profiles, "profiles")
julia> DuckDB.register_data_frame(con, locations, "locations")
julia> @time DBInterface.execute(con,
"""
SELECT dir_country_name,
mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
FROM (SELECT profile_id, location_id, number_comments
FROM posts
WHERE profile_id IS NOT NULL
AND location_id IS NOT NULL
AND number_comments IS NOT NULL
) po
LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
FROM (SELECT profile_id, n_posts
FROM profiles
WHERE n_posts IS NOT NULL)
GROUP BY profile_id) pr
ON po.profile_id = pr.profile_id
LEFT JOIN (SELECT id, dir_country_name
FROM locations
WHERE id IS NOT NULL
AND dir_country_name IS NOT NULL) loc
ON po.location_id = loc.id)
GROUP BY dir_country_name
ORDER BY dir_country_name
""");
1.818248 seconds (616.46 k allocations: 24.675 MiB, 14.41% compilation time)
As you can see this time DuckDB is faster than Julia. Also we see that the majority of original time was spent in reading the data from disk and that the analysis part in both ecosystems is reasonably fast.
I have investigated into this further and the major conclusions for the performance difference are:
- DuckDB performs the query in one shot, while in DataFrames.jl the API is eager, materializing all intermediate tables; when I split the big SQL query into steps materializing intermediate tables the performance difference was significantly smaller.
- joins are faster in DuckDB (this is a known issue in DataFrames.jl that we still need to add multi-threading support to some of the algorithms we provide);
- split-apply-combine is faster in DataFrames.jl.
Conclusions
For me the key takeaways from this test were:
- DuckDB excellently integrates with Julia and DataFrames.jl. As you saw you can register data frames as tables in DuckDB and get query results as data frames.
- DuckDB is fast. You can appreciate these differences especially if you perform large joins (this will improve in DataFrames.jl in the future) and when you execute complex queries (this will not likely improve in DataFrames.jl in the near future - the current design of the package is eager and does not support composed query optimization).
- I found DataFrmes.jl code easier to read, but probably I am biased here. Data scientists proficient with SQL probably will find DuckDB code more natural.
Happy hacking with Julia and DuckDB!