Introduction

Joining tables is one of the fundamental operations when wrangling data. Therefore people using DataFrames.jl would understandably want them to be as fast as possible.

In this post I will show you some tricks that can make joins faster in case that is currently a bottleneck, which is when you have many Strings in your data. This scenario is relevant as it is quite common in data science workflows.

This post was written under Julia 1.6.1, DataFrames.jl 1.2.1, BenchmarkTools.jl 1.1.1, and WeakRefStrings.jl 1.1.0. I am running the tests on a laptop under Linux and with 16GB of RAM using a single thread.

In the timings below I report the memory footprint of julia process in respective scenarios. This is relevant, as when we would be very close to available memory limit Julia might struggle with memory management. I have chosen the size of the tests that they easily fit into memory (but of course they are large enough to cause issues).

The baseline

In the baseline scenario I join tables that do not contain any strings.

Here is my benchmark:

julia> using DataFrames

julia> using BenchmarkTools

julia> df1 = DataFrame(id=1:5*10^7, left=1:5*10^7)
50000000×2 DataFrame
      Row │ id        left
          │ Int64     Int64
──────────┼────────────────────
        1 │        1         1
        2 │        2         2
    ⋮     │    ⋮         ⋮
 49999999 │ 49999999  49999999
 50000000 │ 50000000  50000000
          49999996 rows omitted

julia> df2 = DataFrame(id=1:5*10^7, right=1:5*10^7)
50000000×2 DataFrame
      Row │ id        right
          │ Int64     Int64
──────────┼────────────────────
        1 │        1         1
        2 │        2         2
    ⋮     │    ⋮         ⋮
 49999999 │ 49999999  49999999
 50000000 │ 50000000  50000000
          49999996 rows omitted

julia> GC.gc() # julia process memory footprint: 1.7GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 28 samples with 1 evaluation.
 Range (min … max):  1.963 s …   2.255 s  ┊ GC (min … max): 0.02% … 9.21%
 Time  (median):     2.190 s              ┊ GC (median):    9.49%
 Time  (mean ± σ):   2.188 s ± 47.369 ms  ┊ GC (mean ± σ):  9.12% ± 1.79%

                                             █ ▆
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁███▅▄▁▁▁▁▁▁▁▁▅ ▁
  1.96 s         Histogram: frequency by time        2.26 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see the baseline timing is around 2 seconds. The GC (garbage collector) gets triggered in the benchmarks but it does not take more than 10% of total run time.

This gives us a baseline for our tests.

The issue of many small allocated objects

Now consider that :left column in df1 and :right column in df2 are instead containing elements of type String. Note that this will not affect the joining process as I do not touch the column on which we perform the join.

Here is what we get:

julia> df1.left = string.(df1.left);

julia> df2.right = string.(df2.right);

julia> GC.gc() # julia process memory footprint: 4.7GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  4.670 s … 8.088 s  ┊ GC (min … max): 54.57% … 73.43%
 Time  (median):     7.764 s            ┊ GC (median):    73.47%
 Time  (mean ± σ):   7.457 s ± 1.051 s  ┊ GC (mean ± σ):  72.17% ±  6.31%

                                                   █
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▄ ▁
  4.67 s        Histogram: frequency by time        890 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see now we are roughly 4 times slower on the average. In this case GC typically takes around a bit over 70% of total run time of join operation. We observe this issue although the memory footprint of julia process is 4.7GB, which is still well below the memory I have available on my machine.

The issue is that GC, if triggered, takes very long time because it has to traverse very many small allocated objects (Strings in our case).

We try inlining our strings

Being aware of the issue @quinnj has implemented a set of InlineString* types in the WeakRefStrings.jl package (and soon CSV.jl will use them by default).

Let us check if using them resolves our issues:

julia> using WeakRefStrings

julia> df1.left = InlineString15.(df1.left);

julia> df2.right = InlineString15.(df2.right);

julia> GC.gc()  # julia process memory footprint: 2.5GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  2.258 s …   2.544 s  ┊ GC (min … max):  0.03% … 10.36%
 Time  (median):     2.534 s              ┊ GC (median):    10.27%
 Time  (mean ± σ):   2.521 s ± 56.830 ms  ┊ GC (mean ± σ):   9.83% ±  2.11%

                                                       ▁▃█▃
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁████ ▁
  2.26 s         Histogram: frequency by time        2.54 s <

 Memory estimate: 2.61 GiB, allocs estimate: 198.

Wow! We are almost back to the timings with the integer columns we had originally. The memory footprint and the timings are a bit bigger than in the baseline scenario because the width of the strings in our case requires us to use InlineString15 type.

Does using `Symbol` solve the issue?

One of the alternative practices that people use to work around the issues with Strings is to use Symbols instead. Since Symbols are interned they are faster for certain operations. This is not a free lunch though as, in particular the memory used by them cannot be reclaimed when they are no longer referenced to.

Let us check if using Symbol instead of String helps in our case:

julia> df1.left = Symbol.(df1.left);

julia> df2.right = Symbol.(df2.right);

julia> GC.gc() # julia process memory footprint: 4.0GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  1.879 s …    4.831 s  ┊ GC (min … max):  0.04% … 59.90%
 Time  (median):     3.940 s               ┊ GC (median):    51.65%
 Time  (mean ± σ):   3.868 s ± 574.918 ms  ┊ GC (mean ± σ):  50.72% ± 13.20%

                                          █
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  1.88 s         Histogram: frequency by time         4.83 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see the benchmarks are better than for String. However, still GC time is typically taking 50% of execution time. This shows that although Symbols are interned they are tracked by the GC (which in theory could be avoided since we know that the memory they occupy cannot be reclaimed).

Conclusions

Here is a summary of what we have learned:

using String type can lead to significant issues with GC;
using Symbol instead is somewhat better, but does not resolve the GC issues in full;
using InlineString15 gave very good results.

One just has to remember that InlineString* types are limited and can hold small strings only. Fortunately in typical scenarios when we have a lot of strings they are not super long.

Also note that I use String type as an example in our tests because it is common in data science workflows. However, one would run to similar issues if one would use many objects that have to be tracked by GC. In fact it does not even matter if they would be processed by e.g. join operation that we used in our examples. What matters is that they reside in memory and thus need to be tracked by GC.

Let me also highlight that the issues would be even more visible if I used multiple threads. The reason is that currently the GC in Julia does not handle this scenario as efficiently as it potentially could. This is an issue that is planned to be resolved by the Julia core devs, as we could learn during this talk given at JuliaCon2021.

I believe in the future it will be possible to make everything to be efficient out-of-the-box. However, till the issues with GC when many small objects need to be tracked are present it is good to know how they can be resolved.