Fixed width strings in CSV.jl
Introduction
In my recent post I have written how using inline strings can help to in crease join performance. Since with 0.9 release of CSV.jl these strings are used by default to read CSV files so I want to focus solely on this topic today.
The post was written under Julia 1.6.1, DataFrames.jl 1.2.2, WeakRrefStrings.jl 1.3.0, BenchmarkTools.jl 1.1.4, and CSV.jl 0.9.1.
Introducing inline strings
WeakRefStrings.jl defines 8 inline string types:
julia> using WeakRefStrings
julia> st = subtypes(InlineString)
8-element Vector{Any}:
String1
String127
String15
String255
String3
String31
String63
String7
What is important is that all these types are bits types, as you can see here:
julia> using DataFrames
julia> sort(DataFrame(st=st,
isbits = isbitstype.(st),
sizeof = sizeof.(st)), :sizeof)
8×3 DataFrame
Row │ st isbits sizeof
│ Any Bool Int64
─────┼───────────────────────────
1 │ String1 true 1
2 │ String3 true 4
3 │ String7 true 8
4 │ String15 true 16
5 │ String31 true 32
6 │ String63 true 64
7 │ String127 true 128
8 │ String255 true 256
The suffix of the name of the specific string type indicates the maximum size of
the string it can hold. So for example String3
can hold strings that have
maximum size 3
as returned by sizeof
. Here is an example:
julia> String3("123")
"123"
julia> String3("1234")
ERROR: ArgumentError: string too large (4) to convert to String3
julia> String3("∀")
"∀"
julia> String3("∀1")
ERROR: ArgumentError: string too large (4) to convert to String3
As you can see here it is important to remember that some characters (like ∀
)
in UTF-8 encoding take more than one code unit.
You can use the InlineString
function to create an inline string of
automatically selected minimal size:
julia> typeof(InlineString("∀"))
String3
julia> typeof(InlineString("∀1"))
String7
Finally, as we can see the maximum size of inline string is 255, so:
julia> InlineString("a"^256)
ERROR: ArgumentError: string too large (256) to convert to InlineString
In summary, as you can see, the String[N]
types are similar to CHAR(N)
types that are provided by many data bases. The limitation is that N
can take only several fixed values.
Why and when to use inline strings?
There are two benefits of inline strings.
The first is that they can be faster. A special case when this is visible are equality comparisons which are quite common in practice.
julia> using Random, BenchmarkTools
julia> Random.seed!(1234);
julia> s1 = [randstring(3) for i in 1:1000];
julia> s2 = String3.(s1);
julia> @benchmark $s1 .== permutedims($s1)
BenchmarkTools.Trial: 1064 samples with 1 evaluation.
Range (min … max): 3.928 ms … 6.435 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.910 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.697 ms ± 404.737 μs ┊ GC (mean ± σ): 0.06% ± 0.91%
▅▃ ▆ ▃ ▃█▄▂
██▇▇▅▇▅▅▄▆▄▄▄▅▆▅▅▁▄▆▅▅▇▅▄▁▅█▄▄▁█▇█▇▇█████▇▅▄▅▅▄▁▄▁▄▄▅▄▁▁▅▅▄ █
3.93 ms Histogram: log(frequency) by time 5.49 ms <
Memory estimate: 126.53 KiB, allocs estimate: 6.
julia> @benchmark $s2 .== permutedims($s2)
BenchmarkTools.Trial: 4879 samples with 1 evaluation.
Range (min … max): 888.150 μs … 2.037 ms ┊ GC (min … max): 0.00% … 47.53%
Time (median): 1.043 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.023 ms ± 77.610 μs ┊ GC (mean ± σ): 0.23% ± 2.38%
▅ ▃ ▁ ▁ ▅▃ █▆ ▁
█▅▃▃█████▅▃▁█▆▅▅▁█▇▃▁▃██▅▁▄███▆▅▄▆▇▅▇▅▅▇▅▅▄▄▃▅▅▅▅▃▁▃▅▄▃▁▃▁▃▄ █
888 μs Histogram: log(frequency) by time 1.22 ms <
Memory estimate: 126.53 KiB, allocs estimate: 6.
The second is that they are not heap allocated so they do not lead to significant GC latency. This topic was presented in the post on join performance.
So what are the potential shortcommings. There are several:
- they have a limited capacity, so one might not be able to always rely that the conversion to inline string is possible;
- they are not efficient when we work with strings of highly variable length;
- mixing of inline strings in collections can lead to type instabilities.
I have already discussed the first topic. Now, let us handle the second and third one in consecutive sections.
Memory footprint of inline strings
Consider the following collection:
julia> x = [String127("a"^i) for i in 1:100]
100-element Vector{String127}:
"a"
"aa"
"aaa"
⋮
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
Let us check how expensive it is to perform its copy:
julia> @benchmark copy($x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 677.100 ns … 131.135 μs ┊ GC (min … max): 0.00% … 95.08%
Time (median): 1.224 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.464 μs ± 4.174 μs ┊ GC (mean ± σ): 13.84% ± 4.88%
▁▅███▇▆▄▂
▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▆██████████▇▅▅▅▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▂▂▂ ▃
677 ns Histogram: frequency by time 2.11 μs <
Memory estimate: 12.62 KiB, allocs estimate: 1.
Now we convert it to a standard String
:
julia> y = String.(x)
100-element Vector{String}:
"a"
"aa"
"aaa"
⋮
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
julia> @benchmark copy($y)
BenchmarkTools.Trial: 10000 samples with 973 evaluations.
Range (min … max): 64.867 ns … 1.592 μs ┊ GC (min … max): 0.00% … 82.37%
Time (median): 80.387 ns ┊ GC (median): 0.00%
Time (mean ± σ): 98.944 ns ± 111.194 ns ┊ GC (mean ± σ): 13.24% ± 10.98%
▆█▄▂ ▁
██████▇▆▆▆▄▅▄▃▄▃▁▃▁▁▁▁▁▁▁▃▅▆▅▄▅▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▅▅▄▅▅▅▅▇▆▆▅▅ █
64.9 ns Histogram: log(frequency) by time 830 ns <
Memory estimate: 896 bytes, allocs estimate: 1.
As you can see the operation on String
is much faster. The reason is that x
stores 100
entries of which each has 128
bytes, while y
stores 100
pointers to strings that have only 8
bytes of size. See:
julia> sizeof(x)
12800
julia> sizeof(y)
800
So much less data movement is involved if we performed copying of pointers only.
The conclusion is that it is probably better to use String
type if we expect
to work with collections of strings that have a highly variable length.
Collections of inline strings
The second potential issue is working with collections of inline strings.
When you read in string data from a file CSV.jl will automatically create columns of appropriate widths:
julia> using CSV
julia> str = """
w1,w2,w3,w4
a,a,a,a
b,bb,bb,bb
c,cc,ccc,ccc
d,dd,ddd,dddd
"""
"w1,w2,w3,w4\na,a,a,a\nb,bb,bb,bb\nc,cc,ccc,ccc\nd,dd,ddd,dddd\n"
julia> df = CSV.read(IOBuffer(str), DataFrame)
4×4 DataFrame
Row │ w1 w2 w3 w4
│ String1 String3 String3 String7
─────┼────────────────────────────────────
1 │ a a a a
2 │ b bb bb bb
3 │ c cc ccc ccc
4 │ d dd ddd dddd
However, one has to be careful when converting to inline string manually. Have a look:
julia> strs2 = InlineString.(strs)
8-element Vector{InlineString}:
"a"
"aa"
"aaa"
"aaaa"
"aaaaa"
"aaaaaa"
"aaaaaaa"
"aaaaaaaa"
julia> typeof.(strs2)
8-element Vector{DataType}:
String1
String3
String3
String7
String7
String7
String7
String15
Sometimes this is indeed what one would want (the narrowest possible representation at the cost of having a collection of abstract element type). But currently if you would want to have an automatic type promotion and a concrete element type you would have to do it manually e.g.:
julia> String15.(strs)
8-element Vector{String15}:
"a"
"aa"
"aaa"
"aaaa"
"aaaaa"
"aaaaaa"
"aaaaaaa"
"aaaaaaaa"
Operations on inline strings
Additionally one should know that most operations transforming inline strings
will currently produce a String
by default, e.g.:
julia> s = String3("a")
"a"
julia> typeof(s^2)
String
julia> typeof(uppercase(s))
String
which also is best kept in mind when working with them.
Conclusions
Inline strings are an excellent addition to Julia, however, one should know the type of task they were designed to help with.
As you could see in my examples inline strings are ideal for situations where we have millions of relatively small and strings that have relatively homogeneous size and are not mutated.
This use case might seem restrictive, but actually in practice it is quite common as e.g. all sorts of customer or product identifiers have exactly this nature.
Finally, some of the limitations I listed in this post might be lifted in the future. If you need some functionality please do not hesitate to open an issue on WeakRefStrings.jl GitHub repository.