Strings vs symbols in DataFrames.jl column indexing
Introduction
In DataFrames.jl you can use both symbols and strings for column indexing. Which to choose is one of the topics that new users ask about most frequently. In this post I will explain why both options are supported and what is a difference between them. Note that this is an entry level post, so I will omit many details of the discussed topic and focus on most important aspects only.
The post was written under Julia 1.7.2, DataFrames.jl 1.3.4, DataFramesMeta.jl 0.12.0, BenchmarkTools.jl 1.3.1.
What are strings and symbols?
In Julia a string allows users to store sequences of characters. The simplest way to create a string is to write some text between double quotation marks:
julia> "an example string"
"an example string"
Symbols are objects used in Julia to create identifiers. You can think of them
as labels. Symbols are normally created by prefixing some label with :
like
this:
julia> :label
:label
In this way you can create symbols that are valid variable names.
So, for example, you cannot create a symbol that has a space using :
:
julia> :my label
ERROR: syntax: extra token "label" after end of expression
Instead, in such cases, you need to call Symbol
passing it a string as
an argument:
julia> Symbol("my label")
Symbol("my label")
How are string and symbols different?
To understand the difference between symbols and strings it is easiest to think of them as follows:
- symbols are labels;
- strings are sequences of characters.
So symbols are indivisible - they are always considered to as a whole, while strings consist of multiple characters. The most important consequences of this distinction are the following:
- symbols are faster than strings when you compare them for equality using
==
; - you can manipulate strings (e.g. uppercase, chop, perform substring matching etc.) while none of such operations are supported for symbols.
Let us have a look at these two characteristics by example. First we check comparison speed. We create 1000-element vectors with unique values and compare all pairs of their entries, so we make 1 million comparisons and expect 1000 matches.
julia> using BenchmarkTools
julia> string_vec = string.("s", 1:1000)
1000-element Vector{String}:
"s1"
"s2"
"s3"
"s4"
⋮
"s997"
"s998"
"s999"
"s1000"
julia> symbol_vec = Symbol.("s", 1:1000)
1000-element Vector{Symbol}:
:s1
:s2
:s3
:s4
⋮
:s997
:s998
:s999
:s1000
julia> test_cmp(v) = count(x == y for x in v, y in v)
test_cmp (generic function with 1 method)
julia> @btime test_cmp($string_vec)
3.038 ms (0 allocations: 0 bytes)
1000
julia> @btime test_cmp($symbol_vec)
635.400 μs (0 allocations: 0 bytes)
1000
Indeed symbol comparison is faster.
Now let us look at manipulation:
julia> str = "example"
"example"
julia> uppercase(str)
"EXAMPLE"
julia> chop(str)
"exampl"
julia> match(r"ex", str)
RegexMatch("ex")
julia> sym = :example
:example
julia> uppercase(sym)
ERROR: MethodError: no method matching uppercase(::Symbol)
julia> chop(sym)
ERROR: MethodError: no method matching chop(::Symbol)
julia> match(r"ex", sym)
ERROR: MethodError: no method matching match(::Regex, ::Symbol)
So in summary we could conclude that:
- one can use symbol if the value stored in it is not manipulated
(i.e. is treated as a label); they are faster in comparisons than strings
and a bit easier to type (only
:
prefix is needed) provided that they do not contain characters like spaces (in which case they are not convenient to type); - strings support manipulation as opposed to symbols; the cost is that comparing them is slower than comparing symbols.
Let us now discuss how these considerations translate to the DataFrames.jl realm.
Strings vs symbols in DataFrames.jl
Column names in a DataFrame
are labels. For this reason both symbols and
strings are allowed to be used when referencing them without introducing
an ambiguity. Here is an example. We start with strings:
julia> using DataFrames
julia> df = DataFrame("col1" => 1, "col 2" => 2)
1×2 DataFrame
Row │ col1 col 2
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df."col1"
1-element Vector{Int64}:
1
julia> df."col 2"
1-element Vector{Int64}:
2
julia> df[:, "col1"]
1-element Vector{Int64}:
1
julia> df[:, "col 2"]
1-element Vector{Int64}:
2
Now we try the same with symbols:
julia> df = DataFrame(:col1 => 1, Symbol("col 2") => 2)
1×2 DataFrame
Row │ col1 col 2
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df.col1
1-element Vector{Int64}:
1
julia> getproperty(df, Symbol("col 2"))
1-element Vector{Int64}:
2
julia> df[:, :col1]
1-element Vector{Int64}:
1
julia> df[:, Symbol("col 2")]
1-element Vector{Int64}:
2
We now see the first difference, that we have already discussed. If column
names are all valid variable names symbols are more convenient, however,
if they are not (e.g. contain spaces) then using strings is more convenient.
As an extreme case, note that the convenience syntax for getproperty
using
.
accessor does not work for symbols containing spaces and we need to do
an explicit getproperty
call.
The second important aspect is that all functions that manipulate column names in DataFrames.jl work with strings. This is natural, as symbol manipulation is not supported by Julia. Here is a combo showing this in action:
julia> select(df, Cols(startswith("c")) .=> identity .=> uppercase)
1×2 DataFrame
Row │ COL1 COL 2
│ Int64 Int64
─────┼──────────────
1 │ 1 2
The Cols(startswith("c")) .=> identity .=> uppercase
operation specification
syntax means that we want to pick all columns whose name starts with "c"
(note that the startswith
function expects string as an input), keep them
unchanged (the identiy
function) and uppercase their names in the output
(note that uppercase
expects string as an input).
Finally, you might ask about comparison of speed of column lookup using strings vs symbols. Here is a simple test:
julia> @btime $df.col1
7.500 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
1
julia> @btime $df."col1"
38.446 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
1
As you can see there is a noticeable performance difference. However, please note that both these operations are very fast. Therefore, in practice, column lookup is almost never a performance bottleneck in operations on data frames (usually what you do with the column picked from a data frame is more expensive by several orders of magnitude). So a practical recommendation is that performance should not be a reason of choosing symbols over strings most of the time.
If you really need speed then column lookup using an integer index is fastest:
julia> @btime $df[!, 1]
4.100 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
1
However, this way of picking columns is not recommended and you should use it only if you are sure what column is stored under a given number in a data frame.
Additional practical considerations of using strings and symbols in DataFrames.jl
The first tip is that you can get a list of column names of a data frame as
strings and as symbols in DataFrames.jl using the names
and propertynames
functions respectively:
julia> names(df)
2-element Vector{String}:
"col1"
"col 2"
julia> propertynames(df)
2-element Vector{Symbol}:
:col1
Symbol("col 2")
The second important consideration is that in DataFramesMeta.jl only symbols are considered to be column identifiers in operations by default. Therefore you can write:
julia> using DataFramesMeta
julia> @rselect(df, :out = :col1 + 1)
1×1 DataFrame
Row │ out
│ Int64
─────┼───────
1 │ 2
If you want to use strings instead you have to escape them with $
:
julia> @rselect(df, $"out" = $"col1" + 1)
1×1 DataFrame
Row │ out
│ Int64
─────┼───────
1 │ 2
Conclusions
The post today was long, but the conclusion is simple. In DataFrames.jl you can use both symbols and strings to get access to a column of a data frame. The major consideration you should use when picking one or the other is your convenience.