Introduction

The String type in the Julia language supports the full range of Unicode characters, which is great in practice. Other features that are important when working with the String type (especially when you need performance) are the following:

  • String is immutable, but it is not interned (notably Symbols are interned and thus can be compared using === fast);
  • taking a single character from a String produces a Char, which is a 32-bit value;
  • Julia Base normally assumes that String is UTF-8 encoded which is handy because most likely your source is UTF-8 encoded and in general UTF-8 has a reasonably compact memory footprint (but Strings that contain invalid encodings are allowed to be constructed, and it is possible to transcode strings).

While I think that the Julia manual Section on Strings does a very good job explaining how they work I often find that people are confused by the consequences of UTF-8 encoding of String and this post is intended to cover this ground a bit more in depth.

All what I write here was tested under Julia 1.5.

Two types of indices for String

Since String is UTF-8 encoded one character is represented by 1, 2, 3, or 4 bytes in it, see e.g. Wikipedia for the details of the encoding. In Julia you can check that indeed one code unit of String is one byte by writing (I am storing the String in the str variable as we will soon use it again):

julia> str = "😄 Hello! 👋"
"😄 Hello! 👋"

julia> codeunit(str)
UInt8

Now this string contains 10 characters, which you can check using the length function:

julia> length(str)
10

but it is actually stored on more bytes, which the ncodeunits function tells us:

julia> ncodeunits(str)
16

The reason is that the first and last character in this string are not ASCII (note that in UTF-8 all ASCII characters are stored on one byte), which we can check in the following way:

julia> foreach(c -> println(repr(c), ":\t", ncodeunits(c)), str)
'😄':    4
' ':    1
'H':    1
'e':    1
'l':    1
'l':    1
'o':    1
'!':    1
' ':    1
'👋':    4

Given these observations the natural questions are:

  • how to get the i-th code unit in the String (so called byte index);
  • how to get the i-th character in the String (so called character index);
  • is is easy to go ‘There and Back Again’ between byte and character indices;
  • which functions expect byte indices, which expect character indices and what is the cost of using these functions.

Below I try to answer these questions.

Getting code units

Getting the i-th code unit is simple (but in practice rarely needed, except if you are working with strings on low level), you just use the codeunit function, e.g.:

julia> codeunit(str, 1)
0xf0

julia> codeunit(str, 2)
0x9f

julia> codeunit(str, 3)
0x98

julia> codeunit(str, 4)
0x84

julia> codeunit(str, 5)
0x20

or you can use the codeunits function to get them as a vector:

julia> codeunits(str)
16-element Base.CodeUnits{UInt8,String}:
 0xf0
 0x9f
 0x98
 0x84
 0x20
 0x48
 0x65
 0x6c
 0x6c
 0x6f
 0x21
 0x20
 0xf0
 0x9f
 0x91
 0x8b

Getting characters

Now, this is more tricky:

julia> str[1]
'😄': Unicode U+1F604 (category So: Symbol, other)

julia> str[5]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

julia> str[2]
ERROR: StringIndexError("😄 Hello! 👋", 2)

and you see that the getindex function in the str[i] syntax does not give you the i-th character in the string but rather a character that starts in the i-th byte index in the string (and errors if at this given byte index the character does not start).

So how should one get the i-th character in the string? Use the nextind function in the following way:

julia> str[nextind(str, 0, 1)]
'😄': Unicode U+1F604 (category So: Symbol, other)

julia> str[nextind(str, 0, 2)]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

where you pass 0 as the first argument and the desired character index as a second argument.

You might ask why this is so awkward? The reason is that computing the location of i-th character in the string is expensive, so normally one should use byte indexing. Notably, most functions working on strings take byte indices and only one function — length — returns number of characters, all other functions return byte indices or number of bytes (more on this below in the glossary section).

For example (we store loc as we will use it later also):

julia> loc = findfirst(==('H'), str)
6

julia> str[loc]
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)

and this is fast.

Be warned though that when you use byte indexing you should not do arithmetics on them (unless your string is ASCII only, which you can check using the isascii function). For instance if you want to go back two characters from H do not write:

julia> str[loc - 2]
ERROR: StringIndexError("😄 Hello! 👋", 4)

but rather write:

julia> str[prevind(str, loc, 2)]
'😄': Unicode U+1F604 (category So: Symbol, other)

and let prevind do the calculation of an appropriate byte index. Trying to do arithmetics on byte indices is the most common error when working with Strings in Julia.

Finally it is easy to get all byte indices that point to the start of the character in the string with the eachindex function:

julia> foreach(i -> println("$i:\t$(repr(str[i]))"),  eachindex(str))
1:  '😄'
5:  ' '
6:  'H'
7:  'e'
8:  'l'
9:  'l'
10: 'o'
11: '!'
12: ' '
13: '👋'

Going there and back again between byte and character indices

If you have a byte index and want to find character index of a character that covers this code unit then write:

julia> length(str, 1, loc)
3

Note that byte index does not have to be a valid start of a character. In this case the index of the character that contains this index is returned:

julia> length(str, 1, 1)
1

julia> length(str, 1, 2)
1

julia> length(str, 1, 3)
1

julia> length(str, 1, 4)
1

julia> length(str, 1, 5)
2

(byte index 5 corresponds to the second character in the string)

If you have a character index and want to learn the byte index of this character in the String we already know we should use the nextind function. For example:

julia> nextind(str, 0, 3)
6

gets you the byte index of 'H' character in the string.

The glossary

Below I present the list of functions available in Julia Base that work with strings with a comment if they work with byte or character indices and information about their time complexity for String type. I omit the description what the functions do to keep the table brief and I list only the functions that either take or return an index or byte/character number.

Funtion arguments return value complexity
length byte index characters O(n)
ncodeunits   bytes O(1)
sizeof   bytes O(1)
codeunit byte index   O(1)
isvalid byte index   O(1)
getindex byte index   O(1)
SubString byte index   O(1)
view, @view byte index   O(1)
unsafe_string byte index   O(1)
match byte index   O(n)
findfirst   byte index O(n)
findlast   byte index O(n)
findnext   byte index O(n)
findprev   byte index O(n)
firstindex   byte index O(1)
lastindex   byte index O(1)
thisind byte index byte index O(1)
prevind i: byte index, n: characters byte index O(n)
nextind i: byte index, n: characters byte index O(n)
chop character index   O(n)
first character index   O(n)
last character index   O(n)
lpad characters   O(n)
rpad characters   O(n)
textwidth   screen characters O(n)

Note that nextind and prevind are O(n) for n, but for i they are O(1) as UTF-8 is self-synchronizing. Also, in general, this table gives a very simplified view of cost of different operations, especially these that involve two strings — I have just chatted with @StottPJones on Slack about it and if you are interested in these topics you can reach out to him there.

Conclusions

I must admit that working with the String type can be sometimes tricky, but I hope that the summary I have presented in this post will help you easier navigate through the options.

Finally String is not the only string type available in the Julia language. Actually most functions just work with any AbstractString and there are alternative string types developed in the community, you can check them out for example here.