Strings vs symbols in DataFrames.jl column indexing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/05/symbol.html

Introduction

In DataFrames.jl you can use both symbols and strings for column indexing. Which
to choose is one of the topics that new users ask about most frequently. In this
post I will explain why both options are supported and what is a difference
between them. Note that this is an entry level post, so I will omit many details
of the discussed topic and focus on most important aspects only.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
DataFramesMeta.jl 0.12.0, BenchmarkTools.jl 1.3.1.

What are strings and symbols?

In Julia a string allows users to store sequences of characters. The simplest
way to create a string is to write some text between double quotation marks:

julia> "an example string"
"an example string"

Symbols are objects used in Julia to create identifiers. You can think of them
as labels. Symbols are normally created by prefixing some label with : like
this:

julia> :label
:label

In this way you can create symbols that are valid variable names.
So, for example, you cannot create a symbol that has a space using ::

julia> :my label
ERROR: syntax: extra token "label" after end of expression

Instead, in such cases, you need to call Symbol passing it a string as
an argument:

julia> Symbol("my label")
Symbol("my label")

How are string and symbols different?

To understand the difference between symbols and strings it is easiest to
think of them as follows:

  • symbols are labels;
  • strings are sequences of characters.

So symbols are indivisible – they are always considered to as a whole,
while strings consist of multiple characters. The most important consequences of
this distinction are the following:

  • symbols are faster than strings when you compare them for equality using ==;
  • you can manipulate strings (e.g. uppercase, chop, perform substring matching etc.)
    while none of such operations are supported for symbols.

Let us have a look at these two characteristics by example. First we check
comparison speed. We create 1000-element vectors with unique values and compare
all pairs of their entries, so we make 1 million comparisons and expect 1000
matches.

julia> using BenchmarkTools

julia> string_vec = string.("s", 1:1000)
1000-element Vector{String}:
 "s1"
 "s2"
 "s3"
 "s4"
 ⋮
 "s997"
 "s998"
 "s999"
 "s1000"

julia> symbol_vec = Symbol.("s", 1:1000)
1000-element Vector{Symbol}:
 :s1
 :s2
 :s3
 :s4
 ⋮
 :s997
 :s998
 :s999
 :s1000

julia> test_cmp(v) = count(x == y for x in v, y in v)
test_cmp (generic function with 1 method)

julia> @btime test_cmp($string_vec)
  3.038 ms (0 allocations: 0 bytes)
1000

julia> @btime test_cmp($symbol_vec)
  635.400 μs (0 allocations: 0 bytes)
1000

Indeed symbol comparison is faster.

Now let us look at manipulation:

julia> str = "example"
"example"

julia> uppercase(str)
"EXAMPLE"

julia> chop(str)
"exampl"

julia> match(r"ex", str)
RegexMatch("ex")

julia> sym = :example
:example

julia> uppercase(sym)
ERROR: MethodError: no method matching uppercase(::Symbol)

julia> chop(sym)
ERROR: MethodError: no method matching chop(::Symbol)

julia> match(r"ex", sym)
ERROR: MethodError: no method matching match(::Regex, ::Symbol)

So in summary we could conclude that:

  • one can use symbol if the value stored in it is not manipulated
    (i.e. is treated as a label); they are faster in comparisons than strings
    and a bit easier to type (only : prefix is needed) provided that they do
    not contain characters like spaces (in which case they are not convenient
    to type);
  • strings support manipulation as opposed to symbols; the cost is that
    comparing them is slower than comparing symbols.

Let us now discuss how these considerations translate to the DataFrames.jl realm.

Strings vs symbols in DataFrames.jl

Column names in a DataFrame are labels. For this reason both symbols and
strings are allowed to be used when referencing them without introducing
an ambiguity. Here is an example. We start with strings:

julia> using DataFrames

julia> df = DataFrame("col1" => 1, "col 2" => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df."col1"
1-element Vector{Int64}:
 1

julia> df."col 2"
1-element Vector{Int64}:
 2

julia> df[:, "col1"]
1-element Vector{Int64}:
 1

julia> df[:, "col 2"]
1-element Vector{Int64}:
 2

Now we try the same with symbols:

julia> df = DataFrame(:col1 => 1, Symbol("col 2") => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df.col1
1-element Vector{Int64}:
 1

julia> getproperty(df, Symbol("col 2"))
1-element Vector{Int64}:
 2

julia> df[:, :col1]
1-element Vector{Int64}:
 1

julia> df[:, Symbol("col 2")]
1-element Vector{Int64}:
 2

We now see the first difference, that we have already discussed. If column
names are all valid variable names symbols are more convenient, however,
if they are not (e.g. contain spaces) then using strings is more convenient.
As an extreme case, note that the convenience syntax for getproperty using
. accessor does not work for symbols containing spaces and we need to do
an explicit getproperty call.

The second important aspect is that all functions that manipulate column
names in DataFrames.jl work with strings. This is natural, as symbol
manipulation is not supported by Julia. Here is a combo showing this in action:

julia> select(df, Cols(startswith("c")) .=> identity .=> uppercase)
1×2 DataFrame
 Row │ COL1   COL 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The Cols(startswith("c")) .=> identity .=> uppercase operation specification
syntax means that we want to pick all columns whose name starts with "c"
(note that the startswith function expects string as an input), keep them
unchanged (the identiy function) and uppercase their names in the output
(note that uppercase expects string as an input).

Finally, you might ask about comparison of speed of column lookup using strings
vs symbols. Here is a simple test:

julia> @btime $df.col1
  7.500 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

julia> @btime $df."col1"
  38.446 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

As you can see there is a noticeable performance difference. However, please
note that both these operations are very fast. Therefore, in practice,
column lookup is almost never a performance bottleneck in operations on
data frames (usually what you do with the column picked from a data frame
is more expensive by several orders of magnitude). So a practical recommendation
is that performance should not be a reason of choosing symbols over strings
most of the time.

If you really need speed then column lookup using an integer index is fastest:

julia> @btime $df[!, 1]
  4.100 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

However, this way of picking columns is not recommended and you should use it
only if you are sure what column is stored under a given number in a data frame.

Additional practical considerations of using strings and symbols in DataFrames.jl

The first tip is that you can get a list of column names of a data frame as
strings and as symbols in DataFrames.jl using the names and propertynames
functions respectively:

julia> names(df)
2-element Vector{String}:
 "col1"
 "col 2"

julia> propertynames(df)
2-element Vector{Symbol}:
 :col1
 Symbol("col 2")

The second important consideration is that in DataFramesMeta.jl only symbols are
considered to be column identifiers in operations by default.
Therefore you can write:

julia> using DataFramesMeta

julia> @rselect(df, :out = :col1 + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

If you want to use strings instead you have to escape them with $:

julia> @rselect(df, $"out" = $"col1" + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

Conclusions

The post today was long, but the conclusion is simple. In DataFrames.jl
you can use both symbols and strings to get access to a column of a data frame.
The major consideration you should use when picking one or the other is your
convenience.