Fixed width strings in CSV.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/10/stringx.html

Introduction

In my recent post I have written how using inline strings can help to in
crease join performance. Since with 0.9 release of CSV.jl these strings are
used by default to read CSV files so I want to focus solely on this
topic today.

The post was written under Julia 1.6.1, DataFrames.jl 1.2.2,
WeakRrefStrings.jl 1.3.0, BenchmarkTools.jl 1.1.4, and CSV.jl 0.9.1.

Introducing inline strings

WeakRefStrings.jl defines 8 inline string types:

julia> using WeakRefStrings

julia> st = subtypes(InlineString)
8-element Vector{Any}:
 String1
 String127
 String15
 String255
 String3
 String31
 String63
 String7

What is important is that all these types are bits types, as you can see here:

julia> using DataFrames

julia> sort(DataFrame(st=st,
                      isbits = isbitstype.(st),
                      sizeof = sizeof.(st)), :sizeof)
8×3 DataFrame
 Row │ st         isbits  sizeof
     │ Any        Bool    Int64
─────┼───────────────────────────
   1 │ String1      true       1
   2 │ String3      true       4
   3 │ String7      true       8
   4 │ String15     true      16
   5 │ String31     true      32
   6 │ String63     true      64
   7 │ String127    true     128
   8 │ String255    true     256

The suffix of the name of the specific string type indicates the maximum size of
the string it can hold. So for example String3 can hold strings that have
maximum size 3 as returned by sizeof. Here is an example:

julia> String3("123")
"123"

julia> String3("1234")
ERROR: ArgumentError: string too large (4) to convert to String3

julia> String3("∀")
"∀"

julia> String3("∀1")
ERROR: ArgumentError: string too large (4) to convert to String3

As you can see here it is important to remember that some characters (like )
in UTF-8 encoding take more than one code unit.

You can use the InlineString function to create an inline string of
automatically selected minimal size:

julia> typeof(InlineString("∀"))
String3

julia> typeof(InlineString("∀1"))
String7

Finally, as we can see the maximum size of inline string is 255, so:

julia> InlineString("a"^256)
ERROR: ArgumentError: string too large (256) to convert to InlineString

In summary, as you can see, the String[N] types are similar to CHAR(N)
types that are provided by many data bases. The limitation is that N
can take only several fixed values.

Why and when to use inline strings?

There are two benefits of inline strings.

The first is that they can be faster. A special case when this is visible are
equality comparisons which are quite common in practice.

julia> using Random, BenchmarkTools

julia> Random.seed!(1234);

julia> s1 = [randstring(3) for i in 1:1000];

julia> s2 = String3.(s1);

julia> @benchmark $s1 .== permutedims($s1)
BenchmarkTools.Trial: 1064 samples with 1 evaluation.
 Range (min … max):  3.928 ms …   6.435 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.910 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.697 ms ± 404.737 μs  ┊ GC (mean ± σ):  0.06% ± 0.91%

  ▅▃                             ▆ ▃  ▃█▄▂
  ██▇▇▅▇▅▅▄▆▄▄▄▅▆▅▅▁▄▆▅▅▇▅▄▁▅█▄▄▁█▇█▇▇█████▇▅▄▅▅▄▁▄▁▄▄▅▄▁▁▅▅▄ █
  3.93 ms      Histogram: log(frequency) by time      5.49 ms <

 Memory estimate: 126.53 KiB, allocs estimate: 6.

julia> @benchmark $s2 .== permutedims($s2)
BenchmarkTools.Trial: 4879 samples with 1 evaluation.
 Range (min … max):  888.150 μs …  2.037 ms  ┊ GC (min … max): 0.00% … 47.53%
 Time  (median):       1.043 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.023 ms ± 77.610 μs  ┊ GC (mean ± σ):  0.23% ±  2.38%

  ▅     ▃ ▁        ▁    ▅▃   █▆                                ▁
  █▅▃▃█████▅▃▁█▆▅▅▁█▇▃▁▃██▅▁▄███▆▅▄▆▇▅▇▅▅▇▅▅▄▄▃▅▅▅▅▃▁▃▅▄▃▁▃▁▃▄ █
  888 μs        Histogram: log(frequency) by time      1.22 ms <

 Memory estimate: 126.53 KiB, allocs estimate: 6.

The second is that they are not heap allocated so they do not lead to
significant GC latency. This topic was presented in the post on join
performance.

So what are the potential shortcommings. There are several:

  • they have a limited capacity, so one might not be able to always
    rely that the conversion to inline string is possible;
  • they are not efficient when we work with strings of highly variable length;
  • mixing of inline strings in collections can lead to type instabilities.

I have already discussed the first topic. Now, let us handle the second and
third one in consecutive sections.

Memory footprint of inline strings

Consider the following collection:

julia> x = [String127("a"^i) for i in 1:100]
100-element Vector{String127}:
 "a"
 "aa"
 "aaa"
 ⋮
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

Let us check how expensive it is to perform its copy:

julia> @benchmark copy($x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  677.100 ns … 131.135 μs  ┊ GC (min … max):  0.00% … 95.08%
 Time  (median):       1.224 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.464 μs ±   4.174 μs  ┊ GC (mean ± σ):  13.84% ±  4.88%

                    ▁▅███▇▆▄▂
  ▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▆██████████▇▅▅▅▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▂▂▂ ▃
  677 ns           Histogram: frequency by time         2.11 μs <

 Memory estimate: 12.62 KiB, allocs estimate: 1.

Now we convert it to a standard String:

julia> y = String.(x)
100-element Vector{String}:
 "a"
 "aa"
 "aaa"
 ⋮
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

julia> @benchmark copy($y)
BenchmarkTools.Trial: 10000 samples with 973 evaluations.
 Range (min … max):  64.867 ns …   1.592 μs  ┊ GC (min … max):  0.00% … 82.37%
 Time  (median):     80.387 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   98.944 ns ± 111.194 ns  ┊ GC (mean ± σ):  13.24% ± 10.98%

  ▆█▄▂                                                         ▁
  ██████▇▆▆▆▄▅▄▃▄▃▁▃▁▁▁▁▁▁▁▃▅▆▅▄▅▇▅▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▅▅▄▅▅▅▅▇▆▆▅▅ █
  64.9 ns       Histogram: log(frequency) by time       830 ns <

 Memory estimate: 896 bytes, allocs estimate: 1.

As you can see the operation on String is much faster. The reason is that x
stores 100 entries of which each has 128 bytes, while y stores 100
pointers to strings that have only 8 bytes of size. See:

julia> sizeof(x)
12800

julia> sizeof(y)
800

So much less data movement is involved if we performed copying of pointers only.

The conclusion is that it is probably better to use String type if we expect
to work with collections of strings that have a highly variable length.

Collections of inline strings

The second potential issue is working with collections of inline strings.

When you read in string data from a file CSV.jl will automatically create
columns of appropriate widths:

julia> using CSV

julia> str = """
       w1,w2,w3,w4
       a,a,a,a
       b,bb,bb,bb
       c,cc,ccc,ccc
       d,dd,ddd,dddd
       """
"w1,w2,w3,w4\na,a,a,a\nb,bb,bb,bb\nc,cc,ccc,ccc\nd,dd,ddd,dddd\n"

julia> df = CSV.read(IOBuffer(str), DataFrame)
4×4 DataFrame
 Row │ w1       w2       w3       w4
     │ String1  String3  String3  String7
─────┼────────────────────────────────────
   1 │ a        a        a        a
   2 │ b        bb       bb       bb
   3 │ c        cc       ccc      ccc
   4 │ d        dd       ddd      dddd

However, one has to be careful when converting to inline string manually. Have a
look:

julia> strs2 = InlineString.(strs)
8-element Vector{InlineString}:
 "a"
 "aa"
 "aaa"
 "aaaa"
 "aaaaa"
 "aaaaaa"
 "aaaaaaa"
 "aaaaaaaa"

julia> typeof.(strs2)
8-element Vector{DataType}:
 String1
 String3
 String3
 String7
 String7
 String7
 String7
 String15

Sometimes this is indeed what one would want (the narrowest possible
representation at the cost of having a collection of abstract element type).
But currently if you would want to have an automatic type promotion and a
concrete element type you would have to do it manually e.g.:

julia> String15.(strs)
8-element Vector{String15}:
 "a"
 "aa"
 "aaa"
 "aaaa"
 "aaaaa"
 "aaaaaa"
 "aaaaaaa"
 "aaaaaaaa"

Operations on inline strings

Additionally one should know that most operations transforming inline strings
will currently produce a String by default, e.g.:

julia> s = String3("a")
"a"

julia> typeof(s^2)
String

julia> typeof(uppercase(s))
String

which also is best kept in mind when working with them.

Conclusions

Inline strings are an excellent addition to Julia, however, one should know the
type of task they were designed to help with.

As you could see in my examples inline strings are ideal for situations where we
have millions of relatively small and strings that have relatively homogeneous
size and are not mutated.

This use case might seem restrictive, but actually in practice it is quite
common as e.g. all sorts of customer or product identifiers have exactly this
nature.

Finally, some of the limitations I listed in this post might be lifted in the
future. If you need some functionality please do not hesitate to open an issue
on WeakRefStrings.jl GitHub repository.