Julia basics: replace

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/07/27/replace.html

Introduction

Today I want to write about a basic vector manipulation pattern that I often need when cleaning data.
The topic is replacing values in a vector.

I want to discuss the basic options Base Julia offers you in this area and the considerations of their usage.

The post was written under Julia 1.9.2.

The problem statement

Assume you have the following vector of strings:

julia> using Random

julia> Random.seed!(1234);

julia> v = [randstring('a':'z', 2) for _ in 1:10^7]
10000000-element Vector{String}:
 "io"
 "fx"
 "jk"
 "yu"
 "mt"
 "ps"
 ⋮
 "ji"
 "zx"
 "yo"
 "db"
 "gy"
 "dg"

It consists of randomly drawn strings of length 2 consisting of two lowercase letters.

Let us check how many unique values it has (we expect 26 * 26 = 676):

julia> Set(v)
Set{String} with 676 elements:
  "ao"
  "gi"
  "gy"
  "ea"
  "an"
  "ec"
  "fs"
  "qq"
  "ys"
  "gd"
  "ow"
  "gu"
  ⋮

Now we have two tasks (similar, but not identical):

  • In all cases when the first and the second letter is the same uppercase the string and otherwise set data to missing.
  • In all cases when the first and the second letter is the same uppercase the string and keep the rest of the data as it is.

Both patterns are often needed in practice so I thought it is useful to cover them. Let us discuss them in the following sections.

Replacing a non-matching value by a flag

This pattern is often needed when we have many categories in our data, and want to keep most important ones, while flagging the less important ones as other. In our task the other signaling flag is missing.

How can we achieve the desired result? A direct approach would be to use a comprehension with a condition like this:

julia> sort!(unique([allequal(x) ? uppercase(x) : missing for x in v]))
27-element Vector{Union{Missing, String}}:
 "AA"
 "BB"
 "CC"
 "DD"
 "EE"
 "FF"
 ⋮
 "VV"
 "WW"
 "XX"
 "YY"
 "ZZ"
 missing

I have used the unique and sort! functions to check that indeed we get 27 values in the result (26 uppercase strings + missing).

This approach works, but it relies on a special structure of our condition (all equal letters in a string). What would be a more general pattern? In such cases it is convenient to use a dictionary. Let us first define the desired mapping:

julia> mapping = Dict(c^2 => uppercase(c)^2 for c in 'a':'z')
Dict{String, String} with 26 entries:
  "ff" => "FF"
  "bb" => "BB"
  "hh" => "HH"
  "ii" => "II"
  "oo" => "OO"
  "vv" => "VV"
  "zz" => "ZZ"
  "dd" => "DD"
  "tt" => "TT"
  "ee" => "EE"
  "jj" => "JJ"
  "ww" => "WW"
  ⋮    => ⋮

Now we can use the get function. Its beauty is that it allows us to define a default value that should be returned
when a key is missing in a dictionary. Let us use it in two ways:

julia> sort!(unique([get(mapping, x, missing) for x in v]))
27-element Vector{Union{Missing, String}}:
 "AA"
 "BB"
 "CC"
 "DD"
 "EE"
 "FF"
 ⋮
 "VV"
 "WW"
 "XX"
 "YY"
 "ZZ"
 missing

julia> sort!(unique(get.(Ref(mapping), v, missing)))
27-element Vector{Union{Missing, String}}:
 "AA"
 "BB"
 "CC"
 "DD"
 "EE"
 "FF"
 ⋮
 "VV"
 "WW"
 "XX"
 "YY"
 "ZZ"
 missing

The first uses a comprehension and the second broadcasting. Which one should be preferred. Let us run a minimal benchmark:

julia> @time [get(mapping, x, missing) for x in v];
  1.309264 seconds (31.70 k allocations: 78.387 MiB, 10.17% compilation time)

julia> @time get.(Ref(mapping), v, missing);
  0.660786 seconds (7 allocations: 76.294 MiB)

Broadcasting is noticeably faster. What is the reason? The issue is that we are working in top-level scope and comprehension is not type stable and also needs to get compiled each time it is run. Let us first make it type stable using let:

julia> let mapping=mapping
           @time [get(mapping, x, missing) for x in v]
       end;
  0.828514 seconds (39.49 k allocations: 78.991 MiB, 22.09% compilation time)

We already made it faster but still we are hit by compilation time. Let us define a helper function:

julia> helper1(x, mapping) = [get(mapping, x, missing) for x in v];

julia> @time helper1(x, mapping);
  0.921935 seconds (180.41 k allocations: 88.752 MiB, 31.15% compilation time)

julia> @time helper1(x, mapping);
  0.655153 seconds (5 allocations: 76.294 MiB)

Now we get the same timing. So the lesson is: within a function broadcasting and comprehension should have a similar timing, but in top-level scope (which is often the case when working interactively with data) it is better to use broadcasting.

Retaining non-matching values

If we want to retain non-matching values we cannot use the dictionary get function, because it always returns the same value in case of a miss.

Here we have two basic options: use a custom comprehension or use the replace function. Let us see them both in action:

julia> reshape(sort!(unique(replace(v, mapping...))), 26, :)
26×26 Matrix{String}:
 "AA"  "ab"  "bc"  "cd"  "de"  "ef"  "fg"  "gh"  "hi"  "ij"  "jk"  …  "pq"  "qr"  "rs"  "st"  "tu"  "uv"  "vw"  "wx"  "xy"  "yz"
 "BB"  "ac"  "bd"  "ce"  "df"  "eg"  "fh"  "gi"  "hj"  "ik"  "jl"     "pr"  "qs"  "rt"  "su"  "tv"  "uw"  "vx"  "wy"  "xz"  "za"
 "CC"  "ad"  "be"  "cf"  "dg"  "eh"  "fi"  "gj"  "hk"  "il"  "jm"     "ps"  "qt"  "ru"  "sv"  "tw"  "ux"  "vy"  "wz"  "ya"  "zb"
 "DD"  "ae"  "bf"  "cg"  "dh"  "ei"  "fj"  "gk"  "hl"  "im"  "jn"     "pt"  "qu"  "rv"  "sw"  "tx"  "uy"  "vz"  "xa"  "yb"  "zc"
 "EE"  "af"  "bg"  "ch"  "di"  "ej"  "fk"  "gl"  "hm"  "in"  "jo"     "pu"  "qv"  "rw"  "sx"  "ty"  "uz"  "wa"  "xb"  "yc"  "zd"
 "FF"  "ag"  "bh"  "ci"  "dj"  "ek"  "fl"  "gm"  "hn"  "io"  "jp"  …  "pv"  "qw"  "rx"  "sy"  "tz"  "va"  "wb"  "xc"  "yd"  "ze"
 "GG"  "ah"  "bi"  "cj"  "dk"  "el"  "fm"  "gn"  "ho"  "ip"  "jq"     "pw"  "qx"  "ry"  "sz"  "ua"  "vb"  "wc"  "xd"  "ye"  "zf"
 "HH"  "ai"  "bj"  "ck"  "dl"  "em"  "fn"  "go"  "hp"  "iq"  "jr"     "px"  "qy"  "rz"  "ta"  "ub"  "vc"  "wd"  "xe"  "yf"  "zg"
 "II"  "aj"  "bk"  "cl"  "dm"  "en"  "fo"  "gp"  "hq"  "ir"  "js"     "py"  "qz"  "sa"  "tb"  "uc"  "vd"  "we"  "xf"  "yg"  "zh"
 "JJ"  "ak"  "bl"  "cm"  "dn"  "eo"  "fp"  "gq"  "hr"  "is"  "jt"     "pz"  "ra"  "sb"  "tc"  "ud"  "ve"  "wf"  "xg"  "yh"  "zi"
 "KK"  "al"  "bm"  "cn"  "do"  "ep"  "fq"  "gr"  "hs"  "it"  "ju"  …  "qa"  "rb"  "sc"  "td"  "ue"  "vf"  "wg"  "xh"  "yi"  "zj"
 "LL"  "am"  "bn"  "co"  "dp"  "eq"  "fr"  "gs"  "ht"  "iu"  "jv"     "qb"  "rc"  "sd"  "te"  "uf"  "vg"  "wh"  "xi"  "yj"  "zk"
 "MM"  "an"  "bo"  "cp"  "dq"  "er"  "fs"  "gt"  "hu"  "iv"  "jw"     "qc"  "rd"  "se"  "tf"  "ug"  "vh"  "wi"  "xj"  "yk"  "zl"
 "NN"  "ao"  "bp"  "cq"  "dr"  "es"  "ft"  "gu"  "hv"  "iw"  "jx"     "qd"  "re"  "sf"  "tg"  "uh"  "vi"  "wj"  "xk"  "yl"  "zm"
 "OO"  "ap"  "bq"  "cr"  "ds"  "et"  "fu"  "gv"  "hw"  "ix"  "jy"     "qe"  "rf"  "sg"  "th"  "ui"  "vj"  "wk"  "xl"  "ym"  "zn"
 "PP"  "aq"  "br"  "cs"  "dt"  "eu"  "fv"  "gw"  "hx"  "iy"  "jz"  …  "qf"  "rg"  "sh"  "ti"  "uj"  "vk"  "wl"  "xm"  "yn"  "zo"
 "QQ"  "ar"  "bs"  "ct"  "du"  "ev"  "fw"  "gx"  "hy"  "iz"  "ka"     "qg"  "rh"  "si"  "tj"  "uk"  "vl"  "wm"  "xn"  "yo"  "zp"
 "RR"  "as"  "bt"  "cu"  "dv"  "ew"  "fx"  "gy"  "hz"  "ja"  "kb"     "qh"  "ri"  "sj"  "tk"  "ul"  "vm"  "wn"  "xo"  "yp"  "zq"
 "SS"  "at"  "bu"  "cv"  "dw"  "ex"  "fy"  "gz"  "ia"  "jb"  "kc"     "qi"  "rj"  "sk"  "tl"  "um"  "vn"  "wo"  "xp"  "yq"  "zr"
 "TT"  "au"  "bv"  "cw"  "dx"  "ey"  "fz"  "ha"  "ib"  "jc"  "kd"     "qj"  "rk"  "sl"  "tm"  "un"  "vo"  "wp"  "xq"  "yr"  "zs"
 "UU"  "av"  "bw"  "cx"  "dy"  "ez"  "ga"  "hb"  "ic"  "jd"  "ke"  …  "qk"  "rl"  "sm"  "tn"  "uo"  "vp"  "wq"  "xr"  "ys"  "zt"
 "VV"  "aw"  "bx"  "cy"  "dz"  "fa"  "gb"  "hc"  "id"  "je"  "kf"     "ql"  "rm"  "sn"  "to"  "up"  "vq"  "wr"  "xs"  "yt"  "zu"
 "WW"  "ax"  "by"  "cz"  "ea"  "fb"  "gc"  "hd"  "ie"  "jf"  "kg"     "qm"  "rn"  "so"  "tp"  "uq"  "vr"  "ws"  "xt"  "yu"  "zv"
 "XX"  "ay"  "bz"  "da"  "eb"  "fc"  "gd"  "he"  "if"  "jg"  "kh"     "qn"  "ro"  "sp"  "tq"  "ur"  "vs"  "wt"  "xu"  "yv"  "zw"
 "YY"  "az"  "ca"  "db"  "ec"  "fd"  "ge"  "hf"  "ig"  "jh"  "ki"     "qo"  "rp"  "sq"  "tr"  "us"  "vt"  "wu"  "xv"  "yw"  "zx"
 "ZZ"  "ba"  "cb"  "dc"  "ed"  "fe"  "gf"  "hg"  "ih"  "ji"  "kj"  …  "qp"  "rq"  "sr"  "ts"  "ut"  "vu"  "wv"  "xw"  "yx"  "zy"

julia> reshape(sort!(unique([haskey(mapping, x) ? mapping[x] : x for x in v])), 26, :)
26×26 Matrix{String}:
 "AA"  "ab"  "bc"  "cd"  "de"  "ef"  "fg"  "gh"  "hi"  "ij"  "jk"  …  "pq"  "qr"  "rs"  "st"  "tu"  "uv"  "vw"  "wx"  "xy"  "yz"
 "BB"  "ac"  "bd"  "ce"  "df"  "eg"  "fh"  "gi"  "hj"  "ik"  "jl"     "pr"  "qs"  "rt"  "su"  "tv"  "uw"  "vx"  "wy"  "xz"  "za"
 "CC"  "ad"  "be"  "cf"  "dg"  "eh"  "fi"  "gj"  "hk"  "il"  "jm"     "ps"  "qt"  "ru"  "sv"  "tw"  "ux"  "vy"  "wz"  "ya"  "zb"
 "DD"  "ae"  "bf"  "cg"  "dh"  "ei"  "fj"  "gk"  "hl"  "im"  "jn"     "pt"  "qu"  "rv"  "sw"  "tx"  "uy"  "vz"  "xa"  "yb"  "zc"
 "EE"  "af"  "bg"  "ch"  "di"  "ej"  "fk"  "gl"  "hm"  "in"  "jo"     "pu"  "qv"  "rw"  "sx"  "ty"  "uz"  "wa"  "xb"  "yc"  "zd"
 "FF"  "ag"  "bh"  "ci"  "dj"  "ek"  "fl"  "gm"  "hn"  "io"  "jp"  …  "pv"  "qw"  "rx"  "sy"  "tz"  "va"  "wb"  "xc"  "yd"  "ze"
 "GG"  "ah"  "bi"  "cj"  "dk"  "el"  "fm"  "gn"  "ho"  "ip"  "jq"     "pw"  "qx"  "ry"  "sz"  "ua"  "vb"  "wc"  "xd"  "ye"  "zf"
 "HH"  "ai"  "bj"  "ck"  "dl"  "em"  "fn"  "go"  "hp"  "iq"  "jr"     "px"  "qy"  "rz"  "ta"  "ub"  "vc"  "wd"  "xe"  "yf"  "zg"
 "II"  "aj"  "bk"  "cl"  "dm"  "en"  "fo"  "gp"  "hq"  "ir"  "js"     "py"  "qz"  "sa"  "tb"  "uc"  "vd"  "we"  "xf"  "yg"  "zh"
 "JJ"  "ak"  "bl"  "cm"  "dn"  "eo"  "fp"  "gq"  "hr"  "is"  "jt"     "pz"  "ra"  "sb"  "tc"  "ud"  "ve"  "wf"  "xg"  "yh"  "zi"
 "KK"  "al"  "bm"  "cn"  "do"  "ep"  "fq"  "gr"  "hs"  "it"  "ju"  …  "qa"  "rb"  "sc"  "td"  "ue"  "vf"  "wg"  "xh"  "yi"  "zj"
 "LL"  "am"  "bn"  "co"  "dp"  "eq"  "fr"  "gs"  "ht"  "iu"  "jv"     "qb"  "rc"  "sd"  "te"  "uf"  "vg"  "wh"  "xi"  "yj"  "zk"
 "MM"  "an"  "bo"  "cp"  "dq"  "er"  "fs"  "gt"  "hu"  "iv"  "jw"     "qc"  "rd"  "se"  "tf"  "ug"  "vh"  "wi"  "xj"  "yk"  "zl"
 "NN"  "ao"  "bp"  "cq"  "dr"  "es"  "ft"  "gu"  "hv"  "iw"  "jx"     "qd"  "re"  "sf"  "tg"  "uh"  "vi"  "wj"  "xk"  "yl"  "zm"
 "OO"  "ap"  "bq"  "cr"  "ds"  "et"  "fu"  "gv"  "hw"  "ix"  "jy"     "qe"  "rf"  "sg"  "th"  "ui"  "vj"  "wk"  "xl"  "ym"  "zn"
 "PP"  "aq"  "br"  "cs"  "dt"  "eu"  "fv"  "gw"  "hx"  "iy"  "jz"  …  "qf"  "rg"  "sh"  "ti"  "uj"  "vk"  "wl"  "xm"  "yn"  "zo"
 "QQ"  "ar"  "bs"  "ct"  "du"  "ev"  "fw"  "gx"  "hy"  "iz"  "ka"     "qg"  "rh"  "si"  "tj"  "uk"  "vl"  "wm"  "xn"  "yo"  "zp"
 "RR"  "as"  "bt"  "cu"  "dv"  "ew"  "fx"  "gy"  "hz"  "ja"  "kb"     "qh"  "ri"  "sj"  "tk"  "ul"  "vm"  "wn"  "xo"  "yp"  "zq"
 "SS"  "at"  "bu"  "cv"  "dw"  "ex"  "fy"  "gz"  "ia"  "jb"  "kc"     "qi"  "rj"  "sk"  "tl"  "um"  "vn"  "wo"  "xp"  "yq"  "zr"
 "TT"  "au"  "bv"  "cw"  "dx"  "ey"  "fz"  "ha"  "ib"  "jc"  "kd"     "qj"  "rk"  "sl"  "tm"  "un"  "vo"  "wp"  "xq"  "yr"  "zs"
 "UU"  "av"  "bw"  "cx"  "dy"  "ez"  "ga"  "hb"  "ic"  "jd"  "ke"  …  "qk"  "rl"  "sm"  "tn"  "uo"  "vp"  "wq"  "xr"  "ys"  "zt"
 "VV"  "aw"  "bx"  "cy"  "dz"  "fa"  "gb"  "hc"  "id"  "je"  "kf"     "ql"  "rm"  "sn"  "to"  "up"  "vq"  "wr"  "xs"  "yt"  "zu"
 "WW"  "ax"  "by"  "cz"  "ea"  "fb"  "gc"  "hd"  "ie"  "jf"  "kg"     "qm"  "rn"  "so"  "tp"  "uq"  "vr"  "ws"  "xt"  "yu"  "zv"
 "XX"  "ay"  "bz"  "da"  "eb"  "fc"  "gd"  "he"  "if"  "jg"  "kh"     "qn"  "ro"  "sp"  "tq"  "ur"  "vs"  "wt"  "xu"  "yv"  "zw"
 "YY"  "az"  "ca"  "db"  "ec"  "fd"  "ge"  "hf"  "ig"  "jh"  "ki"     "qo"  "rp"  "sq"  "tr"  "us"  "vt"  "wu"  "xv"  "yw"  "zx"
 "ZZ"  "ba"  "cb"  "dc"  "ed"  "fe"  "gf"  "hg"  "ih"  "ji"  "kj"  …  "qp"  "rq"  "sr"  "ts"  "ut"  "vu"  "wv"  "xw"  "yx"  "zy"

You might wonder which approach is fester in practice. Let us check it (including all options for the comprehension we have used above):

julia> @time replace(v, mapping...);
  2.896812 seconds (461 allocations: 76.314 MiB)

julia> @time [haskey(mapping, x) ? mapping[x] : v for x in v];
  1.389256 seconds (31.67 k allocations: 154.680 MiB, 9.80% compilation time)

julia> @time let mapping=mapping
           [haskey(mapping, x) ? mapping[x] : v for x in v]
       end;
  1.053450 seconds (33.60 k allocations: 154.853 MiB, 15.47% compilation time)

julia> helper2(x, mapping) = [haskey(mapping, x) ? mapping[x] : v for x in v];

julia> @time helper2(x, mapping);
  0.952112 seconds (70.00 k allocations: 157.290 MiB, 19.68% compilation time)

julia> @time helper2(x, mapping);
  0.910453 seconds (9 allocations: 152.588 MiB, 20.39% gc time)

As you can see replace is slower even than the type unstable version of the comprehension.
What is the underlying reason for this? The answer is that replace checks each key => value
pair linearly, which is much slower than using a dictionary if you have many such pairs.

Conclusions

In summary the basic lessons from the tests we discussed today are:

  • If you are replacing values and a a non-matching value should be replaced by some flag the simplest and fast option is to use broadcasted get;
  • If you are replacing values and want to retain non-matching values then:
    • if you have multiple key=>value paris use a comprehension with haskey (preferably making it type stable with let or a function);
    • if you have few key=>value pairs replace should be easiest and fast enough.