Re-posted from: https://bkamins.github.io/julialang/2023/07/27/replace.html
Introduction
Today I want to write about a basic vector manipulation pattern that I often need when cleaning data.
The topic is replacing values in a vector.
I want to discuss the basic options Base Julia offers you in this area and the considerations of their usage.
The post was written under Julia 1.9.2.
The problem statement
Assume you have the following vector of strings:
julia> using Random
julia> Random.seed!(1234);
julia> v = [randstring('a':'z', 2) for _ in 1:10^7]
10000000-element Vector{String}:
"io"
"fx"
"jk"
"yu"
"mt"
"ps"
⋮
"ji"
"zx"
"yo"
"db"
"gy"
"dg"
It consists of randomly drawn strings of length 2 consisting of two lowercase letters.
Let us check how many unique values it has (we expect 26 * 26 = 676):
julia> Set(v)
Set{String} with 676 elements:
"ao"
"gi"
"gy"
"ea"
"an"
"ec"
"fs"
"qq"
"ys"
"gd"
"ow"
"gu"
⋮
Now we have two tasks (similar, but not identical):
- In all cases when the first and the second letter is the same uppercase the string and otherwise set data to
missing
. - In all cases when the first and the second letter is the same uppercase the string and keep the rest of the data as it is.
Both patterns are often needed in practice so I thought it is useful to cover them. Let us discuss them in the following sections.
Replacing a non-matching value by a flag
This pattern is often needed when we have many categories in our data, and want to keep most important ones, while flagging the less important ones as other. In our task the other signaling flag is missing
.
How can we achieve the desired result? A direct approach would be to use a comprehension with a condition like this:
julia> sort!(unique([allequal(x) ? uppercase(x) : missing for x in v]))
27-element Vector{Union{Missing, String}}:
"AA"
"BB"
"CC"
"DD"
"EE"
"FF"
⋮
"VV"
"WW"
"XX"
"YY"
"ZZ"
missing
I have used the unique
and sort!
functions to check that indeed we get 27 values in the result (26 uppercase strings + missing
).
This approach works, but it relies on a special structure of our condition (all equal letters in a string). What would be a more general pattern? In such cases it is convenient to use a dictionary. Let us first define the desired mapping:
julia> mapping = Dict(c^2 => uppercase(c)^2 for c in 'a':'z')
Dict{String, String} with 26 entries:
"ff" => "FF"
"bb" => "BB"
"hh" => "HH"
"ii" => "II"
"oo" => "OO"
"vv" => "VV"
"zz" => "ZZ"
"dd" => "DD"
"tt" => "TT"
"ee" => "EE"
"jj" => "JJ"
"ww" => "WW"
⋮ => ⋮
Now we can use the get
function. Its beauty is that it allows us to define a default value that should be returned
when a key is missing in a dictionary. Let us use it in two ways:
julia> sort!(unique([get(mapping, x, missing) for x in v]))
27-element Vector{Union{Missing, String}}:
"AA"
"BB"
"CC"
"DD"
"EE"
"FF"
⋮
"VV"
"WW"
"XX"
"YY"
"ZZ"
missing
julia> sort!(unique(get.(Ref(mapping), v, missing)))
27-element Vector{Union{Missing, String}}:
"AA"
"BB"
"CC"
"DD"
"EE"
"FF"
⋮
"VV"
"WW"
"XX"
"YY"
"ZZ"
missing
The first uses a comprehension and the second broadcasting. Which one should be preferred. Let us run a minimal benchmark:
julia> @time [get(mapping, x, missing) for x in v];
1.309264 seconds (31.70 k allocations: 78.387 MiB, 10.17% compilation time)
julia> @time get.(Ref(mapping), v, missing);
0.660786 seconds (7 allocations: 76.294 MiB)
Broadcasting is noticeably faster. What is the reason? The issue is that we are working in top-level scope and comprehension is not type stable and also needs to get compiled each time it is run. Let us first make it type stable using let
:
julia> let mapping=mapping
@time [get(mapping, x, missing) for x in v]
end;
0.828514 seconds (39.49 k allocations: 78.991 MiB, 22.09% compilation time)
We already made it faster but still we are hit by compilation time. Let us define a helper function:
julia> helper1(x, mapping) = [get(mapping, x, missing) for x in v];
julia> @time helper1(x, mapping);
0.921935 seconds (180.41 k allocations: 88.752 MiB, 31.15% compilation time)
julia> @time helper1(x, mapping);
0.655153 seconds (5 allocations: 76.294 MiB)
Now we get the same timing. So the lesson is: within a function broadcasting and comprehension should have a similar timing, but in top-level scope (which is often the case when working interactively with data) it is better to use broadcasting.
Retaining non-matching values
If we want to retain non-matching values we cannot use the dictionary get
function, because it always returns the same value in case of a miss.
Here we have two basic options: use a custom comprehension or use the replace
function. Let us see them both in action:
julia> reshape(sort!(unique(replace(v, mapping...))), 26, :)
26×26 Matrix{String}:
"AA" "ab" "bc" "cd" "de" "ef" "fg" "gh" "hi" "ij" "jk" … "pq" "qr" "rs" "st" "tu" "uv" "vw" "wx" "xy" "yz"
"BB" "ac" "bd" "ce" "df" "eg" "fh" "gi" "hj" "ik" "jl" "pr" "qs" "rt" "su" "tv" "uw" "vx" "wy" "xz" "za"
"CC" "ad" "be" "cf" "dg" "eh" "fi" "gj" "hk" "il" "jm" "ps" "qt" "ru" "sv" "tw" "ux" "vy" "wz" "ya" "zb"
"DD" "ae" "bf" "cg" "dh" "ei" "fj" "gk" "hl" "im" "jn" "pt" "qu" "rv" "sw" "tx" "uy" "vz" "xa" "yb" "zc"
"EE" "af" "bg" "ch" "di" "ej" "fk" "gl" "hm" "in" "jo" "pu" "qv" "rw" "sx" "ty" "uz" "wa" "xb" "yc" "zd"
"FF" "ag" "bh" "ci" "dj" "ek" "fl" "gm" "hn" "io" "jp" … "pv" "qw" "rx" "sy" "tz" "va" "wb" "xc" "yd" "ze"
"GG" "ah" "bi" "cj" "dk" "el" "fm" "gn" "ho" "ip" "jq" "pw" "qx" "ry" "sz" "ua" "vb" "wc" "xd" "ye" "zf"
"HH" "ai" "bj" "ck" "dl" "em" "fn" "go" "hp" "iq" "jr" "px" "qy" "rz" "ta" "ub" "vc" "wd" "xe" "yf" "zg"
"II" "aj" "bk" "cl" "dm" "en" "fo" "gp" "hq" "ir" "js" "py" "qz" "sa" "tb" "uc" "vd" "we" "xf" "yg" "zh"
"JJ" "ak" "bl" "cm" "dn" "eo" "fp" "gq" "hr" "is" "jt" "pz" "ra" "sb" "tc" "ud" "ve" "wf" "xg" "yh" "zi"
"KK" "al" "bm" "cn" "do" "ep" "fq" "gr" "hs" "it" "ju" … "qa" "rb" "sc" "td" "ue" "vf" "wg" "xh" "yi" "zj"
"LL" "am" "bn" "co" "dp" "eq" "fr" "gs" "ht" "iu" "jv" "qb" "rc" "sd" "te" "uf" "vg" "wh" "xi" "yj" "zk"
"MM" "an" "bo" "cp" "dq" "er" "fs" "gt" "hu" "iv" "jw" "qc" "rd" "se" "tf" "ug" "vh" "wi" "xj" "yk" "zl"
"NN" "ao" "bp" "cq" "dr" "es" "ft" "gu" "hv" "iw" "jx" "qd" "re" "sf" "tg" "uh" "vi" "wj" "xk" "yl" "zm"
"OO" "ap" "bq" "cr" "ds" "et" "fu" "gv" "hw" "ix" "jy" "qe" "rf" "sg" "th" "ui" "vj" "wk" "xl" "ym" "zn"
"PP" "aq" "br" "cs" "dt" "eu" "fv" "gw" "hx" "iy" "jz" … "qf" "rg" "sh" "ti" "uj" "vk" "wl" "xm" "yn" "zo"
"QQ" "ar" "bs" "ct" "du" "ev" "fw" "gx" "hy" "iz" "ka" "qg" "rh" "si" "tj" "uk" "vl" "wm" "xn" "yo" "zp"
"RR" "as" "bt" "cu" "dv" "ew" "fx" "gy" "hz" "ja" "kb" "qh" "ri" "sj" "tk" "ul" "vm" "wn" "xo" "yp" "zq"
"SS" "at" "bu" "cv" "dw" "ex" "fy" "gz" "ia" "jb" "kc" "qi" "rj" "sk" "tl" "um" "vn" "wo" "xp" "yq" "zr"
"TT" "au" "bv" "cw" "dx" "ey" "fz" "ha" "ib" "jc" "kd" "qj" "rk" "sl" "tm" "un" "vo" "wp" "xq" "yr" "zs"
"UU" "av" "bw" "cx" "dy" "ez" "ga" "hb" "ic" "jd" "ke" … "qk" "rl" "sm" "tn" "uo" "vp" "wq" "xr" "ys" "zt"
"VV" "aw" "bx" "cy" "dz" "fa" "gb" "hc" "id" "je" "kf" "ql" "rm" "sn" "to" "up" "vq" "wr" "xs" "yt" "zu"
"WW" "ax" "by" "cz" "ea" "fb" "gc" "hd" "ie" "jf" "kg" "qm" "rn" "so" "tp" "uq" "vr" "ws" "xt" "yu" "zv"
"XX" "ay" "bz" "da" "eb" "fc" "gd" "he" "if" "jg" "kh" "qn" "ro" "sp" "tq" "ur" "vs" "wt" "xu" "yv" "zw"
"YY" "az" "ca" "db" "ec" "fd" "ge" "hf" "ig" "jh" "ki" "qo" "rp" "sq" "tr" "us" "vt" "wu" "xv" "yw" "zx"
"ZZ" "ba" "cb" "dc" "ed" "fe" "gf" "hg" "ih" "ji" "kj" … "qp" "rq" "sr" "ts" "ut" "vu" "wv" "xw" "yx" "zy"
julia> reshape(sort!(unique([haskey(mapping, x) ? mapping[x] : x for x in v])), 26, :)
26×26 Matrix{String}:
"AA" "ab" "bc" "cd" "de" "ef" "fg" "gh" "hi" "ij" "jk" … "pq" "qr" "rs" "st" "tu" "uv" "vw" "wx" "xy" "yz"
"BB" "ac" "bd" "ce" "df" "eg" "fh" "gi" "hj" "ik" "jl" "pr" "qs" "rt" "su" "tv" "uw" "vx" "wy" "xz" "za"
"CC" "ad" "be" "cf" "dg" "eh" "fi" "gj" "hk" "il" "jm" "ps" "qt" "ru" "sv" "tw" "ux" "vy" "wz" "ya" "zb"
"DD" "ae" "bf" "cg" "dh" "ei" "fj" "gk" "hl" "im" "jn" "pt" "qu" "rv" "sw" "tx" "uy" "vz" "xa" "yb" "zc"
"EE" "af" "bg" "ch" "di" "ej" "fk" "gl" "hm" "in" "jo" "pu" "qv" "rw" "sx" "ty" "uz" "wa" "xb" "yc" "zd"
"FF" "ag" "bh" "ci" "dj" "ek" "fl" "gm" "hn" "io" "jp" … "pv" "qw" "rx" "sy" "tz" "va" "wb" "xc" "yd" "ze"
"GG" "ah" "bi" "cj" "dk" "el" "fm" "gn" "ho" "ip" "jq" "pw" "qx" "ry" "sz" "ua" "vb" "wc" "xd" "ye" "zf"
"HH" "ai" "bj" "ck" "dl" "em" "fn" "go" "hp" "iq" "jr" "px" "qy" "rz" "ta" "ub" "vc" "wd" "xe" "yf" "zg"
"II" "aj" "bk" "cl" "dm" "en" "fo" "gp" "hq" "ir" "js" "py" "qz" "sa" "tb" "uc" "vd" "we" "xf" "yg" "zh"
"JJ" "ak" "bl" "cm" "dn" "eo" "fp" "gq" "hr" "is" "jt" "pz" "ra" "sb" "tc" "ud" "ve" "wf" "xg" "yh" "zi"
"KK" "al" "bm" "cn" "do" "ep" "fq" "gr" "hs" "it" "ju" … "qa" "rb" "sc" "td" "ue" "vf" "wg" "xh" "yi" "zj"
"LL" "am" "bn" "co" "dp" "eq" "fr" "gs" "ht" "iu" "jv" "qb" "rc" "sd" "te" "uf" "vg" "wh" "xi" "yj" "zk"
"MM" "an" "bo" "cp" "dq" "er" "fs" "gt" "hu" "iv" "jw" "qc" "rd" "se" "tf" "ug" "vh" "wi" "xj" "yk" "zl"
"NN" "ao" "bp" "cq" "dr" "es" "ft" "gu" "hv" "iw" "jx" "qd" "re" "sf" "tg" "uh" "vi" "wj" "xk" "yl" "zm"
"OO" "ap" "bq" "cr" "ds" "et" "fu" "gv" "hw" "ix" "jy" "qe" "rf" "sg" "th" "ui" "vj" "wk" "xl" "ym" "zn"
"PP" "aq" "br" "cs" "dt" "eu" "fv" "gw" "hx" "iy" "jz" … "qf" "rg" "sh" "ti" "uj" "vk" "wl" "xm" "yn" "zo"
"QQ" "ar" "bs" "ct" "du" "ev" "fw" "gx" "hy" "iz" "ka" "qg" "rh" "si" "tj" "uk" "vl" "wm" "xn" "yo" "zp"
"RR" "as" "bt" "cu" "dv" "ew" "fx" "gy" "hz" "ja" "kb" "qh" "ri" "sj" "tk" "ul" "vm" "wn" "xo" "yp" "zq"
"SS" "at" "bu" "cv" "dw" "ex" "fy" "gz" "ia" "jb" "kc" "qi" "rj" "sk" "tl" "um" "vn" "wo" "xp" "yq" "zr"
"TT" "au" "bv" "cw" "dx" "ey" "fz" "ha" "ib" "jc" "kd" "qj" "rk" "sl" "tm" "un" "vo" "wp" "xq" "yr" "zs"
"UU" "av" "bw" "cx" "dy" "ez" "ga" "hb" "ic" "jd" "ke" … "qk" "rl" "sm" "tn" "uo" "vp" "wq" "xr" "ys" "zt"
"VV" "aw" "bx" "cy" "dz" "fa" "gb" "hc" "id" "je" "kf" "ql" "rm" "sn" "to" "up" "vq" "wr" "xs" "yt" "zu"
"WW" "ax" "by" "cz" "ea" "fb" "gc" "hd" "ie" "jf" "kg" "qm" "rn" "so" "tp" "uq" "vr" "ws" "xt" "yu" "zv"
"XX" "ay" "bz" "da" "eb" "fc" "gd" "he" "if" "jg" "kh" "qn" "ro" "sp" "tq" "ur" "vs" "wt" "xu" "yv" "zw"
"YY" "az" "ca" "db" "ec" "fd" "ge" "hf" "ig" "jh" "ki" "qo" "rp" "sq" "tr" "us" "vt" "wu" "xv" "yw" "zx"
"ZZ" "ba" "cb" "dc" "ed" "fe" "gf" "hg" "ih" "ji" "kj" … "qp" "rq" "sr" "ts" "ut" "vu" "wv" "xw" "yx" "zy"
You might wonder which approach is fester in practice. Let us check it (including all options for the comprehension we have used above):
julia> @time replace(v, mapping...);
2.896812 seconds (461 allocations: 76.314 MiB)
julia> @time [haskey(mapping, x) ? mapping[x] : v for x in v];
1.389256 seconds (31.67 k allocations: 154.680 MiB, 9.80% compilation time)
julia> @time let mapping=mapping
[haskey(mapping, x) ? mapping[x] : v for x in v]
end;
1.053450 seconds (33.60 k allocations: 154.853 MiB, 15.47% compilation time)
julia> helper2(x, mapping) = [haskey(mapping, x) ? mapping[x] : v for x in v];
julia> @time helper2(x, mapping);
0.952112 seconds (70.00 k allocations: 157.290 MiB, 19.68% compilation time)
julia> @time helper2(x, mapping);
0.910453 seconds (9 allocations: 152.588 MiB, 20.39% gc time)
As you can see replace
is slower even than the type unstable version of the comprehension.
What is the underlying reason for this? The answer is that replace
checks each key => value
pair linearly, which is much slower than using a dictionary if you have many such pairs.
Conclusions
In summary the basic lessons from the tests we discussed today are:
- If you are replacing values and a a non-matching value should be replaced by some flag the simplest and fast option is to use broadcasted
get
; - If you are replacing values and want to retain non-matching values then:
- if you have multiple
key=>value
paris use a comprehension withhaskey
(preferably making it type stable withlet
or a function); - if you have few
key=>value
pairsreplace
should be easiest and fast enough.
- if you have multiple