Re-posted from: https://bkamins.github.io/julialang/2023/02/10/in.html
Introduction
Today I decided to discuss the in
function, which is a basic topic that,
from my teaching experience, often surprises people learning Julia.
I will cover several concrete cases that are worth knowing as either you might
use them yourself or might encounter them in the code that you would be reading.
The post is tested under Julia 1.8.5.
The basic syntax of in
in
is a function in Julia. It is used to determine whether an item is in the
given collection.
Since in
is a function you can invoke it using the standard function call
syntax:
julia> in(1, [1, 2, 3])
true
However, this operation is so common that there are two other ways to perform
this operation:
julia> 1 in [1, 2, 3]
true
julia> 1 ∈ [1, 2, 3]
If you wonder how to type ∈
then you can check it in Julia’s help:
help?> ∈
"∈" can be typed by \in<tab>
Since ∈
is the same as in
you can also write:
julia> ∈(1, [1, 2, 3])
true
although this is likely not the most readable way to do it.
Finally there is an accompanying ∋
syntax that has the order of arguments
reversed:
julia> ∋([1, 2, 3], 1)
true
julia> [1, 2, 3] ∋ 1
true
help?> ∋
"∋" can be typed by \ni<tab>
Negating in
Often you want to check if some element is not in a collection. Here are the
standard ways you can do it (you could similarly negate ∈
and ∋
):
julia> !in(1, [1, 2, 3])
false
julia> !(1 in [1, 2, 3])
false
However, there are also convenience ∉
and ∌
operators:
julia> 1 ∉ [1, 2, 3]
false
julia> [1, 2, 3] ∌ 1
false
help?> ∉
"∉" can be typed by \notin<tab>
help?> ∌
"∌" can be typed by \nni<tab>
Higher-order function
In all cases of in
, ∈
, ∋
, ∉
, and ∌
you can easily create a function
taking only one argument fixing the second argument of the operation.
For example writing in([1, 2, 3])
is equivalent to creation of an anonymous
function e -> e in [1, 2, 3]
. Let us show this syntax at work:
julia> in([1, 2, 3])(1)
true
julia> ∋(1)([1, 2, 3])
true
julia> ∉([1, 2, 3])(1)
false
This syntax is particularly useful when working with higher-order functions:
julia> map(in(Set([1, 2, 3])), [-1, 1, 3, 5])
4-element Vector{Bool}:
0
1
1
0
Performance
In the last example above you probably noticed that I used Set
instead of a
vector for lookup. This is an important pattern:
- lookup in a vector does not have any preprocessing cost, but later
in
execution time is, on the average, linear with the size of the vector
(advanced tip: if vector is sorted you can use theinsorted
function
instead and it will be faster); - lookup in a set has the cost of creating it, but later
in
execution time
does not grow with the size of the collection.
In summary, if you have a large collection in which you want to perform lookup
many times then make sure to convert this collection to a set (timings are after
compilation):
julia> v = rand(1:1_000_000, 10_000);
julia> @time count(in(v), 1)
0.000022 seconds (2 allocations: 48 bytes)
0
julia> @time count(in(Set(v)), 1)
0.000199 seconds (10 allocations: 144.648 KiB)
0
julia> @time count(in(v), 1:1_000_000)
6.104646 seconds (5 allocations: 112 bytes)
9941
julia> @time count(in(Set(v)), 1:1_000_000)
0.017825 seconds (13 allocations: 144.711 KiB)
9941
Note that if we made one lookup Set
creation cost was significant, but if we
made one million lookups creation of a Set
was crucial to ensure good
performance of the operation.
Broadcasting
It is tempting to run the operation:
julia> map(in(Set([1, 2, 3])), [-1, 1, 3, 5])
4-element Vector{Bool}:
0
1
1
0
using broadcasting like this:
julia> in.([-1, 1, 3, 5], Set([1, 2, 3]))
ERROR: DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 4 and 3
However, this fails, because broadcasting iterates both arguments of the in
function [-1, 1, 3, 5]
and Set([1, 2, 3])
. There are two ways how you can
fix it. The first is protecting the collection in which you want to perform
lookup using Ref
:
julia> in.([-1, 1, 3, 5], Ref(Set([1, 2, 3])))
4-element BitVector:
0
1
1
0
julia> [-1, 1, 3, 5] .∈ Ref(Set([1, 2, 3]))
4-element BitVector:
0
1
1
0
The other is to use higher-order function approach:
julia> in(Set([1, 2, 3])).([-1, 1, 3, 5])
4-element BitVector:
0
1
1
0
How does in
lookup work?
The final issue is related to the definition of in
. It states that in
checks
if an item is in the given collection. But what does it mean exactly?
First you need to understand how the collections are iterated. If you do not
know much about this topic you can find a description of the iteration interface
in my recent post.
A typical example that is tricky is Dict
lookup. Since Dict
iterates
key-value pairs the following is incorrect:
julia> 1 in Dict(1 => "a", 2 => "b")
ERROR: AbstractDict collections only contain Pairs;
Instead you most likely wanted:
julia> 1 in keys(Dict(1 => "a", 2 => "b"))
true
The second issue is how does in
check for equality between an item and
elements of the collection. This issue is particularly tricky. Normally the
==
function is used, but for Set
and Dict
the isequal
function is used.
Here are some examples showing you the difference:
julia> v = [1.0, missing, -0.0]
3-element Vector{Union{Missing, Float64}}:
1.0
missing
-0.0
julia> s = Set(v)
Set{Union{Missing, Float64}} with 3 elements:
missing
-0.0
1.0
julia> d = Dict(v .=> 'a':'c')
Dict{Union{Missing, Float64}, Char} with 3 entries:
missing => 'b'
-0.0 => 'c'
1.0 => 'a'
julia> missing in v
missing
julia> missing in s
true
julia> missing in keys(d)
true
julia> (missing => 'b') in d
true
julia> 0.0 in v
true
julia> -0.0 in s
true
julia> 0.0 in s
false
julia> 0.0 in keys(d)
false
julia> -0.0 in keys(d)
true
julia> (0.0 => 'c') in d
false
julia> (-0.0 => 'c') in d
true
The reason for these results is:
julia> missing == missing
missing
julia> isequal(missing, missing)
true
julia> 0.0 == -0.0
true
julia> isequal(0.0, -0.0)
false
Conclusions
As you can see the in
function has several non-obvious behaviors in terms
of:
- syntax: you can use five different operations:
in
,∈
,∋
,∉
, and∌
; - performance: be careful to avoid performance trap of doing many lookups in a
vector; - lookup rule:
Set
andDict
useisequal
test, while normally==
is used;
this is especially relevant in combination with performance recommendation –
you might get a different result of your operations if you switch from vector
toSet
because you wanted to speed-up your computations.
All topics I discussed today are documented in the Julia Manual. However,
I hope that having them presented by example in a single place in this post is
useful for you.