Tag Archives: datastructures

A Simplified E-graph Implementation

By: Philip Zucker

Re-posted from: https://www.philipzucker.com/a-simplified-egraph/

I’ve been spending some time mulling over e-graphs. I think I have it kind of pared down until it’s fairly simple.
This implementation is probably not high performance as the main trick is removing a genuinely useful indexing structure. Still, this implementation is small enough that I can kind of keep it in my head. It has become rather cute.

For a user ready implementation of egraphs, see Metatheory https://github.com/0x0f0f0f/Metatheory.jl or egg https://egraphs-good.github.io/. For more motivation, see the egg paper or my first post https://www.philipzucker.com/egraph-1/

In a computer, terms like (a + b) * c are typically stored as trees. The EGraph converts this tree structure to one with an extra layer of indirection. Instead of nodes directly connecting to nodes, they connect first through a different kind of node labelled by an integer. These integer nodes are called eclasses and the modified original nodes are called enodes.

The EGraph is a bipartite directed graph between eclasses and enodes. ENodes belong uniquely to eclasses. Once we start merging eclasses, eclasses will point to more than one enode. The graph may also develop loops allowing representation in a sense of infinite terms.

Last time https://www.philipzucker.com/union-find-dict/, I built the abstraction of a union-find dict. This data structure allows you to retrieve information keyed on an equivalence class while still supporting the union operation. Given this piece, it is simple to define the two data structures.

@auto_hash_equals struct ENode
    head::Symbol
    args::Vector{Int64}
end

struct EGraph
    eclass::IntDisjointMap
    memo::Dict{ENode, Int64}
end

EGraph() = EGraph( IntDisjointMap(union)  , Dict{ENode,Int64}())

The eclass field is a union-find dictionary from equivalence classes to vectors of ENodes. We tell the underlying IntDisjointMap that upon a union! of equivalence classes, we will union the enode vectors in the codomain of the IntDisjointMap to each other.

The memo table is not strictly necessary, but it gives us a good way to lookup which eclass an enode belongs to. Otherwise we’d have to brute force search the entire IntDisjointMap to find ENodes when we want them.

ENodes hold references to EClass ids, which unfortunately can go stale. We can canonize ENodes to use the freshest equivalence class indices.

canonize(G::EGraph, f::ENode) = ENode(f.head, [find_root(G.eclass, a) for a in f.args])

To add an ENode to the egraph first we canonize it, then we check if it already is in the memo table, and if not we actually push it in the IntDisjointMap and update the memo table.

function addenode!(G::EGraph, f::ENode)
    f = canonize(G,f)
    if haskey(G.memo, f)
        return G.memo[f]
    else
        id = push!(G.eclass, [f])
        G.memo[f] = id
        return id
    end
end

#convenience functions for pushing Julia Expr
addexpr!(G::EGraph, f::Symbol) = addenode!(G, ENode(f,[]))
function addexpr!(G::EGraph, f::Expr)
    @assert f.head == :call
    addenode!(G,  ENode(f.args[1], [ addexpr!(G,a) for a in f.args[2:end] ]  ))
end

When we assert an equality to an egraph, we take the union! of the two corresponding eclasses. We union! the underlying IntDisjointMap, then we recanonize all the held ENodes in that eclass and update the memo table.

function Base.union!(G::EGraph, f::Int64, g::Int64)
    id = union!(G.eclass, f, g)
    eclass = ENode[]
    for enode in G.eclass[id] 
        delete!(G.memo, enode) # should we even bother with delete?
        enode = canonize(G, enode) # should canonize return whether it did anything or not?
        G.memo[enode] = id
        push!(eclass, enode)
    end
    G.eclass[id] = eclass
end

That’s kind of it.

The big thing we haven’t discussed is calculating congruence closure. In my original presentation, this was a whole ordeal and the reason why we needed to maintain parent pointers from eclasses to enodes. This was very confusing.

Instead we can just find congruences via a brute force sweep over the egraph. This is inefficient compared to having likely candidates pointed out to us by the parent pointers. However, during naive ematching we are sweeping over the egraph anyway to find possible rewrite rules applications. This approach makes congruence closure feel rather similar to the other rewrite rules in the sense. There may be some utility in not considering congruence closure as a truly intrinsic part of the egraph. Perhaps you could use it for systems where congruence does not strictly hold?

An unfortunate thing is that congruences needs to be applied in worst case a number of time proportional to the depth of the egraph, as it only propagates congruences one layer at a time.

How it works: after a union! operation there are non canonical ENodes held in both memo and eclass. These noncanonical ENodes are exactly those who have arguments that include the eclass that was just turned into a child of another eclass. These are also exactly those ENodes that are candidates for congruence closure propagation. We can detect them during the sweep by canonization.

This expensive congruence sweep forgives more sins than the more efficient one. Something that can happen is that we try to addexpr! an ENode that is one of the stale ones, in other words it should be in the memo table but is not. This will falsely create a new eclass for this ENode. However, the congruence closure sweep will find this equivalence on the next pass.


# I forgot to include this IntDisjointMap iterator function in my last post.
# Conceptually it belongs there.
function Base.iterate(x::IntDisjointMap, state=1)
    while state <= length(x.parents)
        if x.parents[state] < 0
            return ((state, x.values[state]) , state + 1)
        end
        state += 1
    end
    return nothing
end

# returns a list of tuples of found congruences
function congruences(G::EGraph)
    buf = Tuple{Int64,Int64}[] 
    for (id1, eclass) in G.eclass #alternatively iterate over memo
        for enode in eclass
            cnode = canonize(G,enode)
            if haskey(G.memo, cnode)
                id2 = G.memo[cnode]
                if id1 != id2
                    push!(buf, (id1,id2))
                end
            end
        end
    end
    return buf
end

# propagate all congruences
function propagate_congruence(G::EGraph)
    cong = congruences(G)
    while length(cong) > 0
        for (i,j) in cong
            union!(G,i,j)
        end
        cong = congruences(G)
    end
end

Bits and Bobbles

In principle I think this formulation makes it easier to parallelize congruence finding alongside rewrite rule matching. The rewriting process becomes a swapsies between finding tuples to union and actually applying them.

Everything in this post could probably be tuned up to be more efficient.

To add analyses, you want to store a compound structure in the IntDisjointMap. Tuple{Vector{ENode}, Analysis) The merge operation then does both enode merge and analysis merge.

Possibly changing enodes to be binary might be nice. One can compile arbitrary arity into this. Then everything fits in place in the appropriate arrays, removing significant indirection

Uses of egraphs:

My other implementation

Edit: Max Willsey on twitter, an author of egg, says that egg originally took an approach to congruence like the above but found it too slow on larger workloads. It does indeed have a worse asymptotic performance than actually tracking parents and sniping the congruence locations. https://twitter.com/mwillsey/status/1378476707460509698?s=20

Some tests

using EGraphs
using Test

@testset "Basic EGraph" begin
G = EGraph()
a = addenode!(G, ENode(:a, []))
b = addenode!(G, ENode(:b, []))
#println(G)
union!(G, a, b)
#println(G)
@test addenode!(G, ENode(:a, [])) == 2
@test addenode!(G, ENode(:c, [])) == 3
@test addenode!(G, ENode(:f, [a])) == 4
union!(G, 3, 4)

#= println(G)
for (k,v) in G.eclass
    println(k,v)
end =#
G = EGraph()
a = addenode!(G, ENode(:a, []))
b = addenode!(G, ENode(:b, []))
c = addenode!(G, ENode(:c, []))
union!(G, a, b)
fa = addenode!(G, ENode(:f, [a])) 
fb = addenode!(G, ENode(:f, [b])) 
fc = addenode!(G, ENode(:f, [c])) 

ffa = addenode!(G, ENode(:f, [fa])) 
ffb = addenode!(G, ENode(:f, [fb])) 

@test congruences(G) == [(fa,fb)]


for (x,y) in congruences(G)
    union!(G,x,y)
end

@test congruences(G) == [(ffa,ffb)]

union!(G, a, c)

@test congruences(G) == [(fc,fb), (ffa,ffb)]

for (x,y) in congruences(G)
    union!(G,x,y)
end

@test congruences(G) == []


G = EGraph()
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
f2a = addexpr!(G, :( f(f(a))  ))
@test length(G.eclass) == 6
union!(G , f5a, f2a)
@test find_root(G,f5a) == find_root(G,f2a)
@test length(G.eclass) == 5
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
f2a = addexpr!(G, :( f(f(f(a)))  ))
@test length(G.eclass) == 5

G = EGraph()
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
fa = addexpr!(G, :( f(a)  ))
a = addexpr!(G, :a)
@test length(G.eclass) == 6
union!(G , fa, a)
@test find_root(G,fa) == find_root(G,a)

propagate_congruence(G)
@test length(G.eclass) == 1

G = EGraph()
ffa = addexpr!(G, :( f(f(a))  ))
f5a = addexpr!(G, :( f(f(f(f(f(a)))))  ))
a = addexpr!(G, :a)
@test length(G.eclass) == 6
union!(G , ffa, a)
@test find_root(G,ffa) == find_root(G,a)

propagate_congruence(G)
@test length(G.eclass) == 2



end

Union Find Dicts: Dictionaries Keyed on Equivalence Classes

By: Philip Zucker

Re-posted from: https:/www.philipzucker.com/union-find-dict/

Dictionary/map data structures usually have a set-like counter part. You get a Set data structure over the keys of a dict by just storing some dummy value such as ().

  • Inserting element k in the set is dictionary insertion d[k] = ()
  • Key lookup is the same thing as check set membership. haskey(d,k)
  • Set union is dictionary merge.
  • Enumerating elements of the set is enumerating the keys of the dict.

Not every set data structure is derived this way. BitSets for example are not really a dict you ignore the value in.
Nevertheless, hash and binary tree based Maps and Sets have this similarity to each other. They are largely the same except perhaps the set data structure is specialized to not even bothering to hold the default value.

The union find data structure is a way of storing disjoint unions of sets in such a way that the union operation remains fast. There is a pretty good explanation here https://en.wikipedia.org/wiki/Disjoint-set_data_structure

I want a version of union find that is a dictionary that stores values keyed on these equivalence classes. I mention why at the end (spoiler: Egraphs).

I’m kind of surprised this abstract data structure interface is not commonly available. I couldn’t find anything when I looked. It is only a very small addition to the regular union find.

Basic Union Find Without Path Compression

You can find the DataStructures.jl version of the union find data structure with path compression here https://github.com/JuliaCollections/DataStructures.jl/blob/master/src/disjoint_set.jl. It’s pretty readable, which is a nice aspect of Julia.

struct IntDisjointSet
    parents::Vector{Int64}
end

This stores and array of indices of the parents of the nodes.

IntDisjointSet() = IntDisjointSet(Vector{Int64}[])
Base.length(x::IntDisjointSet) = length(x.parents)

Here we use a little trick, perhaps ill advised. When an item j is a root node, the storage at
x.parents[j] is redundant. One choice is to set x.parents[j] = j in this case. This is what the DataStructures.jl version does, and it does make some operations clean. Another is to use this space to store extra information. I used negatives numbers at the parent node pointer to represent the size of the tree. This allows union to easily place the smaller tree under the larger one. Hence creating a set of size 1 has a -1 put in its parents array here.

Base.push!(x::IntDisjointSet) = push!(x.parents, -1)

Here is a simple find_root loop. It traverses up the parent array. Here is a basic version of it where I stripped out path compression. Sacrilege you say? Perhaps. Path compression is kind of the cleverest and most confusing part.

function find_root(x::IntDisjointSet, i::Int64)
    while x.parents[i] >= 0
        i = x.parents[i]
    end
    return i
end

The union operation is almost interesting. It sets the parent of the smaller tree to the larger one. This has a tendency to keep the trees shallow. You could also use the rank of the subtree. In the absence of path compression, the rank is the height.

function Base.union!(x::IntDisjointSet, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.parents[pi] = pj # set parent of pi to pj
    end
    return pj
end

I have some suspicion for a use case that normalizing the parent pointers all at once in a batch might be kind of nice. Then further lookups are no longer mutating and can more easily be done in parallel and without a loop. Depends on your usage pattern.

function normalize!(x::IntDisjointSet)
    for i in length(x)
        pi = find_root(x, i)
        if pi != i
            x.parents[i] = pi
        end
    end
end

# If normalized we don't even need a loop here.
function _find_root_normal(x::IntDisjointSet, i::Int64)
    pi = x.parents[i]
    if pi < 0 # Is `i` a root?
        return i
    else
        return pi
    end
end

A question I had is if it was possible to make a dictionary structure that is keyed on equivalence classes. Yes, it is, With some caveats.

You need to define what you want to happen to the values when two equivalance class keys merge. I think a reasonable requirement is that values held in the are combined using an associative and commutative operation, since you are unlikely to be able to predict these orderings. It is tempting to think that this should be a semilattice (idempotent) but it is not necessary.
If you wanted to use arrays with array concatenation for example, expect the order of the array to be scrambled.

struct IntDisjointMap
    parents::Vector{Int64}
    values::Vector{Any}
    merge_fun
end

IntDisjointMap(merge_fun) = IntDisjointMap(Vector{Int64}[], [], merge_fun)

Most of the previous definitions stay the same. We do have to change the definition of union and we can add new definitions for retrieving and updating the dict.

Base.getindex(x::IntDisjointMap, i::Int64) = x.values[find_root(x,i)]
Base.setindex!(x::IntDisjointMap, v, i::Int64) = x.values[find_root(x,i)] = v


function Base.union!(x::IntDisjointMap, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.values[pj] = x.merge_fun(x.values[pj], x.values[pi])
        x.values[pi] = nothing # clear out unused storage
        x.parents[pi] = pj
    end
    return pj
end

Some usage storing integers in the Map using + as a merge operation

using Test
G = IntDisjointMap(+)
push!(G, 1)
@test G.parents == [-1]
@test G.values == [1]
push!(G,14)
@test G.parents == [-1, -1]
@test G.values == [1, 14]
@test G[1] == 1
@test G[2] == 14
union!(G,1,2)
@test G.parents == [2,-2]
@test G.values == [nothing, 15]
@test G[1] == 15
@test G[2] == 15
push!(G, 4)
@test find_root(G,1) == 2
@test find_root(G,2) == 2
@test find_root(G,3) == 3
union!(G,1,3)
@test G[3] == 19
@test find_root(G,3) == 2
@test find_root(G,1) == 2
@test G.parents == [2,-3,2]

G[3] = 42
@test G[2] == 42

Why?

Well, I think the the IntDisjointMap is a missing obvious intersection of abstract data type interfaces.

I’ve been playing around with EGraphs lately https://www.philipzucker.com/egraph-1/, and they involve a bipartite graph between Equivalence classes and ENodes, which are tern nodes with the children replaced by EClass Ids. A fairly minimal representation of such a structure would be an IntDisjointMap holding ENodes.

struct ENode
    head::Symbol
    args::Vector{Int64}
end

I’m kind of hoping to make a very minimal, perhaps inefficient implementation of the essence of egraphs.

Notes

Structures mapping equivalence classes to terms also show up in unification.

There is no impediment to implementing this interface over a full path compressed union find.
It is also possible to implement this as a wrapper over a normal union find, at the expense of some unnecessary find operations

struct IntDisjointMap
    unionfind::IntDisjointSet
    values::Vector{Int64}
    merge_fun
end

I think that perhaps using Parametric types like so may give a speed boost as Julia specializes on types.

struct IntDisjointMap{V}
    parents::Vector{Int64}
    values::Vector{V}
    merge_fun
end

Even further, putting a tag of the merge_fun in the type might do something good specializing in that function call.

struct IntDisjointMap{V, merge_fun}
    parents::Vector{Int64}
    values::Vector{V}
end

Union Find Dicts: Dictionaries Keyed on Equivalence Classes

By: Philip Zucker

Re-posted from: https://www.philipzucker.com/union-find-dict/

Dictionary/map data structures usually have a set-like counter part. You get a Set data structure over the keys of a dict by just storing some dummy value such as ().

  • Inserting element k in the set is dictionary insertion d[k] = ()
  • Key lookup is the same thing as check set membership. haskey(d,k)
  • Set union is dictionary merge.
  • Enumerating elements of the set is enumerating the keys of the dict.

Not every set data structure is derived this way. BitSets for example are not really a dict you ignore the value in.
Nevertheless, hash and binary tree based Maps and Sets have this similarity to each other. They are largely the same except perhaps the set data structure is specialized to not even bothering to hold the default value.

The union find data structure is a way of storing disjoint unions of sets in such a way that the union operation remains fast. There is a pretty good explanation here https://en.wikipedia.org/wiki/Disjoint-set_data_structure

I want a version of union find that is a dictionary that stores values keyed on these equivalence classes. I mention why at the end (spoiler: Egraphs).

I’m kind of surprised this abstract data structure interface is not commonly available. I couldn’t find anything when I looked. It is only a very small addition to the regular union find.

Basic Union Find Without Path Compression

You can find the DataStructures.jl version of the union find data structure with path compression here https://github.com/JuliaCollections/DataStructures.jl/blob/master/src/disjoint_set.jl. It’s pretty readable, which is a nice aspect of Julia.

struct IntDisjointSet
    parents::Vector{Int64}
end

This stores and array of indices of the parents of the nodes.

IntDisjointSet() = IntDisjointSet(Vector{Int64}[])
Base.length(x::IntDisjointSet) = length(x.parents)

Here we use a little trick, perhaps ill advised. When an item j is a root node, the storage at
x.parents[j] is redundant. One choice is to set x.parents[j] = j in this case. This is what the DataStructures.jl version does, and it does make some operations clean. Another is to use this space to store extra information. I used negatives numbers at the parent node pointer to represent the size of the tree. This allows union to easily place the smaller tree under the larger one. Hence creating a set of size 1 has a -1 put in its parents array here.

Base.push!(x::IntDisjointSet) = push!(x.parents, -1)

Here is a simple find_root loop. It traverses up the parent array. Here is a basic version of it where I stripped out path compression. Sacrilege you say? Perhaps. Path compression is kind of the cleverest and most confusing part.

function find_root(x::IntDisjointSet, i::Int64)
    while x.parents[i] >= 0
        i = x.parents[i]
    end
    return i
end

The union operation is almost interesting. It sets the parent of the smaller tree to the larger one. This has a tendency to keep the trees shallow. You could also use the rank of the subtree. In the absence of path compression, the rank is the height.

function Base.union!(x::IntDisjointSet, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.parents[pi] = pj # set parent of pi to pj
    end
    return pj
end

I have some suspicion for a use case that normalizing the parent pointers all at once in a batch might be kind of nice. Then further lookups are no longer mutating and can more easily be done in parallel and without a loop. Depends on your usage pattern.

function normalize!(x::IntDisjointSet)
    for i in length(x)
        pi = find_root(x, i)
        if pi != i
            x.parents[i] = pi
        end
    end
end

# If normalized we don't even need a loop here.
function _find_root_normal(x::IntDisjointSet, i::Int64)
    pi = x.parents[i]
    if pi < 0 # Is `i` a root?
        return i
    else
        return pi
    end
end

A question I had is if it was possible to make a dictionary structure that is keyed on equivalence classes. Yes, it is, With some caveats.

You need to define what you want to happen to the values when two equivalance class keys merge. I think a reasonable requirement is that values held in the are combined using an associative and commutative operation, since you are unlikely to be able to predict these orderings. It is tempting to think that this should be a semilattice (idempotent) but it is not necessary.
If you wanted to use arrays with array concatenation for example, expect the order of the array to be scrambled.

struct IntDisjointMap
    parents::Vector{Int64}
    values::Vector{Any}
    merge_fun
end

IntDisjointMap(merge_fun) = IntDisjointMap(Vector{Int64}[], [], merge_fun)

Most of the previous definitions stay the same. We do have to change the definition of union and we can add new definitions for retrieving and updating the dict.

Base.getindex(x::IntDisjointMap, i::Int64) = x.values[find_root(x,i)]
Base.setindex!(x::IntDisjointMap, v, i::Int64) = x.values[find_root(x,i)] = v


function Base.union!(x::IntDisjointMap, i::Int64, j::Int64)
    pi = find_root(x, i)
    pj = find_root(x, j)
    if pi != pj
        isize = -x.parents[pi]
        jsize = -x.parents[pj]
        if isize > jsize # swap to make size of i less than j
            pi, pj = pj, pi
            isize, jsize = jsize, isize
        end
        x.parents[pj] -= isize # increase new size of pj
        x.values[pj] = x.merge_fun(x.values[pj], x.values[pi])
        x.values[pi] = nothing # clear out unused storage
        x.parents[pi] = pj
    end
    return pj
end

Some usage storing integers in the Map using + as a merge operation

using Test
G = IntDisjointMap(+)
push!(G, 1)
@test G.parents == [-1]
@test G.values == [1]
push!(G,14)
@test G.parents == [-1, -1]
@test G.values == [1, 14]
@test G[1] == 1
@test G[2] == 14
union!(G,1,2)
@test G.parents == [2,-2]
@test G.values == [nothing, 15]
@test G[1] == 15
@test G[2] == 15
push!(G, 4)
@test find_root(G,1) == 2
@test find_root(G,2) == 2
@test find_root(G,3) == 3
union!(G,1,3)
@test G[3] == 19
@test find_root(G,3) == 2
@test find_root(G,1) == 2
@test G.parents == [2,-3,2]

G[3] = 42
@test G[2] == 42

Why?

Well, I think the the IntDisjointMap is a missing obvious intersection of abstract data type interfaces.

I’ve been playing around with EGraphs lately https://www.philipzucker.com/egraph-1/, and they involve a bipartite graph between Equivalence classes and ENodes, which are tern nodes with the children replaced by EClass Ids. A fairly minimal representation of such a structure would be an IntDisjointMap holding ENodes.

struct ENode
    head::Symbol
    args::Vector{Int64}
end

I’m kind of hoping to make a very minimal, perhaps inefficient implementation of the essence of egraphs.

Notes

Structures mapping equivalence classes to terms also show up in unification.

There is no impediment to implementing this interface over a full path compressed union find.
It is also possible to implement this as a wrapper over a normal union find, at the expense of some unnecessary find operations

struct IntDisjointMap
    unionfind::IntDisjointSet
    values::Vector{Int64}
    merge_fun
end

I think that perhaps using Parametric types like so may give a speed boost as Julia specializes on types.

struct IntDisjointMap{V}
    parents::Vector{Int64}
    values::Vector{V}
    merge_fun
end

Even further, putting a tag of the merge_fun in the type might do something good specializing in that function call.

struct IntDisjointMap{V, merge_fun}
    parents::Vector{Int64}
    values::Vector{V}
end

Would Int32 give savings?

Is there a category of Equivalence Maps? Maybe.

How to do this in prolog? A dict keyed on unification variables. Would attributed variables suffice?