Re-posted from: https://bkamins.github.io/julialang/2020/12/11/binning.html
Introduction
Cutting data into groups (binning) is one of the most common data preprocessing
tasks.
You can easily do binning into groups of equal sizes using the cut
function
from CategoricalArrays.jl like this (here we bin a vector of values from 1 to 10
into 2 groups):
julia> using CategoricalArrays
julia> cut(1:10, 2)
10-element CategoricalArray{String,1,UInt32}:
"Q1: [1.0, 5.5)"
"Q1: [1.0, 5.5)"
"Q1: [1.0, 5.5)"
"Q1: [1.0, 5.5)"
"Q1: [1.0, 5.5)"
"Q2: [5.5, 10.0]"
"Q2: [5.5, 10.0]"
"Q2: [5.5, 10.0]"
"Q2: [5.5, 10.0]"
"Q2: [5.5, 10.0]"
However, the issue becomes more challenging when the number of bins is not a
divisor of vector length or if you have duplicates in the data.
The post is tested under Julia 1.5.3, DataFrames.jl 0.22.1, CategoricalArrays.jl
0.9.0, and FreqTables.jl 0.4.2.
The corner cases of binning
Let us first highlight some potential issues when binning data.
The first problem is when the number of groups is not a divisor of the vector
length. Let us check it out on some examples:
julia> cut(1:10, 3)
10-element CategoricalArray{String,1,UInt32}:
"Q1: [1.0, 4.0)"
"Q1: [1.0, 4.0)"
"Q1: [1.0, 4.0)"
"Q2: [4.0, 7.0)"
"Q2: [4.0, 7.0)"
"Q2: [4.0, 7.0)"
"Q3: [7.0, 10.0]"
"Q3: [7.0, 10.0]"
"Q3: [7.0, 10.0]"
"Q3: [7.0, 10.0]"
julia> cut(1:10, 4)
10-element CategoricalArray{String,1,UInt32}:
"Q1: [1.0, 3.25)"
"Q1: [1.0, 3.25)"
"Q1: [1.0, 3.25)"
"Q2: [3.25, 5.5)"
"Q2: [3.25, 5.5)"
"Q3: [5.5, 7.75)"
"Q3: [5.5, 7.75)"
"Q4: [7.75, 10.0]"
"Q4: [7.75, 10.0]"
"Q4: [7.75, 10.0]"
The cut
function is deterministic and it uses the quantile
function to find
the bin endpoints. This means that in the first example cut(1:10, 3)
the third
bin will be always larger than the first and second bin. Similarly in cut(1:10,
the first and the fourth bins are going to be larger deterministically.
4)
The other problem is duplicates in data. Consider the following scenario:
julia> cut([1; fill(2, 8); 3], 2)
10-element CategoricalArray{String,1,UInt32}:
"Q1: [1.0, 2.0)"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
"Q2: [2.0, 3.0]"
We want two bins. Ideally both should have five elements, but since we have
duplicates in our data the first bin has size one and the second size nine.
In some cases you will like what cut
produces, but in other cases
one might want to avoid these two problems, that is:
- always make the bins of equal size and if it is not possible to do so, make
the decision which bin should be larger and which smaller randomly; - allow duplicates to be split between two or more bins (this is then
unavoidable in some cases), but in such a way that each duplicate has the same
chance to fall into each bin.
Random binning
Here is the function that performs the binning that has the properties I have
described above:
using DataFrames
using FreqTables
using Random
function binvec(x::AbstractVector, n::Int,
rng::AbstractRNG=Random.default_rng())
n > 0 || throw(ArgumentError("number of bins must be positive"))
l = length(x)
# find bin sizes
d, r = divrem(l, n)
lens = fill(d, n)
lens[1:r] .+= 1
# randomly decide which bins should be larger
shuffle!(rng, lens)
# ensure that we have data sorted by x, but ties are ordered randomly
df = DataFrame(id=axes(x, 1), x=x, r=rand(rng, l))
sort!(df, [:x, :r])
# assign bin ids to rows
binids = reduce(vcat, [fill(i, v) for (i, v) in enumerate(lens)])
df.binids = binids
# recover original row order
sort!(df, :id)
return df.binids
end
Let us now test the binning on the following vector:
julia> Random.seed!(1234);
julia> x = repeat('a':'c', 3)
9-element Array{Char,1}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
julia> binvec(x, 2)
9-element Array{Int64,1}:
1
2
2
1
2
2
1
1
2
As you can see 'b'
s are split betwen bin 1 and 2 to make them have almost
equal size.
Let us make sure that binvec
does the right job in deciding on bin sizes
and splitting 'b'
s between both bins.
julia> df = reduce(vcat, [DataFrame(x=x, run_id=i, row_id=axes(x, 1),
group_id=binvec(x, 2)) for i in 1:10_000]);
julia> freqtable(df, :group_id, :row_id, :x)
2×9×3 Named Array{Int64,3}
[:, :, x='a'] =
group_id ╲ row_id │ 1 2 3 4 5 6 7 8 9
──────────────────┼──────────────────────────────────────────────────────────────
1 │ 10000 0 0 10000 0 0 10000 0 0
2 │ 0 0 0 0 0 0 0 0 0
[:, :, x='b'] =
group_id ╲ row_id │ 1 2 3 4 5 6 7 8 9
──────────────────┼─────────────────────────────────────────────────────
1 │ 0 4941 0 0 5005 0 0 5027 0
2 │ 0 5059 0 0 4995 0 0 4973 0
[:, :, x='c'] =
group_id ╲ row_id │ 1 2 3 4 5 6 7 8 9
──────────────────┼──────────────────────────────────────────────────────────────
1 │ 0 0 0 0 0 0 0 0 0
2 │ 0 0 10000 0 0 10000 0 0 10000
Indeed we see that each 'b'
falls to group 1
and group 2
with 50%
probability. Also the expected size of group 1
and group 2
is 4.5 as
desired. (both calculations are approximate because we used simulation.)
Conclusions
First – let me comment in what scenario the binning I described is desirable.
Assume you have a set of patients you want to vaccinate against COVID-19. Now
let each of them have assigned a discrete urgency level (typically there will be
3 or 4 such urgency levels). You then have to split them into several batches
of equal size so that each batch gets a vaccine in a different period. If you
want to be fair in assigning people to batches you get exactly the setting I
have described.
Second – as usual I wanted to showcase some features of JuliaData ecosystem. In
particular you have seen reduce
–vcat
combo twice (for vectors and for data
frames) and integration of FreqTables.jl with DataFrames.jl that I am very fond
of. Of course this is not the fastest way to get the desired results. I
encourage you to write a faster function that does the same what the binvec
function does.