Introduction

Cutting data into groups (binning) is one of the most common data preprocessing tasks.

You can easily do binning into groups of equal sizes using the cut function from CategoricalArrays.jl like this (here we bin a vector of values from 1 to 10 into 2 groups):

julia> using CategoricalArrays

julia> cut(1:10, 2)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q1: [1.0, 5.5)"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"
 "Q2: [5.5, 10.0]"

However, the issue becomes more challenging when the number of bins is not a divisor of vector length or if you have duplicates in the data.

The post is tested under Julia 1.5.3, DataFrames.jl 0.22.1, CategoricalArrays.jl 0.9.0, and FreqTables.jl 0.4.2.

The corner cases of binning

Let us first highlight some potential issues when binning data.

The first problem is when the number of groups is not a divisor of the vector length. Let us check it out on some examples:

julia> cut(1:10, 3)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 4.0)"
 "Q1: [1.0, 4.0)"
 "Q1: [1.0, 4.0)"
 "Q2: [4.0, 7.0)"
 "Q2: [4.0, 7.0)"
 "Q2: [4.0, 7.0)"
 "Q3: [7.0, 10.0]"
 "Q3: [7.0, 10.0]"
 "Q3: [7.0, 10.0]"
 "Q3: [7.0, 10.0]"

julia> cut(1:10, 4)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 3.25)"
 "Q1: [1.0, 3.25)"
 "Q1: [1.0, 3.25)"
 "Q2: [3.25, 5.5)"
 "Q2: [3.25, 5.5)"
 "Q3: [5.5, 7.75)"
 "Q3: [5.5, 7.75)"
 "Q4: [7.75, 10.0]"
 "Q4: [7.75, 10.0]"
 "Q4: [7.75, 10.0]"

The cut function is deterministic and it uses the quantile function to find the bin endpoints. This means that in the first example cut(1:10, 3) the third bin will be always larger than the first and second bin. Similarly in cut(1:10, 4) the first and the fourth bins are going to be larger deterministically.

The other problem is duplicates in data. Consider the following scenario:

julia> cut([1; fill(2, 8); 3], 2)
10-element CategoricalArray{String,1,UInt32}:
 "Q1: [1.0, 2.0)"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"
 "Q2: [2.0, 3.0]"

We want two bins. Ideally both should have five elements, but since we have duplicates in our data the first bin has size one and the second size nine.

In some cases you will like what cut produces, but in other cases one might want to avoid these two problems, that is:

  • always make the bins of equal size and if it is not possible to do so, make the decision which bin should be larger and which smaller randomly;
  • allow duplicates to be split between two or more bins (this is then unavoidable in some cases), but in such a way that each duplicate has the same chance to fall into each bin.

Random binning

Here is the function that performs the binning that has the properties I have described above:

using DataFrames
using FreqTables
using Random

function binvec(x::AbstractVector, n::Int,
                rng::AbstractRNG=Random.default_rng())
    n > 0 || throw(ArgumentError("number of bins must be positive"))
    l = length(x)

    # find bin sizes
    d, r = divrem(l, n)
    lens = fill(d, n)
    lens[1:r] .+= 1
    # randomly decide which bins should be larger
    shuffle!(rng, lens)

    # ensure that we have data sorted by x, but ties are ordered randomly
    df = DataFrame(id=axes(x, 1), x=x, r=rand(rng, l))
    sort!(df, [:x, :r])

    # assign bin ids to rows
    binids = reduce(vcat, [fill(i, v) for (i, v) in enumerate(lens)])
    df.binids = binids

    # recover original row order
    sort!(df, :id)
    return df.binids
end

Let us now test the binning on the following vector:

julia> Random.seed!(1234);

julia> x = repeat('a':'c', 3)
9-element Array{Char,1}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> binvec(x, 2)
9-element Array{Int64,1}:
 1
 2
 2
 1
 2
 2
 1
 1
 2

As you can see 'b's are split betwen bin 1 and 2 to make them have almost equal size.

Let us make sure that binvec does the right job in deciding on bin sizes and splitting 'b's between both bins.

julia> df = reduce(vcat, [DataFrame(x=x, run_id=i, row_id=axes(x, 1),
                                    group_id=binvec(x, 2)) for i in 1:10_000]);

julia> freqtable(df, :group_id, :row_id, :x)
2×9×3 Named Array{Int64,3}

[:, :, x='a'] =
group_id ╲ row_id │     1      2      3      4      5      6      7      8      9
──────────────────┼──────────────────────────────────────────────────────────────
1                 │ 10000      0      0  10000      0      0  10000      0      0
2                 │     0      0      0      0      0      0      0      0      0

[:, :, x='b'] =
group_id ╲ row_id │    1     2     3     4     5     6     7     8     9
──────────────────┼─────────────────────────────────────────────────────
1                 │    0  4941     0     0  5005     0     0  5027     0
2                 │    0  5059     0     0  4995     0     0  4973     0

[:, :, x='c'] =
group_id ╲ row_id │     1      2      3      4      5      6      7      8      9
──────────────────┼──────────────────────────────────────────────────────────────
1                 │     0      0      0      0      0      0      0      0      0
2                 │     0      0  10000      0      0  10000      0      0  10000

Indeed we see that each 'b' falls to group 1 and group 2 with 50% probability. Also the expected size of group 1 and group 2 is 4.5 as desired. (both calculations are approximate because we used simulation.)

Conclusions

First – let me comment in what scenario the binning I described is desirable. Assume you have a set of patients you want to vaccinate against COVID-19. Now let each of them have assigned a discrete urgency level (typically there will be 3 or 4 such urgency levels). You then have to split them into several batches of equal size so that each batch gets a vaccine in a different period. If you want to be fair in assigning people to batches you get exactly the setting I have described.

Second – as usual I wanted to showcase some features of JuliaData ecosystem. In particular you have seen reduce-vcat combo twice (for vectors and for data frames) and integration of FreqTables.jl with DataFrames.jl that I am very fond of. Of course this is not the fastest way to get the desired results. I encourage you to write a faster function that does the same what the binvec function does.