Re-posted from: https://bkamins.github.io/julialang/2024/03/29/ffd.html
Introduction
Often when working with data we need to get all possible combinations of some
input factors in a data frame. In the field of design of experiments
this is called full factorial design. In this post I will discuss two functions
that DataFrames.jl provides that can help you to generate such designs if
you needed them.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
What is a full factorial design and how to create it?
Assume we are a cardboard box producer have three factors describing a box: its width, height, and depth.
Each of them has a finite set of possible values (due to production process limitations).
Let us create some sample data of this kind:
julia> height = [10, 12]
2-element Vector{Int64}:
10
12
julia> width = [8, 10, 15]
3-element Vector{Int64}:
8
10
15
julia> depth = [5, 6]
2-element Vector{Int64}:
5
6
Our task is to compute the volume of all possible boxes that can be created by our factory.
The list of all possible cardboard box configurations is a full factorial design.
You can get an iterator of these values by using the Iterators.product
function:
julia> Iterators.product(height, width, depth)
Base.Iterators.ProductIterator{Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}}(([10, 12], [8, 10, 15], [5, 6]))
This function is lazy, to see the result we need to materialize its return value using e.g. collect
:
julia> collect(Iterators.product(height, width, depth))
2×3×2 Array{Tuple{Int64, Int64, Int64}, 3}:
[:, :, 1] =
(10, 8, 5) (10, 10, 5) (10, 15, 5)
(12, 8, 5) (12, 10, 5) (12, 15, 5)
[:, :, 2] =
(10, 8, 6) (10, 10, 6) (10, 15, 6)
(12, 8, 6) (12, 10, 6) (12, 15, 6)
We can see that we get an array of tuples of all possible combinations of dimensions. Let us now compute the volumes:
julia> prod.(collect(Iterators.product(height, width, depth)))
2×3×2 Array{Int64, 3}:
[:, :, 1] =
400 500 750
480 600 900
[:, :, 2] =
480 600 900
576 720 1080
The results are nice and efficient. However, sometimes it is more convenient to have this data in a data frame.
Full factorial design in DataFrames.jl
Let us repeat the exercise using DataFrames.jl:
julia> using DataFrames
julia> df = allcombinations(DataFrame; height, width, depth)
12×3 DataFrame
Row │ height width depth
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 10 8 5
2 │ 12 8 5
3 │ 10 10 5
4 │ 12 10 5
5 │ 10 15 5
6 │ 12 15 5
7 │ 10 8 6
8 │ 12 8 6
9 │ 10 10 6
10 │ 12 10 6
11 │ 10 15 6
12 │ 12 15 6
Note that we passed height
, width
, depth
as keyword arguments to allcombinations
taking advantage of a nice functionality of Julia that in this case we can avoid writing height=height
as just writing height
gives us the same result.
Now we can add a volume column:
julia> transform!(df, All() => ByRow(*) => "volume")
12×4 DataFrame
Row │ height width depth volume
│ Int64 Int64 Int64 Int64
─────┼──────────────────────────────
1 │ 10 8 5 400
2 │ 12 8 5 480
3 │ 10 10 5 500
4 │ 12 10 5 600
5 │ 10 15 5 750
6 │ 12 15 5 900
7 │ 10 8 6 480
8 │ 12 8 6 576
9 │ 10 10 6 600
10 │ 12 10 6 720
11 │ 10 15 6 900
12 │ 12 15 6 1080
We have added the "volume"
column in place to df
. Note that we used *
as it can take any number of positional arguments and returns their product. The ByRow
wrapper signals that we want to perform this operation row-wise.
In comparison to the solution shown before many users find presentation of a full factorial design easier to work with.
What if we have a fractional factorial design?
Sometimes your data is incomplete, and some level combinations are missing. Let us start by creating such a data frame:
julia> df2 = df[Not(2, 5, 9), :]
9×4 DataFrame
Row │ height width depth volume
│ Int64 Int64 Int64 Int64
─────┼──────────────────────────────
1 │ 10 8 5 400
2 │ 10 10 5 500
3 │ 12 10 5 600
4 │ 12 15 5 900
5 │ 10 8 6 480
6 │ 12 8 6 576
7 │ 12 10 6 720
8 │ 10 15 6 900
9 │ 12 15 6 1080
Now one might ask to complete this design and re-fill the design to be complete. This can be done by the fillcombinations
function. Let us see it at work:
julia> fillcombinations(df2, Not("volume"))
12×4 DataFrame
Row │ height width depth volume
│ Int64 Int64 Int64 Int64?
─────┼───────────────────────────────
1 │ 10 8 5 400
2 │ 12 8 5 missing
3 │ 10 10 5 500
4 │ 12 10 5 600
5 │ 10 15 5 missing
6 │ 12 15 5 900
7 │ 10 8 6 480
8 │ 12 8 6 576
9 │ 10 10 6 missing
10 │ 12 10 6 720
11 │ 10 15 6 900
12 │ 12 15 6 1080
Observe that after calling this function we have created a new data frame with the missing rows added. The "volume"
column is filled by default with missing
for rows that were added. The Not("volume")
argument meant that we want to get all combinations of values in all columns except "volume"
.
Conclusions
Today we worked with two functions: allcombinations
and fillcombinations
. You will find them useful if in your work you will ever need to create all combinations of levels of some factors. This functionality seems niche, but it is needed in practice surprisingly often.