Author Archives: Julia on μβ

Working with binary libraries for optimization in Julia

By: Julia on μβ

Re-posted from: https://matbesancon.xyz/post/2019-11-04-binary-julia/

Unlike other ecosystems in the scientific programming world, scientists
and engineers working with Julia usually prefer a whole stack in Julia for many
reasons. The compiler is doing way better when able to infer what
is going on in a piece of code; when an error is thrown, the stack trace looks
much nicer when only pure Julia code is involved, functions and types can be
defined as generic as wanted without hard-coded container or number types for instance.

Sometimes however, inter-operability with native code is needed to use some
external native libraries. By that I mean natively built libraries
(*.so files on Linux systems, *.dylib on OSX, *.dll on Windows).
In this post, we will explore some tools to work with native libraries in Julia.
In the last couple weeks, I tinkered a bit with the HiGHS
solver developed at the University of Edinburgh, which I will use as an example
throughout this post. It is still work in progress, but has nice promises as the
next-generation linear optimization solver in the COIN-OR suite.

What does a native lib look like?

Looking at the repository,
it is a pretty standard CMake-based C++ project producing both an executable and
library which can be called through a C interface.
The two initial components are:

  1. The source code producing the library, this can be written in any language producing native code (C, C++, Rust)
  2. The header file defining the C API to call the library from other programs.

This interface is defined in a single header file src/interfaces/highs_c_api.h,
header files may define a bunch of types (structs, unions, enums) but most
importantly they define function prototypes looking like:

int preprocess_variables(int* values, double offset, float coefficient);

When using the function from Julia, the call to the native library looks like
the following:

ccall(
  (my_library_name, :preprocess_variables),
  CInt, # return type
  (Ptr{Cint}, Cdouble, Cfloat), # tuple of argument types
  (pointer(my_array), 3.5, 4.5f) # tuple of arguments
)

Let us dive in.

Solution 1: build and link

For this approach, the first step is to build the HiGHS library and have the
library available. Following the documentation, the instructions are:

cd HiGHS # where HiGHS is installed
mkdir build
cd build
cmake .. # generate makefiles
make # build everything here in the build directory

Like often with native packages, some dependencies might be implicitly assumed,
here is a Dockerfile building the project on an alpine machine, you should be
able to reproduce this with Docker installed.

FROM alpine:3.7

RUN apk add git cmake g++ gcc clang make
WORKDIR /optpreprocess_variables
RUN git clone https://github.com/ERGO-Code/HiGHS.git
RUN mkdir -p HiGHS/build
WORKDIR /opt/HiGHS/build
RUN cmake .. && make
RUN make test
RUN make install # optional

Now back to the Julia side, say we assume the library is available at a given
path, one can write the Julia functions corresponding to the interface. It is
preferable not to expose error-prone C calls to the user. In the example of
the preprocess_variables function defined above, a Julia wrapper would look
like:

function preprocess_variables(my_array::Vector{Cint}, offset::Cdouble, coefficient::Cfloat)
    result = ccall(
        (:preprocess_variables, my_library_name),
        Cint, (Ptr{Cint}, Cdouble, Cfloat),
        (pointer(my_array), 3.5, 4.5f)
    )
    return result
end

Once these wrapper functions are defined, users can convert their values to the
corresponding expected argument types and call them. The last thing needed is my_library_name,
which must be the path to the library object. Hard-coding or assuming paths
should be avoided, it makes software harder to install on some systems.
One thing that can be done is asking the user to pass the library path as an
environment variable:

ENV["HIGHS_DIR"] # should contain the path to the HIGHS directory
joinpath(ENV["HIGHS_DIR"], "build", "lib", "libhighs.so")

Doing this every time is however not convenient. Since library paths are not
changing at every call, one can check for this path at the installation of the
package. For this purpose, a file deps/build.jl can be added in every package
and will be run at the installation of the package or when the Pkg.build
command is called. A build.jl for our purpose could look like:

const highs_location = ENV["HIGHS_DIR"]
const libhighs = joinpath(highs_location, "build", "lib", "libhighs.so")
const depsfile = joinpath(@__DIR__, "deps.jl")
open(depsfile, "w") do f
    print(f, "const libhighs = ")
    print(f, libhighs)
    println(f)
end

The snippet above looks for the libhighs.so library, using the environment
variable as location of the base directory of HiGHS. Placed in build.jl,
the script will create a deps.jl file in the deps folder of the Julia
package, and write const libhighs = "/my/path/to/highs/lib/libhighs.so".
This is more or less what happens with the
SCIP.jl wrapper v0.9.
Once the build step succeeds, one can add in the main module in /src:

module HiGHS

const deps_file = joinpath(dirname(@__FILE__),"..","deps","deps.jl")
if isfile(deps_file)
    include(deps_file)
else
    error("HiGHS not properly installed. Please run import Pkg; Pkg.build(\"HiGHS\")")
end

# other things

end # module

The global constant libhighs can then be used for ccall.
We now have a functional package wrapping a native library downloaded and
built separately. Summing up what we have, the Julia wrapper package looks as
follows:

$ tree
.
├── Project.toml
├── README.md
├── deps
│   ├── build.jl
│   ├── build.log
│   ├── deps.jl
├── src
│   └── HiGHS.jl
└── test
    └── runtests.jl

deps/build.log and deps/deps.jl are not committed in the repository but
generated when installing and/or building the Julia package.

Lifting maintainers’ burden: generating wrapper functions with Clang.jl

One time-consuming task in the previous steps is going from the C header file
describing the API to Julia functions wrapping the ccall. The task is mostly
repetitive and can be automated using Clang.jl.
This package will generate the appropriate functions from a header file,
a reduced example looks like:

import Clang

# HIGHS_DIR = "path/to/highs/dir"
const header_file = joinpath(HIGHS_DIR, "include", "interfaces", "highs_c_api.h")
const LIB_HEADERS = [header_file]

const ctx = Clang.DefaultContext()

Clang.parse_headers!(ctx, LIB_HEADERS,
    includes=[Clang.CLANG_INCLUDE],
)

ctx.libname = "libhighs"
ctx.options["is_function_strictly_typed"] = true
ctx.options["is_struct_mutable"] = false

const api_file = joinpath(@__DIR__, "../src/wrapper", "$(ctx.libname)_api.jl")

open(api_file, "w") do f
    # write each generated function
    # ...
end

This snippet can be placed in a /gen folder of the Julia wrapper package and
writes to src/wrapper all the functions wrapping C calls.
It is less error-prone compared to manually writing the Julia interface and can
save a great deal of time when managing updates of the native library.
Again, the SCIP.jl wrapper uses this method and can be used as example.
Since the wrapper generation has different requirements than the package itself,
we can provide it a Project.toml.
Our package structure now looks like this:

$ tree
.
├── Project.toml
├── README.md
├── deps
│   ├── build.jl
│   ├── build.log
│   ├── deps.jl
├── gen
│   ├── Project.toml
│   └── gen.jl
├── src
│   ├── HiGHS.jl
│   └── wrapper
│       ├── libhighs_api.jl
│       └── libhighs_common.jl
└── test
    └── runtests.jl

Lifting the user’s burden: BinaryBuilder & BinaryProvider

For non-open-source software, what we did up to here this is the best you can get:
let users download and install the library, pass the path once at build time and
partly generate the Julia wrapper for ccall through Clang.jl.
For open-source libraries however, could we go a step further and do everything
for the user when they install the Julia package?

That’s where BinaryBuilder
and BinaryProvider come in.
See the Docker file above, BinaryBuilder uses the same technology and arcane
tricks to cross-compile the binary artifacts (executables and libraries) natively.
It does so by letting you install the library as you would on your own machine,
using cmake, make, make install, etc. The result of running BinaryBuilder is a
single Julia script build_tarballs.jl describing the commands run to produce
the artifacts.
This is placed in a repository with Continuous Integration support, which creates
releases for all specified architectures, OS, compilers.
You can see examples for the Clp solver here
and for HiGHS there.

Back to the Julia package, we can now modify the deps/build.jl script to use
BinaryProvider, fetching the binaries corresponding to the current system.
Without knowing anything about what’s going under the hood and how the library
is built, users can simply perform Pkg.add("ThePackage") which will build
automatically and explicitly specify when a given OS or architecture is not
supported. Take a look at the modified
build file using BinaryProvider.

They don’t need to guarantee that they have the same compiler, make and cmake
version to have a repeatable & smooth installation of the package.

Wrapping up

The process from 0 to a fully ready Julia package built on top of a binary
library is still not straightforward. Special appreciation goes to the
BinaryBuilder developers and contributors who helped me figure out some tricky
bits. But the key take-away of this is that once the pipeline is built, updating
the binary version or Julia wrapper is the same workflow one is used to with
standard Julia packages. Keep building pure Julia software for all its benefits,
but these tools I presented make it as great as possible to work with binaries.

Edits

Some design work is in progress on the Pkg side to be able to reason with
artifacts, a post can be found
here.

Many thanks to staticfloat and
giordano for the feedback and additional information.

Lessons learned on object constructors

By: Julia on μβ

Re-posted from: https://matbesancon.xyz/post/2019-09-23-constructors/


Constructors are a basic building block of object-oriented programming (OOP).
They expose ways to build specific types of objects consistently,
using arbitrary rules to validate properties.
Still, constructors are odd beasts in the OOP world.
In Java, this is usually the first case of function overloading that learning
programmers meet, often without knowing the term. An overloaded constructor is
shown in the following example:

class Car {
    private Motor motor;

    public Car(Motor m) {
        this.motor = m;
    }
    public Car() {
        this.motor = new Motor();
    }
}

Scala and Kotlin, which are both languages on the Java Virtual Machine designed
after and learning from Java, made the design choice of imposing a
primary constructor, which all other constructors have to call.
Constructors are weird beasts because they act partly as a function, partly as a
method. Moreover, they expose a special use of this as a method call instead
of being a pointer to the current object:

class Car {
    private Motor motor;

    public Car(Motor m) {
        // 'this' as an object reference
        this.motor = m;
    }
    public Car(int power) {
        Motor m = new Motor(power);
        // this as a method
        this(m);
    }
}

This has been in my experience confusing and harder to teach on my side because
it forces the learner to get a grasp of many specific tricks at the same time.
Another hard-to-grasp point is this(motor), which has never been defined has
such. The definition it corresponds to is Car(Motor m), the required mental
load here is just unnecessary.
This is why I appreciate Kotlin and Scala having made constructors more
restrictive, removing the need for hand-wavy explanations for bad design.
This great blog post
gives an overview of constructors in different mainstream languages and compare
them with the trait-based system of Rust.

Constructors outside class-based OOP

I will focus here on composite types
or struct. There is a whole section of the Julia docs
on constructors, but I would summarize things as:

  1. There is a primary constructor which must provide values for all fields.
  2. All other constructors are just functions, no magic is involved, and constructors are just multiple methods in the context of multiple dispatch.

This way of building objects as simple structures holding data in different
fields is not new, Kotlin and Scala have a similar pattern as we mentioned above.
Languages like Rust and Go take a different path by having structures being
plain structures, initialized by providing all fields directly:

// rust example

struct Motor {
    pub power: u8,
}

struct Car {
    pub motor : Motor
}
// let m = Motor{power : 33};
// go example

type Motor struct {
	Power uint
}
// m := Motor{33}

Both languages have conventions for calling a standard constructing function,
namely fn new(args) -> T and func NewT(args) for Rust and Go respectively,
but those are not special and remain a simple convention without additional
language complexity.

Two lessons learned

Two interesting Pull Requests are about to be merged in
Distributions.jl,
which is the main package for working with probability distributions in Julia.
Both revolve around a revision of the work of constructors.
I will use them to make a point which I believe generalizes well to other systems.
No probability theory should be needed here, it is merely a motivating example.

Lesson 1: product distributions and constructor promises

Given multiple random variables: $ X_{i}, i = 1..n $ we define a
**product distribution** as the vector random variable built by stacking the
different $ X_i $:

$$ X = [ X_i | i \in 1..n ] $$

They arrived in Distributions.jl in this PR
if you are curious.
One thing to be careful about is that the term “product distribution” does not
correspond with the eponymous Wikipedia entry. What we refer to here is the
product type in the sense of tuple
construction and not the arithmetic product.
EDIT: the correct corresponding Wikipedia entry is the one on
Product measure,
thanks Chad for pointing it out.
One important property is that the entries of the product type are
independent distributions, which helps a great deal deducing properties
of the product distribution.

An example product type could be the product of two univariate Gaussian
distributions:

$$ X_1 \sim \mathcal{N}(0, 1)$$
$$ X_2 \sim \mathcal{N}(0, 2)$$
$$ X = [X_1, X_2]$$

The implementation of the Product type stores the vector of univariate
distributions, sampling and computing the PDF/CDF is done on a per-entry basis.
The corresponding code would look like this:

using Distributions: Normal, Product, pdf

Xs = [Normal(0, 1), Normal(0, 2)]
p = Product(Xs)

# sample from p
rand(p)

# compute PDF at (x1 = 0, x2 = 1)
pdf(p, [0.0, 1.0])

One problem we have here is that we know some specialized, faster techniques
can be used in specific cases. Our product here for example, is nothing more
than a multivariate Gaussian distribution with independent components:
$$ X \sim \mathcal{N}([0, 0], diag([1, 2]))$$
with $diag(\cdot)$ constructing a diagonal matrix from a vector.

Sampling and computing quantities of interest for such multivariate would be
much faster by using a multivariate directly.
Our new design can leverage multiple dispatch, and would look as follows:

function Product(distributions::Vector{<:Gaussian})
	# construct multivariate gaussian
end

function Product(distributions::Vector{<:Uniform})
	# construct multivariate uniform
end

function Product(distributions::Vector{<:UnivariateDistribution})
	# construct generic Product
end

It is all fine and type-stable; if you don’t know what it means, just think
sound from a type perspective. One issue here though is that we break the
promise of a constructor.
A constructor of Product is supposed to return a Product and exactly this.
If you work in a language that uses algebraic data types for possible failures
and absence as Maybe/Either/Result/Option, the constructor should return the
type and not one of these.

struct T
	# type fields
end

"""
T constructor
"""
T(args) = # ...

value = T(args)

# the following should always be true
typeof(value) <: T

In our cases, a more efficient implementation cannot be returned from a
constructor. This means the construction of our type must be left to another
method which could return it or something else.
In the case of product distributions, it was done in this PR,
adding the function product_distribution in Distributions.jl, which can have
various methods returning a Product or something else.
With this design, it is left possible for a distribution to define a special
product type, while the default Product will work reasonably well.

The lesson learned here is to be wary of exposing constructors when many paths
are possible, and a dispatch system might be preferable. Constructors
should always return the same type and are not ideal for a specialization system.

Lesson 2: main constructors should remain lean

Many constructors for probability distributions include a verification of the
parameters. When constructing a uniform distribution $\mathcal{U}(a, b)$, one
would want to verify that $a \leq b$. For a Gaussian distribution, one would
verify that the standard deviation is positive. These checks are fine, but have
a runtime cost and may interrupt the construction of the object.
There are many cases in which the parameters are guaranteed to be valid, two of
them being:

  1. Constructing an object by copy.
  2. Constructing an object with default parameters.
struct T
	# fields
end

function T()
	# default parameters are valid
end

function T(t::T)
	# t is already constructed, and is therefore valid
end

Throwing errors in a constructor is ill-advised, because again,
the promise of a constructor is to construct the object.
In languages where throwing is not advised, it means the constructor would
return a Maybe{T} / Either{_, T}, which again breaks the promise.
The problem is that if checking is not the default, users are less likely to
call the checking function. The solution found here is to use a keyword in
all constructors:

struct D{T <: Real} <: Distribution
	param::T
end

function D(p::T; check_arg = true) where {T}
	if check_arg
		verify_parameter(p)
	end
	return D{T}(p)
end

The default is still to check the validity of parameters, but objects of type
D can now be constructed with opt-out checking. Another way to do it is
with multiple dispatch:

"""
A flag structure to avoid checking arguments.
"""
struct NoArgCheck end

struct D{T <: Real} <: Distribution
	param::T
end

# standard constructor, validates the parameter
function D(p::T) where {T <: Real}
	verify_parameter(p)
	return D{T}(p)
end

# faster constructor, no argument checking
function D(p::T, ::NoArgCheck) where {T}
	return D{T}(p)
end

In either cases, users can now take the responsibility of checking parameters
themselves. In recent Julia version, the compiler optimization of boolean
constants will make the two roughly equivalent.
One general rule to highlight here for scientific programming work
is that the constructor is a fixed cost imposed on all users, treat additional
checks and operations carefully.

If you found this post useful (or not) or want to react in some way, feel free
to reach out on Twitter and/or
Reddit.

Edit: thanks Alec for spotting redundant code. Another blog post
on the subject was posted on Reddit (thanks Paul for pointing it out).

Lessons learned on object constructors

By: Julia on μβ

Re-posted from: https://matbesancon.github.io/post/2019-09-23-constructors/


Constructors are a basic building block of object-oriented programming (OOP).
They expose ways to build specific types of objects consistently,
using arbitrary rules to validate properties.
Still, constructors are odd beasts in the OOP world.
In Java, this is usually the first case of function overloading that learning
programmers meet, often without knowing the term. An overloaded constructor is
shown in the following example:

class Car {
    private Motor motor;

    public Car(Motor m) {
        this.motor = m;
    }
    public Car() {
        this.motor = new Motor();
    }
}

Scala and Kotlin, which are both languages on the Java Virtual Machine designed
after and learning from Java, made the design choice of imposing a
primary constructor, which all other constructors have to call.
Constructors are weird beasts because they act partly as a function, partly as a
method. Moreover, they expose a special use of this as a method call instead
of being a pointer to the current object:

class Car {
    private Motor motor;

    public Car(Motor m) {
        // 'this' as an object reference
        this.motor = m;
    }
    public Car(int power) {
        Motor m = new Motor(power);
        // this as a method
        this(m);
    }
}

This has been in my experience confusing and harder to teach on my side because
it forces the learner to get a grasp of many specific tricks at the same time.
Another hard-to-grasp point is this(motor), which has never been defined has
such. The definition it corresponds to is Car(Motor m), the required mental
load here is just unnecessary.
This is why I appreciate Kotlin and Scala having made constructors more
restrictive, removing the need for hand-wavy explanations for bad design.
This great blog post
gives an overview of constructors in different mainstream languages and compare
them with the trait-based system of Rust.

Constructors outside class-based OOP

I will focus here on composite types
or struct. There is a whole section of the Julia docs
on constructors, but I would summarize things as:

  1. There is a primary constructor which must provide values for all fields.
  2. All other constructors are just functions, no magic is involved, and constructors are just multiple methods in the context of multiple dispatch.

This way of building objects as simple structures holding data in different
fields is not new, Kotlin and Scala have a similar pattern as we mentioned above.
Languages like Rust and Go take a different path by having structures being
plain structures, initialized by providing all fields directly:

// rust example

struct Motor {
    pub power: u8,
}

struct Car {
    pub motor : Motor
}
// let m = Motor{power : 33};
// go example

type Motor struct {
	Power uint
}
// m := Motor{33}

Both languages have conventions for calling a standard constructing function,
namely fn new(args) -> T and func NewT(args) for Rust and Go respectively,
but those are not special and remain a simple convention without additional
language complexity.

Two lessons learned

Two interesting Pull Requests are about to be merged in
Distributions.jl,
which is the main package for working with probability distributions in Julia.
Both revolve around a revision of the work of constructors.
I will use them to make a point which I believe generalizes well to other systems.
No probability theory should be needed here, it is merely a motivating example.

Lesson 1: product distributions and constructor promises

Given multiple random variables: $ X_{i}, i = 1..n $ we define a
product distribution as the vector random variable built by stacking the
different $ X_i $:

$$ X = [ X_i | i \in 1..n ] $$

They arrived in Distributions.jl in this PR
if you are curious.
One thing to be careful about is that the term “product distribution” does not
correspond with the eponymous Wikipedia entry. What we refer to here is the
product type in the sense of tuple
construction and not the arithmetic product.
EDIT: the correct corresponding Wikipedia entry is the one on
Product measure,
thanks Chad for pointing it out.
One important property is that the entries of the product type are
independent distributions, which helps a great deal deducing properties
of the product distribution.

An example product type could be the product of two univariate Gaussian
distributions:

$$ X_1 \sim \mathcal{N}(0, 1)$$
$$ X_2 \sim \mathcal{N}(0, 2)$$
$$ X = [X_1, X_2]$$

The implementation of the Product type stores the vector of univariate
distributions, sampling and computing the PDF/CDF is done on a per-entry basis.
The corresponding code would look like this:

using Distributions: Normal, Product, pdf

Xs = [Normal(0, 1), Normal(0, 2)]
p = Product(Xs)

# sample from p
rand(p)

# compute PDF at (x1 = 0, x2 = 1)
pdf(p, [0.0, 1.0])

One problem we have here is that we know some specialized, faster techniques
can be used in specific cases. Our product here for example, is nothing more
than a multivariate Gaussian distribution with independent components:
$$ X \sim \mathcal{N}([0, 0], diag([1, 2]))$$
with $diag(\cdot)$ constructing a diagonal matrix from a vector.

Sampling and computing quantities of interest for such multivariate would be
much faster by using a multivariate directly.
Our new design can leverage multiple dispatch, and would look as follows:

function Product(distributions::Vector{<:Gaussian})
	# construct multivariate gaussian
end

function Product(distributions::Vector{<:Uniform})
	# construct multivariate uniform
end

function Product(distributions::Vector{<:UnivariateDistribution})
	# construct generic Product
end

It is all fine and type-stable; if you don’t know what it means, just think
sound from a type perspective. One issue here though is that we break the
promise of a constructor.
A constructor of Product is supposed to return a Product and exactly this.
If you work in a language that uses algebraic data types for possible failures
and absence as Maybe/Either/Result/Option, the constructor should return the
type and not one of these.

struct T
	# type fields
end

"""
T constructor
"""
T(args) = # ...

value = T(args)

# the following should always be true
typeof(value) <: T

In our cases, a more efficient implementation cannot be returned from a
constructor. This means the construction of our type must be left to another
method which could return it or something else.
In the case of product distributions, it was done in this PR,
adding the function product_distribution in Distributions.jl, which can have
various methods returning a Product or something else.
With this design, it is left possible for a distribution to define a special
product type, while the default Product will work reasonably well.

The lesson learned here is to be wary of exposing constructors when many paths
are possible, and a dispatch system might be preferable. Constructors
should always return the same type and are not ideal for a specialization system.

Lesson 2: main constructors should remain lean

Many constructors for probability distributions include a verification of the
parameters. When constructing a uniform distribution $\mathcal{U}(a, b)$, one
would want to verify that $a \leq b$. For a Gaussian distribution, one would
verify that the standard deviation is positive. These checks are fine, but have
a runtime cost and may interrupt the construction of the object.
There are many cases in which the parameters are guaranteed to be valid, two of
them being:

  1. Constructing an object by copy.
  2. Constructing an object with default parameters.
struct T
	# fields
end

function T()
	# default parameters are valid
end

function T(t::T)
	# t is already constructed, and is therefore valid
end

Throwing errors in a constructor is ill-advised, because again,
the promise of a constructor is to construct the object.
In languages where throwing is not advised, it means the constructor would
return a Maybe{T} / Either{_, T}, which again breaks the promise.
The problem is that if checking is not the default, users are less likely to
call the checking function. The solution found here is to use a keyword in
all constructors:

struct D{T <: Real} <: Distribution
	param::T
end

function D(p::T; check_arg = true) where {T}
	if check_arg
		verify_parameter(p)
	end
	return D{T}(p)
end

The default is still to check the validity of parameters, but objects of type
D can now be constructed with opt-out checking. Another way to do it is
with multiple dispatch:

"""
A flag structure to avoid checking arguments.
"""
struct NoArgCheck end

struct D{T <: Real} <: Distribution
	param::T
end

# standard constructor, validates the parameter
function D(p::T) where {T <: Real}
	verify_parameter(p)
	return D{T}(p)
end

# faster constructor, no argument checking
function D(p::T, ::NoArgCheck) where {T}
	return D{T}(p)
end

In either cases, users can now take the responsibility of checking parameters
themselves. In recent Julia version, the compiler optimization of boolean
constants will make the two roughly equivalent.
One general rule to highlight here for scientific programming work
is that the constructor is a fixed cost imposed on all users, treat additional
checks and operations carefully.

If you found this post useful (or not) or want to react in some way, feel free
to reach out on Twitter and/or
Reddit.

Edit: thanks Alec for spotting redundant code. Another blog post
on the subject was posted on Reddit (thanks Paul for pointing it out).