Author Archives: julia | Victor Boussange

A multi-language overview on how to test your research project code

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/testing-your-research-code/

Code testing is essential to identify and fix potential issues, to maintain sanity over the course of the development of the project and quickly identify bugs, and to ensure the reliability and sanity of your experiment overtime.

This post is part of a series of posts on best practices for managing research project code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!

Content

Unit testing

Unit testing involves testing a unit of code, typically a single function, to ensure its correctness. Here are some key aspects to consider:

  • Test for correctness with typical inputs.
  • Test edge cases.
  • Test for errors with bad inputs.

Some developers start writing unit tests before writing the actual function, a practice known as Test-Driven Development (TDD). Define upstream on a piece of paper the behavior of the function, write corresponding tests, and when all tests pass, you are done. This philosophy ensures that you have a well-tested implementation, and avoids unnecessary feature development, forcing you to focus only on what is needed. While TDD is a powerful idea, it can be challenging to follow strictly.

A good idea is to write an additional test when you find a bug in your code.

Lightweight formal tests with assert

The simplest form of unit testing involves some sort of assert statement.

Python

def fib(x):
    if x <= 2:
        return 1
    else:
        return fib(x - 1) + fib(x - 2)

assert fib(0) == 0
assert fib(1) == 1
assert fib(2) == 1

Julia

@assert 1 == 0

When one test is broken, you’ll get an error for the corresponding test, which you’ll need to fix to check the following tests.

In Julia or Python, you could directly place the assert statement after your functions. This way, tests are run each time you execute the script. Here is nother pythonic approach, which can be used to decouple the test

def fib(x):
    if x <= 2:
        return 1
    else:
        return fib(x - 1) + fib(x - 2)

if __name__ == '__main__':
    assert fib(0) == 0
    assert fib(1) == 1
    assert fib(2) == 1
    assert fib(6) == 8
    assert fib(40) == 102334155
    print("Tests passed")

Consider using np.isclose, np.testing.assert_allclose (Python) or approx (Julia) for floating point comparisons.

Testing with a test suite

Once you have many tests, it makes sense to group them into a test suite and run them with a test runner. This approach will run all tests, even though some are broken, and retrieve and informative statements on those tests that passed, and those that did not. As you’ll see, it also allows to automatically run the test at each commit, with continuous integration.

Python

Two main frameworks for unit tests in Python are pytest and unittest, with pytest being more popular.

Example using pytest:

from src.fib import fib
import pytest

def test_typical():
    assert fib(1) == 1
    assert fib(2) == 1
    assert fib(6) == 8
    assert fib(40) == 102334155

def test_edge_case():
    assert fib(0) == 0

def test_raises():
    with pytest.raises(NotImplementedError):
        fib(-1)

    with pytest.raises(NotImplementedError):
        fib(1.5)

Run the tests with:

pytest test_fib.py

Julia

Built in module Test, relying on the macro @test. Consider grouping your tests with

julia> @testset "trigonometric identities" begin
          θ = 2/3*π
          @test sin(-θ)  -sin(θ)
          @test cos(-θ)  cos(θ)
          @test sin(2θ)  2*sin(θ)*cos(θ)
          @test cos(2θ)  cos(θ)^2 - sin(θ)^2
      end;

This will nicely output

Test Summary:            | Pass  Total  Time
trigonometric identities |    4      4  0.2s

which comes handy for grouping tests applied to a single function or concept. Test functions may require additional packages to your minimum working environment specified at your package root folder. An additional virtual environment may be specified for tests! To develop my tests interactively, I like using TestEnv. Unfortunately, using Pkg.activate in tests would not work there, you. You need TestEnv to have access to your package functions;

In your package environment,

using TestEnv
TestEnv.activate()

will activate the test environment.

To reactivate the normal environment,

Pkg.activate(".")

Here is a nice thread to read more on that.

R

testhat

Testing non-pure functions and classes

For nondeterministic functions, provide the random seed or variables needed by the function as arguments to make them deterministic.
For stateful functions, test postconditions to ensure the internal state changes as expected.
For functions with I/O side effects, create mock files to verify proper input reading and expected output.

Python

def file_to_upper(in_file, out_file):
    fout = open(out_file, 'w')
    with open(in_file, 'r') as f:
        for line in f:
            fout.write(line.upper())
    fout.close()

import tempfile
import os

def test_file_to_upper():
    in_file = tempfile.NamedTemporaryFile(delete=False, mode='w')
    out_file = tempfile.NamedTemporaryFile(delete=False)
    out_file.close()
    in_file.write("test123\nthetest")
    in_file.close()
    file_to_upper(in_file.name, out_file.name)
    with open(out_file.name, 'r') as f:
        data = f.read()
        assert data == "TEST123\nTHETEST"
    os.unlink(in_file.name)
    os.unlink(out_file.name)

Continuous integration

Automated testing on local machines is useful, but you can do better with continuous integration (CI). In fact, CI is essential for projects involving multiple developers and various target platforms. CI consists in running tests whenever changes are committed.
CI can also be used to automatically build documentation, check for code coverage, and more. GitHub Actions is a popular CI tool available within GitHub.
CI is based on .yaml files, which specify the environment to run the script. You can build matrices to test across different environments (e.g. Linux, Windows and MacOS, with different versino of python or Julia). Jobs will be created that run our tests for each permutation of these.

An example CI.yaml file for Julia

name: Run tests

on: push: branches: - master - main pull_request:

permissions: actions: write contents: read

jobs: test: runs-on: ${{ matrix.os }} strategy: matrix: julia-version: [‘1.6’, ‘1’, ’nightly’] julia-arch: [x64, x86] os: [ubuntu-latest, windows-latest, macOS-latest] exclude: - os: macOS-latest julia-arch: x86

steps:
  - uses: actions/checkout@v4
  - uses: julia-actions/setup-julia@v1
    with:
      version: ${{ matrix.julia-version }}
      arch: ${{ matrix.julia-arch }}
  - uses: julia-actions/cache@v1
  - uses: julia-actions/julia-buildpkg@v1
  - uses: julia-actions/julia-runtest@v1

An example CI.yaml file for Python

This action installs the conda environment called glacier-mass-balance, specified in the environment.yml file.
It then runs pytest, supposing that you have a test/ folder where your functions are located. First try whether pytest works locally. Do not forget to have pytest in your dependencies.


name: Run tests
on: push

jobs:
  miniconda:
    name: Miniconda ${{ matrix.os }}
    runs-on: ${{ matrix.os }}
    strategy:
        matrix:
            os: ["ubuntu-latest"]
    steps:
      - uses: actions/checkout@v2
      - uses: conda-incubator/setup-miniconda@v2
        with:
          environment-file: environment.yml
          activate-environment: glacier-mass-balance
          auto-activate-base: false
      - name: Run pytest
        shell: bash -l {0}
        run: | 
          pytest


Cool tip

You can include a cool badge to show visually whether your tests are passing or failing, like so

Tests

You can get the code for this badge by going on your github repo, then Actions. Click on the test action, then on top right click on the ... and `Create status badge```.

Cool right?

Other types of tests

  • Docstring tests: Unit tests embedded in docstrings.
  • Integration tests: Test whether multiple functions work correctly together.
  • Regression tests: Ensure your code produces the same outputs as previous versions.

Resources

Take-home messages

  • Systematically implementing testing allows you to ensure the sanity of your code
  • The overhead cost of testing is usually well balanced by the reduced time spent downstream in identifying bugs

A multi-language overview on how to organise your research project code and documents

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/best-practices-for-your-research-code/

I personally find that one of the biggest challenge when doing research is to keep things neat and organized. Having a good management system for your code and resources is key to optimizing time and brain resources. In this post, I discuss various methods for structuring a research project folder that includes code, data, publications, and more. Additionally, I discuss the specifics of organizing your research code. As I started my PhD, I wish I could have had some of such guidelines. But starting from scratch allowed me to build, with trials and errors, a good system for my later life. Hopefully, some of this can apply to you!

This post is part of a series of posts on best practices for managing research code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!

Content

Project folder structures

I quite like this project folder structure, which keeps apart raw data and results from the code, but still place them relatively close, together with admin and publications. Having a separate git repo for the paper is something I would recommend as well (possibly linked to an Overleaf project).

|-- code/
|-- data/
|-- results
|-- publications
|    |-- talks
|    |-- posters
|    |-- papers
|-- admin
|-- meetings
|-- more-folders
 -- README.md

You may want to place results within code, together with data (which you should not git track)
The structure of code/ deserves here some attention.

code/ structure

Programming languages typically have their own conventions, but often the folders follow this scheme

  • a README.md file at the top level
  • a src/ folder, containing models and other generic function and classes, that will be used in script/ files,
  • example usages, e.g. examples/
  • scripts to run models, evaluation, etc., e.g. scripts/
  • documentation (often generated), e.g. docs/

It can make sense for research projects to distinguish between scripts placed in scripts/ and reused functions, models, etc., placed in src.

Python Folder structure
|-- src/            # package code
|-- scripts/        # Custom analysis or processing scripts
|-- tests/
|-- examples/       # Example scripts using the package
|-- docs/           # documentation
 -- environment.yml # to handle project dependencies
 -- README.md
R Folder structure
|-- R/               # R scripts and functions (package code)
|-- scripts/         # Custom analysis or processing scripts
|-- man/             # Documentation files
|-- tests/
|-- examples/        # Example scripts using the package
|-- vignettes/       # Long-form documentation
 -- DESCRIPTION      # Package description and metadata
 -- NAMESPACE        # Namespace file for package
 -- README.md        # Project overview and details
Julia Folder structure
|-- src/            # package code
|-- scripts/        # Custom analysis or processing scripts
|-- test/
|-- examples/       # Example scripts using the package
|-- docs/           # documentation
 -- Project.toml    # to handle project dependencies
 -- README.md

Turning your code/ into a “package”

You may want to specify the src folder as a package. This has a few advantages, including

  • not having to deal with relative position of files to call the functions in src/
  • maximizing your productivity by creating a generic package additionally to your main research project.

To import functions and classes (types) located in the src folder, you typically need to indicate in each script the relative path of src. In Julia, you would typically do something like include("../src/path/to/your/src_file.jl"). In Python, you would do something like:

import sys
sys.path.append("../src/")

from src.path.to.your.src_file import my_fun

If src/ directory grows, it’s beneficial to convert it into a separate package. Although this process is a bit more complex, it eliminates the need for path specifications, simplifies the import of functions and classes, and makes the codebase easily accessible for other research projects.

There are typically ways to turn a code-project into an installable package. This is in particular useful for code which other people (or yourself) use for different projects.

You can achieve this easily with development tools.

For Python, tools like setuptools and poetry facilitate package development. If you’re working in R, devtools is the go-to tool for developing packages. In Julia, the Pkg tool serves a similar purpose.

Package templates can be useful to simplify the creation of packages by generating package skeletons. In Python, checkout out cookiecutter. In R, check usethis. For Julia, use the Pkg.generate() built-in functionality, or the more advanced PkgTemplates.jl package.

Note that you may want at some point to locate your src/ (and associated tests, docs, etc…) in a separate git repo.

Further reading for

Wrapping up

Explore these exemplary toy research repositories in different programming languages:

  • Julia, using relative paths for importing src functions.
  • Python, implementing src as a package.
  • R, also implementing src as a package.

These repositories showcase what I consider to be best practices in research project organization.

Take-home messages

  • There is not one way to structure your research project folders, but general guidelines. Create the one that makes most sense for you!
  • A chosen structure should be suitable to both work during the development of your project, and to submit (parts) of it to a repository in a future stage.
  • Consider turning your src/ into a folder. This can increase your academic productivity, as you could eventually be the developer of a cool package that people re-use, with minimum efforts!

A multi-language overview on how to handle dependencies within a research project

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/research-project-dependencies/

Your future self and others should be able to recreate the minimal environment to run the scripts in your research project. This is best achieved using package managers and virtual environments.

This post is part of a series of posts on best practices for managing research project code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!

Content

Some definitions

What is a dependency?

A dependency is an external package that a project requires to run.

What is a package manager?

A package manager like conda, Pkg or renv automates the process of installing, upgrading, configuring, and managing dependencies. It usually relies on a package repository, which is a central location that stores in one place the source code of packages or where to find it.

What is a virtual environment?

A virtual environment is an isolated environment where you can install and manage dependencies separately from the system-wide installation. This isolation ensures that different projects can have different dependencies and versions of packages without causing conflicts. Why use a virtual environment?

  • For yourself, to best deal with multiple projects and to prevent your code from breaking down overtime.
    • Without specifying a virtual environment, you install packages in your base environment, which is shared across all your projects.
    • Imagine you are working with Project A and Project B, which both depend on Package1 (currently @v1.1).
    • You leave aside Project A for a few months, and focus on Project B.
    • A new feature in Package1 motivate you to upgrade to v1.2, which modifies the API or the behavior of one function used in both projects.
    • You then want to come back to Project A, but now everything is broken! Because your code has been formatted to work with Package1@v1.1.
    • Hence, you want to make sure to compartmentalize environments.
  • To share your environment with others individuals and machines.
    • A virtual environement tracks the minimum dependencies, which can easily be shared and installed on other machines (e.g., a HPC).

Package managers

Multilanguage overview

Python R Julia
Package Manager pip, conda (see also mamba), poetry install.packages() (base R) Pkg
Package Repository PyPI (Python Package Index), conda-forge CRAN (Comprehensive R Archive Network) General registry
Distribution Format .whl (wheel, incl binaries) or tar.gz (source) .tar.gz (source and/or binary) Pkg will git clone from source, and download (binary) artifacts
Virtual Environment venv, virtualenv, conda env renv Built-in in the Pkg module
Dependency Management requirements.txt or Pipfile (pip), or environment.yml (conda env) or pyproject.toml (poetry) DESCRIPTION, NAMESPACE Project.toml, Manifest.toml, Artifacts.toml

This table is very much inspired by The Scientific Coder article on package managers.

Julia or R have built-in package managers which can be called within the REPL but Python package managers are called from outside the language.

conda

conda is a very appropriate package manager for scientific projects in Python. Over its older concurrent pip, it can handle python versions and all sorts non-python dependencies artifacts. With two lines of code, it allows someone to quickly install the virtual environment, without any pre-requiste python installation.

Here are some essential conda commands.

conda create --name myenv # creates new virtual environment
conda activate myenv # activate the environment
conda install numpy -c conda-forge # install a package
conda deactivate

Note that not using -c conda-forge will do just fine, but what is it? conda-forge is a community-driven channel (repository in the python jargon) that often has more up-to-date packages and a broader selection than the default Anaconda repository. You should use for several reasons, but mostly because conda-forge generally has the largest volume of packages and the most up-to-date versions

Note that some packages are only available through PyPi (pip). But you are covered for that: You can install pip packages within a conda env, by first activating the conda env and then normally using pip. pip should be part of your dependencies though. Always try to install packages using conda first.

We highly recommend using mamba as a drop-in replacement for conda, for much faster use.

Some useful resources

renv

Here are some basics on how to use renv, but see the renv vignette and documentation for more advanced usage.

# Initialize renv in your project
renv::init(project = "path/to/environment")

# Install a package and snapshot the environment
install.packages("dplyr")
renv::snapshot()
# Load the renv environment for the project
renv::activate()

# Restore the project's dependencies
renv::restore()

renv::update()
renv::history()
renv::revert()

Pkg

using Pkg

# Create a new project environment
Pkg.activate("path/to/MyProject")

# Add packages to the project environment
Pkg.add("DataFrames")

You can also use the Julia REPL by typing ]

(@v1.10) pkg> add DataFrames

or string macros pkg"add DataFrames"

Not that in Julia, the global shared environment is inherited in custom environment. This can be useful!
It is a good idea to install utility packages that you will use for development but that are not mandatory to run your code in the global environment. For instance, the macro @btime from BenchmarkTools is very handy to profile code. But you may not want to have BenchmarkTools in your dependencies. Just install it in base, and then you will be able to call
julia using BenchmarkTools
within your custom environment.
Other utility packages to consider having in your global environments are

  • Test,
  • TestEnv,
  • Revise
  • LocalRegistry

Environment files

Environment files specify the exact versions of the dependencies in your virtual environment, and are used by package managers to instantiate the environment. They are usually .txt, .toml or .yml files.

Always version control your environment files!

Julia

In Julia, the environment is defined using two files: the Project.toml and Manifest.toml. The Project.toml file lists the direct dependencies, while the Manifest.toml file captures the full dependency graph, including all transitive dependencies. The Manifest.toml file may not be tracked in a project, and will be reconstructed if missing. It specifies the exact version of the environment. For reproducibility, you want to include Manifest.toml in your git repo.
Artifacts.toml is used to handle non-Julia package dependencies.

Project.toml example

authors = ["Some One ",
           "Foo Bar "]
name = "MyEnv"
uuid = "7876af07-990d-54b4-ab0e-23690620f79a" # mandatory for packages
version = "1.2.5"

[deps] DataFrames = “7876af07-990d-54b4-ab0e-23690620f79a” Plots = “8dfed614-e22c-5e08-85e1-65c5234f0b40”

[compat] CUDA = “4.4, 5” julia = “1.10”

When you are located within the project root folder containing the .toml file, start julia with

$ julia --project=.

This will load the environment. If it is the first time that you use it, you need to instantiate it with

(Example) pkg> instantiate

Some useful resources

Python

conda env reads .yml, which can take any names. .yml files are not created automatically! Create environment.yml with

conda env export --name machine-learning-env --from-history --file environment.yml

This creates an environment.yml file

environment.yml example

name: machine-learning-env

channels:

  • pytorch
  • conda-forge

dependencies:

  • pytorch=1.1

Not using --from-history will result in listing all dependencies, those installed explicitly AND implicitly. This may be a bit messier.

To specify pip packages, just insert in the .toml

  - pip=19.1
  - pip:
    - kaggle==1.5
    - yellowbrick==0.9

Note the double ‘==’ instead of ‘=’ for the pip installation and that you should include pip itself as a dependency and then a subsection denoting those packages to be installed via pip. Also, note that --from-history won’t catch the pip dependencies. So the best way to proceed is to specify the dependencies by hand.

Installing from environment.yml

mamba env create --prefix ./.env --file environment.yml

Some additional resources

R

Environments with renv are specified in a renv.lock and DESCRIPTION files. It is a JSON file that has two main components: R and Packages. The R component specifies the R version used and the list of repositories where packages were installed. The Packages component includes a record for each package used in the project, with all necessary details for reinstalling that exact version. These details are derived from the installed package’s DESCRIPTION file and cover installations from any source, including CRAN, Bioconductor, GitHub, Gitlab, and Bitbucket. For more information on supported sources, refer to vignette(“package-sources”).

renv.lock

{
  "R": {
    "Version": "4.3.3",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.12.1",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteHost": "api.github.com",
      "RemoteUsername": "yihui",
      "RemoteRepo": "mime",
      "RemoteRef": "main",
      "RemoteSha": "1763e0dcb72fb58d97bab97bb834fc71f1e012bc",
      "Requirements": [
        "tools"
      ],
      "Hash": "c2772b6269924dad6784aaa1d99dbb86"
    }
  }
}

Working with interactive environments

Jupyter notebooks can use Pkg, conda and renv environments, but you may need some extra steps (see how to make Jupyter aware of your Conda environments here and there. You do not need to follow these steps if you are using Visual Studio Code.

Other interactive notebooks solutions store directly the environemnts in the files, which is great for reproducibility purposes. This is the case of Pluto notebooks, which are designed to be reproducible. Under the hood they contain the package environment inside them Binder notebooks also ship with a virtual environment, but using Docker (see below and a tutorial here).

I personally do not like notebooks, and prefer using scripts in Visual Studio Code, executing them line by line for development with whether the Julia extension or the Jupyter extension with "jupyter.interactiveWindow.textEditor.executeSelection": true. With such an approach, you can specify which virtual environment should be used at login, and never worry again with that later.

Caveats of virtual environments

Some packages/libraries rely on system libraries and utilities; for instance pytorch relies on CUDA drivers, which are specific to a certain machine (see how you can deal with CUDA drivers with conda here), or the behavior of the packages my be dependent on system environmental variables. As such, by replicating a virtual environment, you won’t necessarily reproduce the same exact computing environment.
To reproduce more closely a computing environment, containers may be used. Containers virtualize layers of the operating system, replicating to a deeper lever your environment and making it more reproducible. Docker or Singularity are popular solutions. Unfortunately, building containers may be difficult, and the virtualization may add a layer of complexity to your pipeline…
But see Using singularity as a development environment and How to remote dev with vscode and singularity. Note that you could use both a container and a virtual environment… See here a tutorial with renv.

Some additional resources
For more information, check Reproducible Computational Environments Using Containers: Introduction to Docker.

Advanced topic: package development

It can make sense for research projects to distinguish between scripts placed in scripts/ and reused functions, models, etc., placed in src. We’ll cover that more broadly in another post. In such case, it is best to compartmentalize dependencies so as to have a minimal working environment for the src/ functions and classes, independent of that for your scripts. One practical approach for this is to specify the src folder as a package. This has a few advantages, including

  • not having to deal with relative position of files to call the functions in src/
  • maximizing your productivity by creating a generic package additionally to your main research project.

You can achieve this easily with development tools.

For Python, tools like setuptools and poetry facilitate package development. If you’re working in R, devtools is the go-to tool for developing packages. In Julia, the Pkg tool serves a similar purpose.

Package templates can be useful to simplify the creation of packages by generating package skeletons. In Python, checkout out cookiecutter. In R, check usethis. For Julia, use the Pkg.generate() built-in functionality, or the more advanced PkgTemplates.jl package.

Note that you may want at some point to locate your src/ (and associated tests, docs, etc…) in a separate git repo.

Some additional resources

Take-home messages

  • Make sure you understand what are package managers, virtual environments, and dependencies both within your project scripts and at the system level.
  • Clearly document all dependencies and environment setup instructions in project repositories.
  • Provide instructions in an Installation section in the readme.md on how to set up the virtual environment.
  • Check out these toy research repositories in Julia (which uses relative paths for importing the src functions), Python (which has src as package), and R (which has src as package) that implement what I believe good examples of research projects!