A multi-language overview on how to document your research project code

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/documenting-your-research-code/

Documentation serves multiple purposes and may be useful for various audiences, including your future self, collaborators, users and contributors – should you aim at packaging some of your code into a general-purpose library.

This post is part of a series of posts on best practices for managing research project code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!

Content

Style guides

The best documentation starts by writing self-explanatory code with good conventions.

Correctly naming your variables enhances code clarity.

There are only two hard things in Computer Science: cache invalidation and naming things.
Martin Fowler

Instead of using generic names like l for a list:

for l in L:
    pass

Use descriptive names like

for line in lines:
    pass

Using style guides for your chosen language ensures consistency and readability in your code. Here are some resources:

Do not hesitate to refactor your code regularly and remove dead code to prevent confusion for yourself and others.

Comments

In-line comments should be used sparingly. Aim to write self-explanatory code instead. Use comments to provide context not apparent from the code itself, such as references to papers, Stack Overflow topics, or TODOs.

Use single-line comments for brief explanations and multi-line comments for more detailed information.

julia

#=
This is a multi-line
comment
=#

python

"""
This is a multi-line
comment
"""

Tip: use vscode rewrap comment/text to nicely format multiline comments.

On top of nicely formatting your code and appending comments where necessary, a literal documentation greatly facilitates the maintenance, understandability and reproducibility of your code.

Literal documentation

Literal documentation helps users understand your tool and get started with it.

README

A README file is essential for any research repository. It is displayed on under the code structure when accessing a GitHub repo.
It should contain:

  • (Badges showing tests, and a nice logo)
  • A one-sentence description of your project
  • A longer description
  • An overview of the repository structure and files
  • Getting started or Examples section
  • An Installation section with dependencies
  • Citation/Reference section
  • (A link to the documentation)
  • (A How to contribute section)
  • An Acknowledgement section
  • License section

Some examples:

API documentation / doc strings

API documentation describes the usage of functions, classes (types) and modules (packages). Parsers usually support markdown styles, which also enhances raw readability for humans. In short, markdown styles consists in using

Doc strings in python live inside the function

def best_function_ever(a_param, another_parameter):
    """
    this is the docstring
    """
    # do some stuff

But above the function or type definition in Julia

"""
this is the docstring
"""
function best_function_ever(a_param, another_parameter)
# do some stuff
end

"Tell whether there are too foo items in the array."
foo(xs::Array) = ...

Best practice for docstrings include

  • (in Julia: insert the signature of your function )
  • Short description
  • Arguments (Args, Input,…)
  • Returns
  • Examples

Several flavours may be used, even for a single language.

python
3 Different documentation style flavours

Google style is easier to read for humans

def add(a, b):
    """
    Add two integers.

    This function takes two integer arguments and returns their sum.

    # Parameters:
    a: The first integer to be added.
    b: The second integer to be added.

    # Return:
    int: The sum of the two integers.

    # Raise:
    TypeError: If either of the arguments is not an integer.

    Examples:
    >>> add(2, 3)
    5
    >>> add(-1, 1)
    0
    >>> add('a', 1)
    Traceback (most recent call last):
        ...
    TypeError: Both arguments must be integers.
    """
    if not isinstance(a, int) or not isinstance(b, int):
        raise TypeError("Both arguments must be integers")
    return a + b

julia

"""
    add(a, b)

Adds two integers.

This function takes two integer arguments and returns their sum.

# Arguments
- `a`: The first integer to be added.
- `b`: The second integer to be added.

# Returns
- The sum of the two integers.

# Examples

```julia-repl
julia> add(2, 3)
5

julia> add(-1, 1)
0
```
"""
function add(a, b)
    return a + b
end

You may use tools like Documenter.jl or Sphinx to automatically render your API documentation on a website. Github actions can automatize the process of building the documentation for you, similarly to how it can automate testing.

Docstrings may be accompanied by typing.

Type annotations

Typing refers to the specification of variable types and function return types within a programming language. It helps define what kind of data a function or variable can handle, ensuring type safety and reducing runtime errors. It

  • clearly indicates the expected input and output types, making the code easier to understand.
  • helps catch type-related errors early in the development process.
  • encourages consistent usage of types throughout the codebase.

python

def add(a: int, b: int) -> int:
    return a + b

In Python, using typing does not enforce type checking at runtime! You may use decorators to enforce it.

julia

function add(a::Int, b::Int)
    return a + b
end

In Julia, types are enforced at runtime! Type annotations help the Julia compiler optimize performance by making type inferences easier.

Consider raising errors

  • We do not like reading manuals. But we are foreced to read error messages. Use assertions and error messages to handle unexpected inputs and guide users.

python

  • assert: When an assert doesn’t pass, it raises an AssertionError. You can optionally add an error message at the end.
  • NotImplementedError, ValueError, NameError: Commonly used, generic errors you can raise. I probably overuse NotImplementedError compared to other types.
def convolve_vectors(vec1, vec2):
    if not isinstance(vec1, list) or not isinstance(vec2, list):
        raise ValueError("Both inputs must be lists.")
    # convolve the vectors

Tutorials

Create tutorial Jupyter notebooks or vignettes in R to demonstrate the usage of your code. Those can be placed in a folder examples or tutorials. Format them as e.g.

  • vignettes in R,
  • or using Jupyter notebooks, which are the perfect format for tutorials

Accessing documentation

julia

?cos
?@time
?r""

python

help(myfun)

But e.g. VSCode can be also quite helpful, and this works also with your own code!

Doc testing

Doc testing, or doctest, allows you to test your code by running examples embedded in the documentation (docstrings). It compares the output of the examples with the expected results given in the docstrings, ensuring the code works as documented.

Why doc testing?

  • Ensures that the code examples in your documentation are accurate and up-to-date.
  • Simple to write and understand, making it accessible for both writing and reading tests.
  • Promotes writing comprehensive docstrings which enhance code readability and maintainability.

Python

def add(a, b):
    """
    Adds two numbers.

    >>> add(2, 3)
    5
    >>> add(-1, 1)
    0
    """
    return a + b

To run the test:

python -m doctest your_module.py

or from within a script

if __name__ == "__main__":
    import doctest
    doctest.testmod()

julia
Available through Documenter.jl

"""
Adds two numbers.

```jldoctest
julia> add(2, 3)
5

julia> add(-1, 1)
0
```
"""

function add(a, b)
    return a + b
end

Useful packages to help you write and lint your documentation

  • Better Comments

  • Automatic doc string generation

  • Python test explorer

More resources

Take home messages

  • Good documentation helps maintain the long-term memory of a project.
  • Refactor code to reduce complexity instead of documenting tricky code.
  • Writing unit tests is often more productive than extensive documentation.
  • Types of documentation include literal, API, and tutorial/example documentation.
  • Literal documentation explains the big picture and setup.
  • API documentation lives in docstrings and explains function usage.
  • Examples connect the details to common tasks.
  • Consider using tools like ChatGPT to assist with documenting your functions.

A multi-language overview on how to handle dependencies within a research project

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/research-project-dependencies/

Your future self and others should be able to recreate the minimal environment to run the scripts in your research project. This is best achieved using package managers and virtual environments.

This post is part of a series of posts on best practices for managing research project code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!

Content

Some definitions

What is a dependency?

A dependency is an external package that a project requires to run.

What is a package manager?

A package manager like conda, Pkg or renv automates the process of installing, upgrading, configuring, and managing dependencies. It usually relies on a package repository, which is a central location that stores in one place the source code of packages or where to find it.

What is a virtual environment?

A virtual environment is an isolated environment where you can install and manage dependencies separately from the system-wide installation. This isolation ensures that different projects can have different dependencies and versions of packages without causing conflicts. Why use a virtual environment?

  • For yourself, to best deal with multiple projects and to prevent your code from breaking down overtime.
    • Without specifying a virtual environment, you install packages in your base environment, which is shared across all your projects.
    • Imagine you are working with Project A and Project B, which both depend on Package1 (currently @v1.1).
    • You leave aside Project A for a few months, and focus on Project B.
    • A new feature in Package1 motivate you to upgrade to v1.2, which modifies the API or the behavior of one function used in both projects.
    • You then want to come back to Project A, but now everything is broken! Because your code has been formatted to work with Package1@v1.1.
    • Hence, you want to make sure to compartmentalize environments.
  • To share your environment with others individuals and machines.
    • A virtual environement tracks the minimum dependencies, which can easily be shared and installed on other machines (e.g., a HPC).

Package managers

Multilanguage overview

Python R Julia
Package Manager pip, conda (see also mamba), poetry install.packages() (base R) Pkg
Package Repository PyPI (Python Package Index), conda-forge CRAN (Comprehensive R Archive Network) General registry
Distribution Format .whl (wheel, incl binaries) or tar.gz (source) .tar.gz (source and/or binary) Pkg will git clone from source, and download (binary) artifacts
Virtual Environment venv, virtualenv, conda env renv Built-in in the Pkg module
Dependency Management requirements.txt or Pipfile (pip), or environment.yml (conda env) or pyproject.toml (poetry) DESCRIPTION, NAMESPACE Project.toml, Manifest.toml, Artifacts.toml

This table is very much inspired by The Scientific Coder article on package managers.

Julia or R have built-in package managers which can be called within the REPL but Python package managers are called from outside the language.

conda

conda is a very appropriate package manager for scientific projects in Python. Over its older concurrent pip, it can handle python versions and all sorts non-python dependencies artifacts. With two lines of code, it allows someone to quickly install the virtual environment, without any pre-requiste python installation.

Here are some essential conda commands.

conda create --name myenv # creates new virtual environment
conda activate myenv # activate the environment
conda install numpy -c conda-forge # install a package
conda deactivate

Note that not using -c conda-forge will do just fine, but what is it? conda-forge is a community-driven channel (repository in the python jargon) that often has more up-to-date packages and a broader selection than the default Anaconda repository. You should use for several reasons, but mostly because conda-forge generally has the largest volume of packages and the most up-to-date versions

Note that some packages are only available through PyPi (pip). But you are covered for that: You can install pip packages within a conda env, by first activating the conda env and then normally using pip. pip should be part of your dependencies though. Always try to install packages using conda first.

We highly recommend using mamba as a drop-in replacement for conda, for much faster use.

Some useful resources

renv

Here are some basics on how to use renv, but see the renv vignette and documentation for more advanced usage.

# Initialize renv in your project
renv::init(project = "path/to/environment")

# Install a package and snapshot the environment
install.packages("dplyr")
renv::snapshot()
# Load the renv environment for the project
renv::activate()

# Restore the project's dependencies
renv::restore()

renv::update()
renv::history()
renv::revert()

Pkg

using Pkg

# Create a new project environment
Pkg.activate("path/to/MyProject")

# Add packages to the project environment
Pkg.add("DataFrames")

You can also use the Julia REPL by typing ]

(@v1.10) pkg> add DataFrames

or string macros pkg"add DataFrames"

Not that in Julia, the global shared environment is inherited in custom environment. This can be useful!
It is a good idea to install utility packages that you will use for development but that are not mandatory to run your code in the global environment. For instance, the macro @btime from BenchmarkTools is very handy to profile code. But you may not want to have BenchmarkTools in your dependencies. Just install it in base, and then you will be able to call
julia using BenchmarkTools
within your custom environment.
Other utility packages to consider having in your global environments are

  • Test,
  • TestEnv,
  • Revise
  • LocalRegistry

Environment files

Environment files specify the exact versions of the dependencies in your virtual environment, and are used by package managers to instantiate the environment. They are usually .txt, .toml or .yml files.

Always version control your environment files!

Julia

In Julia, the environment is defined using two files: the Project.toml and Manifest.toml. The Project.toml file lists the direct dependencies, while the Manifest.toml file captures the full dependency graph, including all transitive dependencies. The Manifest.toml file may not be tracked in a project, and will be reconstructed if missing. It specifies the exact version of the environment. For reproducibility, you want to include Manifest.toml in your git repo.
Artifacts.toml is used to handle non-Julia package dependencies.

Project.toml example

authors = ["Some One ",
           "Foo Bar "]
name = "MyEnv"
uuid = "7876af07-990d-54b4-ab0e-23690620f79a" # mandatory for packages
version = "1.2.5"

[deps] DataFrames = “7876af07-990d-54b4-ab0e-23690620f79a” Plots = “8dfed614-e22c-5e08-85e1-65c5234f0b40”

[compat] CUDA = “4.4, 5” julia = “1.10”

When you are located within the project root folder containing the .toml file, start julia with

$ julia --project=.

This will load the environment. If it is the first time that you use it, you need to instantiate it with

(Example) pkg> instantiate

Some useful resources

Python

conda env reads .yml, which can take any names. .yml files are not created automatically! Create environment.yml with

conda env export --name machine-learning-env --from-history --file environment.yml

This creates an environment.yml file

environment.yml example

name: machine-learning-env

channels:

  • pytorch
  • conda-forge

dependencies:

  • pytorch=1.1

Not using --from-history will result in listing all dependencies, those installed explicitly AND implicitly. This may be a bit messier.

To specify pip packages, just insert in the .toml

  - pip=19.1
  - pip:
    - kaggle==1.5
    - yellowbrick==0.9

Note the double ‘==’ instead of ‘=’ for the pip installation and that you should include pip itself as a dependency and then a subsection denoting those packages to be installed via pip. Also, note that --from-history won’t catch the pip dependencies. So the best way to proceed is to specify the dependencies by hand.

Installing from environment.yml

mamba env create --prefix ./.env --file environment.yml

Some additional resources

R

Environments with renv are specified in a renv.lock and DESCRIPTION files. It is a JSON file that has two main components: R and Packages. The R component specifies the R version used and the list of repositories where packages were installed. The Packages component includes a record for each package used in the project, with all necessary details for reinstalling that exact version. These details are derived from the installed package’s DESCRIPTION file and cover installations from any source, including CRAN, Bioconductor, GitHub, Gitlab, and Bitbucket. For more information on supported sources, refer to vignette(“package-sources”).

renv.lock

{
  "R": {
    "Version": "4.3.3",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.12.1",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteHost": "api.github.com",
      "RemoteUsername": "yihui",
      "RemoteRepo": "mime",
      "RemoteRef": "main",
      "RemoteSha": "1763e0dcb72fb58d97bab97bb834fc71f1e012bc",
      "Requirements": [
        "tools"
      ],
      "Hash": "c2772b6269924dad6784aaa1d99dbb86"
    }
  }
}

Working with interactive environments

Jupyter notebooks can use Pkg, conda and renv environments, but you may need some extra steps (see how to make Jupyter aware of your Conda environments here and there. You do not need to follow these steps if you are using Visual Studio Code.

Other interactive notebooks solutions store directly the environemnts in the files, which is great for reproducibility purposes. This is the case of Pluto notebooks, which are designed to be reproducible. Under the hood they contain the package environment inside them Binder notebooks also ship with a virtual environment, but using Docker (see below and a tutorial here).

I personally do not like notebooks, and prefer using scripts in Visual Studio Code, executing them line by line for development with whether the Julia extension or the Jupyter extension with "jupyter.interactiveWindow.textEditor.executeSelection": true. With such an approach, you can specify which virtual environment should be used at login, and never worry again with that later.

Caveats of virtual environments

Some packages/libraries rely on system libraries and utilities; for instance pytorch relies on CUDA drivers, which are specific to a certain machine (see how you can deal with CUDA drivers with conda here), or the behavior of the packages my be dependent on system environmental variables. As such, by replicating a virtual environment, you won’t necessarily reproduce the same exact computing environment.
To reproduce more closely a computing environment, containers may be used. Containers virtualize layers of the operating system, replicating to a deeper lever your environment and making it more reproducible. Docker or Singularity are popular solutions. Unfortunately, building containers may be difficult, and the virtualization may add a layer of complexity to your pipeline…
But see Using singularity as a development environment and How to remote dev with vscode and singularity. Note that you could use both a container and a virtual environment… See here a tutorial with renv.

Some additional resources
For more information, check Reproducible Computational Environments Using Containers: Introduction to Docker.

Advanced topic: package development

It can make sense for research projects to distinguish between scripts placed in scripts/ and reused functions, models, etc., placed in src. We’ll cover that more broadly in another post. In such case, it is best to compartmentalize dependencies so as to have a minimal working environment for the src/ functions and classes, independent of that for your scripts. One practical approach for this is to specify the src folder as a package. This has a few advantages, including

  • not having to deal with relative position of files to call the functions in src/
  • maximizing your productivity by creating a generic package additionally to your main research project.

You can achieve this easily with development tools.

For Python, tools like setuptools and poetry facilitate package development. If you’re working in R, devtools is the go-to tool for developing packages. In Julia, the Pkg tool serves a similar purpose.

Package templates can be useful to simplify the creation of packages by generating package skeletons. In Python, checkout out cookiecutter. In R, check usethis. For Julia, use the Pkg.generate() built-in functionality, or the more advanced PkgTemplates.jl package.

Note that you may want at some point to locate your src/ (and associated tests, docs, etc…) in a separate git repo.

Some additional resources

Take-home messages

  • Make sure you understand what are package managers, virtual environments, and dependencies both within your project scripts and at the system level.
  • Clearly document all dependencies and environment setup instructions in project repositories.
  • Provide instructions in an Installation section in the readme.md on how to set up the virtual environment.
  • Check out these toy research repositories in Julia (which uses relative paths for importing the src functions), Python (which has src as package), and R (which has src as package) that implement what I believe good examples of research projects!

A multi-language overview on how to organise your research project code and documents

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/best-practices-for-your-research-code/

I personally find that one of the biggest challenge when doing research is to keep things neat and organized. Having a good management system for your code and resources is key to optimizing time and brain resources. In this post, I discuss various methods for structuring a research project folder that includes code, data, publications, and more. Additionally, I discuss the specifics of organizing your research code. As I started my PhD, I wish I could have had some of such guidelines. But starting from scratch allowed me to build, with trials and errors, a good system for my later life. Hopefully, some of this can apply to you!

This post is part of a series of posts on best practices for managing research code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!

Content

Project folder structures

I quite like this project folder structure, which keeps apart raw data and results from the code, but still place them relatively close, together with admin and publications. Having a separate git repo for the paper is something I would recommend as well (possibly linked to an Overleaf project).

|-- code/
|-- data/
|-- results
|-- publications
|    |-- talks
|    |-- posters
|    |-- papers
|-- admin
|-- meetings
|-- more-folders
 -- README.md

You may want to place results within code, together with data (which you should not git track)
The structure of code/ deserves here some attention.

code/ structure

Programming languages typically have their own conventions, but often the folders follow this scheme

  • a README.md file at the top level
  • a src/ folder, containing models and other generic function and classes, that will be used in script/ files,
  • example usages, e.g. examples/
  • scripts to run models, evaluation, etc., e.g. scripts/
  • documentation (often generated), e.g. docs/

It can make sense for research projects to distinguish between scripts placed in scripts/ and reused functions, models, etc., placed in src.

Python Folder structure
|-- src/            # package code
|-- scripts/        # Custom analysis or processing scripts
|-- tests/
|-- examples/       # Example scripts using the package
|-- docs/           # documentation
 -- environment.yml # to handle project dependencies
 -- README.md
R Folder structure
|-- R/               # R scripts and functions (package code)
|-- scripts/         # Custom analysis or processing scripts
|-- man/             # Documentation files
|-- tests/
|-- examples/        # Example scripts using the package
|-- vignettes/       # Long-form documentation
 -- DESCRIPTION      # Package description and metadata
 -- NAMESPACE        # Namespace file for package
 -- README.md        # Project overview and details
Julia Folder structure
|-- src/            # package code
|-- scripts/        # Custom analysis or processing scripts
|-- test/
|-- examples/       # Example scripts using the package
|-- docs/           # documentation
 -- Project.toml    # to handle project dependencies
 -- README.md

Turning your code/ into a “package”

You may want to specify the src folder as a package. This has a few advantages, including

  • not having to deal with relative position of files to call the functions in src/
  • maximizing your productivity by creating a generic package additionally to your main research project.

To import functions and classes (types) located in the src folder, you typically need to indicate in each script the relative path of src. In Julia, you would typically do something like include("../src/path/to/your/src_file.jl"). In Python, you would do something like:

import sys
sys.path.append("../src/")

from src.path.to.your.src_file import my_fun

If src/ directory grows, it’s beneficial to convert it into a separate package. Although this process is a bit more complex, it eliminates the need for path specifications, simplifies the import of functions and classes, and makes the codebase easily accessible for other research projects.

There are typically ways to turn a code-project into an installable package. This is in particular useful for code which other people (or yourself) use for different projects.

You can achieve this easily with development tools.

For Python, tools like setuptools and poetry facilitate package development. If you’re working in R, devtools is the go-to tool for developing packages. In Julia, the Pkg tool serves a similar purpose.

Package templates can be useful to simplify the creation of packages by generating package skeletons. In Python, checkout out cookiecutter. In R, check usethis. For Julia, use the Pkg.generate() built-in functionality, or the more advanced PkgTemplates.jl package.

Note that you may want at some point to locate your src/ (and associated tests, docs, etc…) in a separate git repo.

Further reading for

Wrapping up

Explore these exemplary toy research repositories in different programming languages:

  • Julia, using relative paths for importing src functions.
  • Python, implementing src as a package.
  • R, also implementing src as a package.

These repositories showcase what I consider to be best practices in research project organization.

Take-home messages

  • There is not one way to structure your research project folders, but general guidelines. Create the one that makes most sense for you!
  • A chosen structure should be suitable to both work during the development of your project, and to submit (parts) of it to a repository in a future stage.
  • Consider turning your src/ into a folder. This can increase your academic productivity, as you could eventually be the developer of a cool package that people re-use, with minimum efforts!