Re-posted from: https://vboussange.github.io/post/best-practices-for-your-research-code/
I personally find that one of the biggest challenge when doing research is to keep things neat and organized. Having a good management system for your code and resources is key to optimizing time and brain resources. In this post, I discuss various methods for structuring a research project folder that includes code, data, publications, and more. Additionally, I discuss the specifics of organizing your research code. As I started my PhD, I wish I could have had some of such guidelines. But starting from scratch allowed me to build, with trials and errors, a good system for my later life. Hopefully, some of this can apply to you!
This post is part of a series of posts on best practices for managing research code. Much of this material was developed in collaboration with Mauro Werder as part of the Course On Reproducible Research, Data Pipelines, and Scientific Computing (CORDS). If you have experiences to share or spot any errors, please reach out!
Content
Project folder structures
I quite like this project folder structure, which keeps apart raw data and results from the code, but still place them relatively close, together with admin and publications. Having a separate git repo for the paper is something I would recommend as well (possibly linked to an Overleaf project).
|-- code/
|-- data/
|-- results
|-- publications
| |-- talks
| |-- posters
| |-- papers
|-- admin
|-- meetings
|-- more-folders
-- README.md
You may want to place results
within code
, together with data
(which you should not git track)
The structure of code/
deserves here some attention.
code/
structure
Programming languages typically have their own conventions, but often the folders follow this scheme
- a
README.md
file at the top level - a
src/
folder, containing models and other generic function and classes, that will be used inscript/
files, - example usages, e.g.
examples/
- scripts to run models, evaluation, etc., e.g.
scripts/
- documentation (often generated), e.g.
docs/
It can make sense for research projects to distinguish between scripts placed in scripts/
and reused functions, models, etc., placed in src
.
Python Folder structure
|-- src/ # package code
|-- scripts/ # Custom analysis or processing scripts
|-- tests/
|-- examples/ # Example scripts using the package
|-- docs/ # documentation
-- environment.yml # to handle project dependencies
-- README.md
R Folder structure
|-- R/ # R scripts and functions (package code)
|-- scripts/ # Custom analysis or processing scripts
|-- man/ # Documentation files
|-- tests/
|-- examples/ # Example scripts using the package
|-- vignettes/ # Long-form documentation
-- DESCRIPTION # Package description and metadata
-- NAMESPACE # Namespace file for package
-- README.md # Project overview and details
Julia Folder structure
|-- src/ # package code
|-- scripts/ # Custom analysis or processing scripts
|-- test/
|-- examples/ # Example scripts using the package
|-- docs/ # documentation
-- Project.toml # to handle project dependencies
-- README.md
Turning your code/
into a “package”
You may want to specify the src
folder as a package. This has a few advantages, including
- not having to deal with relative position of files to call the functions in
src/
- maximizing your productivity by creating a generic package additionally to your main research project.
To import functions and classes (types) located in the src
folder, you typically need to indicate in each script the relative path of src
. In Julia, you would typically do something like include("../src/path/to/your/src_file.jl")
. In Python, you would do something like:
import sys
sys.path.append("../src/")
from src.path.to.your.src_file import my_fun
If src/
directory grows, it’s beneficial to convert it into a separate package. Although this process is a bit more complex, it eliminates the need for path specifications, simplifies the import of functions and classes, and makes the codebase easily accessible for other research projects.
There are typically ways to turn a code-project into an installable package. This is in particular useful for code which other people (or yourself) use for different projects.
You can achieve this easily with development tools.
For Python, tools like setuptools
and poetry
facilitate package development. If you’re working in R, devtools
is the go-to tool for developing packages. In Julia, the Pkg
tool serves a similar purpose.
Package templates can be useful to simplify the creation of packages by generating package skeletons. In Python, checkout out cookiecutter
. In R, check usethis
. For Julia, use the Pkg.generate()
built-in functionality, or the more advanced PkgTemplates.jl
package.
Note that you may want at some point to locate your src/
(and associated tests
, docs
, etc…) in a separate git repo.
Further reading for
Wrapping up
Explore these exemplary toy research repositories in different programming languages:
- Julia, using relative paths for importing
src
functions. - Python, implementing
src
as a package. - R, also implementing
src
as a package.
These repositories showcase what I consider to be best practices in research project organization.
Take-home messages
- There is not one way to structure your research project folders, but general guidelines. Create the one that makes most sense for you!
- A chosen structure should be suitable to both work during the development of your project, and to submit (parts) of it to a repository in a future stage.
- Consider turning your
src/
into a folder. This can increase your academic productivity, as you could eventually be the developer of a cool package that people re-use, with minimum efforts!