@ EMBL-EBI

Writing open and reproducible code is an essential part of computational biology research. We adhere to a strict set of organisational and stylistic conventions to maximise code reuse by lab members, improve analysis rigour, and reduce the amount of code clean-up needed before paper submission. These standards are meant for general data analysis repositories that would accompany a publication, not for published software tools.

Initialising a new repository

Create a new repository within the Ewald Lab GitHub Org with the following specs:

Repository structure

Each repository should have one or more ‘analysis module’ directories in the root. A common set of modules for a simple project could be 00.data_download_exploration/, 01.analysis_pipeline/, and 02.downstream_analysis/. Within each module will be the following sub-directories:

Files in the analysis folder should be numbered based on the order in which they should be executed (ie. 00.download_data.py, 01.descripe_exp_design.py, 02.filter_normalize.py, etc). You’ll notice that there is a separate environment for each analysis module. This helps keeps environments more lightweight, hopefully reducing dependency conflicts across the entire repository.

Repository as a lab notebook

If you encounter an issue with the data, an open question that needs discussion, or generate cool results, please describe these in GitHub issues within the repository and tag Jess or other team members instead of sending an email. This keeps a clear record of all project-related troubleshooting, questions, and milestones. This is very helfpul for onboarding new team members to help out with the project, or for reminding ourselves of the reasons why particular decisions were made - this becomes very important for multi-year projects! The only exceptions are discussions that require confidentiality (use email) or quick reminder-type questions (use Slack). Keep in mind that most repositories will be made public eventually, including all associated GitHub Issues, so this is extra motiviation to keep things professional.

NextFlow

The bulk of your analysis will likely by in the xx.analysis_pipeline module. Ideally it will be implemented using NextFlow. Pipeline orchestration software like NextFlow or Snakemake have a bit of a learning curve, but make running complex pipelines, reproducing them, and experimenting with different parameters or steps much, much, much easier in the long run. We chose NextFlow because this software is officially supported by EBI IT, so there are dedicated training sessions and lots of institutional knowledge on how to make it work well with the EBI cluster.

General principles