Description

SimBA is a suite of two benchmarking tools: one for simulating dataset (SimCT), another one to benchmark the results (BenchCT). The SimBA suite is fully described in a submitted paper.

SimCT

SimCT generates simulated datasets that get as close as possible to specific real biological conditions together with the list of genomic incidents and mutations that have been inserted.

BenchCT

BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation with regards to a specific biological question.

Source code

The source codes of SimCT and BenchCT is freely available on their respective Github repositories:

Sample benchmark

We make available the benchmarks used in our paper in two ways:

  1. You can directly access the datasets
  2. You can reproduce the datasets

Dataset description

Four datasets were designed to address the sequencing of human samples in two biological contexts; normal and somatic cells. The normal condition contains a genomic layer of polymorphisms with rates close to the observations of 1,000 genomes. To improve our simulation quality, 95% of introduced mutations were taken from common human polymorphisms. The somatic condition contains higher mutation rates, a more complex gene expression profile and the introduction of gene fusions.

Two datasets were generated for each of these conditions, varying the length of the reads (101bp and 150bp) and using a sequencing depth of 2x80 million paired-end reads. These two lengths were chosen to mimic the \textit{Illumina Hiseq 2500} sequencing platform which typically produces these read lengths. The deep sequencing was used to create a base dataset that can be further sub-sampled for other applications.

Accessing the datasets

The four datasets can be accessed here. Each directory corresponds to one dataset, whose name denotes in order the species, the read length, the number of sequenced bases, and the condition.

The .err file is the error model used to generate sequencing errors.

The whole datasets can simple be retrieved using wget (for instance):

wget -r http://bonsai-bioinfo.lille.inria.fr/simba/

Within each directory the files are SimCT output files as described in its documentation.

Reproducing the datasets

The datasets can be reproduced using SimCT and the Snakefile that can be accessed at https://github.com/jaudoux/simba-publication-pipeline