CNV Optimization

Copy number variants detection made powerful and accurate

CNV Optimization


Accurate detection of CNVs from sequencing data is required to provide a comprehensive molecular diagnosis. However, there is still a lack of best practices developed for germline CNV calling pipelines and therefore both implementation and tuning of such a pipeline in a diagnostic lab is a challenge for a number of reasons. First of all, a great selection of algorithms and contradictory conclusions from CNV calling benchmarks makes it difficult to choose the most suitable method for a given dataset. Additionally, different target designs and coverage profiles may greatly impact the accuracy of a particular tool. Furthermore, there are no clear recommendations for the optimal selection of control samples that are required for joint-calling algorithms. Finally, a fine-tuning of the specific CNV caller parameters is problematic without access to a golden standard call set.


To address these issues we developed a method capable of automated parameter tuning for the most popular CNV calling algorithms, including XHMM and CODEX. The main idea behind our approach is to modify read coverage profiles of an user dataset to imitate existence of additional rare CNVs that we further use for accuracy assessment. Their coverage characteristics (i.e. relative changes in depth of coverage), allele frequencies, lengths and genomic locations are derived from the set of validated CNVs in 1000 Genomes samples and corresponding sequencing data. Since the processing of BAM files, required to obtain high resolution coverage profiles, is a time consuming problem, we used Apache Spark, i.e. a distributed computing framework. In addition, we proposed a novel columnar data format for storing precomputed read depth to further speed up calculations. Results: We demonstrated the utility of our framework on both publicly available subset of 1000 Genomes data and locally available dataset.


Results indicated that our solution can substantially improve the CNV calling performance of state-of-the-art methods when compared to their default settings. Concluding, proposed approach simplifying CNV calling pipelines implementation may give rise to higher adoption of CNV analysis from sequencing data in clinical laboratories.

Visit Github Repo
  • Status: In progress
  • Started: April 2017
  • Lead: Tomasz Gambin