elPrep* is a high-performance tool for preparing SAM/BAM/CRAM files for variant calling in genomic sequencing pipelines.

Execution Time Cut to 15 Minutes1

elPrep* is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools* and Picard* for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep* apart is its software architecture, which allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep* is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time.

Performance Results

For a preparation pipeline of five steps on a whole-exome BAM file (NA12878), elPrep* reduces the execution time from about 1:40 hours, when using a combination of SAMtools* and Picard*, to about 15 minutes when using elPrep*, while utilizing the same server resources (48 threads and 23 GB RAM)1. Tested using picard-tools-1.229*, samtools-1.2*, elprep-2.2*.

Download the code ›

Reproduce these results with this optimization recipe ›


Sequence analysis generally consists of a mapping phase followed by an analysis phase. In the mapping phase, an alignment tool maps the reads produced by the wet lab to a known reference genome. Afterwards, the mapped reads are processed by an analysis tool, for example for variant detection.

Alignment and analysis tools communicate via sequence alignment/map (SAM) files, a standardized format for storing mapped reads (Li et al., 2009), or the compressed variants thereof (BAM/CRAM). In practice, different alignment tools produce slightly different outputs, and different analysis tools depend on slightly different SAM structures to work properly.

This is why there are typically a number of steps in between the alignment and analysis tools to rewrite the SAM files into a form that is accepted by the analysis tool. For example, the GATK best-practice pipeline (Van der Auwera et al., 2013) requires five preparation steps between alignment (BWA) and analysis (GATK). These steps take up roughly 30% of the runtime of the complete pipeline.

Pipeline Execution Without elPrep*

We developed elPrep*, a new tool that is designed as a high-performance alternative to existing tools for manipulating SAM, BAM, and CRAM files. elPrep* is designed as a multi-threaded program from the ground up: all preparation steps are executed in parallel. The application is designed to run entirely in memory, avoiding repeated file I/O between the preparation steps and merging their computations to execute more efficiently.

Hypothetical Execution with Parallelized Tools

We had to reformulate preparation steps as filters. In many cases, this was straightforward, but some steps required finding alternative algorithms. For example, the algorithm for marking duplicates in Picard* is based on comparing adapted mapping positions of all reads. Its implementation traverses the entire read set multiple times to compare the reads' mapping positions one by one. We reformulate this as a single-pass algorithm, and use memoization to keep track of the reads with the best mapping positions. If a subsequent read maps to the same position as a previous one, but with a better quality score, it replaces the old one in the memoization table, and the old one is marked as a duplicate. Despite such algorithmic reformulations, the output of elPrep* is 100% equivalent to the output produced by SAMtools* and Picard.

Pipeline Execution with elPrep*

Once all data is streamed into memory and all filters are applied, the operations that work on the whole data set, such as sorting, are executed. elPrep* implements this phase using fork-join patterns, which are executed on a work-stealing scheduler for load balancing. After the sorting phase, the worker threads transform the data back into SAM file entries in parallel, while possibly applying additional filters, to write the result to the output file.


Charlotte Herzeel, Pascal Costanza, Dries Decap, Jan Fostier, and Joke Reumers. "elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling." PLoS ONE 10, no. 7 (2015). doi:10.1371/journal.pone.0132868.

Configuration Table

System Overview



picard-tools-1.229*, samtools-1.2*, elprep-2.32*, CentOS* release 7.0.1406 (Core), Python* 2.7.5, GCC* 4.8.2 (optional), GNU parallel* 20150222 (optional)


2x 12-core Intel® Xeon® E5-2690 processor (2.6 GHz)


256 GB


2 TB Intel® P3700 SSD




效能測試中使用的軟體與工作負載可能僅針對 Intel® 微處理器進行最佳化。包括 SYSmark* 與 MobileMark* 在內的效能測試是使用特定電腦系統、零組件、軟體、作業與功能進行量測。這些因素若有任何異動,均可能導致測得結果產生變化。建議您參考其他資訊與效能測試數據,協助您充分評估欲購買產品的性能,包括該產品在搭配其他產品運作時的效能。如需完整的資訊,請參閱 http://www.intel.com.tw/benchmarks