Portable Scalable Concurrency for Genomic Data Processing
PI: Ioannis Xenarios (University of Lausanne)
Co-PIs: Heinz Stockinger, Marco Mattavelli, Jorn Janneck, Yann Thoma
January 1, 2014 - December 31, 2016
Project Summary
The sequencing of the genetic information of human genome has become affordable due to highthroughput sequencing technology. This opens new perspectives for the diagnosis and successful treatment of cancer and other genetic illnesses. However, there remain challenges, scientific as well as computational, that need to be addressed for this technology to find its way into everyday practice in healthcare and medicine. The first challenge is to cope with the flood of sequencing data (several Terabytes of data per week are currently collected by Vital-IT (the SIBs High Performance and High Throughput Bioinformatics Competency Center). For instance, a database covering the inhabitants of a small country like Switzerland would need to store a staggering amount of data, about 2335740 Terabytes. The second challenge is the ability to process such enormous deluge of data in order to 1) increase the scientific knowledge of genome sequence information and 2) search genome databases for diagnosis and therapy purposes. Significant compression of genomic data is required to reduce the storage size, to increase the transmission speed and to reduce the cost of I/O bandwidth connecting the database and the processing facilities. In order to process the data in a timely and effective manner, algorithms need to exploit the significant data parallelism and they need to be able to scale and map the data processing stage on the massive parallel processing infrastructure according to the available resources and to the application constraints.
The main aim of the project is to develop for the Swiss Platform for Advanced Scientific Computing a new computation node composed of heterogeneous hardware, a new compression format for genomic data and a software infrastructure that enables emerging applications such as Genome Analysis (e.g. RNASeq) to be able to process extremely large volumes of genome data in an efficient and timely way for scientific and diagnostic purposes (clinical application) as depicted in Figure 1. Specifically, we will select a representative software application in the domain of RNASeq (e.g. tophat, bowtie or similar), and provide a high performance hardware/software implementation. It will then serve as an example for other applications.
The project will achieve the goal by addressing the following issues and tasks:
- Select existing genomics applications such as tophat and bowtie as essential components for a typical genome analysis application.
- Data compression and decompression:
- select existing compression algorithms that, beside compression, provide the appropriate functionality for the efficient implementation of analysis applications such as tophat and bowtie in terms of reduced analysis time
- basing on the assessment of the performance of existing compression algorithms and their efficient implementation, design new improved algorithms
- Computational engines:
- the different software components necessary to implement the genome analysis programs accessing compressed genomes will be implemented using a portable high level dataflow programming methodology
- a set of tools that synthesize and profile the high level dataflow programs to build genome analysis applications that efficiently runs on the PASC processing infrastructure will be developed
- platform specific libraries with the potential of supporting further algorithmic implementations will be developed.
This project will address the various challenges with the following technical thrusts:
- Investigate the realization of very high-throughput I/O solutions embedding genome compression/decompression algorithm implementations as hardware co-processing units. Compression being based on entropy coding, its essentially serial and bitwise structure is a poor match for current processor instruction sets and therefore requires specific optimized implementations. A platform architecture will be proposed and discussed with the CSCS team in order to fit their future computation node.
- Re-factor genome analysis and processing algorithms and associated applications to benefit from the portable parallelism paradigm resulting from a dataflow programming approach. This work is fundamental not only for obtaining acceptable levels of application performance, but also for exploiting the evolution of processing capabilities. Changes in the processing infrastructure necessitate porting processing algorithms, and the portability of dataflow programs promises to overcome the need to re-write applications when higher or different forms of parallelism and concurrency will be available. This task will also include embedding of processing-intensive functions into the (high level) genome application software which, due to their bitwise nature, require processor native implementations or hardware co-processing. Finally, the new algorithms will have to scale and adapt to the future CSCS node, notably in terms of memory size, number of nodes, and types of nodes.
- Further develop the dataflow programming methodology and associated ecosystem of tools built around the dataflow programming standard ISO/IEC 23001-4 so far used essentially in the video streaming and compression field so that it can become the development environment of genome application software and other domain applications that can benefit its features such as portable parallelism.
Overall, this proposal is in the domain of data-intensive memory-bound applications that further pushes the requirements of special purpose hardware. Our applications seem to be a perfect match for non-standard hardware architectures and the development of high-performance algorithms and libraries.