Doctoral Thesis: High-Performance Computational Genomics

SHARE:

Event Speaker: 

Ariya Reza Shajii

Event Location: 

via Zoom, see details below

Event Date/Time: 

Tuesday, August 3, 2021 - 9:00am

Abstract: 
 
Next-generation sequencing data is growing at an alarming rate, far outpacing other “big data” sources like YouTube, Twitter and astronomy data, thereby necessitating high-performance tools and algorithms. Nevertheless, implementing high-performance, optimized computational genomics software requires extensive knowledge of low-level software optimization techniques as well as insight into the nature of biological data, forcing scientists to resort to high-level software alternatives that are less efficient. This thesis introduces Seq, a Python-based, domain-specific language for bioinformatics and genomics that combines the power and usability of high-level languages like Python with the performance of low-level languages like C or C++. Seq allows for shorter, simpler code, is readily usable by a novice programmer, and obtains significant performance improvements over existing languages and frameworks. Seq is showcased and evaluated by implementing a range of standard, widely-used applications from all stages of the genomics analysis pipeline, including genome index construction, finding maximal exact matches, long-read alignment and haplotype phasing. We show that the Seq implementations are up to an order of magnitude faster than existing hand-optimized implementations, with just a fraction of the code. Seq's substantial performance gains are made possible by a host of novel genomics-specific compiler optimizations that are out of reach for general-purpose compilers, coupled with a static type system that avoids all of Python's runtime overhead. By enabling researchers of all backgrounds to easily implement high-performance analysis tools, Seq further opens the door to the democratization and scalability of high-performance computational genomics. Finally, we also formalize many of the principles used by Seq and introduce a generalization called Codon, which can be applied to other domains with similar results.
 
Thesis Supervisors: Profs. Bonnie Berger and Saman Amarasinghe
 
To attend this defense, please contact the doctoral candidate at arshajii at mit dot edu