Dissertation: Fast Computation of Genome Distances

Abstract

To understand the evolutionary relationships between organisms, they are typically presented in a tree-like structure, a phylogeny. In genomic studies, phylogenies are traditionally reconstructed from a multiple sequence alignment. While most accurate, this approach is also computationally demanding. The problem is that in order to identify shared homologies, the sequences are usually first aligned nucleotide by nucleotide. This alignment step has become a bottleneck in the practice of molecular biology, where thousands of whole bacterial genomes, each a few megabases long, are sequenced and then need to be summarized as phylogenies when analyzing pathogen outbreaks.
One alternative are methods that estimate evolutionary distances directly from unaligned genomes. These pairwise distances can then be used to cluster sequences in a tree. Most of these alignment-free methods heavily rely on exact matching techniques for words of a fixed size for fast sequence comparison. However, they usually do not reflect the substitution rate, the most widely used measure of evolutionary distance.
Instead of using words of fixed size, andi used matches of maximal length as anchors for approximate pairwise alignments. These anchor alignments then can be used to estimate the substitution rate. A first implementation, andi, quickly estimates accurate pairwise distances from hundreds of bacterial genomes on standard hardware. However, the thousands of genomes currently being collected during outbreaks again slow the program down.
Andi uses a suffix array as a full-text index for each of the input sequences. Since constructing and searching in a suffix array is slow, the aim of this thesis was to investigate, whether it might be possible to just compute a single suffix array for one of the input sequences and pile all remaining sequences onto that reference. This should produce an approximate multiple sequence alignment, from which pairwise mismatches could be counted.
This approach is implemented in the program phylonium. It is available via package managers or as open source at github.com/evolbioinf/phylonium. Phylonium is much faster than andi while losing little of its predecessor's accuracy. In this thesis I explain the background to phylonium, describe its implementation, and apply it to simulated and real data. In the application section I compare phylonium to its best competitors and show that it holds a reasonable position in the classical trade-off between speed and accuracy.

Download

Grab the PDF here.

Citation

@phdthesis{kloetzl,
	title = {{Fast Computation of Genome Distances}},
	author = {Kl\"otzl, Fabian},
	language = {eng},
	school = {University of L\"ubeck},
	year = {2020},
	date = {2020-10-16},
	month = {Oct}
}