Machine Learning Algorithms for the Analysis of Data from Whole-Genome Tiling Microarrays.
(PhD Thesis, University of Tübingen 2010)

In this work we developed machine learning-based methods with the aim to further our understanding regarding fundamental questions of molecular biology, using as our example the model plant Arabidopsis thaliana:

What are the differences between genomes of individuals belonging to the same species? Characterizing sequence variants (polymorphisms) genome-wide is a prerequisite for establishing causal links between adaptive quantitative traits and the underlying genetic variants. Single-nucleotide polymorphisms (SNPs) are the most abundant class of polymorphisms. In addition to SNP detection, we investigated genomic regions in which SNP calling algorithms tend to fail: on the one hand, highly variable sequence tracts, for which, paradoxically, only very few SNPs can be identified and, on the other hand, additional polymorphism types, such as insertions and deletions. With our newly developed method (mPPR) we discovered hundreds of thousands of polymorphic regions (with a false-discovery rate of <3%). These correspond, in part, to SNPs, but also contain deletions ranging from a few to several thousand nucleotides in length. Our results revealed, for the first time, a comprehensive, fine-scale picture of the polymorphism patterns in A. thaliana with dramatic differences between coding and noncoding regions and also between individual genes and gene families.

What is an organism's full complement of genes, in which tissues and developmental stages are they transcribed and how is their expression altered in response to environmental changes? Transcriptome studies have provided the foundation for reconstruction of the gene regulatory network, which describes the control of cellular processes, e.g., during cell differentiation. We developed a transcript identification method (mSTAD), which recognizes genic expression patterns. With mSTAD, we discovered thousands of new transcripts that were not previously known despite extensive annotation efforts. Validation experiments confirmed >75% of the tested cases, corroborating mSTAD's high accuracy. Moreover, we found hundreds of genomic regions with evidence of stress-specific transcription. These include previously unannotated genes as well as wrongly annotated parts of known genes.

Our computational methods are based on data generated with so-called tiling arrays, an advanced DNA microarray which interrogates a whole genome in regular intervals. It facilitates both the detection of polymorphisms and transcriptome profiling. Using this technology our analyses targeted, for the first time, the whole genome and were not restricted to a few fragments.

Since the resulting data resources are the basis for further research, high accuracy was imperative. However, microarray data typically exhibits high noise levels. We therefore devised new preprocessing techniques to reduce systematic noise, in particular probe sequence effects. We demonstrated the benefit of this technique for subsequent transcript identification. In contrast to that, comparable methods investigated here failed in this aspect. In our attempts to detect polymorphic or transcribed regions, we were facing segmentation problems. Recently developed machine learning algorithms, especially Hidden Markov Support Vector Machines, were found to be very well-suited for solving these problems. In the case of transcript identification, we could show mSTAD's superior accuracy compared to other widely used methods. Since no comparable methods exist for polymorphic region prediction, however, no such comparison was possible. Although originally developed for the analysis of A. thaliana data, our methods can nevertheless be broadly applied to similar data sets, which already exist for a number of organisms. We furthermore discuss their applicability to related data as it is, for instance, being generated by next-generation sequencing technologies.