Georg Zeller's homepage

Systems Biology of Drug Perturbations

since 2010Carried out at EMBL

Combining and integrating different types of data (e.g., molecular in vitro data, data from cell-based assays, complex phenotypic data such as side effects), we aim at gaining a better understanding of drug mechanisms of action. Ideally, this will aid the interpretation of drug action in many contexts, from the molecular level (for instance, drug-target interaction) to the whole organism (e.g., establishing links between side effects and cellular pathways: Brouwers et al. PLoS One, 2011; see also Iskar et al. Curr. Opin. Biotechnol., 2012). To this end, we are developing methods to predict protein targets, response pathways or side effects of drugs from complex cell-based assays or organism-scale data and analyze these predictive models to reveal the underlying biological mechanisms. We see this as rational approaches to drug repositioning and in silico drug safety assessment.

Currently, we focus on cellular assays of expression changes upon chemical perturbations: The Connectivity Map records gene expression (for >10,000 genes) in several cell lines for hundreds of small molecules. Analysis of such complex multi-parametric read-outs requires intelligent supervised (e.g. feature selection) and unsupervised techniques (such as clustering and bi-clustering) in order to reveal new drug mechanisms of action as well as functions of the biological systems that respond to these treatments (see Iskar et al. Mol. Syst. Biol., 2013).

SIDER resource at EMBL | STITCH resource at EMBL | Drug modules at EMBL | CMap project at the BROAD

Metagenomic Markers of Host States

since 2010Carried out at EMBL

Metagenomics projects targeting the microbiota living in and on the human body seek to reveal interactions between states of the host (such as health status, dietary habits, age, body mass index, etc.) and its commensal microbes. The composition of gut microbiota can markedly differ between individuals, but there appear to be only few community states ("enterotypes"), which can be revealed by clustering (Arumugam et al. Nature, 2011). However, the biological role of these enterotypes and how they relate to host properties is less clear. Together with many colleagues from Peer Bork's group and the MetaHIT consortium, I work to derive signatures from the abundances of certain microbes or from functions encoded in the metagenome that are predictive of host states. To this end, I am experimenting with regression and classification techniques that are well-suited for feature selection and interpretion.

MetaHIT project page

Transcript Identification from RNA-seq

since 2009Carried out at FML

Together with colleagues from the Gunnar Rätsch's group at the FML, I have been working on a Hidden Markov SVM-based method called mTIM for transcript reconstruction from RNA-seq read alignments. We obtained promising initial results for the participation in the RGASP challenge on genome annotation and gene building using RNA-seq data. While technically related to many gene finding systems, mTIM does not require an open reading frame and can detected noncoding transcripts as well as coding genes. Due to the underlying machine learning framework, it is relatively robust to errors in the RNA-seq read alignments in the sense that the accuracy of its transcript predictions suffers less than observed for other methods in our comparisons.

mTIM on the FML galaxy server

Gene Finding

since 2006Carried out at FML

In Gunnar Rätsch's group we developed a novel and very accurate gene finding system called mGene, which uses the latest advances in machine learning, namely a discriminative structure prediction technique called hidden semi-Markov SVMs (Schweikert et al., Genome Res., 2009). mGene outperformed other gene finders in the nGASP competition on nematode genome annotation (on average over all evaluation criteria) (Coghlan et al., BMC Bioinf., 2008). Recently, it was extended to additionally take advantage of direct transcriptome measurements from tiling arrays or RNA-seq alignments leading to an improved prediction performance. mGene.web makes this system available as a galaxy web service (Schweikert et al., NAR., 2009).

For the initial version of mGene, with which we participated at nGASP, I contributed to the development of sensors recognizing the polyadenylation site, as well as sensors that distinguish between different segments (for example, exons and introns) based on DNA sequence content. Moreover I developed sensors discriminating in-frame nucleotide composition from sequences (frames) that are not translated (Schweikert et al., Genome Res., 2009).

More recently we have investigating how discriminative structure prediction algorithms can be applied to gene prediction in prokaryotes and how techniques from multi-task learning are beneficial to ideally transfer gene predictor models from one species to another (even to distantly related ones; see Görnitz et al., NIPS, 2011).

mGene page at FML | mGene.web page at FML | nGASP website

Tiling Array-Based Transcriptomics

2006 - 2011Carried out at FML / MPI

Using whole genome tiling arrays, we studied the transcriptomes of model organisms and their dynamics. Their comprehensive representation by Affymetrix tiling arrays allowed us to monitor the expression of known genes as well as to identify of new, previously uncharacterized (non-coding) genes and alternative transcript isoforms. We leveraged the tiling array technology in conjunction with machine learning methods for data analysis to i) create an inventory of differential expression of known and new genes for various Arabidopsis thaliana organs and several developmental stages (Laubinger et al., Genome Biol., 2008), as well as under changing environmental conditions (Zeller et al., Plant J., 2009); ii) to obtain a detailed expression map for Caenorhabditis elegans tissues and organs at cellular resolution (Spencer et al., Genome Res., 2011) as part of the modENCODE project, which aims to gain a detailed understanding of the functional elements encoded in the worm genome (Gerstein et al., Science, 2010); iii) to characterize on a global scale the transcriptome changes resulting from deficiencies in regulators of mRNA capping, splicing, and biogenesis of small RNAs (Laubinger et al., PNAS., 2008, Laubinger et al., PNAS., 2010).

I developed machine learning-based methods for normalization and segmentation of tiling array data with the twofold goal to reduce probe sequence effects on hybridization intensity and accurately segment transcribed exons out of the intergenic/intronic background (Zeller et al., Pac. Symp. Biocomput., 2008). These normalization and segmentation methods allowed for very accurate de novo transcript identification (Laubinger et al., Genome Biol., 2008) and were crucial tools to complete the above-mentioned projects.

A further subproject identified alternative splicing events, primarily intron retentions, from tiling array data. We approached this task as a two-step classification problem: First, a classifier was trained to discriminate between exons and introns. Subsequently, introns with a large inclusion probability in some tissue were identified as candidates for alternative splicing by a second classifier (Eichner et al., BMC Bioinf. 2011).

Tiling array page at FML | At-TAX at Weigel lab (MPI) | Worm expression map at U. Vanderbilt

Array-Based Polymorphism Discovery

2005 - 2008Carried out at FML / MPI

Using resequencing array technology, this project aimed at characterizing common sequence polymorphisms in 20 diverse strains of the plant model organism Arabidopsis thaliana (Clark et al., Science, 2007). Based on this work, a 250k SNP chip was developed for cost-effective genotyping, which enables high-resolution genome-wide association studies in Arabidopsis thaliana (e.g. Atwell et al., Nature, 2010). A similar project catalogued and analyzed genetic variation between rice varieties (McNally et al., PNAS, 2009). My focus has been on detecting highly polymorphic regions with a machine learning technique called Hidden Markov Support Vector Machines (Zeller et al., Genome Res., 2008). Polymorphic regions typically correspond to clusters of small polymorphisms (SNPs and indels), but also include large deletions. As these types of polymorphisms are impossible or very difficult to identify with SNP calling methods, polymorphic region predictions ideally complement SNP data by adding an inventory of variations expected to have pronounced phenotypic effects when affecting genes and other functional genomic elements (Zeller et al., Genome Res., 2008).

Sequence variation page at FML | Natural variation page of the Weigel lab (MPI)

Georg Zeller