Extract single copy marker genes

Phylogenetic markers are genes (and proteins) which can be used to reconstruct the phylogenetic history of different organisms. One classical phylogenetic marker is the 16S ribosomal RNA gene, which is often-used but is also known to be a sub-optimal phylogenetic marker for some organisms. Efforts to find a good set of protein coding phylogenetic marker genes (Ciccarelli et al., Science, 2006; Sorek et al., Science, 2007) lead to the identification of 40 universal single copy marker genes (MGs). These 40 marker genes occur in single copy in the vast majority of known organisms and they were used to successfully reconstruct a three domain phylogenetic tree (Ciccarelli et al., Science, 2006).

fetchMG extracts the 40 MGs from genomes and metagenomes in an easy and accurate manner. This is done by utilizing Hidden Markov Models (HMMs) trained on protein alignments of known members of the 40 MGs as well as calibrated cutoffs for each of the 40 MGs. Please note that these cutoffs are only accurate when using complete protein sequences as input files. The output of the program are the protein sequences of the identified proteins, as well as their nucleotide sequences, if the nucleotide sequences of all complete genes are given as an additional input.

fetchMG is available as a standalone package, and also as a built in part of MOCAT.