Documentation, examples, tutorials and more

<<

Name

downsampleAnalysis.pl - Downsample result files (like BLAST reports, 16S classification) through downsampling of reads

Synopsis

        downsampleAnalysis.pl [options]

Options

--analysis

type of analysis to perform (functional|refgenome|16S|refcoverage|refbasecoverage)

--list

file containing entities

--size

downsample size

--refdb

name of reference db (looks for <metagenome.<refdb>.blastn.filtered> for refgenome or <genepred.<refdb>mapping.txt> for functional)

Functions

downsample_reads($metagenome, $downsample_size, \%read_lengths)

downsamples the metagenome sample $metagenome to $downsample_size using the lengths provided by %read_lengths. If the third argument is missing, it then queries SmashDB to populate the read lengths. This optional argument is useful when generating multiple downsamples from the same sample, so that you get the read lengths from SmashDB once. In this case, $metagenome is useless, since we need it only to get the read lengths.

%read_lengths has the following format:

        $read_length{$template}{$read} = $length;
For paired end sequencing, ideally, there are two reads per template.

downsample_kegg_maps_lite_sqlite_but_slow($genepred)

Generates downsampled data of the given size as follows

  1. Gets Read2Gene information for a genepred_id into a table
  2. For each downsample:
  3. i.

    Downsamples reads for a given size and inserts into a table

    ii.

    Generates "downsampled" genes as inner join of gene2read and downsampled reads

    iii.

    Selects the corresponding functional annotation lines

Writing into a database tends to be slower when the information can fit into memory. In this case, use downsample_kegg_maps_lite().

downsample_kegg_maps_lite($genepred)

Generates downsampled data of the given size

  1. Gets Read2Gene information for a genepred_id into a hash
  2. For each downsample:
  3. i.

    Downsamples reads for a given size

    ii.

    Generates "downsampled" genes as genes overlapping the downsampled reads

    iii.

    Selects the corresponding functional annotation lines

Labelled 'lite' since it uses a hash instead of writing it into a database, which makes it faster when the information can fit into memory.

<<