Documentation, examples, tutorials and more

<<

NAME

Smash::Utils::Entrez - Utility for communicating with the NCBI Entrez server using Entrez Programming Utilities and Entrez Eutils.

SYNOPSIS

        use Smash::Utils::Entrez qw(:all);
        my $database = "genome";
        my $query    = "Bacteria[organism]+AND+WGS";
        retrieve_gbff_for_query($query, $database);

DESCRIPTION

Smash::Utils::Entrez provides several useful functions to communicate with the NCBI Entrez server. Most of these are queries to specific databases using search terms or specific ids. Here are some examples:

Fetching the search results for a specific search term

Suppose you want to see all the whole-genome shotgun (WGS) projects for bacterial species. You should search the NCBI genome database using the search term "bacteria[organism] AND WGS". To do this programatically, you would say:

        my @ids = get_ids_for_query("bacteria[organism] AND WGS", "genome");

This would retrieve a list of primary id's of the results in the genome database. Note that this id is not the Genome Project ID assigned by GenBank. That's an entirely different id altogether. If this is what you wanted, you are done. If you needed more information, read on.

Fetching the details of a given entry in the database

Suppose you want to see the details for one (or all) of the results of the search you performed above. You can get the details for the entry 6994 in the genome database by saying:

        my $details = get_details_for_id(6994, "genome");

This will retrieve a reference to a hash containing key-value pairs of the details. If you want specific details, then you should ask for the right key. For example:

        my $tax_id = $details->{TaxId};
        my $accession = $details->{Caption};
        my $update_date = $details->{UpdateDate}

Retrieving a whole record in GenBank format

Suppose you want to retrieve the whole GenBank record for genome entry 6994. You would then say:

        write_gbff_for_id(6994, "genome");

This will write the GenBank record to a file called 679926.NC_014507.gbff. Here 679926 is the NCBI taxonomy id for this genome and NC_014597 is the accession number of this record.

Retrieving all records for a search term

Given the above examples, it should be trivial to get all the records for a search term and write them into individual GenBank files:

        my @ids = get_ids_for_query("bacteria[organism] AND WGS", "genome");
        foreach my $id (@ids) {
                my $filename = write_gbff_for_id($id, "genome");
                print "Wrote $id to $filename\n" if $filename;
        }

FUNCTIONS

retrieve_gbff_for_lsof($lsof, $db)

retrieves the GenBank formatted files for the ids in the given list and writes them all locally

retrieve_gbff_for_query($query, $db)

retrieves the GenBank formatted files for the ids that are results for the given query and writes them all locally

write_gbff_for_id($id, $db)

writes the GenBank formatted file for the given id in the given database in <tax_id>.<accession>.gbff and returns the filename.

get_details_for_id($id, $db)

gets the following details for the given id in the given database:

        my ($tax_id, $accession, $update_date) = get_details_for_id(6994, "genome");
get_ids_for_query($query, $db)

gets the list of ids that are the results of searching the given database with the given query (NOTE: this has a fixed limit of 10000 id's being returned; if your query would retrieve more hits, then you should change the limit in the code or restrict your query, or use a different method to get your results).

<<