Information Extraction Results

Large-scale extraction of regulatory gene/protein networks from Medline


Information extraction results

  • Organism specific partionions of Medline annotated with gene expression regulation
    We provide annotation of sentences from Medline in the following bracketed structure:
    • Named entities are bracketed [nx... ... ]. NX abbreviates noun chunk. The following letters identify the semantic type of the noun chunk, f.ex. nxprot abbreviates protein noun chunk.
    • Relations are named [ev.... ev refers to an event. The letters following ev indicate the template. For gene expression this is indicate through expr. The specific type of protein-gene interaction is indicated as well. For activation it is act, for repression it is rep and for neutral regulation it is reg.
      The last group of letters indicate if it's a verbal relation (v) or a nominal relation, whether it's active or passive (a vs. p) and whether it's negated (n).
    Annotations are provided for the following model organisms:

  • Organism specific Partionions of Medline with protein phosporylation and dephosphorylation annotated
    We provide annotation of sentences from Medline in the following bracketed structure:
    • Named entities are bracketed [nx... ... ]. NX abbreviates noun chunk. The following letters identify the semantic type of the noun chunk, f.ex. nxprot abbreviates protein noun chunk .
    • Relations are indicated with [ev..., with ev referring to an event. The letters following ev indicate the template, i.e. the phosphorylation (phos), the dephoshphorylation (dphos), or autophosphorylation (autophos)template.
      The last group of letters indicates whether it's a verbal relation (v) or a nominal relation, whether it's active or passive (a vs. p) and whether it's negated (n).


Revised Version of Part-of-Speech annotated GENIA3.02 corpus

    We provide a revised version of the PoS-Annotation of the GENIA 3.02 corpus (gzipped). We have applied the following changes:
    • Re-tokenising the corpus. Example: B/CD28-responsive was formally tokenised as 3 tokens, i.e. B, /, CD28, and -responsive.
    • Disambiguating the PoS-annotation. Example: IN|CC for of/or has been changed such that of, /, and or each occur as separate token with each its own PoS-tag.
    • Correcting the PoS-annotation. A series of wrong PoS-annotations has been changed. Example: the PoS-tags -, XT, CT, and N are annotated, but not part of UPenn tagset. We've put in the correct PoS-tags.
    • Adapting the tagset. We have adapted the tags such, that auxialliary verbs that derive from be are annotated with VB.... Verbs that derive from have are annotated with VH.... The others are annotated with VV....
    Please contact us if like to have more details about the changing of the PoS-annotation.


Rule and parameter files

  • Part-of-speech tagging was performed using Tree-tagger with a custom parameter file:
  • Entitity and relation chunking was performed using the following CASS grammars:
    • Entity chunking (CASS)
    • Extraction of expression regulation relations (CASS)
    • Extraction of (de-)phosphorylation relations (CASS)


Available literature

  • Jasmin Saric, Lars J. Jensen, and Isabel Rojas
    "Large-scale Extraction of Gene Regulation for Model Organisms in an ontological context"
    In Silico Biology, 5, 0004, 2004
    (Available online)

  • Jasmin Saric, Lars J. Jensen; Rossitza Ouzounova, Isabel Rojas, and Peer Bork
    "Extraction of regulatory gene expression networks from PubMed"
    Proceedings of the ACL 2004 Conference, Barcelona, Spain, 2004
    (PDF).