Corpora for natural language processing in the biomedical literature


Word sense disambiguation in the biomedical domain

This corpus contains a collection of terms that have ambiguous meanings with respect to six classes of biomedical named entities. We currently provide 793 terms that have a different meaning in the classes cell, disease, drug, protein, species, and tissue, or also are common English words. For each term, we collected a sample of texts that discuss this term in the given meaning. These texts originate from PubMed, UniProt (descriptions), and MedlinePlus.

See [here >>] for more details.

BioCreAtIvE-PPI: a corpus for protein-protein interactions

This corpus originated from the BioCreAtIvE task 1A data set for named entity recognition of gene/protein names. We randomly selected 1000 sentences from this set and added additional annotation for interactions between genes/proteins. 173 sentences contain at least one interaction, 589 sentences contain at least one gene/protein. There are 255 interactions, some of which include more than two partners (e.g., one partner occurs with full name and abbreviated).

 750k   bc_sentences.xml             - 1000 sentences
 750k   bc_sentences_xs.xml          - 1000 sentences, with style sheet information
  15k   bc_solution.txt              - gold standard
  15k   bc_pubmeds.txt               - mapping from sentence IDs to PubMed-IDs
Sentences are formatted like
    ..<interactor pos="VBZ">regulates</interactor> <gene>rsmB</gene> <token pos="NN">transcription</token>..
where
    interactor   depicts the evidence for an interaction (with its POS tag)
    gene       denotes a gene or protein (and thus a possible interaction partner)
    token       is an arbitrary token (with its POS tag)
The solution is formatted as follows:
Example: 0006|RsmC|5|rsmB|8|regulates|7|+|comment

    0006       depicts the sentence ID
    RsmC      is the (full) name of the first interaction partner (gene or protein)
    5          is the index of the first partner
    rsmB       name of the second interaction partner
    8          index of the second partner
    regulates   type of interaction, as mentioned in the text
    7          index of type descriptor
    +          agent/target relation: (+, agent=1st partner; -, 2nd is agent; *, equal)
    comment    is an optional comment

Example: 0187|meis1|4|cAMP-responsive sequence;CRS1|13;16|bind|11|+|

    full name and abbreviation occur in the text -> two relations
We copied both tokenization and part-of-speech tagging from the BioCreAtIvE data set. Note that especially genes and protein names are tagged as one token (<gene>cAMP-responsive sequence</gene>), but the indexing counts every single word in such a phrase:
    .. cAMP-responsive sequence (  CRS1 )  ..
    .. 13              14       15 16   17 ..


DIPPPI: a corpus for protein-protein interactions

This corpus is based on protein-protein interactions from the Database of Interacting Proteins (DIP), restricted to proteins from yeast. Edges in DIP contain references to PubMed abstracts, so the goal is to find the evidence for a relation in the text. Whenever possible, we include the full text in the corpus, rather than the abstract only. All plain texts where converted from PDF using pdftotext (D.Noonburg). DIP uses IDs from the Saccharomyces Genome Database (SGD) for nodes, and we use a list of names and synonyms provided by SGD to find the nodes in text.

2990261 dip_plaintexts.tar.gz     - full texts and abstracts
   9435 dip_solution.txt         - gold standard
 195874 dip_synonyms_from_sgd - gene/protein names from SGD
The solution (tab-separated columns) is formatted as follows:
Example: PEdgeID  SGD_ID    SGD_ID    PubMed
         2862     S0005677  S0002266  8001136

    PEdgeID  DIP edge ID for this relation
    SGD_ID   SGD ID of the first partner
    SGD_ID   SGD ID of the second partner
    PubMed   reference to PubMed for this edge


Large Protein-Protein Interaction Sample

This "corpus" is a large set of sentences that contain protein-protein interactions. We collected this set by searching for protein pairs from IntAct (a protein-interaction database) in all of Medline. This led to typical (parts of) sentences often used to describe PPIs, like

  PTN blocks PTN
  PTN suppressed PTN
  PTN binds the PTN
  PTN that regulates PTN
  ..
  PTN, but neither was influenced by PTN
  PTN requires the presence of comparable amounts of PTN or PTN

Published with "Collecting a large corpus from all of Medline" at SMBM'06.
Please see [here >>]


Hereditary diseases - a text classification corpus

Published with the paper "Tuning Text Classification for Hereditary Diseases with Section Weighting" at SMBM'05.
Please see [here >>]

2006-01-10, JH