This corpus contains a collection of terms that have ambiguous meanings with respect to six classes of biomedical named entities. We currently provide 793 terms that have a different meaning in the classes cell, disease, drug, protein, species, and tissue, or also are common English words. For each term, we collected a sample of texts that discuss this term in the given meaning. These texts originate from PubMed, UniProt (descriptions), and MedlinePlus.
See [here >>] for more details.
This corpus originated from the BioCreAtIvE task 1A data set for named entity recognition of gene/protein names. We randomly selected 1000 sentences from this set and added additional annotation for interactions between genes/proteins. 173 sentences contain at least one interaction, 589 sentences contain at least one gene/protein. There are 255 interactions, some of which include more than two partners (e.g., one partner occurs with full name and abbreviated).
751013 bc_sentences.xml - 1000 sentences 751216 bc_sentences_xs.xml - 1000 sentences, with style sheet information 15248 bc_solution.txt - gold standard 15125 bc_pubmeds.txt - mapping from sentence IDs to PubMed-IDsSentences are formatted like
..<interactor pos="VBZ">regulates</interactor> <gene>rsmB</gene> <token pos="NN">transcription</token>..
where
interactor depicts the evidence for an interaction (with its POS tag)
gene denotes a gene or protein (and thus a possible interaction partner)
token is an arbitrary token (with its POS tag)
The solution is formatted as follows:
Example: 0006|RsmC|5|rsmB|8|regulates|7|+|comment
0006 depicts the sentence ID
RsmC is the (full) name of the first interaction partner (gene or protein)
5 is the index of the first partner
rsmB name of the second interaction partner
8 index of the second partner
regulates type of interaction, as mentioned in the text
7 index of type descriptor
+ agent/target relation: (+, agent=1st partner; -, 2nd is agent; *, equal)
comment is an optional comment
Example: 0187|meis1|4|cAMP-responsive sequence;CRS1|13;16|bind|11|+|
full name and abbreviation occur in the text -> two relations
We copied both tokenization and part-of-speech tagging from the BioCreAtIvE data set. Note that especially genes and protein names are tagged as one token (<gene>cAMP-responsive sequence</gene>), but the indexing counts every single word in such a phrase:
.. cAMP-responsive sequence ( CRS1 ) ..
.. 13 14 15 16 17 ..
This corpus is based on protein-protein interactions from the Database of Interacting Proteins (DIP), restricted to proteins from yeast. Edges in DIP contain references to PubMed abstracts, so the goal is to find the evidence for a relation in the text. Whenever possible, we include the full text in the corpus, rather than the abstract only. All plain texts where converted from PDF using pdftotext (D.Noonburg). DIP uses IDs from the Saccharomyces Genome Database (SGD) for nodes, and we use a list of names and synonyms provided by SGD to find the nodes in text.
2990261 dip_plaintexts.tar.gz - full texts and abstracts 9435 dip_solution.txt - gold standard 195874 dip_synonyms_from_sgd - gene/protein names from SGDThe solution (tab-separated columns) is formatted as follows:
Example: PEdgeID SGD_ID SGD_ID PubMed
2862 S0005677 S0002266 8001136
PEdgeID DIP edge ID for this relation
SGD_ID SGD ID of the first partner
SGD_ID SGD ID of the second partner
PubMed reference to PubMed for this edge
This "corpus" is a large set of sentences that contain protein-protein interactions. We collected this set by searching for protein pairs from IntAct (a protein-interaction database) in all of Medline. This led to typical (parts of) sentences often used to describe PPIs, like
PTN blocks PTN PTN suppressed PTN PTN binds the PTN PTN that regulates PTN .. PTN, but neither was influenced by PTN PTN requires the presence of comparable amounts of PTN or PTN
Published with "Collecting a large corpus from all of Medline" at SMBM'06.
Please see [here >>]
Published with the paper "Tuning Text Classification for Hereditary Diseases with Section Weighting" at SMBM'05.
Please see [here >>]
2006-01-10, JH