Collecting a Large Corpus from all of Medline
Supplementary information
This sample is a large set of sentences that contain protein-protein interactions. We collected this set by searching for protein pairs from IntAct (a protein-interaction database) in all of Medline. This led to typical (parts of) sentences often used to describe PPIs, like
PTN blocks PTN
PTN suppressed PTN
PTN binds the PTN
PTN that regulates PTN
(were PTN is a wildcard for a protein name). In addition, we found many examples that were more 'exotic', like
PTN, but neither was influenced by PTN
PTN requires the presence of comparable amounts of PTN or PTN
Data
- lists of words indicating interactions: nouns, verbs, adjectives
- phrases from IntAct pairs
- filtered for verbs: zero-word boundary, uniques only (20439 phrases, 4.5MB) - as txt file
- filtered for verbs: one-word boundary, uniques only (23573 phrases, 5.3MB) - as txt file
- filtered for nouns: zero-word boundary, uniques only (19954 phrases, 4.5MB) - as txt file
- filtered for nouns: one-word boundary, uniques only (24284 phrases, 5.4MB) - as txt file
- filtered for nouns: zero/one-word boundary, uniques only (23008 phrases, 5.0MB) - as txt file
- filtered for nouns: two/zero word boundary, uniques only (24498 phrases, 5.4MB) - as txt file
- filtered for adjectives: zero-word boundary, uniques only (2509 phrases, 0.6MB) - as txt file
- uniques exclude identical, duplicate phrases that occur when all protein names are treated as equal terms
- these phrases contain the markup 'ANY/PTN' that is short for a 'protein with any name'
- relation="1,0,-1,0" refers to a relation between the second (1) and the first (0) protein in the pattern; the interaction word is unknown (-1), as is the direction (0)
- part-of-speech tags are given in CIS notation, see some examples and conversion to Brown/Penn tags
Software
- NER for protein names: Whatizit; we use a list of ca. 200.000 protein names taken from UniProt (protein names, gene names, synonyms)
- monq.jfa - Java Finite Automata class library
- Algorithms are available through an information extraction package: wbi-tm.jar, API documentation
Please send any questions and requests to hakenberg(a)informatik.hu-berlin.de.
[Knowledge Management in Bioinformatics] - [Start page]