A collection of links to corpora and benchmarks available for BioNLP research. We provide some corpora ourselves, an overview can be found here.
Also see my collection of BioNLP papers for publications on corpora.
Named entity recognition
- BioCreAtIvE: task 1A (2003) and GM (2007) - see GENETAG
- The BioCreAtIvE corpus originates from an evaluation of current NER systems, held in 2003/2004. The corpus contains 15.000 sentences from MedLine citations, and annotates names referring to genes and proteins, without distinguishing between these two.
The gene/protein NER task was repeated in 2007/08, adding 7,500 more sentences that were used as an evaluation set. For more information on BioCreative, see biocreative.org.
- A new and updated version of the corpus used for the BioCreAtIvE challenge. It is contained in the data set called 'MedTag', which also includes an updated version of 'MedPost'.
- This corpus contains annotations for more than 40 "entities involved in reactions": substances (e.g. atom, nucleotide, peptide) and sources (cell line, virus, etc.)
- 950 abstracts tagged for proteins; this set includes 31609 interactions from Reactome and BIND, and an annotated set of interactions.
- A set of 2258 abstracts annotated for paragraphs, sentences, part of speech, and a set of biomedical entity types; 642 of the abstracts have been syntactically annotated. Falls into two specific domains, one covering abstracts on Cyp450, the other on oncology. Annotations for substances (proteins) and malignancies (types of cancer).
- Yapex originally is an NER tagger for protein names, but the authors provide a complete test collection of 201 MedLine abstracts, containing labeled protein names, which often is referred to using the same name.
- Linnaeus provides both a tagger and a corpus for species mention detection and identification. The corpus consists of 100 full text articles from PubMedCentral, thus open access. The 4260 species mentions therein have been manually tagged and mapped to NCBI Taxonomy IDs.
Upon request, further data sets --which depend on licensing of the underlying documents-- are available from the authors.
- Arizona Disease Corpus (AZDC)
- 2856 PubMed abstracts annotated for disease names (including symptoms etc.) and mapped to UMLS CUIs.
- Also see relation extraction corpora
- Certainly, you can also use most of the corpora designed for relation extraction for plain NER, e.g.:
- EDGAR: genes, cells, drugs, abbreviation(-resolution)
- Ray/Craven: genes, diseases, subcellular locations, proteins
Entity mention normalization
- BioCreative 1, task 1B
- Gene mention normalization, three different data sets (fly, yeast, mouse genes). Train and test collection each.
- BioCreative 2, GN task
- Gene mention normalization, human genes. Train and test collection each.
- 597 sentences with annotated diseases, mapped to UMLS CUIs.
Paper at LBM 2007: Assessment of diseases named entity recognition on a corpus of annotated sentences.
- GNAT comes with a data set derived from BioCreative 1 and 2 GN tasks, with abstracts annotated for genes from any species.
- Linnaeus, see NER
- Species mention normalization using NCBI Taxonomy IDs
- AZDC, see NER
- Diseases from UMLS, mentions mapped to CUIs.
- BioCreative III
- The BioCreative III challenge (2010) includes a task on species-independent gene normalization; please see biocreative.org.
- Annotations for drugs, genes, and relations; also for cells and abbreviation(-resolution).
- A corpus with annotations for protein-protein interactions. Consists of 836 abstracts, with 2531 relations.
- Corpus for protein-protein interaction extraction; 230 PubMed abstracts.
- "Automatically find, assess, and gather information about proteins with experimentally verified functions."
Consists of 177 full text journal articles, with 591 experiments on wild types and 82 different mutants of 77 proteins.
- The LLL05 challenge task corpus consists of a training and test set (55+25/86 sentences, respectively) with annotations for protein/gene interactions. There are two different versions of the corpus, a basic and an enriched data set. The enriched data set contains further linguistic information (lemmas and syntactic dependencies). The training corpus is organized in two separate parts, one containing only 'simple' sentences, the other including coreferences and ellipis. The LLL05 corpus distinguishes between agents and targets for each relation. In addition, the different types of relations are grouped (explicit action, protein-gene promotor binding, regulon family membership ).
Benchmarking is done via an online scoring service. The gold standard for the training set is not publicly available (at the moment).
- BioText is a corpus for evaluation of mining disease-treatment relations. The corpus contains markup for all entities, and annotates relations between them. Not useful for NER because annotations are not consistent (e.g., "ovarian cancer" is sometimes tagged, sometimes not).
- Contains 303 MedLine abstracts, with annotations for proteins-protein interactions for each
- see above description for NER
- IE Data Sets
- Consists of three different data sets for relation extraction: subcellular locations of proteins, 780 sentences; gene-disease relations, 856 sentence; protein-protein interactions, 5456 sentences. Each set is split into five pre-defined folds. Here is a readme file with more details.
- BC-PPI and DIP-PPI
- Corpora with protein-protein interactions, derived from the BioCreAtIvE data set and DIP. The BC-PPI consists of 1000 sentences, with annotated genes/proteins and interactions; the DIP data set contains about 300 interactions from DIP.
- "Five protein-protein interaction corpora"
- A commonly used set of five protein-protein interaction corpora was described in Pyysalo et al., 2008: A Comparative Analysis of Five Protein-protein Interaction Corpora. It consists of AIMed, BioInfer, HPRD50, IEPA, and LLL05, all converted into the same format.
NLP for biomedical texts
- Word Sense Disambiguation (WSD) Test Collection
- Test collection of 50 highly frequent ambiguous UMLS concepts from 1998 Medline; each of the 50 ambiguous cases has 100 ambiguous instances randomly selected from the 1998 Medline citations
- BLLIP Brown-GENIA treebank
- The Brown-GENIA treebank contains hand-parses for 21 abstracts (215 sentences) from the GENIA corpus of MEDLINE abstracts related to transcription factors in human blood cells; no overlap with the GENIA treebank
Other link collections and resources
- Corpora for biomedical natural language processing
- "This site concerns recent and ongoing work in our lab on corpus construction for biomedical natural language processing. It provides information on obtaining corpora, links to our and other papers on the subject, and supplementary materials for our papers."
- Bio-NLP corpora
- From BioCreative.sourceforge.net
- BioNLP resources
- Collected by Alex Morgan
- Corpora and Terminological resources for BioNLP
- Collected by Jasmin Saric
- IE Relation Data
- Collection by Ben Hachey
- Bio Text Mining Resources
- Collection by Tamara Polajnar
- Developing Linguistic Corpora: a Guide to Good Practice
- Offers advice on basic principles, liguistic annotation, metadata, character encoding, and archiving linguistic corpora.
- http://tiny.cc/corpora - formerly known as devoted.to/corpora
- David Lee's huge collection of bookmarks for corpus-based linguists
- The Linguist List
- a "Definitive Resource of Organizations, Programs and Centers in Linguistics"
Papers on corpora, corpus development, etc.
- CALBC Silver Standard Corpus
- Rebholz-Schuhmann D, Yepes AJ, Van Mulligen EM, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U. Journal of Bioinformatics and Computational Biology, Volume: 8, Issue: 1 (February 2010), Page: 163-179.
[DOI: 10.1142/S0219720010004562] - [Abstract]
- A Comparative Analysis of Five Protein-protein Interaction Corpora
- Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari Björne, Filip Ginter and Tapio Salakosk. BMC Bioinformatics, 9(Suppl. 3):S6, 2008. [Full text]
- - also see my collection of BioNLP papers