Links: corpora and benchmarks for BioNLP

A collection of links to corpora and benchmarks available for BioNLP research.
We provide some corpora ourselves, an overview can be found here. Also see my collection of BioNLP papers for publications on corpora.

Categories: NER - EMN - relationship extraction - various (bio-)NLP tasks - other collections - papers

Named entity recognition

BioCreAtIvE: task 1A (2003) and GM (2007) - see GENETAG
The BioCreAtIvE corpus originates from an evaluation of current NER systems, held in 2003/2004. The corpus contains 15.000 sentences from MedLine citations, and annotates names referring to genes and proteins, without distinguishing between these two.
The gene/protein NER task was repeated in 2007/08, adding 7,500 more sentences that were used as an evaluation set. For more information on BioCreative, see
GENETAG at BioC (2015)
"The GENETAG corpus contains 20K sentences of manually annotated gene/protein names. The first 15K sentences were used for the BioCreative 1 (Task 1A) competition in 2004, and the rest, 5K sentences were used as test data for BioCreative II (Gene Mention Task) competition in 2005. [..] This corpus has now been converted to BioC format and is available for download at the BioC website on SourceForge."
A new and updated version of the corpus used for the BioCreAtIvE challenge. It is contained in the data set called 'MedTag', which also includes an updated version of 'MedPost'.
This corpus contains annotations for more than 40 "entities involved in reactions": substances (e.g. atom, nucleotide, peptide) and sources (cell line, virus, etc.)
950 abstracts tagged for proteins; this set includes 31609 interactions from Reactome and BIND, and an annotated set of interactions.
A set of 2258 abstracts annotated for paragraphs, sentences, part of speech, and a set of biomedical entity types; 642 of the abstracts have been syntactically annotated. Falls into two specific domains, one covering abstracts on Cyp450, the other on oncology. Annotations for substances (proteins) and malignancies (types of cancer).
Yapex originally is an NER tagger for protein names, but the authors provide a complete test collection of 201 MedLine abstracts, containing labeled protein names, which often is referred to using the same name.
Linnaeus provides both a tagger and a corpus for species mention detection and identification. The corpus consists of 100 full text articles from PubMedCentral, thus open access. The 4260 species mentions therein have been manually tagged and mapped to NCBI Taxonomy IDs.
Upon request, further data sets --which depend on licensing of the underlying documents-- are available from the authors.
Arizona Disease Corpus (AZDC)
2856 PubMed abstracts annotated for disease names (including symptoms etc.) and mapped to UMLS CUIs.
Also see relation extraction corpora
Certainly, you can also use most of the corpora designed for relation extraction for plain NER, e.g.:
- EDGAR: genes, cells, drugs, abbreviation(-resolution)
- Ray/Craven: genes, diseases, subcellular locations, proteins

Entity mention normalization

BioCreative 1, task 1B
Gene mention normalization, three different data sets (fly, yeast, mouse genes). Train and test collection each.
BioCreative 2, GN task
Gene mention normalization, human genes. Train and test collection each.
597 sentences with annotated diseases, mapped to UMLS CUIs.
Paper at LBM 2007: Assessment of diseases named entity recognition on a corpus of annotated sentences.
GNAT comes with a data set derived from BioCreative 1 and 2 GN tasks, with abstracts annotated for genes from any species.
Linnaeus, see NER
Species mention normalization using NCBI Taxonomy IDs
Diseases from UMLS, mentions mapped to CUIs.
BioCreative III
The BioCreative III challenge (2010) includes a task on species-independent gene normalization; please see

Relationship extraction

Annotations for drugs, genes, and relations; also for cells and abbreviation(-resolution).
A corpus with annotations for protein-protein interactions. Consists of 836 abstracts, with 2531 relations.
Corpus for protein-protein interaction extraction; 230 PubMed abstracts.
"Automatically find, assess, and gather information about proteins with experimentally verified functions."
Consists of 177 full text journal articles, with 591 experiments on wild types and 82 different mutants of 77 proteins.
The LLL05 challenge task corpus consists of a training and test set (55+25/86 sentences, respectively) with annotations for protein/gene interactions. There are two different versions of the corpus, a basic and an enriched data set. The enriched data set contains further linguistic information (lemmas and syntactic dependencies). The training corpus is organized in two separate parts, one containing only 'simple' sentences, the other including coreferences and ellipis. The LLL05 corpus distinguishes between agents and targets for each relation. In addition, the different types of relations are grouped (explicit action, protein-gene promotor binding, regulon family membership ).
Benchmarking is done via an online scoring service. The gold standard for the training set is not publicly available (at the moment).
BioText is a corpus for evaluation of mining disease-treatment relations. The corpus contains markup for all entities, and annotates relations between them. Not useful for NER because annotations are not consistent (e.g., "ovarian cancer" is sometimes tagged, sometimes not).
Contains 303 MedLine abstracts, with annotations for proteins-protein interactions for each sentence.
see above description for NER
IE Data Sets
Consists of three different data sets for relation extraction: subcellular locations of proteins, 780 sentences; gene-disease relations, 856 sentence; protein-protein interactions, 5456 sentences. Each set is split into five pre-defined folds. Here is a readme file with more details.
Corpora with protein-protein interactions, derived from the BioCreAtIvE data set and DIP. The BC-PPI consists of 1000 sentences, with annotated genes/proteins and interactions; the DIP data set contains about 300 interactions from DIP.
"Five protein-protein interaction corpora"
A commonly used set of five protein-protein interaction corpora was described in Pyysalo et al., 2008: A Comparative Analysis of Five Protein-protein Interaction Corpora. It consists of AIMed, BioInfer, HPRD50, IEPA, and LLL05, all converted into the same format.

NLP for biomedical texts

Word Sense Disambiguation (WSD) Test Collection
Test collection of 50 highly frequent ambiguous UMLS concepts from 1998 Medline; each of the 50 ambiguous cases has 100 ambiguous instances randomly selected from the 1998 Medline citations
BLLIP Brown-GENIA treebank
The Brown-GENIA treebank contains hand-parses for 21 abstracts (215 sentences) from the GENIA corpus of MEDLINE abstracts related to transcription factors in human blood cells; no overlap with the GENIA treebank

Other link collections and resources

Corpora for biomedical natural language processing
"This site concerns recent and ongoing work in our lab on corpus construction for biomedical natural language processing. It provides information on obtaining corpora, links to our and other papers on the subject, and supplementary materials for our papers."
Bio-NLP corpora
BioNLP resources
Collected by Alex Morgan
Corpora and Terminological resources for BioNLP
Collected by Jasmin Saric
IE Relation Data
Collection by Ben Hachey
Bio Text Mining Resources
Collection by Tamara Polajnar
Developing Linguistic Corpora: a Guide to Good Practice
Offers advice on basic principles, liguistic annotation, metadata, character encoding, and archiving linguistic corpora. - formerly known as
David Lee's huge collection of bookmarks for corpus-based linguists
The Linguist List
a "Definitive Resource of Organizations, Programs and Centers in Linguistics"

Papers on corpora, corpus development, etc.

CALBC Silver Standard Corpus
Rebholz-Schuhmann D, Yepes AJ, Van Mulligen EM, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U. Journal of Bioinformatics and Computational Biology, Volume: 8, Issue: 1 (February 2010), Page: 163-179.
[DOI: 10.1142/S0219720010004562] - [Abstract]
A Comparative Analysis of Five Protein-protein Interaction Corpora
Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari Björne, Filip Ginter and Tapio Salakosk. BMC Bioinformatics, 9(Suppl. 3):S6, 2008. [Full text]
- also see my collection of BioNLP papers