BioCreAtIvE: task 1A (2003) and GM (2007) - see GENETAG
The BioCreAtIvE corpus originates from an evaluation of current NER systems, held in 2003/2004. The corpus contains 15.000 sentences from MedLine citations, and annotates names referring to genes and proteins, without distinguishing between these two.
The gene/protein NER task was repeated in 2007/08, adding 7,500 more sentences that were used as an evaluation set. For more information on BioCreative, see biocreative.org.
"The GENETAG corpus contains 20K sentences of manually annotated gene/protein names. The first 15K sentences were used for the BioCreative 1 (Task 1A) competition in 2004, and the rest, 5K sentences were used as test data for BioCreative II (Gene Mention Task) competition in 2005. [..] This corpus has now been converted to BioC format and is available for download at the BioC website on SourceForge."
A set of 2258 abstracts annotated for paragraphs, sentences, part of speech, and a set of biomedical entity types; 642 of the abstracts have been syntactically annotated. Falls into two specific domains, one covering abstracts on Cyp450, the other on oncology. Annotations for substances (proteins) and malignancies (types of cancer).
Yapex originally is an NER tagger for protein names, but the authors provide a complete test collection of 201 MedLine abstracts, containing labeled protein names, which often is referred to using the same name.
Linnaeus provides both a tagger and a corpus for species mention detection and identification. The corpus consists of 100 full text articles from PubMedCentral, thus open access. The 4260 species mentions therein have been manually tagged and mapped to NCBI Taxonomy IDs.
Upon request, further data sets --which depend on licensing of the underlying documents-- are available from the authors.
Certainly, you can also use most of the corpora designed for relation extraction for plain NER, e.g.:
- EDGAR: genes, cells, drugs, abbreviation(-resolution)
- Ray/Craven: genes, diseases, subcellular locations, proteins
Entity mention normalization
BioCreative 1, task 1B
Gene mention normalization, three different data sets (fly, yeast, mouse genes). Train and test collection each.
"Automatically find, assess, and gather information about proteins with experimentally verified functions."
Consists of 177 full text journal articles, with 591 experiments on wild types and 82 different mutants of 77 proteins.
The LLL05 challenge task corpus consists of a training and test set (55+25/86 sentences, respectively) with annotations for protein/gene interactions. There are two different versions of the corpus, a basic and an enriched data set. The enriched data set contains further linguistic information (lemmas and syntactic dependencies). The training corpus is organized in two separate parts, one containing only 'simple' sentences, the other including coreferences and ellipis. The LLL05 corpus distinguishes between agents and targets for each relation. In addition, the different types of relations are grouped (explicit action, protein-gene promotor binding, regulon family membership ).
Benchmarking is done via an online scoring service. The gold standard for the training set is not publicly available (at the moment).
BioText is a corpus for evaluation of mining disease-treatment relations. The corpus contains markup for all entities, and annotates relations between them. Not useful for NER because annotations are not consistent (e.g., "ovarian cancer" is sometimes tagged, sometimes not).
Consists of three different data sets for relation extraction: subcellular locations of proteins, 780 sentences; gene-disease relations, 856 sentence; protein-protein interactions, 5456 sentences. Each set is split into five pre-defined folds. Here is a readme file with more details.
Corpora with protein-protein interactions, derived from the BioCreAtIvE data set and DIP. The BC-PPI consists of 1000 sentences, with annotated genes/proteins and interactions; the DIP data set contains about 300 interactions from DIP.
The Brown-GENIA treebank contains hand-parses for 21 abstracts (215 sentences) from the GENIA corpus of MEDLINE abstracts related to transcription factors in human blood cells; no overlap with the GENIA treebank
"This site concerns recent and ongoing work in our lab on corpus construction for biomedical natural language processing. It provides information on obtaining corpora, links to our and other papers on the subject, and supplementary materials for our papers."
a "Definitive Resource of Organizations, Programs and Centers in Linguistics"
Papers on corpora, corpus development, etc.
CALBC Silver Standard Corpus
Rebholz-Schuhmann D, Yepes AJ, Van Mulligen EM, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U. Journal of Bioinformatics and Computational Biology, Volume: 8, Issue: 1 (February 2010), Page: 163-179.
[DOI: 10.1142/S0219720010004562] - [Abstract]
A Comparative Analysis of Five Protein-protein Interaction Corpora
Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari Björne, Filip Ginter and Tapio Salakosk. BMC Bioinformatics, 9(Suppl. 3):S6, 2008. [Full text]