Collecting a Large Corpus from all of Medline

Jörg Hakenberg1*, Ulf Leser1, Harald Kirsch2, Dietrich Rebholz-Schuhmann2

1 Humboldt-Universität zu Berlin, Department of Computer Science, Knowledge Management Group, Unter den Linden 6, 10099 Berlin, Germany.
2 European Bioinformatics Institute, Hinxton/Cambridge, UK.
* Corresponding author. Current affiliation: Knowledge Management in Bioinformatics, Dept. Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Phone: +49.30.2093.3903, eMail: hakenberg(a)informatik.hu-berlin.de


Abstract

We present our ideas and first results for a system to extract interactions between proteins from scientific publications. This system consists of three main stages. First, we extract a large sample of sentences from unannotated text. Second, we generate language patterns using multiple sentence alignment to identify consensus phrases. Last, we apply these patterns to arbitrary text, again using sentence alignment. In this paper, we concentrate on the step of extracting a large training sample from Medline. We search for occurrences of both partners of a known protein-protein interaction in a single sentence and further refine the resulting set to exclude false positives. We are able to extract almost 68,000 examples for sentences that discuss protein-protein interactions.


Supplementary information

Please see [here >>]


To appear in
Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM). Jena, Germany, April 2006.
[SMBM 2006]

@InProceedings{Hakenberg:2006a,
  author = {J\"org Hakenberg and Ulf Leser and Harald Kirsch and Dietrich Rebholz-Schuhmann},
  title = {Collecting a Large Corpus from all of Medline},
  booktitle = {Second International Symposium on Semantic Mining in Biomedicine, SMBM},
  address = {Jena, Germany},
  month = {April},
  year = 2006
}