What's in a gene name? Automated refinement of gene name dictionaries

Jörg Hakenberg

  Current affiliation: Bioinformatics group, Biotechnological Centre, Technische Universität Dresden, Tatzberg 47-51, 01307 Dresden, Germany.
  eMail: hakenberg(a)informatik.hu-berlin.de


Abstract

Many approaches for named entity recognition rely on dictionaries gathered from entries in hand-curated databases (such as Entrez Gene for gene names.) Strategies for matching entries in a dictionary against arbitrary text use either inexact string matching that allows for known deviations, dictionaries enriched according to some observed rules, or a combination of both. Such refined dictionaries cover potential structural, lexical, orthographical, or morphological variations.
In this paper, we present an approach to automatically analyze dictionaries to discover how names are composed and which variations typically occur. This knowledge can be constructed by looking at single entries (names and synonyms for one gene), and then be transferred to entries that show similar patterns in one or more synonyms. For instance, knowledge about words that are frequently missing in (or added to) a name ("antigen", "protein", "human") could automatically be extracted from dictionaries. Also, knowledge about structural changes ("CD95 receptor", "receptor of CD95") and abbreviations ("FasL", "Fas ligand") can be learned from single entries and then transferred to all other gene names that initially lack this variation.
This paper should be seen as a vision paper, though we implemented most of the ideas presented and show results for the task of gene name recognition. The automatically extracted name composition rules can easily be included in existing approaches, and provide valuable insights into the biomedical sub-language.


Proceedings of the BioNLP 2007 workshop at ACL 2007, p.153-160, June 29 2007, Prague
[Workshop program] - [BioNLP 2007] - [Supplementary information]


@InProceedings{Hakenberg:2007c,
  author = {J\"org Hakenberg},
  title = {What's in a gene name? Automated refinement of gene name dictionaries},
  booktitle = {Proc BioNLP 2007 workshop at ACL 2007},
  pages = {153--160},
  address = {Prague},
  month = {June 29},
  year = 2007
}