An overview of some of my projects that are published and openly available. Find more detailed descriptions below.

Text Mining

SNPshot - A repository of genetic variants linked to phenotypic effects on drug response
Main publication: J. Hakenberg et al., A SNPshot of PubMed to find associations between genetic variants, drugs, and diseases, J Biomed Info, 45(5):842-50, 2012; also GPD-Rxn Workshop at PSB 2010.
[Webpage] offline
GNAT - Gene mention recognition and normalization
Main publication: J. Hakenberg et al., Inter-species normalization of gene mentions with GNAT, Bioinformatics 24(16):i126-32, 2008; update in Bioinformatics 2011.
[GNAT on SourceForge]
CBioC2 - Collaborative Bio Curation, version 2
[Webpage] offline
AliBaba - PubMed as a graph
Main publication: C. Plake et al., AliBaba: PubMed as a graph, Bioinformatics 22(19):2444-5, 2007.

Resequencing algorithms and variant annotation

Exome sequencing data analysis pipeline
RNA-seq data analysis pipeline
Knowledge base for genetic variant annotation


MAPPP - MHC class I antigenic peptide processing prediction
Main publication: J. Hakenberg et al., MAPPP: MHC class I antigenic peptide processing prediction, Applied Bioinformatics 2(3):155-8, 2003.
Relation Mining - BioCreAtIvE 2 - Obesity Genes - LLL'05 - Related Genes - Systems Biology - Ali Baba - NER/NEI - BioCreAtIvE - Few examples - TextClassification - BioNLP corpora - MAPPP - Positional Dependencies

Text mining in biomedical literature

Mining attributed interactions between biological entities

PIT The vast majority of knowledge in the life sciences is not represented in databases, but in research articles. There are currently about 16 million texts, containing highly relevant research findings in semistructured form. We study the extraction of relations between various objects interesting for biological, chemical, or clinical research. These relations refer, e.g., to interactions between genes or proteins, the influences of drugs on cells and diseases, and so on. All kinds of relations are described in the literature, and it is not an easy task to parse and analyze data present in natural language texts. Starting with the extraction of interacions between proteins, we seek to apply computer linguistic and machine learning techniques to extract other relations as well. Mining relations between diseases and treatments is another issue we address with this project. Descriptions in publications often follow certain 'patterns', i.e., the syntax and word choice used by the authors in texts. We learn and refine such patterns from examples, and apply them to arbitrary data.

BioCreAtIvE II.5 - Critical Assessment of Information Extraction Systems in Biology

BioCreative II.5 addressed the biomedical text mining tasks of finding articles discussing protein-protein interactions, extracting these interactions, and extracting their detection methods. Our group participated in the "IPS" task, scanning full text articles for evidence of protein-protein interactions, and mapping interacting proteins to UniProt.

BioCreAtIvE 2 - Critical Assessment of Information Extraction Systems in Biology

BioCreAtIvE is an evaluation / challenge cup organized by the BioLINK group. It aims on providing common benchmarks for the performance of natural language processing systems working on biomedical literature. We participated with solutions for the gene name normalization (GN) and protein-interaction pairs (IPS) tasks. Our system for the GN task achieved the top-scoring results among 20 participants. Our system for the IPS task yielded the highest recall among all participants (ca. 15), and is in the top 5 for f-score. See also our project description for BioCreAtIvE 2003.
[BioCreAtIvE 2]

Obesity associated genes

In this project we seek to identify genes associated with obesity/adipositas using text mining techniques. Experimentally verified or claimed dependencies are described in published articles. We search the Medline abstract database to extract such relations from text. The idea we pursue in a sub-project is to identify meaningful contexts (e.g. sentences) describing a relation between a gene and the disorder using as less manually annotated examples as possible. Starting with relatively simple, but precise, keyword searches that definitely result in contexts that i) discuss the disorder, ii) contain a gene that has a known associtation with the disorder, or iii) contain the evidence for such an association. These contexts then provide a sample to infer models from. For example, such keywords might be unambigous names of disorders or genes, which cannot have a second meaning (in Medline). A gene name like "PCR" may not be a good idea to start with, while "uncoupling protein 1" refers only to the protein or gene. Chr.1


The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database. Training and test sets consist of sentences extracted from MedLine abstracts, annotated with agent/target relations between proteins and genes. These relations are grouped into actions, bindings, regulons, and no interactions. A basic data set contains interactions only, another data set was enriched with linguistic information. Six groups participated in the challenge. On the basic data set, our system scored best, with an f-measure of 52%. The best overall system used the linguistic information, and scored an f-measure of 53%.
[LLL'05 Challenge] [LLL'05 Workshop]

Related genes in M. tuberculosis

Focus of this work is the extraction of interacting genes from Mycobacterium tuberculosis (and related species) from texts. This refers not only to spatially interacting proteins, but also to paralogues, genes appearing in the same operon, gene products sharing similar functions, and some others. This project aims at an automated annotation of a data base for proteins from M. tub., 2D-PAGE.

Text mining for systems biology

KMedDB This projects aims on finding an automated information retrieval system for kinetic data from online publications. Such data are necessary for in silico modeling of cells and whole organisms, systems biology. Our first research tasks were concerned with text classification for finding publications relevant to the topic. We developed a text processing pipeline to classify documents with a support vector machine approach. We currently study the recognition of biologically relevant objects in texts, i.e. finding names referring to enzymes or substrates, reaction rates and other kinetic data. The mining of relations between such entities is our ultimate goal in this project.

Ali Baba

  Ali Baba

PIT:Fas We develop GUIs for automated processing and information extraction from texts. These tools provide different levels for human intervention, either relying on defined processing pipelines for various predefined biological research questions, or allowing for manual and subsequent refinement of texts.

Named entity recognition

Project on this topic deal with the recognition of biologically relevant objects (entities) in scientific publications. Such objects can be names referring to genes, proteins, cell types, drugs, diseases, or treatments, among many others.

BioCreAtIvE - Critical Assessment of Information Extraction Systems in Biology

BioCreAtIvE is an evaluation / challenge cup organized by the BioLINK group. It aims on providing common benchmarks for the performance of natural language processing systems working on biomedical literature. Three different problems have to be solved, with all participating groups working on the same training and evaluation data, to compare and evaluate the best performing systems and methods. We participated with a solution for the named entity recognition task, building automated systems for the recognition of gene and protein names. About 20 groups participated in this particular task, where the best systems achieved an f-measure of 82%. Our initial solution scored an f-measure of 72%; in the aftermath of the workshop, we could improve our system, now scoring 78% on the BioCreative data set.
[BioCreAtIvE] [BioLINK]

Learning with few labeled examples

With this project, we aim at developing methods for learning models from only a few labeled examples. In particular, we address the problem of annotating a sparsely labeled text collection step-by-step, to get more and more labeled examples with high precision. We start with a set of few but precise annotations, were we are sure that the selected examples (words, phrases, short text passages) occur with only one single meaning throughout the whole collection. Looking at the context of these examples, we try to find similar or related passages in the remaining text, skipping the part depicting the exact example. We deduce that the skipped part has the same meaning than in the former example, and thus we have automatically identified a new one, without manual interference.
[more >>]

Text classification

We use classification of texts, i.e. association of texts with a category or topics, as components for building information extraction systems. In a current project, we study improvements of text classification using section weighting. The approach uses data on information density and coverage, which are heterogenous in scientific texts. Depending on research questions, individual sections of texts can be more or less relevant. We evaluate this approach with texts from OMIM, which we try to automatically assign to their proper topic, i.e. hereditary disease.
In the projects on literature mining for systems biology (see above), we use text classification as a filtering technique to find and rank relevant documents. Subsequent information extraction steps than may work on pre-selected texts only.

BioNLP corpora

Our aim is to collect and provide corpora for natural language processing in the biomedical literature. These corpora are annotated from domain experts, and range from general tasks (named entity recognition for genes) to very specific questions (articles discussing particalur genetic disorders).
So far, we have gathered text samples concerning:

If you are interested in obtaining any of these data sets, you can find most of them as supplementary information for one of our papers. If you have any questions, we appreciate your request.
[more >>]

MAPPP - MHC class I Antigenic Peptide Processing Prediction

MAPPP This project deals with the prediction of T cell epitopes. These epitopes are presented to CD8 positive cytotoxic T cell lymphocytes on the cell surface by the major histocompatibility complex class I (MHC class I). We rely on combining different mathematical models for predicting possible proteasome processings and use these to calculate epitopes most likely to form complexes with MHC class I transporter molecules. Currently, we try to use statistical machine learning techniques to generate models from labeled samples and use them for the prediction of new epitopes.

Positional dependencies in protein domains

Methods from information theory and statistics are used to identify important residues of protein domains. Protein sequences forming similar proteins with similar functions for different organisms are retrieved from various databases. Their alignment is then used to characterize conserved parts of the structure. As an examples, conserved positions can be identified by calculating the statistical entropy for each position in the protein domain. To find positional dependencies within the protein domain, we use covariances and correlation measurement.