Our goal is to develop and study algorithms that discover knowledge in large databases, and large document archives. We analyze learning problems, explore foundations, design principles, and properties of learning algorithms. We investigate applications mainly in the areas of information retrieval and bioinformatics.
| Active and
semi-supervised learning from text. |
A challenge of
classification learning lies in the effective utilization of unlabeled
training data. We investigate algorithms that learn classifiers from
few labeled and many unlabeled data. Research papers and websites can
be described by their context in the citation graph, in addition to
their intrinsic content. Multi-view learning methods utilize both
unlabeled training examples and additional context information from the
citation network effectively. They are based on an elementary
principle: the error risk of a consensual decision of multiple decision
makers is lower than the individual risk of every single decision
maker. |
| Information retrieval: spam
identification and user assistance. |
Machine learning can contribute to several
problems of information retrierval. We view spam identification as a
game between two opponents (spam sender and spam filter) who react to
each other's moves. We are looking for a winning strategy that allows
us to identify spam email that will be sent in the future. We develop user assistance systems that utilize knowledge contained in previously edited text documents in order to support a user, for instance, in writing an email or editing a document. |
| Text mining in bioinformatics. |
In order to
generate biological
models that, for instance, predict the function of certain genes, it is
necessary to consider information that is scattered across a large
number of scientific publications. We investigate methods that extract
relevant information automatically from research papers and utilize
this information for model building. |
| Mining data streams. |
We investigate the principles of
algorithms that analyze large databases and discover and
explicate hidden knowledge. Learning from very
large databases is among the challenges of knowledge discovery.
Sampling algorithms process databases which are too large to iterate
over all records, and yet provide optimality guarantees. We analyze
learning methods and are interested in the methodology of evaluating
learners. |