Dieses Seminar wird von den Mitgliedern der beiden Arbeitsgruppen
als Forum der Diskussion und des Austauschs genutzt. Studenten und
Gäste sind herzlich eingeladen. Das Forschungsseminar "Neue
Entwicklungen im Datenbankbereich" von Prof. Freytag findet
im Anschluss daran in Raum 3.113 statt.
| Datum
|
Thema
|
Vortragender
|
|
23.10.2003
|
litsift: Automated Text Categorization in Bibliographic
Search In bioinformatics there exist research topics that
cannot be uniquely characterized by a set of key words because
relevant key words are (i) also heavily used in other contexts and
(ii) often omitted in relevant documents because the context is
clear to the target audience. Information retrieval interfaces
such as entrez/Pubmed produce either low precision or low recall
in this case. To yield a high recall at a reasonable precision,
the results of a broad information retrieval search have to be
filtered to remove irrelevant documents. We use automated text
categorization for this purpose. In this study we use the
topic of conserved secondary RNA structures in viral genomes as
running example. Pubmed result sets for two virus groups,
Picornaviridae and Flaviviridae, have been manually labeled by
human experts. We evaluated various classifiers from the Weka
toolkit together with different feature selection methods to
assess whether classifiers trained on documents dedicated to one
virus group can be successfully applied to filter literature on
other virus groups. Our results indicate that in this domain a
bibliographic search tool trained on a reference corpus may
significantly reduce the amount of time needed for extensive
literature recherches.
|
Lukas Faulstich
|
|
30.10.2003
|
Integrative approaches of the Biomedical knowledge in a
frame of liver transcriptome analysis Rapid emergence of
new biotechnological platforms for high scale investigations in
human health : genome, transcriptome and proteome, prompts further
advances on bioinformatics techniques to take in charge the data
and knowledge generated by these technologies. Non stop tremendous
amount of this knowledge is indeed deposited on disparate public
web resources, and is in return extremely required to share
information, interpret results and reason/suggest new hypothesis.
With these challenging requirements, we will focus during the talk
on the example of studying in-silico liver pathologies by using
expression levels on genes in different physiopathological
situations enriched with annotations extracted from the variety of
the scientific sources and standards. How to integrate data on
liver genes from heterogeneous sources, to organize, refresh and
analyze them efficiently within a target question - which is in
our case specific to an organ and a pathology. This does not much
lay on the built of the data warehouse itself, but more on the
challenging questions the design of such environment may raise :
knowledge representation and modeling, integration issues and
integrated analyzes.
|
Fouzia Moussouni
|
|
6.11.2003
|
--
|
N.N.
|
|
13.11.2003
|
--
|
N.N.
|
|
20.11.2003
|
Leger: An Interactive Bioinformatics Tool for Large-Scale
Data Exploration and Knowledge Discovery in Proteome Research We
describe an integrated proteome database, termed Leger, which can
store, retrieve and analyse various information based on LC-MS
data. Leger is designed as a laboratory information platform that
manages an entire set of experimental data. It also provides query
functions, data mining tools, which analyse the statistical
significance of experimental data and links to other available
public and non-public databases. The user interface is web-based,
so that data acquisition and query can be performed on remote
computers within the intranet. In particular, the subproteome of
different mutants or species can be examined against each other,
resulting in lists of exclusively expressed, up/down regulated or
equally expressed proteins. These lists can be linked to diverse
information. Our current research on Listeria will be used as a
working example.
|
Dr. Guido Dieterich
|
|
27.11.2003
|
--
|
N.N.
|
|
4.12.2003
|
Integration / Mapping biologischer Datenbanken mittels
BioSQL als Zwischenschema Der Vortrag behandelt die
Integration von zwei neuen Datenquellen (SwissProt, Gene Ontology)
in das Datenbankprojekt Columba (www.columba-db.de),
das am Lehrstuhl "Wissensmanagement in der Bioinformatik"
bei Prof. Dr. Ulf Leser mitentwickelt wird. Im speziellen werden
Erfahrungen mit BioSQL als Zwischenschema, Mapping von BioSQL auf
das Zielschema innerhalb der Columba Datenbank und die Integration
via Python in die Datenbank erläutert.
|
Raphael Bauer
(Folien - pdf, 530k)
|
|
11.12.2003
|
--
|
N.N.
|
|
18.12.2003
|
Textmining in Java und Erfahrungen aus dem
BioCreAtIvE-Cup Es werden die Kernkomponenten einer neuen,
am Lehrstuhl entwickelten Java Textminig API vorgestellt und
gezeigt, wie sie zu einem Prozess (Document Retrieval,
Preprocessing, Representation und Classification) integriert
werden können. Der BioCreAtIvE-Cup (Critical Assessment of
Information Extraction Systems in Biology) behandelt verschiedene
Fragestellungen aus dem Bereich 'TextMining in biomedizinischen
Publikationen'. In diesem Jahr ging es um die Erkennung von Gen-
und Proteinnamen in Freitexten (Named Entity Recognition), das
Zuordnen dieser Gennamen zu eindeutigen Bezeichnern (Entity
Mention Normalization) und die automatisierte Kommentierung von
Entitäten/Proteinen mit ihren (GeneOntology-) Annotationen.
Im Seminar wird ein in Kooperation mit dem Lehrstuhl für
Wissensmanagment entwickelter Lösungsansatz vorgestellt.
|
Conrad Plake
|
|
25.12.2003
|
(Weihnachtsferien)
|
|
|
1.1.2004
|
(Weihnachstferien)
|
|
|
8.1.2004
|
--
|
N.N.
|
|
15.1.2004
|
Datenbank-Forschungsthemen - Ein Bericht über das
"Lowell Database Research Self Assessment"
Treffen Führende Datenbankforscher treffen sich alle
paar Jahre, um den aktuellen Stand der Forschung zu
diskutieren und einen Blick in die Zukunft zu werfen:
Welche Themen werden in den nächsten Jahren wichtig?
Womit sollte man sich beschäftigen? Was sind die großen
Probleme? Im Mai 2003 fand das bisher letzte dieser
Treffen in Lowell, MA mit 29 Teilnehmern statt. Wir
wollen die Ergebnisse dieses Workshops kurz vorstellen
und dann gemeinsam darüber diskutieren: Sind das
wirklich die großen Themen der nächsten Jahre? Wurden
Themen vergessen? Was können wir dazu beitragen, sie zu
lösen?
|
Jens Bleiholder, Felix Naumann
(Folien - pdf, 600k)
|
|
22.1.2004
|
Text Mining and its Applications to the Life
Science Domain
This talk shall give an overview of our work in the text mining for life sciences field during the last year and in the near future.
We start with a closer look on problems from the biochemical/biomedical domain, for which solutions using text mining, natural language processing, and machine learning have been proposed. Some of these solutions are quite useful, but neither results in answers for all single aspects of a certain query. We will present possible short-comings of systems existing today, and try to find hints on the reasons for their limitations.
+++ Classification of publications with respect to sets of pre-defined categories +++ Clustering of large volumes +++ Discovering protein-protein-interaction networks +++ Finding evidences for facts or entities in texts +++
Addressing a couple of unsolved questions leads to particular topics we try to concentrate on in ongoing and forthcoming projects. Some ideas we present have been described in principle, but were never implemented and evaluated in real-world scenarios.
+++ Multi-class categorization +++ Weighting schemes +++ Extensions of the vector space model +++ Contexts vs. concepts +++
Possible topics concern the representation of documents as an essential basis for the application of text mining techniques; extraction and generation of object properties appropriate for discrimination and similarity metrices; transformation and reduction of these property sets.
|
Jörg Hakenberg
|
|
29.1.2004
|
DDL2XSD – Automatic Generation of an XML Schema for a
given Relational Schema
Relational databases are the prevalent form for storing data and XML is the de facto standard for publishing data on the World Wide Web. The interaction of a relational database application with the WWW requires transforming data stored in a flat relational structure to a hierarchical XML structure.
This mapping is tedious, time consuming and error-prone when done manually,
which justifies the necessity of algorithms and tools performing this task
automatically. DDL2XSD, which I developed as part of my diploma thesis at IBM
Almaden Research Center, creates an XML Schema for a relational schema
specified in an SQL DDL file. This talk will give an overview of the different
components of DDL2XSD as well as algorithms developed to determine the
hierarchical XML structure.
|
Melanie Weis
|
|
5.2.2004
|
--
|
N.N.
|
|
10.2.2004
|
BCB Lecture: The IMB Jena Image Library of Biological
Macromolecules (Ort: Humboldt-Kabinett)
The IMB Jena Image Library of Biological Macromolecules (www.imb-jena.de/IMAGE.html) is
aimed at a better dissemination of information on three-dimensional biopolymer
structures with an emphasis on visualization and analysis [1]. It consists of
two parts. A division on Basic Information on Biological Macromolecules
includes, for example, the Amino Acid Repository, the Base Pair Directory and
information on Nucleic Acids Nomenclature and Structure as well as on
Structural Elements in Proteins. Also introductions to X-ray Crystallography,
NMR Spectroscopy, Fourier Transform Infrared Spectroscopy and Circular
Dichroism are available. On the other hand, the Atlas of Macromolecule
Structures provides access to all structure entries deposited at the Protein
Data Bank (PDB) and the Nucleic Acid Database (NDB). It offers many tools for
an in-depth analysis of individual structures with an emphasis on
visualization. Analysis tools are designed both for complete structures and for
structure parts such as domains, ligands, active sites, cis-peptide and SS
bonds.
Given the increasing number of known three-dimensional biopolymer structures
and with the data explosion in other fields of biology there is an urgent need
for classification systems and for up-to-date and reliable cross-referencing
schemes to other databases. To fulfil these needs the Image Library includes
among other classification schemes the Hetero Components and Site Databases, a
Comprehensive Bending Classification of Nucleic Acid Structures and a
Genus/Species Classification. Databases that cannot directly be accessed by the
PDB code are linked via a SWISS-PROT/PDB cross reference scheme. The most
recent additions to the Image Library features are a Gene Ontology (GO)
interface to the PDB and a SCOP domain viewer.
The atlas pages include links to about 30 other databases. More recently
disease-related information resources such as OMIM have been
included. Finally, there are ongoing efforts to link PDB structure to genomics
resources. Examples that are already operational are GeneCensus, PRESAGE and
the eukaryotic genome browser Ensembl. Many leading biological databases such
as the PDB and Swiss- Prot, for example, reference the Image Library.
The IMB Jena Image Library now serves the scientific community for about 10
years. In addition to this scientific usage, the database is also of value as
an educational resource and has received attention by the general
public. Images from the database have been used in many books, journals,
newspapers and exhibitions. The database has been featured as Site of the
Month by Science in April 2001 and is also included in the listing of The Web
s Best Sites of the Encyclopedia Britannica.
|
Jürgen Sühnel, IMB Jena
|
|
12.2.2004
|
Generierung verschmutzter XML Daten
Um Data Cleansing-Algorithmen zu testen, werden Daten benötigt, die bestimmte
Mengen vordefinierter Fehler aufweisen. Durch die immer größer werdende Menge
an XML Daten ist die Entwicklung dieser Algorithmen nun auch für geschachtelte
XML Daten ein hochaktuelles Thema. Der Vortrag wird auf Fehler in
geschachtelten XML Daten eingehen und ein Tool vorstellen, das im Rahmen einer
Studienarbeit entwickelt wurde und mit dem "schmutzige" XML Daten zum Testen
dieser Algorithmen erzeugt werden können.
|
Sven Puhlmann
|
|
19.2.2004
|
(entfällt)
|
N.N.
|