Arbeitsgruppe Informationsintegration | Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen in der Bioinformatik und Informationsintegration (Forschungsseminar im WS 03/04)

Prof. Felix Naumann und Prof. Ulf Leser

Donnerstags, 11-13 Uhr in RUD 25, Raum 4.111

Dieses Seminar wird von den Mitgliedern der beiden Arbeitsgruppen als Forum der Diskussion und des Austauschs genutzt. Studenten und Gäste sind herzlich eingeladen. Das Forschungsseminar "Neue Entwicklungen im Datenbankbereich" von Prof. Freytag findet im Anschluss daran in Raum 3.113 statt.

Folgende Termine und Vorträge sind bisher vorgesehen:

Datum Thema Vortragender

23.10.2003

litsift: Automated Text Categorization in Bibliographic Search
In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose.
In this study we use the topic of conserved secondary RNA structures in viral genomes as running example. Pubmed result sets for two virus groups, Picornaviridae and Flaviviridae, have been manually labeled by human experts. We evaluated various classifiers from the Weka toolkit together with different feature selection methods to assess whether classifiers trained on documents dedicated to one virus group can be successfully applied to filter literature on other virus groups. Our results indicate that in this domain a bibliographic search tool trained on a reference corpus may significantly reduce the amount of time needed for extensive literature recherches.

Lukas Faulstich

30.10.2003

Integrative approaches of the Biomedical knowledge in a frame of liver transcriptome analysis
Rapid emergence of new biotechnological platforms for high scale investigations in human health : genome, transcriptome and proteome, prompts further advances on bioinformatics techniques to take in charge the data and knowledge generated by these technologies. Non stop tremendous amount of this knowledge is indeed deposited on disparate public web resources, and is in return extremely required to share information, interpret results and reason/suggest new hypothesis. With these challenging requirements, we will focus during the talk on the example of studying in-silico liver pathologies by using expression levels on genes in different physiopathological situations enriched with annotations extracted from the variety of the scientific sources and standards. How to integrate data on liver genes from heterogeneous sources, to organize, refresh and analyze them efficiently within a target question - which is in our case specific to an organ and a pathology. This does not much lay on the built of the data warehouse itself, but more on the challenging questions the design of such environment may raise : knowledge representation and modeling, integration issues and integrated analyzes.

Fouzia Moussouni

6.11.2003

--

N.N.

13.11.2003

--

N.N.

20.11.2003

Leger: An Interactive Bioinformatics Tool for Large-Scale Data Exploration and Knowledge Discovery in Proteome Research
We describe an integrated proteome database, termed Leger, which can store, retrieve and analyse various information based on LC-MS data. Leger is designed as a laboratory information platform that manages an entire set of experimental data. It also provides query functions, data mining tools, which analyse the statistical significance of experimental data and links to other available public and non-public databases. The user interface is web-based, so that data acquisition and query can be performed on remote computers within the intranet. In particular, the subproteome of different mutants or species can be examined against each other, resulting in lists of exclusively expressed, up/down regulated or equally expressed proteins. These lists can be linked to diverse information.
Our current research on Listeria will be used as a working example.

Dr. Guido Dieterich

27.11.2003

--

N.N.

4.12.2003

Integration / Mapping biologischer Datenbanken mittels BioSQL als Zwischenschema
Der Vortrag behandelt die Integration von zwei neuen Datenquellen (SwissProt, Gene Ontology) in das Datenbankprojekt Columba (www.columba-db.de), das am Lehrstuhl "Wissensmanagement in der Bioinformatik" bei Prof. Dr. Ulf Leser mitentwickelt wird. Im speziellen werden Erfahrungen mit BioSQL als Zwischenschema, Mapping von BioSQL auf das Zielschema innerhalb der Columba Datenbank und die Integration via Python in die Datenbank erläutert.

Raphael Bauer

(Folien - pdf, 530k)

11.12.2003

--

N.N.

18.12.2003

Textmining in Java und Erfahrungen aus dem BioCreAtIvE-Cup
Es werden die Kernkomponenten einer neuen, am Lehrstuhl entwickelten Java Textminig API vorgestellt und gezeigt, wie sie zu einem Prozess (Document Retrieval, Preprocessing, Representation und Classification) integriert werden können. Der BioCreAtIvE-Cup (Critical Assessment of Information Extraction Systems in Biology) behandelt verschiedene Fragestellungen aus dem Bereich 'TextMining in biomedizinischen Publikationen'. In diesem Jahr ging es um die Erkennung von Gen- und Proteinnamen in Freitexten (Named Entity Recognition), das Zuordnen dieser Gennamen zu eindeutigen Bezeichnern (Entity Mention Normalization) und die automatisierte Kommentierung von Entitäten/Proteinen mit ihren (GeneOntology-) Annotationen. Im Seminar wird ein in Kooperation mit dem Lehrstuhl für Wissensmanagment entwickelter Lösungsansatz vorgestellt.

Conrad Plake

25.12.2003

(Weihnachtsferien)

 

1.1.2004

(Weihnachstferien)

 

8.1.2004

--

N.N.

15.1.2004

Datenbank-Forschungsthemen - Ein Bericht über das "Lowell Database Research Self Assessment" Treffen
Führende Datenbankforscher treffen sich alle paar Jahre, um den aktuellen Stand der Forschung zu diskutieren und einen Blick in die Zukunft zu werfen: Welche Themen werden in den nächsten Jahren wichtig? Womit sollte man sich beschäftigen? Was sind die großen Probleme?
Im Mai 2003 fand das bisher letzte dieser Treffen in Lowell, MA mit 29 Teilnehmern statt. Wir wollen die Ergebnisse dieses Workshops kurz vorstellen und dann gemeinsam darüber diskutieren: Sind das wirklich die großen Themen der nächsten Jahre? Wurden Themen vergessen? Was können wir dazu beitragen, sie zu lösen?

Jens Bleiholder, Felix Naumann

(Folien - pdf, 600k)

22.1.2004

Text Mining and its Applications to the Life Science Domain
This talk shall give an overview of our work in the text mining for life sciences field during the last year and in the near future.
We start with a closer look on problems from the biochemical/biomedical domain, for which solutions using text mining, natural language processing, and machine learning have been proposed. Some of these solutions are quite useful, but neither results in answers for all single aspects of a certain query. We will present possible short-comings of systems existing today, and try to find hints on the reasons for their limitations.
+++ Classification of publications with respect to sets of pre-defined categories +++ Clustering of large volumes +++ Discovering protein-protein-interaction networks +++ Finding evidences for facts or entities in texts +++
Addressing a couple of unsolved questions leads to particular topics we try to concentrate on in ongoing and forthcoming projects. Some ideas we present have been described in principle, but were never implemented and evaluated in real-world scenarios.
+++ Multi-class categorization +++ Weighting schemes +++ Extensions of the vector space model +++ Contexts vs. concepts +++
Possible topics concern the representation of documents as an essential basis for the application of text mining techniques; extraction and generation of object properties appropriate for discrimination and similarity metrices; transformation and reduction of these property sets.

Jörg Hakenberg

29.1.2004

DDL2XSD – Automatic Generation of an XML Schema for a given Relational Schema
Relational databases are the prevalent form for storing data and XML is the de facto standard for publishing data on the World Wide Web. The interaction of a relational database application with the WWW requires transforming data stored in a flat relational structure to a hierarchical XML structure.
This mapping is tedious, time consuming and error-prone when done manually, which justifies the necessity of algorithms and tools performing this task automatically. DDL2XSD, which I developed as part of my diploma thesis at IBM Almaden Research Center, creates an XML Schema for a relational schema specified in an SQL DDL file. This talk will give an overview of the different components of DDL2XSD as well as algorithms developed to determine the hierarchical XML structure.

Melanie Weis

5.2.2004

--

N.N.

10.2.2004

BCB Lecture: The IMB Jena Image Library of Biological Macromolecules (Ort: Humboldt-Kabinett)
The IMB Jena Image Library of Biological Macromolecules (www.imb-jena.de/IMAGE.html) is aimed at a better dissemination of information on three-dimensional biopolymer structures with an emphasis on visualization and analysis [1]. It consists of two parts. A division on Basic Information on Biological Macromolecules includes, for example, the Amino Acid Repository, the Base Pair Directory and information on Nucleic Acids Nomenclature and Structure as well as on Structural Elements in Proteins. Also introductions to X-ray Crystallography, NMR Spectroscopy, Fourier Transform Infrared Spectroscopy and Circular Dichroism are available. On the other hand, the Atlas of Macromolecule Structures provides access to all structure entries deposited at the Protein Data Bank (PDB) and the Nucleic Acid Database (NDB). It offers many tools for an in-depth analysis of individual structures with an emphasis on visualization. Analysis tools are designed both for complete structures and for structure parts such as domains, ligands, active sites, cis-peptide and SS bonds.
Given the increasing number of known three-dimensional biopolymer structures and with the data explosion in other fields of biology there is an urgent need for classification systems and for up-to-date and reliable cross-referencing schemes to other databases. To fulfil these needs the Image Library includes among other classification schemes the Hetero Components and Site Databases, a Comprehensive Bending Classification of Nucleic Acid Structures and a Genus/Species Classification. Databases that cannot directly be accessed by the PDB code are linked via a SWISS-PROT/PDB cross reference scheme. The most recent additions to the Image Library features are a Gene Ontology (GO) interface to the PDB and a SCOP domain viewer.
The atlas pages include links to about 30 other databases. More recently disease-related information resources such as OMIM have been included. Finally, there are ongoing efforts to link PDB structure to genomics resources. Examples that are already operational are GeneCensus, PRESAGE and the eukaryotic genome browser Ensembl. Many leading biological databases such as the PDB and Swiss- Prot, for example, reference the Image Library.
The IMB Jena Image Library now serves the scientific community for about 10 years. In addition to this scientific usage, the database is also of value as an educational resource and has received attention by the general public. Images from the database have been used in many books, journals, newspapers and exhibitions. The database has been featured as Site of the Month by Science in April 2001 and is also included in the listing of The Web s Best Sites of the Encyclopedia Britannica.

Jürgen Sühnel, IMB Jena

12.2.2004

Generierung verschmutzter XML Daten
Um Data Cleansing-Algorithmen zu testen, werden Daten benötigt, die bestimmte Mengen vordefinierter Fehler aufweisen. Durch die immer größer werdende Menge an XML Daten ist die Entwicklung dieser Algorithmen nun auch für geschachtelte XML Daten ein hochaktuelles Thema. Der Vortrag wird auf Fehler in geschachtelten XML Daten eingehen und ein Tool vorstellen, das im Rahmen einer Studienarbeit entwickelt wurde und mit dem "schmutzige" XML Daten zum Testen dieser Algorithmen erzeugt werden können.

Sven Puhlmann

19.2.2004

(entfällt)

N.N.