3rd Int. Workshop on Data Integration in the Life Sciences DILS 2006

DILS poster session

List of all posters

Supporting data integration from project planning through to publication
An interaction knowledgebase for biomolecular modeling
VisGenome - genome visualisation at different levels of detail
Virtual Integration of existing web databases for the genotypic selection of cereal cultivars
Biofacets: Solution Towards Leveraging the Wealth of Online Biological Databases
Software Support & Models For Meta-data Integration
The Systems Biology Ontology
A workflow enactment portal for bioinformatics
InterMine - a biological data warehouse system
MGI Database: Integrated Knowledgebase for the Laboratory Mouse
Integrating Mouse SNP data into MGI
CCPNGrid: A Framework for High Throughput Computing in NMR Spectroscopy
Enhancing Drug Discovery Informatics through Multi-Platform Data Integration Workflows
The Integr8 Web Portal
SHARQ Guide: Finding relevant biological data and queries in a peer data management system
A semantic web approach to data integration for the histone code case
Status of the Functional Genomics investigation Ontology (FuGO)
WormMart - the BioMart data warehouse for WormBase
AliBaba - Visualizing biological networks from PubMed query results

Abstracts

Supporting data integration from project planning through to publication

Tim Booth, Bela Tiwari, Dawn Field NERC Environmental Bioinformatics Centre - CEH Oxford

Data tracking and integration are key concerns from the very inception of any experiment where large volumes of complex data will be collected. Scientists need tools, standards and guidance to manage their data effectively. Knowing from long experience the importance of good data management, the NERC created the NERC Environmental Bioinformatics Centre (NEBC) to provide such support to researchers across the UK.
Here we describe BarcodeBase and EnvBase, two core tools developed by the NEBC. We show how how using these, alongside other tools and the extensive outreach available to bench scientists from the NEBC, facilitates the data integration required for experimental analysis and interpretation.
BarcodeBase is a distributed sample tracking system that allows information on physical samples to be logged and shared as well as tracking the items for future use. When the samples are analysed all relevant metadata can be located and exported. EnvBase is our data cataloguing system. Researchers enter and update information via a set of forms to outline a research project and provide descriptions for all data and associated outputs from that project. Wherever relevant, the user will associate accession numbers in public repositories with the EnvBase entry to provide direct links to the data.

An interaction knowledgebase for biomolecular modeling

Gabor Bereczki and Masaru Tomita

Motivation: Many modeling problems in cell simulation require access to information provided by various public biological databases. Providing unified access to these databases as well as integrating them with model building software presents a standing challenge for designers of modeling systems. Results: We set up a knowledgebase integrating several of the most influential public biomolecular databases, a web services front end for browsing and searching the unified data warehouse, SOAP RPC services for application interface to the data and a Java model editing client for graphical model editing.

VisGenome - genome visualisation at different levels of detail

Joanna Jakubowska1, Ela Hunt2, Matthew Chalmers1, and Anna F. Dominiczak3 1

1 Department of Computing Science, University of Glasgow
2 Database Technology Research Group, Department of Informatics, University of Zurich
3 BHF Glasgow Cardiovascular Research Centre, University of Glasgow

We surveyed a number of genome visualisation tools and recognize that there is no universal tool which shows all relevant data the geneticists looking for candidate disease genes would like to see. The biological researchers we collaborate with would like to view integrated data from a variety of sources and be able to see both overviews and details. Therefore, we developed a new visualisation tool - VisGenome.

VisGenome visualises single and comparative representations of the rat, the mouse, and the human chromosomes, and can easily be used for other genomes. Currently, the application loads any data types specified by a query sent to the Ensembl database, or to a local mirror of RGD which holds additional data. After the users are welcomed with a view of all rat, mouse, and human chromosomes, they select the chromosomes of interest. Then a new view with the chromosome detail is created. VisGenome shows chromosome bands, clones, markers, QTLs, micro array probes, SNPs and genes in a single representation, similar to DerBrowser (citation), but not limited in terms of zooming. In the future we are going to combine VisGenome with the comparative displays from SyntenyVista (add reference). We will allow the users to read the data of the interest not only from Ensembl but also from external file folders or web services.

VisGenome offers smooth zooming and panning. The users can keep an area of interest in focus during the zooming process which allows them to keep the context and help not to get lost. The application shows data in natural scaling. In the future we will extend it to use cartoon scaling. We are developing an algorithm which offers different kinds in cartoon scaling.

Virtual Integration of existing web databases for the genotypic selection of cereal cultivars

Sonia Bergamaschi, Antonio Sala

Dipartimento di Ingegneria dell'Informazione
via Vignolese 905
Universita` di Modena e Reggio Emilia

The poster presents the development of a virtual database for the genotypic selection of cereal cultivars starting from phenotypic traits. The database is realized by integration of two existing web databases, Gramene (http://www.gramene.org/) and Graingenes (http://wheat.pw.usda.gov/). The integration process gives rise to a virtual integrated view of the underlying sources representing the information needed. This integration is obtained using the MOMIS system (Mediator envirOnment for Multiple Information Sources), a framework developed by the Database Group of the University of Modena and Reggio Emilia (www.dbgroup.unimo.it). MOMIS performs information extraction and integration from both structured and semistructured data sources. Information integration is then performed in a semi-automatic way, by exploiting the knowledge in a Common Thesaurus (defined by the framework) and descriptions of source schemas with a combination of clustering techniques and Description Logics. Mapping rules and integrity constraints are then specified to handle heterogeneity. Momis also allows querying information in a transparent mode for the user regardless of the specific languages of the sources. The result obtained by applying MOMIS to Gramene and Graingenes web databases is a queryable virtual view, that integrates the two sources and allow performing genotypic selection of cultivars of barley, wheat and rice based on phenotypic traits regardless of the specific languages of the web databases. The project is conducted in collaboration with the Agrarian Faculty of the University of Modena and Reggio Emilia and funded by the Regional Government of Emilia Romagna.

Biofacets: Solution Towards Leveraging the Wealth of Online Biological Databases

M. Mahoui, Z. B. Miled, A. Godse, H. Kulkarni, N. Li

Searching online biological databases has become a ubiquitous task in today's research investigations. Numerous publicly available databases exist, however only a handful of these are effectively used. This mainly stems from the inadequate or sometimes non-existent metadata for describing the available databases which undermine efficient data searching and result presentation. However, each database has its peculiarities and thus the potential to be of use in biological research investigations. The huge volume of data available presents several issues related to data integration and result presentation. To address these issues we propose Biofacets, a facetted classification system for querying biological databases. BioFacets adopts a wrapper-mediator based approach for integrating web-based biological databases. The system features a meta-search engine which facilitates an integrated search across multiple online biological databases. The main feature of Biofacets consists of an innovative faceted classification system that provides dynamic and hierarchical categorization of data resulting from querying multiple and diverse biological databases. Query optimization and cache management are provided to improve system performance.

Software Support & Models For Meta-data Integration

David Hancock(1,2), Norman Morrison(1,2), Bela Tiwari(2), Dawn Field(2)

(1) Microarray Bioinformatics Group, The University of Manchester, UK
(2) NERC Environmental Genomics Data Centre, Oxford, UK

Data integration is a key issue when experiments are performed using high-throughput 'omic technologies (genomics, transcriptomics, proteomics, metabolomics, etc.). Researchers wish to combine the results of these different assays and it is vital that the supporting meta-data helps rather than hinders such integration.

Integration is needed in two places; firstly, when one or more biological materials (a.k.a. samples) are used in several different experiments and secondly, when different experiments use similar, but not identical, samples. Successful integration relies on access to knowledge about both the commonalities and differences in the materials and processes which originated it. Here, an integrated view of sample meta-data can significantly enhance the value and accuracy of subsequent data integration.

This poster discusses several aspect of our current work in the area of data integration for studies in which multiple 'omics analyses are employed. This research is funded as part of the data management plan for the NERC "Post-genomics and Proteomics Science Programme". We are developing a common representation for experimental designs and their attendant materials and methods and a simple, yet expressive generic mechanism for meta-data capture.

We present preliminary details of a novel method for modelling meta-data in situations where a rigid framework cannot be established a-priori. We propose a system based on a composable building block called the 'characteristic'. We show how complex meta-data representations can be achieved by combining characteristics using composition (to represent mixtures) and binary predicates (to express relationships between characteristics). We also introduce the notion of inheritance of characteristics whereby annotations are automatically propagated passed from one level of the the next, thereby both reducing the data entry burden on the user and avoiding the introduction of errors through redundant replication of data.

It is hoped that providing a common data model and annotation framework will have two benefits in the context of data integration. Firstly, that experimental meta-data can be conveniently handled within a single architecture and secondly that integration of the subsequently generated assay data can be achieved automatically and with a high degree of reliability.

The Systems Biology Ontology

Mélanie Courtot, Nicolas Le Novère

The rise of Systems Biology, seeking to apprehend biological processes as a whole, highlighted the need to not only develop corresponding quantitative models, but also to create standards allowing their exchange and integration. This concern drove the community to design common data format such as SBML[1] and CellML[2]. SBML is now largely accepted and used in the field. However, as important as was the definition of a common syntax, we also need to tackle the semantics of the models.

The Systems Biology Ontology (SBO) is developed by the EBI and the SBML team, within the framework of the international initiative BioModels.net. SBO aims to strictly index, define and relate terms used in quantitative modelling, and by extension quantitative biochemistry. SBO is currently made up of four different vocabularies: quantitative parameters (catalytic constant, thermodynamic temperature . . . ), participant roles (substrate, product, catalyst . . . ), modelling frameworks (discrete, continuous . . . ) and mathematical expressions that refer to the other three.

The new version of SBML provides a mechanism to annotate model components with SBO terms, therefore increasing the semantics of the model beyond the sole topology of interaction and mathematical expression. Simulation tools can check the consistency of a rate law, convert reaction from one modelling framework to another (e.g. continuous to discrete), or distinguish between identical mathematical expressions based on different assumptions (e.g. Henri-Michaelis- Menten Vs. Briggs-Haldane). Other tools like SBMLmerge can use the SBO annotation to integrate individual models into a larger one. The use of SBO is not restricted to the developement of models. Resources providing quantitative experimental information such as SABIO Reaction Kinetics [3] will be able to annotate the parameters (what do they mean exactly, how were they calculated) and determine relationships between them.

To curate and maintain SBO, we developed a dedicated resource. A relational database management system (MySQL) at the back-end is accessed through a web interface based on JSP and JavaBeans. Its content is encoded in UTF8, therefore supporting a large set of characters in the definitions of terms. Distributed curation is made possible by using a custom-tailored locking system allowing concurrent access. This system allows a continuous update of the ontology with immediate availability and suppress merging problems.

The ontology can currently be retrieved using the Open Biomedical Ontologies (OBO) flat format [4] and SBO is now listed on the OBO ontology browser. The addition of an export in the Ontology Web Language (OWL) [5] in the near future, and the development of WebServices will help spreading the use of SBO.

References
[1] M Hucka, H Bolouri, A Finney, HM Sauro, JC Doyle, Kitano H, AP Arkin, BJ Bornstein, D Bray, A Cornish-Bowden, AA Cuellar, S Dronov, M Ginkel, V Gor, II Goryanin, WJ Hedley, TC Hodgman, PJ Hunter, NS Juty, JL Kasberger, A Kremling, U Kummer, N Le Novère, LM Loew, D Lucio, P Mendes, ED Mjolsness, Y Nakayama, MR Nelson, PF Nielsen, T Sakurada, JC Schaff, BE Shapiro, TS Shimizu, HD Spence, J Stelling, K Takahashi, M Tomita, J Wagner, and J Wang. The systems biology markup language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics, 19:524–531, 2003.
[2] CM Lloyd, MD Halstead, and PF Nielsen. CellML: its future, present and past. Prog. Biophys. Mol. Biol., 85:433–450, 2004.
[3] Ulrike Wittig, Martin Golebiewski, Renate Kania, Olga Krebs, Saqib Mir, Andreas Weidemann, Stefanie Anstein, Jasmin Saric, Isabel Rojas. SABIO-RK: Integration and curation of reaction kinetics data. Proceedings of the 3rd International Workshop on Data Integration in the Life Sciences 2006 (DILS’06), 2006.
[4] John Day-Richter. The OBO Flat File format specification, version 1.2. 2004. http://www.godatabase.org/dev/doc/obo_format_spec.html.
[5] DL McGuinness and F van Harmelen. OWL, web ontology language. 2004. http://www.w3.org/TR/ owl-features/.

A workflow enactment portal for bioinformatics

Domenico Marra and Paolo Romano

Bioinformatics and Structural Proteomics, National Cancer Research Institute (IST), Genova, Italy

Background: Data integration in biology is a difficult task. Data integration is usually well accomplished for established knowledge domains, where data is clearly defined, as well as integration goals and processes. In biology, this is not the case: a pre-analysis and reorganization of the data is very difficult, because data and related knowledge change very quickly. Moreover, complexity of information makes it difficult to design data models which can be valid for different domains and over time. Finally, goals and needs of researchers evolve very quickly according to new theories and discoveries, this leading to frequent new procedures and processes. At the same time, data integration is needed to achieve a better and wider view of all available information, as well as in order to automatically carry out analysis and/or searches involving more databases and software and to perform analysis involving large data sets. Finally, only a tight integration of data and analysis tools can lead to a real data mining. As a consequence, data integration in biology can only be accomplished by using modular flexible systems, that can easily be customized and updated. ICT standards and tools, like Web Services and Workflow Management Systems (WMS), can support the creation and deployment of such systems. An example is presented in [1]. Many Web Services are already available and some WMS have been proposed. They assume that researchers know which bioinformatics resources can be reached through a programatic interface and that they are skilled in programming and building workflows. Therefore, they are not viable to the majority of unskilled researchers. A portal enabling these to take profit from new technologies is still missing. We present here a workflow enactment portal for bioinformatics applications.

Methods: A workflow enactment portal should provide end users a user-friendly, personalized tool where he/she can register his/her personal preferences and interests, easily identify workflows and keep memory of results. We therefore designed a system that manages profiles of users, allows for searching in the repository of workflows and allows for storing interesting results. The ideal portal would also be able to enact workflows that are available through all workflows management systems. Our portal is currently able to enact workflows developed by using the Taverna Workbench [2] and the BioWMS [3] systems, that we took respectively as de-facto and de-jure standards, since the former is the most used and the latter is based on the XPDL standard. We implemented biowep (see http://bioinformatics.istge.it/biowep/), a web based client application that allows for the selection and execution of a set of predefined workflows whose architecture was previously presented [4]. biowep is partially based on open source software, namely Taverna Workbench and Freefluo, MySQL, and Tomcat, and it also is available through the LGPL license. The user interface supports authentication and profiling of researchers and workflows can be selected on the basis of this information. Moreover, main workflows' processing steps are annotated on the basis of their input and output data, elaboration type and application domain by using a classification of bioinformatics data and tasks. These annotations can then be searched by the users. Enactment of workflows is carried out by FreeFluo for Taverna workflows and by BioAgent/Hermes[5], a mobile agent-based middleware, for BioWMS ones. Results can be saved for later analysis.

Conclusions: We developed a web system that support the selection and execution of predefined workflows, thus simplifying access for all researchers. The implementation of Web Services allowing specialized software to interact with an exhaustive set of biomedical databases and analysis software and the creation of effective workflows can significantly improve automation of in-silico analysis. biowep is available for interested researchers as a reference portal. They are invited to submit their workflows to the workflow repository. biowep is further being developed in the sphere of the Laboratory of Interdisciplinary Technologies in Bioinformatics - LITBIO.

Acknowledgements: This work was partially supported by the Italian Ministry of Education, University and Research (MIUR), projects "Oncology over Internet (O2I)" and "Laboratory of Interdisciplinary Technologies in Bioinformatics (LITBIO)". Our system is partially based on open source and we wish to thank all developers of these tools. biowep is itself available under the GNU Lesser General Public Licence (LGPL).

References
[1] Romano P., Marra D. and Milanesi L., Web services and workflow management for biological resources, BMC Bioinformatics 2005, 6(Suppl 4):S24 (doi:10.1186/1471-2105-6-S4-S24)
[2] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat and P. Li, Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045-3054, 2004
[3] Bartocci E., Corradini F., Merelli E., Scortichini L., BioWMS: a web based Workflow Management System for Bioinformatics. Proceedings of the 3rd Annual Meeting of the Bioinformatics Italian Society BITS 2006, Abstract 13.
[4] P. Romano, G. Bertolini, F. De Paoli, M. Fattore, D. Marra, G. Mauri, E. Merelli, I. Porro, S. Scaglione and L. Milanesi, Network integration of data and analysis of oncology interest. Journal of Integrative Bioinformatics, 0021, 2006.
[5] F. Corradini and E. Merelli. Hermes: agent-base middleware for mobile computing. In Mobile Computing, volume 3465, pages 234-270. LNCS, 2005.

InterMine - a biological data warehouse system

R. Smith(1), F. Guillier(1), W. Ji(1), P. McLaren(1), H. Janssens(1), R. Lyne(1), T. Riley(1), F. Reisinger(1), K. Rutherford(1), M. Wakeling(1), X. Watkins(1), G. Micklem(1)(2).

(1) Department of Genetics, University of Cambridge, Cambridge, UK
(2) Cambridge Computational Biology Institute, Department of Applied Mathematics and Theoretical Physics, Cambridge, UK

InterMine (www.intermine.org) is an open source data warehouse system developed as part of the FlyMine project (www.flymine.org). It is able to integrate many common biological formats and provides a framework for adding other sources. Data is imported into a query-optimised data warehouse and is accessible via a powerful web interface.

Data from Ensembl, UniProt, InterPro, InParanoid (orthologues), PSI (protein interactions) and GO annotation can be imported simply by specifying the organism/experiment. GFF3, FASTA and MAGE can be added with some additional configuration. InterMine also provides a framework for adding new data sources. The data model (the core of which is derived from the sequence ontology) is defined by an XML file and is easily extendable. All model specific parts of the system are generated from this XML so it is easy to incorporate new types of data.

A model-independent web application is designed to be a powerful query tool giving users the ability to build complex queries and not be constrained by fill-in-the-blanks forms. It is also possible to create and publish 'templates' to make common queries easy to run. All queries can be executed on lists of data, either uploaded by users or created from results of other queries. Queries and lists can be saved between sessions and shared with other users. The web application provides a platform for integration of visualisation and analysis tools, GBrowse and Cytoscape are already available for viewing genome annotation and protein interaction networks.

A generic query optimisation system means that InterMine can turn any data model into a query optimised data warehouse. Incoming queries are automatically analysed and re-written to use pre-computed tables in the underlying database. Generation of pre-computed tables is configurable, thus the system can be optimised for any user query and performance can be adapted to actual usage. Template queries are pre-computed once created so will always return results fast.

InterMine is able to operate on any data model so could be used to provide a data warehouse and query interface for an existing data management system. Extensive use of automatic code generation means little development is required.

InterMine is written in Java with a Struts/JSP web application, PostgreSQL is used as the underlying relational database. The system has been in development for over three years by a team of software engineers and biologists. All software is freely available under the LGPL license.

MGI Database: Integrated Knowledgebase for the Laboratory Mouse

James A. Kadin, Judith A. Blake, Carol J. Bult, Mary E. Dolan, Joel E. Richardson, Martin Ringwald, Janan T. Eppig, and the Mouse Genome Informatics Group

The Jackson Laboratory, Bar Harbor, ME, USA 04609

As the model organism of choice for studying human development and disease, the laboratory mouse plays a pivotal role in modern biomedical research. The Mouse Genome Informatics (MGI) system is a public data resource providing integrated, comprehensive data on the laboratory mouse, from sequence (genotype) to phenotype. We gather, distill, and integrate information from diverse sources about sequences, maps, genes, gene function, gene families, gene expression, strains, mutant phenotypes, disease models, mammalian orthologies, SNPs and other polymorphisms. We integrate these data through a combination of expert human curation and automated processes that determine object identities and shared relationships and use a variety of controlled/structured vocabularies (ontologies) such as Gene Ontology (GO), Mammalian Phenotype (MP), Anatomical Dictionary of Mouse Development, OMIM, InterPro, and PIR super families. Our web-based query forms allow users to explore these data from many different starting points. Recently added functionality will be highlighted including graphical views of GO annotations for mouse genes, comparative views of GO annotations of mouse-human-rat genes, mouse models of human disease, and mouse SNP data.

The MGI database is publicly accessible at http://www.informatics.jax.org

Supported by NIH HG00330, HD33745, HG002273.

Integrating Mouse SNP data into MGI

Sharon C. Giannatto, Richard M. Baldarelli, Jonathan Beal, Lori E. Corbin, James A. Kadin, Yunxia Zhu, Janan T. Eppig, Carol J. Bult, and the Mouse Genome Informatics Group

The Jackson Laboratory, Bar Harbor, ME, USA 04609

Mouse Genome Informatics (MGI), a premier mouse model organism database resource offering integrated access to mouse biological data, now offers integrated access to mouse SNP data. The MGI mouse SNP query form allows convenient, robust queries of mouse SNP data from dbSNP. Mouse SNP data can be accessed by mouse strain and strain comparisons, mouse genome coordinates, marker range, SNP attributes (variation type, functional class, accession ID) or by association to MGI genes and genetic markers, all from a single query form. There is also an option to look for RefSNPs within specified distances from an entered gene.

SNP query results include a summary of each returned RefSNP, including consensus strain allele calls derived from SNP assays. Results can be returned in tab-delimited format for easy downloading and further analysis. The summary page links to RefSNP detail pages which provide comprehensive information about individual RefSNPs including the reference flanking sequence, a complete list of the SNP assays that comprise the RefSNP, and gene/marker associations with their corresponding function class annotations.

We will discuss aspects of our software and database design for SNP data and how it is integrated with the rest of MGI mouse data including query form functionality, strain translations, efficient querying by SNP to gene distances, dual database loading strategy, and weekly recomputation of SNP to MGI Marker associations based on EntrezGene IDs.

The MGI database can be accessed freely at http://www.informatics.jax.org

Supported by NIH HG00330.

CCPNGrid: A Framework for High Throughput Computing in NMR Spectroscopy

Alan Wilter1, Wim Vranken2, Mark Hayes3, Andrew Parker4, Ernest D. Laue1

1 University of Cambridge, Dept. of Biochemistry
2 European Molecular Biology Laboratory, European Bioinformatics Institute, MSD Group
3 University of Cambridge, Dept. of Physics, Cavendish Laboratory
4 University of Cambridge, Dept. of Mathematics

The principle objective of this joint Collaborative Computing Project for the NMR community (CCPN)/Cambridge/E-science Centre application is to provide the UK Nuclear Magnetic Resonance (NMR) community with a framework for executing state of the art structure calculation and validation software on a compute cluster. To work out the best approach we will initially implement established automated NOE assignment and NMR structure calculation software (ARIA/CNS) using as input information stored in the data model framework provided by CCPN. So far, we have implemented a solution, using VEGA software, for adding new data into the CCPN XML repository and carried out some case-of-study model examples with ARIA/CNS, in order to automatise the whole procedure.

Enhancing Drug Discovery Informatics through Multi-Platform Data Integration Workflows

Jonathan Dry, Tim French, Sajan Khosla

Cancer Bioinformatics, AstraZeneca, Mereside, Alderley Park, Cheshire, England, SK10 4TG

A range of platforms including 'omics technologies are being deployed at AstraZeneca to address the needs of Translational Science research, particularly product differentiation, tumour selection and personalised medicine strategies. Optimising bioinformatics support for such approaches involves integration of diverse data types and tools for analysis and interpretation. Retrieving and interlinking data from bespoke systems with varied architecture, and achieving the necessary flexibility to handle evolving methodologies and tools, is a complex problem with no current off-the-shelf solution. Manual processes are undesirable for reasons of speed, efficiency and data integrity.

We are currently investigating workflow technology to automate a range of data handling tasks surrounding the merging and resulting manipulation of pre-clinical and clinical data to support translational studies. Here we describe the application of InforSense KDE to automate a typical approach, linking experimental information from multiple data sources, and applying quality control and multivariate analysis to visualize the relationship between microarray data, cell line annotation, genetic and pharmacological profiles.

Such automation will improve the standardisation and efficiency of 'omic analyses. Importantly, this will be achieved while retaining a flexible format that remains interpretable by the bioscientist to enable scientific input adapted to a specific problem.

The Integr8 Web Portal

Rajesh Radhakrishnan, Paul Kersey, Alan Horne, Jorge Duarte, and Rolf Apweiler.

EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campis, Hinxton, Cambridge, CB10 1SD

The development of large scale, high throughput technologies capable of determining the nature of the genome, transcriptome and proteome of living organisms has led to the concomitant development of new ways of interpreting information about individual genes and proteins in a wider context. However, the enormous quantities of data generated through such experimental approaches have increasingly been spread over many different primary resources, making it harder to draw scientific conclusions. Coherently integrating such data, and offering access to it, has thus emerged as one of the most important challenges in bioinformatics. The Integr8 web portal has been developed to provide a single point of access to data obtained from many different primary sources, integrated in a model reflecting the central dogma of biology, in order to describe the current state of knowledge about the biology of individual species.

The focus of Integr8 is species with completely deciphered genomes, in which it is possible to interpret information about particular genes in the wider of the genome as a whole. Integr8 is built by taking reference genome sequences (selected in a species-specific fashion from resources including as the EMBL Nucleotide Sequence Database, SGD, and Ensembl) and incorporating data from protein-centric resources such as the UniProt Knowledgebase and IPI into this framework. The Integr8 portal provides public access to this data, and offers tools for its interpretation and analysis, including statistical analysis of genomes/proteomes, views of homology relationships within and between species, a clear view of the relationship of genes and their products, text and sequence-based searches that can be customised for application within a specified taxonomic range, and hierarchical GO-based search.

Each release of Integr8 is constructed in a step-wise process: data from a primary source of genome-related data is fitted into a generalised model; data from additional sources is compared to the primary data and contradictions are resolved (to identify a definitive list of entities to be represented in Integr8). Various strategies are employed to resolve contradictions, including sequence analysis, the application of precedence hierarchies, and identification of anomalies: the diversity of biological data (and the range of possible errors) mean that this process often needs to be customised for particular data sets, according to their quality and scope. Once a definitive list of features has been identified, data is then imported from additional data sources as supplementary annotation for these entries.

The Integr8 web portal is implemented using Java servlets, supplemented by the use of client side technologies (such as AJAX) to ensure a seamless user-experience. A web-service has recently been made available to provide programatic access to the data. The data from Integr8 is also incorporated into a BioMart-style database, offering user-customisable downloads through the use of a simple wizard-style interface, in addition to pre-formatted downloads also available).

SHARQ Guide: Finding relevant biological data and queries in a peer data management system

Sarah Cohen-Boulakia, Olivier Biton, Shirley Cohen, Zachary Ives, Val Tannen and Susan Davidson

Department of Computer and Information Science
University of Pennsylvania, Philadelphia, USA

Scientists are now faced with an explosion of information which must be rapidly analyzed to form hypotheses and create knowledge. A major challenge lies in how to effectively share information among collaborating, yet autonomous, biological partners. These partners are peers in the sense that they have a diversity of perspectives (and hence heterogeneous and complementary schemas), dynamic data, and the possibility of intermittent connectivity or participation.

SHARQ (Sharing Heterogeneous and Autonomous Resources and Queries) is a collaborative project between the database group at the University of Pennsylvania (Penn) and two biological research groups from Penn Center for Bioinformatics and the Children's Hospital of Philadelphia. The goal of SHARQ is to develop generic tools and technologies for creating and maintaining confederations of peers whose purpose is distributed data sharing. In response to the difficulties outlined, our solution will emphasize: (i) decentralization achieving both scalability and flexibility, (ii) incremental development of resources such as schemas, mappings between different schemas, and queries, (iii) rapid discovery mechanisms for finding the resources relevant to a topic, and (iv) tolerance for intermittent participation of members and for approximate consistency of mappings.

The core engine of SHARQ is the Orchestra system (Ives et al, 2002 & 2005, Taylor et al, 2006, Green et al, 2006), which builds upon concepts from the Piazza peer data management system (PDMS) (Halevy et al, 2004 & 2005). Orchestra supports the exchange of data and updates among cooperating, heterogeneous databases, making use of policies to quickly and automatically manage disagreement among conflicting data.

However, knowing what information is available in the peer network is difficult to determine. The SHARQ Guide is therefore being designed to enable biologists to find relevant information within a peer data management system. It provides assistance not only for users who ask queries, but also for owners of peers who wish to be registered within the Guide. Key ideas of the SHARQ Guide include: (i) representing biological entities and relationships as a graph, following the approach of BioGuide (Cohen-Boulakia et al, 2005). This graph can be extended in a collaborative way by the peer administrators. (ii) Helping administrators of peers to register their schema. (iii) Expressing queries without having to know/cite the schemas to use for querying (transparent queries). (iii) Proposing new features to maximize the amount of data returned to the user, by allowing some fields in the query to be optional.

More information about SHARQ is available at http://db.cis.upenn.edu/research/SHARQ.html.

A semantic web approach to data integration for the histone code case

M. Scott Marshall, Lennart Post, Marco Roos, Timo Breit

In the context of the Virtual Laboratory for e-Science project, we investigate a knowledge-based approach to integrate data in order to elucidate the relationship between the 'histone code', DNA sequence, and gene expression. This use case is an example of a multifaceted biological problem that can take advantage of new integrative approaches in order to enable systems-level research. Our goal is to facilitate analysis of data in terms of the biology behind an experiment. We create 'data models', describing the data schema, semantic models, describing the biology relevant to our experiment, and linking models that link the biological semantics to the data models. We present a scenario using semantic web formats (e.g. RDF, RDF-S, and OWL) and semantic web tools (e.g. Protégé, Sesame) for a showcase experiment that finds DNA regions where the binding of a specific type of histone (H3K4Me3; dataset 1) overlaps with Transcription Factor Binding Sites (dataset 2). We suggest that the translation to RDF is best achieved in two steps: 1) syntactic translation that preserves things like table structure and column names and 2) semantic annotation, where OWL terms are linked to the given column names.

Status of the Functional Genomics investigation Ontology (FuGO)

Susanna-Assunta Sansone on behalf of The FuGO Working Group
http://fugo.sf.net

The application of high-throughput functional genomics technologies such as microarrays and mass spectrometry to a wide variety of biological conditions has facilitated the investigation of new kinds of biological questions and has also resulted in the generation of massive amounts of data and metadata. The Functional Genomics investigation Ontology (FuGO) is being developed as a collaborative, international effort to build an ontology that will provide a common resource of terms to annotate functional genomics investigations as well as assays that generate large amounts of data and metadata (Whetzel et al., The Development of FuGO - An Ontology for Functional Genomics Investigations, 2006, OMICS (in press)). Those involved in the FuGO project represent biological and technological domains and currently include representatives from the following communities: Metabolomics, Proteomics, Transcriptomics, Crop Sciences, Environmental Genomics, Flow Cytometry, Genome Sequence, Immunogenomics, Nutrigenomics, Polymorphism, Toxicogenomics. Although these are diverse communities, there is a shared need for terms to describe universal features of investigations such as the design, protocols, assays, material, and units of measurement. The scope of FuGO will cover both terms that are universal to all functional genomics investigations and those that are domain specific. In addition, FuGO will reference out to existing mature ontologies, e.g. the Foundational Model of Anatomy, for additional domain content, in order to avoid the need to duplicate these resources, and will do so in such a way as to foster their ease of use in annotation.

The initial ontology building process includes the collection of use cases from communities, the identification of terms of interest for annotation from these use cases, and the binning of terms into top-level containers such as Object and Process. In the process of building FuGO, a style guide is being developed to provide recommendations for naming conventions for terms, formats for definitions and policies for the development of the ontology such as those for the addition of synonyms and alphanumeric identifiers for terms. In addition to these style recommendations, FuGO is being developed as an OBO Foundry Application ontology and will meet OBO requirements (http://obofoundry.org). During our first FuGO workshop in February, a draft version of FuGO was developed using Protégé/OWL. This initial version of the ontology focuses on the need for universal terms from the involved communities and includes an extension for technology specific terms for the Proteomics community. Thus far, the ontology building process has focused on the development of is_a relations, while noting where other relations, such as part_of, are needed. Future development of FuGO will proceed as an iterative process in which a section of the ontology will be reviewed by all interested communities allowing terms, both universal and community specific, to be added to the ontology. In addition, non-is_a relations will also be added to join the branches of the ontology. In conclusion, FuGO is being developed to provide a common resource for annotation of functional genomics investigations and in this way, the ontology will serve as the "semantic glue" to provide a common understanding of the annotated data from across disparate data sources.

WormMart - the BioMart data warehouse for WormBase

William Spooner, Todd W. Harris, Lincoln D. Stein

WormBase is a well established model organism database tasked with consolidating genetic and genomic data for C. elegans and related nematodes. WormMart exposes these comprehensive, high-quality data from within a BioMart data warehouse. A generic tool has been developed to transform AceDB object schemas and instances into query-optimised BioMart datasets. Cross-dataset queries are supported via BioMart's query federation capabilities. Integration with other BioMart databases such as those at Ensembl and UniProt are similarly supported. Data access is provided from WormMart's MartView interface. WormMart is also suitable for use with other BioMart compatible software such as MartExplorer and Taverna.

AliBaba - Visualizing biological networks from PubMed query results

Conrad Plake, Jörg Hakenberg, Torsten Schiemann, Ulf Leser

Knowledge Management in Bioinformatics
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany.

Motivation: Today, the most complete knowledge of life sciences is contained in scientific publications. Searching information buried in natural language texts becomes a laborious, hence expensive task. When searching for relevant publications using PubMed, users have to scan all abstracts one-by-one.

Results: We present AliBaba, an interactive graphical tool that visualizes the content of PubMed abstracts retrieved by user queries. We parse abstracts for biological and medical entities (cells, diseases, drugs, proteins/genes, species, and tissues) and for pairwise associations between these entities (protein-protein interactions; (sub)cellular locations of proteins; proteins mentioned with species, diseases, or tissues; drugs concurring with diseases). For recognizing protein-protein interactions and cellular locations of proteins in texts, a sophisticated pattern matching strategy is used; for all other associations, AliBaba searches for sentence co-occurrences. AliBaba shows all associations as a single graph, and links nodes and edges (entities and their respective relations) to external databases (UniProt, MeSH, NCBI Taxonomy, MedlinePlus, PubMed) for further information. All source texts can be accessed by means of browsing through the graph, and get highlighted according to the detected entities and associations.

Availability: http://wbi.informatik.hu-berlin.de/alibaba/

>(capinecho@aol.com)