Silke Trissl1, Kristian Rother2, Ulf Leser
1trissl@informatik.hu-berlin.de, Institute of Informatics, Humboldt-Universität zu Berlin, Germany;
2kristian.rother@charite.de, Institute of Biochemistry, Charitè, Berlin, Germany
We present COLUMBA, a database of information on protein structures that integrates data from twelve different biological databases, including ENZYME, KEGG, SCOP, CATH, DSSP, and SwissProt. COLUMBA allows for the quick computation of sets of protein structures that share interesting properties according to the different data sources.
The number of protein structures deposited in the Protein
Data Bank, PDB (Berman et al. 2000) is increasing rapidly. This allows researchers
in life science to study complex relationships between macromolecular structures
and their properties, such as biological function, folding classification,
or secondary structure. To undertake those studies, not only the three dimensional
(3D) structures have to be known, but also the folding classification and
several other properties of a protein. Gathering such information from web
resources by following hyperlinks is a tedious and time-consuming task.
We have created COLUMBA (Rother et
al. 2004), a database of information on protein structures, that physically
integrates information from twelve different data sources into a single
relational data warehouse. We enrich the protein structures from the PDB
with
We have created a user friendly web interface, which is
available at http://www.columba-db.de.
The web interface allows a full text search as well as data source specific
queries. The web interface uses a "query refinement" paradigm to return
a set of PDB entries, which fulfill the conditions stated. A query is defined
by entering restriction conditions in the form for the data source specific
annotation. The user can combine queries from different data sources, which
act as filters, to obtain the desired subset of PDB entries. The interface
supports interactive and exploratory usage by straightforward adding, deleting,
restricting, or easing of conditions. The user is supported by a header,
called "filter chain", where the number of PDB entries after each filter
step is stated.
The result set, gives basic information on each entry returned. The user
can see the full scope of COLUMBA for
a single entry where all the annotated information for a single entry is
shown.
Through the web interface it is fairly simple to answer the following two questions:
The first query is answered by first entering the phrase
'TIM barrel' in one of the Protein Fold forms - as fold in SCOP and as keyword
in CATH, respectively, then enter for the condition resolution '2.0' in
the PDB Structure form. This will result in the desired set of currently
370 entries for CATH and 381 for SCOP, respectively.
The second query can be answered by using the Metabolism form of COLUMBA. The option 'path coverage' not only
shows the enzymes participating in the selected pathway, but also the number
of structures known for each enzyme.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland,
G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The
Protein Data Bank. Nucleic Acids Research.28: 235 - 242.
Rother, K., Müller, H., Trissl, S., Koch, I., Steinke, T.,
Preissner, R., Frömmel, C., Leser, U. 2004. COLUMBA: Multidimensional
Data Integration of Protein Annotations. E. Rahm(Ed.): DILS 2004, LNBI
2994, 156 - 171.