Stefan Günther1, Kristian Rother2, Silke Trissl
1stefanxguenther@web.de, Institute of Biochemistry, Charitè, Berlin, Germany;
2kristian.rother@charite.de, Institute of Biochemistry, Charitè, Berlin, Germany
In our poster we want to show possible applications of COLUMBA by highlighting several aspects of the linkage quality in biological databases. Moreover we used the database to generate sets of PDB entries containing Protein-DNA complexes and their homologues to conduct research on motives of DNA-binding sides in proteins.
Coverage of PDB with external information
We investigated to what extend PDB entries are covered by second party annotation. We found an 'annotation gap' for structures less than seven years old for each secondary database that is based on methods involving manual steps. The examined databases overlap each other well, dividing the PDB into two well- and one poorly annotated third. Poorly annotated structures are either very new or contain molecules often not desired in datasets, like mere DNA, small molecules, or low-resolution structures.
Analysis of DNA-binding regions
From the COLUMBA database
we retrieved a dataset of 650 Protein-DNA complexes and 555 higly homologous
DNA-binding proteins (> 90 % identity), which were crystallized without
nucleic acid. The advantage of using COLUMBA
compared to web resources was that the resulting dataset contained no unwanted
protein structures, except for two immunoglobulins and one thrombin because
of the possibility to enter a more complex search phrase by using the SQL
query language.
We excluded entries containing single nucleotides, or RNA as nucleotide
sequence. In addition to that, we removed entries containing just a short
peptide. From the remaining set we identified the DNA and polypeptide chains,
respectively. The dataset was cross-checked with other publicly available
Protein-DNA complex resources from PDB, NDB, and IMB. We found that, according
to our criteria, those sets contain several wrongly identified complexes.
Finally, all entries were validated manually by visual inspection.
The final dataset was divided into 218 protein families according to
their sequence identity. The binding region of a polypeptide chain in a
complex was identified by accepting an amino acid with less than 5 Å
distance to a nucleotide chain as belonging to a binding region.
We then compared the three-dimensional structures of the binding regions
with the Needle-Haystack search algorithm (Hoppe & Frömmel 2003)
to find interesting properties.
We found that within a family the local conformation of DNA binding regions
can be strongly diverse as long as no DNA is bound to that region. As soon
as the DNA is complexed together with the protein, a family shares a common
subfold for that binding pocket. In addition to that, we used the binding
pockets of helix-turn-helix proteins, which represent a prominent motive of
DNA binding, for a similarity screening against the entire dataset. We found
closely related proteins, however, the specificity of matches does not allow
to reasonably identify distant relatives.
References