DFG Forschungszentrum Matheon,
Algorithmen und Komplexität
The goal of this project is to design realistic stochastic models for networks representing data bases in life sciences and to develop efficient algorithms which perform crucial tasks on these networks such as searching, storing and sorting.
The underlying idea is that the analysis of these models will determine typical properties of the networks and thus firstly provide means to evaluate how close the models are to real-world situations and secondly allow the algorithms to make use of these statistical properties.
The Dictionary of Interfaces in Proteins (DIP) is a data base collecting the three-dimensional structure of interacting parts of proteins that are called patches. It serves as a repository in which patches similar to given query patches can be found. The computation of the similarity of two patches is time consuming. In this work we address the question how the patches in DIP, that are similar to a given query, can be identified by scanning only a small part of the database.
The score values describing the similarity of two patches can
roughly be divided into three ranges that
correspond to different levels of spatial similarity.
Interestingly, the two iso-score lines separating the three classes can be determined
by two different approaches.
Applying a concept of the theory of random graphs reveals significant structural properties
of the data in DIP.
These can be used to accelerate scanning the DIP for patches similar to a given query.
Scans for similarity scores in the highest range could be accelerated by a factor of more than 25.
In the middle range, nearly all of the relevant patches could be found ten times faster than by
brute-force search.
Accelerating screening of 3D protein data with a graph theoretical approach
In a database of about 2000 approved drugs, represented by 105 structural
conformers, we have performed 2D comparisons (Tanimoto coefficients) and 3D superpositions.
For one class of drugs the correlation between structural resemblance and
similar action was analysed in detail. In general Tanimoto coefficients
and 3D scores give similar results, but we find that 2D similarity
measures neglect important structural/funtional features. Examples for
both over- and underestimation of similarity by 2D metrics are discussed.
The required additional effort for 3D superpositions is assessed by
implementation of a fast algorithm with a processing time below 0.01
seconds and a more sophisticated approach (0.5 seconds per superposition).
COSMOS
According to the improvement of similarity detection compared to 2D
screening full-atom 3D superposition will be an upcoming
method of choice for library prioritization or similarity screening approaches.
Comparison of 2D similarity and 3D Superposition. Application to Searching a Conformational Drug Database
COSMOS is a computer program to compare (small) molecules for 3D similarity by
superposition. It was shown in
Comparison of 2D similarity and 3D Superposition. Application to Searching a Conformational Drug Database
, that the algorithm is able, even with a
purely geometry based scoring function, to find nearly all relevant hits in a database search
with drug-like molecules.
We provide a WEB-interface to the COSMOS algorithm which allows to submit the 3D-structures of two molecules and returns a visualization of the superposition found by the algorithm. Several parameters of the algorithm can be adjusted by the user.
The DTP (Developmental Therapeutical Program, National Cancer Institut) Human Tumor Cell Line Screen has checked tens of thousands of compounds for evidence of the ability to inhibit the growth of human tumor cell lines. The aim of this project is to detect groups of potential anti-cancer drugs that are both similar in experimental data (growth inhibition) and in 3D. This would allow to conclude that molecules of high 3D similarity should also tend to have similar functional properties. A redesign of our 3D superposition algorithms makes it possible to even screen this amount of data. This is work in progress. (joint with R.Preissner, charite)
As graphs are the canonical model for networks, random graphs seem to be appropriate candidates for the stochastic models. The classical random graph model was introduced by Erdös and Renyi in the early 1960s. It considers a fixed set of n vertices and edges that exist with a certain probability p=p(n), independently from each other.
However, it lacks many of the commonly observed properties of real-world networks (e.g. scale free degree distribution and clustering). One of the underlying reasons that are responsible for this mismatch is precisely the independence of the edges, in other words the missing transitivity. In a real-world network, relations between vertices x and y on the one hand and between vertices y and z on the other hand suggest a connection of some sort between vertices x and z.
To take better care of this fact, different models have been investigated. The Linearized Chord Diagram (LCD) model of Bollobas and Riordan goes back to an idea of Albert and Barabsi using a dynamic approach with preferential attachment of new nodes to the graph. Furthermore Chung et al. analyzed graphs with planted (expected) degree distributions similar to the real-world-networks.
Focussing on the missing transitivity, we investigate
stochastic models of so-called intersection graphs.
An intersection graph is a graph G=(V,E) together with a so-called
universal feature set W.
Every vertex in V has an assigned feature set (subset of W), and
the characteristic property of an intersection graph is that
two vertices are connected by an edge if and only
if their feature sets have non-empty intersection.
The idea is that objects (vertices) with common features bear strong
similarities, and this is reflected by the edges. It has been observed
(Michael Behrisch, Anusch Taraz: Clique finding in random intersection graphs)
that real-world networks are of the same nature as intersection graphs.

In an random intersection graph the vertices choose their features
with probability p uniformaly at random.
We study the evolution (depending on p) of these graphs with respect to
certain important parameters such as component size, diameter and degree
distribution. Structural information of this kind allows us to develop
algorithms and rigorously prove their asymptotic optimality.

To give an example, we consider a network that has been generated as a random intersection graph and analyse simple greedy strategies for retrieving feature configurations. In this way, we are able to search and organize data of this kind efficiently, and can gain new insight into the similarity structure of the network.