Protein/RNA Informatics

1. Introduction

The goal of this project is to identify and distinguish biomolecules and develop clustering methods based on certain signatures or fingerprints. The fingerprints are computed from both shapes and properties such as electrostatic potentials.

MolFinger database stores MACTs as unique protein chain metadata for 494 protein chains. MACT is used to support queries for cluster based similarity. Based on the definition of norms, similarity metric can be defined for molecular properties, such as electron density and electrostatic potential. We use MOBIOS (Molecular Biological Information System) for our MolFinger database. MOBIOS uses metric space indexing techniques and provide database query language. MolFinger database allows searching the database with given inputs, which are PDB ID and distance value. The database must take the distance value and calculate similarity. This method reduces the iterations from O (n^2) to O (n*log n) by using metric space indexing technique. MOBIOS also supports a range query. The range query can extract certain protein chains within the similarity range out of 494 protein chains. In terms of similarity, we have a range of 1-0 where a value of 0 indicates that the molecules are dissimilar. However, the database uses a metric space indexing technique based on a distance value. Here, a value of 0 represents a high similarity (i.e. a similarity value of 1).

We deal with data from PDB, mainly proteins and nucleic acids.
Two different representation for biomolecules are used:

  1. Flexible Chain Complex (FCC).
  2. Blobby model.

FCC contains bone-level structures and blobby model contains blurred structures. Since FCC contains too many information, reduced FCC representation is useful for defining bone-level signatures.

The specific steps in the project are to:

  1. Compute various signatures or fingerprints of bio-molecules, mainly protein and nucleic acids.
  2. Define distance (similarity) metric based on the fingerprints (meta data).
  3. Cluster large number of bio-molecules.
  4. Compare clustering methods (CATH, DALI, SCOP, CE, Pfam, etc.).
  5. Use combination of fingerprints for improved clustering.

Geometric, topological, and combinatorial properties of a biomolecule defines fingerprints of the biomolecule with volumetric representation. For example, the distribution of area, volume, and gradient integral for isosurfaces characterizes the geometric property. Contour tree and Betti numbers provide both a topological and combinatorial characterization. Fingerprints based on those properties define distance metrics which is used for clustering of biomolecules. We build a database that stores protein metadata and use them to support queries for clustering based on similarity of the fingerprints.

We developed accelerated visualization techniques for each biomolecular representation, FCC and blobby models by using programmable graphics hardware. Millions of atoms, bonds, cylinders, and helices are rendered in an interactive rate.

  1. PDB → FCC (description)
    Structure and Skeletal Graph Representation (Atomic level approach)
  2. PDB → Rawiv, RawV
    Volume Representation (Blobby level approach)In this model, a molecule is represented as an electron density map of the molecule. We may control the feature resolution of the model by controlling a blobbyness parameter. A level set of the density map approximates a molecular surface. The geometry (e.g. contour spectrum) and topology (e.g. contour tree, morse graph, …) of the density map are useful for capturing and comparing the structures of molecules.
  3. PDB → Raw, Rawn, Rawc, and Rawnc
    Surface Representation

Additional information about Molecular Signature Database.