A metric model of amino acid substitution

Weijia Xu and Daniel P. Miranker

BIOINFORMATICS Vol. 20 no. 0 2004, pages 1-8

ABSTRACT
Motivation: We address the question of whether there exists an effective evolutionary model of amino-acid substitution that forms a metric-distance function. There is always a trade-off between speed and sensitivity among competing computational methods of determining sequence homology. A metric model of evolution is a prerequisite for the development of an entire class of fast sequence analysis algorithms that are both scalable, O(log n) and sensitive.
Results: We have reworked the mathematics of the point accepted mutation model (PAM) by calculating the expected time between accepted mutations in lieu of calculating logodds probabilities. The resulting substitution matrix (mPAM) forms a metric. We validate the application of the mPAM evolutionary model for sequence homology by executing sequence queries from a controlled yeast protein homology search benchmark. We compare the accuracy of the results of mPAM and PAM similarity matrices as well as three prior metric models. The experiment shows that mPAM significantly outperforms the other three metrics and sufficiently approaches the sensitivity of PAM250 to make it applicable to the management of protein sequence databases.
full-text.pdf

Using MoBIoS Scalable Genome Join to Find Conserved Primer Pair Candidates Between Two Genomes

Weijia Xu, Willard J Briggs, Joanna Padolina, Wenguo Liu, C. Randal Linder and Daniel P. Miranker

(to appear, Proceedings of the Conference on Intelligent Systems for Molecular Biology) (ISMB-04)

ABSTRACT
Motivation: For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerase-chain reaction (PCR).
Results: We develop an initial candidate set by comparing the Arabidopsis and rice genomes using MoBIoS (Molecular Biological Information System). MoBIoS is a metric-space database management system targeting life science data. Through the use of metric-space indexing techniques, two genomes can be compared in O(mlog n), where m and n are the lengths of the genomes, versus O(mn) for BLAST based analysis. The filtering of low complexity regions may also be accomplished by directly assessing the uniqueness of the region. We describe mSQL, a SQL extension being developed for MoBIoS that encapsulates the algorithmic details in a common database programming language, shielding endusers from esoteric programming. emergence of computers in the commercial world.
full-text.pdf


Indexing Protein Sequences in Metric Space

Weijia Xu, Daniel P. Miranker, Rui Mao, Shu Wang

Technical Report, TR-04-06,Department of Computer Sciences, University of Texas at Austin

Abstract
The hyper-exponential growth of biological sequence data and complex queries demand new approaches of managing sequence databases where sequence data is preprocessed off-line and organized in data structures such that on-line queries can be executed quickly. Due to the complications of computing biological similarity based on local alignments, such index structures are typically constructed on q-grams (substrings of length q) and embody a three-way trade-off between speed, accuracy and scalability. The development of a biologically effective distance metric on amino-acid substitution, mPAM, permits this approach to be extended beyond direct nucleotide comparison to codon similarity and protein sequences. We consider the storage and retrieval of protein q-grams using a metric-space index structure, the MVP-tree. Using a controlled sequence homology benchmark, we evaluate the trade-off between sensitivity and selectivity as a function of speed and length of the q-grams. We conclude that the system only slightly penetrates the curse of dimensionality and can be expected to offer scalable performance. We will discuss our experimental results to show that the protein sequences can be indexed in metric space with accuracy and scalability.
full-text.pdf

Sequence View: A Database Mechanism for Biosequences

D.P. Miranker, W. Liu, W. Xu, and R. Mao.

Technical Report TR-04-05 The Univ. of Texas at Austin, Dept. of Computer Sciences. TR-04-05. (submitted for publication)

ABSTRACT
Biologically effective retrieval and analysis of sequences is much more complicated than finding matching strings. The identification and storage of biological sequences usually comprises long functional units (e.g. genes, proteins and chromosomes). The analysis and retrieval of those sequences primarily concerns finding ordered sets of short matching subsequences. This characterization applies to both local-alignment, (BLAST searches) and a growing tool kit of algorithms in comparative genomics that are tantamount to executing joins on pairs of whole genome sequences (whole genome joins). To support the dual biological semantics we introduce sequence views, a component of MoBIoS, (Molecular Biological Information System), a metric-space database management system supporting complex life science data types. MoBIoS stores the primary representation of sequences as strings. Sequence views provide access to sequences as a decomposition of those strings into overlapping substrings: q-grams. Thus, sequence views provide a programmatic way to capture the two different logical perspectives. We describe materializing sequence views as a metric-space index and the optimization of paged MVP-trees to support a biologically validated metric for the retrieval of protein sequences. Empirical results demonstrate that for local-alignments the system tends to O(log n) retrieval times. Such a physical structure provides an access path for indexed-nested loop and merge style whole genome joins.
full-text.pdf

An extension to SQL92 for Biological Databases

W. Liu and D. P. Miranker

Technical Report TR-04-05 The University of Texas at Austin, Department of Computer Sciences.

Abstract
MoBIoS is a project that aims at inventing a new generation database management system (DBMS) for molecular biological data. This DBMS has a set of features including storage of sequences, retrieval of sequences based on similarity metric, and a query language (mSQL) embodying the semantics of Genomics and Proteomics and allowing for concise expression of Bioinformatics studies. This paper focuses on designing and implementing mSQL. mSQL is based on the standard relational operators ( join,,¹? , union), with extensions in operators to enable splitting sequences into fragments, merging overlapped fragments, grouping fragments that can be merged, interior fragment select, searching, and joining optimized with metric space index. Query optimization rules are given based on these operator extensions. The concept of Sequence View is introduced, which has all the merits of standard database View, and provides for the explicit fragmentation of a sequence into overlapping substrings of a fixed length and the concurrent construction of a metric-space index on the fragments in order to accelerate matching of those fragments. Three sample biological queries (Whole Genome Join, Conserved Primer Pair Discovery, and Electronic PCR) are expressed in mSQL, query plan trees based on mSQL biological operator extensions and Sequence View are discussed. Biological applications show that mSQL is a suitable and efficient declarative querying language for querying primary structure of biological data sets with the potential to be extended for querying secondary structure.
full-text.pdf