A metric model of amino acid substitution
Weijia Xu and Daniel P. Miranker
BIOINFORMATICS Vol. 20 no. 0 2004, pages 1-8
ABSTRACT
Motivation: We address the question of whether there exists an
effective evolutionary model of amino-acid substitution that forms a
metric-distance function. There is always a trade-off between speed
and sensitivity among competing computational methods of determining
sequence homology. A metric model of evolution is a prerequisite for
the development of an entire class of fast sequence analysis
algorithms that are both scalable, O(log n) and sensitive.
Results: We
have reworked the mathematics of the point accepted mutation model
(PAM) by calculating the expected time between accepted mutations in
lieu of calculating logodds probabilities. The resulting substitution
matrix (mPAM) forms a metric. We validate the application of the mPAM
evolutionary model for sequence homology by executing sequence queries
from a controlled yeast protein homology search benchmark. We compare
the accuracy of the results of mPAM and PAM similarity matrices as
well as three prior metric models. The experiment shows that mPAM
significantly outperforms the other three metrics and sufficiently
approaches the sensitivity of PAM250 to make it applicable to the
management of protein sequence databases.
full-text.pdf
Using MoBIoS Scalable Genome Join to Find Conserved Primer Pair Candidates Between Two Genomes
Weijia Xu, Willard J Briggs, Joanna Padolina, Wenguo Liu, C. Randal Linder and Daniel P. Miranker
(to appear, Proceedings of the Conference on Intelligent Systems for Molecular Biology) (ISMB-04)
ABSTRACT
Motivation: For the purpose of identifying evolutionary reticulation
events in flowering plants, we determine a large number of paired,
conserved DNA oligomers that may be used as primers to amplify
orthologous DNA regions using the polymerase-chain reaction (PCR).
Results: We develop an initial candidate set by comparing the
Arabidopsis and rice genomes using MoBIoS (Molecular Biological
Information System). MoBIoS is a metric-space database management
system targeting life science data. Through the use of metric-space
indexing techniques, two genomes can be compared in O(mlog n), where
m and n are the lengths of the genomes, versus O(mn) for BLAST based
analysis. The filtering of low complexity regions may also be
accomplished by directly assessing the uniqueness of the region. We
describe mSQL, a SQL extension being developed for MoBIoS that
encapsulates the algorithmic details in a common database
programming language, shielding endusers from esoteric programming.
emergence of computers in the commercial world.
full-text.pdf
Indexing Protein
Sequences in Metric Space
Weijia Xu, Daniel P. Miranker, Rui
Mao, Shu Wang
Technical Report, TR-04-06,Department of
Computer Sciences,
University
of Texas at Austin
Abstract
The hyper-exponential growth of
biological sequence data and complex queries demand new approaches of
managing sequence databases where sequence data is preprocessed
off-line and organized in data structures such that on-line queries
can be executed quickly. Due to the complications of computing
biological similarity based on local alignments, such index structures
are typically constructed on q-grams (substrings of length q) and
embody a three-way trade-off between speed, accuracy and scalability.
The development of a biologically effective distance metric on
amino-acid substitution, mPAM, permits this approach to be extended
beyond direct nucleotide comparison to codon similarity and protein
sequences. We consider the storage and retrieval of protein q-grams
using a metric-space index structure, the MVP-tree. Using a controlled
sequence homology benchmark, we evaluate the trade-off between
sensitivity and selectivity as a function of speed and length of the
q-grams. We conclude that the system only slightly penetrates the
curse of dimensionality and can be expected to offer scalable
performance. We will discuss our experimental results to show that the
protein sequences can be indexed in metric space with accuracy and
scalability.
full-text.pdf
Sequence View: A Database
Mechanism for Biosequences
D.P. Miranker, W. Liu, W. Xu, and R. Mao.
Technical Report TR-04-05
The Univ. of Texas at Austin, Dept. of
Computer Sciences. TR-04-05. (submitted for publication)
ABSTRACT
Biologically effective retrieval and analysis of sequences is much
more complicated than finding matching strings. The identification
and storage of biological sequences usually comprises long functional
units (e.g. genes, proteins and chromosomes). The analysis and
retrieval of those sequences primarily concerns finding ordered sets
of short matching subsequences. This characterization applies to both
local-alignment, (BLAST searches) and a growing tool kit of algorithms
in comparative genomics that are tantamount to executing joins on
pairs of whole genome sequences (whole genome joins). To support the
dual biological semantics we introduce sequence views, a component of
MoBIoS, (Molecular Biological Information System), a metric-space
database management system supporting complex life science data types.
MoBIoS stores the primary representation of sequences as strings.
Sequence views provide access to sequences as a decomposition of those
strings into overlapping substrings: q-grams. Thus, sequence views
provide a programmatic way to capture the two different logical
perspectives. We describe materializing sequence views as a
metric-space index and the optimization of paged MVP-trees to support
a biologically validated metric for the retrieval of protein
sequences. Empirical results demonstrate that for local-alignments
the system tends to O(log n) retrieval times. Such a physical
structure provides an access path for indexed-nested loop and merge
style whole genome joins.
full-text.pdf
An extension to SQL92 for Biological
Databases
W. Liu and D. P. Miranker
Technical Report TR-04-05
The University of Texas at Austin, Department of Computer
Sciences.
Abstract
MoBIoS is a project that aims at inventing a new generation database
management system (DBMS) for molecular biological data. This DBMS has
a set of features including storage of sequences, retrieval of
sequences based on similarity metric, and a query language (mSQL)
embodying the semantics of Genomics and Proteomics and allowing for
concise expression of Bioinformatics studies. This paper focuses on
designing and implementing mSQL. mSQL is based on the standard
relational operators ( join,,¹? , union), with extensions in operators
to enable splitting sequences into fragments, merging overlapped
fragments, grouping fragments that can be merged, interior fragment
select, searching, and joining optimized with metric space
index. Query optimization rules are given based on these operator
extensions. The concept of Sequence View is introduced, which has all
the merits of standard database View, and provides for the explicit
fragmentation of a sequence into overlapping substrings of a fixed
length and the concurrent construction of a metric-space index on the
fragments in order to accelerate matching of those fragments. Three
sample biological queries (Whole Genome Join, Conserved Primer Pair
Discovery, and Electronic PCR) are expressed in mSQL, query plan trees
based on mSQL biological operator extensions and Sequence View are
discussed. Biological applications show that mSQL is a suitable and
efficient declarative querying language for querying primary structure
of biological data sets with the potential to be extended for querying
secondary structure.
full-text.pdf