# Research

My interests in computer science lie mostly in its intersection with biology, in the field of bioinformatics. Currently, I am focusing on developing probabilistic models to describe the uncertainty for complex functionals computed on objects with native error/uncertainty. A major focus is on generating intelligent (low-discrepancy) samples in high-dimensional spaces. The applications are to protein-protein docking and protein homology modeling.

I have recently entered into candidacy for my PhD. Please see here for a link to my PhD Proposal presentation slides (best viewed in Adobe Reader), and here for a link to my PhD Proposal document.

In the past, I spent a lot of time in (and still often dabble with) the field of Next-Generation Sequencing, and the future of massivly-parallel DNA sequencing.

During my time as an undergrad and now graduate student, I've been part (and continue to be a major part) of several research projects. These include, but are not limited to, those seen below.### Uncertainty Quantification for Protein Functionals

When used in computational processes, proteins (and other biological molecules)
are traditionally represented as single, static molecules. However, this is far from the
truth. For example, computing the binding free energy of two proteins,
\(\Delta E(A+B) = E(A + B) - E(A) - E(B)\),
is useful for a host of problems, e.g. computational drug discovery. However, the
free energy of a single protein, \(E(A)\), is computed on a *static* molecule.

This research aims to answer the question: Can functionals (such as binding free energy)
be improved by looking at *several* static molecules, instead of just a single one?
If so, how many static molecules is sufficient to obtain high confidence on the results?

Preliminary results have been published in the following papers:

- "Quantifying and Visualizing Uncertainties in Molecular Models" (available here)
- "Uncertainty Quantified Computational Analysis of the Energetics of Virus Capsid Assembly" (available here)

### Generating Low-Discrepancy Sequences in High-Dimensional Spaces

As proteins consist of a linear backbone structure, they can be represented by hinges and angles, moving in much the same way as a robotic arm. If this representation is used, then a protein can be thought to live in a high-dimensional space (each degree of freedom is a single dimension).

It is well known that sampling in high dimensions is very difficult. For example, as
the number of dimensions approaches \(\infty\), the ratio of the volume of the high-
dimensional sphere to high-dimensional cube approaches 0.
(See here
for a more rigorous explanation.) Thus in high-dimensional spaces, it is not trivial
to generate good samples. For this reason, much of my prior work has been in developing
algorithms with provable bounds on their *discrepancy* (a measure of the
evenness of sampling). However, calculating the discrepancy of a given set of points
is a difficult (provably NP-hard) problem. There are some good approximations, but
these algorithms run very slow on high-dimensional spaces.

Part of my current (on-going) research is also implementing discrepancy calculations on parallel architectures, such as GPUs and the latest Intel KNL platform. Our hope is that by formulating the discrepancy problem as a tweaked matrix multiplication, we can achieve massive speedups that will allow us to come to a closer measure of discrepancy.

### ADaM

My current research project involves mapping short DNA fragments to a reference
genome. The key to this research is converting DNA sequences into a point
in metric space, and then constructing an intelligent database (we use a modified MVP tree, which is an adaptation
of the *k*-d tree) to reduce the overall search time. While not as fast as
some of the current DNA mappers, we hope that our *exact search* algorithm will
be able to shed light on some datasets that have relatively poor mapping percentages.

This research was published in BMC Bioinformatics as "ADaM: Augmenting existing approximate fast matching algorithms with efficient and exact range queries." The publication can be found here.

### RNA Folding

I spent some time working with the Gutell Lab on a progressive RNA folding algorithm. It's still in the development stage, but it's looking like it will be a significant improvement over current algorithms (which perform quite poorly). I'm hoping to be able to translate some of this information to the RNA-Seq problem, which is fast becoming one of the more important problems in next-generation sequencing and drug discovery.

### GNUMAP

My undergraduate research project was the development of GNUMAP, a program used to find the best alignment for short DNA sequences on a reference genome. This research was used to spur one first-author peer-reviewed journal paper, two conference papers, and several presentations. The webpage for this project (including download instructions, etc), is here.