Mini-Rosetta Project Milestone 3
Inference of Rosetta Relations
Due: 11:59PM,Friday
April 20
In
this milestone, you are going to compute Rosetta relations. You DON'T need to
identify putative new gene functions now.
You
need to write a SQL query to infer Rosetta relations.
Recall
a Rosetta (gene fusion) relationship exists among three proteins, if two
proteins from the same organism are both homologous to a protein in a second
organism. Previously you computed a homology score between every pair of
proteins in your database. Thus, a Rosetta relationship can be determined
by a SQL query, as follows… (Let, 50 be the minimum homology score to properly
conclude a homologue, you are welcome to experiment with other values.)
SELECT
la1.t_protein_id, la1.t_protein_t_protein_id,
la2.t_protein_id, la2.t_protein_t_protein_id
FROM
t_protein p1, t_protein p2, t_localalignment
la1, t_localalignment la2, t_match
m1, t_match m2
WHERE la1.totalscore
> 50
AND la2.totalscore > 50
AND
p1.organism
= p2.organism
AND
m2.t_protein_id
= la2.t_protein_id
AND
m2.t_protein_t_protein_id
= la2.t_protein_t_protein_id
AND
m1.t_protein_id
= la1.t_protein_id
AND
m1.t_protein_t_protein_id
= la1.t_protein_t_protein_id
AND
((la1.t_protein_t_protein_id =
la2.t_protein_t_protein_id
AND
p1.t_protein_id
= la1.t_protein_id
AND
p2.t_protein_id
= la2.t_protein_id
AND
m1.toB
in
(select Max(toB) from t_match
where t_protein_id =
la1.t_protein_id
and t_protein_t_protein_id = la1.t_protein_t_protein_id)
AND
m2.fromB
in
(select Min(fromB) from t_match
where
t_protein_id = la2.t_protein_id
and
t_protein_t_protein_id =
la2.t_protein_t_protein_id)
AND
m1.toB<m2.fromB )
OR
(la1.t_protein_id = la2.t_protein_id
AND
p1.t_protein_id
= la1.t_protein_t_protein_id
AND
p2.t_protein_id
= la2.t_protein_t_protein_id
AND
m1.toA
in
(select Max(toA) from
t_match
where
t_protein_id = la1.t_protein_id
and
t_protein_t_protein_id =
la1.t_protein_t_protein_id)
AND
m2.fromA
in
(select Min(fromA) from t_match
where
t_protein_id = la2.t_protein_id
and
t_protein_t_protein_id =
la2.t_protein_t_protein_id)
AND m1.toA<m2.fromA));
This query above returns a
set of rows corresponding to the discovered Rosetta relations. You may
take precisely that query and use it as an argument to a JDBC call, take the
answers and, in Java, iterate through them inserting them one at a time back
into the database.
Alternatively, a little
knowledge of SQL results in the following SQL statement. This statement
computes the query above, and the set of results are inserted into the
database. You may use this SQL statement and complete this part of the
homework by executing a single JDBC statement.
INSERT INTO t_genefusionevent
(t_protein_id, t_protein_t_protein_id,
t_localalignment_t_protein_id, t_localalignment_t_protein_t_protein_id)
SELECT
la1.t_protein_id, la1.t_protein_t_protein_id,
la2.t_protein_id, la2.t_protein_t_protein_id
FROM
t_protein p1, t_protein p2, t_localalignment la1,
t_localalignment la2, t_match
m1, t_match m2
WHERE la1.totalscore
> 50
AND la2.totalscore > 50
AND
p1.organism
= p2.organism
AND
m2.t_protein_id
= la2.t_protein_id
AND
m2.t_protein_t_protein_id
= la2.t_protein_t_protein_id
AND
m1.t_protein_id
= la1.t_protein_id
AND
m1.t_protein_t_protein_id
= la1.t_protein_t_protein_id
AND
((la1.t_protein_t_protein_id =
la2.t_protein_t_protein_id
AND
p1.t_protein_id
= la1.t_protein_id
AND
p2.t_protein_id
= la2.t_protein_id
AND
m1.toB
in
(select Max(toB) from
t_match
where t_protein_id =
la1.t_protein_id
and t_protein_t_protein_id =
la1.t_protein_t_protein_id)
AND
m2.fromB
in
(select Min(fromB) from t_match
where
t_protein_id = la2.t_protein_id
and
t_protein_t_protein_id = la2.t_protein_t_protein_id)
AND
m1.toB<m2.fromB )
OR
(la1.t_protein_id = la2.t_protein_id
AND
p1.t_protein_id
= la1.t_protein_t_protein_id
AND
p2.t_protein_id
= la2.t_protein_t_protein_id
AND
m1.toA
in
(select Max(toA) from
t_match
where
t_protein_id = la1.t_protein_id
and
t_protein_t_protein_id =
la1.t_protein_t_protein_id)
AND
m2.fromA
in
(select Min(fromA) from t_match
where
t_protein_id = la2.t_protein_id
and
t_protein_t_protein_id =
la2.t_protein_t_protein_id)
AND m1.toA<m2.fromA));
Note:
The above
is for an asymmetry database, where each pair of local alignment only stored
once.(To think: is the above query complete for asymmetry database?) However, if you created
database that store the local alignment twice for
each pair,
then you don’t need OR clause.
(You can use this query to turn an asymmetry database into a
symmetry one.) Then the query can be similipfied as:
INSERT INTO t_genefusionevent
(t_protein_id, t_protein_t_protein_id,
t_localalignment_t_protein_id, t_localalignment_t_protein_t_protein_id)
SELECT
la1.t_protein_id, la1.t_protein_t_protein_id,
la2.t_protein_id, la2.t_protein_t_protein_id
FROM
t_protein p1, t_protein p2, t_localalignment la1,
t_localalignment la2, t_match
m1, t_match m2
WHERE la1.totalscore
> 50
AND la2.totalscore > 50
AND
p1.organism
= p2.organism
AND
m2.t_protein_id
= la2.t_protein_id
AND
m2.t_protein_t_protein_id
= la2.t_protein_t_protein_id
AND
m1.t_protein_id
= la1.t_protein_id
AND
m1.t_protein_t_protein_id
= la1.t_protein_t_protein_id
AND
la1.t_protein_t_protein_id
= la2.t_protein_t_protein_id
AND p1.t_protein_id = la1.t_protein_id
AND
p2.t_protein_id
= la2.t_protein_id
AND
m1.toB
in
(select Max(toB) from
t_match
where t_protein_id =
la1.t_protein_id
and t_protein_t_protein_id =
la1.t_protein_t_protein_id)
AND
m2.fromB
in
(select Min(fromB) from t_match
where t_protein_id =
la2.t_protein_id
and t_protein_t_protein_id = la2.t_protein_t_protein_id)
AND
m1.toB<m2.fromB ;
Turn
in: Email to the T.A. 3 of the gene
fusion events you computed.(Please include the protien
accession Number (t_protein.accessionNumber) of your results)