Mini-Rosetta Project Milestone 3

 

Inference of Rosetta Relations

 

Due: 11:59PM,Friday April 20

 

In this milestone, you are going to compute Rosetta relations. You DON'T need to identify putative new gene functions now.

You need to write a SQL query to infer Rosetta relations.

Recall a Rosetta (gene fusion) relationship exists among three proteins, if two proteins from the same organism are both homologous to a protein in a second organism. Previously you computed a homology score between every pair of proteins in your database.  Thus, a Rosetta relationship can be determined by a SQL query, as follows… (Let, 50 be the minimum homology score to properly conclude a homologue, you are welcome to experiment with other values.)

SELECT

la1.t_protein_id, la1.t_protein_t_protein_id, la2.t_protein_id, la2.t_protein_t_protein_id

FROM

t_protein p1, t_protein p2, t_localalignment la1, t_localalignment la2, t_match m1, t_match m2

WHERE la1.totalscore > 50

      AND la2.totalscore > 50

AND p1.organism = p2.organism

AND m2.t_protein_id = la2.t_protein_id

AND m2.t_protein_t_protein_id = la2.t_protein_t_protein_id

AND m1.t_protein_id = la1.t_protein_id

AND m1.t_protein_t_protein_id = la1.t_protein_t_protein_id

AND

       ((la1.t_protein_t_protein_id = la2.t_protein_t_protein_id

AND p1.t_protein_id = la1.t_protein_id

AND p2.t_protein_id = la2.t_protein_id

AND m1.toB in

       (select Max(toB) from  t_match

         where t_protein_id = la1.t_protein_id

         and t_protein_t_protein_id = la1.t_protein_t_protein_id)

AND m2.fromB in

      (select Min(fromB) from t_match

where t_protein_id = la2.t_protein_id

and t_protein_t_protein_id = la2.t_protein_t_protein_id)

AND m1.toB<m2.fromB )

            OR

 (la1.t_protein_id = la2.t_protein_id

AND p1.t_protein_id = la1.t_protein_t_protein_id

AND p2.t_protein_id = la2.t_protein_t_protein_id

AND m1.toA in

(select Max(toA) from  t_match

where t_protein_id = la1.t_protein_id

and t_protein_t_protein_id = la1.t_protein_t_protein_id)

AND m2.fromA in

(select Min(fromA) from t_match

where t_protein_id = la2.t_protein_id

and t_protein_t_protein_id = la2.t_protein_t_protein_id)

   AND m1.toA<m2.fromA));

 

This query above returns a set of rows corresponding to the discovered Rosetta relations.  You may take precisely that query and use it as an argument to a JDBC call, take the answers and, in Java, iterate through them inserting them one at a time back into the database.

 

Alternatively, a little knowledge of SQL results in the following SQL statement.  This statement computes the query above, and the set of results are inserted into the database.  You may use this SQL statement and complete this part of the homework by executing a single JDBC statement.

 

INSERT INTO t_genefusionevent (t_protein_id, t_protein_t_protein_id,

            t_localalignment_t_protein_id, t_localalignment_t_protein_t_protein_id)

SELECT

la1.t_protein_id, la1.t_protein_t_protein_id, la2.t_protein_id, la2.t_protein_t_protein_id

FROM

t_protein p1, t_protein p2, t_localalignment la1, t_localalignment la2, t_match m1, t_match m2

WHERE la1.totalscore > 50

      AND la2.totalscore > 50

AND p1.organism = p2.organism

AND m2.t_protein_id = la2.t_protein_id

AND m2.t_protein_t_protein_id = la2.t_protein_t_protein_id

AND m1.t_protein_id = la1.t_protein_id

AND m1.t_protein_t_protein_id = la1.t_protein_t_protein_id

AND

       ((la1.t_protein_t_protein_id = la2.t_protein_t_protein_id

AND p1.t_protein_id = la1.t_protein_id

AND p2.t_protein_id = la2.t_protein_id

AND m1.toB in

       (select Max(toB) from  t_match

         where t_protein_id = la1.t_protein_id

         and t_protein_t_protein_id = la1.t_protein_t_protein_id)

AND m2.fromB in

      (select Min(fromB) from t_match

where t_protein_id = la2.t_protein_id

and t_protein_t_protein_id = la2.t_protein_t_protein_id)

AND m1.toB<m2.fromB )

            OR

 (la1.t_protein_id = la2.t_protein_id

AND p1.t_protein_id = la1.t_protein_t_protein_id

AND p2.t_protein_id = la2.t_protein_t_protein_id

AND m1.toA in

(select Max(toA) from  t_match

where t_protein_id = la1.t_protein_id

and t_protein_t_protein_id = la1.t_protein_t_protein_id)

AND m2.fromA in

(select Min(fromA) from t_match

where t_protein_id = la2.t_protein_id

and t_protein_t_protein_id = la2.t_protein_t_protein_id)

   AND m1.toA<m2.fromA));

 

Note: The above is for an asymmetry database, where each pair of local alignment only stored once.(To think: is the above query complete for asymmetry database?) However, if you created database that store the local alignment twice for each pair, then you don’t need OR clause.

(You can use this query to turn an asymmetry database into a symmetry one.) Then the query can be similipfied as:

INSERT INTO t_genefusionevent (t_protein_id, t_protein_t_protein_id,

            t_localalignment_t_protein_id, t_localalignment_t_protein_t_protein_id)

SELECT

la1.t_protein_id, la1.t_protein_t_protein_id, la2.t_protein_id, la2.t_protein_t_protein_id

FROM

t_protein p1, t_protein p2, t_localalignment la1, t_localalignment la2, t_match m1, t_match m2

WHERE la1.totalscore > 50

      AND la2.totalscore > 50

AND p1.organism = p2.organism

AND m2.t_protein_id = la2.t_protein_id

AND m2.t_protein_t_protein_id = la2.t_protein_t_protein_id

AND m1.t_protein_id = la1.t_protein_id

AND m1.t_protein_t_protein_id = la1.t_protein_t_protein_id

AND la1.t_protein_t_protein_id = la2.t_protein_t_protein_id

      AND p1.t_protein_id = la1.t_protein_id

AND p2.t_protein_id = la2.t_protein_id

AND m1.toB in

       (select Max(toB) from  t_match

         where t_protein_id = la1.t_protein_id

         and t_protein_t_protein_id = la1.t_protein_t_protein_id)

AND m2.fromB in

      (select Min(fromB) from t_match

where t_protein_id = la2.t_protein_id

and t_protein_t_protein_id = la2.t_protein_t_protein_id)

AND m1.toB<m2.fromB ;

 

Turn in: Email to the T.A. 3 of the gene fusion events you computed.(Please include the protien accession Number (t_protein.accessionNumber) of your results)