Milestone 4: Gene Ontology and Rosetta Relation

Milestone 4 overview

Instructions

  1. Data Model Update
  2. MySQL JDBC Driver
  3. Retrieving GO Terms
  4. Verifying Rosetta Relations

1. Data Model Update

Please refer to the solution given here to create table


2. MySQL JDBC driver

The GO (Gene Ontology) database resides in a MySQL database server. Therefore you need to install the MySQL JDBC driver to access the database.

Download the MySQL JDBC driver from here.

Add the MySQL JDBC driver to your Java classpath in the same way you set up the classpath for the PostgreSQL JDBC driver.


3. Retrieving GO terms

The code for connecting to the MySQL database is provided to you. You can make the same kind of JDBC calls as you did with PostgreSQL. The difference lies in the SQL syntax. But for basic queries, there shall be no difference.

You can check the GO database schema at http://www.godatabase.org/dev/sql/doc/tables.html

The tables we are concerned about are

Given a gene product symbol (in our case, it's the column name in the table t_protein) and assuming the symbol uniquely identifies a gene product, you need to write a function that

For example, given a t_protien.name = 'CEAC_ECOLI',

Run this function for all proteins in the t_GeneFusionEvent table. Turn in your code.

Due to the version variations, some names assoicate with given sequences are not found in the newer version of go database, you can run this file to update t_protein table first

Use GoTerm.java, GeneOntology.java and ProteinFunction.java as the skeleton of your program. You have fill in the detailed steps (the TODO sections). The main() method is in ProteinFunction.java.


4. Verifying Rosetta Relations

Let A and B be the sets of GO terms associated with two proteins. Then is the number of common terms between the two proteins and is the total number of distinct terms. The overlap ratio is defined as .

We provide two methods in RosettaRelation.java:

	int overlapCount(int pID1, int pID2);
	int overlapCount(int pID1, int pID2, int pID3);

You can use them to count the number of common GO terms between two or three proteins.

Things you need to do

You final output should be a table of the form

Rosetta Relation Overlap Ratios
Protein ID 1 Protein ID 2 Protein ID 3 P1 & P2 P2 & P3 P1 & P3 P1 & P2 & P3
18 17 9 0.143 0.194 0.472 0.133

You may choose to display protein names instead of IDs. The result shown in the above table is for TOP2_YEAST, HIS2_YEAST and GYRA_ECOLI.