MP6: GeneAnalysis

Introduction

A gene sequence is composed of four nucleotides which are represented as {A,G,T,C}. Suppose we want to know the similarity between these two sequences:

Gene1: ATCTGATC
Gene2: TGCATAC

During evolution, some nucleotides are deleted ("deletions") and some are inserted ("insertions"). These two events change the gene sequences. Fortunately, this problem is the equivalent as finding the *Longest Common Subsequence* of two sequences which as a Computer Scientist you know how to solve.

  1. Question: What is the Longest Common Subsequence?
  2. Answer: *word1* is a subsequence of *word2* if the letters in word1 appear in the same order (but not necessarily next to each other) within word2.

For example, {free} is a subsequence of {reference} and {cast} is a subsequence of {encapsulation}. However, {hint} is not a subsequence of {inheritance} because although {inheritance} contains all the letters in {hint}, they do not occur in the same order.

So back to our first example, we can align the two genes as follows:

Gene 1: A T C T G A T C
Gene 2: T G C A T A C

In this example, the Longest Common Subsequence (LCS) is TCTAC. That is, there are *five* places where both genes have the same character. In other words, the length of the LCS is 5. So, we say that the alignment score = 5.

Recursive Description

Here's a formal recursive description of how to solve this problem:

Implementation

You need to create two methods in the GeneAnalysis class (we've started it for you).