Team:Johns Hopkins-Software/Cloud
From 2012.igem.org
(Difference between revisions)
Line 17: | Line 17: | ||
We have been working on integrating the sequence alignment function with the cloud. This is the task of taking two genetic sequences and finding the area of best fit between the two. Alignments are often conducted by biologists to scan genes from various organisms for certain features, and study the significance of these traits, and how they may have arose. | We have been working on integrating the sequence alignment function with the cloud. This is the task of taking two genetic sequences and finding the area of best fit between the two. Alignments are often conducted by biologists to scan genes from various organisms for certain features, and study the significance of these traits, and how they may have arose. | ||
<br><br>Though there are a number of ways to obtain an alignment, we opted to utilize dynamic programming with the Smith-Waterman algorithm. This algorithm performs a local alignment, meaning it searches through the two sequences for matches of all sizes and finds the highest similarity using a scoring system of assigning points based on matching letters, mismatched letters, or skipped letters (a.k.a. gaps). Varying the scoring system could also vary the results. Such an algorithm takes mn operations given two sequences with lengths m and n, so the worst case scenario would involve a complexity of m squared, implying that this task becomes exponentially more time consuming as our sequences get longer. | <br><br>Though there are a number of ways to obtain an alignment, we opted to utilize dynamic programming with the Smith-Waterman algorithm. This algorithm performs a local alignment, meaning it searches through the two sequences for matches of all sizes and finds the highest similarity using a scoring system of assigning points based on matching letters, mismatched letters, or skipped letters (a.k.a. gaps). Varying the scoring system could also vary the results. Such an algorithm takes mn operations given two sequences with lengths m and n, so the worst case scenario would involve a complexity of m squared, implying that this task becomes exponentially more time consuming as our sequences get longer. | ||
+ | <br> | ||
+ | <a href="https://static.igem.org/mediawiki/2012/f/f6/Waterman_Scoring.png"><img style="display: block; margin-left: auto; margin-right: auto; width: 720px; padding:8px;" src="https://static.igem.org/mediawiki/2012/f/f6/Waterman_Scoring.png"></img></a> | ||
+ | |||
<br><br> | <br><br> | ||
The manual process of this algorithm involves setting each of the two sequences on an axis of a grid. Each box | The manual process of this algorithm involves setting each of the two sequences on an axis of a grid. Each box | ||
is given a score based on the maximum outcome derived from its interaction with a preceding box on the grid. It may be rewarded points for a match when traveling diagonally down and to the right along the grid, or punished points for mismatching when traveling diagonally or jumping right or jumping down (follow the sample given). When the entirety of the grid is filled, the box with the highest score is identified. This becomes the ending index of the alignment. The path of this obtaining this score is retraced back to 0, and the resultant path can be used to determine the alignment of the sequences. Clearly this task can be very laborious, and practically impossible when considering how real genes can consist of sequences thousands or even millions of letters long. Even using computers, it can take a painfully long time to complete a run of this algorithm. That is why we have enlisted the use of cloud computation to revolutionize this process. | is given a score based on the maximum outcome derived from its interaction with a preceding box on the grid. It may be rewarded points for a match when traveling diagonally down and to the right along the grid, or punished points for mismatching when traveling diagonally or jumping right or jumping down (follow the sample given). When the entirety of the grid is filled, the box with the highest score is identified. This becomes the ending index of the alignment. The path of this obtaining this score is retraced back to 0, and the resultant path can be used to determine the alignment of the sequences. Clearly this task can be very laborious, and practically impossible when considering how real genes can consist of sequences thousands or even millions of letters long. Even using computers, it can take a painfully long time to complete a run of this algorithm. That is why we have enlisted the use of cloud computation to revolutionize this process. | ||
<br> | <br> | ||
- | + | ||
- | + | ||
<br> | <br> |