Team:Johns Hopkins-Software/Cloud

From 2012.igem.org

(Difference between revisions)
Line 17: Line 17:
We have been working on integrating the sequence alignment function with the cloud. This is the task of taking two genetic sequences and finding the area of best fit between the two. Alignments are often conducted by biologists to scan genes from various organisms for certain features, and study the significance of these traits, and how they may have arose.
We have been working on integrating the sequence alignment function with the cloud. This is the task of taking two genetic sequences and finding the area of best fit between the two. Alignments are often conducted by biologists to scan genes from various organisms for certain features, and study the significance of these traits, and how they may have arose.
<br><br>Though there are a number of ways to obtain an alignment, we opted to utilize dynamic programming with the Smith-Waterman algorithm. This algorithm performs a local alignment, meaning it searches through the two sequences for matches of all sizes and finds the highest similarity using a scoring system of assigning points based on matching letters, mismatched letters, or skipped letters (a.k.a. gaps). Varying the scoring system could also vary the results. Such an algorithm takes mn operations given two sequences with lengths m and n, so the worst case scenario would involve a complexity of m squared, implying that this task becomes exponentially more time consuming as our sequences get longer.
<br><br>Though there are a number of ways to obtain an alignment, we opted to utilize dynamic programming with the Smith-Waterman algorithm. This algorithm performs a local alignment, meaning it searches through the two sequences for matches of all sizes and finds the highest similarity using a scoring system of assigning points based on matching letters, mismatched letters, or skipped letters (a.k.a. gaps). Varying the scoring system could also vary the results. Such an algorithm takes mn operations given two sequences with lengths m and n, so the worst case scenario would involve a complexity of m squared, implying that this task becomes exponentially more time consuming as our sequences get longer.
 +
<br>
 +
<a href="https://static.igem.org/mediawiki/2012/f/f6/Waterman_Scoring.png"><img style="display: block; margin-left: auto; margin-right: auto; width: 720px; padding:8px;" src="https://static.igem.org/mediawiki/2012/f/f6/Waterman_Scoring.png"></img></a>
 +
<br><br>
<br><br>
The manual process of this algorithm involves setting each of the two sequences on an axis of a grid. Each box
The manual process of this algorithm involves setting each of the two sequences on an axis of a grid. Each box
is given a score based on the maximum outcome derived from its interaction with a preceding box on the grid. It may be rewarded points for a match when traveling diagonally down and to the right along the grid, or punished points for mismatching when traveling diagonally or jumping right or jumping down (follow the sample given). When the entirety of the grid is filled, the box with the highest score is identified. This becomes the ending index of the alignment. The path of this obtaining this score is retraced back to 0, and the resultant path can be used to determine the alignment of the sequences. Clearly this task can be very laborious, and practically impossible when considering how real genes can consist of sequences thousands or even millions of letters long. Even using computers, it can take a painfully long time to complete a run of this algorithm. That is why we have enlisted the use of cloud computation to revolutionize this process.
is given a score based on the maximum outcome derived from its interaction with a preceding box on the grid. It may be rewarded points for a match when traveling diagonally down and to the right along the grid, or punished points for mismatching when traveling diagonally or jumping right or jumping down (follow the sample given). When the entirety of the grid is filled, the box with the highest score is identified. This becomes the ending index of the alignment. The path of this obtaining this score is retraced back to 0, and the resultant path can be used to determine the alignment of the sequences. Clearly this task can be very laborious, and practically impossible when considering how real genes can consist of sequences thousands or even millions of letters long. Even using computers, it can take a painfully long time to complete a run of this algorithm. That is why we have enlisted the use of cloud computation to revolutionize this process.
<br>
<br>
-
<br>
+
 
-
<a href="https://static.igem.org/mediawiki/2012/f/f6/Waterman_Scoring.png"><img style="display: block; margin-left: auto; margin-right: auto; width: 720px; padding:8px;" src="https://static.igem.org/mediawiki/2012/f/f6/Waterman_Scoring.png"></img></a>
+
<br>
<br>

Revision as of 08:07, 3 October 2012

Cloud Technology

Autogene harnesses the power of the cloud to perform computationally intense tasks at record speeds. Cloud computing is known as the use of software and hardware services across a network, often the internet. The advantages of using the cloud is that an organization would not have to maintain their own hardware, so they can save on the cost of the technology while ensuring the quality of performances. They can increase access as essentially anyone with the authorized credentials could access the data or software through the internet, and are not limited to any physical location. And of course, the cloud can handle many demanding tasks. Using multiple machines to process work in parallel, performance could be sped up to a small fraction of the time.

Smith-Waterman Algorithm

We have been working on integrating the sequence alignment function with the cloud. This is the task of taking two genetic sequences and finding the area of best fit between the two. Alignments are often conducted by biologists to scan genes from various organisms for certain features, and study the significance of these traits, and how they may have arose.

Though there are a number of ways to obtain an alignment, we opted to utilize dynamic programming with the Smith-Waterman algorithm. This algorithm performs a local alignment, meaning it searches through the two sequences for matches of all sizes and finds the highest similarity using a scoring system of assigning points based on matching letters, mismatched letters, or skipped letters (a.k.a. gaps). Varying the scoring system could also vary the results. Such an algorithm takes mn operations given two sequences with lengths m and n, so the worst case scenario would involve a complexity of m squared, implying that this task becomes exponentially more time consuming as our sequences get longer.


The manual process of this algorithm involves setting each of the two sequences on an axis of a grid. Each box is given a score based on the maximum outcome derived from its interaction with a preceding box on the grid. It may be rewarded points for a match when traveling diagonally down and to the right along the grid, or punished points for mismatching when traveling diagonally or jumping right or jumping down (follow the sample given). When the entirety of the grid is filled, the box with the highest score is identified. This becomes the ending index of the alignment. The path of this obtaining this score is retraced back to 0, and the resultant path can be used to determine the alignment of the sequences. Clearly this task can be very laborious, and practically impossible when considering how real genes can consist of sequences thousands or even millions of letters long. Even using computers, it can take a painfully long time to complete a run of this algorithm. That is why we have enlisted the use of cloud computation to revolutionize this process.

Autodesk Saturn

In the case of the Autogene alignment algorithm, we wrote a client script that communicates with the cloud backend, running two tiers of algorithms that splits up the job into many subjobs running in parallel.


Cloud Performance

We have tested this on an alignment of the PUC18 gene, which consists of a sequence of 2,680 letters, against a library of 17,498 yeast features, each about 400 base-pairs long. Running conventionally without the cloud, we found that it takes a local machine 39 minutes to complete this alignment. Running it on the cloud with 10 processors we cut the time to three minutes, and running it with 30 processors we cut it to nearly one minute. PUC18 is a relatively unintimidating-sized sequence. Considering how many sequences of interest can be up to thousands of letters in length, and how libraries can have countless features, which could cause alignments to take weeks to complete, certain alignment tasks would require more memory than a local machine would be able to handle, so this is the kind of job that could only be done through a cloud server. With this kind of improvement, we are making the impossible in biology possible.






































































































































































































Autogene

Retrieved from "http://2012.igem.org/Team:Johns_Hopkins-Software/Cloud"