Team:Johns Hopkins-Software/Cloud

Cloud Technology

Autogene harnesses the power of the cloud to perform computationally intense tasks at record speeds. Cloud computing is known as the use of software and hardware services across a network, often the internet. The advantages of using the cloud is that an organization would not have to maintain their own hardware, so they can save on the cost of the technology while ensuring the quality of performances. They can increase access as essentially anyone with the authorized credentials could access the data or software through the internet, and are not limited to any physical location. And of course, the cloud can handle many demanding tasks. Using multiple machines to process work in parallel, performance could be sped up to a small fraction of the time.

Smith-Waterman Algorithm

We have been working on integrating the sequence alignment function with the cloud. This is the task of taking two genetic sequences and finding the area of best fit between the two. Alignments are often conducted by biologists to scan genes from various organisms for certain features, and study the significance of these traits, and how they may have arose.

Though there are a number of ways to obtain an alignment, we opted to utilize dynamic programming with the Smith-Waterman algorithm. This algorithm performs a local alignment, meaning it searches through the two sequences for matches of all sizes and finds the highest similarity using a scoring system of assigning points based on matching letters, mismatched letters, or skipped letters (a.k.a. gaps). Varying the scoring system could also vary the results. Such an algorithm takes mn operations given two sequences with lengths m and n, so the worst case scenario would involve a complexity of m squared, implying that this task becomes exponentially more time consuming as our sequences get longer.

The manual process of this algorithm involves setting each of the two sequences on an axis of a grid. Each box is given a score based on the maximum outcome derived from its interaction with a preceding box on the grid. It may be rewarded points for a match when traveling diagonally down and to the right along the grid, or punished points for mismatching when traveling diagonally or jumping right or jumping down (follow the sample given). When the entirety of the grid is filled, the box with the highest score is identified. This becomes the ending index of the alignment. The path of this obtaining this score is retraced back to 0, and the resultant path can be used to determine the alignment of the sequences. Clearly this task can be very laborious, and practically impossible when considering how real genes can consist of sequences thousands or even millions of letters long. Even using computers, it can take a painfully long time to complete a run of this algorithm. That is why we have enlisted the use of cloud computation to revolutionize this process.

Autodesk Saturn

We collaborated with Autodesk and implemented our cloud algorithm through their Project Saturn API and Autodesk Cloud services. Saturn is a new framework designed to provide customers with single- and multi-objective global optimization-driven algorithms. It features the capability to be fully integrated in engineering products as a multi-language and multi-platform optimization library, and to communicate with the framework running on the Autodesk Cloud, thereby actualizing the possibility of seamlessly and efficiently integrating custom solutions and scalable systems able to carry out any optimization taks demanded by users.

In our project, we wrote both frontend and backend components to carry out the Smith-Waterman local alignments on the cloud at rapid speeds on demand by utilizing a two-tier algorithm which splits up the tasks and runs the subjobs in parallel. Users are able to upload a plasmid sequence and specify a (n) number of subjobs in which to split the alignment process. The plasmid sequence is then temporarily stored as a resource in the cloud, along with an existing table of features previously and permanently uploaded to cut down on file transfer time. In the first tier of the algorithm, the stored features table is accessed and split into the specified (n) number of subjobs and initiates (n) subjobs to run in the second tier. This requires the activation of (n) machines on the cloud to most efficiently execute the alignment process. At each branch of the second tier algorithm, we utilize the EMBOSS Water tool to run the Smith-Waterman algorithm against the plasmid sequence and the designated set of features. A unique result resource is created for each subjob, and alignment results are filtered through a threshold of 98% identity before being appended to this file. The results are then reformated into a JSON array before being passed to the first tier. With the completion of each subjob, the first tier of the algorithm sends back partial results to the client, which then appends each of these JSON arrays together and returns a final result with all the completed alignments together.

Cloud Performance

We have tested this on an alignment of the PUC18 plasmid, which consists of a sequence of 2,680 letters, against a library of 17,498 yeast features, each about 400 base-pairs long. Running conventionally without the cloud, we found that it takes a local machine 39 minutes to complete this alignment. Implementing the cloud in over 80 timed executions with very little standard error values, we found that running it on the cloud with 10 processors cut the time to three minutes, and running it with 30 processors cut it to nearly one minute. PUC18 is a relatively unintimidating-sized sequence. Considering how many sequences of interest can be up to thousands of letters in length, and how libraries can have countless features, which could cause alignments to take weeks to complete, certain alignment tasks would require more memory than a local machine would be able to handle, so this is the kind of job that could only be done through a cloud server. With this kind of improvement, we are making the impossible in biology possible.