Team:Johns Hopkins-Software/BiobrickAnalysis

From 2012.igem.org

Biobrick Analysis
When creating AutoPlasmid, we recognized the need for a streamline, automated method of annotating sequences in the field of synthetic biology, since hand-annotated sequences are likely to have errors, and even with many eyes to oversee the annotating process, it may still be difficult to spot errors in annotations. The registry of standard parts currently has over 20,000 parts; roughly 7,000 of them are categorized as available, 11,000 are planning, 1,500 have been deleted, and the rest are either categorized as unavailable, missing, or informational. New biobrick parts are characterized every year, and are hand-annotated, which often lead to errors in characterizing the sequence. To this end, we used AutoPlasmid’s annotation capabilities to check over the annotations made in all of the biobricks in the Registry of Standard Parts (as of September 1, 2012).

Methodology
We read through each xml file that provided the data for each biobrick part and converted them in a format accepted by AutoPlasmid, and cross-checked the annotations provided from the xml files with the annotations provided by AutoPlasmid. In this test, we did perfect alignments instead of imperfect. Any biobricks that had a notable mistake in their annotations were flagged, and the mistake was recorded. Other parameters of the biobricks, such as the status (i.e. if available, planning, deleted, etc.), were also taken into account.



Results
We noticed two very common errors with the biobrick annotations from the xml data, one being incorrectly defining the strand on which the sequence was on, i.e. the annotation stated it was on the reverse strand, whereas it was on the forward strand and vice versa (Wrong Strand). The other was the annotation sequence didn’t match the correct sequence (Mismatch). Others, which were surprising, included having annotations that were not on the biobrick part’s sequence (Out of Bounds) and biobrick parts that were less than 3 base pairs long (Empty).







Conclusion
What we did was a very quick scan of the biobrick parts, since we checked only perfect alignments and didn’t take into account potential mutations in annotation sequences that could still produce the same result. Although there may be slight discrepancies in what is truly an incorrect annotation sequence, what we have done is isolated the parts that may have annotation errors and will need to be checked over, which is something that would have taken several hours if a single person were doing it. We see this as a reason for synthetic biologists to use software to help them annotate their constructed sequences, as opposed to hand-annotating, since the accuracy of computer-generated annotations from simple alignment algorithms would be much greater and reduce the amount of errors that we see currently in the Parts Registry. Given that the Parts Registry is constantly increasing in size, and more and more complicated constructs will be created in the future as synthetic biology advances, we see that using software to annotate will help to mitigate future errors and man hours invested into correcting incorrect annotation sequences.

A complete list of our Results can be found here as a comma-separated value file. The entries are the biobrick part, what was annotated, the sequence of what the annotation should be, the sequence on the biobrick part, where the error was found, the type of error, and the status of the biobrick part (A-available, P-planning, U-unavailable, I -informational, M-missing, D-deleted).

A zip file containing all of the biobrick XML files that we used can be found here.
















































































































































































Autogene

Retrieved from "http://2012.igem.org/Team:Johns_Hopkins-Software/BiobrickAnalysis"