Minimal Genome Designer

- Analysis

1. Introduction

1-1. Suggestion

Since the Genome project started in 2002, we can easily get the genetic information of many species. Also as the scientific technique developed, we can insert and compose the genome. If we can design a whole genome, then we will be able to make a one and only useful genome. But as today, the compose of the minimum genome made with the essential gene has succeeded, but did not last.

1-2. Object

To design a genome, we have to analyze the pattern of the genome and the distribution of the gene.

1-3. Method and the range of the study

The study was used information of species in streptococcus by patric database ( http://www.patricbrc.org/portal/portal/patric/Home) and SynbUID.
The Data was built by mysql 5.5.27, and a statistical analysis program was used by SAS 9.3.

2. Design

2-1. Prepare

1) Build database

An attribute of Genome name is consisted of ID, Genome_name, COG, Start, End, Strand, and Size.
An attribute of Annotation Table_EG is consisted of ID, locus, and SynbUID. Two entities are paired of Locus_tag 1 by 1.

2) Represented sample number

For checking the number of specimen that is representative, we used a simple random sampling method, and assumed that the complete genome is random. We used the significance level (a=0.05) and the limit of error (b=0.1). The total species of streptococcus is 494 species, and between these, 82 species are completed. According to our calculation, when there is 81 species, the result is satisfied. Therefore, as a result, 82 complete species represent the streptococcus.

3) Standard

3-1) Divided the interval of the genome

The number and size of the genome differs between species. To supplement this problem, we divided the genes in a section to show the genome’s size as a proportion. As a result, when we divided the analyzing section less then a hundred, it was hard to see the patterns because the data has been diluted. And when we divided it into more then a hundred pieces, it was not that different from the result that divided it into a hundred pieces. So we decided to divide it into a hundred pieces.

3-2) Identified the starting point

The number one ORF of each gene sequence analysis data is different between every species. Thus we had to make a specific standard to equalize the beginning of the data. We checked the strand pattern of each genome and identified it with the strands.

Team:CBNU-Korea/Project/GD/Analysis

From 2012.igem.org