Team:CBNU-Korea/Project/GD/Analysis

From 2012.igem.org

Minimal Genome Designer

- Analysis

1. Introduction

1-1. Suggestion

Since the Genome project started in 2002, we can easily get the genetic information of many species. Also as the scientific technique developed, we can insert and compose the genome. If we can design a whole genome, then we will be able to make a one and only useful genome. But as today, the compose of the minimum genome made with the essential gene has succeeded, but did not last.

1-2. Object

To design a genome, we have to analyze the pattern of the genome and the distribution of the gene.

1-3. Method and the range of the study

The study was used information of species in streptococcus by patric database ( http://www.patricbrc.org/portal/portal/patric/Home) and SynbUID.
The Data was built by mysql 5.5.27, and a statistical analysis program was used by SAS 9.3.

2. Design

2-1. Prepare

1) Build database

An attribute of Genome name is consisted of ID, Genome_name, COG, Start, End, Strand, and Size.
An attribute of Annotation Table_EG is consisted of ID, locus, and SynbUID. Two entities are paired of Locus_tag 1 by 1.

2) Represented sample number

For checking the number of specimen that is representative, we used a simple random sampling method, and assumed that the complete genome is random. We used the significance level (a=0.05) and the limit of error (b=0.1). The total species of streptococcus is 494 species, and between these, 82 species are completed. According to our calculation, when there is 81 species, the result is satisfied. Therefore, as a result, 82 complete species represent the streptococcus.

3) Standard

3-1) Divided the interval of the genome

The number and size of the genome differs between species. To supplement this problem, we divided the genes in a section to show the genome’s size as a proportion. As a result, when we divided the analyzing section less then a hundred, it was hard to see the patterns because the data has been diluted. And when we divided it into more then a hundred pieces, it was not that different from the result that divided it into a hundred pieces. So we decided to divide it into a hundred pieces.

3-2) Identified the starting point

The number one ORF of each gene sequence analysis data is different between every species. Thus we had to make a specific standard to equalize the beginning of the data. We checked the strand pattern of each genome and identified it with the strands.

2-2. Analysis

1) Strand

- We chose 77 species out of 82 species randomly, and estimated the patterns of the strand ratio of each sections, and verified the estimated number with the other 5 species.

- We checked the strand ratio of the essential gene.

2) Region

- We found the COG and the frequency of the essential gene because of referred to the genome design

2-1) We checked where the essential genome is distributed.
2-2) We checked where the gene by being provided COG is distributed.

3. Result

2-1. Strand

When we checked the strand pattern of the 82 species, the genes were distributed in 4 places with different tendency. So we decided the section of the proc transreg as 4, and analyzed.

1) Estimated the transpose linear regression

We explained with a theory that ‘The null hypothesis does not satisfy the regression model, but the alternative hypothesis does.’ As a result in the SAS, according to the null hypothesis, the F-value was 3093.13, and the P=value <.0001. Therefore at a significance level of 0.01, the null hypothesis is dismissible. In other words, the regression model is more suitable.

2) Estimated factor β0, β1

As the notable probability gets smaller, it can affect the dependent variable more. According to the null hypothesis, the F-value of β0 is 10.13, and the Pr > F 0.0015. So the null hypothesis is dismissable. And the estimated calculation is 1.22744031. Also F-value of β1 was 3093.13, and the Pr > F <.0001, so again the null hypothesis is dismissed. Therefore, the estimated number is 0.97546498.

3) Estimated the transe regression model

As a result to look the distribution of the strand to each species, we found a similar pattern. Thus we studied the pattern of the strand distribution after to range a standard by section which is changed the strand's sign. By using The Transpose Regression Method, we have a result to be able to express The Spline Regression Model by the distribution pattern of the strand of 82 species.

Identity(spercent) = 1.22744031 + 0.97546498*spline(interval)

The X axis shows the 100 sections of the genome of the randomly selected 77 species. And the Y axis is the ratio of the + patterns of each section. The sum of each section’s +, - pattern is 100. According to the graph above, when the standard number is 50, the + patterns appears as 25 on the left, and the – pattern on the right higher than 80. So in this case, it is a + pattern.

4) Verifying the estimated Prediction Equation is adequate.

By conducting the chi-square test with the estimated transpose linear regression prediction equation with the 77 randomly selected species, we verified if the equation is adequate. The null hypothesis is independent from the prediction equation and the other 5 species that was not selected. And the alternative hypothesis is subordinate with the prediction equation and the 5 species. The p-value of the 5 species is independent from the prediction equation estimated by the null hypothesis. We can see that it is subordinate when it is dismissed.

2-2. Region

We estimated the origin which is a part of changed strand’s pattern.

1) The spread of the essential genes is shown at the table below.

The X axis shows the 20 sections of the genome size. And the Y axis is the frequency of the genome. As a result to guess the distribution of the essential gene, the graph is showed like that. We know that 322 essential genes among 485 essential gene are distributed a bilateral symmetry in the middle of origin. We can divide between the Synb_ID which is the origin of high frequency and Synb_ID which is the origin of high frequency by both sides.

2) The spread of the genes provided COG is shown at the table below.

The X axis shows the 20 sections of the genome size. And the Y axis is the frequency of the gene provided the COG.
As a result to guess the distribution of gene by being provided COG, the graph provide an information that the COG is distributed symmetrically around the middle of origin.