Team:CBNU-Korea/Project/GD/Analysis
From 2012.igem.org
Minimal Genome Designer
- Analysis
1. Introduction
1-1. Suggestion
Since the Genome project started in 2002, we can easily get the genetic information of many species. Also as the scientific technique developed, we can insert and compose the genome. If we can design a whole genome, then we will be able to make a one and only useful genome. But as today, the compose of the minimum genome made with the essential gene has succeeded, but did not last.
1-2. Object
To design a genome, we have to analyze the pattern of the genome and the distribution of the gene.
1-3. Method and the range of the study
The study was used information of species in streptococcus by
patric database (
http://www.patricbrc.org/portal/portal/patric/Home) and SynbUID.
The Data was built by mysql 5.5.27, and a statistical analysis
program was used by SAS 9.3.
2. Design
2-1. Prepare
1) Build database
An attribute of Genome name is consisted of ID, Genome_name, COG,
Start, End, Strand, and Size.
An attribute of Annotation
Table_EG is consisted of ID, locus, and SynbUID. Two entities are
paired of Locus_tag 1 by 1.
2) Represented sample number
For checking the number of specimen that is representative, we used a simple random sampling method, and assumed that the complete genome is random. We used the significance level (a=0.05) and the limit of error (b=0.1). The total species of streptococcus is 494 species, and between these, 82 species are completed. According to our calculation, when there is 81 species, the result is satisfied. Therefore, as a result, 82 complete species represent the streptococcus.
3) Standard
3-1) Divided the interval of the genome
The number and size of the genome differs between species. To supplement this problem, we divided the genes in a section to show the genome’s size as a proportion. As a result, when we divided the analyzing section less then a hundred, it was hard to see the patterns because the data has been diluted. And when we divided it into more then a hundred pieces, it was not that different from the result that divided it into a hundred pieces. So we decided to divide it into a hundred pieces.
3-2) Identified the starting point
The number one ORF of each gene sequence analysis data is different between every species. Thus we had to make a specific standard to equalize the beginning of the data. We checked the strand pattern of each genome and identified it with the strands.
2-2. Analysis
1) Strand
1-1) Method
- We chose 77 species out of 82 species randomly, and estimated the patterns of the strand ratio of each sections, and verified the estimated number with the other 5 species.
- We checked the strand ratio of the essential gene.
1-2) Region
- We checked where the genome is distributed.
3. Result
2-1. Strand
When we checked the strand pattern of the 82 species, the genes were distributed in 4 places with different tendency. So we decided the section of the proc transreg as 4, and analyzed.
1) Estimated the transpose linear regression
We explained with a theory that ‘The null hypothesis does not satisfy the regression model, but the alternative hypothesis does.’ As a result in the SAS, according to the null hypothesis, the F-value was 3093.13, and the P=value <.0001. Therefore at a significance level of 0.01, the null hypothesis is dismissible. In other words, the regression model is more suitable.
2) Estimated factor β0, β1
As the notable probability gets smaller, it can affect the dependent variable more. According to the null hypothesis, the F-value of β0 is 10.13, and the Pr > F 0.0015. So the null hypothesis is dismissable. And the estimated calculation is 1.22744031. Also F-value of β1 was 3093.13, and the Pr > F <.0001, so again the null hypothesis is dismissed. Therefore, the estimated number is 0.97546498.
3) Estimated the transe regression model
As a result to look the distribution of the strand to each
species, we found a similar pattern. Thus we studied the pattern
of the strand distribution after to range a standard by section
which is changed the strand's sign. By using The Transpose
Regression Method, we have a result to be able to express The
Spline Regression Model by the distribution pattern of the strand
of 82 species.
Identity(spercent) = 1.22744031
+ 0.97546498*spline(interval)
The X axis shows the 100 sections of the genome of the randomly selected 77 species. And the Y axis is the ratio of the + patterns of each section. The sum of each section’s +, - pattern is 100. According to the graph above, when the standard number is 50, the + patterns appears as 25 on the left, and the – pattern on the right higher than 80. So in this case, it is a + pattern.
4) Verifying the estimated Prediction Equation is adequate.
By conducting the chi-square test with the estimated transpose linear regression prediction equation with the 77 randomly selected species, we verified if the equation is adequate. The null hypothesis is independent from the prediction equation and the other 5 species that was not selected. And the alternative hypothesis is subordinate with the prediction equation and the 5 species. The p-value of the 5 species is independent from the prediction equation estimated by the null hypothesis. We can see that it is subordinate when it is dismissed.
2-2. Region
We estimated the origin which is a part of changed strand’s pattern.
1) The spread of the essential genes is shown at the table below.
As a result to guess the distribution of the essential gene, a graph was showed like that. We know that 322 essential genes among 485 essential gene are distributed a bilateral symmetry in the middle of origin. We can divide between the Synb_ID which is the origin of high frequency and Synb_ID which is the origin of of high frequency by both sides.
2) The spread of the genes provided COG is shown at the table below.