Minimal Genome Designer

- Method

1. Method

1-1. Selection of DB and Reasons

To provide various information on genome, we combined the databases from various sources. Basically we used information about the complete genome provided by NCBI. NCBI is useful because it not only provides information on the genome sequence, but also COG information on the function of the genome. We think that it is important to obtain a result that is not much different from that obtained with the experiment, so we also used the data of essential genes that DEG analyzed. In addition, to overcome the limitation of the classification of function that COG has, we also used the Gene Ontology information to show the information about the products of the gene. Because most of the selected data are related to PATRIC, we used PATRI data as a base.

1-2. Composition of DB

We established a database in two major sites. The two sites are NMPDR(http://www.patricbrc.org/portal/portal/patric/Home) and DEG(http://tubic.tju.edu.cn/deg/). Especially, in NMPDR it connects the database of NCBI and GO(http://www.geneontology.org/) with user's. In effect, it is constructed with 4 different databases. In the database of NMPDR or NCBI, the information on genome and gene sequence is provided. At DEG, the information on essential genes is disclosed with experiments. Lastly, the information on the function of genes is provided by GO.

1-3. Selection of the Subject of Analysis and the Reasons

1) Selection of the Subject of Analysis and the Reason

It is known that though we use a computer, it takes long time to find the gene that every species have in common. Also, we have found out that to establish our own experimental methods and analysis standard to gain the same result as the results from experiments, we need a specific species. So we selected the subject to be analyzed.

2) The Subject of Analysis

The subject to be analyze has to be the one whose accuracy of analysis pursued by can be checked. So there should be more than two species in the same genus whose essential genes have been revealed in vitro. Of the data provided by DEG which analyze essential genes with experiments, the genus that meet this requirement are Escherichia, Mycoplasma, Salmonella, Staphylococcus, Streptococcus. And we chose Streptococcus, considering the number of the specimen and the analysis time.

1-4. Selecting Methods of Analysis and Basis

1) The Method of Analysis

We conduct analysis to find essential genes following the sequence in the next flow chart.

2) The First Analysis(Determination of BLAST standard) and Reliability

As we emphasized earlier, we hope that there is no difference between the results we get and that in vitro results. So we are going to prove it in our first analysis. In our first analysis, we used Streptococcus pneumoniae TIGR4 and Streptococcus sanguinis SK36, as they are in the same Streptococcus genuses and essential genes are found with in vitro experimental methods. We verified reliability of the result and the accuracy of the analysis method by blasting the two data with our BLAST standards.

3) BLAST Analysis Results

We label the essential gene information produced by DEG as a (+), and the gene that is not produced as a (-). Also we label the genes that are thought to be essential genes according to our analysis as a (+), and the ones that don't a (-), and made a 2x2 cross diagram. In this, we are aware that though we analyze the same data by BLAST, the result can be different depending on the query. So we switch the two data on the diagram and analyze it repeatedly.

4) Verification of the credibility of BLAST Analysis

4-1) Verification of the BLAST Analysis Standard

To verify reliability of the standards in BLAST, we use Sensitivity, Specificity, and Accuracy. 'Sensitivity' is the probability that essential genes of DEG are analyzed as essential gene also in our analysis result. 'Specificity' is the probability that non-essential genes of DEG are actually analyzed as the same in our result. And 'Accuracy' is the probability that shows the degree of correspondence between the results in the whole specimen. If the Accuracy value is over 80%, it means that the results made by BLAST Analysis Standard is reliable. Therefore, the results that we infer are reliable as well. Then we conduct Likelihood test to check the validity of the Accuracy to show that Accuracy can analyze accurate reliability.

4-2) Verification of the reliability of BLAST Analysis Results

To procure the reliability of analysis result, we conduct the McNemar test.By this, we can be sure that the analysis result with experiments is not much different from that with our analysis.

4-3) The Second Analysis (Annotation)

We applied the BLAST standards confirmed by the first analysis to 82 Complete Genome in the Streptococcus. We grouped the genes with the similar sequence using the BLAST result and annotate with our own ID. That is, the genes annotated with the same ID are regarded as the same genes. At this time, we call the ID that we give in our own way a 'Synb UID'. If all the 82 total genome have a specific Synb UID, then we infer them as an 'Essential Gene'. On the other hand, if the Synb UID was found in only one genome, we named it a 'Specific Gene'. From a result of the second analysis, the total essential genes in 82 species of Streptococcus are about 478 .

Team:CBNU-Korea/Project/GD/Method

From 2012.igem.org