Team:USTC-Software/project.html

From 2012.igem.org

global_header

igem

Project USTC-Software

Introduction

Synthetic biology is like building a house, we have BioBricks as basic components, we use genetic circuits as the global structure and main method to establish the building, and finally, the finished plasmid is like a house that embodies the ideas and purposes. But when it comes to a complete iGEM project, things get more complex. After consideration of all related factors of iGEM, we can divide them into three major areas (Figure 1):

Biology: referring to how biological molecules interacting with each other, construction of genetic circuits, what kind of parts to build or use, etc.

Experiments: about experimental details, such as temperature, reaction rates, pH conditions, etc, to realize designed genetic circuits in laboratory conditions.

Mathematics: focusing on using mathematical methods to simulate the behavior of a biological system and using models to analyze robustness and sensitivity.

figure 1

A typical wet-lab brainstorming is usually concentrating on the biological area: discussing about the function of a system and how to build a genetic circuit to accomplish the function. As for software development, however, we must frame a project in a global context. From designing genetic circuits to controlling experimental conditions, from acquiring data to generating mathematical models, softwares can be everywhere doing simple yet important jobs. But softwares that aim to combine those three major areas are particularly difficult, while at the same time, valuable.

That is our goal, our belief that the power of software can really make synthetic biology much easier and meanwhile help researchers on a much higher level.

Technology Basics

Reverse Engineering

This century saw the increasing importance of quantitative approaches in molecular biology. Engineering and physics methods in Biology have made possible the birth of synthetic biology. While engineering focusing on the process of building up biological motifs, devices and systems that have certain functions, reverse engineering is becoming more and more important in inferring inside GRNs from the properties outside.

To realize reverse engineering, a variety of methods and algorithms have been used and revised, including ODE method, Bayesian and dynamic Bayesian networks and information theoretic or correlation-based methods (Penfold el, 2011). There are pros and cons in each method, for example, Bayesian Networks can reflect the gene co-expressions but fail to illustrate feedbacks.

ODEs are widely used in iGEM projects for mathematical modeling because of their simplicity and connection to both biochemical reactions and time-course datasets. Furthermore, ODE methods can reflect the directions of edges connecting each biological molecules (Bansal el, 2007). The down side is the difficulty to abstract genetic regulatory networks from chemical reaction networks, process of which may require knowledge of graph theory and visualization techniques. Nevertheless, the ODE method should be the main part of our project to infer the GRN from time-course series.

Genetic Regulatory Network(GRN)

A genetic regulatory network (GRN) is a collection of DNA segments in a cell which interact with each other indirectly (through their RNA and protein expression products) and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA. In general, each mRNA molecule goes on to make a specific protein (or set of proteins). In some cases this protein will be structural, and will accumulate at the cell membrane or within the cell to give it particular structural properties. In other cases the protein will be an enzyme, i.e., a micro-machine that catalyzes a certain reaction, such as the breakdown of a food source or toxin. Some proteins though serve only to activate other genes, and these are the transcription factors that are the main players in regulatory networks or cascades. By binding to the promoter region at the start of other genes they turn them on, initiating the production of another protein, and so on. Some transcription factors are inhibitory.

Researches on genetic regulatory network also draw the interest of many iGEM teams. The method of deducing a special functional GRN is a based problem which still remains a land of mysteries and wonders. This year our team is also inspired by this fundamental topic and decides to develop a software helping researchers better analyze a GRN. In a GRN, all the genes are viewed as simple nodes and between each gene there are reaction connections. In this mathematical model, some knowledge of math such as topology or graph theory can be applied to better understand the whole GRN, which makes the problem much more interesting and attractive.

Mathematical models of GRNs have been developed to capture the behavior of the system being modeled, and in some cases generate predictions corresponding with experimental observations. In some other cases, models have proven to make accurate novel predictions, which can be tested experimentally, thus suggesting new approaches to explore in an experiment that sometimes wouldn't be considered in the design of the protocol of an experimental laboratory. The most common modeling technique involves the use of coupled ordinary differential equations (ODEs). Several other promising modeling techniques have been used, including Boolean networks, Petri nets, Bayesian networks, graphical Gaussian models, Stochastic, and Process Calculi. Conversely, techniques have been proposed for generating models of GRNs that best explain a set of time series observations.

Machine Learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that take as input empirical data, such as that from sensors or databases, and yield patterns or predictions thought to be features of the underlying mechanism that generated the data. A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as instances of the possible relations between observed variables. A major focus of machine learning research is the design of algorithms that recognize complex patterns and make intelligent decisions based on input data. One fundamental difficulty is that the set of all possible behaviors given all possible inputs is too large to be included in the set of observed examples (training data). Hence the learner must generalize from the given examples in order to produce a useful output in new cases.

Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm.

Supervised learning generates a function that maps inputs to desired outputs (also called labels, because they are often provided by human experts labeling the training examples). For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function.

Unsupervised learning models a set of inputs, like clustering. See also data mining and knowledge discovery.

Semi-supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier.

Reinforcement learning learns how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm.

Transduction, or transductive inference, tries to predict new outputs on specific and fixed (test) cases from observed, specific (training) cases.

Learning to learn learns its own inductive bias based on previous experience.

The development of computer science and biology witness a widespread usage of machine learning. In the present fields, nearly every related topic shares some connections to machine learning. Especially in synthetic biological research field, machine learning is a powerful tool to get deeper insight to biological problems and challenges.