Team:Yale/Modeling

From 2012.igem.org

(Difference between revisions)
(Algorithm optimization)
 
(55 intermediate revisions not shown)
Line 11: Line 11:
|}
|}
-
==Modeling the evolution of a population during MAGE==
 
-
===Predicting the distribution of mutations===
 
-
Modeling mutations across evolving populations
+
To help design MAGE experiments, both ours and others, Team Yale has developed a mathematical model of the outcomes of multiplexed recombineering, and efficient methods for its computation. A script implementing these functions is available on request.
-
Replacement efficiency calculation. For each ssDNA oligo produced by CAD-GENOME, the allelic replacement efficiency in E. coli is estimated using empirical relationships previously published in the documentation for optMAGE [5]. CAD-GENOME tabulates these predicted efficiencies  in the ssDNA oligo output files and uses these efficiencies values to guide  the user  to the number of MAGE cycles  necessary to create a population of cells including all of the desired set of mutations, as described below.
+
 
-
Estimated distribution of targeted mutations
+
<gallery widths=410px perrow=2 heights=380px caption="Figures generated by our MAGE model">
-
The distribution of specific mutations in a population over cycles of multiplexed recombination is naturally a stochastic process controlled by diverse physiological effects and binding interactions. CAD-GENOME models this process as a function of each oligo's predicted efficiency of allelic replacement, assuming that each recombination event is always either complete or insignificant and that no two oligos compete for binding at the same site. This model returns the relative prevalence of each replacement, of each possible number of replacements per cell, of each possible minimum number of replacements per cell, the mean number of mutations per cell, and the variance of the same. These estimates will guide users seeking to guarantee, for instance, that their experiment will mutate some fraction of their specimens to carry all desired replacements before subjecting them to some selection pressure.
+
Image:MAGEfig1.png|How the number of mutations per cell increases from 0 toward 9 over cycles 0, 1, ..., 50, 90
 +
Image:MAGEfig2.png|How the minimum number of mutations per cell increases from 0 toward 9 over cycles 0, 1, ..., 50, 90
 +
Image:MAGEfig3.png|How each of nine mutations yeaL, ..., yliE accumulates over 90 cycles
 +
Image:MAGEhistogram.png|How many off-target binding events might happen, and how spontaneously
 +
</gallery>
 +
 
 +
==Modeling the evolution of a population during MAGE==
 +
[[Image:MAGE_model_scheme.png|right|300px|thumb|Schematic of model. Oligos 1, 2 and 3 each carry some number of mutations ''r'' and bind their corresponding loci on the genome with probability ''p'', creating eight possible progeny, of which three are shown.]]
 +
 
 +
The distribution of specific mutations in MAGE is a discrete, stochastic process. How prevalent will each possible mutant be after some number of cycles? We estimate these prevalences by assuming that each oligo binds only
 +
*at its target on the genome,
 +
*completely,
 +
*at a sequence-dependent frequency, empirically estimated for ''E. coli'' [1].
-
Survey for off-target binding sites
+
Given these assumptions, then a population after ''c'' cycles is a weighted sum of ''n'' Bernoulli trials ''X'', each zero if the oligo does not mutate its target ''i'' and otherwise equal to the number ''r'' of mutations it induces. Given efficiencies of allelic replacement ''p'', this probability mass function ''K'' becomes:
-
Off-target binding could undercut our assumption that oligos do not compete or lead to other confounding phenotypes, so as a warning to the user of such exceptions, CAD-GENOME searches the specimen genome for regions with four base pairs or more of homology with any oligo, records each region of potential hybridization, then tabulates those regions' positions and sequences with an estimated free energy of hybridization and plots a log-scaled histogram of the same.
+
-
===Implementation===
+
[[Image:Eqns1.png|center]]
-
Single-stranded DNA hybridization and minimum folding free energies are calculated using the UNAFold software package. This allows for the creation of optimized ssDNA oligos for optimal efficiency in producing desired mutants, and for estimating the significance of off-target binding sites. The BLAST+ alignment package is used for providing users with the top alignments for oligos to confirm that the expected locus is being targeted correctly [16]. In addition, BLAST+ is used for finding possible alternative hybridization events which may lower the efficiency of MAGE experiments.
+
-
Given our assumptions, we calculate the distribution of mutations in a population after $c$ cycles of multiplexed recombination as a weighted sum of $n$ independent Bernoulli trials $r_iX_i$, each equal to $0$ if an oligo does not replace the sequence at its targeted position $i$ and equal to $r_i$ if it does, thus inducing its $r_i$ encoded mutations. Given an accurate array of estimated efficiencies of allelic replacement $p_i$, then, we find the probability mass function <math> \mathrm{Pr}(K=k; c) </math>
+
where each ''D'' is the set of all sets of indices ''A'' for ''r'' that sum to ''k''. In doing this, we have derived a more general form of the binomial distribution. Computing this PMF involves solving the subset sum problem, an NP-complete problem, and so we optimized our algorithm to avoid slowdowns. In the occasional case when all oligos carry the same number of mutations, we used a recursive formula [2], and in cases not so degenerate, we used a branched, dynamic programming algorithm [3].
-
as follows from elementary probability theory and as derived from generating functions in S1. The sum in (1) is over the set $D_k$ of all sets of oligos $A$ which carry a total of $k$ mutations; $\bar{A}$ is the set of oligos in the pool but not in $A$.  If each oligo carries a single replacement with the same efficiency of allelic replacement, then (1) can condense to form the binomial distribution.
+
We also derived the moments of this distribution. The moments of independent events being independent, we found them as the sums of the moments of mutation at individual loci ''M'', each determined by their generating functions ''h'':
 +
[[Image:Eqns2.png|center]]
-
Computing the probability mass function of $K(c)$ can be intensive, as enumerating the sets in $D_k$ is equivalent to solving the NP-complete subset sum problem, but we speed the process in two ways: when $r=1$, by using a formula recursive in $k$ taking $O(c \cdot \mathrm{max}(k))$(Wadyicki, Shah et al. 1973), and otherwise, a branched, dynammic programming algorithm taking $O(c \cdot \mathrm{max}(2^{n/2}, n|D_k|))$ (Horowitz and Sahni 1974). The complementary cumulative distribution function of $K$ (that is, $\mathrm{Pr}(K \ge k; c)$) is then determined recursively in $O(c \cdot \mathrm{max}(k))$. To find the mean and variance of $K(c)$, we simply find the sums of the means and variances of the definitely independent terms $M_i(c)$ constituting $K(c)$. These can be determined from their generating functions $h_{M_i}$,
+
==Survey for off-target binding sites==
 +
Though our results do agree with most experimental observations, not all MAGE-induced mutations occur at the intended sites; to identify likely unintended mutations, we scripted a search of the genome using the alignment package BLAST+ [4] to find subsequences with four base pairs or more matching oligos in the MAGE oligo pool, and estimates the change in Gibbs energy likely upon hybridization at each such off-target pairing, using the program UNAFold [5].
-
===Detecting of off-target binding sites===
+
==References==
 +
# Wang, H. H., F. J. Isaacs, et al. (2009). "Programming cells by multiplex genome engineering and accelerated evolution." Nature 460(7257): 894-898.
 +
# Wadycki, W. J., B. K. Shah, et al. (1973). "Letters to the Editor." The American Statistician 27(3): 123-127.
 +
# Horowitz, E. and S. Sahni (1974). "Computing Partitions with Applications to the Knapsack Problem." J. ACM 21(2): 277-292.
 +
# Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17): 3389-3402.
 +
# Markham, N. R. and M. Zuker (2008). "UNAFold: software for nucleic acid folding and hybridization." Methods Mol Biol 453: 3-31.

Latest revision as of 02:28, 27 October 2012

Home Team Official Team Profile Project Parts Submitted to the Registry Modeling Notebook Safety Attributions


To help design MAGE experiments, both ours and others, Team Yale has developed a mathematical model of the outcomes of multiplexed recombineering, and efficient methods for its computation. A script implementing these functions is available on request.

Modeling the evolution of a population during MAGE

Schematic of model. Oligos 1, 2 and 3 each carry some number of mutations r and bind their corresponding loci on the genome with probability p, creating eight possible progeny, of which three are shown.

The distribution of specific mutations in MAGE is a discrete, stochastic process. How prevalent will each possible mutant be after some number of cycles? We estimate these prevalences by assuming that each oligo binds only

  • at its target on the genome,
  • completely,
  • at a sequence-dependent frequency, empirically estimated for E. coli [1].

Given these assumptions, then a population after c cycles is a weighted sum of n Bernoulli trials X, each zero if the oligo does not mutate its target i and otherwise equal to the number r of mutations it induces. Given efficiencies of allelic replacement p, this probability mass function K becomes:

Eqns1.png

where each D is the set of all sets of indices A for r that sum to k. In doing this, we have derived a more general form of the binomial distribution. Computing this PMF involves solving the subset sum problem, an NP-complete problem, and so we optimized our algorithm to avoid slowdowns. In the occasional case when all oligos carry the same number of mutations, we used a recursive formula [2], and in cases not so degenerate, we used a branched, dynamic programming algorithm [3].

We also derived the moments of this distribution. The moments of independent events being independent, we found them as the sums of the moments of mutation at individual loci M, each determined by their generating functions h:

Eqns2.png

Survey for off-target binding sites

Though our results do agree with most experimental observations, not all MAGE-induced mutations occur at the intended sites; to identify likely unintended mutations, we scripted a search of the genome using the alignment package BLAST+ [4] to find subsequences with four base pairs or more matching oligos in the MAGE oligo pool, and estimates the change in Gibbs energy likely upon hybridization at each such off-target pairing, using the program UNAFold [5].

References

  1. Wang, H. H., F. J. Isaacs, et al. (2009). "Programming cells by multiplex genome engineering and accelerated evolution." Nature 460(7257): 894-898.
  2. Wadycki, W. J., B. K. Shah, et al. (1973). "Letters to the Editor." The American Statistician 27(3): 123-127.
  3. Horowitz, E. and S. Sahni (1974). "Computing Partitions with Applications to the Knapsack Problem." J. ACM 21(2): 277-292.
  4. Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17): 3389-3402.
  5. Markham, N. R. and M. Zuker (2008). "UNAFold: software for nucleic acid folding and hybridization." Methods Mol Biol 453: 3-31.