AlphaSim: Software for Breeding Program Simulation
Assigned to Associate Editor Jianming Yu.
Abstract
Core Ideas
- AlphaSim allows breeders and researchers to simulate genomic data with specific user criteria.
- AlphaSim is flexible, computationally efficient, and easy to use for a wide range of possible scenarios.
- AlphaSim can also be used in animal breeding, human genetics, and population genetics.
This paper describes AlphaSim, a software package for simulating plant and animal breeding programs. AlphaSim enables the simulation of multiple aspects of breeding programs with a high degree of flexibility. AlphaSim simulates breeding programs in a series of steps: (i) simulate haplotype sequences and pedigree; (ii) drop haplotypes into the base generation of the pedigree and select single‐nucleotide polymorphism (SNP) and quantitative trait nucleotide (QTN); (iii) assign QTN effects, calculate genetic values, and simulate phenotypes; (iv) drop haplotypes into the burn‐in generations; and (v) perform selection and simulate new generations. The program is flexible in terms of historical population structure and diversity, recent pedigree structure, trait architecture, and selection strategy. It integrates biotechnologies such as doubled‐haploids (DHs) and gene editing and allows the user to simulate multiple traits and multiple environments, specify recombination hot spots and cold spots, specify gene jungles and deserts, perform genomic predictions, and apply optimal contribution selection. AlphaSim also includes restart functionalities, which increase its flexibility by allowing the simulation process to be paused so that the parameters can be changed or to import an externally created pedigree, trial design, or results of an analysis of previously simulated data. By combining the options, a user can simulate simple or complex breeding programs with several generations, variable population structures and variable breeding decisions over time. In conclusion, AlphaSim is a flexible and computationally efficient software package to simulate biotechnology enhanced breeding programs with the aim of performing rapid, low‐cost, and objective in silico comparison of breeding technologies.
Abbreviations
-
- DH
-
- doubled‐haploid
-
- G × E
-
- genotype × environment
-
- gEBV
-
- genomic‐estimated breeding value
-
- MAGIC
-
- multiparent advanced‐generation intercross
-
- pEBV
-
- pedigree‐estimated breeding value
-
- QTN
-
- quantitative trait nucleotide
-
- RIL
-
- recombinant inbred line
-
- SNP
-
- single‐nucleotide polymorphism
-
- TBV
-
- true breeding value
-
- TGV
-
- true genotypic value
This paper introduces AlphaSim, a software package for simulating breeding programs. AlphaSim combines features from three previous simulation packages, AlphaDrop (Hickey and Gorjanc, 2012), AlphaSimPlant, and AlphaMPSim (Hickey et al., 2014), with new features to form a comprehensive software package capable of simulating a wide range of mating designs, biotechnologies, and selection strategies. This allows a user to perform plant or animal breeding simulations in any species using a wide range of strategies. AlphaSim offers the user a high degree of simulation flexibility making it a useful tool for designing and optimizing new breeding strategies using newly developed technologies.
Simulation has been an effective platform for the evaluation and development of new breeding strategies. Large‐scale field‐testing of breeding strategies is either impractical or impossible because of the time and resources needed; simulation offers a comparatively quick and inexpensive alternative. Many software packages for plant breeding simulations are currently available (Sun et al., 2011). These packages have been useful for evaluating existing breeding strategies in actual field‐based breeding programs (Wang et al., 2003) and have been used to develop new breeding strategies. For example, the PLABSIM software package aided in the development of an efficient marker‐assisted backcross design used to transfer the stripe rust (Puccinia striiformis f. sp. tritici) resistance gene Yr15 to the spring wheat (Triticum aestivum L.) cultivar Zak (Randhawa et al., 2009). The historical use of simulation and the expanding range of technological options for breeding programs indicates that simulation will continue to play a role and indeed play an increasingly relevant role in future research focusing on design and optimization of plant and animal breeding programs.
New breeding strategies are required to efficiently optimize the implementation of new technologies in breeding programs. Genomic selection (Meuwissen et al., 2001; Bernardo and Yu, 2007) and genome editing (Shan et al., 2014; Jenko et al., 2015) are two such technologies. Genomic selection in particular has been widely promoted as a technology of great value to plant breeding (Bernardo and Yu, 2007; Heffner et al., 2009; Jannink et al., 2010). While it has a large potential to improve plant breeding, implementation of genomic selection requires optimization to maximize return on investment. Simulation is the ideal tool to develop optimal breeding strategies while assessing costs and benefits (e.g., Hickey et al., 2014; Gorjanc et al., 2016). However, to our knowledge, existing software packages lack the ability to simulate breeding programs with genomic selection with sufficient flexibility and computational efficiency.
We designed AlphaSim to fill the need for a software package that is capable of simulating new breeding designs and application of biotechnologies, such as genomic selection and gene editing, in a flexible and computationally efficient manner. This paper describes the simulation method and operation of AlphaSim in varied plant breeding applications with an emphasis on its main features (Fig. 1) and computational performance. Examples of how to use the software are included with measures of computational efficiency. Table 1 gives a list of symbols used throughout this paper.

Some of the AlphaSim parameters that can be specified by the user, of which most can be changed during the course of a simulation. Asterisk denotes parameters that are immutable.
| Symbol | Definition† |
|---|---|
| a | Vector of breeding values |
![]() |
Vector of estimated breeding values |
| A | Pedigree or genomic numerator relationship matrix |
| ak | Additive effect at QTN k simulated for a given trait |
| αk | Average allele substitution effect at QTN k computed for a given trait |
| b | Vector of SNP effects |
![]() |
Estimated effect of SNP j |
| dk | Dominance effect at QTN k computed for a given trait |
| δk | Dominance degree at QTN k simulated for a given trait |
| e | Vector of residual effects |
| ei,r | Residual effect for individual i and trait r |
| gEBVi | Genomic estimated breeding value of individual i |
| H2 | Broad‐sense heritability |
| h2 | Narrow‐sense heritability |
| i | Indicates a given individual, varies from 1 to nIndiv |
| j | Indicates a given SNP, varies from 1 to nSNP |
| k | Indicates a given QTN, varies from 1 to nQTN |
| LA, LE | Lower triangular matrix obtained from the Cholesky decomposition of VA and VE |
![]() |
Entry of LA and LE between traits r and s |
| λ | Penalty factor applied on the loss of genetic diversity in optimal contribution selection |
| mδ | User‐specified mean dominance degree |
| μ | Intercept of the regression models (Eq. [10] and [12]) |
| μ0 | Mean value of the base generation of the pedigree |
| nIndiv | No. of individuals |
| nQTN | Total no. of unrestricted or frequency‐restricted QTN in the genome |
| nSNP | Total no. of SNP in the genome |
| nTraits | No. of simulated traits |
| pEBVi | Pedigree‐estimated breeding value of individual i |
| pk, qk | Frequencies of the nonzero and zero alleles, respectively, at QTN k in the base generation of the pedigree |
| r, s | Indicate two distinct traits: r varies from 1 to nTraits, and s varies from 1 to r |
| RandDev | Random deviate sampled from a Gaussian or Gamma distribution |
![]() |
A priori additive genetic variance specified by the user for a given trait |
![]() |
Additive genetic variance computed for a given trait in the base generation |
![]() |
Additive genetic variance computed in the training population using the TBV of the training individuals. |
![]() |
Variance of the SNP effects |
![]() |
Dominance genetic variance computed for a given trait in the base generation |
![]() |
User‐specified variance of the dominance degrees |
![]() |
Residual variance computed for a given trait |
![]() |
Genotypic variance computed for a given trait in the base generation |
| TBVi, TDVi, TGBi | True breeding value, true dominance value, and true genotypic value of individual i for a trait characterized by a given set of QTN, unrestricted or frequency‐restricted, and a given distribution, Gaussian or Gamma |
| VA, VE | Additive genetic and residual correlation matrix, respectively, dimensions nTraitsnTraits |
| x | Vector providing the contribution of each selection candidate to the next generation |
| X | Incidence matrix linking phenotypes to b |
| xi,k, xi,j | Genotype of individual i at QTN k or SNP j, coded as 0, 1, or 2 according to the number of copies of the nonzero allele |
| y | Vector of phenotype records |
| Z | Incidence matrix linking phenotypes to a |
- † QTN, quantitative trait nucleotide; SNP, single‐nucleotide polymorphism.
Materials and Methods
Method
-
Simulate haplotype sequences and pedigree.
-
Drop haplotypes into the base generation and select SNP and QTN.
-
Assign QTN effects, calculate genetic values, and simulate phenotypes.
-
Drop haplotypes into the burn‐in generations.
-
Perform selection and simulate new generations.

Principle of an AlphaSim simulation illustrated using a pedigree structured in four burn‐in generations and one selection generation for two traits characterized by an additive genetic model. (1) Haplotype sequences and an internal pedigree are simulated. (2) Haplotypes are recombined and dropped into the base generation of the pedigree. At this step, single‐nucleotide polymorphisms (SNPs) and quantitative trait nucleotides (QTNs) are selected. (3) An effect is assigned to each QTN, and, for each individual of the base generation of the pedigree, genetic values are calculated and phenotypes simulated. (4) Haplotypes of the base generation are recombined and dropped into the burn‐in generations of the pedigree successively. Similar to the base generation, genetic values are calculated and phenotypes simulated for each individual of the burn‐in generations. (5) A selection generation is simulated according to the selection method and strategy as defined by the user.
For each generation, AlphaSim writes information about the haplotype sequences, SNP and QTN genotypes, and breeding values in output files, which canww be used for further analysis or for running alternative scenarios. This also helps to keep the memory requirements of AlphaSim low. The following five subsections provide details of each step of the method considering the simulation of a single trait. The remainder of this section describes additional features, output, and the data storage system of the software.
First Step: Simulate Haplotype Sequences and Pedigree (Fig. 2, Step 1)
By default, haplotype sequences are simulated through a system call to program MaCS. MaCS is a coalescent simulation program that simulates, for each chromosome successively, a sample of haplotype sequences according to specified ancestral population with, at a minimum, a specified chromosome size, mutation rate, recombination rate, and effective population size. Alternatively, the user can also generate their own haplotype sequences externally and import them into AlphaSim. This external source can be either real sequences or sequences simulated using other methods.
Second Step: Drop Haplotypes into the Base Generation of the Pedigree and Select Single‐Nucleotide Polymorphism and Quantitative Trait Nucleotide (Fig. 2, Step 2)
AlphaSim samples haplotypes with replacement from the base set of haplotype sequences and drops them into the first generation of the pedigree. Dropping of haplotypes involves recombination events, which are randomly distributed across the genome ignoring interference. Should the user prefer nonrandom distribution of recombination events, a file can be supplied that specifies the proportion of recombination events in specific regions of the genome, that is, recombination hot spots and cold spots.
After the haplotypes are dropped into the base generation, AlphaSim samples segregating sites to become either SNP markers or QTNs. The SNP markers constitute distinct SNP panels, and the user has control over the number of panels, their density, the minimum and maximum allele frequency of SNP, whether the panels are nested within each other or not, whether these panels include QTN or not, and which panel will be used in selection. The user can also control whether the full sequence and phased data are provided as output.
AlphaSim samples two sets of segregating sites to become biallelic QTN. Both sets include the same user‐specified number of QTN, denoted as nQTN. The first set, referred to as unrestricted, is comprised of QTN selected at random from across the genome. The second set, referred to as restricted, is comprised of QTN selected at random from across the genome with the restriction that the minor allele frequency must be in a specified range. The restrictions in allele frequency of both SNP markers and QTN allow the user to manage the possibility that QTN have different allele frequencies than SNP. Should the user prefer nonrandom distribution of QTN, a file can be supplied that specifies the proportions of QTN in specific regions of the genome, that is, gene jungles and deserts.
Third Step: Assign Quantitative Trait Nucleotide Effects, Calculate Genetic Values and Simulate Phenotypes (Fig. 2, Step 3)

is the a priori additive genetic variance specified by the user.
specified by the user. The user can specify no dominance by setting both mδ and
to zero. Values for δk are obtained as follows:



The total genotypic variance in the base generation
is calculated by taking the variance of TGV in the base generation.



The additive genetic variance in the base generation
is calculated by taking the variance of the TBV, and the dominance genetic variance in the base generation
is calculated by taking the variance of the TDV (Bernardo, 2010). Since these calculations, except TGV, depend on allele frequencies, AlphaSim recalculates each in subsequent generations using the generation specific allele frequencies. This means that only TGV and not TBV and TDV should be compared across generations.
The residual variance is calculated so as to obtain the user defined value for trait heritability. The user defines the heritability as either broad‐sense heritability, H2, or narrow‐sense heritability, h2. If the user defines broad‐sense heritability, the calculation for residual variance is as follows:


Fourth Step: Drop Haplotypes into the Burn‐In Generations (Fig. 2, Step 4)
AlphaSim distinguishes burn‐in and selection generations. If the pedigree is internally simulated, burn‐in generations are generated by mating randomly selected parents. Internally simulated selection generations are generated by mating parents selected via different selection methods. The generation size and number of parents in each generation can be constant or variable. AlphaSim allows for three distinct types of matings regarding the sex of parents: (i) crosses between male and female individuals, (ii) crosses between bisexual individuals used as male and female parents interchangeably while preventing selfing, and (iii) selfing. Note that an external pedigree can also be imported for any selection generation and combined with the internally simulated pedigrees so that almost any pedigree structure can be defined. If an external pedigree is provided, the user has the option to run the breeding program using only the individuals in the external pedigree or to extend the external pedigree with simulated generations. Extension of the supplied pedigree or internal simulation of the pedigree from the first generation both require the user to specify the number of generations, the size of each generation, the number of parents for each generation, and the mating design to be used in each generation.
Fifth Step: Perform Selection and Simulate New Generations (Fig. 2, Step 5)
Selection in AlphaSim proceeds by selecting individuals in a given generation to become parents of the next generation. Truncation selection is used by default, that is, the best‐performing individuals are selected. The number of individuals to be selected can be made constant or variable across generations, and selection can be performed with or without considering gender. AlphaSim enables the selection of individuals based on their TGV, TBV, genomic‐estimated breeding values (gEBVs), pedigree‐estimated breeding values (pEBVs), or phenotypes; all are obtained using the set of pedigree, SNP, and QTN that characterize the trait under selection as specified by the user.
For computation of both gEBV and pEBV, a training population and a test population are defined. The training population is used to estimate the model parameters, while the test population is used to quantify the accuracy of selection. The test population includes the individuals that will become parents of the next generation. There are several options for constructing the training population: (i) include all individuals in all generations up to the current generation, (ii) include all individuals in all generations up to and including the current generation, (iii) include all individuals in the previous generation only, (iv) as in (ii) but using information from males only, (v) random sampling of a given number of individuals from a range of generations, or (vi) user‐specified set of individuals. For each of these options, AlphaSim allows the use of the same training set across different user‐specified selection generations. This latter possibility can be used in combination with an externally defined training population to simulate complex selection processes.

is a vector of allele substitution effects, X is the incidence matrix linking phenotypes to
is a vector of residuals,
and
are respectively variances of residuals and SNP allele substitution effects. The ridge regression is solved through a call to the program AlphaBayes with the variance components set to the simulated values;
is residual variance and
where
is the additive
:


is a vector of breeding values with A as the pedigree numerator relationship matrix calculated from an optional number of ancestral generations, Z is the incidence matrix linking phenotypes to a, and
is a vector of residuals. The pedigree regression is solved through a call to the program AlphaBayes with the variance components set to the simulated values.
Additional Features
Multiple Traits
, the Cholesky factor LA is used to compute additive genetic effects for each QTN k and trait r:

, the Cholesky factor LE is used to compute residual effects ei,r for each individual I and trait r:

When simulating multiple traits and performing selection, a selection index is used to rank individuals. The selection index weights the values of each of the traits as specified by the user. The input values for the selection index can be TBV, gEBV, pEBV, or phenotypes.
Doubled‐Haploids
AlphaSim allows the use of DHs. Doubling can be achieved for all individuals included in any given generation as specified by the user. Operationally, AlphaSim simulates DHs by first generating a recombined gamete from the two haplotypes of an individual and then doubling this gamete to produce a diploid individual with identical haplotypes.
Genome Editing
Genome editing is a new technology that has great potential for empowering breeding programs. In recent years, several applications of genome editing have been demonstrated in plant breeding. For example, heritable resistance to powdery mildew has been conferred to bread wheat by simultaneously editing three homeologs (Wang et al., 2014). In maize (Zea mays L.), editing technologies were used to modify endogenous loci and add an herbicide tolerance gene at a targeted locus (Shukla et al., 2009). To evaluate the potential of genome editing in breeding programs, genome editing functionality has been added to AlphaSim and demonstrated in an animal breeding application by Jenko et al. (2015). This functionality gives the user the capacity to determine the number of individuals to be edited if these are the top or bottom ranked individuals among the selected and the number of QTN to be edited for each individual. The QTN to be edited are selected in descending order of magnitude of their effect, that is, the QTN with large effect in absolute value are preferentially edited. AlphaSim then performs gene editing such that each edited individual bears the favorable allele in a homozygous state at the edited QTN.
Breeding by Optimal Contribution Selection
In addition to truncation selection, AlphaSim can perform optimal contribution selection, which seeks to find the balance between maximizing the response to selection and minimizing the loss of genetic variance and thereby increases the opportunity for greater response to selection in the long term (Wray and Goddard, 1994; Meuwissen, 1997).

is the mean genetic merit passed to the next generation, λ is an unknown penalty on the loss of genetic diversity, A is the pedigree or genomic numerator relationship matrix between the selection candidates, and
is an average expected inbreeding in progeny (Wray and Goddard, 1994; Meuwissen, 1997). AlphaMate searches for the value of penalty that gives the user a specified allowed increase in rate of inbreeding, and given that value, solves Eq. [15] for the vector of contributions x.
Flexibility
AlphaSim includes three restart functionalities, which make it more flexible than the packages from which it is derived. The first restart functionality enables a simulation process to be stopped after a user‐specified generation and to be resumed with some program parameters changed. For example, truncation selection could be used for a number of generations, and then the simulation is stopped and resumed with optimal contribution selection activated or using an alternative genomic selection training population or SNP panel. This feature also enables the simulation of a base population from which different scenarios can be derived or the combination of external and internally simulated pedigrees for both burn‐in and selection generations.
The second restart functionality makes AlphaSim flexible in terms of the method used to perform selection. Selection methods or statistical methods that are not implemented in AlphaSim (e.g., marker‐assisted selection) can be applied using third‐party software to analyze simulated data, select individuals, and mate them, and the externally created pedigree can then be imported into AlphaSim. This functionality thus allows the use of any user‐defined pedigree structure in one or more selection generations. For this purpose, AlphaSim provides the user with information about the genotypes, phenotypes, TGV, and TBV of both selection candidates and training individuals as well as the gEBV or pEBV of the selection candidates obtained through a call to the program AlphaBayes.
The third restart functionality enables output from different AlphaSim runs to be merged into a single run. This enables further flexibility and parallel processing. The merge functionality can be used in two ways. The first is to merge information across a range of AlphaSim runs, that is, by run directory merge. The second way is to merge information from specified sets of individuals from a range of runs, that is, by individual merge. This means that the user has the choice of independently performing selection at the end of each run and then combining specific individuals to form a new merged population.
For example, one AlphaSim run can be performed to generate a base population. From this base population, 100 AlphaSim runs can be spawned in which each run would generate a biparental family from two inbred parents. Because these runs all spawn from the same base population, the genetic architecture of traits and other parameters of the founding population is shared between the biparental families. Once these 100 runs are finished, another run of AlphaSim can be performed in which a subset of the individuals from each of the biparental families can be selected and merged into a single population, forming a selected set of lines that can serve as parents of a new set of biparental families. This process of splitting and merging can be repeated several times in many different ways.
Output and Data Storage
The output files of AlphaSim are organized in three directories: Chromosomes, Selection, and SimulatedData (Fig. 3). The Chromosomes directory stores detailed information about the segregating sites, SNP panels, and QTN as well as the phased haplotypes and genotypes of the simulated individuals for each chromosome and for each generation. The Selection directory stores the information required to perform selection for each selection cycle: the TGV of the selection candidates when selection is based on TGV, the TBV when selection is based on TBV, their phenotypes when selection is based on phenotypes, and the SNP genotypes of both the training individuals and selection candidates and the phenotypes of the training individuals when selection is based on gEBV. It also stores the input and output files of AlphaBayes when selection is based on estimated breeding values and the input and output files of AlphaMate when optimal contribution selection is used. The SimulatedData directory stores results of the simulation process. This directory includes the pedigree; the gender of each individual; the simulated TGV, TBV, and phenotypes; the allele frequency and physical position of each SNP and QTN; the simulated QTN effects; the SNP and QTN genotypes; and the trait variance components.

Output of AlphaSim by directory.
AlphaSim has an efficient system of data storage that makes the simulation of whole‐chromosome haplotype sequences in very large pedigrees computationally feasible. This system includes, among other aspects, the representation of strings of zeros and ones in segments of genome as long integers, meaning that more sequence information can be stored in a given segment of memory. Also, the user can define a rate at which the genome is reduced in its representation, which specifically means that only a portion of the segregating sites in the base haplotypes are used in the subsequent part of the simulation. This option allows a reduction in the computational time and memory requirements for the simulation while maintaining all or most of its properties depending on the aims of the simulation. Additionally, standard file zipping procedures are used to compress the larger files.
AlphaSim makes extensive use of the hard disk to store files, which allows the required virtual memory to be managed. The files are stored in the Chromosomes directory and account for the largest part of disk space that is used by the simulation process. To release this disk space, the user has the option to discard the files stored in the Chromosomes directory once the simulating process has ended. Finally, the user can use the flexibility of AlphaSim to further reduce memory and storage requirements by breaking large simulations into manageable blocks (by generation or biparental family, etc.) using the restart functionality.
Results
In this section we provide four examples of plant breeding programs simulated using AlphaSim and illustrate the computational requirements of the software. We have demonstrated some examples of the animal breeding applications elsewhere (e.g., Hickey et al., 2011; Gorjanc et al., 2015a,b; Jenko et al., 2015).
Example 1: Genomic Best Linear Unbiased Prediction selection and Genotype × Environment Interactions in Biparental Families
We simulated a pedigree comprising five biparental families in which recombinant inbred lines (RILs) were selected and evaluated in contrasting environments using an experimental design with several replicates. Each biparental family was derived from a cross between two DH lines. Selfing each of the five F1 individuals was simulated to result in four F2 individuals per family, that is, 20 F2 individuals in total. The F2 individuals were then selfed through a single‐seed descent process for eight generations to simulate 20 F10 RILs. Five F10 RILs were selected based on their gEBV to generate a new generation. In total, the pedigree included 12 burn‐in generations and one selection generation (Fig. 4). Because performing genomic selection requires the presence of a population of individuals for which both phenotypes and SNP genotypes are available to train the prediction Eq. [10], a base generation including a large number of individuals (e.g., 1000 individuals) was simulated. As illustrated in Fig. 4, the number of individuals and the number of parents to be selected for each generation was specified according to the applied mating design. AlphaSim then simulated the pedigree so the plants in a given generation were equally distributed among the matings.

Simulation of the plant breeding pedigree in Example 1. The pedigree includes 12 burn‐in generations (from founders to F10) and one selection generation (F11). Ten randomly selected founders are used to generate 10 double haploids (DHs). These DHs are crossed to simulate five unique F1 individuals. Selfing the F1 individuals results in 20 F2 individuals (i.e., four per one F1). The F2 genotypes are selfed through single‐seed descent for eight generations to generate 20 F10 or recombinant inbred lines (RILs). Five RILs are then selected and selfed to create three F11 each.
Genotype × environment interactions were simulated using the multiple traits capability of AlphaSim with each trait representing a distinct environment. The heritability was set to 0.8 and 0.2, respectively, for the Environment 1 and 2. The a priori additive genetic variance was set to 1.0 in both environments, the genetic correlation between the environments was set to 0.8, and the residual correlation was set to zero. Dominance effects were assumed to be null. This setting simulated a correlated G + G × E value for each individual in each of the two environments. These G + G × E values were a sum of the main genotypic effect and interaction with the environment. Adding independent residuals gave rise to phenotypic values in each of the two environments.
Genomic selection was conducted in F10 by using the default truncation selection method. The five best‐performing F10 RILs were selected based on their gEBV. For this purpose, the training population was comprised of the 1000 individuals from the first generation of the pedigree. The SNP effects were estimated in each environment independently using the genotypes of the training individuals and their phenotypes, which were simulated in each environment. The gEBVs were then computed in each environment independently before being integrated into selection indices using the provided index weights. Here, the same importance was given to each environment by setting the index weights to 0.5. Finally, selection was performed in F10 based on the genomic‐estimated selection indices.
The five selected F10 plants were selfed to generate F11 seeds. These latter were tested in the two distinct environments with three replicates so that three phenotypic values were simulated for each RIL in each environment. The five simulated RILs showed contrasting performance in the two distinct environments (Fig. 5).

Results of genotype × environment interaction simulated in Example 1. Five recombinant inbred lines (RILs) tested in three replicates in two contrasting environments with heritabilities of 0.8 and 0.2.
Example 2: User‐Defined Selection in Biparental Families
A pedigree including five burn‐in generations and one generation derived from selection was generated with a structure similar to Example 1 (Fig. 6). The best‐performing F3 individual in each of the five biparental families was selected and selfed to create three F4 individuals. The selection of one single F3 individual in each family was achieved using the restart functionalities of AlphaSim. Specifically, the simulating process was stopped after the creation of the gEBV for the F3 individuals, enabling selection decisions to be made outside the program. The pedigree of Generation 6, that is, the pedigree of the F4 individuals, was externally created and then imported into AlphaSim before resuming the simulation process.

Simulation of the plant breeding pedigree in Example 2. The pedigree includes five burn‐in generations (from founders to F3) and one selection generation. Ten randomly selected founders are used to generate 10 double haploids (DHs). These DHs are crossed to simulate five unique F1 genotypes. Selfing the F1 individuals result in 20 F2 individuals, four per F1. The F2 individuals are selfed through single‐seed descent for one generation to simulate 20 F3, and the best performing F3 in each of the five biparental families is selected and selfed to create three F4 individuals.
Example 3: Eight‐Parent Multiparent Advanced‐Generation Intercross Population
Some populations used in plant breeding have a pedigree structure that includes a very specific crossing scheme. For this example, we used the pedigree of an eight‐parent multiparent advanced‐generation intercross (MAGIC) population, whose power for the dissection of the genetics of traits has been demonstrated (Mackay et al., 2014). The pedigree included a total of 561 individuals: the eight parental varieties, the 28 possible F1 individuals derived from crossing two parents (excluding reciprocal crosses), the 210 possible F2 individuals derived from crossing two unrelated F1 parents, and the 315 F3 individuals derived from crossing two unrelated F2 parents (Mackay et al., 2014). Because of the specificity of the crossing structure, the pedigree of the eight‐parent MAGIC population was constructed externally and then imported into AlphaSim. Since externally imported pedigrees and internally simulated pedigrees are compatible in AlphaSim, additional generations can be integrated into the MAGIC pedigree. This feature could be used to derive RILs from the 315 F3 individuals as demonstrated in Example 1.
Example 4: Plant Breeding Programs
A further plant breeding capability of AlphaSim is demonstrated with a simulation of the development of (pseudo) F4 RILs using single‐seed descent combined with recurrent selection on F2 plants. The simulation included three scenarios that differed from each other by the number of cycles of recurrent selection: 0, 2, or 4 (Fig. 7). The scenarios begin with a common pair of initial parents simulated using a single run of AlphaSim. Crossing these parents generated the F1 population, which was then selfed to generate F2 plants. The output from this latter run was copied to three distinct new locations, in which each scenario was run using the flexibility option. Recurrent selection consisted of selecting the two best performing F2 individuals based on their gEBV and crossing them to generate new F2 plants (Fig. 7). Because the simulation did not include a training population for genomic selection, we took gEBV to be a phenotype with a heritability of 0.6. After completing all cycles of recurrent selection, F4 RILs were developed using single‐seed descent. The resulting F4 RILs were used to compare the performance of the three scenarios (Table 2).

Simulation of the development of F4 derived recombinant inbred lines (RILs) using single‐seed descent combined with recurrent selection on F2 plants. The simulation included three scenarios differing from each other by the number of cycles of recurrent selection, which was 0, 2, or 4. The scenarios begin with a common pair of parents, which were crossed to generate the F1 plants. Selfing the F1 generated F2 plants. Recurrent selection consisted of selecting the two best performing F2 individuals based on their genomic estimated breeding values and crossing them to generate new F2 plants. The F4 RILs were then developed using single‐seed descent.
| Base | Scenario 1 | Scenario 2 | Scenario 3 | |
|---|---|---|---|---|
| User features | ||||
| Number of chromosomes | 7 | |||
| Number of segregating sites | 30,356 | |||
| Start–stop generation | 1–2 | 3–5 | 3–9 | 3–13 |
| Number of individuals | 52 | 300 | 504 | 708 |
| Results | ||||
| Genetic variance F4 stage | – | 0.333 | 0.110 | 0.002 |
| Mean gEBV† F4 stage | – | 1.315 | 2.182 | 2.584 |
| Computational feature | ||||
| Running Time | 0 m 6 s | 0 m 4 s | 0 m 7 s | 0 m 10 s |
- † gEBV, genomic estimated breeding value.
Computational Requirement
AlphaSim was benchmarked using the simulation of two distinct scenarios that were each run twice with or without requesting the full genome sequence to be written out (Table 3). The two scenarios differed from each other by the number of segregating sites in the genome, the numbers of SNP and QTN, the size of the pedigree, and the size of the genomic selection training population, all larger in Scenario 2 than in Scenario 1 (Table 3). The genome was comprised of 10 chromosomes, each 1 Morgan in length. In Scenario 1, MaCS used parameters relating to the historical effective population size, mutation rate, and recombination rate, resulting in an average of 71,190 segregating sites across the genome, while in Scenario 2, there was an average of 163,590 segregating sites across the genome. Totals of 5000 and 20,000 SNP, and 2500 and 10,000 QTN, respectively, were sampled from the segregating sites of Scenarios 1 and 2. Two traits were simulated with heritability and variance–covariance components as described in Example 1. The structures of the pedigrees were as described in Fig. 4. In Scenario 1, the pedigree included 1210 individuals distributed along the pedigree as shown in Fig. 4. In Scenario 2, the pedigree included 234,500 individuals; 50,000, 2000, and 1000 individuals in Generations 1, 2, and 3, respectively; 20,000 in Generations 4 to 12; and 1500 in Generation 13. Genomic selection was performed in Generation 12 using a training population of 1000 and 30,000 individuals, respectively, that were sampled from the first generation of Scenarios 1 and 2.
| Scenario 1 | Scenario 2 | |||
|---|---|---|---|---|
| User features | ||||
| Write out the full genome sequence | No | Yes | No | Yes |
| Number of segregating sites | 70,140 | 72,240 | 162,780 | 164,400 |
| Number of SNP† | 5000 | 20,000 | ||
| Number of QTN‡ | 2500 | 10,000 | ||
| Pedigree size (no. of individuals) | 1210 | 234,500 | ||
| Size of the training population | 1000 | 30,000 | ||
| Computational feature | ||||
| Running Time | 1 m 34 s | 3 m 39 s | 4 h 9 m 54 s | 19 h 6 m 11 s |
- † SNP, single‐nucleotide polymorphism.
- ‡ QTN, quantitative trait nucleotide.
Computations were performed on a Linux server. Scenario 1 was run using one CPU core with 2 GB of RAM available from a dual Intel Westmere E5620 2.4‐GHz quad‐core processor. Scenario 2 was run using 12 CPU cores with 5 GB of RAM available from dual Intel Westmere E5645 2.4‐GHz six‐core processors. For Scenario 1, running time was 1 min 34 s and increased to 3 min 39 s when the full sequence information was written to disk (Table 3). For Scenario 2, the running time was 4 h 9 min 54 s and increased to 19 h 6 min 11 s when the full sequence was written to disk.
Discussion
AlphaSim is a new software package for simulating breeding program designs that use sequence data, pedigrees, genotypes, and phenotypes. Different mating systems enable simulation of plant or animal populations. AlphaSim extends the scope of the currently available plant breeding simulation packages because of its wide flexibility, enabling the design of almost any pedigree structure and the application of many selection methods, in particular genomic selection and genome editing as demonstrated in the above examples and other previously published work (Clark et al., 2012; Daetwyler et al., 2013; Hickey et al., 2014, 2015; Gorjanc et al., 2015a).
AlphaSim can be used for the simulation of small datasets in a very short time. Simulating large pedigrees with large genome sequence significantly increases the running time, particularly when writing the full sequence data to disk (Table 3). However, when simulating distinct scenarios characterized by the same SNP panels, QTN, and trait information, the simulation time can be significantly reduced using the restart functionality of the software, that is, by deriving each scenario from a common base generation. For example, we have successfully used this approach to simulate a wheat breeding program with genomic selection spanning 41 yr (overlapping generations) with 1.7 million unique genotypes per year or a pig (Sus scrofa domesticus) breeding program with genomic selection spanning 30 yr (overlapping generations) with 35,000 unique genotypes per year (R.C. Gaynor and J.M. Hickey, unpublished data, 2016).
In conclusion, we make three points: (i) AlphaSim allows breeders and researchers to simulate genomic data controlled by very specific user criteria, to evaluate the power of diverse breeding programs, and to optimize their requirements in terms of sequencing, genotyping, and phenotyping resources; (ii) AlphaSim is flexible, computationally efficient, and easy to use for a wide range of possible scenarios; and (iii) AlphaSim was designed to simulate plant breeding programs; however, it can be used in many other fields of genetics, including animal breeding, human genetics, and population genetics.
Finally, we plan to continue to add new features to AlphaSim to increase its functionality and to respond to the requirements of simulation technology and breeding program designs that will emerge in the coming years.
Availability
AlphaSim is available from http://www.alphagenes.roslin.ed.ac.uk/alphasuite/alphasim/. Material available includes the compiled programs for 64‐bit Linux, Mac OSX, and Windows together with a user manual. This material also includes a 47‐page user manual and a 51‐page set of simple examples with step‐by‐step instructions that is aimed at prospective users who have no experience with Linux or Mac OSX.
Acknowledgments
The development of AlphaSim has been funded under multiple projects. The authors would like to acknowledge specific contribution from The Australian Research Council, Aviagen LTD, Genus PLC, Pfizer Inc., The Sheep CRC, CIMMYT, Advanta Semillas, the Seeds of Discovery Project supported by SAGARPA (La Secretaría de Agricultura, Ganadería, Desarrollo Rural, Pesca y Alimentación), Mexico under the MasAgro (Sustainable Modernization of Traditional Agriculture) initiative, the ISPG funding from the BBSRC to The Roslin Institute, and the GplusE project funded by the BBSRC.
References
Citing Literature
Number of times cited according to CrossRef: 2
- Gregor Gorjanc, Jean‐Francois Dumasy, Serap Gonen, R. Chris Gaynor, Roberto Antolin, John M. Hickey, Potential of Low‐Coverage Genotyping‐by‐Sequencing and Imputation for Cost‐Effective Genomic Selection in Biparental Segregating Populations, Crop Science, 10.2135/cropsci2016.08.0675, 57, 3, (1404-1420), (2017).
- Gregor Gorjanc, Mara Battagin, Jean‐Francois Dumasy, Roberto Antolin, R. Chris Gaynor, John M. Hickey, Prospects for Cost‐Effective Genomic Selection via Accurate Within‐Family Imputation, Crop Science, 10.2135/cropsci2016.06.0526, 57, 1, (216-228), (2017).















