Estimation of realized rates of genetic gain and indicators for breeding program assessment

Routine estimation of the rate of genetic gain (ΔGt) realized by a breeding program has been proposed as a means to monitor its effectiveness. Several methods of realized ΔGt estimation have been utilized in other studies, but none have been objectively evaluated in a plant breeding context. Stochastic simulations of 80 rice (Oryza sativa) breeding programs over 28 years were done to generate data used to evaluate five methods of realized ΔGt estimation in terms of error, precision, efficiency and correlation between true and predicted annual mean breeding values. Two indicators of ΔGt, the expected ΔGt and the average number of equivalent complete generations (EqCg), were described and evaluated. At best, estimates of realized ΔGt were over or underestimated by 15% and 27% when considering all 28 years and the past 15 years of breeding respectively. The best methods were the control population, estimated breeding value, and ERA trial methods. Among these, correlations between true and estimated ΔGt were at best 0.59, indicating that these methods cannot very accurately rank breeding programs in terms of realized ΔGt. The expected ΔGt and the average EqCg were shown to be useful indicators for determining if a non-zero genetic gain is expected. Determining which of the three best realized ΔGt estimation methods evaluated, if any, would be appropriate for any given breeding program should be done with careful consideration of the objectives, resources, seed stocks, and structure of the data available.


Introduction
Realized genetic gain is the change in average breeding value of a population over at least one cycle of selection for a particular trait or index of traits. Change in breeding or genetic values of populations over many cycles or years is referred to as genetic trend. When a genetic trend is linear, the rate of genetic gain per year, Δ G t , that is realized can be estimated by regressing the average breeding value on year (Eberhart, 1964). Assuming the breeding process remains unchanged and the trait of interest is quantitatively inherited according to the infinitesimal model (Fisher, 1918), this estimate is can be used to predict future genetic gain.
Analyses of genetic trends, often accompanied by estimation of realized Δ G t , have been conducted to analyze the outcomes of breeding programs or selection experiments in model species, livestock, and crops. In experimental populations of non-model and model species such as fruit fly (Drosophila melanogaster) and laboratory mouse (Mus musculus), selection experiments and assessments of genetic trends have been used to study the genetics of complex traits, selection limits, and to test quantitative genetic models and assumptions (Hill and Caballero, 1992). In livestock, reports of genetic trends are used to assess the effectiveness of breeding programs and to determine if adjustments are needed (Oldenbroek and van der Waaij, 2015). In crop species, estimation of realized Δ G t is typically done after conducting selection experiments with the objective of comparing different breeding strategies (Hallauer et al., 2010) and to a lesser extent to assess the effectiveness of breeding programs. This may be because the adoption of varieties by growers in the targeted region has been considered to be the ultimate measure of success of the program (Brennan and Byerlee, 1991). 3 Various methods have been used to assess genetic trends depending on the species, data and germplasm that are available. Typically, for the analysis of genetic trends in model species, such as those described by Manning (1961) , Kraaijeveld and Godfray (1997), and Swallow et al. (1998), an unselected control population is maintained and included in the phenotypic evaluations to be able to determine what changes in phenotypic value were due to artificial selection imposed by the experiment and what changes are due to environmental or other causes. In livestock breeding programs, because maintaining an unselected control population would be very costly, animal breeders assess genetic trends using estimated breeding values (EBVs), using methods reviewed by Garrick, 2010. EBVs are estimated using historical breeding program data and a mixed linear model with genetic covariance between relatives modeled according to additive relationships estimated using pedigrees. Then for each year, the average EBV for animals born in that year is computed and trends in average EBVs are analyzed. Genetic trends estimated in this way have been shown to be accurate under certain conditions (Sorensen and Kennedy, 1984) including accurate estimation of variance components and adequate connectedness across years. In crop selection experiments, analysis of genetic trends and estimation of realized Δ G t is typically done by growing a sample of germplasm from each selection cycle in a common set of environments and regressing population means against year or cycle number (Eberhart, 1964). By growing all germplasm in one set of environments, trends in the average phenotypic values per-cycle can be attributed to genetic rather than environmental causes. In crop breeding programs on the other hand, samples of germplasm originating from each selection cycle are not usually available and so varieties released over time are used to represent the breeding germplasm.
Such experiments are commonly referred to as ERA trials, popularized by Duvick who published a series of ERA trial studies between 1972 and 2004 to understand and quantify yield gains due to hybrid corn (Zea mays) breeding (reviewed by Duvick, 2005). In few studies, mixed models have been used to estimate realized Δ G t in crop breeding programs using historical variety trial data (Mackay et al., 2011;Piepho et al., 2014).
Recently there has been significant interest in routinely analyzing genetic trends in order to estimate realized Δ G t due to crop breeding programs to monitor their effectiveness (G.N. Atlin, personal communication, 2018). In order to do this effectively, careful planning and analysis is required to determine what data or germplasm should be archived or collected, and what experimental and statistical methods would be suitable given the resources and constraints of the breeding program. Potential methods of estimating realized Δ G t should be evaluated based on their level of error and precision of realized Δ G t estimation, and the resources they require. The ideal realized Δ G t estimation method would be accurate, precise, and would utilize data already being generated by the breeding program.
Because estimation of realized Δ G t requires that the performance of breeding germplasm be observed across time, in order for a breeding program to obtain an estimate of its realized Δ G t , the breeding program will ultimately need to have been operating for at least a few years or cycles of selection and it must have either kept samples of germplasm developed in different years or cycles, or accurate phenotypic data that is suitable for acrossyear analysis. In cases where these requirements are not met, indicators of realized Δ G t may be useful for understanding if realized Δ G t >0 is expected from the breeding effort.
Indicators of realized Δ G t could also be useful to monitor each time new parents are selected to help identify cases when the selection decisions will not lead to genetic gain.
However, successful examples of this approach have not yet been reported.
In order to determine if a realized Δ G t estimation method or indicator of realized Δ G t is performing well, true values of realized Δ G t should be known for comparison.
Because true Δ G t can only be known when data are from a simulated breeding program, computer simulation of plant breeding programs, reviewed by Sun et al., (2011) was used as a tool to evaluate realized Δ G t estimation methods and indicators to determine which methods and indicators can be recommended for assessment of breeding programs and to understand their limitations. To identify the best method for estimation of medium to long term realized Δ G t , I evaluated a set methods similar to those previously reported for estimating medium to long term realized Δ G t using 15 and 28 years of data generated from stochastic simulations of rice (Oryza sativa) breeding programs. Lastly, I examine potential indicators of realized Δ G t that could be measured when obtaining reliable estimates of realized Δ G t may not be immediately possible.

Stochastic simulations for comparison of medium to long term Δ G t estimation methods
A founder population of 140 individuals was simulated to have genome structure similar to that of elite Indica rice germplasm. To do so, 12 chromosomes of 110 centimorgans in length and an effective population size of 33 was assumed. The trait under selection was assumed to be quantitative, conferred by 1000 additive loci. To initiate the breeding program ( Figure S1) founders underwent four generations of self-pollination. Thirty lines were selected as parents and intermated each year. Until at least two years worth of phenotypic data was available on new breeding lines, parents were randomly selected from the initial set of 140 inbred lines.
After parents were selected, they were crossed in a half-diallel and then if at least two years worht of phenotypic data was available the top 100 F1s were selected, otherwise 100 F1s were selected at random. From the 100 F1s, 10 F2 progeny were generated per F1 and F2s were selfed for three more generations, resulting in 1000 F5s.
The 1000 F5s entered the testing and advancement process which followed one of two possible schemes (Table 1). In scheme A, 1000 F5s were phenotyped in location one with one replicate, and then the best 100 were selected for advancement. These 100 were phenotyped the following year at locations one through two with two replicates per location. From these 100, the best 50 were selected for advancement and phenotyped the following year at locations one through four with two replicates per location. In scheme B, 1000 F5s were phenotyped in location one with one replicate, and then the best 267 were selected for advancement. The 267 lines were phenotyped in the following year at locations two through four with one replicate per location. In both schemes, advancement from one stage to the next and selection of parents was restricted by family such that no more than three lines could be selected per family. In both schemes, once parent selection was based on performance and not at random, in the same year that parents were selected, parental lines were phenotypically evaluated with the same number of locations and replications as the previous stage of testing.
In both schemes, selections were based on best linear unbiased prediction BLUP assuming genotypes were independent and identically distributed (i.i.d.). For advancement, only phenotypic data from the current year was utilized whereas for parent selection, phenotypic data from the last three years and last two years were utilized in scheme A and B respectively. After the first round of the testing and advancement process, the best five lines were selected as checks that were henceforth phenotyped along with breeding lines at each stage of testing. Simulation of phenotypic data followed one of two different sets of assumptions of variance components (Table 2) in order to simulate high and low levels of heritability. The first stage of testing was always assumed to have a higher error variance compared to the subsequent stages. To simulate either a negative or positive non-genetic trend, 0.1x was added to or subtracted from the phenotypic values each year where x is the year of the breeding program. In total there were eight different breeding program scenarios consisting of all possible combinations of testing scheme, heritability level, and non-genetic trend. Each breeding scenario was repeated 10 times for a total of 80 simulations.

Comparison of breeding scenarios based on true Δ G t
To compare true Δ G t from each breeding scenario per time period, average true breeding value per line-year for each simulation and time period was used to fit the mixed model where y i is the average true breeding value for line-year i, breeding scenario j, and simulation replicate k, β is a fixed regression coefficient for the average true Δ G t across all simulations, r i is a continuous covariate for the line-year i, s j is a fixed factor for the breeding scenario j, is a fixed regression coefficient for the average true Δ G t per breeding scenario, rs ij is a fixed factor for the interaction between line-year i and breeding scenario j, t k ( j) is a random factor for the simulation replicate k nested within breeding scenario j with , and ε ijk is the residual error with ε ijk ∼ N (0, R σ ε 2 ) . The diagonal of R contains a weighting factor to model a unique error variance for testing schemes A and B.
Estimated marginal means and standard errors of linear trends were then then computed per breeding scenario. Statistical comparisons between pairs of trend estimates from different breeding scenarios were made using a Tukey's multiplicity adjustment.

Control population:
Using data exclusively from the final stage of testing, stage three in scheme A and stage two in scheme B, realized Δ G t was estimated by fitting the mixed model: Where y ijlk is the phenotypic value from population i observed in year j and location k on genotype l, p i is the fixed effect of the ith population (either control, consisting of five longterm checks, or selected, consisting of the breeding lines), s j is the random effect of the jth year with s j ∼ N (0,σ s 2 ) , l k is the fixed effect of the kth location, g l ( j ) is the random effect of the lth genotype nested within the jth year with g l ( j ) ∼ N (0,σ g 2 ) , β is a fixed regression coefficient for Δ G t , r l is a continuous covariate for the line-year of genotype l, p s ij is the random interaction effect between the ith population and the jth year with ps ij ∼ N (0, σ ps 2 ) , p l ik is the fixed interaction effect between the ith population and the kth location, l g kl is the random interaction effect between the kth location and the lth genotype with lg kl ∼ N (0,σ lg 2 ) , and ε ijkl is the residual error with ε ijkl ∼ N (0, R σ ε 2 ) where the diagonal of R contains the residual variances of y ijkl to account for heterogeneous error variance.

EBV:
After excluding terminal years of the breeding programs where phenotypic data from all testing stages was not present, breeding values were estimated based on the model y ijk =μ+ g i +l j +s k +ε ijk where y ijk is the phenotypic value of genotype i observed in location j and year k, g i is the effect of the ith genotype with g i ∼ N (0, A σ g 2 ) where A is the additive relationship matrix estimated using pedigree records from the simulated breeding program of interest, l j is the effect of the jth location, s k is the effect of the kth year, and ε ijk is the residual error with ε ijk ∼ N (0, R σ ε 2 ) . The diagonal of R contains the residual variances of y ijk . Interactions between genotype and location and genotype and environment were not included due to software and hardware limitations. Next, the line-year, which is the year that a line entered the first stage of testing, was determined and the average EBV per line-year was computed. These average EBVs were then used as the response variable in the model: Where y i is the average EBV for line-year i, β is a fixed regression coefficient for Δ G t , r i is a continuous covariate for the line-year and ε i is the residual error with ε i ∼ N (0, σ ε 2 ) . The EBV method was repeated assuming different subsets of phenotypic data were available. When conducting statistical tests to compare methods, the EBV method applied to different phenotypic data subsets was considered as separate methods.

Variety trial:
In each year of each simulated rice breeding program where phenotypic data on breeding lines was available prior to parent selection, the 30 lines selected as parents for crossing in the that were never selected as parents, because only the top performing lines would be included in a variety trial. Then, according to Piepho et al. (2014), the simulated variety trial dataset was analyzed using the mixed model: where y ijk is the phenotypic value of genotype i observed in location j and year k, g i is the random effect of the ith genotype with g i ∼ N (0, σ g 2 ) , β is a fixed regression coefficient for Δ G t , r i is a continuous covariate for the line-year of genotype i, s j is the random effect of the jth year with s j ∼ N (0,σ s 2 ) , is a fixed regression coefficient for the non-genetic gain or loss per year, t j is a continuous covariate for the jth year, l k is a random effect of the kth location with l k ∼ N (0, σ l 2 ) , l s jk is the random effect of the interaction between location and year with ls jk ∼ N (0,σ ls 2 ) , g l ik is a random effect of the interaction between genotype and location with gl ik ∼ N (0, σ gl 2 ) , g s ij is a random effect of the interaction between genotype and year with g s ij ∼ N (0, σ gs 2 ) , and ε ijk is the residual error with , where the diagonal of R contains the residual variances of y ijk to account for heterogeneous error variance.

ERA trial:
For each line-year, genotypes belonging to that line-year were ranked according to their means calculated using phenotypic data from the final stage of testing, which was stage three of one genotype was selected per line-year. Using the simulated trial data, the phenotypic mean for each line-year was then computed and then used as the response variable in the model: Where y i is the phenotypic mean for line-year i, β is a fixed regression coefficient for Δ G t , r i is a continuous covariate for the line-year and ε i is the residual error with

Phenotype:
Using phenotypic data from the final stage of testing, stage three of Scheme A and stage two of Scheme B, phenotypic means were computed per year and then used as the response variable in the model Where y i is the phenotypic mean for year i within the final stage of testing, β is a fixed regression coefficient for Δ G t , r i is a continuous covariate for the line-year and ε i is the residual error with ε i ∼ N (0, σ ε 2 ) . where y i is the average true breeding value for line-year i, β is a fixed regression coefficient for true realized Δ G t , r i is a continuous covariate for the line-year and ε i is the residual error with ε i ∼ N (0, σ ε 2 ) . The standard error of estimated realized Δ G t was used as the measure of precision, efficiency was estimated as p t +n t / N ×100 where p t is the number of true positives and n t is the number of true negatives, and N is the total number of significance tests. True positives and true negatives were cases where the p-value of the true and estimated Δ G t were both < 0.05 and both ≥ 0.05 respectively. For each of the metrics error, precision, efficiency, and correlation, means for each of the realized Δ G t estimation methods were compared by fitting the mixed model

Comparison of medium to long term Δ G t estimation methods
where y ijk is the value of the metric of interest method i, breeding scenario j, and simulation replicate k, m i is a fixed factor for realized Δ G t estimation method i, s j is a fixed factor for breeding scenario j, ms ij is a fixed factor for the interaction between realized Δ G t estimation method i and breeding scenario j, t k ( j) is a random factor for the simulation replicate k nested within breeding scenario j with t k ( j) ∼ N (0,σ t 2 ) , and ε ijk is the residual error with ε ijk ∼ N (0, R σ ε 2 ) the diagonal of R contains a weighting factor to model a unique error variance each Δ G t estimation method i. For each metric, estimated marginal means per Δ G t estimation method were compared using a Tukey's multiplicity adjustment.

Equivalent complete generations:
The number of equivalent complete generations (EqCg), which is a measure of the equivalent number of discrete breeding cycles that have been completed in a breeding program where generations may overlap, was calculated per-line according to ∑ where n is the total number of known ancestors of the line, and g i is the number of generations between the line and its ancestor i (Boichard et al., 1997). Only parents of crosses, as opposed to selfpollinations, were considered as ancestors. Furthermore, for any cross appearing in the pedigree, the individuals involved in that cross were counted as ancestors regardless of whether they had appeared in other crosses in the same pedigree. For example, a line that appears in a pedigree as both a parent and great-grandparent is counted as two different ancestors. Mean EqCg per line-year was then computed and trends in mean EqCg and the association of mean EqCg with average true breeding value was inspected and quantified within each simulated breeding program based on the Pearson's correlation to determine if trend in EqCg could be used as an indicator of non-zero Δ G t .

Expected rate of genetic gain:
For two time periods, past 15 years and all 28 years, the expected Δ G t was computed each time parents were selected for crossing according to R/L where R is the expected response to selection in the current cycle which is the weighted average BLUP of the selected individuals minus the overall average BLUP and L is breeding cycle duration which is the weighted average age of the selected individuals. For the weighted averages, the relative contribution of a selected individual to the next generation was used as the weight. BLUPs were computed using all phenotypic data available within the time period of interest. Age of an individual was computed as the year of selection minus the year that the individual's most recent parents were crossed. Self-pollinations were not considered as crosses. Average R/L was the overall estimate of the expected Δ G t . I considered average R/L as an additional method of realized Δ G t estimation in some of the comparisons.

Evaluation of the use of estimates of Δ G t for comparing breeding programs
Associations between the expected Δ G t or the estimated realized Δ G t and true Δ G t were inspected among the set of 80 simulated breeding programs to determine how well the methods of realized or expected Δ G t could be used to compare breeding programs based on their true rates of Δ G t . The ability of the expected rate of genetic gain and estimated Δ G t to accurately predict the true Δ G t was quantified using the Pearson's correlation, r.
Absolute values of Pearson's correlation greater than 0.22 were considered significant at an ⍺ = 0.05 level of significance. To understand the cause of differences in levels of association between true and estimated Δ G t . Bias in estimated Δ G t by scenario was inspected.

Software
EBV estimation was implemented in REMLF90 (Misztal et al., 2014). All other analyses for the evaluation of medium to long term Δ G t estimation methods were carried out using the R

Genetic trends and realized Δ G t of simulated breeding programs
Genetic trends were linear for all eight scenarios (Figure 1). In all scenarios there was an initial period of seven years where there was no improvement in average breeding value because parents were selected at random from lines derived from the the initial population. Once selections could be made based on BLUP, annual improvements in average true breeding value were observed, however there was a year to year variability such that average true breeding value did not improve every year. This year to year variability was greater for scheme B compared to scheme A. Rates of realized Δ G t in units of genetic standard deviations ranged from 0.19 to 0.32 (Table 3). The average rate of realized Δ G t was 0.28 ± 0.0016 across all 28 years and 0.25 ± 0.0031 across the past 15 years. Within a scheme, high heritability led to higher rates of realized Δ G t compared low heritability. When the nongenetic trend was negative and the heritability was low, scheme A lead to higher realized Δ G t compared to scheme B in both time periods.

Comparison of medium to long term realized Δ G t estimation methods
Error of realized Δ G t estimation methods (Table 4)  had the lowest level of error followed by the ERA trial, EBV, or variety trial method. Excluding checks from the dataset prior to using the EBV or variety trial methods led to a significant increase in error of these methods, with the error of the variety trial method affected to a greater extent. In fact, when the checks were excluded, the variety trial method had higher error than the phenotype method regardless of the time period. The best method that did not require that long-term checks be present in the breeding trial dataset was the ERA trial method, however, in the past 15 year time period the error of the ERA trial method was not significantly less than that of the phenotype method.
Precision, measured using the standard error, of realized Δ G t estimation methods (Table 4) ranged from 0.019 to 0.081 when using data from the past 15 years and from 0.013 to 0.08 when using data from all 28 years. The variety trial method was the most precise method followed by the EBV method. Excluding the checks from the dataset significantly reduced the precision obtained for the variety trial method. The least precise method in both time periods was the ERA trial method.
Efficiency of the Δ G t estimation methods (Table S2) ranged from 79 to 100% when using data from the past 15 years and from 78 to 100% when using data from all 28 years.
When using data from the past 15 years there was greater variability in efficiencies among the different Δ G t estimation methods. Regardless of whether checks were included, the EBV and variety trial methods had efficiency levels of at least 98%. The next most efficient method in this time period was the control population method followed by the phenotype and ERA trial methods. When using data from all 28 years, all methods had 100% efficiency except for the ERA trial method which had an efficiency level of 78%.
Correlations between the annual true mean breeding values and the predicted annual mean breeding values (Table S2) were at least 0.9 with the exception of three cases. When using data from the past 15 years, correlations were 0.74 and 0.51 for the phenotype and ERA trial methods respectively and when using data from all 28 years, the ERA trial method had a correlation of 0.49.

Impact of incomplete data on the effectiveness of the EBV method
Accuracies of the EBV method excluding different subsets of data ranged from 0.052 to 0.11 in the last 15 year time period and 0.057 to 0.1 in the 28 year time period (Table 5). Excluding data on lines discarded in stage one either improved or had no impact on error. Excluding data on the last testing stage or the checks led to a significant reduction in error, except when data was also excluded on the lines discarded in stage one.
Levels of precision of the EBV method when using different subsets of data (Table 5) ranged from 0.028 to 0.031 in the past 15 year time period and from 0.015 to 0.018 in the 28 year time period. None of the differences in precision were significant in the 15 year time period. In the 28 year time period the highest levels of precision were observed when data on the lines discarded in the stage one and the checks were excluded.
Efficiency of the EBV method was unaffected by the exclusion of different subsets of data (Table S3). In all cases 100% efficiency was observed. Similarly, correlations between the annual true mean breeding values and the predicted annual mean breeding values based on the EBV method were also mostly unaffected by the subset of data used (Table S3).
Correlations ranged from 0.97 to 0.98 in the past 15 year time period and from 0.97 to 0.99 in the 28 year time period.

Evaluation of Δ G t indicators
Trends in average EqCg (Figure 2) reflected the trends in average true breeding values ( Figure   1). Within a breeding program, the average correlation between mean EqCg per line-year and mean true breeding value per line-year was 0.98 ± 0.0010 over all 28 years and 0.98 ± 0.001 over the last 15 years. Average R/L had an error level of 0.079 ± 0.0033 in the past 15 year time period which was not significantly different from the control population, EBV, and ERA trial methods. In the 28 year time period average R/L had an error of 0.064 ± 0.0029 which was less accurate than the control population method but just as accurate as the EBV, ERA trial, and variety trial methods.

Association between the true realized Δ G t and estimated or expected realized Δ G t
Positive linear relationships between the true realized Δ G t and estimated or expected realized Δ G t were observed for the average R/L, control population, EBV, ERA trial, and phenotype methods in the 15 year ( Figure 3) time period. This observation was consistent in the 28 year time period ( Figure S2), except that estimated realized Δ G t based on the EBV method was not associated with true realized Δ G t . Interestingly, in the 28 year time period the variety trial method with data on checks excluded provided estimates of realized Δ G t which were significantly negatively correlated with the true realized Δ G t . The best method based on association between true realized Δ G t and estimated or expected realized Δ G t was the control population method in the 15 year time period and the expected gain method based on average R/L in the 28 year time period. In the 28 year time period the control population method was second best. Systematic error (bias) in expected or estimated realized Δ G t was detected. For example among the methods that lead to the highest associations between true and expected or estimated realized Δ G t in the 15 year time period, bias was associated with the breeding scenario ( Figure 4). This association was most pronounced for the EBV method and least pronounced for the ERA trial method.

Simulated breeding scenarios
The breeding scenarios simulated in this study were designed to mimic actual medium to low budget breeding programs of rice and other self-pollinated crops. As in real breeding programs, the testing and selection practices imposed in the simulations may not have been ideal for maximizing Δ G t , however all simulated breeding scenarios lead to a linear increase in true breeding values over time. Testing scheme A lead to less variability in genetic trend compared to testing scheme B. This is probably because with testing scheme B there was more disconnectivity in the data leading to confounding of breeding value, year, and location effects, reducing my ability to achieve the expected level of selection error on a consistent basis. The stair-step pattern in the genetic trend observed (Figure 1) especially in the earlier years of the breeding programs was not surprising. This occurred because at the outset of the breeding program generations were mostly discrete and waves of improved material appear corresponding to the completion of a complete breeding cycle. The stair-step pattern diminishes towards the end of the simulations because the generations begin to overlap to a greater extent. This phenomena can also be observed upon inspection of the trends EqCg. Change in EqCg becomes more continuous towards the end of the breeding programs. In real breeding programs, it should be kept in mind that when a breeding program is first initiated, there may be a period of time where genetic improvement will come in waves and not a follow a smooth linear trend. In other words, a measurable improvement in the population is not to be expected every year, especially if a breeding program was recently initiated.

Estimation of medium to long term realized Δ G t
Error of realized Δ G t estimation was considered to be the most important factor when comparing the effectiveness of different methods because error indicates how close the estimates are from the true values of realized Δ G t . Error of different methods varied to a large extent, thus it was useful for discriminating among methods. Precision, based on the standard error of realized Δ G t estimates, was considered an important metric, however evaluating methods based on precision alone would be misleading because methods with high levels of precision often had unacceptable levels of error. Often in assessment of statistical methods in real datasets precision is used as a primary indicator of a method's performance, but based on what was observed, evaluating realized Δ G t estimation methods based on precision alone is not sufficient. A method may provide an estimate of realized Δ G t which is highly precise but highly inaccurate. Efficiency was found to be a useful metric for discriminating between different realized Δ G t estimation methods and it is considered important because significance thresholds are commonly used when determining if realized Δ G t > 0. Lower levels of efficiency indicate that a method is more likely to misdiagnose the presence or absence of non-zero realized Δ G t based on the p-value of the estimate. As with precision, methods with high levels of efficiency sometimes had low levels of error, and so evaluation of methods based on efficiency alone is not recommended. Finally, correlation with the average true breeding values over time is important if one is interested in predicting and visualizing the population means over time. It should be noted that the control population and variety trial methods only predicted the population mean breeding value based on a the linear rate of realized Δ G t estimated, so these method do not provide a visual showing how the population means over time deviate from linearity. Conveniently, correlation was associated with efficiency, so it was not difficult to identify efficient methods that also work well for predicting the actual population means per year.
Considering all the metrics evaluated, the best method for estimation of realized Δ G t in the 15 year time period was the EBV method either including data on long-term checks or excluding data on lines discarded after the first stage of testing. This method had the lowest level of error, the second highest level of precision, 100% efficiency, and the highest correlation between and true and estimated annual average breeding values (0.98). The main limitation to utilizing the EBV method in practice is that it requires pedigree records have been kept since the beginning of the breeding program so that additive relationships between lines can be estimated. The second factor potentially limiting the application of this method is the data structure required. There must be good connectivity between years for the EBV method to work well. Ideally the connectivity should be as good as or better than that of the data simulated in this study. Our results suggest that the use of long-term checks and discarding data on lines tested for only one year can help to increase the connectivity and the performance of the EBV method. Because it can be difficult to judge whether or not a dataset is suitable for estimation of realized Δ G t based on the EBV method. The best way to do so may be to simulate the breeding and testing procedure used in the breeding program of interest assuming a realistic non-genetic trend, and then evaluate the effectiveness of the EBV method using the simulation data. The added benefit of this step would be that the expected rate of Δ G t could also be estimated for comparison with that which was realized. Agreement between expected and realized Δ G t would provide stronger evidence that the estimate of 22 realized Δ G t is accurate.
In the 28 year time period, the control population method was the best method for estimation of realized Δ G t considering all the metrics evaluated. This method had the lowest error, moderately high precision, 100% efficiency, and high correlation between true and predicted annual average breeding values (0.97). The main limitation to utilizing this method in practice is that in order to estimate realized Δ G t over long time periods it requires that the same check(s) have been phenotyped every year, a condition that is almost never met in data from real breeding programs. Furthermore, in real field experiments, checks that are grown for many years become increasingly susceptible to disease which can prevent meaningful phenotyping of other important traits and increase the magnitude of the non-genetic trend.
While pathogen evolution can be a legitimate source of a non-genetic trend, if the objective is to measure realized Δ G t for the trait of interest in the absence of disease, then loss of disease resistance in the checks would lead to an overestimation of realized Δ G t , leading to even higher levels of error than what was observed in this study. Due to these issues, the control population method for the estimation of realized Δ G t over long time periods may not always be suitable, even if the data structure permits its use. The next best method in this time period was the EBV method excluding data on the lines discarded in the first stage of testing.
Performance of this method was similar to that of the control population method except that it was slightly less accurate, but more precise.
Over shorter time periods, the control population method would be a good alternative to the EBV method assuming the same check(s) have been evaluated each year in the time period of interest. In the past 15 year time period of our experiment, the control population method was just as accurate as the EBV method. Precision was moderate to low relative to the other methods, but efficiency and correlation between true and estimated annual mean breeding values was 91% and 0.93 respectively which I consider to be acceptable. The advantage of using this method rather than the EBV method is that it is more flexible in that it can be used with datasets that are disconnected across years and it does not require that pedigree records have been kept from the beginning of the breeding program. As mentioned earlier, disadvantages of this method are that it cannot be used to estimate realized Δ G t for the trait of interest in the absence of disease and it requires that the same check(s) be grown for many years consecutively.
If the breeding program data do not meet the requirements for neither the EBV nor the control population method, then the next best option would be to conduct an ERA trial. The advantage of conducting an ERA trial for estimation of realized Δ G t is that the design of the experiment can be adjusted to achieve desired levels of error, precision, efficiency and correlation. For example, the number of lines used to represent a single year and the number of environments and replicates within an environment can be increased. The main disadvantage of this method is that it requires additional resources for designing and conducting the trial, and it relies on the availability of germplasm representative of the germplasm generated in each year of the breeding program. Most ERA trials that have been reported in the literature utilize very few varieties, less than one per year, to represent breeding germplasm developed over many years, possibly due to lack of seed sources of additional germplasm. In these cases, ERA trials are likely to be inadequate for estimating realized Δ G t . As an example, in this study, the ERA trial method with one line representing each year was just as accurate and less efficient as the simple phenotype method for estimating realized Δ G t in the past 15 year time period. Thus the purpose of the ERA trial, which is to estimate realized Δ G t more accurately compared to regression of average phenotype over years, was defeated entirely. An ERA trial with more lines sampled each year would have been more effective in this case.

Indicators of realized Δ G t
Evaluating indicators of realized Δ G t can be a useful first step prior to embarking on estimation of realized Δ G t as it may indicate that realized Δ G t > 0 is not expected, potentially eliminating unnecessary analyses or experiments attempting to estimate realized Likewise, if average R/L ≯ 0, then realized Δ G t also cannot be greater than zero. This study demonstrated that average R/L does in fact accurately predict the rate of realized Δ G t . The advantage of utilizing average R/L as an indicator of future realized Δ G t is that it would enable routine monitoring of the selection decisions of breeding programs on a yearly basis so that potential problems or mistakes can be detected and corrected immediately rather than many years later when faulty selection decisions are finally reflected in estimates of realized

Comparing breeding programs based on estimates of realized Δ G t
One of the most interesting findings of this study was that positive linear relationships between the true and estimated realized Δ G t across breeding scenarios were not always observed.
In fact, only the estimates of realized Δ G t based on average R/L, control population, and ERA trial methods were at least moderately associated with true realized Δ G The lack of association between estimates of realized Δ G t and true Δ G t was due to systematic error, also known as bias. Bias, the difference between true and estimated Δ G t , was associated with the testing scheme, heritability, and non-genetic trend. This makes it difficult to compare breeding programs effectively based on estimated realized Δ G t because each breeding program will be subject to bias in a different way. For instance breeding program X may have a testing scheme that is associated with overestimation of realized Δ G t while the testing scheme of breeding program Y is associated with underestimation of realized Δ G t . Even if these two breeding programs have the same true rate of realized Δ G t , breeding program X will appear to be performing better than breeding program Y.
Even comparison of estimated realized Δ G t between two different time periods within the same breeding program will be problematic if the breeding strategy differs between the two time periods.
While I do not recommend using the realized Δ G t estimation or prediction methods presented in this paper for comparing breeding programs, itf would be relatively safe to use average R/L, control population, and ERA trial methods to make very general comparisons among breeding programs if the observed differences in estimated realized Δ G t are very large. If breeding programs are compared based on estimated realized Δ G t , it will also be important to express estimated realized Δ G t in units of genetic standard deviations because each breeding population will have a unique level of genetic variance that will have a major influence on the rate of realized Δ G t when expressed in units of the trait of interest.
Expressing Δ G t in units of genetic standard deviations removes this effect. If comparing breeding programs or breeding schemes is a major objective, methods other than those presented in this study should be utilized. The method proposed by Eberhart (1964), which has been used frequently in maize to evaluate realized Δ G t from selection experiments, would be an effective method for this purpose. With this method, a sample of germplasm is taken each year or cycle, archived, and then tested years later in a common set of environments. While this is similar to the ERA trial method, it is better in that the sample of germplasm taken each year is relatively large and taken at random so that it can be considered a representative sample of the breeding population in a given year which should eliminate any association between sampling error and the breeding scheme. Another option would be to simply utilize simulation experiments to compare breeding schemes or programs.
The major advantage of this approach are that the virtual breeding programs can be replicated and completed quickly at a low cost. Replication eliminates random noise that is inevitable due to random genetic drift, and the time and cost saved means that more breeding strategies can be evaluated and improved strategies can be implemented right away, avoiding a potentially major opportunity costs of not implementing an improved strategy once it is identified.

Conclusion
Overall, this study showed that while its relatively easy to obtain estimates of realized Δ G t using data generated from breeding programs or from ERA experiments, these estimates are often very inaccurate. Furthermore, the realized Δ G t estimation methods evaluated in this study should not be used to compare breeding programs or to evaluate how changes made to breeding programs have affected realized Δ G t because error in the estimates of realized Δ G t is associated with the breeding scheme, non-genetic trend, and heritability. If the goal is simply to determine if a positive upward trend exists or to obtain a rough estimate of realized