Genome‐based prediction of Bayesian linear and non‐linear regression models for ordinal data

Linear and non‐linear models used in applications of genomic selection (GS) can fit different types of responses (e.g., continuous, ordinal, binary). In recent years, several genomic‐enabled prediction models have been developed for predicting complex traits in genomic‐assisted animal and plant breeding. These models include linear, non‐linear and non‐parametric models, mostly for continuous responses and less frequently for categorical responses. Several linear and non‐linear models are special cases of a more general family of statistical models known as artificial neural networks, which provide better prediction ability than other models. In this paper, we propose a Bayesian Regularized Neural Network (BRNNO) for modelling ordinal data. The proposed model was fitted using a Bayesian framework; we used the data augmentation algorithm to facilitate computations. The proposed model was fitted using the Gibbs Maximum a Posteriori and Generalized EM algorithm implemented by combining code written in C and R programming languages. The new model was tested with two real maize datasets evaluated for Septoria and GLS diseases and was compared with the Bayesian Ordered Probit Model (BOPM). Results indicated that the BRNNO model performed better in terms of genomic‐based prediction than the BOPM model.


INTRODUCTION
populations. Practical and simulation studies have favored the use of GS as a way to increase genetic gains in less time (Crossa et al., 2017). For continuous traits, models have been developed to regress phenotypes on all markers using a linear model (Meuwissen, Hayes, & Goddard, 2001). However, in plant and animal breeding, the response variables in many traits are discrete counts (y = 0, 1, 2, …). In classical probability theory, counts could indicate a binomial, multinomial or a Poisson distribution. For smaller counts, data analysts recommend using logarithmic or square root transformations. In GS it is still common practice to apply linear regression models to categorical data or transformed data (Montesinos-López et al., 2015a). Transformations do not consider many distributions (of count data) as positively skewed: several observations have a value of 0, and a high number of 0 values in the data does not allow a skewed distribution to be transformed into a normal distribution; it is also possible that the regression model will produce negative predicted values. When a transformation is used, it is not always possible to have normally distributed transformed data, and often times, transformations are counterproductive (Stroup, 2012).
Ordinal data are very common in many research fields, such as econometrics, finance, agriculture, animal and plant breeding, genetics, etc. In agricultural applications, resistance/susceptibility to diseases is usually measured on an ordinal scale (e.g., 1 = no disease, 2 = low infection, 3 = moderate infection, 4 = high infection, and 5 = totally infected). With data on this scale, it is possible to obtain relative or absolute frequencies, but the usual sample mean or sample standard deviation cannot be obtained, since the distance between classes is not defined (Stevens, 1946). An important application with ordinal data is ordinal regression, where a response variable that is measured on an ordinal scale is predicted by using several covariates. Ordinal regression has been widely used in plant and animal breeding (e.g., Gianola, 1982) and is mainly based on linear mixed models. However, ignoring the ordinal nature of the data can cause several problems related to biased estimates, which could potentially lead to wrong and misleading conclusions (Montesinos-López et al., 2015a). In plant breeding, several economically important traits are categorical, and in general, threshold models have not been considered in GS. One of the first studies that introduced the threshold model to GS incorporating genomic × environment interaction was Montesinos-López et al. (2015a) when 278 maize lines scored for resistance to gray leaf spot (GLS) were measured using an ordered categorical scale in three environments.
In GS, a response variable is predicted based on dense molecular markers. Most of the models used to predict the response variable are linear, and the following assumptions hold: (1) the response variable is normally distributed, (2) the variance of the residuals is constant, and (3) the expected value of the response variable is a linear function of covariates (Montesinos-López et al., 2015a;Pérez-Rodríguez et al., 2012). Recently, linear models used in GS have been extended to model ordered phenotypical categories (e.g., Montesinos-López et al., 2015a;Montesinos-López, Montesinos-López, Crossa, Burgueño, & Eskridge, 2015b;Wang et al., 2013).
In order to make predictions on new data, machine learning builds models from sample data using several algorithms (Samuel, 1959). There are some applications of machine learning in genomic selection González-Camacho et al., 2018) with connected network architecture, which is a type of machine learning algorithm that uses an artificial neural network with input, hid-

Core Ideas
• Neural networks are universal approximation machines that can be adapted to predict ordinal responses • Neural networks outperform the predictive power of linear models • Data augmentation is a valuable tool for fitting neural network models den and output layers linked non-linearly. The layers consist of stages of non-linear data transformations. Non-linear kernel-based models and neural networks have also been used in the context of GS for continuous traits with promising results , 2018Pérez-Rodríguez et al., 2012). Recently, Montesinos-López et al.
(2019) compared genome-based prediction accuracies for ordinal traits using several deep machine learning methods and the threshold model for ordinal data on an extensive number of ordinal data; the authors found that the threshold genomic best linear unbiased prediction was the most consistent model in terms of genomic-enabled prediction accuracy. However, the scientific literature does not show results on many non-linear GS models and methods for categorical data. Based on the previous considerations, and motivated by the fact that (1) artificial neural networks are considered universal approximations, which in some cases exhibit better predictive ability than linear and non-linear models, and that (2) several traits commonly used in GS are measured on an ordinal scale, the main objective of this study was to introduce a Bayesian Regularized Neural Network for Ordinal data (BRNNO) with a Single Hidden Layer (SHL). We used a data augmentation algorithm (Tanner, 1993) to make part of the computations feasible. We assumed that the process that leads to the observed response variable is a latent variable (unobserved), which is a continuous random variable that follows a standard normal distribution (e.g., Albert & Chib, 1993;Gianola, 1982;Montesinos-López et al., 2015a). We combined Markov Chain Monte Carlo (MCMC) simulation and EM with other optimization algorithms that allowed us to fit the proposed model. The BRNNO model can also be considered a generalization of the probit ordered regression in the context of nonlinear models and can also be extended to generalize the logit ordered regression model (Montesinos-López et al., 2015b). This paper is organized as follows: In the Materials and Methods section we introduce the Bayesian ordered probit model (BOPM) and the Bayesian ordered logit model (BOLM) for ordinal data; next, we introduce the Bayesian Regularized Neural Network for Ordinal data (BRNNO) and describe the procedures used to assess the predictive power of the proposed model. Then, we present an application of the BRNNO using two real datasets for wheat diseases from CIM-MYT maize and wheat breeding trials (https://www.cimmyt. org). Finally, we present the results and discussion.

Bayesian ordered probit model (BOPM)
where is a random normal variable with mean 0 and variance 1. Using this representation, it can be shown that: with Φ(⋅) as the standard normal cumulative distribution. Albert and Chib (1993) assigned a diffuse prior for ( , ) and showed that all conditional distributions necessary to implement the Gibbs Sampler algorithm (Geman & Geman, 1984) have closed form and can be used to draw samples from the posterior distribution ( , |data). Montesinos-López et al. (2015b) argue that the ordinal logistic regression model is often preferred over the ordinal probit model because it provides regression coefficients that can be more easily interpreted as the regression coefficients are connected with the odds ratio. Montesinos-López et al. (2015b) proposed using a latent variable (liability) with logistic distribution, which can be represented as follows:

Bayesian ordered logit model (BOLM)
wherẽis the standard logistic random variable; therefore, the probabilities ( = ) are computed based on the distribution function of a logistic random variable. The authors propose using Póyla-Gamma data augmentation, which facilitates the computations when sampling from the posterior distribution of the parameters of interest with the Gibbs sampler (Geman & Geman, 1984).

The proposed non-linear model: Bayesian Regularized Neural Networks for Ordinal data (BRNNO)
Following Albert and Chib (1993), Gianola (1982), González-Camacho et al. (2012) and Pérez-Rodríguez, Gianola, Weigel, Rosa, and Crossa (2013), in this study, we propose using the latent variable approach in order to model ordinal data with a Single Hidden Layer Neural Network (SHLNN), as shown in Figure 1. The structure of the neural network was analogous to that shown in Pérez-Rodríguez et al. (2013), that is, the input layer has the input variables = ( 1 , … , ), which were then combined with linear inputs in neuron = 1, … , and transformed by applying a non-linear activation function (⋅) in the hidden layer and then combined linearly in the output layer to regress the latent variable, that is: The activation function is a function that maps the inputs from the real line into the bounded open interval (−1,1), for example, the tangent hyperbolic function tanh(⋅) is given by ( ) = 2 1+exp(−2 ) − 1. There are several options for activation functions, but tanh(⋅) is a popular alternative implemented in most software packages that fits neural networks. The output from equation (5) is related to the observed data using the approach employed in the probit and logit ordered model, that is, = if λ −1 < < λ with λ 0 = −∞, λ = ∞, λ 1 , … , λ −1 the set of unknown thresholds. By using this approach, the probabilities given in equations (2) and (3) can F I G U R E 1 Structure of a single-layer feed forward neural network (SLNN) adapted from González-Camacho et al. (2012) and Pérez-Rodríguez et al. (2013) for ordinal data using the latent variable approach be computed as follows: Note that the ordered probit model (Albert & Chib, 1993) is a special case of the single layer perceptron, which also is a special case of the proposed Single Layer Neural Network, which is obtained by setting the number of neurons equal to 1 ( = 1), 1 = 1, ( ) = (identity function), 1 = 0, = [1] . A Single Hidden Layer Neural Network that generalizes the ordered logit model proposed by Montesinos-López et al. (2015b) can be obtained replacing the random error term in equation (5) by a random error term with standard logistic distribution.
The BRNNO model can be fitted using the Generalized Expectation-Maximization algorithm (Kärkkäinen & Sillanpää, 2013;Neal & Hinton, 1998) or by using Bayesian methods. Let = ( 1 , … , , the vector of weights, biases and connection strengths, and let ( |σ 2 ) be the prior distribution assigned to the elements of this vector. Here we assume that ( |σ 2 ) = ( , σ 2 ), where ( , ) stands for multivariate normal distribution with mean and variance-covariance matrix ; in this case, = and = σ 2 , with σ 2 a variance parameter common to all the elements in , assuming for the moment that σ 2 is known. In the data augmentation framework for ordinal data (e.g., Albert & Chib, 1993), the joint posterior distribution of { , λ 1 , … , λ −1 , 1 , … , }, given the observed data, can be obtained as follows: By assigning a diffuse prior for ( ), this leads to: where ϕ(; μ, σ 2 ) is the density function of a normal random variable with mean μ and variance σ 2 , and 1( ∈ ) is the indicator function that takes a value of 1 if is contained in the set, and a value of 0 otherwise. Note that the joint posterior distribution of the parameters of interest does not have a closed form, so we must use numerical algorithms and simulation techniques to fit this model. In order to ensure the identification of the parameters, λ 1 is set to 0 (Albert & Chib, 1993).
Next, we describe two numerical algorithms that can be employed to fit the BRNNO model: GMAP and GEM (Generalized EM algorithm). The GMAP algorithm (Kim, Hall, & Li, 2009) combines the Gibbs Sampler algorithm (G = Gibbs; see Geman & Geman, 1984) and an optimization algorithm to obtain the posterior modes of some parameters of interest in the model (MAP = Maximum A Posteriori). Pursuant to Kim et al. (2009), it is necessary to divide the parameters into The Plant Genome two parts, which are labeled as 'tractable' and 'intractable'. Tractable parameters are those whose full conditional distributions have a closed form, whereas the remaining parameters are 'intractable'. Thus, 'tractable' parameters can be sampled by using the Gibbs sampler and 'intractable' parameters are estimated by MAP and G; MAP steps are repeated until convergence. As for the problem of interest, the tractable parameters are the latent variables and the thresholds { , }, while the intractable parameters are the weights, biases, connection strengths and variance parameter σ 2 , that is { , σ 2 }.
Next, we derive the full conditional distributions for tractable parameters and briefly explain how to obtain the posterior modes of intractable parameters. Appendix 1 shows computational details for implementing the GMAP algorithm.
The fully conditional distribution of the latent variables can be obtained from (8); it has closed forms and can be obtained as follows: which corresponds to a truncated normal random variable truncated at the left (right) by λ −1 (λ ), location parameter ∑ = 1 ( + [ ] ) and scale parameter 1.

a. Conditional posterior modes of 'intractable' parameters
The set of intractable parameters is { , σ 2 }. From equation (5), and by taking into account the prior distribution assigned to , it is clear that the problem reduces to fitting a Bayesian Regularized Neural Network with a Single Hidden Layer; the details of the algorithm can be found elsewhere, such as in Foresee and Hagan (1997), , and Pérez-Rodríguez et al. (2013), among others. The algorithms used to fit the Bayesian Regularized Neural Network need to be slightly modified to take into account that the residual variance for in equation (5) is set to 1. The Appendix shows the computational details for estimating posterior modes for these parameters. One interesting output from this algorithm is the effective number of parameters η which can give some guidance about the number of neurons to include in the neural network (see, for example, Pérez-Rodríguez et al., 2013). This parameter is computed as follows (see Appendix 1 for more details): where is the number of elements in , α = 1 When this parameter shows little change for S and S+1 neurons, the more parsimonious model is preferred.
The GEM algorithm (Kärkkäinen & Sillanpää, 2013;Neal & Hinton, 1998) is similar to the GMAP algorithm. The 'tractable' parameters (latent variables and thresholds) are updated according to the conditional expectations for distributions given in (9) and (10) and the 'intractable' parameters are estimated using MAP, so the updating of 'tractable' and 'untractable' parameters is repeated until convergence. As the number of parameters to estimate increases, the GMAP algorithm becomes slow, and therefore, our preferred algorithm for fitting the proposed model is GEM. Appendix 2 shows computational details for implementing the GEM algorithm.

The software
The ordered probit model described above can be fitted using the BGLR library functions in R (Pérez & de los Campos, 2014). The ordered logit model proposed by Montesinos-López et al. (2015b) can also be fitted in R (R Core Team, 2019); the authors provided the code in the supplementary materials included in the published paper. The GEM algorithm for fitting the ordinal Bayesian Regularized Neural Network proposed in this study was implemented using the C programming language (Kernighan & Ritchie, 1988) and the R statistical package (R Core Team, 2019) to speed up computations and facilitate user interaction with the software. We included the resulting routines for fitting the proposed model in the brnn package version 0.8 with the GEM algorithm (Pérez-Rodríguez & Gianola, 2020), while the code for fitting the model with the GMAP algorithm is included in supplementary materials. We decided not to include it in the brnn package, as it is very slow.

Measurements of genome-based prediction accuracy of models for ordinal data
In order to evaluate the proposed model's predictive ability, we proposed partitioning the data at random into the training and testing sets. The basic idea was to fit the model with the training data and then predict phenotypes (observed ordinal response) for the individuals in the testing set. In the case of continuous response, the model's predictive ability was measured by Pearson's correlation coefficient between observed and predicted phenotypical values or mean squared error prediction (MSEP).
However, in the case of categorical data, several metrics are widely used, such as sensitivity, specificity, ROC curve, etc. (Tharwat, 2018). Montesinos-López et al. (2015b) suggested using the Brier score (BS) (Brier, 1950) because these authors argued that this statistic uses all the information contained in the predictive distribution. The Brier score can be computed as follows: where is the total number of observations in the set of interest,π is the estimated probability that observation = 1, … , belongs to class = 1, … , , which can be computed using the adjusted model with the equations (6)-(7) and = 1, if observation belongs to class j and = 0 otherwise. As defined above, BS takes values between 0 and 1, and small values are associated with better prediction ability.
Other genome-based prediction performance measures that are used with ordinal data are, for example, the misclassification error rate (MER), which is obtained by counting the number of classification errors and then dividing the result by the number of test cases. Mathieson (1995) argued that for ordinal responses, performance measures that take into account the difference between class numbers are preferred over MER due to the lack of better information. In this case, two performance measures that take into account this criterion are Mean Absolute Error (MAE) computed as 1 ∑ = 1 | −̂| and Root Mean Square Error (RMSE), which can be obtained as √ 1 ∑ = 1 ( −̂) 2 (see da Costa & Cardoso, 2005, for more details). Pearson's correlation coefficient, widely used with continuous data, can be replaced by Sperman's rank correlation coefficient or Kendall's tau coefficient; here, we used Sperman's coefficient. It is important to note that the predicted clasŝin the testing set is obtained by first predicting the value of the latent variablêusing the estimated parameters and associated covariates, and then̂= ifλ −1 <̂< λ , wherêis the vector of estimated threshold parameters.

The Septoria dataset
Septoria is a fungus that causes leaf spot diseases in field crops, forage crops and vegetables. The Septoria dataset includes information for 268 wheat lines from CIMMYT (http://www.cimmyt.org) and was previously analyzed by Montesinos-López et al. (2015b). The lines were planted in 2010 in Toluca, Mexico. The dataset includes information about disease severity, which was measured on an ordinal scale with four points. The lines were genotyped using Genotype by Sequencing (GBS; Poland et al., 2012). Markers were filtered by removing markers with more than 50% missing values, imputed using observed allelic frequencies, and further removing markers with minor allele frequency smaller than 0.05. After that, 6787 markers were still available for further analysis. The dataset can be downloaded from http:// hdl.handle.net/11529/10254.

The GLS dataset
Gray Leaf Spot (GLS) is a disease caused by the fungus Cercospora zeae-maydis. This dataset consists of genotypic and phenotypic information for 278 maize lines from the Drought Tolerance Maize (DTMA) project of CIMMYT's Global Maize Program. The dataset was originally analyzed by Crossa et al. (2011), and re-analyzed later by González-Camacho et al. (2012), Montesinos-López et al. (2015b) and Pérez-Rodríguez et al. (2018) using different statistical models. The dataset includes information on disease severity measured on an ordinal scale with 5 points: 1 = no disease, 2 = low infection, 3 = moderate infection, 4 = high infection and 5 = totally infected. The lines were initially genotyped with 1,152 SNPs and re-genotyped later with 55k SNPs using the Illumina platform. After removing SNPs with more than 10% missing values and imputing filtering markers with minor allele frequency smaller than 0.05, a total of 46,347 markers were still available for further analysis. The dataset can be downloaded from http://hdl.handle.net/11529/10254.
We evaluated the predictive ability of the proposed model by using 100 partitions generated at random with 90% of observations in the training set and the remaining 10% in the testing set. For each partition, we fitted the Bayesian Regularized Neural Network (BRNN) for ordinal data with two neurons. In order to expedite computations, we first computed a genomic relationship matrix ( ) between centered T A B L E 1 Performance measures (BS, MAE, RMSE, r and MER) for the Bayesian Ordered Probit Model (BOPM) and Bayesian Regularized Neural Network for Ordinal data (BRNNO) for Septoria and GLS at three locations (Colombia, Harare and Mexico). Standard deviation (sd). GMAP denotes the Gibbs Maximum a Posteriori Algorithm and GEM is the Generalized EM algorithm and standardized genotypes (Lopez-Cruz et al., 2015); we then performed the eigen-value decomposition of , that is, = ′ , and used 1 2 (principal components) as our matrix of covariates to fit the models (see Gianola et al., 2011, for other computing strategies). The model was fitted using the GMAP and GEM algorithms. In the case of GMAP, inferences were based on 5000 MCMC iterations obtained after discarding 5000 samples that were taken as burn-in, due to high computational times.
In the case of ordered probit regression, we used the same computational strategy as before for speeding up computations. We fitted the model using the BGLR package in R (Pérez & de los Campos, 2014), and inferences were based on 5000 MCMC iterations obtained after discarding 25,000 samples that were taken as burn-in. For each partition, we computed several statistics that allowed us to assess the performance of the models; in this case, Brier score, MAE, RMSE, Spearman's correlation (r) and MER.  Table 1 shows the results for Septoria and GLS in Colombia, Harare and Mexico. Results in Table 1 show the five criteria for determining the best models for all of them (except the correlations), i.e., the lower the better, whereas for the correlations, the higher the better. Results show that the BRNNO model performed better than the probit model (BOPM), except in the case of BS criterion for the Septoria disease (BS = 0.3205). For all the other cases, the performance measured by criteria MAE, RMSE and MER was always smaller for BRNNO than for the ordered probit model (BOPM). In addition, it should be noted that the associated Spearman's rank correlation coefficient is higher for BRNNO than for the probit model (BOPM). The best prediction of the BRNNO algorithm occurred for disease GLS measured in Colombia using the GEM algorithm with 0.5745, 0.8683, 0.7295, and 0.4942 values for criteria MAE, RMSE, r, and MER, The Plant Genome F I G U R E 3 Scatter plot matrix of MER for BRNNO and BOPM. Points above the 45 • degree line are associated with the worst performance of the model on the y-axis (BOPM) in comparison with the model presented on the x-axis (BRNNO). When the worst model is BOPM, it is represented by a filled circle, and when the worst model is BRNNO, it is represented by an open circle. The plots also show the number of times that BOPM is inferior to BRNNO as % respectively. Furthermore, usually the second best model was also BRNNO (using either one of the two algorithms GEM or GMAP). Table 2 shows the number of parameters that should be estimated for the neural network and the estimated effective number of parameters. We fitted the models with S = 3 neurons and concluded that it was not necessary to include more neurons. Figures 2 and 3 show the evaluation of metrics MAE and MER for Septoria and GLS in Colombia, Harare and Mexico, respectively, for each of the 100 partitions generated at random and obtained with the GEM algorithm. Results show similar patterns for the GMAP algorithm. These figures are a graphical representation of the summary shown in Table 1. The 45 • line makes it possible to quickly compare the performance of the two models, from where it is clear that MAE and MER are worse for the probit model than for the proposed (BRNNO) model because they exhibit higher values.

RESULTS
For the MAE criterion, BRNNO was superior to BOPM in GLS because the latter exhibited higher MAE in 89% of times in Mexico, 85% of times in Colombia and 69% of times in Harare. The MAE for Septoria was slightly better for BOPM since BRNNO was worse in 52% of times. Similar patterns were observed in the case of MER, where BOPM was worse than BRNNO for GLS, MER was higher 87% of the times for Mexico, 84% of times in Colombia and 73% of times in Harare. Finally, in the case of Septoria, the BOPM performed slightly better than the proposed model since it had smaller MER values 56% of times.

DISCUSSION
Results suggest that the proposed model BRNNO performs better than the BOPM model for both algorithms, GMAP and GEM; however, GEM is several orders of magnitude faster than GMAP. For example, for the GLS dataset, the process took ∼8 minutes to fit the proposed model with the GEM algorithm in a computer with an intel quad core i7 processor at 2.8 GHz, whereas for 10,000 MCMC iterations with GMAP on the same computer, the process lasted ∼3 hours. Thus it is clear that computing times are reduced substantially by using the GEM algorithm.
As already mentioned, the single hidden layer neural network (BRNNO) proposed in this study generalizes the Bayesian ordered probit model (BOPM), as well as the Bayesian ordered logit model (BOLM) of Montesinos-López et al. (2015b). It should be noted that the Brier score results from Montesinos-López et al. (2015b) are not directly comparable with those obtained for the proposed BRNNO method in this study. However, the pooled BS of the real GLS data in Table 4 of Montesinos-López et al. (2015b) showed a slightly higher BS of around 0.37 for BOLM and BOPM than the BS estimated for the BRNNO ranging from 0.33-0.35 for the three locations in Colombia, Mexico, and Harare (Table 1). The BS values for the Bayesian ordered probit model (BOPM) of this study for the three locations, Colombia, Harare, and Mexico, were 0.3550, 0.3458, and 0.3359, respectively. Not many studies have been reported on the prediction of ordinal traits in genomic-enabled prediction. Recently, Montesinos-López et al. (2019) proposed a deep learning method for the simultaneous prediction of mixed phenotypes (binary, ordinal and continuous) in plant breeding. The proposed neural network BRNNO and the open-source R package brnn referred in this study for the genomic-enabled prediction of ordinal data is important due to the fact that there is a lack of genomic-enabled studies predicting categorical data in plant breeding and simultaneously offering ready-to-use R software.

CONCLUSIONS
We introduced a neural network that generalizes existing models for the prediction of ordinal responses. We explored two algorithms (GEM and GMAP) that can be used to fit the proposed model. The GMAP algorithm is able to fit the model, but it is very slow because at each iteration, it is necessary to fit a Bayesian Regularized Neural Network using the latent variable as a response; additionally, it is necessary to have several thousands of MCMC iterations to make inferences. In contrast, although the GEM algorithm is also slow in the empirical evaluation, it requires much fewer iterations to converge; these results are in agreement with Kärkkäinen and Sillanpää (2013). Thus, we conclude there is ample room for improvement in both algorithms.
Regarding the genomic-enabled predictive accuracy of the models, most of selected indexes (MAE, RMSE, r, MER), except the Brier score, favored the proposed BRNNO model, fitted either with the GMAP or the GEM algorithm. Improvements are limited, but consistent with the findings of other studies (Montesinos-López et al., 2015b). We should point out that this approach could be applied not only in the GS context, but also in the context of conventional phenotype plant breeding for disease resistance and many other ordinal traits.