Non-Homology-Based Prediction of Gene Functions

Advances in genome sequencing and annotation have eased the diﬃculty of identifying new gene sequences. Predicting the functions of these newly identiﬁed genes remains challenging. Genes descended from a common ancestral sequence are likely to have common functions. As a result homology is widely used for gene function prediction. This means functional annotation errors also propagate from one species to another. Several approaches based on machine learning classiﬁcation algorithms were evaluated for their ability to accurately predict gene function from non-homology gene features. Among the eight supervised classiﬁcation algorithms evaluated, random forest-based prediction consistently provided the most accurate gene function prediction. Non-homology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance in Biological Process GO terms, the domain where homology-based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology-based functional annotation is highest. Non-homology-based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes which lack functionally characterized homologs, and to identify and correct functional annotation errors which were propagated through homology-based functional annotations.


Introduction
The rapid acceleration in genome sequencing is providing complete sequences for dozens of new species each year (Michael and Jackson, 2013;Chen et al., 2018). Advances in both de novo and extrinsic evidence based gene structure annotation, combined with low cost and abundant RNA sequence datasets, aid the identification and definition gene models across each new genome assembly (Campbell et al., 2014;Del Angel et al., 2018;Cook et al., 2019;Monnahan et al., 2019). However, while the accuracy and throughput of methods to define the structure of genes have grown rapidly, methods to experimentally determine the function of individual genes have not. Existing annotations are taken from a small set of proteins with direct experimental evidence and then these annotations are extrapolated to not only homologous genes in the same genome but homologous genes in the genomes of other species (Valencia, 2005). Among eukaryotes, fission yeast Schizosaccharomyces pombe has perhaps the most comprehensive set of functional gene annotations (Aslett and Wood, 2006). There are currently 41,912 gene associations for 5,397 gene products available on Sz. pombe GeneDB (Lock et al., 2018). Of these, 16,657 functional annotations for 2,302 genes (42.6% of 5,397 annotated genes) are directly derived from experiments. Of those, a subset of 4,761 functional annotations for 1,459 genes (27.0% of all annotated gene models in Sz. pombe) are supported by mutant phenotype analysis. Among flowering plants, the model species Arabidopsis thaliana has been the subject of intensive and comprehensive genetic investigation. However, of the 28,775 annotated gene models in the TAIR10 A. thaliana reference genome, only 14.2% have functional annotations supported by mutant phenotypes, 25.8% have functional annotations supported by other types of experimental evidence (e.g. biochemical assays of enzymatic function, protein-protein interactions, expression data, etc), 51.1% are functionally annotated based on solely protein features, sequence similarity, or other homology-based forms of evidence, and 8.9% of gene models lack any functional annotation (Lamesch et al., 2011).
Two decades ago, a significant challenge of homology-based functional annotation, as generally employed, was recognized: functional annotations are propagated from one sequence to the next without data on provenance. Thus it is often impossible or impractical to track a computationally assigned functional annotation back to the original source of experimental evidence. This presents a challenge, as mistaken findings related to protein functions will be published from time to time (Iyer et al., 2001), and once an experimentally derived functional annotation is assigned to homologs in other species, there is no way to "recall" this annotation. In fact, the annotation is likely to continue to propagate to new genome assemblies and to reannotations of existing assemblies (Brenner, 1999;Valencia, 2005;Gilks et al., 2002Gilks et al., , 2005. Curated annotations made based on non-sequence similarity evidence have been estimated to have an error rate of 13%-18%, while curated annotations made based on sequence similarity evidence had an estimated error rate of 49% (Jones et al., 2007). In short, "functional annotations are propagated repeatedly from one sequence to the next, to the next, with no record made of the source of a given annotation, leading to a potential transitive catastrophe of erroneous annotations" (Karp, 1998). Homology-based functional annotation also rests on the basic assumption that sequence similarity and functional similarity is highly correlated, which is an assumption that is not always correct as demonstrated by many cases of sub-and neo-functionalization between homologous genes (Clark and Radivojac, 2011;Brown et al., 2006;Radivojac et al., 2013).
In addition to concerns with annotation accuracy, many species also contain a signficant number of genes where homology-based annotation is not possible. The genomes of arabidopsis and rice, respectively, are reported to contain 1,430 and 1,926 orphan genes which lack known homologs in other species (Guo et al., 2007;Guo, 2013). By definition, homology-based methods are only able to make predictions when the function of at least one related sequence -whether detected through direct nucleotide or protein sequence similarity (Conesa et al., 2005), or more sensitive methods such as the presence of a shared protein domain or protein domain architecture (Thomas et al., 2003;Quevillon et al., 2005;Hulo et al., 2006;Finn et al., 2015) -has been experimentally characterized. As the result, genes belonging to orphan gene families and/or carrying only domains of unknown functions are likely to lack predicted or potential functions. This in turn contributes to the noted pattern of clustering of research efforts on more detailed characterization of genes with existing well characterized functions (Stoeger et al., 2018).
However, there exists a parallel set of nonhomology based approaches to predict the function of uncharacterized genes Marcotte, 2000;Gabaldón and Huynen, 2004). Chromosomal context has been widely employed for functional prediction in prokaryotes where operons of genes involved in a single metabolic pathway or biological process are common (Edwards et al., 2005;Enault et al., 2005). High rates of gene loss and horizontal gene transfer in prokaryotes can also be employed to assign predicted functions to genes with either similar or complementary phylogenetic distributions (Gaasterland and Ragan, 1998;Pellegrini et al., 1999;Morett et al., 2003). In eurkaryotes such as maize and arabidopsis, mRNA co-expression analysis has been show to improve the prioritization of GWAS hits (Chan et al., 2011;Angelovici et al., 2017;Schaefer et al., 2018;Zheng et al., 2019). Protein co-expression networks are also beginning to become more widely available and appear to capture different information content from mRNA co-expression networks (Walley et al., 2016). Nonhomology based methods have been used to systematically develop functional predictions in prokaryotes and have been to prioritize individual sets of candidate genes in plants (Chan et al., 2011;Angelovici et al., 2017;Schaefer et al., 2018;Zheng et al., 2019). However, genome-wide functional annotation in plants still relies primarily on homology-based methods.
This initial analysis of the potential for non-homology-based genome-wide functional annotation in eukaryotes focuses on maize (Zea mays), a widely studied genetic model and economically vital crop species. As the result, maize has an extensive collection of functional genomic datasets, including large RNA and protein expression atlases (Walley et al., 2016;Stelpflug et al., 2016), methylation and histone modification profiling datasets (Dong et al., 2017), one of the largest collection of characterized and cloned loss of function mutants of any plant species (Schnable and Freeling, 2011;Oellrich et al., 2015). However, maize also exhibits substantial genomic structural complexity with thousands of genes which are present in the genomes and genome assemblies of some inbreds but not others (Springer et al., 2009;Swanson-Wagner et al., 2010;Sun et al., 2018;Springer et al., 2018). The majority of these genes are evolutionarily young, with the evidence that the majority of them are conserved between maize and its wild progenitor teosinte, but only a small minority were found to be conserved at syntenic locations the genome of Sorghum bicolor a related species which diverged from the lineage leading to maize approximately 12 million years ago (Swigoňová et al., 2004;Sun et al., 2018;Springer et al., 2018). The need for non-homology-based functional annotations is pressing in maize, particularly as there is evidence these new and variably-present genes may be involved in hybrid vigor Baldauf et al., 2016Baldauf et al., , 2018. In this project, we evaluated the potential for using supervised machine learning-based classification algorithms to predict the function of annotated maize genes using purely non-homolog-based predictive variables.

Results
We assembled a set of descriptors for each gene, including gene structure, population genetics, expression, histone modification and DNA methylation features (Table S1). This dataset included features calculated from the alignment of sequence reads to the maize AGPv4 reference genome and data mined from previously published papers (Jiao et al., 2017;Hoopes et al., 2019;Walley et al., 2016;Dong et al., 2017;Bukowski et al., 2017;Liang et al., 2019). Some important features, for example, protein abundance data for maize vegetative and reproductive stages, are only available for prior versions of the maize reference genome. As the result, we constrained this analysis and only considered a set of 29,428 gene models which had a 1:1 relationship between a single gene model in the maize reference genome version AGPv2 and a single gene model in the maize reference genome version AGPv4 (Schnable et al., 2009;Wang et al., 2016;Liang et al., 2019). Many algorithms for making predictions from input feature sets are intolerant of missing values. While the overall rate of missing data for this 1:1 gene set was low, a small number of genes (1,995 genes) have missing values for more than half of the total set of 369 features ( Figure S1). These genes were omitted from subsequent analyses. For the remaining 27,433 genes, missing values were imputed using the median value for that feature across all the genes where that feature was successfully scored.

Potential for Dimensional Reduction Among Nonhomology Features
Spearman correlation coefficients were caclulated among the 369 features ( Figure 1). The two largest classes of features, i.e. RNA abundance (274 features) and protein abundance (33 features), showed substantial between-feature correlation. Supervised machine learning classification models tend to overfit when trained with excessively large numbers of features. This overfitting decreases predictive performance on non-training datasets. Dimensional reduction techniques seek to to address this problem by reducing the total number of features available for training without significantly reducing the overall information content of the dataset. Dimensional reduction algorithms can generally be divided into the categories of feature extraction and feature selection. Principal Component Analysis (PCA) is a widely used feature extraction technique that improves learning performance, reduces computational complexity, builds better models and decreases the required memory space (Tang et al., 2014). More than 90% of the variance in RNA abundance and protein abundance could be captured by 20 and 10 principal components respectively ( Figure 1). These principal components were used in place of the original RNA and protein abundance features, which decreased the number of possible predictive variables from 369 to 92. This decrease also substantially reduced the degree of correlations between possible predictive variables. The 95 th percentile and 99 th percentile for the absolute values of Spearman correlation (r s ) dropped from 0.76 and 0.94 to 0.22 and 0.51, respectively ( Figure 1B).
Multiple distinct sets of GO predictions exist for the maize reference genome (Goodstein et al., 2011;Tello-Ruiz et al., 2015;Wimalanathan et al., 2018). We chose to use the maize GAMER dataset as our starting point for training and evaluating non-homology-based prediction algorithms (Wimalanathan et al., 2018). The published maize GAMER dataset includes 9,336 GO terms which are directly assigned to one or more gene models, and an additional 2,757 GO terms which are implicitly assigned to one or more gene models. An implicit GO term assignment can occur when a specific GO term is explicitly assigned to a gene. In this case, each parent of that GO term is also assigned to the same gene implicitly. We utilized both implicit and explicit GO term assignments. The initial dataset thus consisted of 12,093 GO terms and  Table S2. each go term was assigned to one or more of the 39,324 gene models in the B73 AGPv4 maize reference genome. We chose to exclude both extremely common GO terms (e.g. GO:0008150 "Biological Process") and extremely rare GO terms. Extremely common GO terms tend to be low information content. Extremely rare GO terms are unlikely to possess enough known positive genes to accurately train prediction algorithms. After excluding GO terms assigned to <100 genes in our set of 27,433 genes with feature data and those assigned to >5,000 genes in the same set, 1,562 GO terms -including 1,148 Biological Process, 151 Cellular Component and 263 Molecular Function terms -remained for downstream analyses.
Classifiers were trained using either all 369 predictor features or the reduced set of 92 predictor features remaining after targeted dimensional reduction. Across the models trained of each of the 1,562 GO terms in this analysis, those trained with the full set of 369 features exhibited prediction accuracies of 0.35 to 0.93 with a median of 0.67. Models trained using the reduced set of 92 predictor features exhibited prediction accuracies of 0.41 to 0.93 with a median of 0.68. The increase in prediction accuracy for the targeted dimensional reduction models relative to the full models was statistically significant (p=0.0002; two tailed paired t-test). While this targeted dimensional reduction increased prediction accuracy, untargeted dimensional reduction did not. Models trailed using a set of 50 principal components extracted from the complete set of 369 features provided significantly lower prediction accuracy than either the total feature set (p = 5.96e-158; two tailed paired t-test) or the targeted dimensional reduction feature set (p = 3.89e-192; two tailed paired t-test) ( Figure S2). Prediction accuracy was evaluated using shuffled data to test whether, even after dimension reduction, over fitting might be occurring. Prediction accuracy balanced gene sets with shuffled functional annotations ranged from 0.47 to 0.57 with a median of 0.51, only slightly higher than expected of random predictions. All subsequent analyses used the set of 92 features obtained by targeted dimension reduction of protein and RNA abundance features.

Selection of Random Forest for Gene Function Prediction
Eight machine learning-based supervised classification algorithms including random forest (Liaw and Wiener, 2002), stochastic gradient boosting machines (Ridgeway et al., 2013), Lasso and Elastic-Net Regularized Generalized Linear Models (Friedman et al., 2010;Simon et al., 2011), radial kernel SVM (support vector machine) (Karatzoglou et al., 2004(Karatzoglou et al., , 2018, partial least squares (Wehrens and Mevik, 2007), neural network (Venables and Ripley, 2002;Ripley et al., 2016), penalized multinomial regression (Venables and Ripley, 2002;Ripley et al., 2016), and linear discriminant analysis (Venables and Ripley, 2002;Ripley et al., 2013) were evaluated for their accuracy in predicting GO annotations. Benchmark genes for every GO terms were divided into the sets of 80% training and 20% testing. For training data, 10-fold cross validation was performed for all machine learning methods. Validation accuracy and testing accuracy, i.e. the accuracy in testing dataset, were both calculated for the 8 algorithms and comparisons of the algorithms were based on the accuracy in testing dataset. Based on the average accuracy across all GO terms tested, random forest and gbm methods performed the best and second best respectively (Figure 2A, Figure 2B). This ranking was consistent across sets of GO terms with different annotation frequencies, as well as for GO terms within each of the three GO ontology domains: Biological Process, Cellular Component, Molecular Function (Table S3). This ranking was also consistent when performance was calculated in different ways. Random Forest exhibited the best performance based on calculations of precision (proportion of predicted genes that are truly positive), recall (proportion of true positive genes recovered), F-measure (harmonic mean of precision and recall), consistency score, and AUC-ROC (Area Under Curve-Receiver Operator Curve) ( Figure 2B). Ensemble methods were also evaluated however these did not show a significant increase in prediction accuracy compared to pure random forest based prediction ( Figure S3).

Higher Prediction Accuracy for Biological Process GO Terms
AUC-ROC values (area under curve-receiver operating characteristic) values were calculated for each individual GO term ROC (receiver operating characteristic) curves for the prediction accuracy of random forest based prediction -determined from 10-fold cross validation -were plotted for individual GO terms ( Figure  3A). Details for the performance measures of every GO term provided in Table S4. As a control, AUC-ROC values were also calculated for genes with shuffled functional annotations. The 5 th and 95 th percentile of AUC-ROC values from 4 times of gene label shuffling for all GO terms were 0.45 and 0.56 with a median of 0.5. These values were consistent with expectations for random labeling of balanced data.
Random forest testing accuracy for individual GO terms ranged from 0.41 to 0.93 with a median of 0.68. The single best performing GO term prediction model, assessed based on accuracy, was for GO:0006270 (replication initiation) using random forest (precision = 95.2%, recall = 90.9%, FPR = 4.5%, Accuracy = 0.93, AUC-ROC = 0.92, Consistency score = 0.87). GO terms related to DNA replication (GO:0006270, DNA replication initiation, Accu = 0.93), modification (GO:0016556, mRNA modification, Accu = 0.90; GO:0006304, DNA modification, Accu = 0.85), methylation (GO:0006346, methylation-dependent chromatin silencing, Accu = 0.93; GO:0001510, RNA methylation, Accu = 0.91) and metabolic process (GO:0009220, pyrimidine ribonucleotide biosynthetic process, Accu = 0.86) are well predicted using nonhomology features. On the other end of the distribution, examples of GO terms with the lowest prediction accuracy were (GO:0022832, voltage-gated channel activity, Accu = 0.48; GO:0005216, ion channel activity, Accu = 0.48) and regulation of a process (GO:0050778, positive regulation of immune response, Accu = 0.52; GO:0051348, negative regulation of transferase activity). GO terms with higher prediction accuracy were drawn primarily from the Biological Process domain while GO terms with the lowest prediction accuracy belonged primarily to the Molecular Function domain.
To test whether this finding represented a consistent pattern, the distribution of prediction accuracies was evaluated separately for GO terms belonging to each of the three domains (Biological Process, Cellular Component, and Molecular Function). GO terms involved in Cellular Component have the highest median accuracy ( Figure 3B). Cellular Component GO terms were the rarest of the three domains (151 GO terms of 1,562 total terms tested). Median accuracy for Biological Process GO terms was modestly lower than for Cellular Component. Biological Process GO terms were much more abundant (73%) in 1,562 GO terms test, which may explain why the most accurate individual GO terms were drawn from this domain. GO terms from the Molecular Function domain had the lowest median accuracy, and there were many Molecular Function GO terms, particularly those related to channel, transporter, enzyme activity or binding with extremely low accuracy (Table S4). This ranking of accuracy across GO domains was largely consistent across GO terms with different population sizes of genes carrying the annotation and across the results from predicting using different machine learning algorithms (Table S3).

Contribution of Different Feature Types to Prediction Accuracy
Separate predictions were conducted using distinct subsets of features to assess relative contributions of different types of features to the overall accuracy of non-homology-based functional prediction. The ranking of prediction accuracy was largely consistent across the three primary GO term domains: Biological Process, Cellular Component, and Molecular Function. Models trained using only gene model structure features or trained using only RNA expression features provided approximately equal independent prediction accuracy. One exception was in the Molecular Function domain where models trained using only gene structure features performed almost equivalently to the complete model (AUC-ROC = 0.69 and 0.70 for the models using structural data only and full models, respectively). Models for predicting Molecular Function GO terms trained using only RNA expression features performed significantly worse than the complete model. Models trained using only chromatin features or only co-expression features did not perform well in any of the three domains ( Figure 4A). However, at the level of individual GO terms, there were a number of GO terms where models trained using only protein expression features (99 GO terms 6.3%) or chromatin state features (35 GO terms 2.2%) had better performance than any of the other component models ( Figure 4B). In a minor number of cases (15 GO terms 0.96%) the model trained using only co-expression features provided the highest accuracy of any of the component models (Table S5).

Discussion
Accurate and precise annotation of gene model functions in the absence of gene by gene genetic analysis remains challenging. In most species, the vast majority of genes have not been studied or characterized directly. Instead, when functional annotations are present, they are drawn from functional characterization of homologous genes. Homology-based approaches may also introduce erroneous and misleading functional annotations. Firstly, genes which are homologous will not always perform the same biological function or be localized to the same cellular compartments. For example, R2R3-MYB transcription factors are all homologous to each other yet play different roles regulating responses to multiple stress conditions, controlling plant development and cell fate, or regulating secondary metabolism (Du et al., 2012). Secondly, because homology-based functional annotations are often drawn from datasets and databases which were originally also annotated based on homology, it is possible for incorrect functional annotations to propagate through biological databases indefinitely. Estimates of the mis-annotation using experimentally well-characterized sets of enzyme can range from about 25% to over 60% (Schnoes et al., 2009). Finally, 5% to 15% of annotated gene models in the genomes of many species are "orphans" without detectable homology to any protein with a characterized function. Here we sought to evaluate whether using machine learning methods and a set of non-homology-based features can complement existing methods for functional annotation. Non-homologybased methods may ultimately be able to correctly assign new functional annotations to gene models and identify potentially inaccurate existing functional annotations.
It is important to discuss one critical limitation of the analyses employed here. While non-homology-based annotation approaches ultimately hold the potential to identify and correct errors introduced by homologybased annotation, in this study a set of functional annotations derived from homology-based annotation were treated as ground truth. As the result, the true recall of non-homology-based methods may be higher than the estimated recall in this study, as some false negatives may in fact represent errors in the underlying functional annotations. At the same time, models trained using functional annotations currently assigned to only one or several homologous gene families may learn signatures of those gene families rather than the annotated function itself. In these cases, the accuracies calculated may be higher than the true values (Washburn et al., 2019). Going forward, there is a clear need for curated sets of experimentally supported functional annotations for maize equivalent to those previously generated for species such as yeast and arabidopsis (Aslett and Wood, 2006;Lamesch et al., 2011). While acknowledging these limitations some intriguing initial patterns are still apparent in this initial trial of non-homology based function annotation.
Machine learning-based functional annotation showed strengths which are complementary to known accuracy patterns of primarily homology-based methods. Specifically, homology-based functional annotation has been reported to show higher accuracy for GO terms in the Molecular Function domain (Radivojac et al., 2013;Jiang et al., 2016). In contrast, we found that non-homology-based predictions exhibited the highest prediction accuracy in the Cellular Component and Biological Process domains, and the lowest accuracy in Molecular Function ( Figure 3B). Molecular functions (e.g. transcription factors, transporters, structural proteins) are likely to be conserved between homologous sequences. In contrast, the cellular localization and biological role of a given transcription factor or signal transduction component can vary and diverge substantially between even closely related homologs (Du et al., 2012). Genes involved in the same biological process or localized to a specific cellular compartment may be more likely to exhibit shared features such as co-expression than specific classes of transcription factors or transporters which may be localized to different cell types or expressed only in response to different environmental stimuli.
Going forward there are a number of potential avenues to improve the accuracy of genome-wide nonhomology-based functional annotation. As discussed above, the incorporation of more detailed provenance information for existing functional annotations will serve both to train more accurate models, and to more accurately quantify the performance of these models. There are also additional types of non-homology-based predictive variables which could be incorporated in the future. These include more extensive protein and mRNA expression data, particularly from different stress conditions, experimentally derived protein-protein interaction data, descriptors of population genetic features including different types of selection and diversity, and as well as incorporating the results of quantitative genetic analyses using different types of phenotypes in different environments. Two challenges for future studies are how to integrate these heterogeneous data sources and how to deal with incomplete and noisy data.

Composition of the Prediction Variable Dataset
Predictive variables were divided into five categories: Gene Model Structure, RNA Expression, Protein Expression, Chromatin, Co-Expression, and Natural Diversity. Gene structural features included gene length from transcription start site to transcription stop site, including introns, exon number, CDS length, 3' UTR and 5' UTR length. These values were calculated for each gene using the published AGPv4 maize genome sequence and annotation (Jiao et al., 2017). Nucleotide composition and the GC content were calculated using all sequence from the annotated transcription start site to the annotated transcription stop site.
For protein-coding genes, a codon usage bias score which describes the degree of bias towards the most frequently used codons for multiple encoding amino acids in a given species was calculated following the method described in (Sharp and Li, 1987) as implemented in the SeqIO module in biopython (v1.72) package (Cock et al., 2009).
The initial set of RNA expression features included data from 2-3 replicates of 79 distinct tissue types in the maize inbred B73 (222 total samples) (Stelpflug et al., 2016) and 52 samples from biotic and biotic stress studies of B73 in different labs (Opitz et al., 2014;Makarevitch et al., 2015;Swart et al., 2017) for a total of 274 distinct samples. Normalized (FPKM: fragments per kilobase of exon per million aligned reads) expression values for each gene in each experiment were obtained from (Hoopes et al., 2019).
For each of these chromatin features, scores were calculated for three regions: one using the gene body, defined as the region from the annotated transcription start site to the annotated transcription stop site, a second for the upstream region, defined as a 2 KB region directly upstream of the transcription start site, and a third for the downstream region, defined as the 2 KB region directly downstream of the transcription stop site. For each BS-seq dataset, for each of the three regions relative to each gene and each of three methylation contexts (CG, CHG, CHH), a single percentage score was calculated. This percentages was calculated as the ratio of all cytosines in that context in that genomic interval which were classified as "methylated" (≥ 5 mapped reads and with >50% of mapped reads showing methylation) to the total number of cytosines in that context in that genomic interval. For each ChIP-seq and ATAC-seq dataset, two features were calculated for each genomic interval: the maximum intensity among peaks overlapping that interval and the proportion of that interval covered by peaks using methods previously described in (Lloyd et al., 2017).
For each of these chromatin features, three scores were calculated: one using the gene body, defined as the region from the annotated transcription start site to the annotated transcription stop site, a second for the upstream region, defined as a 2KB region directly upstream of the transcription start site, and a third for the downstream region, defined as the 2 KB region directly downstream of the transcription stop site.
The co-expression set of features consisted of 12 binary variables defining membership in each of the 12 co-expression models defined by (Hoopes et al., 2019).
For population features, raw genotype calls of 277 resequenced inbreds in maize 282 association panel (Bukowski et al., 2017) were downloaded from Panzea (https://www.panzea.org/). Only biallelic SNPs were considered as variations in the given population for this study. SNP filtering, imputation and assignment to maize AGPv4 gene body region were processed as a previous study (Liang et al., 2019). SNP number per gene was determined by the number of final detected SNPs per AGPv4 gene.

Dimension Reduction
Principal component-based dimension reduction was evaluated for RNA abundance and protein abundance data using R prcomp() function with parameters "center=TRUE, scale=TRUE". For each set of features, 50 principal components were calculated. In each case, the decision on how many principal components to include was based on the cumulative proportion of variance explained.

Defining the Subset of Gene Models and Functional Annotations
Several important features like protein abundance data for maize vegetative and reproductive stages, are only available for maize AGPv2. As the result, we constrained this analysis and only considered a set gene models which had a 1:1 relationship between a single gene model in the maize reference genome version AGPv2 and a single gene model in the maize reference genome version AGPv4 (Liang et al., 2019). A small number of genes with missing values for more than half of the total set of 369 features were omitted from subsequent analyses. For the remaining genes, features were centered, scaled and imputed (for missing values) using preProcess() function in caret (v6.0-80) R package (Kuhn, 2015).
An implicit GO term assignment can occur when a specific GO term is explicitly assigned to a gene, each parent of that GO term is also implicitly assigned to the same gene. The get all parents() function in goatools (v0.8.9) python package was used to add the implicit GO terms to each gene (Klopfenstein et al., 2018). After explicitly assigning implicit GO annotations to genes, GO terms which were assigned to less than 100 genes or more than 5,000 gene models were excluded.

Implementing Machine Learning Algorithms
The eight machine learning algorithms, i.e. random forest, neural network, radial kernel SVM, glmnet, lda, penalized multinomial regression, partial least squares and gbm with parameter "tuneLength=5", evaluated as part of this study were all implemented in the R package caret (v6.0-80) (Kuhn, 2015). For each GO term, a balanced training data was constructed using the set of maize genes assigned with that annotation as the "positive" set and a randomly selected equal number of genes not assigned with that annotation as the "negative" set. 20% of the negative and positive genes from each training set were set aside as the hold-out testing data to assess model performance. The remaining 80% of data was used to train each algorithm for each GO term. 10-fold cross validation was used. The three stacking ensemble methods evaluated in our study were also tested using implementations in the caret package. Each of the three was employed as a supervisor model and was provided with the output of three primary predictive methods (random forest, gbm, and glm) with "tuneLength=3".

Evaluating Prediction Accuracy
Accuracy, FPR (false positive rate), recall, precision, F1 score, AUC-ROC, and consistency score were calculated. Accuracy = (TP+TN)/(TP+TN+FP+FN) (where: TP = True positive; FP = False positive; TN = True negative; FN = False negative). The FPR was calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events (FP/FP+TN). Recall was defined as the fraction of relevant instances that have been retrieved over the total amount of relevant instances (TP/TP+FN). Precision was defined as the fraction of relevant instances among the retrieved instances (TP/TP+FP). F1 score was calculated as the harmonic mean of precision and recall. AUC-ROC was calculated as the ratio of the total the area under the plot of receiver operating characteristic curve, to the total area contained within the plot. For permutation testing to evaluate the potential of over-fitting, the same training and testing datasets were used and the same algorithms employed, but genes were shuffled between the positive and negative categories.   Table S2: List of all 369 and 92 features in the order they are show in Figure 1 (TableS2-feature-names.xlsx)   Table S5: AUC-ROC values for models trained with subsets of data (TableS5-AUC-summary-featureimp.xlsx) Figure S1: Distribution of missing rates, i.e. the proportion of 369 features for which a given gene has no reported value, among the initial set of 29,428 gene models with 1:1 matches between maize RefGen2 and maize RefGen4. Genes with high missing rates were excluded from downstream analyses.  Figure S3: Distribution of prediction accuracies in all 1,562 tested GO terms for 3 ensemble methods aggregating the outputs from the three best-performing individual models, compared to the best-performing primary method (Random Forest) on its own. Abbreviations for the three ensemble methods are GBM (Gradient Boosting Machine); GLM (Generalized Linear Model); RF (Random Forest).