Random Forest Regression for Optimizing Variable Planting Rates for Corn and Soybean Using High-Resolution Topographical and Soil Data

In recent years, planting machinery that enables precise control of the planting rates has become available for corn (Zea mays L.) and soybean (Glycine max L.). With increasingly available topographical and soil information, there is a growing interest in developing variable rate planting strategies to exploit variation in the agri-landscape in order to maximize production. A random forest regression-based approach was developed to model the interactions between planting rate, topography, and soil characteristics and their effects on yield based on on-farm variable rate planting trials for corn and soybean conducted at 27 sites in New York between 2014 and 2018 (57 site-years) in collaboration with the New York Corn and Soybean Growers Association. Planting rate ranked highly in terms of random forest regression variable importance while explaining relatively minimal yield variation in the linear context, indicating that yield response to planting rate likely depends on complex interactions with agri-landscape features. Models were moderately predictive of yield within site-years and across years at a particular site, while the ability to predict yield across sites was low. Relatedly, variable importance measures for the topographical and soil features varied considerably across sites. Together, these results suggest that local testing may provide the most accurate optimized planting rate designs due to the unique set of conditions at each site. The proposed method was extended to identify the optimal variable rate planting design for maximizing yield at each site given the topographical and soil data, and empirical validation of the resulting designs is currently underway.


INTRODUCTION 1
In agricultural production systems, site-specific management involves the 2 development of crop management strategies at a finer spatial scale than that of the whole 3 field area (Plant, 2001). Compared to early farming, which was performed by hand and 4 thereby facilitated site-specific management, the advent of mechanization in farming 5 ushered in an era of uniform application of inputs on large areas of cropland. Studies 6 have shown that uniform applications can result in suboptimal input use efficiency in 7 parts of the field that may require greater or fewer inputs than the applied fixed rated 8 (Mulla & Schepers, 1997;Moore & Tyndale-Biscoe, 1999), though increases in farm 9 productivity generally outweighed the associated economic losses. However, the growing 10 costs of agricultural inputs (USDA NASS, 2017) and negative environmental impacts of 11 intensive production practices (Tilman et al., 2002) threaten the long-term economic and 12 environmental sustainability of uniform crop management. 13 Variable rate application systems enable growers to apply inputs at a range of 14 user-defined levels within a field area. Variable rate as it applies to planting is not a new 15 concept. The first variable rate planting systems emerged in the United States during the 16 1970s but did not become popular due to their initial lack of automation, which 17 necessitated that growers trigger planting rate transitions manually (Lowenberg-DeBoer, 18 1999). In the late 1990s, the integration of GPS and automated variable rate systems for 19 commercial planting equipment renewed interest in variable rate planting technologies. 20 However, these improvements did not immediately drive a widespread adoption 21 of variable rate planting, as users remained uncertain of how to develop variable rate 22 planting designs to maximize returns given the underlying field conditions. Some early as 23

Data Types and Quality Control 1
For each site-year, five spatial data layers were used to assess the relationship 2 between yield, planting rate, hybrid/variety, topographical characteristics, and soil 3 characteristics. The layers were as follows: 1) harvest, 2) target/as-applied planting rate 4 3) topography, 4) grid soil sampling, and 5) Natural Resources Conservation Service 5 (NRCS) Soil Survey (Fig. 5). 6 The harvest point layer contained the levels of dry volume yield harvested 7 throughout the field. In the form of polygons, the target planting rate layer contained the 8 randomized 0.81 ha block design with which the trials were sown, while the as-applied 9 planting rate point layer recorded the precise rates applied by the planter as well as where 10 each hybrid/variety was sown. These three layers were used in conjunction with one 11 another to perform various quality control measures to the as-applied and yield data (Fig. 12 S1A-F). Briefly, the headlands were removed from both the as-applied planting and 13 harvest layers in QGIS. The remaining data points were then assigned ID numbers 14 corresponding to the experimental unit grid cell in which they appeared. The as-applied 15 planting and harvest layers were then analyzed in R (R Core Team, 2019) to remove 16 outlying grid cells based on the mean, variance, and number of data points within each 17 cell. Grid cells were also removed from the analysis if they contained data corresponding 18 to more than one target planting rate or hybrid/variety. Finally, grid cells in which the as-19 applied planting rate deviated from the target planting rate beyond a threshold of ±1011.7 20 seeds ha -1 for corn and ±8093.7 seeds ha -1 for soybean were also removed. 21 Topographical variables for elevation, slope, aspect, and curvature were derived 22 from elevation measurements recorded by the GPS systems aboard the planting 23 machinery (Fig. S1G). Using the as-applied planting rate layer, elevation means were 1 calculated in QGIS using all of the data points within each experimental grid cell. Slope, 2 aspect, and curvature were then calculated in R using 3×3 neighborhoods of cells 3 following Burrough and McDonell (1998) and Zevenbergen and Thorne (1987). 4 Integrated Ag Services (Mildford Center, OH) conducted grid soil sampling along 5 a 0.20-ha grid pattern, and Spectrum Analytic (Washington Court House, OH) provided 6 the topsoil analysis. Point data on up to 12 soil features were provided for each site: 7 aluminum (Al), phosphorus (P), potassium (K), potassium saturation (KSt), calcium (Ca), 8 calcium saturation (CaSt), magnesium (Mg), magnesium saturation (MgSt), pH, buffer 9 pH (BpH), cation exchange capacity (CEC), and soil organic matter (OM). K, Mg, and 10 CEC were recorded for all 27 sites (57 site-years). The remaining variables were 11 unavailable for ≤ 4 sites (≤ 12 site-years) with the exception of Al, which was available 12 for only 17 of the 27 sites (37 site-years). Using the "gstat" package in R, the soil sample 13 measurements were block kriged to the experimental unit grid (Fig. S1G). If a 14 semivariogram for a particular variable could not be fit due to low spatial autocorrelation, 15 the values for that variable were set to missing. 16 The NRCS Soil Survey layer in the form of polygons was obtained for each sites 17 using the SSURGO database (Soil Survey Staff, Natural Resources Conservation Service, 18 USDA, 2015). The soil type attribute was extracted for each experimental grid cell using 19 QGIS. If an experimental unit grid cell spanned multiple NRCS Soil Survey polygons 20 with differing reported soil types, the soil type variable for that cell was set to missing 21 Final datasets containing variables describing the yield, target and as-applied 1 planting rates, topography, grid soil sampling, and NRCS Soil Survey were merged into a 2 combined data frame in R according to the experimental unit grid cell IDs. Excluding 3 yield as the response variable, the dimensions of the final datasets for each site-year used 4 for model fitting were n × p where n is the number of experimental unit grid cells and p is 5 the number of predictors. 6

Variable Correlation, ANOVA, and Linear Regression 7
Pearson's correlations were calculated between all variables. Treatment effects for 8 planting rate and hybrid/variety were estimated using ANOVA. Yield was regressed on 9 each topographical and soil variable to estimate the percent of yield variation explained 10 (R 2 ). Yield was also regressed on planting rate, hybrid/variety, and their interaction to 11 assess their collective influence on yield in a linear context. All correlation, ANOVA, and 12 linear regression analyses were carried out in R. 13

Random Forest Regression 14
Random forest regression models were fit to assess prediction accuracy, calculate 15 variable importance, and develop optimized planting rate designs. Forests were grown 16 using the "cforest" function of the "party" package for R (Hothorn et al. 2006;Strobl et 17 al. 2007;Strobl et al. 2008 20 Strobl et al. (2007) proposed "cforest" as an alternative implementation of the commonly 21 used regression functions in the "randomForest" package. The algorithm is based on 22 conditional inference trees and applies subsampling without replacement, which was 23 shown to produce reliable variable importance measures when predictors varied in their 1 scale of measurement or number of categories. 2 Random forest regression models were fit using each site-year as a training set. 3 The hybrid/variety variable was excluded as a predictor because predictions could not be 4 made across site-years for which different hybrids/varieties were sown. The models may 5 therefore be described as "agnostic" to the hybrid/variety planted. Fitted models were 6 then used to predict yield for each remaining site-year planted with the same crop as the 7 test set. If the soil type for a given experimental unit grid cell in the test set was not 8 observed in the training set, soil type for that grid cell was set to missing. Prediction 9 accuracy was assessed as the Pearson's correlation between the predicted and observed 10 values for yield. Once prediction accuracy had been assessed for all pairs of site-years, 11 the mean prediction accuracy was calculated for each site-year as the training set using its 12 accuracies for all other sites-years as the test sets. 13 Variable importance measures for each predictor were calculated for each site-14 year using the "varimp" function in the "party" package for R. Briefly, variable 15 importance is calculated by permuting each predictor variable to determine the difference 16 in prediction accuracy before and after the permutation. Variable importance measures 17 were scaled to enable comparisons across site-years. To serve as a metric for the 18 similarity between pairs of site-years, Euclidean distances between site-years in terms of 19 their scaled random forest variable importance measures were calculated using the "dist" 20 function in R. 21 A subset of the full dataset was utilized to evaluate the effect of the mtry 22 hyperparameter on model accuracy (data not shown). In summary, prediction models were trained for each site-year using all possible values for mtry {1…p}. The fitted 1 models were then applied to the remaining site-years in the subset planted with the same 2 crop. Overall, prediction accuracies showed minimal variation among the evaluated mtry 3 values. For each fitted model, there was no mtry value that consistently provided the 4 highest accuracies across the predicted site-years. Therefore, the default mtry value of p/3 5 was used for this study (Breiman, 2001). The ntree value was set to 1000. 6

Optimized Planting Rate Designs 7
Optimal planting rate designs were developed for each site-year using its own 8 data as the training set to fit the random forest regression model. For each experimental 9 unit grid cell, trained models were used to predict yield at each of the four tested planting 10 rates, given the underlying soil and topographical features of that grid cell. The planting 11 rate provided the highest yield prediction within each grid cell was then identified to 12 create the optimized design. 13

Grain Yield Summary Statistics 15
The average grain yields across all site-years were 11,566 kg ha -1 for corn and 16 3,676 kg ha -1 for soybean, which were above the average statewide yields of 9,953 and 17 3,026 kg ha -1 for corn and soybean, respectively, (USDA-NASS, 2019) during the 2014-18 2018 period. Corn yields were lowest and most variable during the 2015 growing season, 19 which was characterized by extremely wet conditions during May and June followed by 20 dry conditions from early July through mid-September (Fig. 2). Scouting reports recorded 21 in early July noted multiple sites with standing water in some low-lying sections (data not 22 shown). Corn yields were likewise variable in 2016, due to severe drought conditions in 1 the most variable with respect to within-site-year standard deviation, while yields were 2 the lowest in 2016. 3

Effect of Planting Rate and Hybrid/Variety on Yield 4
Although ANOVA of planting rate showed a significant difference in yield 5 between at least one pair of planting rate levels (p-value < 0.05) for 31 out of 32 site-6 years for corn, planting rate on average explained only 4.1 percent of yield variation and 7 did not exceed 17.4 percent for any site-year for corn ( Fig. 6). For soybean, a significant 8 (p-value < 0.05) treatment effect for planting rate was observed in all 25 site-years, but 9 the average amount of yield variation explained was likewise low at 10.8 percent. In 10 contrast with the corn trials, however, planting rate explained a large amount of yield 11 variation in some site-years (up to 40.8 percent). 12 Significant treatment effects (p-value < 0.05) were less prevalent for 13 hybrid/variety, appearing in 13 out of the 19 site-years for which multiple hybrids of corn 14 were sown and for 5 out of 9 site-years for which multiple varieties of soybean were 15 sown. As with planting rate, the amount of yield variation explained by the hybrid 16 variable was very low for corn, averaging 1.0 percent with a range of 0.0 to 4.1 percent. 17 For soybean, variety explained 8.5 percent of yield variation on average, though this 18 ranged as high as 30.1 percent. Notably, hybrid/variety explained over 25 percent of 19 variation in soybean yields in 2 site-years, Vc1_2015 and Mc1_2018, in which the 20 difference in performance between the varieties grown were 771 and 395 kg ha -1 , 21 respectively.
Considering the composite effects of planting rate, hybrid/variety, and their 1 interaction, as before, the percent of yield variation explained was much greater for 2 soybean than corn. These variables explained 2.8 and 22.8 percent of yield variation in 3 corn and soybean, respectively. Most notably, these variables explained 61.9 percent of 4 the variation in yield for Vc1_2015 with both varieties P22T41 and P92Y51 showing 5 positive linear responses to planting rate. 6

Topographical and Soil Summary Statistics 7
The 27 field sites tested varied greatly in their topographical and soil feature 8 profiles. The lowest and highest elevations recorded were 105 and 368 masl, and the 9 difference between lowest and highest points at a site was 16 m on average. The largest 10 gradient within a single site with respect to elevation was observed at Ke, which ranged 11 from 116 to 164 masl for a difference of 48 m. The United States Environmental 12 Protection Agency classifies crop production on slopes ≥ 9 percent, or 5.14°, as 13 "agricultural land cover on steep slopes." Of the 27 field sites, 19 contained areas of 14 sloped terrain at grades of 5.14° or above. 15 Soils in New York State are generally acidic. The soil grid sampling data showed 16 that the pH ranged from 5.8 to 7.2 for the tested field sites. High levels of aluminum were 17 likewise observed, ranging from 657 to 767 ppm. CEC levels were moderate, ranging 18 from 6.2 to 16.5 meq/100 g, while soil OM was moderate to low, averaging 2.3 percent. 19 According to the NRCS Soil Survey, 48 unique soil types were present across the tested 20 sites, with 13 appearing in only one site. Each site contained an average of 6 different soil 21 types, with type "Cazenovia" as the most common, appearing in 10 of the sites.
In the context of yield prediction, it would be desirable for the correlation for a 1 pair of topographical or soil characteristics to have a high magnitude mean and low 2 standard deviation across the testing sites, indicating that the two variables are not only 3 strongly related, but their relationship is consistent across sites. A wide variety of 4 relationships were observed between the measured variables (Fig. 7). As expected, pH 5 and Al were observed to have a consistent, strong negative correlation, while CEC and 6 OM were positively correlated. The relationship between OM and elevation was observed 7 to be slightly negative, on average, although highly variable across the sites. When 8 considering each variable's relationship with yield, soil type had the most consistent 9 relationship for both corn and soybean. It explained an average of 6.8 and 10.4 percent of 10 the variation in corn and soybean yields, respectively, across the site-years. All other 11 variables typically explained moderate amounts of yield variation within some site-years, 12 but these relationships were not consistent across site-years overall. 13

Random Forest Regressions 14
Random forest regression models were fit to predict yield using each site-year as 15 the training set with moderate success. Overall, within site-year prediction accuracy was 16 moderate, with R 2 values ranged from 0.16 to 0.78 and averaging 0.47 for corn and 17 ranging from 0.17 to 0.72 and averaging 0.50 for soybean. Planting rate consistently had 18 higher values for random forest regression variable importance for both crops but 19 particularly for soybean (Fig. 8). Elevation was likewise important for both crops in most 20 site-years. Organic matter appeared to be more consistently important for soybean than 21 for corn. The soil nutrient variables were more frequently of greater importance for corn than for soybean. For most other variables, the level of importance was inconsistent 1 across site-years. 2

Yield Predictions Across Environments 3
When random forest regression was applied to predict yield across site-years, the 4 prediction accuracies varied greatly based on the pair of site-years used to train and test 5 (Fig. 9). This is evident in the mean prediction accuracy for each site-year, which 6 averaged 0.03 for corn and 0.08 for soybean. However, when considering pairs when predicting Sc_2017, which was planted under a re-randomized design. The 20 prediction accuracies for each pair of site-years were correlated with the Euclidian 21 distances between the variable importance measures for each pair of site-years at a level of -0.23 for corn and -0.48 for soybean, suggesting that site-year pairs with similar 1 variable importance profiles tend to predict one another at a higher level of accuracy. 2

Planting Rate Optimization 3
Optimized planting rate designs were developed from the random forest 4 regression prediction models to maximize yields (Fig. 10). In general, the optimized 5 designs favored higher rates for both corn and soybean. For corn, the four planting rates 6 in increasing order were assigned to, on average, 18.6, 21.8, 33.3, and 28.2 percent of the 7 field. For soybean, optimizing for yield overwhelming favored the highest rate, which 8 was assigned to 60.1 percent of the field, on average, while the remaining three rates in 9 increasing order were assigned to 9.7, 13.8, and 18.9 percent of the field, on average. 10 Corn and soybean were evaluated in more than one year in 7 and 6 of the field 11 sites, respectively. The optimized designs showed a considerable amount of differences 12 depending on which year was used as the training set. For corn, the proportion of grid 13 cells assigned to the same rate across years was 31.5 percent on average, ranging from 14 12.7 to 44.3 percent. For soybean, the average was 33.2 percent with a range of 5.3 to 15 74.3 percent. 16

DISCUSSION 17
In the context of linear modeling, planting rate explained relatively low levels of 18 the yield variation. In most site-years, there was no consistent yield advantage of one of 19 the tested planting rates over the others. This lack of an apparent site-wide "optimum" 20 planting rate would suggest that subsections of the field may have different optimums, 21 according to underlying field conditions. Furthermore, for random forest regression, 22 which is capable of modeling complex and nonlinear interactions, planting rate ranked highly in terms of variable importance for both corn and soybean. Together, these results 1 provide support for the hypothesis that there may be important complex interactions of 2 planting rate with topographical and soil variables that are driving response in yield. 3 These observations also highlight the importance of selecting an appropriate 4 modeling approach for large agronomic datasets. A similar study found artificial neural 5 networks, which -like random forest regression -are capable of modeling nonlinear 6 interactions, were superior to linear modeling approaches when predicting corn yields 7 using soil and landscape features (Miao et al., 2006). Random forest regression is among The proposed random forest regression model was extended to build optimized 16 variable rate planting designs for maximizing yields. Marked differences were observed 17 between corn and soybean with respect to the resulting optimized designs. For corn, 18 although higher rates were generally favored, all four tested planting rates were assigned 19 to considerable parts of the field in most of the optimized designs. For soybean, many of 20 the optimized designs were predominantly assigned the highest of the four rates. While 21 this may suggest that fixed rate planting may be more suitable for those sites, it is also 22 important to consider the trade-off between marginal yield gains from higher planting rates and the cost of the seed, as it would be no longer profitable to plant at higher rates if 1 revenue from the yield gains sold at the market price does not exceed the cost of the 2 additional seed. The proposed random forest-based approach may also be utilized to 3 optimize planting rate designs for maximized profit. An example of optimized designs 4 based on theoretical seed cost-to-market price ratios is shown in Fig. 10, demonstrating 5 that lower planting rates are assigned to greater proportions of the field as the cost-price 6 ratio becomes less favorable to growers. This highlights the importance of leveraging 7 market trends to inform management decisions concerning variable rate, and the 8 proposed approach is fully capable of accommodating user-defined commodity price and 9 seed cost information to optimize planting rate designs for profit. 10 For sites that were evaluated under the same crop in multiple years, the resulting 11 optimized planting rate designs showed considerable differences depending on the data 12 used to train the model. These inconsistencies are likely due to variations in 13 environmental conditions across years (Wells, 1993;Reeves & Cox, 2013) or differential 14 sensitivities to planting rate among the varieties sown (Agudamu et al., 2016).  year variation in weather conditions and its impact on yield pose a challenge for the 16 development of variable rate planting strategies. With sufficient training data, weather 17 information could be incorporated into variable rate planting models, as with many of the 18 currently available variable rate input management schemes (Melkonian et al., 2008, 19 Hedley & Yule, 2009). However, this may be of limited use because, whereas variable 20 rate input management strategies can be designed to respond to in-season weather status, 21 planting decisions must be made at the beginning of the season without prior information 22 about eventual weather outcomes. Therefore, an alternative approach may be to train 23 models with data characterizing the range of possible weather scenarios in order to 1 identify and base variable rate planting designs on main effects. Further long-term testing 2 is needed to sample a greater range of seasonal weather conditions and to assess the 3 stability of variable rate planting optimizations across years. 4 Short product life cycles also complicate the design and implementation of 5 variable rate planting technologies. Results from this study and other works show that 6 corn hybrids and soybean varieties can differ in their response to planting rate (Rutger 7 and Crowder, 1967;Modarres et al., 1998;Agudamu et al., 2016). To address this, some 8 seed companies conduct planting rate trials across a range of environments and publish 9 optimal rate recommendations for the hybrids and varieties they release. This information 10 cannot, however, be customized to the conditions of an individual site, nor does it 11 provide insight into effective variable rate strategies to impose within a site. Empirical 12 testing of variable rate response, as in this study, is time-consuming, and advances in 13 agricultural biotechnologies and molecular breeding have accelerated the rate at which 14 improved hybrids and varieties enter the market. A 2010 study estimated that US corn 15 hybrids reach maximum sales with 2-3 years before declining in the following years 16 (Magnier et al., 2010). New hybrids or varieties may therefore become obsolete during 17 the time required to conduct thorough on-farm trials of variable rate response if carried 18 out by the grower. Greater integration of variable rate testing in the germplasm 19 development pipeline may serve to address this issue. Alternatively, though not ideal, 20 variable rate optimization models could be trained using data collected on a range of 21 hybrids/varieties so as to become "agnostic" the hybrid or variety sown and base planting rate optimizations on main effects. Empirical evaluations are needed to establish the 1 value of variable rate planting if hybrid/variety effects are ignored. 2 In addition, the expense of conducting on-farm trials of variable rate response is 3 not trivial. Grid soil sampling was carried out at each field site, and research associates 4 trained in data science were needed to process and analyze the data. In addition, the 5 current approach requires growers to plant and evaluate variable rate randomized block 6 designs for at least one year before optimized designs can be generated. The ability to 7 build optimized designs for sites that have not undergone variable rate randomized block 8 testing would therefore be of great value. Overall, using the fitted random forest 9 regression models to predict yield across site-years produced low prediction accuracies. 10 The site-years with the highest mean prediction accuracies across all other site-years for 11 corn and soybean were Mc3_2015 and Du3_2015 with accuracies of 0.11 and 0.20, 12 respectively. Low accuracies are perhaps expected given the variation across site-years in 13 the direction and magnitude of the relationships between predictor variables. Within site-14 year prediction accuracies were much higher, indicating that local randomized block 15 testing is likely to provide the most accurate optimizations; however, there may be 16 alternative approaches for building the most useful training sets for a given untested site. 17 For example, there was a significant negative correlation between the across site-year 18 yield prediction accuracies and the Euclidean distances between site-years for random 19 forest regression variable importance. Provided yield, topographical, and soil data are 20 available for an untested site, random forest regression models predicting yield could be 21 fit, and the resulting variable importance measures could be utilized to identify the most 22 similar site-year(s) evaluated under the randomized block design to use as a training set.
Additionally, data from multiple site-years could be combined in an optimal fashion to 1 maximize prediction accuracy. Heslot et al. (2013) provides an innovative approach for 2 this type of training set optimization in the context of plant breeding that identifies and 3 removes less predictive site-years from the complete set of site-years in order to train a 4 composite model. 5 Prior to testing planting rate optimization for untested sites, a critical next step is 6 to empirically validate optimized designs based on local testing. Within site-year 7 prediction accuracies suggest that the data collected are capturing the relationships 8 between yield, planting rate, topography, and soil at a reasonable level, but in order to 9 determine whether the proposed approach offers advantage over fixed rate sowing across 10 years, empirical testing of the optimized designs is needed. The NYCSGA is currently 11 conducting on-farm validation trials in which the optimized designs developed from 12 previous years' randomized block variable rate trials are sown. 13

CONCLUSIONS 14
In this study, we have proposed a random forest regression-based approach to 15 predict yields given high resolution topographical, soil, and variable rate planting rate 16 data. We have likewise developed a method for optimizing planting rate designs to 17 maximize yields, thought it may be further extended to accommodate user-defined seed 18 costs and crop market prices for maximizing profits. While planting rate explained 19 relatively low amounts of yield variation in the linear context, it ranked highly in terms of 20 random forest regression variable importance, suggesting that there may be important 21 complex, non-linear interactions of planting rate with topographical and soil features 22 driving yield response that are not captured by linear model. The resulting optimized planting rate designs showed limited consistency across years, likely due to temporal 1  The top row for each crop is the average variable importance over all site-years of that crop,

4
White indicates that the variable was not measured or could not be included in the analysis due to 5 low spatial autocorrelation. Site-years are ordered with respect to their mean prediction accuracy across all other site-years.

3
The diagonal corresponds to the R 2 of the random forest regression training model for each site- I Setting the soil type to "missing" if more than one soil type was present within a grid cell 1 2 For experimental grid cells in which there was more than one soil type present, the soil type was 3 set to missing. In the illustrated example, the yellow asterisks denote grid cells in which more 4 than one soil type is present.