Crop growth and soil water fluxes at erosion‐affected arable sites: Using weighing lysimeter data for model intercomparison

Agroecosystem models need to reliably simulate all biophysical processes that control crop growth, particularly the soil water fluxes and nutrient dynamics. As a result of the erosion history, truncated and colluvial soil profiles coexist in arable fields. The erosion‐affected field‐scale soil spatial heterogeneity may limit agroecosystem model predictions. The objective was to identify the variation in the importance of soil properties and soil profile modifications in agroecosystem models for both agronomic and environmental performance. Four lysimeters with different soil types were used that cover the range of soil variability in an erosion‐affected hummocky agricultural landscape. Twelve models were calibrated on crop phenological stages, and model performance was tested against observed grain yield, aboveground biomass, leaf area index, actual evapotranspiration, drainage, and soil water content. Despite considering identical input data, the predictive capability among models was highly diverse. Neither a single crop model nor the multi‐model mean was able to capture the observed differences between the four soil profiles in agronomic and environmental variables. The model's sensitivity to soil‐related parameters was apparently limited and dependent on model structure and parameterization. Information on phenology alone seemed insufficient to calibrate crop models. The results demonstrated model‐specific differences in the impact of soil variability and suggested that soil matters in predictive agroecosystem models. Soil processes need to receive greater attention in field‐scale agroecosystem modeling; high‐precision weighable lysimeters can provide valuable data for improving the description of soil–vegetation–atmosphere process in the tested models.


INTRODUCTION
Agroecosystem models are used for an increasing range of applications, including the assessment of risk of climate change on food security and the development of suitable management strategies for adaptation to and mitigation of climate change (Challinor et al., 2018;Martre et al., 2015). A large number of agroecosystem models, differing in model complexity and functionality to simulate and predict the major physiological processes that determine crop growth and yield and their responses to environmental and management factors, have been developed and used for risk assessment studies on agricultural production (Asseng et al., 2013;O'Leary et al., 2015). Each of these models comes with an inherent prediction error when used to predict the response of crop growth and soil water flux dynamics to climatic conditions such as changing seasonal rainfall patterns, rising temperatures, and elevated atmospheric CO 2 concentrations. Therefore, comparative studies using different models are required to evaluate and quantify the uncertainties of simulations results. Several recent studies intercompared crop models and systematically analyzed the sensitivity to different climatic conditions for different crops: wheat (Triticum aestivum L.; Asseng et al., 2013Asseng et al., , 2014Asseng et al., , 2019Frieler et al., 2017;Hussain, Khaliq, Ahmad, & Akhtar, 2018;Martre et al., 2015;O'Leary et al., 2015;Palosuo et al., 2011;Schauberger et al., 2017;Wallach et al., 2018), barley (Hordeum vulgare L.; Rötter et al., 2012;Salo et al., 2016), rice (Oryza sativa L.; Frieler et al., 2017;Hasegawa et al., 2017;Li et al., 2015), maize (Zea mays L.; Bassu et al., 2014;Durand et al., 2018;Frieler et al., 2017;Schauberger et al., 2017), potato (Solanum tuberosum L.; Fleisher et al., 2017), soybean [Glycine max (L.) Merr.; Battisti, Sentelhas, & Boote, 2017;Battisti et al., 2018;Frieler et al., 2017;Schauberger et al., 2017], sugarcane (Saccharum spp.; Marin, Thorburn, Nassif, & Costa, 2015), and crop rotations (Kollas et al., 2015;Yin, Kersebaum, Kollas, Baby, et al., 2017;Yin, Kersebaum, Kollas, Manevski, et al., 2017). In several of these studies (Kimball et al., 2019;Martre et al., 2015;Palosuo et al., 2011;Yin, Kersebaum, Kollas, Baby, et al., 2017;Yin, Kersebaum, Kollas, Manevski, et al., 2017), model calibration was mostly based on experimental data with the focus on agronomic in-season variables such as leaf area index (LAI), phenological stages (BBCH), and end-ofseason variables such as plant N content, grain yield (GY), or total aboveground biomass (AgBio) at harvest. The explicit consideration of soil water flux dynamics in crop growth intercomparison studies has been largely neglected despite the importance of water and nutrient availability for crop growth in field conditions and simulations. Regardless of its relevance to increase the reliability and confidence of such modeling tools, few simulation studies have characterized the implications of using, in addition to agronomic variables such as crop growth, environmentaland ecosystem-related variables such as actual evapotranspiration (ET a ; Cammarano et al., 2016;Kimball et al., 2019;Wöhling, Schöniger, Gayler, & Nowak, 2015), spatial soil textural information (Hoffmann et al., 2016;Maharjan et al., 2019), subsoil N content (Wallor et al., 2018), and soil water content (SWC; Wallor et al., 2018;Wöhling et al., 2013Wöhling et al., , 2015. To accurately describe and understand the crop and soil agroecosystem processes, soil water (and matter) flux-related processes should also be part of the model calibration strategy, because these variables (i.e., ET a , SWC, storage, and drainage) are critical, as they influence the crop model predictions. Unfortunately, most model calibrations including those water and matter fluxes are performed with only a few temporally or spatially resolved data (Martre et al., 2015;Žydelis, Weihermüller, Herbst, Klosterhalfen, & Lazauskas, 2018). Thus, classical agroecosystem model intercomparison studies focused their model calibration on crop-relevant data only, rather than on soil water dynamics (Seidel, Palosuo, Thorburn, & Wallach, 2018). This lack in the calibration is mostly because datasets that include all necessary information are often not available . Considering a broader range of data types in the crop model calibration can lead to a more accurate description of agroecosystem or land surface models (Wöhling et al., 2013). It should be noted that the way models are calibrated may also be important for the reliability of the models predictions. However, a recent investigation using ET a in crop model intercomparison studies showed that much of the difference and variability in predicting ET a by crop models was mainly a consequence of using different approaches to estimate potential ET (Kimball et al., 2019). This example demonstrates that in addition to differences in crop model complexity and model functionality, the selection of boundary conditions (Diamantopoulos et al., 2017) can largely affect the outcome of crop models. Thus, not only the selection of the upper boundary (e.g., reference ET) but also the selection of the bottom boundary condition might be relevant in model intercomparison studies if water from deeper soil layers or the water table is available for capillary rise and may serve as an additional water supply for root water uptake and evaporation (Groh, Vanderborght, Pütz, & Vereecken, 2016;Luo & Sophocleous, 2010). The bottom boundary condition can also indirectly affect the water uptake from aboveground sources (irrigation and precipitation), because the plant consumption of infiltrating water differs in the case of upward-directed flow in the subsoil layer, and this eventually affects the crop model predictions. This highlights the need to use standardized model boundary conditions in model intercomparison studies. We therefore hypothesize that agronomic and environmental observations obtained with cropped weighing lysimeters are suitable for testing agroecosystem models, as they provide information about all water and matter fluxes, and information about boundary conditions and crop performance. When applying agroecosystem models, the field-scale soil variability needs to be considered. In arable fields, for example, spatial soil heterogeneity can result from long-term soil management in combination with erosion effects. In hummocky soil landscapes, erosion-affected soil profile modifications can lead to truncated soil profiles at exposed and hillcrest positions and to colluvial soils at toeslopes or in local depressions ( Figure 1). Both profile truncation and colluviation changes the existence and thickness of soil horizons, and even the soil hydraulic properties of the same soil horizons can substantially differ (Rieckh, Gerke, & Sommer, 2012). Because of the close interactions of soil-and croprelated biophysical processes, these soil modifications may lead to differences in water and matter fluxes and consequently to differences in crop performance (Gerke, Rieckh, & Sommer, 2016). Quantitative studies on the limitations of model predictions of field-scale agroecosystem simulations caused by different soil characteristics, especially for truncated and colluviated profiles, are rare, even if such phenomena can be detected frequently. The classical representation of arable fields by using dominant soil profiles may limit predictions of both the agronomic

Core Ideas
• We performed model intercomparison based on lysimeter data of erosion-affected soils. • Uniform crop model calibration was performed by providing only phenological growth stages. • We compared individual models predictions with those of the multi-model mean. • Differences between erosion-affected soils could not be described well by models and the multimodel mean. • We used lysimeter data for testing how soil matters in agroecosystem models.
crop performance as well as the environmental fluxes in erosion-affected soil landscapes (Rieckh, Gerke, Siemens, & Sommer, 2014). The objective of this crop model comparison study was to quantify differences in the role of soil properties in agroecosystem modeling when trying to predict both agronomic aspects such as crop development and yield and environmental aspects such as water and matter fluxes. Specific aims were (a) to compare forward simulation results after minimal calibration on phenology of crop growth and soil water flux-related ecosystem variables based on lysimeter observations from erosionaffected soils, and (b) to evaluate how well the models reproduce the agronomic and environmental variables of erosion-affected soil profiles.
Data are provided from the TERrestrial ENviromental Observatories (TERENO)-SOILCan lysimeter network site Dedelow, which is located in the hummocky ground moraine landscape of northeastern Germany (Heinrich et al., 2018;Pütz et al., 2016). This landscape type is important for global crop production, as it covers ∼10% of the arable land at the global temperate climatic zone (Chapman et al., 2020;Sommer, Fiedler, Glatzel, & Kleber, 2004). It reveals a regular, erosion-and deposition-related soil pattern across continents (Pennock, Bedard-Haughn, Kiss, & van der Kamp, 2014;Sommer, Gerke, & Deumlich, 2008) and has been intensively studied under aspects of erosion feedbacks on crop biomasses and yields (Papiernik et al., 2005), as well as on greenhouse gas fluxes (Hoffmann et al., 2017;Pennock et al., 2010). We selected lysimeter observations from four soil profiles, which differed in both soil type and soil erosion history and capture a broad range of the field-scale spatial variability of soils in erosionaffected, hummocky agricultural landscapes (Miller et al., 2016;Sommer et al., 2008). For each soil monolith, data were provided for (a) typical agronomic observations (GY, Ten agroecosystem models and two crop-soil hydrology models were tested with limited calibration, based on phenology stages only, to predict crop growth and water flux dynamics. The dataset for the simulations included weather, reference ET, soil water retention characteristics, soil initial conditions, site management, targeted range of GY, and BBCH.

Site and soil descriptions
The data for this study were collected at the experimental field station in Dedelow, Germany (53 • 22′2.45′′ N, 13 • 48′10.91′′ W, 50-60 m asl) and are maintained by the Leibniz Centre for Agricultural Landscape Research (ZALF). The station is located in a hummocky arable soil landscape and is part of the Northeast German Lowland Observatory of TERENO (Heinrich et al., 2018) and the German wide lysimeter network SOILCan . In total, 12 high-precision weighing lysimeters (METER Group), each with a surface area of 1 m 2 and a depth of 1.5 m, were installed at the experimental field. Four characteristic soil types, which evolved from glacial till, were selected for this study. They represent a typical catena of hummocky ground moraines ( Figure 1) showing an extremely eroded Calcaric Regosol (Gr_K) at the exposed hilltop, a strongly eroded Luvisol (Nudiargic Luvisol Dd_5) and a noneroded Luvisol (Calcic Luvisol Dd_1) from slopes, and a depositional Colluvic Regosol (Hd_S) at the depression. Physical and chemical soil characteristics can be taken from Supplemental Tables S1 and S2 and were provided to each modeling group. The soil water retention characteristics were determined from undisturbed soil cores with the evaporation method using the HYPROP technique (METER Group). This method provides drying soil water retention data, which were fitted to the standard van Genuchten-type soil water retention model. Soil texture and soil chemical properties were obtained from disturbed samples. More information on the site and soil properties and the lysimeter setup of SOILCan can be found in , Herbrich, Gerke, Bens, and Sommer (2017), and Pütz et al. (2016). The arable-land lysimeters were surrounded by a control field (∼2.3 ha). The plant management (crop type, fertilizer, plant protection application, crop growth regulators, and tillage) on the lysimeters and the surrounding field was identical during the observation period (1 Aug. 2014 until 31 Oct. 2018). Soil tillage was carried out manually each year (depth of tillage = 8-12 cm), and fertilizer applications during the observation period are listed in Supplemental Table S3, which was also provided to each modeling group. The lysimeters were cultivated with winter wheat (T. aestivum) cultivar 'Julius', winter wheat (T. aestivum ) cultivar 'Pionier', winter rye (Secale cereale L.) cultivar 'SU Cossani', winter rye (S. cereale) cultivar 'SU Cossani', and spring oats (Avena sativa L.) cultivar 'Ivory'. It should be noted that the winter rye was not harvested in 2018, and the entire plant material was incorporated into the soil by plowing before sowing oats on 11 Apr. 2018. Seasonal crop development was characterized by BBCH and LAI (m 2 m −2 ) measured by a plant canopy analyzer probe (SunScan Probe Type SS1, Delta-T Devices). Dry matter AgBio and GY were gravimetrically determined at harvest after drying at 60 • C for >24 h after reaching constant weight.

Lysimeter data
Observations on water fluxes (ET a , precipitation, and drainage) were obtained for each soil profile from the corresponding lysimeter mass data. The weighing precision of the lysimeters was 10 g (∼0.01 mm); for the water reservoir tank used to capture drainage, a precision of 1 g (∼0.001 mm) was specified. For the Calcaric Regosol (Gr_K) and the Colluvic Regosol (Hd_S), the field pressure head values were obtained from tensiometers installed at the original landscape position of the soil monolith extraction in order to more realistically represent ET processes under conditions of upward-directed soil water fluxes . At the bottom of the lysimeters (i.e., 1.4to 1.5-m depth), a porous suction rake device was installed and connected with a pumping system that allowed the bidirectional transfer of water between lysimeter soil and the weighable water reservoir. The pumping system operated according to an algorithm that controlled pumping direction and duration according to differences in pressure heads observed in the lysimeter at 1.4-m depth and those measured in same soil depth at the monolith extraction site's landscape position. The lysimeter weights were logged every minute and post-processed using the AWAT noise filter of Peters et al. (2017). For this study, water fluxes obtained from lysimeter and water reservoir tank mass data at 1-min resolution were aggregated to daily sums of ET a , precipitation, and net water flux across the lysimeter bottom at 1.5-m soil depth (NetQ). Lysimeters were equipped with time domain reflectometry (TDR, CS610, Campbell Scientific) sensors to determine SWC at 10-, 30-, and 50-cm soil depths. Average daily SWC of the first 60 cm was calculated by SWC observations from 10-, 30-, and 50-cm depths by using a depth interval weighed sum for the first 60 cm.
Standard meteorological data of air temperature, global radiation, relative humidity, air pressure, and wind speed were provided in 10-min intervals by a weather station (Type WXT510, Vaisala Oyj) that was located ∼4 m from the lysimeters in Dedelow. A linear regression model between the corresponding meteorological variables at the lysimeter station and surrounding weather stations (Dedelow, Heydenhof, Christianhof, Kuntzow, and Muessentin), obtained from the publicly accessible TERENO platform (http://www.tereno.net/ddp/), was used for gap filling of missing data in the time series. We applied a stepwise gap-filling approach, so that missing meteorological values were filled based on the observation that had the highest correlation (adjusted R 2 ). The gap-filled meteorological parameters were used to estimate reference ET for grass (ET 0 ) for hourly time steps according to the FAO56 Penman-Monteith method (Allen, Pereira, Raes, & Smith, 1998) and was summed up to daily values. The parameterization of the bulk surface resistance (soil and plant) and aerodynamic resistance (plant) values in the Penman-Monteith model followed recommendations of Allen et al. (2006). The maximum rooting depth for each crop and the corresponding soil was set in a first step arbitrarily to the depth of the lysimeter (1.5 m). In a second step, the maximum rooting depth was determined based on soil profile observations, assuming that roots penetrate the soil solum down to the C horizon (i.e., highly compacted glacial till) plus the upper 0.2 m of the C horizon along cracks and earthworm burrows, resulting in maximum rooting depths of 0.47 m for Gr_K, 0.85 m for Dd_5, 1.0 m for Dd_1, and 1.50 m for Hd_S.

Crop models
This intercomparison study considered 12 simulation models, which have been described in detail in separate pub Model simulations were performed on a daily time step, except for the DY and THGS model, which simulated crop growth (DY) and/or water flow (THGS) on an hourly basis. The models represent different degrees of crop model complexity and model functionality to simulate agronomicand environmental ecosystem fluxes and states. Four crop models shared the same modular structure as EN to simulate soil water flow, soil heat, and N transport but differed in structure to simulate crop growth. The simulation programs THD and THGS used crop outcomes (LAI and root depth) from TH and thus did not have a direct feedback between crop and soil status. However, those models were included in the study to serve as independent reference for Richards equation-based numerical soil hydraulic models as compared with crop models coupled with bucket-type and Richards equation-based soil water models. All other crop models do not share a common system with respect to model structure to simulate crop growth and water flux dynamics.

Setup up of model intercomparisons
Each individual modeling group received basic key soil parameters and variables (physical and chemical soil characteristics; Supplemental Tables S1 and S2), daily meteorological conditions (air temperature [maximum, minimum, and mean], wind speed, global radiation, relative humidity, and air pressure), ET 0 , precipitation, bottom boundary condition (distance to groundwater table or pressure head), maximum rooting depth for all crops (fixed and depth adjusted), information on the crop-soil management (sowing density, sowing, and harvesting date), and observed crop phenology (BBCH; Meier, 2001). Hereby, BBCH was provided as the only opportunity allowing for a minimal crop model calibration (Supplemental Table S3). Conversion from simulated development stages to BBCH followed the table of Wang and Engel (1998), except for the models DY and DDC. In DDC, the management events are defined in the input files. The different stages are simulated by a phenological model with crop specific parameters, which were provided in the crop datasets. Information on the range of regional GY was provided for winter wheat (range: 6-12 t ha −1 ), winter rye (range: 5 -11 t ha −1 ), and oat (range: 2-5 t ha −1 ) based on long-term yield observations carried out in the ZALF project "AgroScapeLab Quillow" (http://www.zalf.de/en/forschung_lehre). The initial SWCs supplied to the modeling groups were calculated from the vertical matric potential distribution profile obtained by linearly interpolating between measured values at the top (10 cm) and the bottom (140 cm) of the lysimeter at the starting day. The matric potential values were converted to volumetric water contents using the van Genuchten soil water retention model and the parameters of each soil horizon (Supplemental Table S2).

Statistical analysis
The evaluation of the individual model and the multimodel mean (MMM) performance was done for the agronomic ecosystem-related in-season and end-of-season variables (LAI, GY, and AgBio) and the main in-season soil environmental ecosystem variables (ET a , NetQ, and SWC). We considered the corresponding simulation outputs of 10 crop models to calculate MMM of the model simulations and thus exclude the soil hydraulic models because of missing interactive coupling of crop development with soil moisture status (i.e., THD and THGS) as in the other crop models. The R software (R Core Team, 2016) and the function nrmse of the package hydroGOF (Zambrano-Bigiarini, 2017) were used to compare observed with simulated variables. The normalized RMSE (nRMSE) was calculated as nRMSE = 100 SD (Obs) where SD is standard deviation, N is the number of samples, and Sim and Obs are the simulated and observed values. The nRMSE provides a statistical criterion that allows comparison of the overall model performance variables of each model irrespective of the different dimension of the variable (Wallach, 2006). The normalization in nRMSE was done with the standard deviation rather than the mean of the observation variable. This was necessary, because one variable contains positive and negative values of the water fluxes (NetQ) and SD permits accounting for the range of the variability of the observation and thus enables the comparison of the model performance for different variables of a heterogeneous set of observation data (Coucheney et al., 2015). Spearmanťs correlation coefficient, ρ, was used to investigate the relationship between the nRMSE of the corresponding agronomic and environmental ecosystem variable.

Evaluation of agronomic and environmental fluxes and states
The models were calibrated only on phenology data, and model performance values of nRMSE (Table 1) ranged between 17% (model AGC) and 83% (ENGE). However, most models achieved relatively low nRMSE (∼30%) and were thus able to describe the observed BBCH well. Note that the variable BBCH was not available from results of DDC, Hydrus-1D, and HydrogeoSphere. For the observation period (August 2014-October 2018), the in-season variable LAI was overestimated by all individual models and the MMM for all four soils (e.g., Figures 2a and 2b). Large overestimation of LAI was especially visible for oats in 2018. Most crop models do not provide well-tested parameter values for spring oats, including their water deficit response, which may be the reason why dry conditions (early ripening) in 2018 resulted in this large deviation from observations. The variability in crop models for in-season variables (LAI, SWC, ET a , and NetQ) was generally large, and observations were, in part, outside of the 10th to 90th percentile range of the simulations. For example, in season 2016-2017 the cumulative simulated MMM and observed ET a and NetQ showed a discrepancy of 178 and −185 mm, respectively (see Figures 2f and 2h). However, the MMM captured the temporal dynamics of the  SWC well, despite the large range of simulated SWC of the individual models. Crop models generally underestimated the seasonal ET a and overestimated the seasonal NetQ (e.g., Figures 2f and 2h). The overestimation of the LAI, on the one hand, and the underestimation of ET a , on the other hand, suggest that the partitioning of ET a into its constituent components, actual evaporation (E a ) and actual transpiration (T a ), and thus the crop water stress conditions in terms of the ratio T a to potential transpiration (T p ), was perhaps insufficiently described by the crop models. The LAI and ET a are dependent variables, at least for models HER and MON, which means that a failure in simulating LAI consequently leads to a failure in simulating ET a . However, for some models ET is not linked to the crop, because T a is not calculated as a function of LAI directly (e.g., DDC, Hydrus-1D, and HydroGeoSphere). The discrepancy between simulated and observed ET a is highly relevant when estimating crop water use efficiency, and crop productivity (Cammarano et al., 2016).
The observations showed (see Figure 3a, blue lines) that ET a increased along the catena from the Calcaric Regosol (Gr_K, extremely eroded) towards the Colluvic Regosol (Hd_S, depositional). The opposite was found for NetQ ( Figure 3b, blue lines), where the amount of water that drained from the soil column was larger for the Calcaric Regosol than for the Colluvic Regosol. The dependency of water flux rates on the erosion-affected differences in the soil profiles can be related to the soil water storage capacities, which differ due to erosion or deposition processes. This is in agreement with the results reported by Herbrich, Gerke, Bens, and Sommer (2017), who showed for various Luvisols that the cumulative drainage increased with the degree of erosion-induced soil profile truncation.
The range in measured total ET a and NetQ between soil profile Gr_K and Hd_S was 2,046 and 2,548 mm for ET a and −509 and −94 mm for NetQ (negative flux = outflow of the soil profile). In comparison, simulated total ET a and NetQ ranged only between 2,124 (Gr_K) and 2,227 mm (Hd_S) and between −657 (Gr_K) and −534 mm (Dd_5), respectively.
Long-term soil erosion and deposition changed soil horizon depths (Bt or C nearer to the soil surface in case of erosion) and created new soil horizons (fertile, colluvial soil layers in depositional area). This results in changes of depth functions of soil properties, which are crucial for plant growth (e.g. texture, bulk density, continuity of the pore system, soil organic C, and nutrients). The magnitude of these changes is related to spatially distributed intensities of soil erosion by water and tillage (Gerke & Hierold, 2012), the distance to the ground water table (Rieckh et al., 2012), and their combined effect on the above-and belowground plant growth and development (Große, 1963;Li, Lobb, Lindstrom, & Farenhorst, 2007). As the simulation results indicate, the average range of the individual crop model simulations and the MMM cannot capture such complex interactions on the variables ET a and NetQ (see Figures 3a and 3b), when using the provided soil hydraulic properties and calibrating on phenology stages only. However, the field-scale heterogeneity of ET a and NetQ, which is related to soil spatial variability (e.g., by variations in structure and depth of soil horizons), needs to be considered when simulating crop growth in hummocky landscapes with erosionaffected pedogenesis. Interactions between erosion effects on the water fluxes can further result from heterogeneity of root distribution at specific landscape positions and might also affect solute transport (e.g., dissolved C; Gerke et al., 2016).

Vadose Zone Journal
TA B L E 2 Observed grain yield (GY) for each soil profile for winter wheat, winter rye, and oat and mean GY over the period from 2014 until 2018 A similar effect of erosion-affected soil profiles could be observed for the measured variable GY. The mean GY, which was calculated over all three crops (winter wheat, winter barley, and oat) from the observation period (2014)(2015)(2016)(2017)(2018), clearly shows an increase of mean GY from the Calcaric Regosol (Gr_K, 4.9 t ha −1 ) to the Colluvic Regosol (Hd_S, 8.5 t ha −1 , Table 2). A similar effect could be observed for individual vegetation periods of the winter crops in 2014-2015 and 2016-2017, which showed, in general, the highest GY for the depositional soil Hd_S (Table 2) and the smallest GY for the most eroded soil (Gr_K) along the idealized catena. This was also observed for seasonal measured maximum LAI (not shown). Earlier simulations by Gerke et al. (2016) showed that truncated soils at the hilltop achieved a reduced crop growth and an increased percolation in comparison to other soils. For winter wheat (2015-2016) observations on GY of the Colluvic Regosol (Hd_S, 7.9 t ha −1 ), Calcic Luvisol (Dd_1, 8.2 t ha −1 ), and Nudiargic Luvisol (Dd_5, 7.8 t ha −1 ) achieved similar values but exceeded the GY of the Gr_K (5.8 t ha −1 ) by far. The observed range of GY for winter wheat and winter rye agreed well with the range of observed GY from the region (Raatz et al., 2019;Schils et al., 2018;Verch, Kächele, Höltl, Richter, & Fuchs, 2009). The GY for oats in the dry year, 2018, was highest for the Calcic Luvisol (Dd_1, 3.9 t ha −1 ), followed by the Colluvic Regosol (Hd_S, 3.3 t ha −1 ), and the Nudiargic Luvisol (Dd_5, 2.9 t ha −1 ). The Calcaric Regosol showed also for oats the lowest GY (Gr_K, 2.0 t ha −1 ). Water stress during high temperature periods in June and July caused an earlier ripening of oats in 2018 (Dr. Verch,  Rezaei et al. (2018), who showed that a combined occurrence of heat and drought stress can lead to a significant decrease in GY.
The deviation of the simulated GY from the 1:1 line in Figure 4 clearly shows that most crop models could not capture how GYs were affected by different erosionaffected soils, as simulated differences in GY between the soil profiles and crops were relatively small for most models. Some of the crop models underestimated the impact of drought stress on simulated yields. For example, TH predicted hardly any crop water stress conditions in some years in terms of the ratio T a to T p (not shown here) for all soil profiles. The representation of these interactions between soil-and crop-related biophysical processes needs to be considered when applying agroecosystem models to landscapes with a high soil-spatial variability.
Differences between simulations and observations were also large for seasonal cumulative NetQ (hydrological years, see Figure 5). Most models had problems capturing observed seasonal cumulative NetQ. However, devi-ations between simulated and observed drainage are not surprising, as crop models were only calibrated on phenology and have rarely been tested with drainage data, simply because of a lack of data from most field experiments . The use of calibrated instead of measured soil hydraulic properties (Žydelis et al., 2018) could have reduced the deviation of the simulated from the measured NetQ ( Figure 5). This approach was not pursued here because a model-specific calibration of soil hydraulic parameters would have limited the comparability between the model results. However, we note that the method of obtaining knowledge about the soil hydraulic properties using the evaporation method might be sensitive to the simulated water fluxes for the following reasons: the intact soil cores used for this method represent only a relatively small volume of the soil horizon; soil hydraulic parameters were determined from the main drying curve of the soil water retention function, and thus dynamic changes of the soil pore structure were due to swelling and shrinkage or wetting and drying; and topsoil tillage operations were neglected. Earlier analyses of data from the same lysimeters showed by comparing laboratory-measured soil water retention curves with field-measured suction and water content data that nonequilibrium conditions in the soil water retention dynamics can be relevant, especially in the upper soil horizons . Nevertheless, incorporation of these complex processes was out of the scope of this study.
The nRMSE values of the agronomic and environmental ecosystem variables ranged between different crop models from 10 to 428%. The nRMSE values ranged from 117 to 410% for LAI, 10 to 148% for AgBio, 26 to 186% for GY, 36 to 388% for SWC, 50 to 129% for ET a , and 95 to 428% for NetQ. This shows that the errors in simulated agronomic ecosystem variables were, in general, much larger for the in-season variables (nRMSE = 36-428%) than for the end-of-season variables. AgBio and GY (nRMSE = 10-186%). As mentioned, the in-season variables LAI and NetQ showed the largest uncertainties. The partially large variability among crop model outputs reflects the differences in model structure and relationships, or simply the use of other model parameter values to parameterize the same process. The results agree with previous investigations for simulations of ET a (Kimball et al., 2019), phenology (Wallach et al., 2019), and yield (Asseng et al., 2013). For the agronomic ecosystem variables, different individual crop models, mostly from the EN model family, yielded the lowest nRMSE values, but for the environmental ecosystem-related variables, no individual crop model outperformed the others in terms of nRMSE. A comparison between Richards-and non-Richards-based models showed that non-Richards-based models achieved higher nRMSE values for SWC and NetQ. The overestimation of NetQ goes along with a lower plant water availability, which indirectly limits the root water uptake and yields lower T a . However, no clear effect on the nRMSE values could be seen for other variables (e.g., for ET a ), as here water fluxes might be compensated by the use of different root uptake functions and root distributions (Wu & Kersebaum, 2008), partitioning of ET a , as well as water stress, which is included in the root water uptake and heat stress functions. Hence, the results emphasize the importance of how soil physics and root water uptake are represented in the different crop models.
Overall, the MMM showed the most agreement between the corresponding observation and simulation, indicating that the MMM is the best predictor. Two models, which use similar routines to simulate water fluxes (by the use of the Richards equation) and also share the same crop growth scheme (SUCROS), are AGC and ENSU. Notably, both models show the same pattern of deviations from the measured GY, even though they differ in the absolute values (Figure 4). The outputs of both models differ with respect to LAI, AgBio, and SWC, particularly for the Calcaric Regosol (Gr_K; see Figure 6). For ET a and NetQ, the differences between the two models are less pronounced. Given that the two models are quite similar, this could only result from using different crop parameters, which also points to the fact that the simulation results are always affected by the inherently biased choice of model parameters selected by the user (Diekkrüger, Söndgerath, Kersebaum, & McVoy, 1995;Herbst et al., 2005). Outcome from different EN models (with the exception of ENGE) showed similar nRMSE values, with respect to environmental variables, for ET a , SWC, and NetQ. This reflects the relatively similar functionality and parameterization of the soil in the EN models. However, the use of different crop growth schemes (CERES, GECROS, SPASS, and SUCROS) in EN resulted in a larger spread of the model performance for agronomic variables (LAI, AgBio, and GY).   Figure 7 summarizes the individual model performance, pointing to the large differences among models and between the models and the MMM to predict agronomic and environmental ecosystem-related fluxes and states. For all responses, the equal-weighted mean nRMSE values of the MMM were lower than any individual average nRMSE over a specific crop model and soil. The better predictions for ET a and NetQ by the MMM, as compared with a particular crop model, were already reported for other agronomic variables in earlier studies (Asseng et al., 2013;Martre et al., 2015;Wallach et al., 2018).
The low predictability of the observed effects by the single crop models or the MMM within this model exercise might be related to the assumed maximum rooting depth of each soil, as root development in such a specific soil landscape depends on the erosion-induced soil profile modifications such as horizon thickness, vertical sequence of horizons, and structural development of the horizon . To account for the different potential rooting depths, the rooting depth was adjusted to the soil profile in a second step and ranged from 0.47 at the hilltop to 1.5 m at depression. Unfortunately, the adapted maximum rooting depth, which accounts for the erosion-induced soil profile modifications, did not lead to a significant improvement of the average nRMSE of the corresponding soil profile observations. Moreover, the nRMSE averaged over all soils of the MMM increased from 81 to 92%, when accounting for soil-specific maximum rooting depth. Comparing the nRMSE values for MMM showed that an adjusted maximum rooting depth led, on average, to an improved description of the in-season variables LAI, SWC, and NetQ, but at the same time to larger mismatch of the end-of-season variables AgBio and GY. The latter might be related to a more pronounced occurrence of water and nutrient stress during the growing season.

Relationship between the errors of agronomic and environmental fluxes and states
The analysis of the relationship between the model errors of different variables (Figure 8) revealed significant correlations between specific in-and end-of-season variables. We found significant positive correlations (>0.5) between end-of-season variables GY and AgBio and between the in-and end-of-season variables ET a and GY. Close correlations (>0.3 and <0.5) were identified between end-of-and in-season variables AgBio and LAI, and GY and BBCH, and between in-season variables LAI and BBCH (negative), BBCH and SWC (negative), SWC and ET a (positive), LAI and NetQ (negative), and ET and NetQ (positive). This is in line with results from Martre et al. (2015), who showed that end-of-season variables like AgBio were highly related to GY, whereas in-season variables like SWC and LAI were solely related to the total aboveground N and the N harvest index. The correlations between GY and ET a and between AgBio and LAI demonstrate that errors in simulating the most important end-of-season values are related to errors in simulating in-season growth processes. This suggests that in-season variables like LAI, SWC, NetQ, and especially ET a might be of high relevance to improve predictions of crop model simulations. This is in line with Specka, Nendel, and Wieland (2019), who showed that parameters of process-based agroecosystem models were sensitive during a specific period of the growing season. Including in-season observations in the calibration helps simulating and describing more realistically in-season growth processes, which finally lead to end-of-season values of AgBio and GY (Martre et al., 2015).

CONCLUSION
Our study used data from the TERENO-SOILCan lysimeter network site Dedelow to compare 12 agroecosystem models for simulating agronomic and environmental variables with limited calibration, based on phenology stages only. Simulation results and data for GY, AgBio, LAI, ET a , drainage (NetQ), and SWC were compared with data for a period from August 2014 until October 2018. The results of the crop model intercomparison suggest that the predictive capability of crop models is highly diverse for both agronomic and environmental variables. So far, combined analysis of model performance for different data representing agronomic and environmental variables has rarely been done in model comparison studies. The divergence between predicting LAI and ET a suggests that the partitioning of ET a into actual evaporation (E a ) and actual transpiration (T a ), which is of crucial importance for terrestrial water balance and for the prediction of future ecosystem feedbacks, is insufficiently described by most crop models. The weighable lysimeters observed interactions between crop yield, plant development, and water fluxes were largely controlled by erosion-affected soil states and related soil properties. This demonstrates that soil is important when attempting to describe both the agronomic and environmental fluxes and states. However, this can be tested only where drainage data are available, such as in this weighing lysimeter study.
Simulation results clearly showed the need to calibrate crop models on a much broader range of agronomic and environmental variables, because neither the individual crop models nor the MMM were able to predict the observed representative agronomic and environmental variables (e.g., LAI, GY, ET a , and NetQ) for the four lysimeters with truncated and colluvial soil profiles. Regardless of questions about model parameterization and model structure, the results suggest that the data on soil hydraulic properties, phenological crop stages, and boundary conditions are insufficient for calibrating crop models to represent field-scale heterogeneity in erosion-affected agricultural soil landscapes. In line with previous studies, the MMM approach was useful here, as it largely reduced the errors in predicting water fluxes.
Our analysis showed that: 1. The predictive capability of the models was highly diverse for simulating both crop development and environmental fluxes. 2. Soil does matter in agroecosystem models, and lysimeters provide such soil-related data for testing modeling of soil-vegetation-atmosphere processes. 3. Erosion-and deposition-induced changes in depth functions of soil properties are relevant in understanding biomass production, water fluxes, and soil states. 4. Differences between erosion-affected soils in crop yield, water fluxes, and states could not satisfactorily be described by individual models and the MMM when calibrated for crop BBCH only.
The results herald the need to account for spatial variability of soils within a detailed calibration procedure of the crop models when trying to improve predictions of agronomic and environmental fluxes and states. This especially holds true for erosion-affected soil landscapes of hummocky ground moraines. The varying thickness and sequence of soil horizons of the soil monoliths in the presently studied lysimeters reflect soil profiles that result from long-term effects of erosion and tillage, which is typical for intensively cultivated, post-glacial morainic soil landscapes.

A C K N O W L E D G M E N T S
We acknowledge the support of TERENO and SOIL-Can, which were funded by the Helmholtz Association (HGF) and the Federal Ministry of Education and Research (BMBF). The TERENO-SOILCan lysimeters in Dedelow are operated by the Leibniz Centre for Agricultural Landscape Research (ZALF) Müncheberg. Evelyn Wallor was additionally funded by the I4S Project within the BMBF BonaRes Program (031B0513I). Tobias Weber was financially supported by the Collaborative Research Center 1253 CAMPOS (Project 7: Stochastic Modeling Framework), funded by the German Research Foundation (DFG). We thank Jörg Haase, Dr. Gernot Verch, Ingrid Onasch, and Gudrun Buddrus for data collection and maintenance of the experimental setup at the ZALF Research Station Dedelow.
Open access funding enabled and organized by Projekt DEAL.