Combining visible near-infrared spectroscopy and water vapor sorption for soil specific surface area estimation

The soil specific surface area (SSA) is a fundamental property governing a range of soil processes relevant to engineering, environmental, and agricultural applica-tions. A method for SSA determination based on a combination of visible near-infrared spectroscopy (vis-NIRS) and vapor sorption isotherm measurements was proposed. Two models for water vapor sorption isotherms (WSIs) were used: the Tuller–Or (TO) and the Guggenheim–Anderson–de Boer (GAB) model. They were parameterized with sorption isotherm measurements and applied for SSA estimation for a wide range of soils ( N = 270) from 27 countries. The generated vis-NIRS models were compared with models where the SSA was determined with the ethylene glycol monoethyl ether (EGME) method. Different regression techniques were tested and included partial least squares (PLS), support vector machines (SVM), and artificial neural networks (ANN). The effect of dataset subdivision based on EGME values on model performance was also tested. Successful calibration models for SSA TO and SSA GAB were generated and were nearly identical to that of SSA EGME . The performance of models was dependent on the range and variation in SSA values. However, the comparison using selected validation samples indicated no significant differences in the estimated SSA TO , SSA GAB , and SSA EGME , with an average standardized RMSE (SRMSE = RMSE/range) of 0.07, 0.06 and 0.07, respectively. Small differences among the regression techniques were found, yet SVM performed best. The results of this study indicate that the combination of vis-NIRS with the WSI as a reference technique for vis-NIRS models provides SSA estimations akin to the EGME method.


INTRODUCTION
The soil specific surface area (SSA) plays a crucial role for a wide range of soil processes, including the movement and retention of water, nutrient, and contaminant dynamics, ion exchange reactions, microbial activity, heat transport, development of soil structure, and geotechnical soil behavior (Pennell, 2002;Petersen, Moldrup, Jacobsen, & Rolston, Vadose Zone J. 2020;19:e20007. wileyonlinelibrary.com/journal/vzj2 1 of 13 https://doi.org/10. 1002/vzj2.20007 1996). The SSA is expressed as surface area per unit mass of soil (m 2 g −1 ). Depending on the organic and mineral composition and particle size distribution of the soil, the values of SSA can differ greatly (Pennell, 2002). In general, soils with elevated clay contents exhibit large SSA, whereas sandy soils have much smaller SSA (Petersen et al., 1996). Moreover, for a given sample, the measurement technique itself can affect the estimates of SSA. The techniques to measure SSA include both direct and indirect methods. Direct estimations are performed by measuring the size and shape of soil particles (Borkovec, Wu, Degovics, Laggner, & Sticher, 1993). Indirect techniques comprise gas-phase adsorption (N 2 , CO 2 , C 2 H 6 , C 2 H 4 , and C 2 H 2 ) (de Jonge & Mittelmeijer-Hazeleger, 1996;de Jonge, de Jonge, & Mittelmeijer-Hazeleger, 2000;Kim, Yoon, & Bae, 2016) and retention of polar liquids such as water (Amali, Petersen, & Rolston, 1994;Arthur et al., 2018;Tuller & Or, 2005), ethylene glycol, ethylene glycol monoethyl ether (EGME) (Cerato & Lutenegger, 2002;Knadel et al., 2018;Petersen et al., 1996), and methylene blue (Hang & Brindley, 1970), with the EGME method being the most common (Pennell, 2002). Apart from water, the use of other polar liquid-based methods has some weaknesses like the complicated measurement protocols, long measurement time, and environmental problems with chemical disposal (Heister, 2014). Considering these limitations, the use of water to estimate SSA is a better alternative and has been previously applied (Newman, 1983;Puri & Murari, 1964). Estimation of SSA from water sorption or retention is often achieved by combing water vapor sorption isotherm (WSI) measurements with physically based (e.g., Tuller & Or, 2005) or empirical (e.g., Resurreccion et al., 2011) models. The isotherms represent the relationship between relative humidity (water activity) and the equilibrium soil-water content at a given temperature, obtained along an adsorption (wetting) or desorption (drying) path. Recent technological advances have led to faster, more detailed, and reliable measurements of the WSI. Arthur, Tuller, Moldrup, and de Jonge (2014) reported the great potential of an automated vapor sorption analyzer (VSA) for soil exploration, including estimations of clay content and SSA, as well as solute percolation threshold and cation exchange capacity. To estimate SSA, WSIs were often used in conjunction with different modeling approaches.
The Brunauer-Emmet-Teller (BET) model is a monolayer approach to estimate SSA, usually applied in conjunction with gas (N 2 or other gases) (Brunauer, Emmett, & Teller, 1938) and works well but only for nonswelling soils (Khorshidi, Lu, Akin, & Likos, 2017). The Guggenheim-Anderson-de Boer (GAB) model is similar to the BET equation but accounts for multilayer molecules relative to the bulk liquid. It presents a good alternative to the BET model and was reported to be accurate for both natural and swelling soils Arthur, Tuller, Moldrup, & de Jonge, 2016 (Akin & Likos, 2014;Arthur et al., 2013;Khorshidi et al., 2017;Leão & Tuller, 2014;Tuller & Or, 2005). However, it fails to accurately describe the drier parts of the adsorption isotherms (Resurreccion et al., 2011). Visible near-infrared spectroscopy (vis-NIRS) is another promising alternative technique for SSA estimation. It is a versatile and robust analytical technique with a high repeatability and a demonstrated record of successful application to soil analysis. The vis-NIRS is based on the interaction of light with the soil sample under investigation. The output is a vis-NIR reflectance spectrum (400-2500 nm), represented as measured vis-NIR intensities vs. wavelength of electromagnetic radiation. The vis-NIR spectrum reflects the presence of chemical functional groups related to the mineral and organic composition of the sample, thus being relevant for the estimation of physical and chemical soil properties. It is a very efficient method (short measurement time and minimal sample preparation) that does not require chemicals and does not destroy the sample. With only one obtained spectrum, multiple soil properties can be determined (Pasquini, 2003). The vibrational modes in the vis-NIR region are, however, weak and typically cause broad and overlapping absorption features. In order to assign specific features to specific chemical components, multivariate calibrations are used (Martens & Naes, 1989). Different methods can be applied to correlate soil spectra with the soil constituents of interest. The most common include linear models such as principal component regression, partial least square (PLS) regression, multiple linear regression, and stepwise multiple linear regression (Soriano-Disla, Janik, Rossel, Macdonald, & McLaughlin, 2014). Nonlinear models include machinelearning techniques such as multivariate adaptive regression splines, artificial neural networks (ANN), regression trees, or support vector machines (SVM) (Viscarra Rossel & Behrens, 2010). Although the application of PLS in soil spectroscopy is most prevalent in the literature, machine-learning algorithms have been reported to provide higher estimation accuracy for a range of soil properties (Viscarra Rossel & Behrens, 2010).
Extensive research efforts have been devoted in the last decade to using vis-NIRS in combination with multivariate techniques as a powerful means to overcome the timeconsuming and often complicated classical analysis of both fundamental and functional soil properties Katuwal et al., 2017;Knadel et al., 2016;Nocita et al., 2012;Paradelo et al., 2016;Pittaki et al., 2018Pittaki et al., , 2019Viscarra Rossel et al., 2016). However, the application of vis-NIRS to SSA determination is still relatively rare. The few successful attempts to determine the SSA from vis-NIR spectra included the predictions of SSA obtained from the EGME method only (Ben-Dor & Banin, 1995;Ben-Dor, Heller, & Chudnovsky, 2008;Knadel et al., 2018).
To further investigate the applicability of vis-NIRS for SSA estimation, the objectives of this study are (i) to test the feasibility of vis-NIRS for SSA estimation, where the SSA is determined with the TO (SSA TO ) and GAB (SSA GAB ) models parameterized with WSIs measured with a VSA, and using three types of regression techniques (PLS, ANN, and SVM), (ii) to compare the generated vis-NIRS models for SSA with SSA models where SSA was estimated with the EGME method (SSA EGME ), (iii) to investigate the effect of dataset subdivision according to EGME values on the performance of the vis-NIRS models.  Table S1), others from different agroecological regions within a country, and some from the large soil database of the International Soil Reference and Information Centre (Wageningen). Further descriptions of individual samples and their properties, soil type and sampling locations are provided in Supplemental Table S1.

Reference soil measurements
All soil samples were air dried and sieved to 2 mm prior to the analyses described below. After removal of organic matter and carbonates, particle size fractions were determined with a combination of wet sieving and pipette or hydrometer methods (Gee & Or, 2002). The soil organic C (SOC) was either determined based on the principle of C oxidation at 1800 • C using an elemental analyzer with a thermal conductivity detector (Thermo Fisher Scientific) or by wet combustion using the Walkley-Black method (Nelson & Sommers, 1982). The SSA was determined in the laboratory via retention of EGME at monolayer coverage (Pennell, 2002) without organic C removal or ion saturation.

Water vapor sorption measurements
Soil WSIs were obtained with a fully automated VSA (METER Group). The VSA system dries and wets the air-dry sample (∼3.5 g soil) and measures the water potential using a chilled-mirror dewpoint method. The sample mass is automatically recorded during the drying and wetting process with a high-precision magnetic force balance (Arthur et al., 2013;Likos, Lu, & Wenszel, 2011). The isotherms were measured in dynamic dewpoint mode for adsorption and desorption for a water activity range from 0.03 to 0.93 and a temperature of 25 • C. The reference water content for all samples was calculated after oven drying at 105 • C for 48 h. For a detailed description of the VSA, interested readers are referred to Arthur et al. (2014).

Tuller-Or model
The physically based TO model Equation 1 was parameterized with water adsorption data for the matric potential (ψ) range from −470 to −10 MPa (corresponding to the water activity range from 0.03 to 0.93). The TO model relates the equilibrium water content, M (kg kg −1 ), to ψ (cm H 2 O) and the SSA (m 2 kg −1 ) as where A svl (J) is the Hamaker constant for solid-vapor interactions through the intervening liquid, ρ w is the density of water (kg m −3 ), and g is acceleration due to gravity (m s −2 ). The value of A svl was set to −6 × 10 −20 J, as suggested in Tuller and Or (2005) and Maček, Mauko, Mladenovič, Majes, and Petkovšek (2013).

Guggenheim-Anderson-de Boer model
The GAB model relates the water activities to the equilibrium water contents (M, kg kg −1 ) via three model parameters (M 0 , C, and K): where M 0 (kg kg −1 ) is the monolayer water content, C G is an energy constant, and Ka w represents the difference of free enthalpy of the water molecules in the pure liquid and the layers above the monolayer Equation 2. Since the GAB model can be parameterized with both adsorption and desorption data, here we applied desorption data. This was because the adsorption data are not always reproducible due to their sensitivity to initial water content, hydrophobicity, and stronger intermolecular forces than experienced for desorption (Johansen & Dunning, 1957;Lu & Khorshidi, 2015). The SSA GAB was calculated with Equation 3 (Newman, 1983;Quirk & Murray, 1999): where M 0 is the monolayer water content (kg kg −1 ) from the GAB equation, N is Avogadro's number (6.02 × 10 23 mol −1 ), A is the area covered by one water molecule (10.8 × 10 −20 m 2 ), and w M is the molecular weight of water (0.018 kg mol −1 ).

Vis-NIRS measurements
Spectral measurements were performed in the visible and near-infrared range (400-2500 nm) with a NIRS DS2500 spectrophotometer (FOSS) in a temperature-and humiditycontrolled room (temperature of 23 • C, humidity of 48%). Air-dried and 2-mm-sieved soil samples (∼50 g) were scanned in seven spots each through a quartz window of the sample holder. An average of the seven scans (absorbance spectrum (Abs) = [log(1/R)], where R is reflectance) was used further in the modeling phase.

Datasets
Calibration models were generated to demonstrate the potential of vis-NIRS for SSA estimation for this diverse dataset and were based on the full dataset, as well as on datasets obtained after subsetting, where the distribution of SSA values was considered. Due to skewness in the SSA EGME values (almost 70% of the samples exhibited SSA EGME values <100 m 2 g −1 ), the data were divided into two subsets, with SSA EGME < 100 m 2 g −1 (N = 180) and SSA EGME > 100 m 2 g −1 (N = 90).
To ensure a representative selection of calibration sets for vis-NIRS modeling, a principal component analysis (PCA) was performed (Webster & Oliver, 2001) for spectral data of each dataset considered above, and the Kennard-Stone algorithm (Kennard & Stone, 1969) was applied to the scores of the first three principal components. The algorithm was set to select 80% of the samples for calibration, with the remaining 20% assigned to a validation dataset. This resulted in a calibration and a validation set for the entire dataset including 216 and 54 samples (validation samples were marked in gray in Supplemental Table S1), respectively, and four subsets considering the SSA distribution: calibration (N = 144) and validation (N = 36) subsets for the set with SSA EGME < 100 m 2 g −1 , and calibration (N = 70) and validation subsets (N = 20) for the set with SSA EGME > 100 m 2 g −1 . To avoid an issue with pseudoreplicates in the calibration and validation subsets (as in few cases that the samples with a gradient in SSA were obtained from one field), all field samples were kept in the calibration datasets.

Multivariate data analysis
In order to derive information on soil constituents from the weak and broad absorptions in vis-NIR spectra, three types of regression techniques were used: PLS, ANN, and SVM. All of them were using calibration samples to generate models for SSA determined by the TO and GAB methods, and by the EGME method. Moreover, models for texture (clay, silt, and sand) and SOC were also generated (but only for the first calibration and validation approach on the entire dataset). The training of all calibration models was performed with a single 10-fold venetian blinds cross-validation. In this calibration method, 10% of the data were withheld and used to validate the calibration model built on the data of the remaining samples. This was repeated until all samples were left out once. All calibration models were further validated with the independent validation sets. Modeling was performed with the Matlab PLS Toolbox 8.7 (Eigenvector Research).

Partial least squares regression
Partial least squares regression is one of the most commonly used regression methods that produced satisfactory calibration results for a variety of soil constituents. It models both the X (spectra) and Y (soil constituent of interest) matrices simultaneously by compressing and regressing the data to find the latent variables (factors) in X that best predict the latent variables in Y. This regression technique reduces data dimensionality and noise and is computationally faster. It is used for highly collinear predictor variables. Here, PLS with a noniterative partial least square algorithm was applied (Martens & Naes, 1989;Wold, Sjöström, & Eriksson, 2001).

Artificial neural networks
An ANN is a framework for a range of machine-learning algorithms designed to imitate the way a brain performs different tasks. It is a group of three layers of interconnected nodes (artificial neurons). The three layers include input (here, vis-NIR spectra), hidden (a layer between the input and output), and output (the property to be predicted). The nodes from one layer are connected with the nodes from the adjacent layer with a strength referred to as a weight. Each input within one layer is multiplied by a corresponding weight and is handled by an activation function, in the hidden layer, to produce an output. This is further used as an input in the next layer. The weights optimization is accomplished through a training procedure performed on a calibration set (Goldshleger, Chudnovsky, & Ben Dor, 2012). A feedforward ANN with a backpropagation neural network, which aims at minimizing the network error, was used. It finds the optimal number of iteration cycles by choosing the lowest RMSE of cross-validation based on the training data set and iteration values (here, 1-20) (Rumelhart, Hinton, & Williams, 1986). To shorten the computation time, the vis-NIR spectra were compressed using PLS regression and three principal components. Then ANN with two nodes in the first layer on the principal component scores was performed.

Support vector machines
Support vector machines are nonlinear kernel-based learning methods. Here, the Gaussian radial basis function kernel type was used. Support vector machine regression trains nonlinear data by mapping them into a multidimensional kernel space and derives optimal bounds for regression (Vapnik, 1995). It defines the loss function, which ignores errors situated within a given distance of the true value. Models are built with a smaller set of representative observations close to the regression boundary (support vectors) (Suykens & Vandewalle, 1999). This algorithm requires model optimization by adjusting two parameters: ε (used values: 1.0, 0.1, 0.01), which is the upper tolerance on prediction errors, and C (11 values from 10 −3 to 100, spaced uniformly on the log scale used), which determines the tradeoff between the model complexity and the degree to which deviations larger than ε are The performance of the regression models was evaluated using the RMSE of cross-validation, the RMSE of prediction, and the R 2 . Due to differences in the SSA range resulting from the use of different determination methods (TO or GAB model and the EGME method), the standardized RMSE was additionally calculated as SRMSE = RMSE/range, to enable the comparison between the performance of different models.

Soils
The investigated samples represent a wide range of soil types (Figure 1), with clay contents ranging from 1 to 95% and sand contents ranging from 0 to 96%. The samples covered both mineral and more organic soils, with some containing >8% organic C (Table 1). Because of the diverse geographic origin of the considered soils, distinct differences in mineralogy are also expected. This high variability in soil properties of the investigated soil resulted in a wide range of SSA EGME values (6-445 m 2 g −1 ) ( Table 1).  Note. The first value is for the entire dataset (N = 270), the first value in brackets is for the calibration dataset (N = 216), and the second value in the brackets is for the validation dataset (N = 54). a SSA TO , soil specific surface area (SSA) determined from vapor sorption isotherms using the Tuller-Or model; SSA GAB , SSA determined from vapor sorption isotherms using a Guggenheim-Anderson-Boer model; SSA EGME , SSA determined using ethylene glycol monoethyl ether method; SOC, soil organic C. b Gen.stat, general statistics; Q1, the first quartile, Q3, the third quartile.

F I G U R E 2 (a) Example of measured water vapor sorption isotherms for three samples with different soil specific surface areas (SSAs), (b) fit
of Tuller and Or (2005) model to the adsorption isotherms, and (c) fit of the Guggenheim-Andersen-de Boer (GAB) model to the desorption isotherms. The ethylene glycol monoethyl ether estimates of SSA for the high-, medium-, and low-SSA samples were 307, 111, and 45 m 2 g −1 , respectively. The numbers in the legend of Panels b and c are the SSA estimates in m 2 g −1 from the two models SSA had higher soil water sorption for any given water activity value. The fits of the TO model (fitted to the adsorption isotherms) and the GAB model (fitted to the desorption isotherms) reflect the same behavior for the three soils (Figures 2b and 2c). The GAB model predicted water content well, regardless of the soil type and water activity value (Figure 2c), whereas the TO model ( Figure 2b) described the adsorption isotherms well only up to −200 MPa for the soil with the medium (SSA EGME = 111 m 2 g −1 ) and low (SSA EGME = 45 m 2 g −1 ) SSA values, and up to −120 MPa for the soil with the highest SSA value (SSA EGME = 307 m 2 g −1 ). Above these thresholds, a clear overprediction can be observed. This is in line with the previous findings, where the TO-predicted water contents were up to 50% higher than foreseen and were attributed to higher errors for the finertextured soils (Arthur et al., 2013;Resurreccion et al., 2011). This is perhaps the reason why the SSA TO estimated for large-surface-area samples was less than the SSA GAB and F I G U R E 3 Relationships between soil specific surface areas (SSAs) determined with the ethylene glycol monoethyl ether (EGME) method, with SSA derived with the Tuller-Or (TO) and Guggenheim-Anderson-de Boer (GAB) models, for (a) the entire dataset (N = 270), (b) the subset with SSA EGME < 100 m 2 g −1 (N = 180), and (c) the subset with SSA EGME > 100 m 2 g −1 (N = 90). For all sets, p < .001 was reported for the regression analyses SSA EGME , with the maximum SSA values obtained being 374, 428, and 445 m 2 g −1 , respectively. The correlations between the SSA TO and SSA GAB with the SSA EGME values were very high (R 2 = .95 and .96, respectively) (Figure 3a). As discussed above, the TO model does not work optimally for soils with high SSA values and thus started deviating from the SSA EGME values at ∼150 m 2 g −1 (Figure 3a).
Due to skewed SSA EGME values, the dataset was further divided into two subsets: SSA EGME < 100 m 2 g −1 (N = 180) and SSA EGME > 100 m 2 g −1 (N = 90) (Supplemental Tables  S2 and S3). In general, lower correlations between the SSA TO and SSA GAB values with the SSA EGME values were observed after subsetting, when compared with application of the full dataset (Figures 3b and 3c). The SSA values estimated with both the TO and GAB models were larger for the subset with SSA EGME values < 100 m 2 g −1 than the values obtained with the EGME method. In turn, lower values for the TO model for the set with values > 100 m 2 g −1 than the values obtained by the EGME method were obtained . Moreover, for the subset with the SSA EGME values > 100 m 2 g −1 , higher correlations with SSA EGME (R 2 of .88 and .92, for SSA TO and SSA GAB , respectively) than for the subset with the SSA EGME values < 100 m 2 g −1 (R 2 of .73 and .60, for SSA TO and SSA GAB, respectively) were reported (Figures 3b and 3c).

Full dataset
For the full dataset, the best vis-NIRS calibration models for SSA TO and SSA GAB exhibited identical estimation accuracy to the SSA EGME model (SRMSE = 0.10) (Figure 4). The best texture and SOC models obtained here had lower precision for calibration (average SRMSE of 0.18 for texture and 0.13 for SOC estimations) than those for the SSA models (Supplemental Table S4). Among the three regression techniques, SVM was the most accurate for estimating SSA TO , clay, silt, sand, and SOC, and PLS generated the best results for SSA GAB and SSA EGME, whereas ANN showed the lowest estimation accuracy of the calibration models (Figure 4, Supplemental Table S4).
The independent validation of the developed calibration models for the three SSA estimates reflected the accuracy of the calibration models ( Figure 5). The validation results from the best calibration model for SSA TO slightly outperformed (SRMSE = 0.08) that of SSA GAB (SRMSE = 0.10) but was similar to the SSA EGME model (SRMSE = 0.09). High R 2 values (>.89) were obtained for all SSA estimations. The SSA EGME validation results exhibit higher accuracy than obtained in Knadel et al., 2018 (SRMSE = 0.13), who validated with a set representing a smaller range of SSA EGME values (4-116 m 2 g −1 ), but also lower SD values (SD = 27) than the validation set used in this study (range: 10-392 m 2 g −1 , SD = 94) (Table 1).
When comparing the performance of these models with the best texture and SOC validation results (Supplemental Table  S4), the SSA models showed better estimation accuracy than the texture and SOC models, which had average SRMSEs of prediction of 0.49 and 0.66, respectively.

Subsets according to SSA EGME values
To test how the range of the SSA affects the performance of SSA models, the same modeling analysis on the sets with SSA EGME < 100 m 2 g −1 (N = 180) and SSA EGME > 100 m 2 F I G U R E 4 Visible-near-infrared spectroscopy calibration results (N = 216) for the soil specific surface area (SSA) presented as predicted (cross-validation [cv]) vs. measured for the Tuller-Or (TO) and Guggenheim-Anderson-de Boer (GAB) models , and the ethylene glycol monoethyl ether (EGME) method generated using partial least squares (PLS), artificial neural networks (ANN), and support vector machine (SVM) regression techniques. SRMSE = RMSE/range g −1 (N = 90) was performed (Supplemental Tables S2 and  S3). Detailed results of calibration and validation for both subsets are presented in Supplemental Figures S1-S4. Both calibration and validation results for the best SSA models of the subset with SSA EGME values < 100 m 2 g −1 exhibited much lower estimation accuracy with higher SRMSE (0.18 on average) and lower R 2 values (.22-.45) than for the set with SSA EGME values > 100 m 2 g −1 (on average, SRMSE of 0.14 and R 2 values between .63 and .78). The discrepancies in model performance for the two subsets can be related to the effects of variation in SSA values themselves, the organo-mineral composition, and their interactions. The subset with SSA EGME values > 100 m 2 g −1 presents higher standard deviations (Supplemental Table S3), and this was previously related to elevated R 2 values (Stenberg, 2010). Additionally, this subset includes samples with the highest clay contents (on average, 51%), whereas the subset with the SSA EGME values < 100 m 2 g −1 includes soils with an average clay content of 21%. Higher clay content results in more pronounced absorptions from molecular bonds related to clay minerals, but also to SOC (Stevens, Nocita, Tóth, Montanarella, & Van Wesemael, 2013), the two soil properties greatly affecting SSA . Thus, we found improved SSA model performance for the set with SSA EGME values > 100 m 2 g −1 , which was also characterized by a higher clay content. In contrary, the subset with SSA EGME values < 100 m 2 g −1 represented mostly sandy soils (average sand content of 49%). Therefore, weak signals from clay minerals in the vis-NIR range were present. Moreover, high sand content increases light scattering and was reported to have a negative effect on model performance of SOC (Stenberg, 2010;Stevens et al., 2013). Knadel et al. (2018) showed that, aside from the differences in texture and SOC content, the complexation status of SOC also affects the vis-NIRS estimation of SSA. The subset with SSA EGME values > 100 m 2 g −1 represents soils with the capacity of clay to complex SOC (clay/SOC ratio, defined by Dexter et al., 2008, as n = 10) with n values > 10 (Supplemental Figure S5), meaning that soils unsaturated with SOC are present and all SOC is in complexed form. The subset F I G U R E 5 Visible-near-infrared spectroscopy validation results (N = 54) for the soil specific surface area (SSA) presented as predicted versus measured for the Tuller-Or (TO) and Guggenheim-Anderson-de Boer (GAB) models and the ethylene glycol monoethyl ether (EGME) method generated using partial least squares (PLS), artificial neural networks (ANN), and support vector machine (SVM) regression techniques. SRMSE = RMSE/range. RMSEP, RMSE of prediction with the SSA EGME values < 100 m 2 g −1 , in turn, represents soils with both noncomplexed and complexed forms of SOC (10 < n > 10) (Supplemental Figure S5). Therefore, the mineral surfaces of the samples with noncomplexed SOC have the potential to be coated with SOC , which can potentially mask a portion of the SSA. This, together with the fact that both complexation forms were present, as well as the above-listed confounding effects of other soil constituents (like different clay mineralogy and negative effect of sand fractions) and the range of SSA values, potentially led to degraded SSA models for the subset with SSA EGME values < 100 m 2 g −1 .
The SRMSE values obtained from the calibration models for each regression technique (PLS, ANN, SVM) for the full dataset and SSA subsets, and for each measure of SSA, are presented in Figure 6. For the models based on the full dataset, PLS resulted in the lowest errors for SSA GAB and SSA EGME estimation, whereas SVM provided best estimates for SSA TO . After subsetting the data according to SSA EGME values, SVM performed better than the two remaining techniques for all three SSA estimates. Thus, on average, SVM resulted in higher accuracy. This points to an advantage of using machine-learning techniques over PLS and is in line with other studies where the application of machine-learning algorithms such as SVM outperformed PLS regression for soil property determination (Goldshleger et al., 2012;Kuang, Tekin, & Mouazen, 2015;Morellos et al., 2016;Tekin, Zeynal, & Mouazen, 2011;Viscarra Rossel & Behrens, 2010). However, the differences in the values of SRMSE among the different techniques were small and dependent on the dataset.
The comparison of results based on different validation datasets is somewhat problematic. Each of these sets consisted of different samples, a different total number of samples, and samples covering different ranges of SSA values. Therefore, even when using a standardized error, the comparison is not optimal. Thus, in order to perform a fair comparison, common validation samples existing in the validation set for a full dataset (N = 54), as well as in one of the validation sets for the F I G U R E 6 Comparison of the SRMSE (RMSE/ range) for the (a) Tuller-Or (TO), (b) Guggenheim-Anderson-de Boer (GAB), and (c) ethylene glycol monoethyl ether (EGME) estimates of soil specific surface area (SSA) based on visible-near-infrared spectroscopy modeling results using calibration datasets for all samples (N = 54), samples with SSA EGME < 100 m 2 g −1 (N = 36), and samples with SSA EGME > 100 m 2 g −1 (N = 20), generated with partial least squares (PLS), artificial neural networks (ANN), and support vector machine (SVM) regression techniques subsets with the SSA EGME < 100 m 2 g −1 (N = 36) and the subset with the SSA EGME > 100 m 2 g −1 (N = 20), were extracted. In total 28 common samples were found, and their estimations from the three calibration approaches were compared (Figure 7). In general, higher estimation accuracy was obtained after subsetting the data, with the greatest improvement seen for SSA GAB (SRMSE of 0.07 and R 2 of .92 before subsetting, and SRMSE of 0.04 and R 2 of .95 after F I G U R E 7 Comparison of model performance (standardized room mean square error [SRMSE] = RMSEP/range and R 2 ) for the (a) Tuller-Or (TO), (b) Guggenheim-Anderson-de Boer (GAB), and (c) ethylene glycol monoethyl ether (EGME) estimates of soil specific surface area (SSA) based on visible-near-infrared spectroscopy calibration models for common validation samples (N = 25) occurring in the full dataset (full, black circle) and subsets according to EGME values (sub, open triangle) subsetting). Nevertheless, there were no significant differences between the subsetting methods when the differences between the reference values and the predicted SSA values were compared for each SSA estimate (Mann-Whitney rank sum test, P = .617 for SSA TO, P = .7 for SSA GAB , and P = .8 for SSA EGME ). Moreover, the estimation accuracy of vis-NIRS models for the SSA obtained by the two WSIs models (TO and GAB with an average SRMSE of 0.07 and 0.06, respectively) and for the 28 common samples was nearly identical to that of vis-NIRS models for SSA EGME (average SRMSE of 0.06).

CONCLUSIONS
In this study, vis-NIRS combined with different modeling techniques (PLS, ANN, and SVM) was applied to estimate SSA determined with two WSI-based models (SSA TO and SSA GAB ) for a heterogeneous soil sample set. The vis-NIRS SSA estimates were successful and indicated a similar estimation ability to a vis-NIRS model of SSA determined with the often-used EGME method. Furthermore, the performance of the models was mainly dependent on the range and variation in SSA values, as well as the organo-mineral composition and its interactions. However, no significant differences among the performance of calibration models, based on the entire dataset and the subsets in regards to SSA EGME values, were found for common validation samples. Moreover, in most cases, the application of SVM technique in the vis-NIRS modeling resulted in the best performance, yet the differences among the three types of regression techniques tested were small.
The elevated interest in SSA, which governs numerous soil processes and behaviors, calls for rapid, more accurate, and repeatable alternative methods for its determination. Given the results from this study, we suggest a combination of vis-NIRS, known for its reliable results, and the WSI as a reference technique for training vis-NIRS models, which does not involve the use of chemicals and provides SSA estimations similar to the EGME method. Although no significant differences in the estimation of SSA from vis-NIRS based on TO and GAB models have been observed, we recommend the use of the latter, as it is known to predict water contents well regardless of the soil type and water activity value. The performance of regression techniques applied, as well as spectral preprocessing methods, is usually dataset dependent, and we suggest testing different methods including both linear and nonlinear techniques to find the best option for the dataset under consideration.