Characterization of Select Wild Soybean Accessions in the USDA Germplasm Collection for Seed Composition and Agronomic Traits

The relatively low genetic variation of current US soybean [Glycine max (L.) Merr.] cultivars constrains the improvement of grain yield and other agronomic and seed composition traits. Recently, a substantial effort has been undertaken to introduce novel genetic diversity present in wild soybean (Glycine soja Siebold & Zucc.) into elite cultivars, in both public and private breeding programs. The objectives of this research were to evaluate the phenotypic diversity within a collection of 80 G. soja plant introductions (PIs) in the USDA National Genetic Resources Program and to analyze the correlations between agronomic and seed composition traits. Field tests were conducted in Missouri and North Carolina during 3 yr (2013, 2014, and 2015) in a randomized complete block design. The phenotypic data collected included plant maturity date, seed weight, and the seed concentration of protein, oil, essential amino acids, fatty acid, and soluble carbohydrates. We found that genotype was a significant (P < 0.0001) source of variation for maturity date, seed weight, seed protein and amino acids, seed oil and fatty acids, and seed carbohydrates, and significant correlations were observed between numerous traits. The G. soja PIs generally had lower seed weight, higher seed contents of protein, linolenic acid, raffinose, and stachyose, and lower seed contents of oil and oleic acid than the cultivated soybean G. max lines. The information and data collected in this study will be invaluable in guiding soybean breeders and geneticists in selecting promising G. soja PIs for research and cultivar improvement. T. La, E. Large, H.T. Nguyen, G. Shannon, and A. Scaboo, College of Agriculture, Food & Natural Resources; Division of Plant Sciences, Univ. of Missouri, Columbia, MO 65211; E. Taliercio, USDA-ARS, Soybean and Nitrogen Fixation Research Unit, Raleigh, NC 27607; Q. Song, USDA-ARS, Soybean Genomics and Improvement Lab., Beltsville Agricultural Research Center, Beltsville, MD 20705; J.D. Gillman, USDA-ARS, Plant Genetics Research Unit, Univ. of Missouri, Columbia, MO 65211; D. Xu, Dep. of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, Univ. of Missouri, Columbia, MO 65211. Received 29 Aug. 2017. Accepted 20 Aug. 2018. *Corresponding author (scabooa@missouri.edu). Assigned to Associate Editor Zenglu Li. Abbreviations: ELSD, evaporative light scattering detection; GRIN, Genetic Resources Information Network; GWAS, genome-wide association study; HPLC, high-performance liquid chromatography; LD, linkage disequilibrium; MG, maturity group; PI, plant introduction; QTL, quantitative trait locus; SNP, single nucleotide polymorphism. Published in Crop Sci. 58:1–19 (2019). doi: 10.2135/cropsci2017.08.0514 © Crop Science Society of America | 5585 Guilford Rd., Madison, WI 53711 USA This is an open access article distributed under the CC BY license (https:// creativecommons.org/licenses/by/4.0/). Published online October 25, 2018

of soybean to self-pollinate combined with the narrow genetic base of North American public soybean cultivars creates a need for genetically diverse germplasm to allow for improvement of agronomic and seed quality traits.
To address the problem of narrow genetic diversity in soybean, the USDA maintains a soybean germplasm collection of 1168 wild soybean (Glycine soja Siebold & Zucc.) accessions and 18,480 G. max accessions (Song et al., 2015).Crop relatives and exotic germplasm are important genetic resources for improving agriculture productivity, yet wild soybean has been largely underused in breeding efforts focused on broadening the narrow genetic background of cultivated soybeans.Wild soybean germplasm utilization is limited in breeding programs because potentially beneficial genes controlling valuable agronomic and seed composition traits are often genetically linked and co-inherited with undesirable traits such as prostrate growth habit, hard seed coat, low seed quality, and seedpod shattering (Concibido et al., 2003).
Utilization of >1100 accessions in the USDA wild soybean germplasm collection for soybean breeding is unmanageable and impractical for public and private breeders using conventional breeding techniques and marker-assisted selection.Although the use of markerassisted selection has increased the utility of wild soybean to breeders, the undesirable agronomic traits of wild soybean germplasm can be avoided during population development by backcrossing with elite cultivars and by evaluating large segregating populations (Ertl and Fehr, 1985;Carpenter and Fehr, 1986;LeRoy et al., 1991;Sebolt et al., 2000;Kabelka et al., 2006;Zhang and Huang, 2011;Akpertey et al., 2014;Shivakumar et al., 2016).To address this problem, Frankel and Brown (1984) suggested the establishment of a core collection with a limited number of accessions derived from the original collection, representing ?10% of the full collection.The selected core collection should represent the genetic diversity of the original collection with the lowest number of redundant accessions.A core collection is easier to evaluate and more efficient to use.Core collections were successfully developed with multiple crops including maize (Zea mays L.), rice (Oryza sativa L.), wheat (Triticum aestivum L.), and peanut (Arachis hypogaea L.) (Holbrook et al., 2000;Coimbra et al., 2009;Bordes et al., 2011;Liu et al., 2015).Soybean core collections exist in East Asia (Qiu et al., 2013) and Brazil (Priolli et al., 2013).Domesticated soybean core collections composed of a portion of the 18,480 USDA G. max accessions have also been developed using the standard 10% selection threshold (Oliveira et al., 2010).Even smaller mini-core collections that represent the most diverse 1% of the accessions have been developed for multiple crops including maize, rice, wheat, and peanut (Holbrook et al., 2000;Coimbra et al., 2009;Bordes et al., 2011;Liu et al., 2015).However, to our knowledge, there are no published core collections derived from the USDA G. soja collection.
There is a substantial body of literature on soybean seed composition (Bellaloui et al., 2009;Medic et al., 2014;Yu et al., 2016;Lee et al., 2017).Even so, there have been few studies about seed quality and composition profiles within wild soybean germplasm collections, particularly the seed content of soluble sugars and amino acids (Takahashi et al., 2003;Krishnan, 2005;Wang et al., 2015;Warrington et al., 2015).Soybean seed protein is valuable because it contains all of the essential amino acids for human and animal consumption; however, soybean seed has relatively low contents of the S-containing amino acids cysteine and methionine (George and De Lumen, 1991).Sucrose, fructose, and glucose induce sweet taste and are easily digestible, whereas raffinose and stachyose are indigestible by monogastric animals and cause digestion problems such as flatulence and diarrhea (Hou et al., 2009;Kumar et al., 2010).Hence, increasing S-containing amino acids and the soluble sugars sucrose, glucose, or fructose while reducing stachyose and raffinose content in soybean seed is an important objective for improving soybean seed quality (Yu et al., 2016).Improving soybean seed oil content and the respective fatty acid profile of the oil is also an important objective in many breeding programs.Hydrogenation of soybean oil has been used to improve oil stability by reducing the number of double bonds in polyunsaturated fatty acid molecules (Yadav, 1996).This process increases the cost of oil and produces trans fats, which are associated with increased risk of heart disease, stroke, and diabetes (Mozaffarian and Rimm, 2006).Oleic acid is more oxidatively stable, and oil with high oleic acid content is desirable for various applications such as cooking oil and biofuel.Therefore, two of the goals of soybean breeding to improve oil quality are (i) to reduce the content of linolenic acid and (ii) to increase the content of oleic acid (Lee et al., 2007;Yoon et al., 2009).The entire USDA G. soja collection has been genotyped and a diverse collection of genetically and phenotypically defined wild soybeans would be useful to identify and utilize accessions with favorable agronomic and seed quality traits.Even so, only one recent study examined the phenotypic variation of primarily Korean and Japanese wild soybean seed compositions for protein, oil, and five fatty acids from the USDA G. soja collection (Leamy et al., 2017).
This study focuses on characterizing the phenotypic variation of maturity, seed weight, and seed compositions including total protein and oil, amino acids, fatty acids, and soluble carbohydrates of accessions in a G. soja collection representing the majority of the known single nucleotide polymorphism (SNP) diversity within the entire USDA G. soja collection.The objectives of this study were to characterize agronomic and seed composition traits in m −1 .At other locations, the seeds were sown at the rate of 20 seeds m −1 .Plots were seeded using a four-row ALMACO cone planter with Kinze row units.Wild soybean seeds have hard seed coats and possess extended dormancy periods (late germination), so seeds were scarified before planting by using a razor blade to make a small incision in the seed coat on the opposite side of the hilum.Lines were planted in a randomized complete block design with three replicates at all location per year.
In North Carolina, seeds were planted with a funnel dropper in 2.43-m-long rows at 10 seeds m −1 .Seeds were scarified in a coffee mill with the blades replaced with a sandpaper disk using 10 1-s pulses.If the seeds were not visibly scarified, five more 1-s pulses were used.Lines were planted in a randomized complete block design with three replicates at all locations per year.

Measurement of Agronomic Traits
Plant maturity was recorded as the number of days between planting date and the date when ?95% of the pods' color had changed to mature pod color (R8) (Fehr et al., 1971).The maturity was determined for all Missouri plots at each location.Plants of the same plot were harvested together by hand and threshed by an ALMACO small bundle thresher at all locations.One hundred-seed weight was measured by randomly picking and measuring 100 seeds from each plot for all locations three times with replacement.

Crude Protein and Amino Acid Analysis
Approximately 9 g of soybean seeds from each plot were ground using a Thomas Wiley Mini-Mill (Thomas Scientific) and filtered with a 20-mesh screen.A Labconco freeze dry system was used to lyophilize the ground powder for 48 h.Samples containing ?3 g of ground seeds from each plot in all locations were sent to the University of Missouri Agricultural Experiment Station Chemical Laboratory, University of Missouri (Columbia), to determine the crude protein and amino acid contents.The seed protein and amino acid content (12 amino acids) were evaluated for two out of three replicates from each location.
Crude N was determined by combustion analysis (AOAC Official Method 990.03, 2006).The N content in a 200-mg subsample was measured using the Dumas method and a LECO truSpec model FP-428 N analyzer following the manufacturer's recommendations.The protein content of soybean seed was estimated by multiplying the total N concentration by 6.25.
The contents of 12 amino acids were measured by a single oxidation 4-h hydrolysis method (Gehrke et al., 1987).The 12 amino acids are alanine, aspartic acid, cysteine, glutamic acid, glycine, isoleucine, leucine, lysine, methionine, proline, threonine, and valine.The hydrolyzation of the samples was performed using 6 M HCl for 4 h at 145°C, and the amino acid concentration was determined by cation exchange chromatography in a Beckman 6300 amino acid analyzer (Beckman Instruments).

Oil Analysis
Approximately 5 g of ground soybean seed was used to determine oil content with a XDS Rapid Content Analyzer (FOSS) and the ISIscan software (FOSS, 2005) at the University of Missouri's northern soybean breeding laboratory located at the Bay Farm Research Facility in Columbia, MO.A certified 80% a genetically diverse collection of wild soybean plant introductions (PIs) in replicated, multienvironment field experiments, and to make these data available to other soybean breeders and researchers for cultivar development and genetic studies through the Genetic Resources Information Network (GRIN) maintained by the USDA.In addition, we evaluated the correlations between these traits and identified genomic regions associated with seed composition and agronomic traits in a genome-wide association study (GWAS).

Plant Materials
The USDA soybean collection includes 1168 G. soja PIs from China, Korea, Japan, and Russia (www.ars-grin.gov)and the majority were previously genotyped with the SoySNP50K Bead-Chips (Song et al., 2013(Song et al., , 2015)).Analysis of the pairwise genetic distances among the G. soja accessions based on 42,509 SNPs showed that a total of 806 G. soja accessions from China, Korea, Japan, and Russia were nonredundant (Song et al., 2015).Thus, a total of 80 G. soja PIs (Supplemental Table S1), which is ?10% of the total number of nonredundant G. soja accessions in the collection, were chosen to represent maximal diversity.The 806 accessions were clustered to a predefined number of clusters based on their genetic distances, and one accession from each cluster was selected to form a core set.The PIs have maturity group (MG) assignments ranging from MG 000 to MG X, with nearly half of the collection consisting of MG V lines (www.ars-grin.gov).The geographic range of the lines is broad consisting of lines from Eastern China (19 PIs), Japan (22 PIs), eastern Russia (11 PIs), and South Korea (28 PIs) (www.ars-grin.gov).Seeds were obtained from the USDA Soybean Germplasm Collection via GRIN (www.ars-grin.gov).Eight G. max cultivars were planted in all Missouri location-years as checks (Supplemental Table S2).The maturity of these checks ranges from MG 0 to MG VII.Because PI 245331 had a late maturity (MG X) assignment, this genotype was not harvested in any environment and was excluded in further analysis; thus, 79 PIs were characterized for agronomic and seed quality traits.Carolina, 79 and 64 PIs were used, respectively, in the analysis due to a lack of seed production of PI 245331.In Missouri, all genotypes were planted in single-row plots of 2.43-m length, plot spacing was 1.22 m, and spacing was 1.52 m between rows.At the Novelty location, seeds were sown at the rate of 30 seeds reflectance reference was used to create reference standard.The performance test was performed by running four segments 10 times and compiling the spectra.

Fatty Acid Analysis
The fatty acid profiles of total oil for each plot in Columbia and Novelty, MO, were evaluated at the University of Missouri's northern soybean breeding laboratory located at the Bay Farm Research Facility in Columbia, MO, using a previously described procedure (Yoon et al. (2009).The five fatty acids that were measured are palmitic acid (C16:0), stearic acid (C18:0), oleic acid (C18:1), linoleic acid (C18:2), and linolenic acid (C18:3).The fatty acid levels were determined as a percentage of the total fatty acids in soybean seeds.The oil in 0.2 g of ground soybean seed was extracted by placing the soybean seed powder in 2 mL of extraction buffer (chloroform/hexane/methanol [8:5:2, v/v/v]) for 12 h.One hundred microliters of the extract was transferred to vials containing 75 mL of methylating reagent (0.25 M methanolic sodium methoxide/petroleum ether/ethyl ether [1:5:2, v/v/v]).Extraction buffer was added to acquire 1 mL of sample.An Agilent Series 6890 capillary gas chromatograph with a flame ionization detector (275°C) and an AT-silar capillary column (Alltech Associates) was used.Standard fatty acid mixtures (Animal and Vegetable Oil Reference Mixture 1, AOCS) were used as calibration reference standards.

Sugar Analysis
The concentrations of glucose, fructose, sucrose, raffinose, and stachyose were determined at the University of Missouri's northern soybean breeding laboratory located at the Bay Farm Research Facility in Columbia, MO, using a high-performance liquid chromatography-evaporative light scattering detection (HPLC-ELSD) procedure (Valliyodan et al., 2015).Approximately 90 mg of lyophilized seed powder was mixed with 900 mL HPLC-grade water (Fisher Scientific) and incubated at 55°C with 250 rpm agitation for 30 min.After incubation, vials were vortexed, cooled down to room temperature, and blended with 900 mL HPLC-grade acetonitrile (Fisher Scientific).The suspension was centrifuged for 30 min at 13.3 ´ 1000g min −1 .The supernatant was diluted five times with an acetonitrile/ water mixture (65:35, v/v).The Agilent 1200 Series HPLC-ELSD system was used with 250-mm ´ 4.6-mm Prevail Carbohydrate ES columns (5 mm) and 7.5-mm ´ 4.6-mm guard columns (Grace Davison Discovery Sciences).Sugar standards [D-fructose, D-(+) glucose, sucrose, D-(+) raffinose pentahydrate, and stachyose hydrate] were prepared in water with concentrations of 50, 100, 300, and 500 mg mL −1 and run to generate a standard curve for prediction.

Statistical Analysis
Each location-year was considered as a single environment (Table 1).The ANOVA was performed by using PROC MIXED in SAS version 9.4 (SAS Institute, 2013).Genotype was treated as a fixed effect to test for significant genotypic differences among accessions for all traits.Environment was treated as a fixed effect to test for significant environmental differences for all traits.The heritability (h 2 ) of each trait was calculated following Nyquist and Baker (1991): ( ) where s 2 g is the variance among genotypes, s 2 ge is the variance of genotype ´ environment interaction, s 2 e is experimental error, t is the number of test environments, and r is the number of replications.
PROC CORR of SAS was used to determine significance and correlation coefficients between studied traits according to the mean value of individual genotypes across replications and locations.

Genotyping and Quality Control
The genotypic data including 42,509 SNP markers for all 79 genotypes were downloaded from the SoyBase website (www.soybase.org).The information on these SNP markers was retrieved from the study of Song et al. (2013).In their study, the Illumina SoySNP50k iSelect BeadChip was used for genotyping.We filtered the genotypic data by removing SNPs not located on any of the 20 chromosomes, as well as those with missing rates >5% or with minor allele frequencies <5%.A total of 35,285 SNPs across 79 genotypes were used in the genomewide association analysis for all observed traits in this study.

Linkage Disequilibrium Estimation
TASSEL version 5 was used to calculate r 2 and measure pairwise linkage disequilibrium (LD) (Bradbury et al., 2007).The LD was calculated for genome-wise, euchromatic, and heterochromatic regions.A marker was determined to be in euchromatic or heterochromatic regions based on the physical information of the marker and these regions.The information was downloaded from SoyBase (www.soybase.org),and the physical distance between two markers, where the average r 2 reached half of its maximum value (Huang et al., 2010), was used as the LD decay rate.

Genome-Wide Association Study
The PROC GLM procedure was used in SAS version 9.4 (SAS Institute, 2013) for estimating least squares means for all traits in each environment to account for missing plot data.The least square means estimates were used for all phenotypic data in the GWAS analysis.The Fixed and Random Model Circulating Probability Unification (FarmCPU) method was used to perform genome-wide association analysis (Liu et al., 2015).The first three principal components were used as covariates to correct for stratification, and a Bonferroni test threshold was used as the significance cutoff for the GWAS.The threshold was set as 0.05/total SNPs [−log 10 (P) = 5.85] (Price et al., 2006).Due to the relatively small sample size of the selected G. soja collection in this study, there were limitations in the ability to detect marker associations with small effects.Therefore, we were mainly interested in identifying large-effect quantitative trait loci (QTLs) associated with major genes of interest for all analyzed traits in the GWAS analysis.to 0.99) estimates indicate that genetic factors are largely responsible for most predicted variance, which suggests that genotypic differences are heritable and can be exploited for breeding and research when using wild soybean for improving valuable traits in cultivated soybean.

Amino Acid Profiles
In terms of nutrition, the amino acid profile of soybean seed protein is a more compelling trait for selection than protein content per se.Due to deficiencies in methionine, lysine, and threonine of soybean protein (Pelaez and Walker, 1979;Erickson et al., 1989), these amino acids have been supplemented to improve the quality of soybean meal.Methionine is the most limiting amino acid (Erickson et al., 1989;Wang and Li, 2012), and Imsande (2001) reported that roughly US$100 million were spent annually by poultry and swine producers to supplement animal feed with methionine.In addition, the leaching of methionine supplements may lead to the formation of undesirable volatile sulfides due to bacterial degradation (George and De Lumen, 1991).Therefore, developing soybean cultivars with improved amino acid composition is desirable both to improve soybean seed value and to avoid negative environmental effects caused by supplementing amino acids.
In this study, the contents of cysteine and methionine in wild soybean seed ranged from 14.9 to 19.2 g kg −1 with a mean of 16.9 g kg −1 and from 14.0 to 16.9 g kg −1 with a mean of 15.1 g kg −1 , respectively (Table 2).Banaszkiewicz (2011) reported a narrower range of cysteine and methionine (15.0-17.0 and 13.6-15.8g kg −1 , respectively) when the author studied soybean and soybean meal for animals.George and De Lumen (1991) also reported a lower range of methionine in cultivated soybean samples that they had collected.This range was 11 to 16 g kg −1 , and most soybean samples were in the range of 12 to 14 g kg −1 .The ranges of cysteine and methionine content in our study of wild soybean were higher than the range reported by Warrington et al. (2015) (14.7-16.2g kg −1 for cysteine content and 13.8-14.7 g kg −1 for methionine content) when they studied a recombinant inbred line population developed from a cross between 'Benning' and 'Danbaekkong'.When Kwanyuen et al. (1997) studied a wild soybean germplasm, the seed content of cysteine and methionine (4-8 and 7-11 g kg −1 , respectively) showed lower ranges than those in our study.The variance

RESULTS AND DISCUSSION
A phylogenic tree of the abbreviated (excluding redundant PIs with >99% similarity) USDA G. soja PI collection is presented in Fig. 1, along with the total (q) and average (p) nucleotide diversity estimates for the entire USDA G. soja collection, the abbreviated G. soja collection, and the G. soja collection used in this study.These data illustrate the SNP diversity and distribution of PIs in the collection used in this study relative to the entire USDA collection of G. soja PIs.Although the total nucleotide diversity for the collection used in this study was slightly higher than in the entire and abbreviated collections, the average nucleotide diversity did not change, and the selected PIs for the study are evenly distributed across the phylogenic tree of the abbreviated collection (Fig. 1).These data show that the collection of G. soja PIs used in this study is representative of the entire USDA collection of G. soja PIs based on SNP diversity and genetic distance.
The collected and analyzed data from the field experiments are categorized into two sets.Set 1 includes the data of 79 genotypes in three out of six studied environments, including 13CLM, 14CLM, and 15NOV (where the number indicates the year [e.g., "13" for 2013] and CLM and NOV stand for Columbia and Novelty, respectively; Table 1).The measured traits in Set 1 were maturity (R8 date), 100-seed weight, seed protein and amino acid content, seed oil and fatty acid content, and seed soluble sugar content.Set 2 includes 100-seed weight and seed protein and amino acid content of 64 of the 79 PIs in all six environments, including 13CLA, 13CLM, 14CLA, 14CLM, 14SAN, and 15NOV (where SAN and CLA stand for Sandhills and Clayton, respectively; Table 1).The measured traits in Set 2 showed similar means, ranges, and variation, with the exception of 100-seed weight, to those in Set 1 (data not shown), which indicates that both sets of PIs showed similar phenotypic data and either of them can be used to determine significant differences among genotypes for the traits measured in this collection of G. soja PIs.The entry-mean-based heritability estimates ranged from 0.52 to 0.99 for maturity, 100-seed weight, seed protein and oil contents, seed fatty acid composition, seed amino acid composition, and seed soluble carbohydrate composition (Table 2).The exceptions were the traits whose heritability estimates were <0.50, including seed contents of glucose, fructose, valine, and proline.The traits with medium to high heritability (0.52 estimates in our study for these two amino acids were significantly affected by genotype (P < 0.0001) (Table 2).The heritability estimates for amino acids on an entrymean basis for Set 2 were relatively high, with a range of 0.30 to 0.84 (Table 2).The relatively high entry-mean heritability of 10 of the 12 amino acids measured suggests that genetic gains for amino acids can be achieved using wild soybean germplasm in breeding programs.On average, 9 out of 12 amino acids, based on crude protein content, showed significant differences across six studied environments (Table 3).The highest average content of 6 of the 12 amino acids was observed in 15NOV.Grieshop and Fahey (2001) and Karr-Lilienthal et al. (2005) suggested that differences in temperature among studied environments may contribute to variation in amino acid content, and Singh et al. (2016) stated that the free amino acid increased when there was a deficit of phosphate, and the seed amino acid decreased under elevated CO 2 conditions.To our knowledge, the results of this study are some of the first published and publicly available amino acid data sets in wild soybean seed grown across multiple replications and environments.The data indicate that there is significant genetic variation for amino acids in the USDA G. soja collection, and soybean researchers and breeders should be able to use this variation for cultivar improvement and research to better understand the genetic architecture.

Fatty Acid Profiles
Genotype showed significant influence on the variation of all five fatty acids (P < 0.0001, Table 2).The average 2.4 † Fatty acids for each genotype were calculated as the proportion of each fatty acid for the total oil fraction.‡ Amino acids for each genotype were calculated as the proportion of each amino acid for the total protein fraction.
Table 3. Summary of the mean crude protein and amino acid contents calculated as the proportion of each amino acid for the total protein fraction across six studied environments.fatty acid content of soybean oil in this study was 127.9 g kg −1 for palmitic acid, 32.3 g kg −1 for stearic acid, 122.1 g kg −1 for oleic acid, 554.1 g kg −1 for linoleic acid, and 163.8 g kg −1 for linolenic acid.The oleic acid content (122.1 g kg −1 ) was lower, whereas the linolenic acid content (163.8 g kg −1 ) was higher than values reported by Guo and Petrovic (2005) (233 and 76 g kg −1 , respectively), who studied the oil extracted from cultivated soybean G. max.The higher concentrations of linolenic acid in wild soybean compared with those in cultivated soybean in our study were consistent with the findings of Asekova et al. (2014) and Pantalone et al. (1997).There was a significant genotype ´ environment interaction effect for oleic, linoleic, and linolenic acid in the across-location analysis, and this effect was largely due to a change in magnitude rather than a change in rank of individuals (Supplemental Table S3).Soybean lines with high linolenic acid content are not desirable for many breeding programs because oil with a higher content of linolenic acid readily oxidizes, resulting in the formation of off-flavors when the oil is used for cooking (Yoon et al., 2009).In addition, when the oil is used as biofuel, the oxidized oil may result in viscous materials that clog the oil filter and obstruct fuel flow (Yadav, 1996).Another aspect of linoleic acid and linolenic acid is that these acids are omega-6 and omega-3 fatty acids, which are essential for human health and development (Covington, 2004).The lack of these fatty acids can make humans more vulnerable to health problems such as heart diseases, asthma, allergies, and other syndromes or diseases (Simopoulos, 2002;Simopoulos, 2008).However, the biological activities of these fatty acids in humans are different.When a high amount of omega-3 fatty acids are ingested, inflammation and thrombosis may be suppressed; in contrast, high intake of omega-6 fatty acids may lead to increased inflammation (Asif, 2011).Due to the differences in biological activities between these two fatty acids, a heathy range of ratios of omega-6 to omega-3 (1:1 to 4:1) was reported by Mattson and Grundy (1985) and Simopoulos (2002).Dhakal et al. (2013) and Asekova et al. (2014) stated that this ratio in commodity soybean oil was 6:1 to 7:1.The selected collection of wild soybean used in this study had a lower ratio (3.4:1); therefore, the wild soybean genotypes in the USDA collection may be useful when trying to improve the ratio of omega-6 to omega-3 fatty acids in cultivated soybean for human health benefits.

Soluble Sugar Profiles
The seed contents of fructose, glucose, sucrose, raffinose, and stachyose are shown on Table 2. Five water-soluble carbohydrates, including fructose, glucose, sucrose, raffinose, and stachyose, were analyzed and significant variation was observed in sucrose, raffinose, and stachyose (P < 0.001, Table 2).Glucose, fructose, and raffinose are present at low concentrations (<15g kg −1 ) in wild soybean seeds.The sucrose, raffinose, and stachyose carbohydrates concentrations ranged from 14.6 to 39.5 g kg −1 with a mean of 21.5 g kg −1 for sucrose, 6.6 to 9.3 g kg −1 with a mean of 7.8 g kg −1 for raffinose, and 37.2 to 58.9 g kg −1 with a mean of 47.8 g kg −1 for stachyose.Among the five studied sugars, sucrose and stachyose exhibited the highest concentrations while fructose, glucose, and raffinose were at lower concentrations.The average sucrose content was lower than those reported by Yu et al. (2016) and Hou et al. (2009) for cultivated soybeans (52.1 and 46.8 g kg −1 , respectively).However, the average stachyose content in this study was higher than those reported by Yu et al. (2016) and Hou et al. (2009) (39.3 and 31.7 g kg −1 , respectively).One possible explanation for the differences between our reported results and previous studies is that sugar profiles are strongly influenced by the environment (e.g., the CV for sugar traits is quite high, Table 2), and the genotype ´ environment source of variation was significant for all sugars measured (Supplemental Table S3).The ranges of fructose, glucose, and raffinose seed content in the core collection were 5.1 to 11.6 g kg −1 with a mean of 7.1 g kg −1 for fructose, 4.3 to 6.5 g kg −1 with a mean of 5.1 g kg −1 for glucose, 6.6 to 9.3 g kg −1 with a mean of 7.8 g kg −1 for raffinose (Table 2).The variation and average concentrations of fructose, glucose, and raffinose in this study were similar to those in the previous reports for domesticated soybeans (Hou et al., 2009;Kumar et al., 2010;Yu et al., 2016).Although genotype ´ environment interaction was significant, G. soja PIs possess unique seed profiles for sucrose and stachyose, which could be useful for understanding the underlying genetic architecture of these carbohydrates in soybean.

Total Protein and Oil
The protein and oil contents of the core collection were in the range of 392.6 to 481.7 g kg −1 for protein and 157.6 to 175.8 g kg −1 for oil (Table 2).PI 407228 had the highest seed protein content in a single plot (493 g kg −1 ), and PI 549048 had the highest seed oil content in a single plot (176 g kg −1 ).The average contents of seed protein and oil were within the ranges of seed protein and oil of the soybean germplasm collection (Wilson, 2004).Compared with the checks, the seed of wild soybean lines in this collection had lower oil and higher protein content (Table 2).The heritability of seed protein was 0.91 and was 0.86 for oil (Table 2).Although high, these values are similar to the entry-mean heritabilities reported by Jarquin et al. (2016) using 18,500 accessions of G. max in the USDA Soybean Germplasm Collection; this may indicate a significant potential for selecting for seed protein and oil composition in future cycles of breeding using wild soybean germplasm.The plot-based heritabilities were also high for protein, indicating an increased possibility of selecting for these traits using a single plot per generation (Table 2).These data indicate significant differences among G. soja PIs in this collection, which suggests that more studies about the potential for novel protein and oil genetic contributions in this collection are needed.

Seed Weight
The 100-seed weight showed a wide phenotypic range from 0.9 to 3.5 g, with a mean of 1.78 g (Table 2).Glycine soja has a much lower range of 100-seed weight than the cultivated soybeans (11.9-16.2g 100 seed −1 ) used as checks in this study.One explanation for the wide variation of seed weight could be that a plot of wild soybean seed may have been composed of individuals with varying seed weights.It has been shown that seed weights in a population of wild soybean can show differences of up to 4.9-fold (Wang et al., 2014).Thus, our results are consistent with the study of Wang et al. (2014) in which the variation in seed weight among wild soybean genotypes was high but could still be genetically distinguished.

Correlations among Traits
Maturity showed significant but weak correlations with seed weight and crude protein (0.22, P < 0.05 and 0.37, P < 0.001, respectively; Table 4).These results are inconsistent with a recent study (Vaughn et al., 2014), which did not observe any relationship between protein levels and MGs using 3258 accessions (G.max) in MGs I to IX from the USDA Soybean Germplasm Collection.This could be explained by the loss of genes or alleles relating to these traits during domestication and improvement selection in which cultivars were developed to achieve highest yield potential in certain regions with specific MGs.Bellaloui et al. (2009) studied the relationship between maturity and seed compositions of near-isogenic soybean lines derived from the cultivars 'Clark' (Johnson, 1958) and 'Harosoy' (Weiss and Stevenson, 1955) and observed a positive correlation between maturity and protein for Clark isolines and a nonsignificant correlation between these traits for Harosoy isolines.The relationship between maturity, protein, and seed weight in wild and domesticated soybeans may need to be refined with further studies.
The 100-seed weight shows significant associations with maturity, the content of oil, sucrose, and most fatty acids, with the exception of linoleic acid and palmitic acid (Table 4).Soybean seed oil and oleic acid both were positively correlated with seed weight, whereas stearic acid and linolenic acid were negatively correlated with seed weight (Table 4).This result is consistent with the observations of Kumar et al. (2006), Guleria et al. (2008), Poeta et al. (2016), and Lee et al. (2017), who observed that seed size had positive correlations with oil and oleic acid but a negative correlation with linolenic acid when they studied different collections of Glycine max accessions.
Seed protein and oil contents showed a significant correlation (0.66, P < 0.001; Table 4).Due to the negative ** Significant at the 0.01 probability level.*** Significant at the 0.001 probability level.† Seed compositions were calculated as the proportion of each seed composition for seed dry weight.‡ Fatty acids for each genotype were calculated as the proportion of each fatty acid for the total oil fraction.§ ns, nonsignificant.
relationship between oil and protein, we also observed relationships between protein or oil and some of their corresponding negatively and positively correlated traits, such as oleic acid, linoleic acid, linolenic acid, fructose, and sucrose (Table 4).The negative correlation between protein and oil has been well documented (Hymowitz, 1972;Burton, 1987;Wilcox, 1998;Chung et al., 2003;Vaughn et al., 2014;Leamy et al., 2017;Wu et al., 2017).Chung et al. (2003) reported that the negative relationship between oil and protein may either be due to two traits being controlled by the same genes (pleiotropy) or via traits controlled by different, yet linked, alleles.Other possible explanations include differences in soybean accessions or cultivation practices (Lee et al., 2017).Dornbos and McDonald (1986) and Saldivar et al. (2011) observed seed oil accumulation at early developmental stages and protein at later stages of soybean seed development.The strong negative correlation between seed protein and oil implies that improving both seed protein and oil content may be difficult (Chung et al., 2003;Nichols et al., 2006).Hwang et al. (2014) recently found a biallelic SNP significantly associated with increased protein and oil content with one variant and associated with decreased content of protein and oil with the other variant.Among the five fatty acids, oleic acid and linolenic acid showed significant correlations with seed weight, seed protein, linoleic acid, and between themselves (Table 4).Brace et al. (2011) and La et al. (2014) also reported that oleic acid was negatively correlated with linoleic acid and linolenic acid.These negative correlations between oleic acid content and other fatty acid contents may be due to their roles in the fatty acid biosynthesis pathway, where one fatty acid is the direct precursor of the other (Ohlrogge and Browse, 1995).Oil showed positive correlation with oleic acid and linoleic acid (0.39, P < 0.001 and 0.40, P < 0.001, respectively) but a negative correlation with linolenic acid (−0.62,P < 0.001) (Table 4).This result is consistent with results reported by La et al. (2014) and Wu et al. (2017) in which there was a positive correlation between oil and oleic acid but a negative correlation between oil and linolenic acid.Inconsistently, La et al. (2014) and Wu et al. (2017) reported a negative correlation between oil and oleic acid.This may be explained by the materials they used in their studies.La et al. (2014)  studied G. max soybean lines with high and normal oleic content, and Wu et al. (2017) characterized soybean cultivars.These finding illustrate the need to understand the phenotypic correlations among seed oil traits in a wide array of soybean germplasm, including wild soybean, to better our ability to develop methods to avoid negative consequences associated with selection for desired traits.In our study, linolenic acid was significantly correlated with seed content of protein, oil, oleic acid, and linoleic acid (0.46, P < 0.001;−0.62,P < 0.001; −0.54, P < 0.001; and −0.45, P < 0.001, respectively; Table 4).Because of these strong correlations, breeding for increased seed content of linolenic acid would lead to an increase in seed protein content and decreases in seed content of oil, oleic acid, and linoleic acid.
In this study, sucrose showed a positive correlation with raffinose but a nonsignificant correlation with stachyose in wild soybeans.This result is not consistent with the study of Neus et al. (2005).In their study, Neus et al. (2005) used two populations developed from crosses between two elite lines and PI 200508.They observed that sucrose was negatively correlated with both raffinose and stachyose.We also found significant correlations between protein and glucose (0.29, P < 0.05), protein and stachyose (−0.29,P < 0.01), oil and fructose (−0.29,P < 0.01), and oil and raffinose (−0.24,P < 0.05) (Table 4).The positive correlation between protein and stachyose in the Neus et al. (2005) study suggests that it may be challenging to develop soybean lines with low stachyose content and high protein content.We found a higher ratio of raffinose and stachyose to sucrose than ratios observed in cultivated soybeans.We also found a significant correlation between maturity and stachyose (0.76, P < 0.001), a significant negative correlation between maturity and raffinose (−0.22,P < 0.05), and a strong correlation between seed weight and sucrose (0.65, P < 0.001) (Table 4).
When the amino acid contents of the core collection are calculated according to total seed dry weight, the amino acid contents tend to increase in parallel to an increase in protein content (Fig. 2a and 2c).When these amino acids were calculated according to the contents of total seed protein, all amino acids, except glutamic acid, show an inverse relationship with seed protein content (Fig. 2b  and 2d).The exception of glutamic acid may be due to the role of this amino acid in the biosynthetic pathway.In this pathway, glutamic acid acts as basic precursor for the synthesis of arginine, aspartate, glutamine, and proline (Taiz and Zeiger, 2013).All studied amino acids showed strong and positive correlations between themselves, but negative correlations (except glutamic acid) with the content of seed crude protein (Table 5).Soybean lines with higher seed protein content will also typically have a lower concentration of essential amino acids (cysteine, lysine, methionine, threonine, and tryptophan)  than soybean lines with lower protein content (Medic et al., 2014).These correlations may, in part, be due to the determination of soybean seed amino acid content via N assimilation during the seed-filling stage (Hernández-Sebastià et al., 2005) Linkage Disequilibrium and Genome-Wide Association Study The LD, which was indicated as "R-squared " in Fig. 3 decreased to half of its highest value at 7.5 kb when LD was calculated for the whole genome.The LD decay rate in this study was calculated at 0.22.This value is lower than that in wild soybean population in the studies of Zhou et al. (2015) and Leamy et al. (2017) (27 and >100 kb, respectively).Because LD in wild soybean was lower than in domesticated soybean, Leamy et al. (2017) suggested that more QTLs would be found if more markers were used in wild soybean populations.We also observed a difference in LD decay between euchromatic and heterochromatic regions (5.2 and 148.6 kb, respectively).Zhou et al. (2015) and Hyten et al. (2007) reported similar results when they studied different soybean populations.
For maturity, five SNPs on chromosomes 1, 2, 6, 9, and 12 passed the threshold of the genome-wide association (Table 6, Fig. 4).All of these SNPs had late maturity estimated effects.The SNP on chromosome 6 would delay the maturity up to an estimated 10 d, and the physical location of this marker was ?4 Mb away from maturity gene E1 with reference to the Williams 82 sequence (www.soybase.org).Another SNP on chromosome 12 was associated with maturity and located ?1 MB away from Satt317 (www.soybase.org).Satt317 was reported to be significantly associated with maturity (Eskandari et al., 2013) in soybean.These results agree with published mapping efforts of known genes and QTLs associated with maturity in soybean and provide a proof of concept that this collection of G. soja PIs is useful for a GWAS study aimed at identifying maker associations with large-effect QTLs.The significant SNPs on chromosomes 1, 2, and 9 may be linked to unidentified QTLs associated with soybean maturity Ten different SNPs were significantly associated with seed content of oleic and linoleic acids (Table 6, Fig. 5).These SNPs located on nine different chromosomes, with one SNP each on chromosome 2, 3 4, 7, 8, 13, 14, and 20 and two SNPs on chromosome 11 that are likely associated with the same QTL region.The seed content of the corresponding fatty acid could be changed by 2 to 9 g kg −1 due to the effect of these SNPs.
No marker was found to be significantly associated with seed content of protein or oil.This could be explained by high phenotypic variance of these studied traits (Gibson, 2012;Korte and Farlow, 2013).Yan et al. (2017) also stated that increasing the frequency of the genotypes containing causative alleles would increase the chance of detecting a variant with major effects.The G. soja collection used in this study was chosen to represent a maximum genetic diversity of 806 G. soja accessions; therefore, it could have low frequency of specific alleles.Due to the high genetic diversity of the selected G. soja PIs in this collection, the extent of LD was lower than in other GWAS studies (Hwang et al., 2014;Zhou et al., 2015;Leamy et al., 2017;Zeng et al., 2017), and a higher marker density could be required to make sure that the genome was covered adequately for a GWAS study (Nordborg and Tavaré, 2002).In 2011, Ingvarsson and Street (2011) reported that a larger sample size might be required to facilitate the discovery of associations between genetic markers and studied traits.Thus, sample size and marker coverage are major limitations in the ability of this GWAS analysis to identify marker associations with QTLs that have minor effects or are at low allelic frequencies.
It is necessary to enhance germplasm in soybean breeding programs for improving the genetic base of newly developed cultivars.Evaluation and characterization of the entire USDA wild soybean collection (1168 G. soja PIs) using multiple replications and environments would be prohibitively costly due to the practically unmanageable size of the entire collection.In this study, the G. soja collection consists of 79 accessions that represent virtually all the genetic diversity present in 1168 G. soja PIs.This collection of 79 G. soja PIs was characterized across multiple replications and environments and will provide a useful framework to evaluate and identify valuable parental lines for soybean breeding and research programs.Similar core collections in chickpea (Cicer arietinum L.), peanut, foxtail millet [Setaria italic (L.) P.
Beauv.] and sorghum [Sorghum bicolor (L.) Moench] have   Sharma et al., 2012;Upadhyaya et al., 2013) and accessions tolerant to drought (Sakhi et al., 2014;Upadhyaya et al., 2017), Similarly, the core collection of foxtail millet was also the source of accessions that have drought tolerance and salinity tolerance (Krishnamurthy et al., 2016; Upadhyaya et al., 2017).Likewise, our wild soybean collection with a modest number of accessions (79) could be evaluated more extensively for traits of agronomic and economic importance, which would be impractical for the entire USDA-GRIN G. soja.In addition, due to the small size of the collection, soybean breeders can simplify their management and enhance their utilization of wild soybean genetic resources.The characterization of this collection is an important step to further study and explore the genetic resources in wild soybean, and all data from this study are publicly available via GRIN.In our genomewide association analysis of maturity, seed weight, and seed compositions, we observed both known and novel markers with significant associations with maturity, fatty acids, amino acids, and 100-seed weight in wild soybean.The information about these markers suggests marker associations with important traits and will assist in further studies to identify genes controlling the maturity, seed weight, and seed contents of aspartic acid, glutamine, palmitic acid, oleic acid, and linoleic acid in wild soybean that could be used to improve cultivated soybean around the globe.project, including Dr. Xiaofan Niu, Erin Grannemann, Dennis Yungbluth, Josh Dakota, and Sam McDonald.

Fig. 1 .
Fig.1.A phylogenetic tree of the abbreviated USDA G. soja collection based on 39,298 single nucleotide polymorphisms (SNPs) from the SoySNP50K iSelect BeadChip.The abbreviated G. soja collection is the entire USDA G. soja collection minus accessions with >99.9% similarity to other accessions.Color-coded branch tip labels are aligned in a circle surrounding the tree.Text coloration corresponds to the country of origin, except for the plant introductions (PIs) in this study, which are denoted by black circles and labels.The branches of the tree within the circle are colored according to the clade.Branch lengths equal the number of nucleotide substitutions per site and are proportional to the scale bar on the lower right side of the tree.The nucleotide genetic diversity of the entire USDA collection (All), the abbreviated USDA collection (All*), and the selected collection of PIs evaluated in this study (Core) are described in the box underneath the tree.Both the total nucleotide diversity q (theta) and the average nucleotide diversity p (pi) are included in the box.The following colors and abbreviations were used for non-core-collection tip labels and branch clades: China (C or CHN) = red, South Korea (SK or KOR) = blue, Russia (R or RUS) = green, and Japan (JPN) = yellow.

Table 4 .
Correlations between maturity, seed weight, seed oil and protein, fatty acids, and soluble carbohydrates based on the means of 79 G. soja plant introductions (PIs) across three environments.at the 0.05 probability level.

Fig. 2 .
Fig. 2. The relationship between total crude protein and amino acid compositions presented as the proportion of each amino acid (a, c) based on total dry weight of the soybean and (b, d) based on the total crude protein fraction of the soybean for the G. soja plant introductions in Set 2 across six environments in Missouri and North Carolina during 2013 to 2015.

*
Correlations between seed protein and amino acid composition based on the means of 63 G. soja plant introductions (PIs) across six environments.Amino acid contents were determined based on seed crude protein content.Significant at the 0.05 probability level.**Significant at the 0.01 probability level.*** Significant at the 0.001 probability level.† ns, nonsignificant.

Fig. 3 .
Fig. 3.The average linkage disequilibrium decay of (a) the whole genome (the genome-wide average), (b) heterochromatic regions, and (c) euchromatic regions estimated for the core collection of wild soybean.

Fig. 4 .
Fig. 4. Manhattan plots for maturity, seed weight, and seed content of aspartic acid and glutamine.The yellow horizontal line indicates the genome-wide threshold [−log 10 (P) = 5.85].

Table 1 .
Environments used in this study for evaluation of the selected G. soja plant introductions (PIs).

Table 2 .
Performance of the selected G. soja plant introductions (PIs) across three environments (Set 1, 79 PIs) and six environments (Set 2, 64 PIs) in Missouri and North Carolina from 2013 to 2015.

Table 6 .
Single nucleotide polymorphisms (SNPs) significantly associated with studied traits as identified by genome-wide association mapping using 79 G. soja plant introductions (PIs) grown in Missouri during 2013 to 2015.