Journal list menu

Volume 60, Issue 2 p. 516-529
REVIEW AND INTERPRETATION
Open Access

Knowledge representation and data sharing to unlock crop variation for nutritional food security

Liliana Andrés-Hernández

Liliana Andrés-Hernández

Southern Cross Plant Science, Southern Cross University, Lismore, NSW, 2480 Australia

Search for more papers by this author
Abdul Baten

Abdul Baten

Southern Cross Plant Science, Southern Cross University, Lismore, NSW, 2480 Australia

current address, AgResearch, Grasslands Research Centre, Palmerston North, 4410 New Zealand

Search for more papers by this author
Razlin Azman Halimi

Razlin Azman Halimi

Southern Cross Plant Science, Southern Cross University, Lismore, NSW, 2480 Australia

Search for more papers by this author
Ramona Walls

Ramona Walls

CyVerse, Bio5 Institute, University of Arizona, Tucson, AZ, 85719 USA

Search for more papers by this author
Graham J. King

Corresponding Author

Graham J. King

Southern Cross Plant Science, Southern Cross University, Lismore, NSW, 2480 Australia

Correspondence

Graham J. King, Southern Cross Plant Science, Southern Cross University, Lismore, NSW 2480, Australia.

Email: [email protected]

Search for more papers by this author
First published: 16 March 2020
Citations: 6

Assigned to Associate Editor M. Paul Scott.

Abstract

Meeting the challenge of food and nutritional security requires ongoing innovation, particularly in managing dietary nutritional information for pre-breeding analysis, selection, and cultivation of specific food crops and cultivars. At present, the ability to compare the relative nutritional value of crops is limited, with data management systems for most crops often inconsistent and poorly integrated. Here, we review generic efforts to standardize the description and management of crop trait data and discuss several issues currently constraining their exchange and comparison, with a focus on knowledge representation related to dietary nutrition. These issues include lack of consistency within or between crop specific databases, as well as limited data standardization and interoperability. At present, the use of common descriptors or controlled vocabularies between crops is fragmentary, with only partial implementation or uptake of formal ontologies, particularly for dietary nutritional composition. Although development of the existing Crop Ontology (CO) system has improved data sharing and reuse, it represents only a limited set of trait classes and crops. We identify the need for more robust and generic ontologies, particularly those that may address crop contributions to human dietary nutrition. We propose development of a Crop Dietary Nutrition Ontology (CDNO) as a robust structured controlled vocabulary for dietary nutritional composition and function, and provide examples of specific use cases and different end users who would benefit from using CDNO terms in their database searches. This development is likely to transform the way in which crops may be compared in terms of optimal dietary nutritional values.

Abbreviations

  • BIP
  • Brassica Information Portal
  • BMS
  • Breeding Management System
  • BrAPI
  • Breeding Application Programming Interface
  • CDNO
  • crop dietary nutrition ontology
  • ChEBI
  • Chemical Entities of Biological Interest
  • CO
  • Crop Ontology
  • FAIR
  • findable, accessible, interoperable, and reusable
  • FCT/FCDB
  • food composition databases and tables
  • FIX
  • physicochemical methods and properties
  • GI
  • glycemic index
  • GO
  • Gene Ontology
  • MIAPPE
  • Minimum Information About Plant Phenotyping Experiments
  • OBO
  • Open Biological and Biomedical Ontologies
  • PO
  • Plant Ontology
  • TD
  • trait dictionary
  • TO
  • Trait Ontology
  • 1 INTRODUCTION

    The domestication of crop plants has underpinned the development and expansion of human civilization, by providing the necessary human and livestock dietary nutrition in a wide range of cultivation environments. Although the challenge of increasing global and regional nutritional security is widely recognized (Cole, Augustin, Robertson, & Manners, 2018; Dillard, 2019; Kumar, Kumar, Das, & Rajkhowa, 2019; Martin, 2018; Nyathi et al., 2019; Vinoth & Ravindhran, 2017), in many cases, insufficient data have been collated to provide a detailed path that connects dietary requirements with the practice of crop genetic improvement. While development of new cultivars was traditionally achieved through selection of high-yielding phenotypes, recent decades have seen increasing sophistication in identification of component traits, as well as direct selection of subsets of gene alleles made possibly through the availability of detailed genetic and genomic information. Although molecular and biotechnological breeding techniques have improved the efficiency of genetic improvement, strategic innovation is still required to meet the increasing concerns of food and nutritional security (Mayes et al., 2012). Here, we outline the gaps and opportunities for information-based improvement of crop nutritional composition.

    Although major crops such as wheat (Triticum aestivum L.), maize (Zea mays L.), and rice (Oryza sativa L.) (Seck, Diagne, Mohanty, & Wopereis, 2012; Tilman, 1999) account for the majority of dietary energy and nutritional resources, these starch-rich cereals tend to have relatively low concentrations of minerals, vitamins, and essential amino acids (Grusak & DellaPenna, 1999). Their increased consumption contributes to low-quality diets and may lead to malnutrition, which is also evident in under- and overweight children and adults (Fanzo, Hawkes, Udomkesmalee, Afshin, & Allemandi, 2018; Mclean, Cogswell, Egli, Wojdyla, & De Benoist, 2008; Miller & Welch, 2013; Wakeel, Farooq, Bashir, & Ozturk, 2018). At present, the ability to compare the relative nutritional value of crops is limited, with data management systems for most crops often inconsistent and poorly integrated. There is recent recognition that extending the portfolio of domesticated crop species and resilient cultivars is hindered by the lack of a systematic framework and tools to represent, exchange, and analyze relevant knowledge (Mabhaudhi et al., 2019; Rasheed, Mujeeb-Kazi, Ogbonnaya, He, & Rajaram, 2018). An information-led approach is expected to make a substantial and direct contribution to decision making, from policy decisions to individual farmers and consumers (Fanzo et al., 2018). Although there are various sources of collated information that describe dietary nutritional composition for different crop products (Clancy, Woods, McMahon, & Probst, 2015), these are typically oversimplified and mostly do not represent cultivar- or region-specific variation (Azman Halimi, Barkla, Mayes, & King, 2019). For many crops, there are also distinct data sources describing genetic variation in terms of crop phenotypic traits for a wide range of germplasm, breeding materials, and cultivars. However, these are poorly integrated with those that describe nutritional composition. This highlights a pressing need for data and tools that simplify comparative analysis of alternative crops and cultivars and bridge the gap between plant breeders, farmers, food processors, and nutritionists, as well as national and regional policy and other decision makers.

    As crop-related datasets become increasingly large, complex, and dispersed (Eppig, Blake, Bult, Kadin, & Richardson, 2012), coordination of their curation and analysis becomes a major challenge, particularly where there are perceived benefits from data reuse and sharing (Lai, Lorenc, & Edwards, 2012b). A number of coordinated initiatives and adoption of shared approaches to data integration, as well as knowledge representation using ontologies, are being developed to address this and will hopefully assist in the development of crops that can address concerns of food and nutritional security. Ontologies have been developed in computer science as a tool to represent and share complex knowledge about a specific domain through formal controlled vocabularies that are both human and machine readable (Stevens, Goble, & Bechhofer, 2000).

    Here, we review generic efforts to standardize the description and management of crop trait data and discuss several issues currently constraining their exchange and comparison, with a focus on knowledge representation using ontologies, and their application to crop-based dietary nutrition. This review is intended to be of value to plant scientists and breeders, statisticians, data managers, bioinformaticians, and others involved in pre-breeding and crop genetic improvement, as well as those involved in dietary nutrition, dietary assessment, and analysis. It focuses on three major issues that currently constrain the integration and use of dietary nutritional information for pre-breeding analysis, selection, and cultivation of food crops and cultivars. In the context of knowledge representation, these are (i) a lack of consistency within or between crop-specific databases that manage trait performance and cultivar information; (ii) limited data standardization and interoperability; and (iii) gaps in the use of common descriptors or controlled vocabularies, particularly for dietary nutritional composition.

    We present an overview of the current state of the art and recent developments in facilitating management of crop-related trait data and knowledge systems. We identify the need for more robust controlled vocabularies and generic systems of knowledge representation, particularly those that may address crop contributions to human dietary nutrition. Based on this, we propose development of a Crop Dietary Nutrition Ontology (CDNO) to assist in management and navigation of nutritional components and outline how this is likely to transform the way in which crops may be compared in terms of optimal dietary nutritional values. This innovation is likely to benefit crop breeding, as well as downstream value through the supply chain (Figure 1). Although some specific content may be of particular interest to bioinformaticians and data scientists, in multidisciplinary teams, it is increasingly important for key stakeholders involved in the generation and analysis of data to understand the opportunities and constraints of data mining and knowledge management.

    Details are in the caption following the image
    Schematic diagram representing the relationship between different data sources and knowledge representation in the crop dietary nutrition domain. Primary data sources may include those that record germplasm (genetic resources, breeding material, and cultivars), trial data reflecting environmental factors (location, climate, and agronomy), and phenotypic traits associated with yield and quality. Quality traits of harvested products may include nutrient composition. Similar dietary nutritional components may be described for marketed food products in food composition tables and databases but at a lower level of granularity. Data may be managed or available in a variety of forms, including .txt, .csv, .xlsx, and .pdf files or structured databases. For crop-specific databases, there is opportunity to facilitate adherence to findable, accessible, interoperable, and reusable (FAIR) principles (Wilkinson, 2016), by ensuring data comply with the Minimal Information About Plant Phenotyping Experiments (MIAPPE) standard (Ćwiek-Kupczyńska et al., 2016). The adoption of a Crop Dietary Nutrition Ontology (CDNO) as a formal system of knowledge representation with structured controlled vocabularies for distinct classes of information is expected to increase the ease with which different user groups, such as those involved in pre-breeding and nutritionists, may navigate, interrogate, and compare diverse datasets. The CNDO includes three major classes for dietary nutritional components, dietary function, and analytical method and provides the opportunity for checking semantic consistency between different data descriptors and records. Where possible, existing ontological terms and relationships may be reused. Thus, a subset of chemical components are derived from the Chemical Entities for Biological Interest (ChEBI)

    To provide some explicit examples of where improved knowledge representation can help in the navigation of relevant “big data,” we take the example of two fictional characters Gene Smith, a plant breeder, and Di Etkins, a nutritionist. Gene is very interested in being able to understand how he can increase the nutritional value of the bean (Phaseolus vulgaris L.) cultivars he develops, and to compare these with major commodity crops such as soybean [Glycine max (L.) Merr.]. Di is working within a regional development agency and is keen to maximize the nutritional and market value of local crops to provide evidence-based advice to local farmers. Each are aware of the other's discipline but lack detailed knowledge of how to mine relevant datasets.

    2 NAVIGATING CROP GENETIC DIVERSITY

    Establishing a common worldview is essential for communication between different disciplines and can benefit from systematic approaches to data definition and knowledge representation, especially where these avoid “reinventing the wheel” for each discipline. It has been recognized for some time that improving the standardization of crop-related data would enhance communication between plant breeders and bioinformaticians, as well as laboratory- and field-based researchers involved in pre-breeding (Shrestha et al., 2010). In particular, species- or crop-specific databases can provide a valuable framework for systematic comparative studies when based on genetic, trait, and cultivar-specific information (Lai et al., 2012b). However, it is surprising that even for major commodity crops, there is often relatively poor coordination between multiple data management platforms, leading to poor integration and an inability to compare equivalent data entities (Lai et al., 2012a).

    Over the past two decades, a range of public-domain, crop-specific data management systems have been developed that aim to compile reference data for experimental or genetic resource collections of germplasm, genomes, genetic maps and markers, phenotypes, and associated trials (Bombarely et al., 2011). Those currently in the public domain tend to be associated with major crops or model plant species (Supplemental Table S1) and often benefit from the ability to combine trait and marker data with reference to linkage maps or genome sequence for genetic analysis, pre-breeding, and marker-assisted breeding.

    In recent years, genome-wide association studies (GWAS) and derivation of values for genomic and phenotypic prediction have been successful for a number of crops (Crossa et al., 2017; Huang et al., 2010; Wang, Xu, Hu, & Xu, 2018). However, the widespread integration and comparison of such datasets between and within crops has been limited by a lack of consistent systematic description of metadata related to genetic resources, phenotype, or environment.

    A number of initiatives have begun to address the issue of data integration (Supplemental Table S1), including the Wheat Information System (Wheat IS) (http://wheatis.org/), the Arabidopsis Information Portal (ARAPORT) (https://www.araport.org/) (Krishnakumar et al., 2015), the Global Grape Information System (GrapeIS) (https://www6.inra.fr/iggp) (Adam-Blondon et al., 2016), and the transPLANT project (http://www.transplantdb.eu/) (Spannagl et al., 2016). However, these are seldom integrated with each other, although there is increasing effort to standardize and harmonize approaches within research communities such as Divseek (Meyer, 2015). For example, the Brassica Information Portal (BIP) (Eckes et al., 2017) is an open-source repository for managing phenotypic trait data in brassica crops such as canola (Brassica napus L.) and is based on the generic CropStoreDB (http://www.cropstoredb.org/) relational schema (Love et al., 2012). To enhance interoperability and accessibility, BIP makes use of the Breeding Application Programming Interface (BrAPI) (https://brapi.org/) for data download. The BrAPI specifies a standard interface for plant phenotype and genotype databases to serve their data to crop breeding applications. A number of tools have been incorporated into the BrAPI application showcase (BRAPPs) (https://brapi.org/brapps.php) that enable users to search and filter queries, as well as compare germplasm across studies, and various other visualization tools. The BrAPI is also used by Germinate 3 (https://ics.hutton.ac.uk/get-germinate/), a generic database platform for management of passport and other data relating to plant genetic resources, as well as more advanced data types representing phenotypic, genotypic, and field trial data (Shaw et al., 2017).

    The collection of software within the Breeding Management System (BMS) (https://www.integratedbreeding.net/15/breeding-management-system) from the Integrated Breeding Platform (IBP) is also emerging as one of the more popular tools for crop-specific data management for pre-breeding and breeding efforts. The BMS is able to incorporate germplasm, trial and seed inventory data, and statistical analysis to improve plant breeding efficiency, thus facilitating the use of genetic markers. Some other platforms have addressed specific aspects of crop data.

    3 THE “BABEL FISH” IDEAL: DATA STANDARDIZATION AND INTEROPERABILITY

    A fictional universal translator introduced 40 yr ago by Douglas Adams (1980), the Babel fish neatly crossed language barriers between species. This concept has since increasingly materialized in different contexts to facilitate human and machine communication. For scientific data management, the recent emergence of findable, accessible, interoperable, and reusable (FAIR) principles (Wilkinson, 2016) is proving useful in supporting data and knowledge integration, particularly for complex data sources, and increasingly is being addressed through adoption of common sets of controlled vocabularies (Pommier et al., 2019). The FAIR principles endorse sharing and data reuse with an emphasis on machine-readable data, and metadata readable by humans (Leonelli, Davey, Arnaud, Parry, & Bastow, 2017; Marsden & Shahtout, 2014; Rodríguez-Iglesias et al., 2016). The transPLANT project and an increasing number of other initiatives have adopted these principles by providing search services for a broad range of databases (Spannagl et al., 2016). Such genome-centric platforms are also now facilitating prediction of gene function for crops (Dong, Schlueter, & Brendel, 2004; Krishnakumar et al., 2015; Nussbaumer et al., 2013; Rhee et al., 2003).

    There are various generic approaches to ensure data are more “interoperable” (I), which both can increase semantic value and the ability to make inferences at higher levels of information. At the string or syntactic level, data standardization may be based on consistent naming, standard identifiers, nomenclature conventions, agreed definitions, and well-defined formats (Krajewski et al., 2015). These greatly facilitate systematic data management and exchange, including between different crop species (Jaiswal et al., 2005), although there remain considerable challenges in managing simple synonyms (Supplemental Table S2), an issue we will see is important for nutritional components. Beyond syntactical matching, formal approaches to knowledge representation developed in computational science are being adopted to facilitate data exchange and inference of semantic content and include the development of controlled vocabularies or terminologies, represented as ontologies (see below).

    Within the FAIR principles of data management, “reusability” (R) relies on well-described metadata, accessible by both humans and machines. However, manual integration, use, and analysis of crop trait and associated metadata is usually time consuming and often not funded in research programs. This is particularly apparent where genomic data are used to facilitate trait prediction or to select ideal allelic combinations for breeding (Lee et al., 2005; Zamir, 2013). The need for adopting standardized approaches to data curation is starting to be recognized as a prerequisite for data reuse, exchange, and description (King, 2004).

    Recent progress in the ability to record massive amounts of plant trait and environmental data have led to standards such as the open community Minimum Information About Plant Phenotyping Experiments (MIAPPE), which aims to harmonize the collection and presentation of data from plant phenotyping experiments and thus facilitate experimental reproducibility and data interpretation (Ćwiek-Kupczyńska et al., 2016). This comprises an evolving conceptual checklist of metadata required for adequate description of plant phenotyping experiments (Fiehn et al., 2007) and includes a set of attributes that encompass experimental design, biosource, and observed variables. Each category allows for one or more descriptive term covering different aspects of environment associated with the experimental design, and different aspects of sample collection, processing, or management of biosource samples (Figure 2). The MIAPPE standards are presented as a formal set of guidelines using the Investigation/Study/Assay tab delimited (ISA-Tab) format (Rocca-Serra et al., 2011) for experimental metadata and exchange. The preparation and adoption of this type of checklist is challenging, and it will take some time before MIAPPE becomes stable and as widely used as the routine use of Genbank/ENA accession numbers, required prior to publishing a DNA sequence (Parkinson & Brazma, 2006).

    Details are in the caption following the image
    The latest version of the Minimal Information About Plant Phenotyping Experiments (MIAPPE; Ćwiek-Kupczyńska et al., 2016) provides a checklist of metadata that can assist in describing datasets relevant to crop science (MIAPPE version 1.1., https://github.com/MIAPPE/MIAPPE/tree/master/MIAPPE_Checklist-Data-Model-v1.1). Here were represent the checklist as a schema of entity relationships, consisting of four primary objects (represented in orange) including investigation metadata, observation unit, and observed (trait) variables. The latter are also described within the trait dictionary format from the Crop Ontology (CO) (www.cropontology.org). One to many relationships are indicated between the major descriptive objects within the schema

    4 FORMAL REPRESENTATION OF COMPLEX KNOWLEDGE

    Although individual breeders and research communities may be able to share common assumptions and datasets with relative ease, the scope for ambiguity increases considerably once a wider set of people are involved. To address this, ontologies provide the opportunity to adopt well-structured controlled vocabularies that are able to represent complex relationships. Ontologies provide identifiers for terms represented within controlled vocabularies that are both human and machine readable (Courtot et al., 2011; Leonelli, 2008; Oellrich et al., 2015). One of the benefits of representing knowledge in ontological form (Figure 3) is the ability to provide a structured framework to help find, reuse, and organize information from a broad range of sources. Thus, data collected in different file formats, such as .pdf, .csv, or.xls sheets, may be indexed according to ontology terms. This can increase findability and discoverability through search engines, and help achieve a wider user base in specific knowledge domains.

    Details are in the caption following the image
    A directed acyclic graph (DAG; Healy & Nikolov, 2001) for the Plant Ontology (PO) term “gynoecium ridge” in the “plant anatomical entity,” indicating the terms and relationships from the most specific to the most general. Within DAGs, descriptive terms occur as nodes (boxes), and the relationships (arrows) between nodes as edges, such as “is a” and “part of,” that link every node in the ontology, with the topology constrained so that multiple parent–child relationships may exist. Boxes represent nodes, each node or child can have one or more parents. More generally, this illustrates the representation of ontological concepts where a set of entities are classified hierarchically within a domain, and the assignment of relationships refines and increases the contextual meaning of information (Rhee, Wood, Dolinski, & Draghici, 2008)

    The properties of ontologies allow representation of complex multidimensional biological relationships, something that is not possible via typical branching trees or other hierarchical structures. The successful application of ontologies in biology (bio-ontologies) was pioneered over 20 yr ago with the establishment of the widely used Gene Ontology (GO) (http://www.geneontology.org) (Ashburner et al., 2000), which provides a unified vocabulary for annotation of genes and attributes of gene products (Ashburner et al., 2000). Widespread adoption of GO has helped to increase the awareness amongst biologists of the value and descriptive power of formal knowledge representation and has stimulated a major advance in the development of other biological ontologies (Bada et al., 2004). For example, the Plant Ontology (PO) provides structured terminologies related to developmental stages in plants (Figure 3), and a vocabulary to compare data generated for different plant species (Walls et al., 2012).

    The PO highlights the advantages of ontological representation, where the terms or nodes may be related to more than one parent or ancestor term (e.g., “gynoecium” is a “collective phyllome structure” and is [also] part of a “flower”), and so can model many different types of information. Overall, this topology enables powerful and efficient grouping, searching, and analysis of terms (Blake et al., 2013; Kim, Caralt, & Hilliard, 2007) (see also Supplemental Figure S1).

    Of the online tools available to identify and navigate existing ontologies, the most comprehensive and current are the Ontology LookUp Service (OLS) with 214 ontologies (Côté, Jones, Apweiler, & Hermjakob, 2006), and Ontobee with 190 ontologies (Xiang, Mungall, Ruttenberg, & He, 2011). These manage definitions within the Open Biological and Biomedical Ontologies (OBO) system (Smith et al., 2007) developed by the OBO Foundry. This organization manages the development of new ontologies, as well as maintenance and updates, thus avoiding redundancy in ontology terms. Although the Bioportal (https://bioportal.bioontology.org/) repository contains 773 ontologies (Noy et al., 2011), many of these are redundant, poorly documented, or not maintained. The AgroPortal repository (Jonquet et al., 2018) includes a selected subset of 106 ontologies and provides some additional search capabilities, including the ability to identify corresponding terms across different agriculture-related ontologies (http://agroportal.lirmm.fr/).

    4.1 Existing plant and crop ontologies do not adequately address dietary nutrition

    The PO (Jaiswal et al., 2005) was originally established for the model plant Arabidopsis, and then for rice and maize, to provide terms describing flowering plant anatomy, morphology, and developmental stages recognized as being generic properties of most flowering plants (Avraham et al., 2008; Cooper & Jaiswal, 2016). The PO has been extended with the establishment of the Plant Trait Ontology (TO), which is used for phenotypic traits and their comparison (Cooper et al., 2018). First integrated into the Gramene platform for rice (Jaiswal et al., 2002), TO has different classes to help sharing and reuse of different descriptive plant trait terminologies, including some relating to nutrition. For example the class “quality trait” (https://archive.gramene.org/db/ontology/search?id = TO:0000162) includes the class: “seed composition based quality trait” with subclasses “carbohydrate composition related trait,” “protein composition related trait,” or “fat and essential oil composition related trait.” However, the TO derived class “TO:biochemical trait” has also been used to represent nutritional composition within a plant or plant part (Supplemental Table S3), including terms that equate to concentration of specific nutritional components in seeds or whole plants. Unfortunately, this multiple representation in different classes leads to an inherent ambiguity in assigning such terms as proxies for either commodity raw material quality or end-use nutritional composition for human or livestock diet.

    To address some of the specific challenges associated with representing crop-specific knowledge, the Crop Ontology (CO) (Matteis et al., 2013) was initiated in 2008 for chickpea (Cicer arietinum L.), maize, banana (Musa spp.), potato (Solanum tuberosum L.), rice, and wheat and has since been extended to 25 different crops (https://www.cropontology.org). The CO terms have been synchronized with the integrated breeding (IB) field books developed by the Generation Challenge Programme (Lugo-Espinosa et al., 2013) and have been generated from trait dictionaries (TDs) (Shrestha et al., 2010). Trait dictionaries consist of a list of phenotypic traits in an Excel spreadsheet template, with associated descriptions of traits, methods, and scoring scales (Jaiswal et al., 2006). The TDs have typically been generated by a representative group of researchers familiar with particular crop-specific datasets. More generally, they have been designed to help maintain quality control across records for phenotypic characteristics (Cooper et al., 2016).

    Trait names and allocated classes within TDs are used as the input to create CO terms, providing an accessible common format intelligible for breeders and researchers, and so maintain data accessibility and reusability. However, in practice, the adoption, curation, and collation of TDs remains a challenge, as it requires adequate knowledge of the entities described to assign sufficient metadata with actual data records accurately.

    It is generally recognized that if descriptive data are not standardized and interoperable between different ontologies from the outset, then comparison becomes inefficient (Lonsdale, Embley, Ding, Xu, & Hepp, 2010). For comparative nutritional analysis, as well as other interrogation, it is unfortunate that the current “CO:quality” and “CO:biochemical” trait classes within the CO have virtually no common usage across crops for specific trait names (Supplemental Information X). This is evident from the disparities between terms entered from TDs, which are then automatically integrated by the CO curation tool in development of the independent ontologies. For example, the class “quality traits” is available in the cowpea [Vigna unguiculata (L.) Walp.] ontology but is not present for soybean (Figure 4). Conversely, the class “biochemical traits” is present in the soybean ontology but not for cowpea.

    Details are in the caption following the image
    Comparison of domains for cowpea and soybean ontologies in Crop Ontology (http://www.cropontology.org/). The image created based on the Crop Ontology webpage. Each ontology has six classes (e.g., abiotic stress traits, agronomic traits), with “quality traits” only appearing in cowpea, and “biochemical traits” only appearing in soybean (Supplemental Information). For each parent class, there exists sets of child terms

    This lack of standardized vocabulary between crops has led to a proliferation of synonymous trait terms used in different species. Indeed, we found that, at most, 12 crops out of 19 have a single trait name in common, and for the next most prevalent term only seven crops in common. This lack of cohesiveness presents a barrier to comparative analysis and is of particular concern where minor crops could benefit from the accumulated knowledge and data sharing from other crops. Although the establishment of the CO has been of value to researchers working on specific crops, from the outset, it was not intended to facilitate direct comparative analysis (Jonquet et al., 2018). While the TO and derived CO have some capacity to describe certain nutritional composition attributes for crops, they lack consistency and completeness and do not provide adequate depth of knowledge representation for crop-derived nutritional information.

    5 CONNECTING CROPS AND DIETARY NUTRITION

    We have identified several challenges confronting our data miners Gene and Di when they attempt comparison and reuse of data relating to crops and dietary nutrition. For most nutritionists, as well as policy and other decision makers, food composition databases and tables (FCT/FCDB) are the primary data platform available to source, share, and compare nutritional data. Although FCT/FCDB are supplemented and supported by research literature, databases, and food industry datasheets, as well as data sources maintained as unpublished spreadsheets, they mostly do not present drilled-down regional or cultivar-specific data (Asman Halimi et al., unpublished data, 2019).

    More generally, we have found very few available resources that provide relevant compilations of detailed peer-reviewed structured datasets or controlled vocabularies. There are surprisingly few ontologies that provide comprehensive representation of dietary nutritional information in terms of chemical, functional, and analytical attributes (Supplemental Table S4). Although superficially many bio-ontologies, including those for crops and food, are available for reuse, in practice, we have found it necessary to carry out extensive exploration and analysis. This is required to identify existing structured ontology components suitable for representing nutritional composition. While some ontologies have been developed to describe dietary nutritional information for specific end-user domains, most do not appear to be efficiently managed, updated, or easily accessible (Table 1).

    Table 1. Comparison of nutrition and food ontologies, sorted alphabetically. Some ontologies, such as the Ontologies for Nutritional Studies (ONS), Food Ontology (FOODON), and the Ontology for Nutritional Epidemiology (ONE), are very broad in scope and more focused on food items and food groups for the benefit of dieticians, restaurants, or hospitals. Others provide limited nutritional function (e.g., ONS, FOODON, and ONE) of value to nutritionists or researchers
    Ontology Status No. of classes Scope Description Format Repository URL or referencea
    Bionutrition Ontology (BNO) Active 100 Nutrition capabilities Human nutrition biomedical context Web Ontology Language (OWL) Bioportal U[1]
    Food Ontology (FOODON) Active 27,097 Obsolete class, entity Food products OWL Bioportal, Github, Online Linguistic Support (OLS) U[2]
    Food-Oriented Ontology-driven System (FOODS) Not found Not found Regional cuisine, dishes, ingredients, availability, nutrients, nutrition based diseases, preparation methods, utensils, price Web-based food menu recommender for patients with Diabetes in Thailand Resource Description Framework (RDF), SPARQL Web-based URL Not found (Snae & Brückner, 2008)
    OntoFood (OF) Active 292 Entity, entity Nutritional rules diabetic patient. OWL Bioportal U[3]
    Ontology-Driven Mobile Safe Food Consumption System (FoodWiki) Not found 3000 Diseases, person, ingredients, product Processed food OWL Mobile application (not URL or source found) URL Not found (Çelik, 2015)
    Ontology for Nutritional Epidemiology (ONE) Active 339 Case studies, descriptors for nutritional epidemiological data, extra list of terms for description, nutritional epidemiological terms Nutritional epidemiological studies OWL Bioportal U[4]
    Ontology for Nutritional studies (ONS) Active 3442 Entity, obsolete class, version Nutritional studies OWL, RDF Bioportal U[5]
    Ontology for Pediatric Nutrition Desktop Application (open-desktop application) Active 45 Malnutrition, malnutrition_caution, food, nutrient, nutrition_function, person Pediatric nutrition OWL GitHub U[6]
    Underutilized Crops Ontologies (UC-ONTO) Not recently updated 111 Date time description Nutrition and other information for underutilized crops OWL GitHub U[7]

    Although a number of ontologies have been created to represent knowledge related to nutritional information or food, only one appears to be of particular relevance for comparative study of dietary value for crop plants. OntoFood (Nachabe, El Hassan, AlMouhammad, & Girod Genet, 2017) manages 292 ontology terms relevant to diabetic patients, and contains the “material entity” subclass. This includes the major macronutrients such as “carbohydrates,” “fat,” or “protein,” each of which may be further subclassified, such as carbohydrate → complex and carbohydrate → simple. This hierarchical subclassification allows the structured representation of nutritional components, with up to six levels of increasing specificity in terms of chemical identity (Supplemental Figure S2, https://bioportal.bioontology.org/ontologies/OF/?p=classes&conceptid=root).

    Although initially promising, we found that the representation of some of the “macronutrient” components in OntoFood were not appropriately classified. For example, “mineral” has been assigned in the micronutrient class (Latham, 1997), but in OntoFood, “mineral” is allocated to the macronutrient class. In addition, OntoFood is incomplete, with some obvious omissions such as amylose or amylopectin concentrations of starch, and antinutritonal factors known as the raffinose family of oligosaccharides (RFO), which includes raffinose, verbascose, and ajugose. Likewise, the Bionutrition Ontology (BNO) (https://bioportal.bioontology.org/ontologies/BNO) focuses on (functional) nutritional capabilities but appears to be very limited in scope and incomplete (Supplemental Figure S3).

    The Underutilized Crops Ontology and related linked data (UC-ONTO) (https://github.com/Abbalawan/Semantologies) appeared to be relevant by including a “nutrition” class, although this was very limited in scope and accessibility. We also found a lack of specificity in detailed subclassification of terms within classes for a range of other nutrition-related ontologies, such as the Pediatric Nutrition Ontology (PNO) (Sari, Sihwi, & Anggrainingsih, 2014), which likely contributes to the relatively scant evidence for their practical implementation (Table 1).

    6 PROPOSED DEVELOPMENT OF AN ONTOLOGY FOR CROP DIETARY NUTRITION

    It is clear that currently available tools and resources are limited in their ability to manage, represent, and compare knowledge relating to specific crops, cultivars, and dietary nutrition. This suggests the need to establish a new ontology that would be of value for data sharing amongst crop scientists and nutritionists, particularly in terms of the ability to understand variation in composition and nutritional value both between and within crops. We therefore propose the development of a CDNO. Generating a formal human- and machine-readable controlled vocabulary to navigate crop-related nutritional information should greatly enhance the findability and connection of terminologies within external databases or online repositories (Figure 1).

    To achieve this and connect knowledge between a range of key stakeholders such as Gene and Di, we propose establishment of three different ontologies (domains or major classes) (Figure 1, bottom):

    [CDNO: dietary nutritional components], for chemical composition as typically analyzed to assess dietary nutritional components.

    [CDNO: dietary function], for the dietary functional role of crop-derived nutritional components.

    [CDNO: analytical method], analytical methodology for each nutritional component.

    Prior to developing any ontology, it is important to take into account terms that may be available for reuse from existing ontologies (Bontas, Mochol, & Tolksdorf, 2005). Such reuse can reduce the time and cost required in developing ontologies from scratch (Lonsdale et al., 2010).

    The population of terms within the [CDNO: dietary nutritional components] will make use of data definitions structured within a branched hierarchical nutrition schema recently developed by Azman Halimi et al. (2019), which presents a systematic and relevant well-defined classification, drawing on the classification already described by Latham (1997) and the USDA Food Composition databases (https://ndb.nal.usda.gov/ndb/). For the reuse of existing ontologies, there is opportunity to derive a large proportion of relevant nutritional component information from one of the most complete and successful ontologies. Established to share chemical information, the Chemical Entities of Biological Interest (ChEBI) (http://www.ebi.ac.uk/chebi) ontology contains 55,800 (last release 2019) annotations including chemical structure and synonyms (Hastings et al., 2016; Supplemental Table S3), which are based on hierarchies and relationships associated with molecular entities and biosynthesis.

    However, we have found that ChEBI represents chemicals at a high and exhaustive level of granularity (Supplemental Figure S1), primarily from the point of view of organic chemistry and biosynthesis, and so is overspecified for an adequate representation of chemical components in the context of dietary intake. However, the overall comprehensive coverage of chemical components within ChEBI is an excellent starting point. We are able to reuse relationships between terms where entities within ChEBI are semantically equivalent to nutritional components and identified the need to insert additional terms where nutritional subclasses are incomplete. We have also found other gaps in coverage, with scope for inclusion of a wider range of plant-specific, and mostly secondary metabolism and nutritional, components. For example, vitamins are represented in ChEBI as the subclass “water soluble vitamin” that includes vitamin C and B, and “fat soluble vitamin,” which has no members represented.

    The development of the [CDNO: dietary function] class has the potential to increase the value of comparative crop-based data, particularly in decision making for nutritionists, or those wishing to gain market advantage for functional foods. As well as encompassing aspects of functional foods and nutrigenomics (Sikalidis, 2018) such as glycemic index (GI), antioxidants, phytonutrients, and other beneficial compounds, it would allow representation of antinutritional factors and known crop-based food toxins. Ideally, one would wish to see incorporation of indicators of bioavailable nutrients, although at present there are limited datasets available at the level of crop cultivar (Bechoff & Dhuique-Mayer, 2017; Jeong & Guerinot, 2008).

    To bridge the gap between nutritional, genetic, and agronomic information relevant to crop composition, it is important to have a clear representation of the analytical protocols used to acquire such knowledge. The [CDNO: analytical method] class would provide a structured framework for representation of knowledge describing the methodologies and protocols used for sampling, extraction, and analysis of each nutritional component. Some existing ontologies have been developed to share analytical methods, such as the chemical methods ontology (CHMO) (https://www.ebi.ac.uk/ols/ontologies/chmo), and the physicochemical methods and properties (FIX) (https://www.ebi.ac.uk/ols/ontologies/fix). The FIX ontology has a more complete representation of analytical methods. However, it is necessary to extend the scope of FIX by including classes related to sample preparation, which may include extraction and isolation, and additional terminologies for analysis. We therefore propose reusing and integrating relevant FIX subsets into the CDNO analytical method category.

    In practice, we would expect the CNDO to be used as a tool that enables consistent linking and searching of crop-specific databases using a controlled vocabulary. Our data miners Gene and Di can expect to benefit from the increased linkage between disparate data sources and datasets (Figure 1), and the ability to navigate structured nutritional information within the CDNO. For example, our legume breeder Gene may start by searching a crop-specific database for a major crop such as soybean based on the nutritional term digestible-starch, and then wish to find corresponding data described by the same CDNO term in other legume crops to compare reported concentrations (Figure 1). On another occasion, Gene may be searching for nutritional components within a minor legume crop that may be correlated or anticorrelated with the same “digestible_starch” term and be drawn to an interesting relationship with amylose content. Meanwhile, as a nutritionist, Di is very keen to understand what crops or locally suited cultivars may be available that provide a low-GI option into regional diets, to help in management of an increase of diabetes in the regional population. For this, she would make use of the dietary function class within CDNO, to navigate a range of crop-specific and comparative databases. Her search may therefore start with the attribute “low glycemic index.” This dietary function search term would help her find datasets related to low GI and in turn uncover associated relationships to compositional variation in “digestible_starch” or “amylose” content.

    7 CONCLUSION AND OUTLOOK

    We have found that, in general, data standardization across different crop species has yet to be fully achieved. In particular, the ability to share and represent knowledge relating to crop dietary nutritional composition remains a challenge. Existing ontologies for dietary nutrition have not yet been widely used or updated, and relevant data sources are poorly connected to those that manage crop- and cultivar-specific information, in the context of genetic resources and pre-breeding germplasm. This is evident from the lack of consistency and very limited coverage of information relating to dietary nutrition within the CO system.

    We anticipate that future crop data management systems will adhere to FAIR principles and provide the opportunity for transdisciplinary analysis that refers a CDNO. This is likely to have more general impact, given the current lack of cohesion for presenting nutrition data relating to major crops (ten Berge et al., 2019; Wheeler & von Braun, 2013).

    To ensure that the CDNO is adopted and developed to be a valuable tool for a wide range of crop-related researchers and stakeholders, it is important to identify several specific public-access use cases. We are currently collating a series of existing (Azman Halimi et al., 2019) and new datasets for bambara groundnut [Vigna subterranea (L.) Verdc.] for 60 nutritional components, alongside a comparative study of five legume species that incorporates 97 components and 155 studies. In addition, we will apply the CNDO terms to existing and new datasets associated with the BIP (Eckes et al., 2017) that include various seed fatty acid (Barker, Larson, Graham, Lynn, & King, 2007), leaf Ca+ and Mg+ (Broadley et al., 2008), and Zn (Broadley et al., 2010). Based on these initial use cases, we expect wider adoption within communities of practice such as Divseek and the Multinational Brassica Genome Project (www.brassica.info), and dissemination via networks such as the Global Action Plant for Agricultural Diversification and the Food Security Information Network.

    As indicated with our example use-case queries posed by Gene and Di, we expect the CDNO will facilitate far wider exploration of crop-related nutritional datasets and place greater emphasis on identifying links between cultivar-specific variation in composition and dietary function. This would contribute to meeting some key objectives outlined in Fanzo et al. (2018), which highlights the need to make data easy to interpret by policymakers and businesses, as well as nongovernmental organizations who are making decisions about where to invest and to intervene (Zamir, 2013). More generally, practical approaches to increasing food and nutritional security require common knowledge representation to assist in decision making for breeders, farmers, and nutritionists in relation to crop and cultivar choice.

    CONFLICT OF INTEREST

    The authors declare that there is no conflict of interest.

    ACKNOWLEDGMENTS

    We would like to thank Ramil Mauleon for comments that improved the manuscript, Cyril Pommier for contributions to refining Figure 2, and Sadaf Naz for contributions to Supplemental Table S1.