A tool for integrating agrometeorological observation data for digital agriculture: A Minnesota case study
Assigned to Associate Editor E. RoTimi Ojo.
Abstract
Agrometeorological data are essential for understanding production using digital agriculture techniques. However, integrating agrometerological observations from multiple sources remains a challenge. Often, digital agriculture scientists download and clean the same datasets many times. We present a prototype system that simplifies the process of collecting, cleaning, integrating, and aggregating data from meteorological data sources by providing a simplified user interface, database, and application programming interface. The prototype provides a standard interface for querying multiple geospatial formats (raster and vector) and integrates observation networks including the National Oceanic and Atmospheric Administration Global Historical Climatology Network (NOAA GHCN), NOAA NClim-Grid (NOAA's Gridded Climate Normals), and Ameriflux BASE. The system automatically checks and updates data, saving storage space and processing time, and allows users to summarize data spatially and temporally. Provided as open source code and browser-based user interface, the application and integration system can be run across Windows, Linux, and Mac environments to support broader use of multi-source agrometeorology data.
Core Ideas
- Digital agriculture scientists often re-download and re-clean the same agrometeorology datasets.
- Prototype provides a standard interface for querying multiple geospatial formats and data sources.
- System checks if data are already downloaded and keeps local version of data in sync with remote versions.
- Provides spatial and temporal summarization across computing platforms.
Abbreviations
-
- API
-
- application programming interface
-
- CSV
-
- comma separated value
-
- ETL
-
- extract transform load
-
- GDD
-
- growing degree day
-
- GHCNd
-
- Global Historical Climatology Network daily
-
- JSON
-
- JavaScript Object Notation
-
- NetCDF
-
- Network Common Data Form
-
- NOAA
-
- National Oceanic and Atmospheric Administration
-
- OGC
-
- Open Geospatial Consortium
-
- PRCP
-
- precipitation
-
- REST
-
- Representational State Transfer
1 INTRODUCTION
Digital agriculture is an essential tool to help feed and clothe the world by providing cross-scale estimates of yields and environmental impacts (Fuller et al., 2023). One common approach is using regional agricultural models to partition variation between genetics, management, and climate (Henry, 2020). Then researchers make projections about how crops will perform in the future. Such models are often parameterized with relatively little data (Roberts et al., 2017), and this sparse data has implications for the types of inference that can be made beyond the specific site-years studied (Elias et al., 2016). There are many techniques to improve the generalizability ranging from improved system representation in statistical and machine learning models to the use of process-based modeling (Karpatne et al., 2022), but one established and straightforward approach is to increase the amount of data for parameterization, training, and validation (Basso & Antle, 2020; Runck et al., 2024). This increase in data leads to a new challenge for digital agriculture scientists, namely, that such data are often in disparate formats, varied in form, and of differential quality.
A particular example of this is agrometeorological data, one of the most common data required to characterize crop-growing environments (Fischer, 2015; Frazier et al., 2022; Jagermeyr et al., 2021; Lobel & Gourdiji, 2012). Such data are used in simple growing degree day (GDD) models, complex simulation models, and statistical and machine-learning approaches (Ramirez-Villegas & Challinor, 2012). Despite its commonplace nature, it remains a challenge to obtain analysis-ready meteorological readings.
Reliable climate data at the right spatial and temporal resolution is key to creating accurate and useful models (Fick & Hijmans, 2017). Many options exist for digital agriculture scientists, ranging from climate grids like Oregon State Parameter-elevation Regressions on Independent Slopes Model (PRISM Climate Group, Oregon State University, n.d.), public observations, such as the National Weather Service Cooperative Observer Program (https://www.weather.gov/coop/), regional Meso-scale Networks like Oklahoma MESONET (Brock et al., 1995; McPherson et al., 2007), to private weather networks. The problem is not a dearth of data. The problem is that it remains challenging and time consuming to integrate data from across these many sources.
Each of the data sources has its own unique interface, sometimes machine-readable, other times non-standardized formatting. Even when machine-readable, each source has a different application programming interface (API), naming conventions, license requirements, data quality, and spatial and temporal resolution. Thus, a key need is to somehow manage this complexity in the context of already complex local workflows so that the quirks of each individual data source are hidden and data are returned in a consistent format. However, many of the current sources of quality data are difficult to interact with for non-specialists and many data products do not have high levels of temporal specificity (e.g., annual), and each has a different level of spatial resolution (1 m–10 km to county, state, or country).
There have been numerous efforts to create data standards (e.g., findability, accessibility, interoperability, and reusability) to ensure data products are both accessible and easy to use (Jacobsen et al., 2020; McCord et al., 2023; Runck et al., 2022; Wilkinson et al., 2016). For data to be useful, it must be accessible in different contexts. For location-specific research studies, point-based weather station data might be best, whereas a regional study of growing conditions may need gridded climate data products. Different research contexts may need time-series data in daily, weekly, monthly, or yearly summaries. Sourcing the right dataset is often a challenge, as the type of questions asked change the appropriate type of data. However, too often data are still found in supplemental materials of articles rather than in easily accessible archived and searchable databases.
The objectives of this study were to make various data sources more findable and accessible through a common interface by (1) creating a data collection pipeline for the National Oceanic and Atmospheric Administration (NOAA) weather station API, NOAA gridded weather data products, and Ameriflux data, and (2) providing easily accessible data aggregations of these data at user-defined data types, locations, and spatial resolutions. We provide open source code to show how any individual user can generate temporally and spatially explicit queries for a target geography. Further, we have generated a test case with an example of a functional graphical user interface to show how the data search could be operationalized.
Core Ideas
- Digital agriculture scientists often re-download and re-clean the same agrometeorology datasets.
- Prototype provides a standard interface for querying multiple geospatial formats and data sources.
- System checks if data are already downloaded and keeps local version of data in sync with remote versions.
- Provides spatial and temporal summarization across computing platforms.
2 MATERIALS AND METHODS
2.1 Climate data acquisition
The initial implementation of the data integration tool provides data products and aggregation from two NOAA data sources and one Ameriflux data source. During the design process, the Open Geospatial Consortium (OGC) (Matheus et al., 2021) and Representational State Transfer (REST) (Richardson & Ruby, 2007) design principles and standards were considered. Users of this system are intended to be agricultural scientists and would interact with the system through a graphical user interface. RESTful design principles were followed in the development but were not strictly adhered to. While OGC standards offer robust and well-established interfaces for data access and interoperability, adhering strictly to these standards was not within scope. The main objective was to create a user-centric interface and download process where complex data standards are abstracted away. Integrating OGC standards require additional development time and resources, which do not align with initial proof-of-concept project goals for simplicity and ease of use for non-technical users. The system's architecture is designed to be extendable, allowing for future integration with OGC standards should the need for broader interoperability arise.
The NOAA and Ameriflux APIs provide access to historical climate and flux data but have limited capabilities in usability and aggregations. Accessing and processing these data can be a consuming process during research workflows. This code automates these data downloading and aggregation as an on-demand service. The initial implementation provides access to the NOAA Global Historical Climatology Network daily (GHCNd) (https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily), NOAA NClimGrid-Daily (https://www.ncei.noaa.gov/products/land-based-station/nclimgrid-daily), and Ameriflux BASE (https://ameriflux.lbl.gov/data/flux-data-products/base-publish/) datasets. The GHCNd dataset contains daily point-based historical weather station readings for more than 100,000 stations globally. These source data are accessed via the NOAA's Climate Data Online API providing JavaScript Object Notation (JSON) data responses. The NClimGrid-Daily dataset contains interpolated gridded data for historical temperature and precipitation (PRCP) across the Continental United States. These source data are accessed through a file transfer protocol server providing network common data form (NetCDF) format files. The Ameriflux BASE dataset contains 30-min historical fluxnet station readings for hundreds of stations across the Americas. This source data are accessed through an API library in R (Chu and Hufkens, 2021). The state of Minnesota is used as a proof of concept case study.
2.2 System architecture
The system design is an aggregation of multiple data sources. This system architecture is shown in Figure 1. A custom-coded extract transform load (ETL) manager was created for each of the three data sources in order to provide data, download missing data, and aggregate data based on user preferences (https://github.com/RTGS-Lab/project_zero_prototype). These ETL pipelines can be seen on Gitub—https://github.com/RTGS-Lab/project_zero_prototype. A Flask API application then provides a user interface to the ETL managers, which can be accessed directly or through a web user interface for easy data downloading to comma separated value (CSV) or JSON format.

The ETL manager for the NOAA GHCNd implements a user-provided Structured Query Language database to cache data. It works by copying weather station data from the NOAA's API as needed upon request. The NOAA API has limits of 1000 records per request, which slows down data acquisition significantly. This can be troublesome, especially for data that may need to be regularly accessed. Upon the user's request, the ETL manager will aggregate and serve data kept in the database. The ETL manager for the NOAA NClimGrid-Daily follows a similar structure to the GHCNd, but does not implement database functionality. The NOAA regularly updates the gridded data products, which are stored in a NetCDF file format. Each NetCDF file contains a given month's daily grids. The ETL manager will check and download necessary files for each user request. The Ameriflux ETL manager follows a similar structure to the NOAA NClimGrid-Daily, but has modified metadata outputs. The Ameriflux API provides completeness data for each reported variable, so a CSV file can be requested to view variables and completeness information.
More APIs and data sources can be implemented with slight modifications to these ETL managers. The NClimGrid-Daily and Ameriflux ETL managers are modified versions of the GHCNd ETL by taking the open source code and modifying it for any user's specific needs.
3 RESULTS AND DISCUSSION
The output of this tool is a CSV or JSON file aggregated and formatted for ease of data analytics. The tool can provide a single interface to multiple data stores and output data in a consistent format. This makes data acquisition easier and quicker, and simplifies users' code. Each observation contains spatiotemporal metadata (latitude, longitude, timestamp, etc.) along with the given observation and data type (Table 1).
Parameter | Description |
---|---|
Start and end date | The beginning and ending dates for the data collection period. |
Location (bounding box or list) | The geographical area for data collection is specified either as a bounding box or a list of locations. |
Data types | The specific types of data to be collected, such as average temperature (TAVG), minimum temperature (TMIN), and precipitation (PRCP). |
Aggregation time scale | The time intervals over which data are aggregated, such as daily, weekly, monthly, or yearly. |
Type of aggregation per data type | The method of aggregating each type of data, for example, the mean of TAVG or the sum of PRCP. |
Data format | The structure of the data output, either as a tall dataframe (long format) or a wide dataframe (wide format). |
Output | The preferred format for the output, either Metadata, comma separated value (CSV), or JavaScript Object Notation (JSON). |
This tool harmonizes all data types, including converting gridded data into point-based data. While this removes the tessellated, visual aspect of the data, it provides easier access for point-based comparisons and analytics. This tool's output differs from other widely used climate data sources as it provides granular control of data outputs. A similar data product, WorldClim, provides a similar temperature and PRCP interpolations, but are downscaled rasters and, at the time of writing, only updated to 2021 (Fick and Hijmans, 2017). This project pulls data as soon as it is updated by the NOAA or Ameriflux, ensuring near-current information. Compared to the NOAA's data tools, this provides an alternative use with customized aggregations (i.e., sum, mean, and median) and data cleaned to match user request (i.e., dataframe formatting and JSON file type). It also provides access to the gridded data in point form, allowing for alternate forms of analysis, such as point comparisons along a wide area. The overall program is designed to be straightforward for an end-user with minimal programming experience, only needing to start a flask application and have a database connection to begin working with data.
3.1 How to access and use the tool
The code for this tool is publicly available on GitHub at https://github.com/RTGS-Lab/project_zero_prototype. A user can clone the repository and follow the provided instructions in the README file to run the program and get the data. The tool is designed to run locally or is easily adaptable to a remote environment with the Flask application. Once fully set up, the Flask application will be the only necessary file to run to work the tool.
The tool can be accessed directly through the application's Flask API. This can be done programmatically and is useful for crop modeling at scale. The tool can also be accessed using a graphical web interface, shown in Figure 2, that gives a direct way for users to define parameters and locations based on their needs. When using the API, a user picks the dates, time scale, output data types, and the study's spatial extent of interest. After running the tool, the user receives processed data products using the most recent and relevant data provided.

Using the graphical user interface, a user can define a bounding box area for weather stations and gridded data products. The API tool is designed to work in a 2-call process—first to preview metadata, then to perform data downloading. This process is in place to ensure data completeness and serve as a sanity check for users. A user first defines API call parameters they plan to download. They then can select the “metadata” option for “Direct Download”. This will run a test case for the requested data. It reports the number of observations (GHCNd) or files available (NClimGrid-Daily) or variable completeness (Ameriflux). This “metadata” check outputs a metadata JSON that will notify the user of any missing data. For the GHCNd interface, it can start loading missing/new observations into the database. When a user is ready to directly download data, they can change “Direct Download” to “CSV” or “JSON” depending on output preference. This will then perform all the data aggregation and output all available data. Examples of this and a typical user process flow can be found in the project's github (https://github.com/RTGS-Lab/project_zero_prototype).
3.2 Potential for use in digital agriculture
Cropping calendars developed on heuristics are common across the world. They define growing seasons for specific crops and target stages of growth and development. The calendars are often highly regionally specific and are not always formalized for specialty and niche crops. Climate change is rapidly altering cropping calendars. This tool provides a way to look at spatial and temporal trends in cropping calendars based on temperature and PRCP, providing a potential starting point for decision support for regional/local climate mitigation. By using high-resolution historical climate data, users can analyze past weather patterns to understand how temperature and PRCP have influenced crop growth and development over time. This analysis can help identify shifts in growing seasons and inform adjustments to planting and harvesting schedules. Additionally, it can help in refining irrigation schedules, GDD calendar creation, and pest management strategies based on historical climate conditions.
4 CONCLUSION
This prototype is an example of streamlined systems to gather climate data, offering an accessible interface for users to download meteorological data from user-defined points or visually bounded areas on a web-map interface. This tool simplifies the process of obtaining relevant data important in agricultural research and decision-making. Users can specify the exact dates and locations for their data needs, allowing for precise temporal and spatial analysis. Users can select their desired data types, aggregation methods (such as daily, weekly, or monthly averages), and output formats (JSON or CSV), making the data readily analyzable. This flexibility ensures that users can tailor the data retrieval to their specific research requirements, whether for point-based studies or broader regional analyses. The tool's capacity to deliver current and relevant data, updated as soon as it is available from the NOAA and Ameriflux, further enhances its utility. By automating the data collection and aggregation processes, this project not only saves researchers time and effort but also enhances the accuracy and reliability of the data used in agricultural and climate research.
AUTHOR CONTRIBUTIONS
Logan Gall: Data curation; formal analysis; visualization; writing—original draft; writing—review and editing. Tom Glancy: Data curation; formal analysis; writing—review and editing. Michael Kantar: Project administration; writing—review and editing. Bryan C. Runck: Conceptualization; data curation; funding acquisition; project administration; writing—review and editing.
ACKNOWLEDGMENTS
Funding for this project was provided by the Minnesota Environment and Natural Resources Trust Fund as recommended by the Legislative-Citizen Commission on Minnesota Resources (LCCMR) project ML 2021, Chp6, Art6, Sec. 2, 04e-E812SIM 2021–266.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Open Research
DATA AVAILABILITY STATEMENT
Code is available at https://github.com/RTGS-Lab/project_zero_prototype.