How useful is Active Learning for Image-based Plant Phenotyping?

Deep learning models have been successfully deployed for a diverse array of image-based plant phenotyping applications including disease detection and classification. However, successful deployment of supervised deep learning models requires large amount of labeled data, which is a significant challenge in plant science (and most biological) domains due to the inherent complexity. Specifically, data annotation is costly, laborious, time consuming and needs domain expertise for phenotyping tasks, especially for diseases. To overcome this challenge, active learning algorithms have been proposed that reduce the amount of labeling needed by deep learning models to achieve good predictive performance. Active learning methods adaptively select samples to annotate using an acquisition function to achieve maximum (classification) performance under a fixed labeling budget. We report the performance of four different active learning methods, (1) Deep Bayesian Active Learning (DBAL), (2) Entropy, (3) Least Confidence, and (4) Coreset, with conventional random sampling-based annotation for two different image-based classification datasets. The first image dataset consists of soybean [Glycine max L. (Merr.)] leaves belonging to eight different soybean stresses and a healthy class, and the second consists of nine different weed species from the field. For a fixed labeling budget, we observed that the classification performance of deep learning models with active learning-based acquisition strategies is better than random sampling-based acquisition for both datasets. The integration of active learning strategies for data annotation can help mitigate labelling challenges in the plant sciences applications particularly where deep domain knowledge is required.


Introduction and Related Work
Deep learning architectures have advanced the state-of-the-art performance for image-based classification tasks [21], and have been successfully deployed for a arXiv:2006.04255v2 [cs.CV] 1 Jul 2020 diverse array of image-based plant phenotyping applications applications including disease detection, classification and quantification [27,34]. However, one of the critical drawbacks of deep learning models is its necessity to have a large amount of labeled data to achieve good model accuracy. This is especially true for plant science applications, where annotating data can be costly, laborious, and time consuming to obtain, and generally need domain expertise (for instance, for plant stress image labeling that requires trained plant pathologists). To overcome this drawback, one effective and practical strategy is to use Active Learning (AL) based image annotation [9]. Weak supervision [16], domain adaptation [36,4], domain randomization [39], synthetic dataset creation [37,6,40], and transfer learning [35] are some of the other methods available to reduce the amount of labeling needed. However, when large amounts of unlabeled data is available but the task of labeling is hard or infeasible, AL methods are very useful. With the advent of high throughput phenotyping in plant sciences [33,3], this dichotomy between increasingly large corpus of sensor and image data but our inability to exhaustively label them is expanding. AL methods adaptively select the most informative samples for labeling for the highest improvement in test accuracy. The goal of AL is to achieve maximum predictive performance under a fixed labeling budget, which makes it desirable for plant science applications.
Many AL methods have been proposed with different heuristics [31] to reduce the amount of labeling needed for training machine learning (ML) models for classification tasks. A small amount of data is randomly chosen initially for labeling; and this labeled dataset is used to train a neural network model. Then, a batch of data from the remaining unlabeled data set is adaptively selected using an acquisition function for labeling by human domain experts. The acquisition function serves to select the most useful samples in the unlabeled dataset for improving neural network model performance. This process of choosing limited samples from unlabeled data sets, having the human expert annotate/label these limited samples, adding them to the labeled set, and retraining the model continues until one of two termination criteria is met -a desired performance threshold of the model is achieved, or the labeling budget is exhausted.
Recently, in non-plant sciences problems, AL methods have been successfully applied for improving the performance of deep learning models, for example, deep learning based image classification [38], biomedical image segmentation [41], text classification [41] and object detection [19]. In the field of plant phenotyping, uncertainty based sampling method was used to select samples for training Faster R-CNN model for panicle detection in cereal crops [7]. The continual improvement of AL strategies in the ML community, can be leveraged to significantly augment plant phenotyping efforts through state-of-the-art AL techniques. As a first step, there is a need to perform a comparative evaluation of the available sophisticated AL strategies in the context of canonical plant phenotyping applications. We compare four active learning methods defined by different acquisition functions: least confidence [10], entropy [32], Deep Bayesian Active Learning [14], and core-set [30] on two disparate plant phenotyping problemssoybean stress identification [15] and weed species classification [24].

Datasets
Soybean Stress Dataset The dataset consists of 16,573 RGB images of soybean leaves across nine different classes (eight different soybean stresses, and the ninth class containing healthy soybean leaf). Details on the dataset can be found in [15]. Briefly, these classes cover a diverse spectrum of biotic and abiotic foliar stresses in soybean.

Experimental Setup
We trained a neural network-based classification model for identifying the class labels for input images in the two data sets, with the goal to achieve maximum classification performance for a fixed labeling budget. We evaluated each of the four active learning strategies (corset, DBAL, entropy, and least confidence) based on how well the neural network performed -i.e. using the classification accuracy on the complete dataset. We used MobileNetV2 [29] architecture for dataset #1 (soybean stress classification) and ResNet-50 [17] architecture for dataset #2 (weed species classification). These networks were specifically chosen because of their popularity and strong performance, as well as to test the capability of AL on two distinct and well used networks. MobileNetV2 is a smaller, more compact network, while ResNet-50 is a large network, and were appropriate for dataset #1 (controlled condition imaging) and dataset #2 (field based imaging), respectively.
Each data set was analyzed separately. The AL approach was repeated ten times for each dataset. For each run, we randomly selected and labeled 5% of the samples from the complete data to create a validation set . Each run starts with an initial random batch of 1000 samples spread across different classes, which was used for the evaluation of all four active learning methods. After training the neural network model for 100 epochs, a query batch of 1000 samples from the remaining unlabeled dataset was selected. This selection was performed using the acquisition function of each of the four active learning algorithms (so each AL approach will potentially select distinct set of 1000 samples to next annotate). These 1000 samples were added to the labeled dataset, to retrain the neural network model. This process was repeated until the labeling budget was exhausted (labeling budget was 10,000 samples for the soybean stress classification, and 9000 samples for the weed species classification). We saved the model with best validation accuracy. The model was retrained from scratch after every selection of new labeled samples for 100 epochs. We optimized the model using Adam [20] optimizer with the default learning rate of 0.001. We used Keras [8] with Tensorflow [1] backend for the implementation. A schematic of the approach is shown in Fig. 3.

Evaluated Methods
Formally, let x be the input and y (1, . . . , N ) be the output of the classification model in the active learning setup. The neural network was trained using labelled set L pool . The active learning methods selects a batch of b points [x * 1 , . . . , x * b ] from the unlabeled pool U pool for expert annotation according to an acquisition criterion, and add these b points to the labeled set L pool . The four active learning methods are described below: Random. The samples are chosen at random from the unlabeled data. This represents the baseline, if AL methods are not used.
Least Confidence. The unlabeled samples are sorted in ascending order according to maximum predicted classification probability for the input x (p( y x )) and the samples with the lower rank are chosen for labeling [10].
Entropy. The unlabeled samples with highest entropy H of the predicted classification probability distribution p are chosen for labeling [32].
Core-set. A set of diverse samples that best represents the distribution of the entire dataset in the representation space (penultimate layer) learned by the neural network model are chosen for labeling. The greedy approximation method was used to implement the core-set selection [30].
Deep Bayesian Active Learning (DBAL). The Monte Carlo dropout (MCdropout) [13] based uncertainty estimation is combined with Bayesian Active Learning by Disagreement (BALD) [18] acquisition framework for selecting the samples in DBAL [14]. The MC-dropout based uncertainty estimates were computed by averaging the outputs of T different forward stochastic passes through the trained neural network model with weights w t for the pass t during the test time. A new dropout mask was applied during each of the T forward passes. The BALD acquisition function calculates the mutual information between the data samples and the model weights. Unlabeled data samples with larger mutual information between the predicted label and model weights were selected for labeling. The uncertainty estimate p is: The BALD acquisition criterion I is:

Results and Discussion
Mean accuracy for different active learning methods for the two canonical problems on soybean stress classification and weed species classification are presented in Fig. 4. and Fig. 5 respectively. For the soybean stress classification dataset, we clearly observe that all the uncertainty sampling based active learning methods outperform random sampling whereas diversity sampling based Coreset method underperforms. For the weed species classification dataset, all active learning algorithms outperform random sampling. The performance gain due to AL methods over random sampling for plant domain datasets is similar to the improvement observed in other domain datasets like MNIST and CIFAR10 [5]. The overall performance gains of active learning algorithms were higher for weed species dataset than soybean stress dataset. One reason for this could be the challenging nature of the weed species dataset which was collected under diverse field conditions whereas the soybean dataset was collected under indoor conditions with constant illumination. Additionally, the field images for weed data set had more background objects and obscurity compared to the soybean dataset, which was images under more controlled conditions [15]. Hence, the random sampling-based annotation method provides a stronger baseline for the soybean stress dataset. The dip in accuracy at 2000 samples for the soybean dataset was due to high class-imbalance in the labeled dataset after the sample selection.
Which samples are selected?: A random selection strategy is expected to blindly pick new samples for annotation, therefore the distribution of selected   points is expected to be uniform. In contrast, we anticipate the AL based methods to pick fewer samples from classes that are well predicted, and instead pick more points from classes that are not well predicted, for example class 0 and 7 are both bacterial diseases with very similar and confounding symptoms, which are at times even difficult for human raters to distinguish. We visualize this expected behavior in Fig. 6 (for soybean stress classification) and Fig. 7 (for weed classification). The per-class classification accuracy of different active learning methods and random sampling is shown in Fig. 6a and Fig. 7a respectively. The per class sample selection percentage, i.e. how many samples are used per class (calculated as number of sample selected from a class/total number of samples available in a class) of different active learning algorithms are presented in Fig. 6b and Fig. 7b respectively. The accuracy plot of random method indicates the classes that are hard and easy to predict. The clear inverse relationship between the performance of individual classes shown in the accuracy plot of random and the number of samples chosen is apparent across all uncertainty based AL methods. This is in stark contrast to a nave random sampling (Figure of Fig. 6b and Fig. 7b). Class 0 and Class 7 have low per-class classification accuracy from random samplingbased annotation (Fig. 6). Least Confidence, Entropy and DBAL methods chose more samples from the Class 0 and Class 7 and obtained better per-class accuracy than random sampling. AL based like LC, Entropy and DBAL methods selected more samples from stresses which are highly confusing when compared to less confusing stresses. This is very promising from a domain perspective because confounding symptoms classes are more extensively sampled by these three AL methods. The uncertainty-based methods sampled only a small percentage of samples from classes that have high per-class accuracy for random sampling (Classes 3, 5, 6 and 8) method. In contrast to the uncertainty based acquisition functions of LC, Entropy, and DBAL, core-set uses a diversity based sampling. Its comparatively poor performance can be explained by the fact that it chooses less samples from classes exhibiting less diversity, even if that class is difficult to classify. Uncertainty based AL algorithms adaptively sampled more from the low accuracy classes of random sampling method (Classes 0, 1, and 7) and sampled less from the high accuracy classes of random sampling methods (Classes 6 and 8). There was no consistent trend for leaf variation (narrow and broad leaved).The AL methods show promising results for plant sciences problems where extensive data are needed to train useable models. These include diverse applications, such as cluttered image problem for soybean cyst nematode egg detection [2], hyperspectral imaging [22,28], abiotic stress disease rating [23,42], root imaging [11,12], yield predictors [25,26] et cetera.

Conclusions
In this paper, we compared four different active learning algorithms for classification of soybean stress and weed species datasets. We empirically showed that uncertainty based active learning methods outperform random sampling-based annotation for classification of two distinct datasets using MobileNetV2 and ResNet50 architectures. Uncertainty based active learning strategy reasonably outperformed random sampling, especially in the lower end of the labeled dataset availability. For the two cases shown here, Entropy sampling is marginally better than the other AL strategies. We believe that active learning methods can be quite helpful in reducing the amount of labeling needed for image-based plant phenotyping tasks. In future, AL methods should be combined with unsupervised representation learning methods to further increase the label efficiency.