Zhao (2022)
Topics covered
Zhao (2022)
Topics covered
Crop Protection
journal homepage: www.elsevier.com/locate/cropro
A R T I C L E I N F O A B S T R A C T
Keywords: In agriculture, early and timely detection and identification of plant disease categories can help growers take
Plant disease classification timely countermeasures. The use of deep learning techniques for plant disease category detection prevents
CAST-Net further spread of the disease and helps to prevent crop production losses. In this paper, Based on the Next-Vit
Lightweight neural network
neural network model, we proposed a lightweight neural network CAST-Net based on the combination of
Dynamic learning rate function
Self-distillation
convolution and self-attention, and we adopted self-distillation based on this model to achieve increased accu
racy in classifying plant leaf diseases while reducing the number of model parameters and flops. Our model and
method achieved 98.4% accuracy on the tomato subset of the data-enhanced PlantVillage dataset, a 4.9%
improvement over the Next-Vit model, and 99.0% accuracy on the full PlantVillage set, a 6.9% improvement
over the Next-Vit model. We also propose a new dynamic learning rate function that is applied to the training
phase to prevent the loss from reaching the optimal value. The results show that our model and method have
higher accuracy, fewer parameters, shorter training time and lower computational complexity than existing
models.
1. Introduction and BDA optimization algorithm are used for feature extraction and
feature selection respectively, and finally ELM algorithm is used for
As a significant agricultural nation, China is bound to face pests and plant leaf disease classification (Aqel et al., 2022). Disease classification
diseases during the crop growth process. Failing to detect and recognize of tomato, potato and chilli crops using hybrid machine learning tech
the exact type of pests and categorize them correctly may result in niques (Bhagat and Kumar, 2023). An inter-class similarity analysis
inadequate measures, causing substantial losses. Manual identification based method to assess the contribution of sub-image information
of pests and diseases is highly resource-intensive, requiring significant combined with an active learning image selection strategy is proposed to
manpower and material resources. The process may take a prolonged resolve classification inaccuracies in intelligent identification of plant
period to produce the results. As a result, the utilization of machine diseases (Yang et al., 2022). Supervised learning and image classifica
learning in plant pest and disease classification and identification is of tion are applied to the early detection of potato late blight (Suarez Baron
utmost importance to achieve effective and precise diagnosis and clas et al., 2022). The plant disease features were extracted using various
sification. It remains a current research priority and a topic of consid statistical features based on the classification evaluation of six machine
erable interest. Researchers have been carrying out ongoing learning models. The improved grey scale covariance matrix (GLCM)
investigations on computer vision technology to attain accurate plant technique was employed. The highest classification accuracy rates were
pest classification. Machine learning algorithms can be categorized achieved using the light gradient boosting machine (LGBM) and support
primarily as supervised learning, unsupervised learning, and semi- vector machine (SVM) models, at 94.39% and 93.15% respectively
supervised learning. Among the commonly utilized supervised (Tabbakh and Barpanda, 2022). The combination of supervised learning
learning algorithms are support vector machine, K nearest neighbor model and support vector machine was used to achieve pomegranate
algorithm, and decision tree algorithm. In terms of algorithms, K-mean plant disease feature extraction through ROI extraction - region of in
clustering algorithm is used for image segmentation, GLCM algorithm terest feature extraction techniques. Pomegranate leaf disease
* Corresponding author.
E-mail addresses: [email protected] (Y. Zhao), [email protected] (Y. Li), [email protected] (N. Wu), [email protected] (X. Xu).
1
This is the first author footnote.
https://doi.org/10.1016/j.cropro.2024.106637
Received 18 September 2023; Received in revised form 21 February 2024; Accepted 27 February 2024
Available online 5 March 2024
0261-2194/© 2024 Elsevier Ltd. All rights reserved.
Y. Zhao et al. Crop Protection 180 (2024) 106637
classification was achieved with a final accuracy rate of 98.07% (Mad shorter training time, and higher detection accuracy (Zhao et al., 2022).
havan et al., 2021). The authors proposed an optimal deep learning An attention mechanism is proposed to represent the visual information
model based on adaptive genetic algorithm to diagnose olive leaf dis of local regions of an image by labelling, calculate the information
eases. With an accuracy of around 96% in multi-class classification task correlation between local regions using the attention mechanism, and
and 98% in binary classification task (Alshammari et al., 2022), the finally integrate the global information for classification, which effec
model proves to be effective. The authors used an improved deep tively emphasizes the maize leaf lesion information and suppresses the
migration learning model to extract multiple leaf diseases for classifi background noise, facilitating the fine-grained maize leaf disease
cation. Multiple support vector machine (SVM) models were used to recognition under complex backgrounds (Qian et al., 2022). The ghost
improve the feature recognition and processing speed (Saberi Anari network was used as the convolutional backbone to generate interme
et al., 2022). diate feature maps with linear operations, followed by the transformer
Because machine learning techniques do not perform well in real encoder with integrated multi-head attention to extract deep semantic
complex environmental background images, people continue to conduct features, and the results showed that the method was effective and ac
research and exploration in the direction of deep learning techniques, curate for grape leaf field diagnosis (Lu et al., 2022). The proposed CST,
and some traditional convolution neural networks, such as ResNet (He based on Swin Transformer, can identify the degree and type of disease
et al., 2016), DenseNet (Huang et al., 2017), and VGG (Simonyan and with high testing accuracy and excellent robustness (Guo et al., 2022b).
Zisserman, 2014), have gradually emerged, and these network models An Attention Dense Learning (ADL) mechanism is proposed that com
improve the accuracy of machine learning techniques in image classi bines hybrid S-type attentional learning with the basic dense learning
fication by continuously increasing the depth or width of the neural process of deep CNN. This helps to obtain robustness and higher testing
network model. The accuracy of machine learning techniques in image accuracy for plant leaf disease classification. The proposed mechanism
classification, but also greatly increase the complexity of the model. has achieved 97.33% classification accuracy in its real-world environ
Subsequently, the MobileNet family of lightweight networks (Howard ment in the RGB leaf dataset (Pandey and Jain, 2022). In order to
et al., 2017, 2019; Sandler et al., 2018) was developed to reduce the improve plant disease classification performance, the number of images
complexity of the model while minimizing the impact on accuracy. in the training set was reduced by using Plant Image Generative
SENet (Hu et al., 2018) was proposed to adaptively recalibrate the Adversarial Network (PI-GAN) for data augmented training (Batchuluun
channel features based on the interdependencies between channels. et al., 2022).
Later, researchers implemented the Transformer structure in the visual In order to improve the accuracy of image classification, deep
domain based on the attention mechanism of the ViT (Dosovitskiy et al., learning models are employed, which in turn increases the model’s
2020) and SwinViT (Liu et al., 2021) neural network architectures. The complexity. To address this, people utilize knowledge distillation in
classification accuracy of this structure on large datasets was again machine learning. They use a complex teacher model to teach distilled
improved. knowledge to a simpler student model, thereby allowing this lightweight
Recent studies (Srinivas et al., 2021; Wu et al., 2021; Guo et al., model to achieve a high level of accuracy in classification. The knowl
2022a; Mehta and Rastegari, 2021; Chen et al., 2022; Li et al., 2022) edge gained from model combination is consolidated into a single model
achieve better performance by combining the advantages of convolution (Hinton et al., 2015). The res-student approach utilizes the knowledge
and Transformer. BoTNet (Srinivas et al., 2021) uses a multi-head gap between teachers and students to train lightweight students (Li
self-attention mechanism to replace the last three bottleneck blocks of et al., 2021). A self-distillation framework was recommended to
ResNet.CvT (Wu et al., 2021) uses depth-separable convolution com compress deep network model knowledge into shallow knowledge,
bined with self-attention. CMT(Guo et al., 2022a) uses the Transformer which enhances image classification by 2.65% in average level accuracy
structure to capture remote dependencies and convolution to capture (Zhang et al., 2019). To help the model better learn the distribution and
local features. MobileViT (Mehta and Rastegari, 2021) uses spatial in features of the data, the dynamic soft target knowledge of the previous
duction bias to learn representations with fewer parameters in different data sampling during training is constrained to be provided to the cur
visual tasks. MobileFormer (Chen et al., 2022) implements a parallel rent iteration of training learning. According to experiments, this
structure for MobileNet and Transformer fusion through a bilinear strategy is compatible with image classification and can be applied
bridge. Next-Vit (Li et al., 2022) combines CNN and Transformer to (Shen et al., 2022). The accuracy of plant pest and disease classification
capture local and global feature information through a deployable recognition increased by 2.12% when knowledge distillation was
mechanism. With the development of deep learning convolutional applied to the MobileNet lightweight network model (Ghofrani and
neural networks, deep learning classification models have also been Mahdian Toroghi, 2022).
applied to agricultural development problems. The AlexNet model was The previous studies have shown that it is possible to apply deep
explored to detect maize leaf diseases quickly and accurately, eventually learning techniques to plant disease classification and identification
achieving 99.16% accuracy (Singh et al., 2022). A new Reconstructed tasks, but in order to achieve more accurate identification of plant leaf
Disease Aware Convolutional Neural Network (RDA-CNN) was proposed disease categories, most of the studies have been done through the use of
to take low resolution image input of rice crop and convert the high more complex models. There is also a part of the research as although
resolution output to recover the disease condition of different parts of the model is made to achieve the lightweight level, but the classification
rice plant, and the experimental results showed that the RDA-CNN accuracy is not improved. Based on the above problem analysis, this
improved the classification performance by 4%–6% (Sathya and Raja paper proposes a lightweight deep learning classification network model
lakshmi, 2022). A deep convolutional neural network (DCNN) model based on the Next-Vit model, and applies it to plant disease classification
was proposed to optimize the hyperparameters of the DCNN model for and detection, which reduces the complexity of the model and improves
plant leaf disease classification using stochastic search techniques, and the classification accuracy at the same time. And combined with a self-
experiments showed that the overall performance of the model out distillation technique, the model can better learn the features of plant
performed advanced machine learning techniques (Pandian et al., disease spots. We also propose a new dynamic learning rate change
2022). A method for accurate identification of plant leaf diseases based function. The main contributions are:
on deep convolutional neural networks and gating units has been pro
posed, which was measured on PlantVillage data and achieved better 1. A shallow plant disease featured extraction module is proposed,
results than other models (Alguliyev et al., 2021). Embedding the which adopts multi-scale feature fusion technique to extract the
improved CBAM convolutional attention module into the improved input plant disease image features from four branches respectively,
Inception network improves the effectiveness of plant leaf disease the first branch of maximum pooling for local significant feature
classification, and the results show that the model has fewer parameters, extraction, the second branch of average pooling for local average
2
Y. Zhao et al. Crop Protection 180 (2024) 106637
3
Y. Zhao et al. Crop Protection 180 (2024) 106637
3.1. ShallowBlock features and increase the image’s channel dimension. The first branch
applies maximum pooling to the local region of the feature map to
When using convolutional operations for feature extraction, a small extract important features. The second branch uses average pooling to
kernel has a limited receptive field, which results in better feature extract overall data information features from the feature map. The third
extraction for small areas. Conversely, a large kernel has a larger branch employs two convolutional sub-branches to extract the local
receptive field, which allows for better feature extraction of entire leaf features of the feature map by effectively halving the channels. The
diseases. Existing neural network models in shallow extraction typically resulting feature maps are concatenated, and the original feature in
consist of a single convolutional layer with a large convolutional kernel formation is preserved in the residual branch. Finally, the four branches
or multiple small convolutional kernel convolutional layers super multiply the weight coefficients by the adaptive learning weights and
imposed on the input image to extract edge knowledge information. then carry out the addition operation to achieve feature fusion, which
However, this approach can only extract feature information at a single results in obtaining varying levels of image feature information extrac
scale. We employ multi-scale feature fusion for extracting shallow ted at different scales. The feature fusion formula is:
feature map information. In the shallow layer of the network, we aim to
F = w1 × fr + w2 × fm + w3 × fa + w4 × fc (1)
extract image information features of varying granularity by incorpo
rating multiple scales. We optimize the weight coefficients of the w1,w2,w3,w4 are the normalized weight coefficients, fm is the first
extracted features from each branch scale and then perform weighted branch feature, fa is the second branch feature, fc is the third branch
fusion. Thus, we can establish a basis for extracting sophisticated se feature and fr is the residual branch feature.
mantic information from the network model and avoid the loss of crucial
features initially. In Fig. 2, we apply a 1 × 1 convolution at the start of
each branch to enhance the network model’s ability to extract additional
4
Y. Zhao et al. Crop Protection 180 (2024) 106637
3.2. SCBlock optimization. We also have developed the DTB module, which combines
the DCB expansion convolution module with the multi-head attention
Since the NCB module in Next-Vit only considers the extraction of mechanism to extract global feature information. The DCB module en
local spatial feature information without taking into account the infor hances the receptive field of the initial 3 × 3 convolution without
mation correlation between different channels, we added the channel incurring additional computation costs and maintains information
shuffle operation to the improved NCB module, so that inter-channel integrity during the network’s convolution process as it deepens. This
shuffling is performed each time the network is trained by the mod approach ensures minimal information loss.
ule, making the inter-channel information correlated. We replaced the As displayed in Table 2, we conducted a computational analysis
GroupConv3 × 3 group convolution (the number of groups is equal to comparing the number of parameters and computation between the
the number of output channels/32) in the original NCB module with grouped convolution and DCB expansion convolution modules. Our
DW3 × 3 convolution (depth separable convolution). As the DW findings indicate that the DCB module enhances the sensory field of
convolution is a special group convolution (the number of groups is feature extraction without increasing computational load. For instance,
equal to the number of output channels) to extract local features, it is the input channel is configured to 64, and the output channel is set to 96.
combined with channel shuffle to correlate the global channel infor The convolution kernel size is established as 3, while the input image
mation, resulting in better feature extraction of the feature map. This size is defined as 224 × 224. The number of groups is predetermined at
form of convolution is the most efficient convolution compared to group 3. A comparison between the number of parameters and computations
convolution, with the same number of parameters and the same number for grouped convolution and inflated convolution is presented in
of operations, as it can produce Cout (the number of output channels) of Table 2.
feature maps, whereas GConv can only produce Cout/32 feature maps. In the DTB module, as illustrated in Fig. 5, the process initially in
We found that most of the parameters in the model come from the volves a convolutional layer that employs a 1 × 1 filter. Subsequently, an
MLP multilayer perceptron layer, and the Next-Vit model adopts a attentional mechanism is utilized for the global extraction of informa
convex structure MLP, and the hidden layer units in the middle of the tion from the input features of plant disease images. After this, the
convex structure MLP is twice as many as the number of inputs and feature integration takes place in a 1 × 1 convolutional layer, and the
outputs, so the number of parameters in the MLP layer is also very large, process then progresses to our DCB module, which starts with a 3 × 3
and we change the MLP layer to a concave structure as shown in Fig. 3, inflated convolution layer with a nulling rate of 2 in order to extract
and the number of neurons in the middle of the hidden layer is only half local information from plant disease image features using a large
of the number of neurons in the input-output layer or less. The number receptive field. This is followed by a 1 × 1 convolutional layer for feature
of neurons in the middle hidden layer is only half of the number in the integration and, to address the issue of gradient vanishing, residual
input and output layers or less, which greatly reduces the number of merging is also applied before passing through the concave MLP layer.
parameters in the MLP layer, and after experimental testing, the model
accuracy does not decrease, but the amount of parameters in the model 3.4. CAST-net
is reduced by two-thirds of the original.We call the improved NCB
module the SCB module. As shown in Fig. 4, the SCB module will first Since SCBlock extracts local feature information and DTBlock ex
shuffle the channels of the input feature information so that the features tracts global feature information of plant disease images, we combine
extracted in the previous iteration are concerned with the information them to achieve a combination of local and global feature information.
connection between the channels when entering the SCB to extract the We let the input image pass through the ShallowBlock to obtain rich
features again this time. Then, enter will into a deep convolution with a shallow feature information before entering the SCBlock and DTBlock
filter size of for 3 × 3 the extraction of local features, and nonlinear for deep feature information extraction. We employ SCBlock and
activation, normalization it point convolution then a filter with a filter DTBlock in tandem to integrate local and global feature information for
size of 1 × 1 for feature integration, followed by then BN and ReLu, and images of plant diseases. After passing the input image through Shal
finally into a concave MLP layer. lowBlock to obtain comprehensive shallow feature information, it then
flows into the series-connected SCBlock and DTBlock to extract deep
feature information, thereby enabling the entire-Net model to classify
3.3. DTBlock
plant leaf disease categories.
As depicted in Fig. 6, n1, n2, n3, and n4 determine the number of
We suggest a DCB mini-module comprising a 3 × 3 kernel of zeroes
overlaps for each Stage layer. It is well-established that as the number of
with an inflated convolution by 2 for expanding the receptive field of the
layers in the network model increases, the model’s complexity increases,
kernel for extracting local features while preventing the extraction of
and its ability to accurately identify plant leaf diseases improves. The
excessive redundant information. Then, we have normalization and
configuration mainly utilized in our experiment is [3,4,10,3], denoting
nonlinear activation, followed by a 1 × 1 convolution for integrating
the number of overlaps per layer. In Stage 1, the SCB was stacked thrice,
features. Additionally, we incorporate a residual structure to avoid
followed by the SCB being stacked thrice and the DTB being stacked
gradient vanishing during network depth increase and model
once in Stage 2. Stage 3 involved stacking the SCB eight times and the
DTB twice. Stage 4 saw the SCB being stacked twice and the DTB once.
The final stage included discriminating the categories of the input plant
disease images after the fully connected layer. Fig. 6 provides an illus
tration of the overall CAST-Net network model structure, and Table 3
describes the CAST-Net network model’s detailed configuration
information.
Based on the model training with fixed learning rate, it was found
that the model tends to jump around the minimum value repeatedly in
the later stage as the number of epochs increases, which may be caused
by too large a learning rate in the later stage. So I used dynamic change
Fig. 3. MLP convex-to-concave. of learning rate to train the model. When we use the gradient descent
5
Y. Zhao et al. Crop Protection 180 (2024) 106637
Fig. 4. (a) shows the original module NCB, (b) shows our modified module SCB.
6
Y. Zhao et al. Crop Protection 180 (2024) 106637
Table 3 1 ∑m ∑ n
lossce = − ⋅ yi ⋅logPi (4)
CAST-Net detailed configuration. m k=1 i=1
Stages Input Size Block CAST-Net Layers
Stem 224 × 224 Shallow Block Shallow-B × 1, C = 32, S = 2 1 ∑n
T 2 ⋅Dkl (PT,t− (5)
1 T,t
Shallow-B × 1, C = 64, S = 2 losslb = ⋅ i Pi )
n i=1
Stage 1 64 × 64 Down Conv 1 × 1, C = 96, S = 1
Sampling
CAST Block [ SCB × 3, C = 96 ] loss = lossce + α⋅losslb (6)
Stage 2 64 × 64 Down Avgpool, S = 2
Sampling Conv 1 × 1, C = 96 where Pi is the probability that the current sample belongs to cate
CAST Block [ SCB × 3, C = 192 ], [ DTB × 1, C = 256 ] gory i, xi is the logit of the corresponding category i for the current
Stage 3 32 × 32 Down Avgpool, S = 2 sample, n is the total number of sample categories, lossce is the cross
Sampling Conv 1 × 1, C = 384
entropy loss. T is the distillation temperature, PT,t−
i
1
is the probability
CAST Block [ SCB × 8, C = 384 ], [ DTB × 2, C = 512 ]
Stage 4 16 × 16 Down Avgpool, S = 2 that the current sample belongs to category i for the t − 1st iteration of
Sampling Conv 1 × 1, C = 768 soft labelling at temperature T, α is a hyperparameter controlling how
CAST Block [ SCB × 2, C = 768 ], much of the total loss is accounted for by the distillation losslb, and loss is
[ DTB × 1, C = 1024 ]
the total loss in model training.
7
Y. Zhao et al. Crop Protection 180 (2024) 106637
Fig. 7. (a) indicates the effect of different Temperature on CAST-Net tested under the condition of Alpha fixed at 0.5. (b) Denotes the effect of different Alpha on
CAST-Net tested under the condition of Temperature fixed at 3.
8
Y. Zhao et al. Crop Protection 180 (2024) 106637
Table 4
Comparison on Tomato10 dataset and PlantVillage dataset.
Dataset NetWork Accuracy Loss Recall Precision F1-score Params(M) Flops(G)
Table 5
CAST-Net’s recognition of single diseases.
Type of disease Predict the Predict the Predict the Predict
Total Correct Wrong Accuracy
quantity quantity quantity
9
Y. Zhao et al. Crop Protection 180 (2024) 106637
plant patches in the pre-training phase. As the model gains knowledge, respectively before conducting our experiments, as shown in Table 6.
the learning rate gradually decreases to allow the model to successfully Our final model is denoted by CAST, while our final model combined
learn advanced semantic information. This approach prevents erratic with the self-distillation strategy is denoted by CAST+SD. Static_lr in
oscillations of the loss curve resulting from a gradual reduction in the dicates the use of a static learning rate of 0.9% lower accuracy compared
learning rate and ensures that the model reaches its optimal perfor to our dynamic learning rate function. We excluded the DCB module
mance. The learning rate is gradually reduced to prevent the model from when training and testing our model for classification of leaf disease
skipping the optimal value, which can cause oscillations in the loss categories, resulting in a 1.5% decrease in accuracy. No_SCB indicates
curve. Our dynamic learning rate function therefore allows the model to that the model’s classification accuracy drops and its parameters and
gradually reduce its loss and stabilize the curve. computational complexity increase by 1.5M and 0.56G, respectively,
when the improved SCB module is removed. It can be observed that our
4.3. Ablation experiment ultimate model surpasses the subsequent three models that lack a certain
improvement aspect in all evaluation indices, and our ultimate model,
The paper presents a lightweight model built on the Next-Vit model CAST-Net, attains a test accuracy of 97.5%, which is then advanced by
that merges CNN with Attention. In addition, a ShallowBlock is devised 0.9% to attain 98.4% when integrated with the self-distillation training
to capture abnormal global features of plant disease images through method. Although the overall recognition accuracies of our models
multi-scale fusion in the bottom layer of the model. The SCB module, an decreased after excluding one of our improved modules, they still out
improved version of the NTB module, uses depth-separable convolution performed Next-Vit by 2.9%, 3.4%, and 2.9%, respectively, in the
instead of the original grouped convolution. This change results in a presence of any other two improved modules. This further highlights
larger number of feature maps while maintaining the same number of that each of our proposed modules is better than the original Next-Vit
parameters and operations. In addition, a new module, DCB, is intro model for disease recognition and classification, emphasizing the
duced that uses expansion convolution to expand the sensory field indispensability of our proposed modules.
without increasing the complexity of the model. Finally, we propose a
dynamic learning rate function that allows the model to bypass the 5. Discussion
optimal value in the final training phase and accelerate the convergence
of the model. In this study, we applied four image enhancement techniques to
generate images of plant diseases in the laboratory simulating real
4.3.1. ShallowBlock module performance analysis shooting conditions before training the model. Our results indicate that
We use a heatmap to compare the efficiency of the ShallowBlock pre-processing all data images to simulate authentic shooting conditions
module with that of the shallow feature extraction module in Next-Vit to does not result in effective training. Therefore, we employed a random
extract image features of plant diseases. Fig. 9 shows the extraction of proportion method for data processing, which improves the general
Apple_Black_rot and Corn_Cercospora_leaf_spot in PlantVillage. Apple ization ability of the model. We present a shallow feature extraction
Cedar apple rust, Pepper bell Bacterial spot, and Tomato Bacterial spot module that uses the multi-scale fusion technique with adaptive weight
are types of plant diseases that we compared to evaluate the effective coefficients to obtain extended feature information on plant leaf diseases
ness of our ShallowBlock and Next-Vit’s shallow feature extraction in the initial stage of the network. The NCB and NTB components in the
module in extracting leaf spot feature information. Our ShallowBlock Next-Vit model have been improved. To simplify the model and enhance
uses multiple scales to extract features from input plant disease images, connectivity between channel features, we have substituted the original
which enables feature fusion that makes the extracted features richer group convolution with deep convolution and point convolution, and
and reduces the loss of useful information. It is obvious that our Shal included the channel correlation operation. To better extract global in
lowBlock outperforms the shallow feature extraction module of Next-Vit formation, we have employed the fusion of inflated convolution and self-
(consisting of four 3 × 3 convolution layers) in the shallow layer of the attention mechanism, resulting in the enhanced CAST-Net model
plant disease network. exhibiting significantly improved test accuracy.
We have investigated knowledge distillation of the model in the
4.3.2. Comparative experiments in which a module is excluded separately context of implementing a lighter aliquot model to enhance classifica
We excluded both our original module and the improved version tion accuracy. We found that the traditional teacher-student distillation
10
Y. Zhao et al. Crop Protection 180 (2024) 106637
Table 6
Comparison on Tomato10 dataset and PlantVillage dataset.
NetWork Accuracy Loss Recall Precision F1-score Params(M) Flops(G)
technique used in applying knowledge distillation methods to the task of employed a self-distillation method to further refine the accuracy of our
classifying agricultural plant diseases demands a significant amount of model for plant disease recognition. In order to validate our model and
time to identify and prepare a robust and intricate teacher model in the methodology, we carried out experiments on the tomato 10-class dataset
knowledge that will be learned and transferred to our lighter model, and and PlantVillage dataset. These datasets have been subjected to simu
it may not be effective. Therefore, we implemented a self-distillation lated real-world conditions. We then compared our results with those
technique that leverages the impact of previous iterations during the obtained from established models such as Next-Vit, ResNet34, Mobile
training phase of our proprietary network model to direct the next Net, ConvNeXt, Vit-base, SwinVit-Tiny and MobileVit. The results
iteration, thereby enhancing the precision of the test when compared to demonstrated that our model and methodology achieved better accu
training the CAST-Net model independently. racy, had fewer model parameters, and lower computational
Finally, we discovered that using a fixed learning rate during the complexity. Our model has a size of 11.69 million, with 1.84 billion
model’s training phase hindered its ability to converge to the optimal FLOPs, and yields an accuracy of 98.4% for the 10-class Tomato dataset
value. Therefore, we introduced a new dynamic learning rate function to and 99.0% for the enhanced PlantVillage dataset.
enable the model to adjust the learning rate size based on the function.
We also evaluated the impact of various components of the dynamic Funding
learning function during training. According to our experimental results,
the model converged more efficiently while using our learning rate This work was supported by National Key Research and Develop
function. ment Program of China (2019YFE0126100); the Key Research and
Our research aims to improve outcomes for classifying and recog Development Program in Zhejiang Province of China (2019C54005);
nizing plant leaf diseases and pests by employing a lighter model. The National Natural Science Foundation of China (61605173) and
training dataset includes various plant species and disease categories. (61403346); Natural Science Foundation of Zhejiang Province
However, every photo only contains images of a single disease feature, (LY16C130003).
lacking the intricacies and uncertainties of real-life conditions. Never
theless, it is important to note that our model was not exclusively trained Acknowledgments
using images captured under laboratory conditions. We used a variety of
stochastic simulations to mimic factors encountered in actual shooting Appreciations are given to the editors and reviews of the Journal.
scenarios, resulting in promising outcomes. Our forthcoming goals
entail incorporating our model into mobile platforms for rapid detection CRediT authorship contribution statement
and categorization of plant disease varieties in natural environments.
Compared to previous research findings, the dataset utilized in our Yun Zhao: Writing – review & editing, Formal analysis, Data cura
study during the training phase only incorporates 13 plant categories tion, Conceptualization. Yang Li: Writing – review & editing, Writing –
and 26 disease categories. As this data is relatively limited in comparison original draft, Methodology, Data curation, Conceptualization. Na Wu:
to larger databases, it is essential that we incorporate additional plant Supervision, Investigation. Xing Xu: Project administration, Investiga
species and disease categories into the model’s training to improve the tion, Funding acquisition.
model’s generalization ability and robustness. However, our model and
methodology offer additional benefits, including fewer parameters, Declaration of competing interest
reduced computational complexity, and improved test accuracy for
detecting diseases in single and multiple plant classifications compared The authors declare that they have no known competing financial
to existing research. interests or personal relationships that could have appeared to influence
the work reported in this paper.
6. Conclusion
Data availability
In this paper, we propose a light neural network called CAST-Net,
which combines Convolution and Self-Attention. Utilizing this model, Data will be made available on request.
we achieve single plant disease classification recognition and multi-
plant disease classification recognition through the use of the self- References
distillation method. The initial feature extraction module in the model
extracts additional disease feature details using an adaptive multi-scale Alguliyev, R., Imamverdiyev, Y., Sukhostat, L., Bayramov, R., 2021. Plant disease
detection based on a deep model. Soft Comput. 25, 13229–13242.
feature fusion technique. The SCB module in CAST-Net utilizes a com Alshammari, H., Gasmi, K., Krichen, M., Ammar, L.B., Abdelhadi, M.O., Boukrara, A.,
bination of deep and point convolution, as well as channel association to Mahmood, M.A., 2022. Optimal deep learning model for olive disease diagnosis
reduce the number of model parameters and link channel features. The based on an adaptive genetic algorithm. Wireless Commun. Mobile Comput. 1–13,
2022.
DTB module achieves the recognition of leaf disease images with a large Aqel, D., Al-Zubi, S., Mughaid, A., Jararweh, Y., 2022. Extreme learning machine for
sensory field for global feature extraction by combining the expansion plant diseases classification: a sustainable approach for smart agriculture. Cluster
convolution module, DCB, and the self-attention mechanism. We Comput. 1–14.
Batchuluun, G., Nam, S.H., Park, K.R., 2022. Deep learning-based plant-image
implemented a novel learning rate function during the training phase to
classification using a small training dataset. Mathematics 10, 3091.
enhance the model’s ability to converge towards the optimum. We also Bhagat, M., Kumar, D., 2023. Efficient feature selection using bows and surf method for
leaf disease identification. Multimed. Tool. Appl. 1–25.
11
Y. Zhao et al. Crop Protection 180 (2024) 106637
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z., 2022. Mobile-former: Mehta, S., Rastegari, M., 2021. Mobilevit: Light-Weight, General-Purpose, and Mobile-
bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Friendly Vision Transformer arXiv preprint arXiv:2110.02178.
Computer Vision and Pattern Recognition, pp. 5270–5279. Pandey, A., Jain, K., 2022. A robust deep attention dense convolutional neural network
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., for plant leaf disease identification and classification from smart phone captured real
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An Image Is Worth world images. Ecol. Inf. 70, 101725.
16x16 Words: Transformers for Image Recognition at Scale arXiv preprint arXiv: Pandian, J.A., Kanchanadevi, K., Kumar, V.D., Jasińska, E., Goňo, R., Leonowicz, Z.,
2010.11929. Jasiński, M., 2022. A five convolutional layer deep convolutional neural network for
Ghofrani, A., Mahdian Toroghi, R., 2022. Knowledge distillation in plant disease plant leaf disease detection. Electronics 11, 1266.
recognition. Neural Comput. Appl. 34, 14287–14296. Qian, X., Zhang, C., Chen, L., Li, K., 2022. Deep learning-based identification of maize
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C., 2022a. Cmt: convolutional leaf diseases is improved by an attention mechanism: self-attention. Front. Plant Sci.
neural networks meet vision transformers. In: Proceedings of the IEEE/CVF 13, 864486.
Conference on Computer Vision and Pattern Recognition, pp. 12175–12185. Saberi Anari, M., et al., 2022. A hybrid model for leaf diseases classification based on the
Guo, Y., Lan, Y., Chen, X., 2022b. Cst: convolutional swin transformer for detecting the modified deep transfer learning and ensemble approach for agricultural aiot-based
degree and types of plant diseases. Comput. Electron. Agric. 202, 107407. monitoring. Comput. Intell. Neurosci. 2022.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018. Mobilenetv2:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on
pp. 770–778. Computer Vision and Pattern Recognition, pp. 4510–4520.
Hinton, G., Vinyals, O., Dean, J., 2015. Distilling the Knowledge in a Neural Network Sathya, K., Rajalakshmi, M., 2022. Rda-cnn: enhanced super resolution method for rice
arXiv preprint arXiv:1503.02531. plant disease classification. Comput. Syst. Sci. Eng. 42.
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Shen, Y., Xu, L., Yang, Y., Li, Y., Guo, Y., 2022. Self-distillation from the last mini-batch
Pang, R., Vasudevan, V., et al., 2019. Searching for mobilenetv3. In: Proceedings of for consistency regularization. In: Proceedings of the IEEE/CVF Conference on
the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324. Computer Vision and Pattern Recognition, pp. 11943–11952.
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale
Andreetto, M., Adam, H., 2017. Mobilenets: Efficient Convolutional Neural Networks Image Recognition arXiv preprint arXiv:1409.1556.
for Mobile Vision Applications arXiv preprint arXiv:1704.04861. Singh, R.K., Tiwari, A., Gupta, R.K., 2022. Deep transfer modeling for classification of
Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of the maize plant leaf disease. Multimed. Tool. Appl. 81, 6051–6067.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A., 2021. Bottleneck
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on
convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision Computer Vision and Pattern Recognition, pp. 16519–16529.
and Pattern Recognition, pp. 4700–4708. Suarez Baron, M.J., Gomez, A.L., Diaz, J.E.E., 2022. Supervised learning-based image
Hughes, D., Salathé, M., et al., 2015. An Open Access Repository of Images on Plant classification for the detection of late blight in potato crops. Appl. Sci. 12, 9371.
Health to Enable the Development of Mobile Disease Diagnostics arXiv preprint Tabbakh, A., Barpanda, S.S., 2022. Evaluation of machine learning models for plant
arXiv:1511.08060. disease classification using modified glcm and wavelet based statistical features.
Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X., 2022. Next- Trait. Du. Signal 39, 1893.
vit: Next Generation Vision Transformer for Efficient Deployment in Realistic Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L., 2021. Cvt: introducing
Industrial Scenarios arXiv preprint arXiv:2207.05501. convolutions to vision transformers. In: Proceedings of the IEEE/CVF International
Li, X., Li, S., Omar, B., Wu, F., Li, X., 2021. Reskd: residual-guided knowledge distillation. Conference on Computer Vision, pp. 22–31.
IEEE Trans. Image Process. 30, 4735–4746. Yang, J., Yang, Y., Li, Y., Xiao, S., Ercisli, S., 2022. Image information contribution
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin evaluation for plant diseases classification via inter-class similarity. Sustainability
transformer: hierarchical vision transformer using shifted windows. In: Proceedings 14, 10938.
of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., Ma, K., 2019. Be your own teacher:
Lu, X., Yang, R., Zhou, J., Jiao, J., Liu, F., Liu, Y., Su, B., Gu, P., 2022. A hybrid model of improve the performance of convolutional neural networks via self distillation. In:
ghost-convolution enlightened transformer for effective diagnosis of grape leaf Proceedings of the IEEE/CVF International Conference on Computer Vision,
disease and pest. Journal of King Saud University-Computer and Information pp. 3713–3722.
Sciences 34, 1755–1767. Zhao, Y., Sun, C., Xu, X., Chen, J., 2022. Ric-net: a plant disease classification model
Madhavan, M.V., Thanh, D.N.H., Khamparia, A., Pande, S., Malik, R., Gupta, D., 2021. based on the fusion of inception and residual structure and embedded attention
Recognition and classification of pomegranate leaves diseases by image processing mechanism. Comput. Electron. Agric. 193, 106644.
and machine learning techniques. Comput. Mater. Continua (CMC) 66, 2939–2955.
12
The lightweight deep learning classification network model introduces several innovations to improve plant disease classification accuracy. Firstly, it utilizes a shallow plant disease feature extraction module that employs a multi-scale feature fusion technique across four branches, enhancing the extraction of global lesion information . Secondly, the model replaces group convolution in its NCB module with depth convolution, point convolution, and channel information correlation, renaming it as SCB, to better extract local features with fewer parameters . Thirdly, DCB, a new module, uses expansion convolution to increase the sensory field, which is fused with a multi-head self-attention mechanism, creating the DTB module for better global feature extraction . Additionally, a dynamic learning rate function is proposed to prevent the model from falling into local optima and improve convergence . Finally, the self-distillation technique during training helps the model learn more effectively from previous iterations to enhance classification precision .
The critical improvements made to the NCB and NTB components of the Next-Vit model include replacing group convolutions with depth and point convolutions, along with channel information correlation operations, forming the SCB module . These changes help simplify the model and enhance feature extraction efficiency. Additionally, the introduction of the DCB module, which employs expansion convolution with self-attention mechanisms, enhances the model's capacity to extract and process global feature information, significantly improving plant disease recognition and classification .
The proposed research addresses the challenge of achieving accurate plant disease classification using less complex models by developing a Next-Vit-based lightweight deep learning classification network. The model incorporates modules like SCB and DCB that optimize feature extraction while reducing parameters compared to traditional methods . A dynamic learning rate function enhances model training convergence, and self-distillation techniques refine classification precision without increasing model complexity . These strategies collectively reduce cumbersome computational demands and improve classification accuracy, successfully balancing model complexity and performance .
Self-distillation plays a crucial role in enhancing the precision of plant disease classification models by leveraging the outputs of earlier iterations to guide the training in subsequent iterations. This method refines the learning process, enabling the model to better recognize intricate patterns and features within plant disease images . By utilizing knowledge gained from prior iterations, self-distillation improves the model's classification accuracy significantly, surpassing models trained independently without distillation .
A static learning rate might hinder model optimization in plant disease classification because it can cause the model to converge to suboptimal local minima, failing to reach the optimal solution . Static rates do not adapt to the nuances of training dynamics, potentially causing either slow convergence if too low, or oscillations and divergence if too high . By contrast, a dynamic learning rate allows the model to adjust learning rates during training, providing a better balance between exploration and convergence throughout the optimization process .
Using a dynamic learning rate function in training neural network models for plant disease detection offers several advantages. It helps the model bypass local optima and accelerate convergence during the training phase . This function adjusts the learning rate throughout the training process, allowing the model to explore more optimal solutions and improve its learning efficiency. As a result, the model achieves better convergence and can reach higher classification accuracy compared to using a static learning rate .
The SCB module improves feature extraction in plant disease classification models by utilizing depth-separable convolution instead of the original grouped convolution. This modification allows the SCB module to produce a larger number of feature maps while keeping the same number of parameters and computational complexity . The change enhances the model's ability to capture more detailed feature information, thus improving the classification accuracy of plant disease images .
The use of expansion convolution in the DCB module positively affects the model's ability to process plant disease images by broadening the convolutional sensory field, allowing the network to capture more extensive spatial i formation without increasing model complexity . This ensures that the model can extract significant global features from plant disease images more effectively, contributing to an enhanced classification performance compared to using narrower convolutional scopes .
The ShallowBlock module outperforms Next-Vit's shallow feature extraction module by employing multiple scales to extract and fuse features from input plant disease images . This approach enhances the richness of the extracted features and reduces the loss of useful information . In comparative experiments, ShallowBlock demonstrated superior performance by achieving a more comprehensive extraction of plant disease features compared to the four 3 × 3 convolution layers utilized by Next-Vit's module .
The CAST model and its enhanced version, CAST+SD, differ primarily in the application of self-distillation. While CAST serves as the base model, CAST+SD includes self-distillation in its training process, contributing to an increase in test accuracy from 97.5% to 98.4% . This self-distillation technique allows the model to leverage previous learning phases, improving overall classification precision compared to using the CAST model independently .