0% found this document useful (0 votes)
36 views12 pages

Zhao (2022)

The paper presents CAST-Net, a lightweight neural network model that combines convolution and self-attention for the recognition of plant leaf diseases, achieving 98.4% accuracy on the tomato subset of the PlantVillage dataset. The model incorporates a dynamic learning rate function and self-distillation to enhance classification accuracy while reducing parameters and computational complexity. This approach demonstrates significant improvements over existing models, making it a promising tool for early detection and management of plant diseases in agriculture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topics covered

  • knowledge distillation,
  • pest detection,
  • computational complexity,
  • accuracy improvement,
  • computer vision,
  • crop protection,
  • hyperparameter tuning,
  • model training,
  • feature maps,
  • experimental results
0% found this document useful (0 votes)
36 views12 pages

Zhao (2022)

The paper presents CAST-Net, a lightweight neural network model that combines convolution and self-attention for the recognition of plant leaf diseases, achieving 98.4% accuracy on the tomato subset of the PlantVillage dataset. The model incorporates a dynamic learning rate function and self-distillation to enhance classification accuracy while reducing parameters and computational complexity. This approach demonstrates significant improvements over existing models, making it a promising tool for early detection and management of plant diseases in agriculture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topics covered

  • knowledge distillation,
  • pest detection,
  • computational complexity,
  • accuracy improvement,
  • computer vision,
  • crop protection,
  • hyperparameter tuning,
  • model training,
  • feature maps,
  • experimental results

Crop Protection 180 (2024) 106637

Contents lists available at ScienceDirect

Crop Protection
journal homepage: www.elsevier.com/locate/cropro

Neural network based on convolution and self-attention fusion mechanism


for plant leaves disease recognition
Yun Zhao 1, Yang Li , Na Wu , Xing Xu *
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, China

A R T I C L E I N F O A B S T R A C T

Keywords: In agriculture, early and timely detection and identification of plant disease categories can help growers take
Plant disease classification timely countermeasures. The use of deep learning techniques for plant disease category detection prevents
CAST-Net further spread of the disease and helps to prevent crop production losses. In this paper, Based on the Next-Vit
Lightweight neural network
neural network model, we proposed a lightweight neural network CAST-Net based on the combination of
Dynamic learning rate function
Self-distillation
convolution and self-attention, and we adopted self-distillation based on this model to achieve increased accu­
racy in classifying plant leaf diseases while reducing the number of model parameters and flops. Our model and
method achieved 98.4% accuracy on the tomato subset of the data-enhanced PlantVillage dataset, a 4.9%
improvement over the Next-Vit model, and 99.0% accuracy on the full PlantVillage set, a 6.9% improvement
over the Next-Vit model. We also propose a new dynamic learning rate function that is applied to the training
phase to prevent the loss from reaching the optimal value. The results show that our model and method have
higher accuracy, fewer parameters, shorter training time and lower computational complexity than existing
models.

1. Introduction and BDA optimization algorithm are used for feature extraction and
feature selection respectively, and finally ELM algorithm is used for
As a significant agricultural nation, China is bound to face pests and plant leaf disease classification (Aqel et al., 2022). Disease classification
diseases during the crop growth process. Failing to detect and recognize of tomato, potato and chilli crops using hybrid machine learning tech­
the exact type of pests and categorize them correctly may result in niques (Bhagat and Kumar, 2023). An inter-class similarity analysis
inadequate measures, causing substantial losses. Manual identification based method to assess the contribution of sub-image information
of pests and diseases is highly resource-intensive, requiring significant combined with an active learning image selection strategy is proposed to
manpower and material resources. The process may take a prolonged resolve classification inaccuracies in intelligent identification of plant
period to produce the results. As a result, the utilization of machine diseases (Yang et al., 2022). Supervised learning and image classifica­
learning in plant pest and disease classification and identification is of tion are applied to the early detection of potato late blight (Suarez Baron
utmost importance to achieve effective and precise diagnosis and clas­ et al., 2022). The plant disease features were extracted using various
sification. It remains a current research priority and a topic of consid­ statistical features based on the classification evaluation of six machine
erable interest. Researchers have been carrying out ongoing learning models. The improved grey scale covariance matrix (GLCM)
investigations on computer vision technology to attain accurate plant technique was employed. The highest classification accuracy rates were
pest classification. Machine learning algorithms can be categorized achieved using the light gradient boosting machine (LGBM) and support
primarily as supervised learning, unsupervised learning, and semi- vector machine (SVM) models, at 94.39% and 93.15% respectively
supervised learning. Among the commonly utilized supervised (Tabbakh and Barpanda, 2022). The combination of supervised learning
learning algorithms are support vector machine, K nearest neighbor model and support vector machine was used to achieve pomegranate
algorithm, and decision tree algorithm. In terms of algorithms, K-mean plant disease feature extraction through ROI extraction - region of in­
clustering algorithm is used for image segmentation, GLCM algorithm terest feature extraction techniques. Pomegranate leaf disease

* Corresponding author.
E-mail addresses: [email protected] (Y. Zhao), [email protected] (Y. Li), [email protected] (N. Wu), [email protected] (X. Xu).
1
This is the first author footnote.

https://doi.org/10.1016/j.cropro.2024.106637
Received 18 September 2023; Received in revised form 21 February 2024; Accepted 27 February 2024
Available online 5 March 2024
0261-2194/© 2024 Elsevier Ltd. All rights reserved.
Y. Zhao et al. Crop Protection 180 (2024) 106637

classification was achieved with a final accuracy rate of 98.07% (Mad­ shorter training time, and higher detection accuracy (Zhao et al., 2022).
havan et al., 2021). The authors proposed an optimal deep learning An attention mechanism is proposed to represent the visual information
model based on adaptive genetic algorithm to diagnose olive leaf dis­ of local regions of an image by labelling, calculate the information
eases. With an accuracy of around 96% in multi-class classification task correlation between local regions using the attention mechanism, and
and 98% in binary classification task (Alshammari et al., 2022), the finally integrate the global information for classification, which effec­
model proves to be effective. The authors used an improved deep tively emphasizes the maize leaf lesion information and suppresses the
migration learning model to extract multiple leaf diseases for classifi­ background noise, facilitating the fine-grained maize leaf disease
cation. Multiple support vector machine (SVM) models were used to recognition under complex backgrounds (Qian et al., 2022). The ghost
improve the feature recognition and processing speed (Saberi Anari network was used as the convolutional backbone to generate interme­
et al., 2022). diate feature maps with linear operations, followed by the transformer
Because machine learning techniques do not perform well in real encoder with integrated multi-head attention to extract deep semantic
complex environmental background images, people continue to conduct features, and the results showed that the method was effective and ac­
research and exploration in the direction of deep learning techniques, curate for grape leaf field diagnosis (Lu et al., 2022). The proposed CST,
and some traditional convolution neural networks, such as ResNet (He based on Swin Transformer, can identify the degree and type of disease
et al., 2016), DenseNet (Huang et al., 2017), and VGG (Simonyan and with high testing accuracy and excellent robustness (Guo et al., 2022b).
Zisserman, 2014), have gradually emerged, and these network models An Attention Dense Learning (ADL) mechanism is proposed that com­
improve the accuracy of machine learning techniques in image classi­ bines hybrid S-type attentional learning with the basic dense learning
fication by continuously increasing the depth or width of the neural process of deep CNN. This helps to obtain robustness and higher testing
network model. The accuracy of machine learning techniques in image accuracy for plant leaf disease classification. The proposed mechanism
classification, but also greatly increase the complexity of the model. has achieved 97.33% classification accuracy in its real-world environ­
Subsequently, the MobileNet family of lightweight networks (Howard ment in the RGB leaf dataset (Pandey and Jain, 2022). In order to
et al., 2017, 2019; Sandler et al., 2018) was developed to reduce the improve plant disease classification performance, the number of images
complexity of the model while minimizing the impact on accuracy. in the training set was reduced by using Plant Image Generative
SENet (Hu et al., 2018) was proposed to adaptively recalibrate the Adversarial Network (PI-GAN) for data augmented training (Batchuluun
channel features based on the interdependencies between channels. et al., 2022).
Later, researchers implemented the Transformer structure in the visual In order to improve the accuracy of image classification, deep
domain based on the attention mechanism of the ViT (Dosovitskiy et al., learning models are employed, which in turn increases the model’s
2020) and SwinViT (Liu et al., 2021) neural network architectures. The complexity. To address this, people utilize knowledge distillation in
classification accuracy of this structure on large datasets was again machine learning. They use a complex teacher model to teach distilled
improved. knowledge to a simpler student model, thereby allowing this lightweight
Recent studies (Srinivas et al., 2021; Wu et al., 2021; Guo et al., model to achieve a high level of accuracy in classification. The knowl­
2022a; Mehta and Rastegari, 2021; Chen et al., 2022; Li et al., 2022) edge gained from model combination is consolidated into a single model
achieve better performance by combining the advantages of convolution (Hinton et al., 2015). The res-student approach utilizes the knowledge
and Transformer. BoTNet (Srinivas et al., 2021) uses a multi-head gap between teachers and students to train lightweight students (Li
self-attention mechanism to replace the last three bottleneck blocks of et al., 2021). A self-distillation framework was recommended to
ResNet.CvT (Wu et al., 2021) uses depth-separable convolution com­ compress deep network model knowledge into shallow knowledge,
bined with self-attention. CMT(Guo et al., 2022a) uses the Transformer which enhances image classification by 2.65% in average level accuracy
structure to capture remote dependencies and convolution to capture (Zhang et al., 2019). To help the model better learn the distribution and
local features. MobileViT (Mehta and Rastegari, 2021) uses spatial in­ features of the data, the dynamic soft target knowledge of the previous
duction bias to learn representations with fewer parameters in different data sampling during training is constrained to be provided to the cur­
visual tasks. MobileFormer (Chen et al., 2022) implements a parallel rent iteration of training learning. According to experiments, this
structure for MobileNet and Transformer fusion through a bilinear strategy is compatible with image classification and can be applied
bridge. Next-Vit (Li et al., 2022) combines CNN and Transformer to (Shen et al., 2022). The accuracy of plant pest and disease classification
capture local and global feature information through a deployable recognition increased by 2.12% when knowledge distillation was
mechanism. With the development of deep learning convolutional applied to the MobileNet lightweight network model (Ghofrani and
neural networks, deep learning classification models have also been Mahdian Toroghi, 2022).
applied to agricultural development problems. The AlexNet model was The previous studies have shown that it is possible to apply deep
explored to detect maize leaf diseases quickly and accurately, eventually learning techniques to plant disease classification and identification
achieving 99.16% accuracy (Singh et al., 2022). A new Reconstructed tasks, but in order to achieve more accurate identification of plant leaf
Disease Aware Convolutional Neural Network (RDA-CNN) was proposed disease categories, most of the studies have been done through the use of
to take low resolution image input of rice crop and convert the high more complex models. There is also a part of the research as although
resolution output to recover the disease condition of different parts of the model is made to achieve the lightweight level, but the classification
rice plant, and the experimental results showed that the RDA-CNN accuracy is not improved. Based on the above problem analysis, this
improved the classification performance by 4%–6% (Sathya and Raja­ paper proposes a lightweight deep learning classification network model
lakshmi, 2022). A deep convolutional neural network (DCNN) model based on the Next-Vit model, and applies it to plant disease classification
was proposed to optimize the hyperparameters of the DCNN model for and detection, which reduces the complexity of the model and improves
plant leaf disease classification using stochastic search techniques, and the classification accuracy at the same time. And combined with a self-
experiments showed that the overall performance of the model out­ distillation technique, the model can better learn the features of plant
performed advanced machine learning techniques (Pandian et al., disease spots. We also propose a new dynamic learning rate change
2022). A method for accurate identification of plant leaf diseases based function. The main contributions are:
on deep convolutional neural networks and gating units has been pro­
posed, which was measured on PlantVillage data and achieved better 1. A shallow plant disease featured extraction module is proposed,
results than other models (Alguliyev et al., 2021). Embedding the which adopts multi-scale feature fusion technique to extract the
improved CBAM convolutional attention module into the improved input plant disease image features from four branches respectively,
Inception network improves the effectiveness of plant leaf disease the first branch of maximum pooling for local significant feature
classification, and the results show that the model has fewer parameters, extraction, the second branch of average pooling for local average

2
Y. Zhao et al. Crop Protection 180 (2024) 106637

feature extraction, the third branch of 3 × 3 convolution kernel size Table 1


for richer feature information extraction, and the fourth branch of Tomato10 dataset.
residual for original feature information. Final four branches are Type Numbers
multiplied by adaptive weight coefficients before feature fusion. This
Tomato_Bacterial_Spot 2127
module can make the model extract more abundant global lesion Tomato_Early_Blight 1000
information in the shallow part of the plant. Tomato_Beaithy 1591
2. Based on Next-Vit, we improved two modules. First, to reduce the Tomato_Late_Blight 1909
number of model parameters while better extracting the local feature Tomato_Leaf_Mold 952
Tomato_Septoria_Leaf_Spot 1771
information of plant leaf diseases, we replaced the group convolution Tomato_Spider_Mites 1676
in NCB with depth convolution and point convolution, and added the Tomato_Target_Spot 1404
channel information correlation operation, and rename the module Tomato_Mosaic_Virus 373
as SCB. Second, to extract the global feature information of plant Tomato_Yellow_Leaf_Curl_Virus 5357
Total 18160
diseases, we propose a module that extracts features by expansion
convolution, then performs normalization and nonlinear activation,
and then fuses the residuals together. We proposed a module that 224 size, (2) randomly offsetting the brightness, contrast, saturation,
extracts features by expansion convolution, followed by normaliza­ and all of the image to 50% of the original image, (3) adding Gaussian
tion and nonlinear activation, and then fuses the residual connec­ noise on the image with 50% probability, (4) pixel erasure of the image
tions, calling it DCB. This module can increase the convolutional with 50% probability to a size 0.02 to 0.13 ratio of the input size, (5)
sensory field when extracting feature maps. We fused a module of normalization, and five methods to simulate images taken in real envi­
DCB with a multi-head self-attention mechanism and named it DTB. ronments where there are different levels of exposure, noise blur, and
3. Since the model will fall into a local optimum if a static learning rate clutter occlusion.
is used in the training phase, we proposed a new dynamic learning As shown in Fig. 1, A is processed by random exposure according to
rate function to solve the problem, and compared it with other dy­ 50% probability of the original image, B is processed by randomly
namic learning rate functions to analyze the superiority of our pro­ adding different levels of Gaussian noise to 50% probability of the
posed learning rate function. original image, and C is processed by randomly erasing pixels within a
4. In order to further improve our model for learning leaf disease image certain range with 50% probability of the original image. A+B+C is a
information and classification accuracy, we adopted a self- fusion of the three data pre-processing methods that we performed
distillation training method for our model in the training stage, so before training, whereby a single image of the original lab was randomly
that the softened target results of the model in the previous epoch processed at a certain ratio to simulate images taken in real environ­
training would guide the next epoch learning. ments, and then trained our model on the processed data set to obtain
better generalization ability. A+B+C is the fusion of the three data pre-
We applied the above proposed module and the improved new model processing methods that we perform before training, where a single
CAST-Net to the task of plant leaf disease classification and recognition, image of the original lab is randomly processed in a certain ratio to
and conducted experimental analyses on the enhanced tomato 10-class simulate the images taken in a real environment, and then our model is
dataset and PlantVillage dataset, respectively. Compared with Next-Vit trained on the processed dataset in order to obtain a better generaliza­
and other advanced classification network models, our model had tion ability.
significantly improved the accuracy of plant disease class recognition,
and the model had a lower number of parameters and computational 2.3. Experimental equipment
complexity. Therefore, we used a self-distillation training method for
CAST-Net in the training phase to make the model better learn leaf All experiments in this paper were conducted using the following
disease information and further improve the classification accuracy. hardware and software: an AMD Ryzen 9 5900X 12-Core processor
running at 3.70 GHz, an Nvidia GeForce RTX 3090 graphics card, 128
2. Data processing GB of onboard RAM, Windows 10 as the operating system, and the
following software: PyCharm2022, Python3.8, and Torch1.12.0.
2.1. Dataset
3. Material and methods
In this paper, we use the dataset PlantVillage (Hughes and Salathé,
2015), which contains 55448 images of plant disease leaves taken in the Currently, the accuracy of pure convolutional neural networks for
laboratory. It contains a total of 26 categories of disease leaves for 13 image classification and recognition is inferior to that of Transformer
plant species. There are 39 categories of sample images in this dataset, models. Nonetheless, Transformer models reruire excessive computa­
including 38 categories of plant disease categories and one category of tional power and parameters to support model training. To address this,
background image, all of which have an image size of 256 pixels × 256 we proposed an innovation to improve the model by combining con­
pixels. We conducted an experimental study on its full set and a subset of volutional and Transformer models – Next-Vit – and designing a multi-
its tomato category, respectively. The subset of the tomato category has scale feature fusion. The Next-Vit network’s NCB and NTB modules
a total of 10 categories and 18,160 images, as shown in Table 1. have been upgraded to decrease model parameters and boost plant
disease classification accuracy. The shallow feature extraction module
2.2. Data processing extracted richer feature information in the early stage. Additionally, a
new dynamic learning rate function was proposed for stability during
As the current data were taken under laboratory conditions, and the training and testing, resulting in further enhanced accuracy for plant
real background environment of the collected data images is still disease classification. In order to enhance the precision of plant disease
somewhat different, the images taken under laboratory conditions of a categorization, we implemented a self-distillation technique, enabling
single background condition training in the real environment desired the model to acquire a more thorough understanding of plant spot fea­
effect of plant disease often do not achieve the recognition. The real tures for superior classification performance.
environment has a complex background, with exposure, noise, and ob­
stacles obscuring the leaves, etc. Therefore, in the case that real envi­
ronment data is lacking, we perform (1) resizing each image to 224 ×

3
Y. Zhao et al. Crop Protection 180 (2024) 106637

Fig. 1. Data image enhancement.

3.1. ShallowBlock features and increase the image’s channel dimension. The first branch
applies maximum pooling to the local region of the feature map to
When using convolutional operations for feature extraction, a small extract important features. The second branch uses average pooling to
kernel has a limited receptive field, which results in better feature extract overall data information features from the feature map. The third
extraction for small areas. Conversely, a large kernel has a larger branch employs two convolutional sub-branches to extract the local
receptive field, which allows for better feature extraction of entire leaf features of the feature map by effectively halving the channels. The
diseases. Existing neural network models in shallow extraction typically resulting feature maps are concatenated, and the original feature in­
consist of a single convolutional layer with a large convolutional kernel formation is preserved in the residual branch. Finally, the four branches
or multiple small convolutional kernel convolutional layers super­ multiply the weight coefficients by the adaptive learning weights and
imposed on the input image to extract edge knowledge information. then carry out the addition operation to achieve feature fusion, which
However, this approach can only extract feature information at a single results in obtaining varying levels of image feature information extrac­
scale. We employ multi-scale feature fusion for extracting shallow ted at different scales. The feature fusion formula is:
feature map information. In the shallow layer of the network, we aim to
F = w1 × fr + w2 × fm + w3 × fa + w4 × fc (1)
extract image information features of varying granularity by incorpo­
rating multiple scales. We optimize the weight coefficients of the w1,w2,w3,w4 are the normalized weight coefficients, fm is the first
extracted features from each branch scale and then perform weighted branch feature, fa is the second branch feature, fc is the third branch
fusion. Thus, we can establish a basis for extracting sophisticated se­ feature and fr is the residual branch feature.
mantic information from the network model and avoid the loss of crucial
features initially. In Fig. 2, we apply a 1 × 1 convolution at the start of
each branch to enhance the network model’s ability to extract additional

Fig. 2. ShallowBlock structure.

4
Y. Zhao et al. Crop Protection 180 (2024) 106637

3.2. SCBlock optimization. We also have developed the DTB module, which combines
the DCB expansion convolution module with the multi-head attention
Since the NCB module in Next-Vit only considers the extraction of mechanism to extract global feature information. The DCB module en­
local spatial feature information without taking into account the infor­ hances the receptive field of the initial 3 × 3 convolution without
mation correlation between different channels, we added the channel incurring additional computation costs and maintains information
shuffle operation to the improved NCB module, so that inter-channel integrity during the network’s convolution process as it deepens. This
shuffling is performed each time the network is trained by the mod­ approach ensures minimal information loss.
ule, making the inter-channel information correlated. We replaced the As displayed in Table 2, we conducted a computational analysis
GroupConv3 × 3 group convolution (the number of groups is equal to comparing the number of parameters and computation between the
the number of output channels/32) in the original NCB module with grouped convolution and DCB expansion convolution modules. Our
DW3 × 3 convolution (depth separable convolution). As the DW findings indicate that the DCB module enhances the sensory field of
convolution is a special group convolution (the number of groups is feature extraction without increasing computational load. For instance,
equal to the number of output channels) to extract local features, it is the input channel is configured to 64, and the output channel is set to 96.
combined with channel shuffle to correlate the global channel infor­ The convolution kernel size is established as 3, while the input image
mation, resulting in better feature extraction of the feature map. This size is defined as 224 × 224. The number of groups is predetermined at
form of convolution is the most efficient convolution compared to group 3. A comparison between the number of parameters and computations
convolution, with the same number of parameters and the same number for grouped convolution and inflated convolution is presented in
of operations, as it can produce Cout (the number of output channels) of Table 2.
feature maps, whereas GConv can only produce Cout/32 feature maps. In the DTB module, as illustrated in Fig. 5, the process initially in­
We found that most of the parameters in the model come from the volves a convolutional layer that employs a 1 × 1 filter. Subsequently, an
MLP multilayer perceptron layer, and the Next-Vit model adopts a attentional mechanism is utilized for the global extraction of informa­
convex structure MLP, and the hidden layer units in the middle of the tion from the input features of plant disease images. After this, the
convex structure MLP is twice as many as the number of inputs and feature integration takes place in a 1 × 1 convolutional layer, and the
outputs, so the number of parameters in the MLP layer is also very large, process then progresses to our DCB module, which starts with a 3 × 3
and we change the MLP layer to a concave structure as shown in Fig. 3, inflated convolution layer with a nulling rate of 2 in order to extract
and the number of neurons in the middle of the hidden layer is only half local information from plant disease image features using a large
of the number of neurons in the input-output layer or less. The number receptive field. This is followed by a 1 × 1 convolutional layer for feature
of neurons in the middle hidden layer is only half of the number in the integration and, to address the issue of gradient vanishing, residual
input and output layers or less, which greatly reduces the number of merging is also applied before passing through the concave MLP layer.
parameters in the MLP layer, and after experimental testing, the model
accuracy does not decrease, but the amount of parameters in the model 3.4. CAST-net
is reduced by two-thirds of the original.We call the improved NCB
module the SCB module. As shown in Fig. 4, the SCB module will first Since SCBlock extracts local feature information and DTBlock ex­
shuffle the channels of the input feature information so that the features tracts global feature information of plant disease images, we combine
extracted in the previous iteration are concerned with the information them to achieve a combination of local and global feature information.
connection between the channels when entering the SCB to extract the We let the input image pass through the ShallowBlock to obtain rich
features again this time. Then, enter will into a deep convolution with a shallow feature information before entering the SCBlock and DTBlock
filter size of for 3 × 3 the extraction of local features, and nonlinear for deep feature information extraction. We employ SCBlock and
activation, normalization it point convolution then a filter with a filter DTBlock in tandem to integrate local and global feature information for
size of 1 × 1 for feature integration, followed by then BN and ReLu, and images of plant diseases. After passing the input image through Shal­
finally into a concave MLP layer. lowBlock to obtain comprehensive shallow feature information, it then
flows into the series-connected SCBlock and DTBlock to extract deep
feature information, thereby enabling the entire-Net model to classify
3.3. DTBlock
plant leaf disease categories.
As depicted in Fig. 6, n1, n2, n3, and n4 determine the number of
We suggest a DCB mini-module comprising a 3 × 3 kernel of zeroes
overlaps for each Stage layer. It is well-established that as the number of
with an inflated convolution by 2 for expanding the receptive field of the
layers in the network model increases, the model’s complexity increases,
kernel for extracting local features while preventing the extraction of
and its ability to accurately identify plant leaf diseases improves. The
excessive redundant information. Then, we have normalization and
configuration mainly utilized in our experiment is [3,4,10,3], denoting
nonlinear activation, followed by a 1 × 1 convolution for integrating
the number of overlaps per layer. In Stage 1, the SCB was stacked thrice,
features. Additionally, we incorporate a residual structure to avoid
followed by the SCB being stacked thrice and the DTB being stacked
gradient vanishing during network depth increase and model
once in Stage 2. Stage 3 involved stacking the SCB eight times and the
DTB twice. Stage 4 saw the SCB being stacked twice and the DTB once.
The final stage included discriminating the categories of the input plant
disease images after the fully connected layer. Fig. 6 provides an illus­
tration of the overall CAST-Net network model structure, and Table 3
describes the CAST-Net network model’s detailed configuration
information.

3.5. New dynamic learning rate function

Based on the model training with fixed learning rate, it was found
that the model tends to jump around the minimum value repeatedly in
the later stage as the number of epochs increases, which may be caused
by too large a learning rate in the later stage. So I used dynamic change
Fig. 3. MLP convex-to-concave. of learning rate to train the model. When we use the gradient descent

5
Y. Zhao et al. Crop Protection 180 (2024) 106637

Fig. 4. (a) shows the original module NCB, (b) shows our modified module SCB.

the symmetric function e− x of the exponential function to construct the


Table 2
new dynamic learning function, with the value of x from 0 to 1, the value
Comparison of the number of parameters and operations between group
of e− x decreases slowly and does not drop to 0. This ensures that the
convolution and expansion convolution.
model has the ability to learn new feature information even in the later
Parameters or Flops Value stage. In addition, a constant learning rate ηmax is used in the first 5
Grouped Convolution Parameters 3 × 3 × (96/3) × 3 = 864 epochs, so that the model learns more knowledge in the early stage, and
Grouped Convolution Calculations 3 × 3 × (96/3) × 3 × 224 × 224 = after 5 epochs, the proposed dynamic learning rate function is used to
43352064
decay the learning rate according to the sub-increase of the epochs, so
DCB Expansion Convolution 3 × 3 × (96/3) × 3 = 864
Parameters that the loss steadily decreases and the accuracy increases
DCB Expansion Convolution 3 × 3 × (96/3) × 3 × 224 × 224 = ( )
Calculations 43352064 [ − N×EEmax ] (2)
cur
1
η = ηmin + (ηmax − ηmin ) × 1 + e
2
where η is the dynamic learning rate under the current epoch, ηmin is
the set minimum learning rate, ηmax is the set maximum learning rate,
Ecur is the current epoch number, Emax is the set maximum epoch num­
ber, N is the parameter controlling how fast or slow the e− x function
decays. The larger N is the faster the e− x function decays, and the faster
the learning rate decays. The smaller N is, the slower the e− x function
decays and the slower the learning rate decays.

3.6. Method of self-distillation

In traditional knowledge distillation, a large number of experiments


are usually conducted in advance to find a complex teacher model that is
trained in advance to learn the knowledge, and then the knowledge
learned from the teacher is transferred to the lightweight student model
under the condition of controlling the hyperparameter temperature T,
which is used to improve the accuracy of the lightweight model for leaf
classification. We used self-distillation to make our leaf disease classi­
fication model even more accurate based on the already improved
lightweight model.
The self-distilling method we use is to distil the soft targets that
overlap with the next batch of samples, so that the training samples in
each forward propagation process are connected to a backward propa­
gation process, improving learning efficiency. The traditional self-
supervised image classification methods perform one-hot coding to
calculate the cross-entropy loss after the model is trained, so only the
training samples with the loss of correct label positions are considered
and the loss of incorrect label positions is ignored. We therefore use a
self-distillation approach in each iteration of the training phase, where
our lightweight network acts as both the student and the teacher, with
Fig. 5. DTBlock structure. the teacher’s role being to generate soft target optimizations in the next
batch, and the student’s role being to distil learning from the labels
softened in the previous batch.
algorithm to optimize the objective function, as it gets closer to the
global minimum of the loss value, the learning rate should decrease to exi
Pi = ∑n (3)
bring the model as close to that point as possible. Therefore, we adopt i=1 exi

6
Y. Zhao et al. Crop Protection 180 (2024) 106637

Fig. 6. CAST-net structure.

Table 3 1 ∑m ∑ n
lossce = − ⋅ yi ⋅logPi (4)
CAST-Net detailed configuration. m k=1 i=1
Stages Input Size Block CAST-Net Layers
Stem 224 × 224 Shallow Block Shallow-B × 1, C = 32, S = 2 1 ∑n
T 2 ⋅Dkl (PT,t− (5)
1 T,t
Shallow-B × 1, C = 64, S = 2 losslb = ⋅ i Pi )
n i=1
Stage 1 64 × 64 Down Conv 1 × 1, C = 96, S = 1
Sampling
CAST Block [ SCB × 3, C = 96 ] loss = lossce + α⋅losslb (6)
Stage 2 64 × 64 Down Avgpool, S = 2
Sampling Conv 1 × 1, C = 96 where Pi is the probability that the current sample belongs to cate­
CAST Block [ SCB × 3, C = 192 ], [ DTB × 1, C = 256 ] gory i, xi is the logit of the corresponding category i for the current
Stage 3 32 × 32 Down Avgpool, S = 2 sample, n is the total number of sample categories, lossce is the cross
Sampling Conv 1 × 1, C = 384
entropy loss. T is the distillation temperature, PT,t−
i
1
is the probability
CAST Block [ SCB × 8, C = 384 ], [ DTB × 2, C = 512 ]
Stage 4 16 × 16 Down Avgpool, S = 2 that the current sample belongs to category i for the t − 1st iteration of
Sampling Conv 1 × 1, C = 768 soft labelling at temperature T, α is a hyperparameter controlling how
CAST Block [ SCB × 2, C = 768 ], much of the total loss is accounted for by the distillation losslb, and loss is
[ DTB × 1, C = 1024 ]
the total loss in model training.

7
Y. Zhao et al. Crop Protection 180 (2024) 106637

What the level of temperature changes is the amount of attention


TP + TN
paid to negative labels during model training. At lower temperatures, Accuracy = (9)
TP + TN + FP + FN
less attention is paid to negative labels, especially those that are
significantly lower than the average; while at higher temperatures, the 2 × Precision × Recall
values associated with negative labels increase relatively and the model F1 = (10)
Precision + Reacll
pays relatively more attention to the negative labels. But too high a T can
also make the model pay less attention to the positive labels, making the The use of evaluation metrics such as Recall, Precision, etc. in multi-
model performance degrade. In order to find the most suitable distilla­ classification tasks assumes that only the current category being pre­
tion temperature and Alpha for CAST-Net, we conducted a number of dicted is a positive plant disease sample, while all other categories serve
columns of experiments with different temperature and Alpha settings as negative samples. TP represents the number of correctly detected
for comparison, as shown in Fig. 7. Finally, we concluded that the positive plant disease samples, FP represents the number of incorrectly
combination experiment with T of 3 and α of 0.5 has the best results (see detected positive samples of plant diseases, TN represents the number of
Fig. 7). correctly detected negative plant disease samples, and FN represents the
number of incorrectly detected negative samples of plant diseases. TP
represents the number of correctly detected positive samples of plant
4. Experiments and results
diseases, FP represents the number of incorrectly detected positive
samples of plant diseases. Since there is a negative correlation between
We investigated the performance of the deep learning based neural
Precision and Recall, the F1 value is proposed as a means to balance
network model CAST -Net to obtain category classes for multiple and
these two evaluation metrics. The Loss value is used to quantify the
single plant leaf diseases, and conducted comparative experiments with
deviation between the predicted output categories of our model and the
some representative network models: Next-Vit, ResNet34, MobileNet,
method used to classify leaf disease categories and their actual cate­
ViT-base, SwinVit-Tiny and MobileViT all of which were tested with a
gories. Smaller Loss values indicate that the predicted categories of our
single leaf disease dataset, Tomato10, and a multiple leaf disease data­
outputs are closer to the actual categories, thus demonstrating the su­
set, the PlantVillage dataset, respectively. If BatchSize is set too low, the
periority of our model and method. The closer our predicted output
model converges slowly, and if it’s set too high, the model can fall into a
categories are to the actual categories, the better our models and
local minimum, resulting in a strong oscillation of the loss curve. After
methods are. As a measure of this closeness, the loss is minimized. The
conducting numerous experiments, it was found that a BatchSize of 32
model parameter ’params’ represents model complexity: the smaller it
gave the best results. The experimental results show that the iterations of
is, the simpler the model is. Meanwhile, computational complexity is
the weight update in the neural network gradually reach a matching
represented by ’flops’. Smaller ’flops’ values indicate a better model.
state before the training epoch reaches 50. Therefore, the epoch is
As shown in Table 4, the training and testing results of the Tomato10
uniformly set to 50, and the optimal training model is automatically
dataset show that the Next-Vit model has an accuracy rate of 93.5%,
stored. In the optimizer, select SGD, with momentum set to 0.9 and
while our CAST-Net model achieves an accuracy rate of 97.5%, which is
weight decay set to 0.01. If the loss function does not use the self-
a 4% improvement over Next-Vit model. After implementing the self-
distillation method, use the cross-entropy loss function to calculate the
distillation technique, the accuracy of our model was further
loss values for multiclassification.
improved to 98.4%. We also trained and tested other well-known deep
learning models for image classification. ResNet34, MobileNet, ViT-
4.1. Comparison of different model experiments base, ConvNeXt, SwinViT-Tiny and MobileViT achieved accuracy rates
of 95.3%, 92.4%, 86.2%, 80.8%, 83.6% and 88.8% respectively. These
In order to assess the benefits of our model approach framework models will be investigated. In terms of model parameters and compu­
more effectively, we conducted comparative experiments on seven tational complexity (flops), the CAST-Net has 11.69 M and 1.84 G,
evaluation indices: accuracy, loss, recall, precision, F1-score, model respectively. Although our model has more parameters and flops than
parameters, and computational complexity flops. The accuracy, recall, the MobileNet model, it outperforms MobileNet in classification accu­
precision, and F1-score are calculated as follows. racy. Compared to other models, our model has fewer parameters and
TP lower computational complexity. To demonstrate our model’s general­
Precision = (7) ization ability, we conducted training and testing experiments on the
TP + FP
entire PlantVillage dataset after data enhancement. The accuracy of
TP Next-Vit is 92.1%, while our CAST-Net achieved 98.0%, which is 5.9%
Recall = (8)
TP + FN higher than that of Next-Vit. The implementation of the self-distillation
method further improved the accuracy to 99%. The results show the

Fig. 7. (a) indicates the effect of different Temperature on CAST-Net tested under the condition of Alpha fixed at 0.5. (b) Denotes the effect of different Alpha on
CAST-Net tested under the condition of Temperature fixed at 3.

8
Y. Zhao et al. Crop Protection 180 (2024) 106637

Table 4
Comparison on Tomato10 dataset and PlantVillage dataset.
Dataset NetWork Accuracy Loss Recall Precision F1-score Params(M) Flops(G)

Tomato10 Next-Vit 0.935 0.209 0.921 0.915 0.914 30.78 5.79


CAST 0.975 0.105 0.967 0.964 0.965 11.69 1.84
CAST+SD 0.984 0.076 0.980 0.978 0.979 11.69 1.84
ResNet34 0.953 0.148 0.925 0.944 0.933 21.36 3.68
MobileNet 0.924 0.257 0.894 0.917 0.901 3.22 0.77
ViT-base 0.862 0.434 0.833 0.844 0.836 85.65 16.86
ConvNeXt 0.808 0.574 0.760 0.779 0.765 27.81 4.45
SwinViT-Tiny 0.836 0.488 0.795 0.805 0.796 27.53 4.37
MobileViT 0.888 0.391 0.870 0.886 0.864 4.94 1.89

Dataset NetWork Accuracy Loss Recall Precision F1-score Params(M) Flops(G)

PlantVillage Next-Vit 0.921 0.318 0.893 0.914 0.896 30.78 5.79


CAST 0.980 0.102 0.966 0.976 0.970 11.69 1.84
CAST+SD 0.990 0.075 0.986 0.985 0.984 11.69 1.84
ResNet34 0.960 0.161 0.942 0.954 0.946 21.36 3.68
MobileNet 0.968 0.154 0.949 0.964 0.954 3.22 0.77
ViT-base 0.779 0.731 0.701 0.774 0.714 85.65 16.86
ConvNeXt 0.845 0.500 0.794 0.830 0.800 27.81 4.45
SwinViT-Tiny 0.822 0.567 0.759 0.805 0.763 27.53 4.37
MobileViT 0.935 0.278 0.920 0.927 0.917 4.94 1.89

effectiveness of our model with the applied approach. ResNet34,


MobileNet, Vit-base, ConvNeXt, SwinViT-Tiny and MobileViT achieved
96.0%, 96.8%, 77.9%, 84.5%, 82.2% and 93.5% accuracy on the
updated PlantVillage dataset, respectively, as shown in Table 4. Our
model uses deep convolution and pointwise convolution within the SCB
module to extract more feature maps from plant patch feature images
compared to the grouped convolution of Next-Vit. In addition, the in­
clusion of the channel association operation improves the ability of our
model to detect the disease category associated with each plant disease
image.As shown in Table 5, we examined the detection performance of
our model for each of the ten disease categories of tomato in the test set
of the Tomato10 dataset. The results show that our model achieves high
accuracy in each category. We attribute this success to the use of the DCB
expansion convolution of the DTB module in collaboration with the
attention mechanism, which enables the extraction of global informa­
tion from plant disease images through a larger receptive field. Thus, the
detection accuracy for disease categories such as bacterial spot, late
blight, leaf mold, mosaic virus, and yellow foliage is 98.8%, 98.2%,
98.1%, 98.2%, and 99.1%, respectively. In addition, the detection ac­
curacy for healthy tomato leaves is 100.

4.2. Different learning rate functions

As indicated in Fig. 8 the static learning rate is in purple, multi-step


dynamic learning rate is in green, our dynamic learning rate is in red,
and the single-step learning rate is in blue. The other learning rate
functions cause significant fluctuations in the loss curve, indicating
insufficient model convergence. Our learning rate function dynamically
adjusts the rate to efficiently learn the formation of edge features of

Table 5
CAST-Net’s recognition of single diseases.
Type of disease Predict the Predict the Predict the Predict
Total Correct Wrong Accuracy
quantity quantity quantity

Bacterial_spot 345 341 4 0.988


Early_blight 147 142 5 0.965
Late_blight 290 285 5 0.982
Leaf_mold 160 157 3 0.981
Septoria_leaf_spot 279 272 7 0.974
Spider_mites 265 258 7 0.973
Fig. 8. Comparison of changes in different learning rate functions. Loss func­
Target_spot 233 220 13 0.944
Yellow_leaf 855 848 7 0.991
tion plots for different learning rate functions.
Mosaic_vius 58 57 1 0.982
Healthy 257 257 0 1.000

9
Y. Zhao et al. Crop Protection 180 (2024) 106637

plant patches in the pre-training phase. As the model gains knowledge, respectively before conducting our experiments, as shown in Table 6.
the learning rate gradually decreases to allow the model to successfully Our final model is denoted by CAST, while our final model combined
learn advanced semantic information. This approach prevents erratic with the self-distillation strategy is denoted by CAST+SD. Static_lr in­
oscillations of the loss curve resulting from a gradual reduction in the dicates the use of a static learning rate of 0.9% lower accuracy compared
learning rate and ensures that the model reaches its optimal perfor­ to our dynamic learning rate function. We excluded the DCB module
mance. The learning rate is gradually reduced to prevent the model from when training and testing our model for classification of leaf disease
skipping the optimal value, which can cause oscillations in the loss categories, resulting in a 1.5% decrease in accuracy. No_SCB indicates
curve. Our dynamic learning rate function therefore allows the model to that the model’s classification accuracy drops and its parameters and
gradually reduce its loss and stabilize the curve. computational complexity increase by 1.5M and 0.56G, respectively,
when the improved SCB module is removed. It can be observed that our
4.3. Ablation experiment ultimate model surpasses the subsequent three models that lack a certain
improvement aspect in all evaluation indices, and our ultimate model,
The paper presents a lightweight model built on the Next-Vit model CAST-Net, attains a test accuracy of 97.5%, which is then advanced by
that merges CNN with Attention. In addition, a ShallowBlock is devised 0.9% to attain 98.4% when integrated with the self-distillation training
to capture abnormal global features of plant disease images through method. Although the overall recognition accuracies of our models
multi-scale fusion in the bottom layer of the model. The SCB module, an decreased after excluding one of our improved modules, they still out­
improved version of the NTB module, uses depth-separable convolution performed Next-Vit by 2.9%, 3.4%, and 2.9%, respectively, in the
instead of the original grouped convolution. This change results in a presence of any other two improved modules. This further highlights
larger number of feature maps while maintaining the same number of that each of our proposed modules is better than the original Next-Vit
parameters and operations. In addition, a new module, DCB, is intro­ model for disease recognition and classification, emphasizing the
duced that uses expansion convolution to expand the sensory field indispensability of our proposed modules.
without increasing the complexity of the model. Finally, we propose a
dynamic learning rate function that allows the model to bypass the 5. Discussion
optimal value in the final training phase and accelerate the convergence
of the model. In this study, we applied four image enhancement techniques to
generate images of plant diseases in the laboratory simulating real
4.3.1. ShallowBlock module performance analysis shooting conditions before training the model. Our results indicate that
We use a heatmap to compare the efficiency of the ShallowBlock pre-processing all data images to simulate authentic shooting conditions
module with that of the shallow feature extraction module in Next-Vit to does not result in effective training. Therefore, we employed a random
extract image features of plant diseases. Fig. 9 shows the extraction of proportion method for data processing, which improves the general­
Apple_Black_rot and Corn_Cercospora_leaf_spot in PlantVillage. Apple ization ability of the model. We present a shallow feature extraction
Cedar apple rust, Pepper bell Bacterial spot, and Tomato Bacterial spot module that uses the multi-scale fusion technique with adaptive weight
are types of plant diseases that we compared to evaluate the effective­ coefficients to obtain extended feature information on plant leaf diseases
ness of our ShallowBlock and Next-Vit’s shallow feature extraction in the initial stage of the network. The NCB and NTB components in the
module in extracting leaf spot feature information. Our ShallowBlock Next-Vit model have been improved. To simplify the model and enhance
uses multiple scales to extract features from input plant disease images, connectivity between channel features, we have substituted the original
which enables feature fusion that makes the extracted features richer group convolution with deep convolution and point convolution, and
and reduces the loss of useful information. It is obvious that our Shal­ included the channel correlation operation. To better extract global in­
lowBlock outperforms the shallow feature extraction module of Next-Vit formation, we have employed the fusion of inflated convolution and self-
(consisting of four 3 × 3 convolution layers) in the shallow layer of the attention mechanism, resulting in the enhanced CAST-Net model
plant disease network. exhibiting significantly improved test accuracy.
We have investigated knowledge distillation of the model in the
4.3.2. Comparative experiments in which a module is excluded separately context of implementing a lighter aliquot model to enhance classifica­
We excluded both our original module and the improved version tion accuracy. We found that the traditional teacher-student distillation

Fig. 9. The effect of extracting plant spots from shallow network.

10
Y. Zhao et al. Crop Protection 180 (2024) 106637

Table 6
Comparison on Tomato10 dataset and PlantVillage dataset.
NetWork Accuracy Loss Recall Precision F1-score Params(M) Flops(G)

CAST+SD 0.984 0.076 0.980 0.978 0.979 11.69 1.84


CAST 0.975 0.105 0.967 0.964 0.965 11.69 1.84
Staic_lr 0.964 0.122 0.914 0.942 0.922 11.69 1.84
No_DCB 0.969 0.103 0.961 0.962 0.960 10.84 1.74
NO_SCB 0.964 0.124 0.956 0.959 0.957 13.19 2.40
Next-ViT 0.935 0.209 0.921 0.915 0.914 30.78 5.79

technique used in applying knowledge distillation methods to the task of employed a self-distillation method to further refine the accuracy of our
classifying agricultural plant diseases demands a significant amount of model for plant disease recognition. In order to validate our model and
time to identify and prepare a robust and intricate teacher model in the methodology, we carried out experiments on the tomato 10-class dataset
knowledge that will be learned and transferred to our lighter model, and and PlantVillage dataset. These datasets have been subjected to simu­
it may not be effective. Therefore, we implemented a self-distillation lated real-world conditions. We then compared our results with those
technique that leverages the impact of previous iterations during the obtained from established models such as Next-Vit, ResNet34, Mobile­
training phase of our proprietary network model to direct the next Net, ConvNeXt, Vit-base, SwinVit-Tiny and MobileVit. The results
iteration, thereby enhancing the precision of the test when compared to demonstrated that our model and methodology achieved better accu­
training the CAST-Net model independently. racy, had fewer model parameters, and lower computational
Finally, we discovered that using a fixed learning rate during the complexity. Our model has a size of 11.69 million, with 1.84 billion
model’s training phase hindered its ability to converge to the optimal FLOPs, and yields an accuracy of 98.4% for the 10-class Tomato dataset
value. Therefore, we introduced a new dynamic learning rate function to and 99.0% for the enhanced PlantVillage dataset.
enable the model to adjust the learning rate size based on the function.
We also evaluated the impact of various components of the dynamic Funding
learning function during training. According to our experimental results,
the model converged more efficiently while using our learning rate This work was supported by National Key Research and Develop­
function. ment Program of China (2019YFE0126100); the Key Research and
Our research aims to improve outcomes for classifying and recog­ Development Program in Zhejiang Province of China (2019C54005);
nizing plant leaf diseases and pests by employing a lighter model. The National Natural Science Foundation of China (61605173) and
training dataset includes various plant species and disease categories. (61403346); Natural Science Foundation of Zhejiang Province
However, every photo only contains images of a single disease feature, (LY16C130003).
lacking the intricacies and uncertainties of real-life conditions. Never­
theless, it is important to note that our model was not exclusively trained Acknowledgments
using images captured under laboratory conditions. We used a variety of
stochastic simulations to mimic factors encountered in actual shooting Appreciations are given to the editors and reviews of the Journal.
scenarios, resulting in promising outcomes. Our forthcoming goals
entail incorporating our model into mobile platforms for rapid detection CRediT authorship contribution statement
and categorization of plant disease varieties in natural environments.
Compared to previous research findings, the dataset utilized in our Yun Zhao: Writing – review & editing, Formal analysis, Data cura­
study during the training phase only incorporates 13 plant categories tion, Conceptualization. Yang Li: Writing – review & editing, Writing –
and 26 disease categories. As this data is relatively limited in comparison original draft, Methodology, Data curation, Conceptualization. Na Wu:
to larger databases, it is essential that we incorporate additional plant Supervision, Investigation. Xing Xu: Project administration, Investiga­
species and disease categories into the model’s training to improve the tion, Funding acquisition.
model’s generalization ability and robustness. However, our model and
methodology offer additional benefits, including fewer parameters, Declaration of competing interest
reduced computational complexity, and improved test accuracy for
detecting diseases in single and multiple plant classifications compared The authors declare that they have no known competing financial
to existing research. interests or personal relationships that could have appeared to influence
the work reported in this paper.
6. Conclusion
Data availability
In this paper, we propose a light neural network called CAST-Net,
which combines Convolution and Self-Attention. Utilizing this model, Data will be made available on request.
we achieve single plant disease classification recognition and multi-
plant disease classification recognition through the use of the self- References
distillation method. The initial feature extraction module in the model
extracts additional disease feature details using an adaptive multi-scale Alguliyev, R., Imamverdiyev, Y., Sukhostat, L., Bayramov, R., 2021. Plant disease
detection based on a deep model. Soft Comput. 25, 13229–13242.
feature fusion technique. The SCB module in CAST-Net utilizes a com­ Alshammari, H., Gasmi, K., Krichen, M., Ammar, L.B., Abdelhadi, M.O., Boukrara, A.,
bination of deep and point convolution, as well as channel association to Mahmood, M.A., 2022. Optimal deep learning model for olive disease diagnosis
reduce the number of model parameters and link channel features. The based on an adaptive genetic algorithm. Wireless Commun. Mobile Comput. 1–13,
2022.
DTB module achieves the recognition of leaf disease images with a large Aqel, D., Al-Zubi, S., Mughaid, A., Jararweh, Y., 2022. Extreme learning machine for
sensory field for global feature extraction by combining the expansion plant diseases classification: a sustainable approach for smart agriculture. Cluster
convolution module, DCB, and the self-attention mechanism. We Comput. 1–14.
Batchuluun, G., Nam, S.H., Park, K.R., 2022. Deep learning-based plant-image
implemented a novel learning rate function during the training phase to
classification using a small training dataset. Mathematics 10, 3091.
enhance the model’s ability to converge towards the optimum. We also Bhagat, M., Kumar, D., 2023. Efficient feature selection using bows and surf method for
leaf disease identification. Multimed. Tool. Appl. 1–25.

11
Y. Zhao et al. Crop Protection 180 (2024) 106637

Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z., 2022. Mobile-former: Mehta, S., Rastegari, M., 2021. Mobilevit: Light-Weight, General-Purpose, and Mobile-
bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Friendly Vision Transformer arXiv preprint arXiv:2110.02178.
Computer Vision and Pattern Recognition, pp. 5270–5279. Pandey, A., Jain, K., 2022. A robust deep attention dense convolutional neural network
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., for plant leaf disease identification and classification from smart phone captured real
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An Image Is Worth world images. Ecol. Inf. 70, 101725.
16x16 Words: Transformers for Image Recognition at Scale arXiv preprint arXiv: Pandian, J.A., Kanchanadevi, K., Kumar, V.D., Jasińska, E., Goňo, R., Leonowicz, Z.,
2010.11929. Jasiński, M., 2022. A five convolutional layer deep convolutional neural network for
Ghofrani, A., Mahdian Toroghi, R., 2022. Knowledge distillation in plant disease plant leaf disease detection. Electronics 11, 1266.
recognition. Neural Comput. Appl. 34, 14287–14296. Qian, X., Zhang, C., Chen, L., Li, K., 2022. Deep learning-based identification of maize
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C., 2022a. Cmt: convolutional leaf diseases is improved by an attention mechanism: self-attention. Front. Plant Sci.
neural networks meet vision transformers. In: Proceedings of the IEEE/CVF 13, 864486.
Conference on Computer Vision and Pattern Recognition, pp. 12175–12185. Saberi Anari, M., et al., 2022. A hybrid model for leaf diseases classification based on the
Guo, Y., Lan, Y., Chen, X., 2022b. Cst: convolutional swin transformer for detecting the modified deep transfer learning and ensemble approach for agricultural aiot-based
degree and types of plant diseases. Comput. Electron. Agric. 202, 107407. monitoring. Comput. Intell. Neurosci. 2022.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018. Mobilenetv2:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on
pp. 770–778. Computer Vision and Pattern Recognition, pp. 4510–4520.
Hinton, G., Vinyals, O., Dean, J., 2015. Distilling the Knowledge in a Neural Network Sathya, K., Rajalakshmi, M., 2022. Rda-cnn: enhanced super resolution method for rice
arXiv preprint arXiv:1503.02531. plant disease classification. Comput. Syst. Sci. Eng. 42.
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Shen, Y., Xu, L., Yang, Y., Li, Y., Guo, Y., 2022. Self-distillation from the last mini-batch
Pang, R., Vasudevan, V., et al., 2019. Searching for mobilenetv3. In: Proceedings of for consistency regularization. In: Proceedings of the IEEE/CVF Conference on
the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324. Computer Vision and Pattern Recognition, pp. 11943–11952.
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale
Andreetto, M., Adam, H., 2017. Mobilenets: Efficient Convolutional Neural Networks Image Recognition arXiv preprint arXiv:1409.1556.
for Mobile Vision Applications arXiv preprint arXiv:1704.04861. Singh, R.K., Tiwari, A., Gupta, R.K., 2022. Deep transfer modeling for classification of
Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of the maize plant leaf disease. Multimed. Tool. Appl. 81, 6051–6067.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A., 2021. Bottleneck
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on
convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision Computer Vision and Pattern Recognition, pp. 16519–16529.
and Pattern Recognition, pp. 4700–4708. Suarez Baron, M.J., Gomez, A.L., Diaz, J.E.E., 2022. Supervised learning-based image
Hughes, D., Salathé, M., et al., 2015. An Open Access Repository of Images on Plant classification for the detection of late blight in potato crops. Appl. Sci. 12, 9371.
Health to Enable the Development of Mobile Disease Diagnostics arXiv preprint Tabbakh, A., Barpanda, S.S., 2022. Evaluation of machine learning models for plant
arXiv:1511.08060. disease classification using modified glcm and wavelet based statistical features.
Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X., 2022. Next- Trait. Du. Signal 39, 1893.
vit: Next Generation Vision Transformer for Efficient Deployment in Realistic Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L., 2021. Cvt: introducing
Industrial Scenarios arXiv preprint arXiv:2207.05501. convolutions to vision transformers. In: Proceedings of the IEEE/CVF International
Li, X., Li, S., Omar, B., Wu, F., Li, X., 2021. Reskd: residual-guided knowledge distillation. Conference on Computer Vision, pp. 22–31.
IEEE Trans. Image Process. 30, 4735–4746. Yang, J., Yang, Y., Li, Y., Xiao, S., Ercisli, S., 2022. Image information contribution
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin evaluation for plant diseases classification via inter-class similarity. Sustainability
transformer: hierarchical vision transformer using shifted windows. In: Proceedings 14, 10938.
of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., Ma, K., 2019. Be your own teacher:
Lu, X., Yang, R., Zhou, J., Jiao, J., Liu, F., Liu, Y., Su, B., Gu, P., 2022. A hybrid model of improve the performance of convolutional neural networks via self distillation. In:
ghost-convolution enlightened transformer for effective diagnosis of grape leaf Proceedings of the IEEE/CVF International Conference on Computer Vision,
disease and pest. Journal of King Saud University-Computer and Information pp. 3713–3722.
Sciences 34, 1755–1767. Zhao, Y., Sun, C., Xu, X., Chen, J., 2022. Ric-net: a plant disease classification model
Madhavan, M.V., Thanh, D.N.H., Khamparia, A., Pande, S., Malik, R., Gupta, D., 2021. based on the fusion of inception and residual structure and embedded attention
Recognition and classification of pomegranate leaves diseases by image processing mechanism. Comput. Electron. Agric. 193, 106644.
and machine learning techniques. Comput. Mater. Continua (CMC) 66, 2939–2955.

12

Common questions

Powered by AI

The lightweight deep learning classification network model introduces several innovations to improve plant disease classification accuracy. Firstly, it utilizes a shallow plant disease feature extraction module that employs a multi-scale feature fusion technique across four branches, enhancing the extraction of global lesion information . Secondly, the model replaces group convolution in its NCB module with depth convolution, point convolution, and channel information correlation, renaming it as SCB, to better extract local features with fewer parameters . Thirdly, DCB, a new module, uses expansion convolution to increase the sensory field, which is fused with a multi-head self-attention mechanism, creating the DTB module for better global feature extraction . Additionally, a dynamic learning rate function is proposed to prevent the model from falling into local optima and improve convergence . Finally, the self-distillation technique during training helps the model learn more effectively from previous iterations to enhance classification precision .

The critical improvements made to the NCB and NTB components of the Next-Vit model include replacing group convolutions with depth and point convolutions, along with channel information correlation operations, forming the SCB module . These changes help simplify the model and enhance feature extraction efficiency. Additionally, the introduction of the DCB module, which employs expansion convolution with self-attention mechanisms, enhances the model's capacity to extract and process global feature information, significantly improving plant disease recognition and classification .

The proposed research addresses the challenge of achieving accurate plant disease classification using less complex models by developing a Next-Vit-based lightweight deep learning classification network. The model incorporates modules like SCB and DCB that optimize feature extraction while reducing parameters compared to traditional methods . A dynamic learning rate function enhances model training convergence, and self-distillation techniques refine classification precision without increasing model complexity . These strategies collectively reduce cumbersome computational demands and improve classification accuracy, successfully balancing model complexity and performance .

Self-distillation plays a crucial role in enhancing the precision of plant disease classification models by leveraging the outputs of earlier iterations to guide the training in subsequent iterations. This method refines the learning process, enabling the model to better recognize intricate patterns and features within plant disease images . By utilizing knowledge gained from prior iterations, self-distillation improves the model's classification accuracy significantly, surpassing models trained independently without distillation .

A static learning rate might hinder model optimization in plant disease classification because it can cause the model to converge to suboptimal local minima, failing to reach the optimal solution . Static rates do not adapt to the nuances of training dynamics, potentially causing either slow convergence if too low, or oscillations and divergence if too high . By contrast, a dynamic learning rate allows the model to adjust learning rates during training, providing a better balance between exploration and convergence throughout the optimization process .

Using a dynamic learning rate function in training neural network models for plant disease detection offers several advantages. It helps the model bypass local optima and accelerate convergence during the training phase . This function adjusts the learning rate throughout the training process, allowing the model to explore more optimal solutions and improve its learning efficiency. As a result, the model achieves better convergence and can reach higher classification accuracy compared to using a static learning rate .

The SCB module improves feature extraction in plant disease classification models by utilizing depth-separable convolution instead of the original grouped convolution. This modification allows the SCB module to produce a larger number of feature maps while keeping the same number of parameters and computational complexity . The change enhances the model's ability to capture more detailed feature information, thus improving the classification accuracy of plant disease images .

The use of expansion convolution in the DCB module positively affects the model's ability to process plant disease images by broadening the convolutional sensory field, allowing the network to capture more extensive spatial i formation without increasing model complexity . This ensures that the model can extract significant global features from plant disease images more effectively, contributing to an enhanced classification performance compared to using narrower convolutional scopes .

The ShallowBlock module outperforms Next-Vit's shallow feature extraction module by employing multiple scales to extract and fuse features from input plant disease images . This approach enhances the richness of the extracted features and reduces the loss of useful information . In comparative experiments, ShallowBlock demonstrated superior performance by achieving a more comprehensive extraction of plant disease features compared to the four 3 × 3 convolution layers utilized by Next-Vit's module .

The CAST model and its enhanced version, CAST+SD, differ primarily in the application of self-distillation. While CAST serves as the base model, CAST+SD includes self-distillation in its training process, contributing to an increase in test accuracy from 97.5% to 98.4% . This self-distillation technique allows the model to leverage previous learning phases, improving overall classification precision compared to using the CAST model independently .

You might also like