SCSegamba Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
SCSegamba Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
Segmentation in Structures
Abstract
Dysample
F4 …
4 5 7 Flatten 12 13 14 15
6 0 Sum
MLP
SAVSS Down Concat
Block4 Sample 8 9 10 11 0 4 1 2 … 15
(128×64×64) …
12 13 14 15 3 7 2 1 12
Dysample
F3
MLP
SAVSS Down Block1 Block2
Block3 Sample
Patch Embedding (64×128×128)
Output
Linear
SS2D
ResConnection
SiLU
Dysample Dysample
L = α · Lbce + β · LDice
F2
Block3
MLP
SAVSS Down
Block2 Sample Block4 Input
Supervise
Linear
Linear
(32×256×256)
PAF
GN
GBC
Positional
Encoding
GBC×2
F1
SAVSS Down
MLP
Conv
Linear
SiLU
Block1 Sample
(16×512×512) MLP Output
Figure 2. Overview of our proposed method. (a) illustrates the overall architecture of SCSegamba and the processing flow for crack
images. (b) displays the structure of the SAVSS block. The input crack image undergoes comprehensive morphological and texture feature
extraction through SAVSS, while MFS produces a high-quality pixel-level segmentation map.
combines Swin Transformer and CNN for automatic fea- researchers have adapted Mamba to the visual domain, cre-
ture extraction, while TBUNet [60], a Transformer-based ating various VSS blocks. ViM [62] achieves comparable
knowledge distillation model, achieves high-performance modeling to ViT [10] without attention mechanisms, using
crack segmentation with a hybrid loss function. Although fewer computational resources, while VMamba [36] prior-
Transformer-based methods are highly effective at captur- itizes efficient computation and high performance. Plain-
ing crack texture cues and suppressing background noise, Mamba [55] employs a fixed-width layer stacking ap-
their self-attention mechanism introduces computational proach, excelling in tasks such as instance segmentation
complexity that grows quadratically with sequence length. and object detection. However, the VSS block and scanning
This results in a high parameter count and significant strategy require specific optimizations for each visual task,
computational demands, which limit their deployment on as tasks differ in their reliance on long- and short-distance
resource-constrained edge devices. information, necessitating customized VSS block designs to
ensure optimal performance.
Currently, no high-performing Mamba-based model ex-
ReLU ists for crack segmentation. Thus, designing an optimized
ReLU GN VSS structure specifically for crack segmentation is essen-
PointConv Output
GN tial to improve performance and efficiency. Given the intri-
BottConv
cate details and irregular textures of cracks, the VSS block
BottConv
DepthConv requires strong shape extraction and directional awareness
ReLU ReLU
to effectively capture crack texture cues. Additionally, it
should facilitate efficient crack segmentation while mini-
PointConv GN GN Input
mizing computational resource requirements.
BottConv BottConv
3. Methodology
3.1. Preliminary
Figure 3. Architecture of GBC. It employs bottleneck convolu- The complete architecture of our proposed SCSegamba is
tion to efficiently reduce the parameters and computational load, depicted in Figure 2. It includes two main components:
while the gating mechanism enhances the model’s adaptability in the SAVSS for extracting crack shape and texture cues, and
processing diverse crack patterns and complex backgrounds. GN
the MFS for efficient feature processing. To capture key
represents group normalization.
crack region cues, we integrate the GBC at the initial stage
2.2. Selective State Space Model of SAVSS and the final stage of MFS.
The introduction of the Selective State Space Models (S6) For a single RGB image E \in \mathbb {R}^{ 3 \times H \times W} , spatial in-
in the Mamba model [14] has highlighted the potential of formation is divided into n patches, forming a sequence
SSM [12, 13]. Unlike the linear time-invariant S4 model, \{B_1, B_2, \dots , B_n\} . This sequence is processed through the
S6 efficiently captures complex long-distance dependencies SAVSS block, embedding key crack pixel cues into multi-
while preserving computational efficiency, achieving strong scale feature maps \{F_1, F_2, F_3, F_4\} . Finally, in the MFS,
performance in NLP, audio, and genomics. Consequently, all information is consolidated into a single tensor, produc-
ing a refined segmentation output W \in \mathbb {R}^{ 1 \times H \times W } . details. After the residual connection is applied, the result-
ing output is:
3.2. Lightweight Gated Bottleneck Convolution
y = ReLU(Norm_4(BottConv_4(m(x)))) (7)
The gating mechanism enables dynamic features for each
spatial position and channel, enhancing the model’s abil- Output = y + x_{residual} (8)
ity to capture details [8, 57]. To further reduce parameter The design of BottConv and deeper gated branch enable
count and computational cost, we embedded a bottleneck the model to preserve basic crack features while dynami-
convolution (BottConv) with low-rank approximation [28], cally refining the fine-grained feature characterization of the
mapping matrices from high to low dimensional spaces and main branch, resulting in more accurate segmentation maps
significantly lowering computational complexity. in detailed regions.
In the convolution layer, assuming the spatial size of the
filter is p , the number of input channels is d and the input is 3.3. Structure-Aware Visual State Space Module
s, the convolution response can be represented as: Our designed SAVSS features a two-dimensional selective
z = Qs + c (1) scan (SS2D) tailored for visual tasks. Different scanning
strategies impact the model’s ability to capture continuous
where Q is a matrix of size f \times (p^2 \times d) , f is the number of crack textures. As shown in Figure 4, current vision Mamba
output channels, and c is the original bias term. Assuming z networks use various scanning directions, including paral-
lies in a low-rank subspace of rank f_0 , it can be represented lel, snake, bidirectional, and diagonal scans [36, 55, 62].
as z = V(z - z_1) + z_1 , where z_1 abstracts the mean vector Parallel and diagonal scans lack continuity across rows or
of features, acting as an auxiliary variable to facilitate the- diagonals, which limits their sensitivity to crack directions.
oretical derivation and correct feature offsets, V = LM^T Although bidirectional and snake scans maintain semantic
( L \in \mathbb {R}^{f \times f_0} , M \in \mathbb {R}^{(p^2d) \times f_0} ) represents the low-rank pro- continuity along horizontal or vertical paths, they struggle
jection matrix. The simplified response then becomes: to capture diagonal or interwoven textures. To address this,
our proposed diagonal snake scanning is designed to better
z = LM^Ts + c' (2)
capture complex crack texture cues.
Since f_0 < f , the computational complexity reduces SASS consists of four paths: two parallel snake paths
from O(fp^2d) to O(f_0p^2d) + O(ff_0) , where and two diagonal snake paths. This design enables the
, O(ff_0) \ll O(fp^2d)
indicating that the computational complexity re- effective extraction of continuous semantic information in
duction is proportional to the original ratio of f_0/f . regular crack regions while preserving texture continuity
In BottConv, pointwise convolutions project features in multiple directions, making it suitable for multi-scenario
into or out of low-rank subspace, thus significantly reduc- crack images with complex backgrounds.
ing complexity, while depthwise convolution that perform After the RGB crack image undergoes Patch Embedding
spatial information-adequate extraction in subspace guar- and Position Encoding, it is input as a sequence into the
antees negligibly low complexity. As shown in Figure 5, SAVSS block. To maintain a lightweight network, we use
BottConv in our GBC design significantly reduces parame- only 4 layers of SAVSS blocks. The processing equations
ter count and computational load compared to the original are as follows:
convolution, with minimal performance impact. \overline {P} = e^{\Delta P} (9)
As shown in Figure 3, the input feature x \in \mathbb {R}^{ C \times H \times W } is \overline {Q} = (\Delta P)^{-1} (e^{\Delta P} - I) \cdot \Delta Q (10)
retained as x_{\text {residual}} = x to facilitate the residual connection.
Subsequently, the feature x is passed through the BottConv z_k = \overline {P} z_{k-1} + \overline {Q} w_k (11)
layer, followed by normalization and activation functions, u_k = R z_k + S w_k (12)
resulting in the features x_1 and g_2(x) as shown below:
g_1(x) = ReLU(Norm_1(f_1(x))) (3) In these equations, the input w \in \mathbb {R}^{t \times D} , P \in \mathbb {R}^{G \times D}
controls the hidden spatial state, S \in \mathbb {R}^{D \times D} is used to
x_1 = ReLU(Norm_2(BottConv_2(g_1(x)))) (4) initialize the skip connection for input, z_k represents the
g_2(x) = ReLU(Norm_3(BottConv_3(x))) (5) specific hidden state at time step k , and Q \in \mathbb {R}^{G \times D} and
R \in \mathbb {R}^{G \times D} are matrices with hidden spatial dimensions G
To generate the gating feature map, x_1 and g_2(x) are and temporal dimensions D , respectively, obtained through
combined through the Hadamard product: selective scanning SS2D. These are trainable parameters
m(x) = x_1 \odot g_2(x) (6) that are updated accordingly. u_k represents the output at
time step k . SASS establishes multi-directional adjacency
The gating feature map m(x) is subsequently processed relationships, allowing the hidden state z_k to capture more
through BottConv once again to further refine fine-grained intricate topological and textural details, while enabling the
Parallel Parallel Snake Parallel Bidirectional Diagonal Diagonal Snake
0 1 2 3 0 4 8 12 … 3
4 5 6 7 12 13 14 15 … 0
8 9 10 11 0 4 1 2 … 15
12 13 14 15 3 7 2 1 … 12
Patches Input Parallel Route1 Parallel Route2 Diagonal Route1 Diagonal Route2 Scan Sequence Output
SASS Scan Routes
Figure 4. Illustration of our proposed SASS and other scanning strategies. The first row presents four commonly used single scanning
paths, along with our proposed diagonal snake path. The second row illustrates the execution flow of our proposed SASS scanning strategy.
o = MLP(Conv(o_1)) (15)
Figure 6. Visual comparison of typical cracks with 9 SOTA methods across four datasets. Red boxes highlight critical details, and green
boxes mark misidentified regions.
During processing, all datasets were divided into train- mIoU is used to measure the mean proportion of the in-
ing, validation, and test sets with a 7:1:2 ratio. tersection over union between the ground truth and the pre-
dicted results. The calculation is given by the formula:
4.2. Implementation Details.
Experimental Settings. We built our SCSegamba network
mIoU = \frac {1}{N+1} \sum _{l=0}^{N} \frac {p_{ll}}{\sum _{t=0}^{N} p_{lt} + \sum _{t=0}^{N} p_{tl} - p_{ll}} (19)
using PyTorch v1.13.1 and trained it on an Intel Xeon Plat-
inum 8336C CPU with eight Nvidia GeForce RTX 4090
GPUs. The AdamW optimizer was used with an initial where N is the number of classes, which we set as
learning rate of 5e-4, PolyLR scheduling, a weight decay ; N = 1
t represents the ground truth, l represents the predicted
of 0.01, and a random seed of 42. The network was trained value, and p_{tl} represents the count of pixels classified as l
for 50 epochs, and the model with the best validation per- but belonging to t .
formance was selected for testing. Additionally, we evaluated our method’s complexity us-
Comparison Methods. To comprehensively evaluate our ing three metrics: FLOPs, Params, and Model Size, rep-
model, we compared SCSegamba with 9 SOTA methods. resenting computational complexity, parameter complexity,
The CNN or Transformer-based models included RIND and memory footprint.
[38], SFIAN [5], CTCrackSeg [44], DTrCNet [48], Crack-
mer [46] and SimCrack [20]. Additionally, we compared 4.3. Comparison with SOTA Methods
it with other Mamba-based models, including CSMamba As listed in Table 1, compared with 9 other SOTA meth-
[37], PlainMamba [55], and MambaIR [16]. ods, our proposed SCSegamba achieves the best perfor-
Evaluation Metrics. We used six metrics to evaluate mance across four public datasets. Specifically, on the
SCSegamba’s performance: Precision (P), Recall (R), F1 Crack500 [56] and DeepCrack [35] datasets, which con-
2RP tain larger and more complex crack regions, SCSegamba
Score (F 1 = R+P ), Optimal Dataset Scale (ODS), Opti-
mal Image Scale (OIS), and mean Intersection over Union achieved the highest performance. Notably, on the Deep-
(mIoU). ODS measures the model’s adaptability to datasets Crack dataset, it surpassed the next best method by 1.50%
of varying scales at a fixed threshold m, while OIS evaluates in F1 score and 1.09% in mIoU. This improvement is due
adaptability across image scales at an optimal threshold n. to the robust ability of GBC to capture morphological clues
The calculation formulas are as follows: in large crack areas, enhancing the model’s representational
power. On the CrackMap [22] dataset, which features thin-
ODS = \max _m \frac {2 \cdot P_m \cdot R_m}{P_m + R_m} (17) ner and more elongated cracks, our method surpasses all
other SOTA methods in every metric, outperforming the
next best method by 2.06% in F1 and 1.65% in mIoU. This
OIS = \frac {1}{N} \sum _{i=1}^{N} \max _n \frac {2 \cdot P_{n,i} \cdot R_{n,i}}{P_{n,i} + R_{n,i}} (18) demonstrates the effectiveness of SASS in capturing fine
textures and elongated crack structures. As illustrated in
Crack500 DeepCrack
Methods
ODS OIS P R F1 mIoU ODS OIS P R F1 mIoU
RIND [38] 0.6469 0.6483 0.6998 0.7245 0.7119 0.7381 0.8087 0.8267 0.7896 0.8920 0.8377 0.8391
SFIAN [5] 0.6977 0.7348 0.6983 0.7742 0.7343 0.7604 0.8616 0.8928 0.8549 0.8692 0.8620 0.8776
CTCrackSeg [44] 0.6941 0.7059 0.6940 0.7748 0.7322 0.7591 0.8819 0.8904 0.9011 0.8895 0.8952 0.8925
DTrCNet [48] 0.7012 0.7241 0.6527 0.8280 0.7357 0.7627 0.8473 0.8512 0.8905 0.8251 0.8566 0.8661
Crackmer [46] 0.6933 0.7097 0.6985 0.7572 0.7267 0.7591 0.8712 0.8785 0.8946 0.8783 0.8864 0.8844
SimCrack [20] 0.7127 0.7308 0.7093 0.7984 0.7516 0.7715 0.8570 0.8722 0.8984 0.8549 0.8761 0.8744
CSMamba [37] 0.6931 0.7162 0.6858 0.7823 0.7315 0.7592 0.8738 0.8766 0.9025 0.8863 0.8943 0.8863
PlainMamba [55] 0.7035 0.7173 0.7170 0.7557 0.7358 0.7682 0.8646 0.8668 0.9050 0.8659 0.8850 0.8788
MambaIR [16] 0.7043 0.7189 0.7204 0.7681 0.7435 0.7663 0.8796 0.8840 0.9056 0.8895 0.8975 0.8907
SCSegamba (Ours) 0.7244 0.7370 0.7270 0.7859 0.7553 0.7778 0.8938 0.8990 0.9097 0.9124 0.9110 0.9022
CrackMap TUT
Methods
ODS OIS P R F1 mIoU ODS OIS P R F1 mIoU
RIND [38] 0.6745 0.6943 0.6023 0.7586 0.6699 0.7425 0.7531 0.7891 0.7872 0.7665 0.7767 0.8051
SFIAN [5] 0.7200 0.7465 0.6715 0.7668 0.7160 0.7748 0.7290 0.7513 0.7715 0.7367 0.7537 0.7896
CTCrackSeg [44] 0.7289 0.7373 0.6911 0.7669 0.7270 0.7785 0.7940 0.7996 0.8202 0.8195 0.8199 0.8301
DTrCNet [48] 0.7328 0.7413 0.6912 0.7681 0.7276 0.7812 0.7987 0.8073 0.7972 0.8441 0.8202 0.8331
Crackmer [46] 0.7395 0.7437 0.7229 0.7467 0.7346 0.7860 0.7429 0.7640 0.7501 0.7656 0.7578 0.7966
SimCrack [20] 0.7559 0.7625 0.7380 0.7672 0.7523 0.7963 0.7984 0.8090 0.8051 0.8371 0.8208 0.8334
CSMamba [37] 0.7371 0.7413 0.7053 0.7663 0.7346 0.7841 0.7879 0.7946 0.7947 0.8353 0.8146 0.8263
PlainMamba [55] 0.7150 0.7189 0.6649 0.7616 0.7099 0.7699 0.7867 0.7967 0.7701 0.8523 0.8102 0.8253
MambaIR [16] 0.7332 0.7347 0.7569 0.7013 0.7280 0.7834 0.7861 0.7930 0.7877 0.8387 0.8125 0.8249
SCSegamba (Ours) 0.7741 0.7766 0.7629 0.7727 0.7678 0.8094 0.8204 0.8255 0.8241 0.8545 0.8390 0.8479
Table 1. Comparison with 9 SOTA methods across 4 datasets. Best results are in bold, and second-best results are underlined.
Methods Year FLOPs↓ Params↓ Size↓ that our method, with the enhanced crack morphology
RIND [38] 2021 695.77G 59.39M 453MB and texture perception from GBC and SASS, exhibits ex-
SFIAN [5] 2023 84.57G 13.63M 56MB ceptional robustness and stability. Additionally, leverag-
CTCrackSeg [44] 2023 39.47G 22.88M 174MB ing MFS for feature aggregation improves multi-scale per-
DTrCNet [48] 2023 123.20G 63.45M 317MB ception, making our model particularly suited for diverse,
Crackmer [46] 2024 14.94G 5.90M 43MB interference-rich scenarios.
SimCrack [20] 2024 286.62G 29.58M 225MB 4.4. Complexity Analysis
CSMamba [37] 2024 145.84G 35.95M 233MB Table 2 shows a comparison of the complexity of our
PlainMamba [55] 2024 73.36G 16.72M 96MB method with other SOTA methods when the input image
MambaIR [16] 2024 47.32G 10.34M 79MB size is uniformly set to 512. With only 2.80M param-
SCSegamba (Ours) 2024 18.16G 2.80M 37MB eters and a model size of 37MB, our method surpasses
all others, being 52.54% and 13.95% lower than the next
Table 2. Comparison of complexity with other methods. Best re- best result, respectively. Additionally, compared to Crack-
sults are in bold, and second-best results are underlined. mer [46], which prioritizes computational efficiency, our
Figure 6, our method produces clearer and more precise fea- method’s FLOPs are only 3.22G higher. This demonstrates
ture maps, with superior detail capture in typical scenarios that the combination of lightweight SAVSS and MFS en-
such as cement and bitumen, compared to other methods. ables high-quality segmentation in noisy crack scenes with
For the TUT dataset [33], which includes eight diverse minimal parameters and low computational load, which is
scenarios, our method achieved the best performance, sur- essential for resource-constrained devices.
passing the next best method by 2.21% in F1 and 1.74% in 4.5. Ablation Studies
mIoU. As shown in Figure 6, whether in the complex crack We performed ablation experiments on the representative
topology of plastic tracks, the noise-heavy backgrounds of multi-scenario dataset TUT [33].
metallic materials and turbine blades, or the low-contrast, Ablation study of segmentation heads. As listed in Ta-
dimly lit underground pipeline images, SCSegamba consis- ble 3, with our designed MFS, SCSegamba achieved the
tently produced high-quality segmentation maps while ef- best results across all six metrics, with F1 and mIoU scores
fectively suppressing irrelevant noise. This demonstrates 1.57% and 1.21% higher than the second-best method. In
Seg Head ODS OIS P R F1 mIoU Params ↓ FLOPs ↓ Model Size ↓
UNet [42] 0.8055 0.8151 0.8148 0.8376 0.8260 0.8378 2.92M 19.27G 39MB
Ham [11] 0.7703 0.7784 0.7962 0.7838 0.7909 0.8124 2.86M 35.08G 38MB
SegFormer [50] 0.7947 0.7983 0.8170 0.8174 0.8172 0.8307 2.79M 17.87G 35MB
MFS 0.8204 0.8255 0.8241 0.8545 0.8390 0.8479 2.80M 18.16G 37MB
Table 3. Ablation study of different segmentation heads. UNet [42], Ham [11], and SegFormer [50] are high-performance heads.
GBC PAF Res ODS OIS P R F1 mIoU Params ↓ FLOPs ↓ Model Size ↓
0.8136 0.8196 0.8213 0.8461 0.8335 0.8434 2.49M 16.75G 34MB
0.7998 0.8084 0.7918 0.8524 0.8222 0.8343 2.28M 14.91G 33MB
0.7936 0.8069 0.7952 0.8438 0.8197 0.8313 2.48M 15.65G 35MB
0.8047 0.8102 0.8174 0.8379 0.8275 0.8377 2.54M 17.08G 35MB
0.8116 0.8200 0.8156 0.8522 0.8334 0.8425 2.75M 17.82G 37MB
0.8023 0.8076 0.8219 0.8302 0.8260 0.8360 2.54M 15.99G 35MB
0.8204 0.8255 0.8241 0.8545 0.8390 0.8479 2.80M 18.16G 37MB
Table 4. Ablation study of components within the SAVSS block. Best results are in bold, and second-best results are underlined.
terms of complexity, although Params, FLOPs, and Model Ablation studies of scanning strategies. As listed in Ta-
Size are only 0.01M, 0.29G, and 2MB larger than the Seg- ble 5, under the same conditions of using four different di-
Former head, our method surpasses it in F1 and mIoU by rectional scanning paths, the model achieved the best per-
2.67% and 2.07%, respectively. This demonstrates that formance with our designed SASS scanning strategy, im-
MFS enhances SAVSS output integration, significantly im- proving F1 and mIoU by 0.30% and 0.33% over the diag-
proving performance while keeping the model lightweight. onal snake strategy. This demonstrates SASS’s ability to
construct semantic and dependency information suited to
Scan ODS OIS P R F1 mIoU crack topology, enhancing crack pixel perception in sub-
Parallel 0.8123 0.8184 0.8146 0.8523 0.8330 0.8427 sequent modules. More comprehensive experiments and
Diag 0.8091 0.8148 0.8225 0.8417 0.8320 0.8410 real-world deployments are available in the Appendix.
ParaSna 0.8102 0.8162 0.8219 0.8365 0.8291 0.8408 5. Conclusion
DiagSna 0.8153 0.8215 0.8237 0.8497 0.8365 0.8451 In this paper, we proposed SCSegamba, a lightweight
SASS 0.8204 0.8255 0.8241 0.8545 0.8390 0.8479 structure-aware Vision Mamba for precise pixel-level crack
segmentation. SCSegamba combines SAVSS and MFS to
Table 5. Ablation studies with different four-route scanning strate-
enhance crack shape and texture perception with a low
gies in the SAVSS block, comparing parallel, diagonal, parallel
parameter count. Equipped with GBC and SASS scan-
snake, and diagonal snake scanning. The impact of different scan-
ning strategies on complexity is negligible; thus, complexity anal- ning, SAVSS, captures irregular crack textures across var-
ysis is omitted from this table. Best results are in bold, and second- ious structures. Experiments on four datasets show SC-
best results are underlined. Segamba’s exceptional performance, especially in complex,
noisy scenarios. On the challenging multi-scenario dataset,
Ablation study of components. Table 4 shows the im- it achieved an F1 score of 0.8390 and mIoU of 0.8479 with
pact of each component in SAVSS on model performance. only 18.16G FLOPs and 2.8M parameters, demonstrating
When fully utilizing GBC, PAF, and residual connections, its effectiveness for real-world crack detection and suitabil-
our model achieved the best results across all metrics. No- ity for edge devices. Future work will incorporate multi-
tably, adding GBC led to significant improvements in F1 modal cues to enhance segmentation quality, while further
and mIoU by 1.57% and 1.42%, respectively, highlighting optimizing VSS design and scan strategies to achieve high-
its strength in capturing crack morphology cues. Similarly, quality results with low computational resources.
residual connections boosted F1 and mIoU by 0.13% and 6. Acknowledgement
2.47%, indicating their role in focusing on essential crack
This work was supported by the National Natural Sci-
features. Although using only PAF resulted in the lowest
ence Foundation of China (NSFC) under Grants 62272342,
Params, FLOPs, and Model Size, it significantly reduced
62020106004, 62306212, and T2422015; the Tianjin Natu-
performance. These findings demonstrate that our fully in-
ral Science Foundation under Grants 23JCJQJC00070 and
tegrated SAVSS effectively captures crack morphology and
24PTLYHZ00320; and the Marie Skłodowska-Curie Ac-
texture cues, achieving top pixel-level segmentation results
tions (MSCA) under Project No. 101111188.
while maintaining a lightweight model.
SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack
Segmentation in Structures
Supplementary Material
7. Details of SASS and Ablation Experiments α:β ODS OIS P R F1 mIoU
BCE 0.8099 0.8151 0.8207 0.8457 0.8330 0.8414
As described in Subsection 3.3, the SASS strategy enhances Dice 0.8022 0.8072 0.8038 0.8430 0.8231 0.8358
semantic capture in complex crack regions by scanning tex- 5:1 0.8125 0.8168 0.8207 0.8432 0.8319 0.8428
ture cues from multiple directions. SASS combines parallel 4:1 0.8144 0.8184 0.8217 0.8442 0.8328 0.8437
snake and diagonal snake scans, aligning the scanning paths 3:1 0.8180 0.8229 0.8293 0.8436 0.8364 0.8463
with the actual extension and irregular shapes of cracks, en- 2:1 0.8098 0.8152 0.8204 0.8392 0.8297 0.8408
suring comprehensive capture of texture information. 1:1 0.8123 0.8184 0.8141 0.8507 0.8320 0.8423
To evaluate the necessity of using four scanning paths 1:2 0.8152 0.8214 0.8210 0.8484 0.8345 0.8443
in SASS, we conducted ablation experiments with different 1:3 0.8109 0.8163 0.8226 0.8396 0.8310 0.8418
path numbers across various scanning strategies on multi- 1:4 0.8133 0.8185 0.8163 0.8515 0.8336 0.8433
scenario dataset TUT. As listed in Table 6, all strategies 1:5 0.8204 0.8255 0.8241 0.8545 0.8390 0.8479
performed significantly better with four paths than with
two, likely because four paths allow SAVSS to capture finer Table 7. Sensitivity analysis experiments with different α and β
ratios. Best results are in bold, and second-best results are under-
crack details and topological cues. Notably, aside from
lined.
SASS, the diagonal snake-like scan consistently achieved
the second-best results, with two-path configurations yield-
ing F1 and mIoU scores 0.48% and 0.45% higher than the 8. Details of Objective Function and Analysis
diagonal unidirectional scan. This indicates that the diago-
nal snake-like scan provides more continuous semantic in- The calculation formulas for BCE [29] loss and Dice [43]
formation, enhancing segmentation. Importantly, our pro- loss are as follows:
posed SASS achieved the best results with both two-path
and four-path setups, demonstrating its effectiveness in cap-
L_{Dice} = 1 - \frac {2 \sum _{j=1}^M p_j \hat {p}_j + \epsilon }{\sum _{j=1}^M p_j + \sum _{j=1}^M \hat {p}_j + \epsilon } (20)
turing diverse crack topologies.
To clarify the implementation of our proposed SASS, we
present its execution process in Algorithm 1.
L_{BCE} = -\frac {1}{N} \left [ p_j \log (\hat {p}_j) + (1 - p_j) \log (1 - \hat {p}_j) \right ] (21)
N ODS OIS P R F1 mIoU where M denotes the number of samples, p_j is the ground
truth label for the j -th sample, \hat {p}_j is the predicted probabil-
PaSna Parallel
4 0.8091 0.8148 0.8225 0.8417 0.8320 0.8410 As listed in Table 7, setting the α to β ratio at 1:5 yields
the best performance, with improvements of 0.65% in F1
SASS DigSna
Raw GT RIND SFIAN CTCrackseg DTrCNet Crackmer SimCrack CSMamba PlainMamba MambaIR SCSegamba
Figure 7. Visual comparison with 9 SOTA methods across four public datasets. Red boxes highlight critical details, and green boxes mark
misidentified regions.
primarily include bitumen, concrete, and brick scenarios continuity and fine segmentation, resulting in discontinu-
with minimal background noise and a range of crack thick- ities and expanded segmentation areas that do not align with
nesses, our method consistently achieves accurate segmen- actual crack images.
tation, even capturing intricate fine cracks. This is attributed For the TUT [33] dataset, which includes diverse scenar-
to GBC’s strong capability in capturing crack morphology. ios and significant background noise, our method excels at
In contrast, other methods show weaker performance in suppressing interference. For instance, in images of cracks
Layer Num ODS OIS P R F1 mIoU Params ↓ FLOPs ↓ Model Size ↓
2 0.8102 0.8165 0.8181 0.8420 0.8299 0.8413 1.56M 12.26G 20MB
4 0.8204 0.8255 0.8241 0.8545 0.8390 0.8479 2.80M 18.16G 37MB
8 0.8174 0.8222 0.8199 0.8579 0.8387 0.8461 5.23M 29.27G 68MB
16 0.8126 0.8187 0.8226 0.8475 0.8349 0.8430 10.08M 51.51G 127MB
32 0.5203 0.5365 0.5830 0.5680 0.5754 0.6785 19.79M 95.97G 247MB
Table 8. Experiments with different numbers of SAVSS layers. Best results are in bold, and second-best results are underlined.
Table 9. Experiments with different patch sizes. Best results are in bold, and second-best results are underlined.
Table 10. Comparison experiments of different Mamba-based methods using 4 VSS layers. Best results are in bold, and second-best results
are underlined.
on generator blades and steel pipes, it effectively minimizes mance and resource efficiency, making it ideal for practical
irrelevant noise and provides precise crack segmentation. applications.
This performance is largely attributed to SAVSS’s accu- Comparison with different Patch Size. In our SAVSS,
rate capture of crack topologies. In contrast, CNN-based we set the Patch Size to 8 during Patch Embedding. To ver-
methods like RIND [38] and SFIAN [5] struggle to dis- ify its effectiveness, we conducted experiments with various
tinguish background noise from crack regions, highlighting Patch Sizes. As listed in Table 9, a Patch Size of 8 yields
their limitations in contextual dependency capture. Other the best performance, with F1 and mIoU scores 1.16% and
Transformer and Mamba-based methods also fall short in 1.17% higher than a Patch Size of 4. Although a smaller
segmentation continuity and detail handling compared to Patch Size of 4 reduces parameters and model size, it limits
our approach. the receptive field and hinders the effective capture of longer
textures, impacting segmentation. As shown in Figure 9, as
10. Additional Analysis the Patch Size increases, parameter count and model size
To provide a thorough demonstration of the necessity of decrease, but the computational load per patch rises, affect-
each component in our proposed SCSegamba, we con- ing efficiency. At a Patch Size of 32, performance drops
ducted a more extensive analysis experiment. significantly due to reduced fine-grained detail capture and
Comparison with different numbers of SAVSS layers. sensitivity to contextual variations. Thus, a Patch Size of 8
In our SCSegamba, we used 4 layers of SAVSS blocks to balances detail accuracy and generalization while maintain-
balance performance and computational requirements. As ing model efficiency.
listed in Table 8, 4 layers achieved optimal results, with F1 Comparison under the same number of VSS layers. In
and mIoU scores 0.036% and 0.21% higher than with 8 lay- Subsection 4.3, we compare SCSegamba with other SOTA
ers, while reducing parameters by 2.43M, computation by methods, using default VSS layer settings for Mamba-
11.11G, and model size by 31MB. Although using only 2 based models like MambaIR [16], CSMamba [37], and
layers minimized resource demands, with 1.56M parame- PlainMamba [55]. To examine complexity and perfor-
ters, performance decreased. Conversely, using 32 layers mance under uniform VSS layer counts, we set all Mamba-
increased resource use and reduced performance due to re- based models to 4 VSS layers and conducted comparisons.
dundant features, which impacted generalization. Thus, 4 As listed in Table 2 and 10, although computational re-
SAVSS layers strike an effective balance between perfor- quirements for MambaIR, CSMamba, and PlainMamba de-
Get Video
Input Video
LiDAR
Camera
Upload
Mo
ve
Server
Control by Command
Processing
Console
Output Video
Figure 8. Schematic of real-world deployment. The intelligent vehicle is placed on an outdoor road surface, and we use the server terminal
to remotely control it. The vehicle transmits the video data in real-time to the server, where it is processed to obtain the final output.
crease, their performance drops significantly. For exam- SOTA methods. Specifically, our experimental system con-
ple, CSMamba’s F1 and mIoU scores drop to 0.7503 and sists of two main components: the intelligent vehicle and
0.7773. While PlainMamba with 4 layers achieves reduc- the server. The intelligent vehicle used is a Turtlebot4 Lite
tions of 0.60M in parameters, 4.07G in FLOPs, and 19MB driven by a Raspberry Pi 4, equipped with a LiDAR and
in model size, SCSegamba surpasses it by 4.04% in F1 and a camera. The camera model is OAK-D-Pro, fitted with
3.39% in mIoU. Thus, with 4 SAVSS layers, SCSegamba an OV9282 image sensor capable of capturing high-quality
balances performance and efficiency, capturing crack mor- crack images. The server is a laptop equipped with a Core
phology and texture for high-quality segmentation. i9-13900 CPU running Ubuntu 22.04. The intelligent ve-
hicle and server communicate via the internet. This setup
simulates resource-limited equipment to evaluate the per-
formance of our SCSegamba in real-world deployment sce-
narios.
As shown in Figure 8, in the real-world deployment pro-
cess, the intelligent vehicle was placed on an outdoor road
surface filled with cracks. We remotely controlled the ve-
hicle from the server terminal, directing it to move forward
Figure 9. Comparison of computing resources required for differ- in a straight line at a speed of 0.15 m/s. The camera cap-
ent Patch Size tured video at a frame rate of 30 frames per second. The
vehicle transmitted the recorded video data to the server in
real-time via the network. To accelerate data transmission
11. Real-world Deployment Applications from the vehicle to the server, we set the recording reso-
lution to 512 × 512. Upon receiving the video data, the
To validate the effectiveness of our proposed SCSegamba in server first segmented it into frames, then fed each frame
real-world applications, we conducted a practical deploy- into the pre-trained SCSegamba model, which was trained
ment and compared its real-world performance with other on all datasets, for inference. After segmentation, the server
Methods Inf Time↓
RIND [38] 0.0909s
SFIAN [5] 0.0286s Algorithm 1 SASS execution process
CTCrackseg [44] 0.0357s 1: Input: Patch matrix dimensions H, W
DTrCNet [48] 0.0213s 2: Output: O = (o1, o2, o3, o4), O inverse =
Crackmer [46] 0.0323s (o1 inverse, o2 inverse, o3 inverse, o4 inverse),
SimCrack [20] 0.0345s D = (d1, d2, d3, d4)
CSMamba [37] 0.0625s 3: Initialize: L = H × W
PlainMamba [55] 0.1667s 4: Initialize (i, j) ← (0, 0) for o1, (H − 1, W − 1) if H is
MambaIR [16] 0.0400s odd else (H − 1, 0) for o2
SCSegamba (Ours) 0.0313s 5: id ← down, jd ← lef t if H is odd else right
6: while j < W or i ≥ 0 do
Table 11. Comparison of inference time with other SOTA methods
7: idx ← i × W + j, append idx to o1, set
on resource-constrained server.
o1 inverse[idx]
8: if id = down and i < H − 1 then
recombined the processed frames into a video, yielding the 9: i ← i + 1, add down to d1
final output. This setup simulates real-time crack segmen- 10: else
tation in an real-world production process. 11: j ← j + 1, id ← up if i = H − 1 else down, add
Additionally, we deployed the weight files of other right to d1
SOTA methods on the server for comparison. As listed 12: end if
in Table 11, our SCSegamba achieved an inference speed 13: idx ← i × W + j, append idx to o2, set
of 0.0313 seconds per frame on the resource-constrained o2 inverse[idx]
server, outperforming most other methods. This demon- 14: if jd = right and j < W − 1 then
strates that our method has excellent real-time performance, 15: j ← j + 1, add right to d2
making it suitable for real-time segmentation of cracks in 16: else
video data. 17: i ← i − 1, jd ← lef t if j = W − 1 else right,
As shown in Figure 10, compared to other SOTA meth- add up to d2
ods, our SCSegamba better suppresses irrelevant noise in 18: end if
video data and generates continuous crack region segmen- 19: end while
tation maps. For instance, although SSM-based methods 20: d1 ← [dstart ] + d1[: −1], d2 ← [dstart ] + d2[: −1]
like PlainMamba [55], MambaIR [16], and CSMamba [37] 21: for diag ← 0 to H + W − 2 do
achieve continuous segmentation, they tend to produce false 22: direction ← right if diag is even else down
positives in some irrelevant noise spots. Additionally, while 23: for k ← 0 to min(diag + 1, H, W ) − 1 do
CNN and Transformer-based methods achieve high metrics 24: i, j ← (diag−k, k) if diag is even else (k, diag−k)
and performance on datasets with faster inference speed,
their performance on video data is suboptimal, often show- 25: if j < W then
ing discontinuous segmentation and poor background sup- 26: idx ← i × W + j
pression. For example, cracks segmented by DTrCNet [48] 27: Append idx to o3, set o3 inverse[idx], add
and CTCrackSeg [44] exhibit significant discontinuities, direction to d3
and Crackmer [46] struggles to distinguish between crack 28: end if
and background regions. Based on the above real-world 29: i, j ← (diag − k, W − k − 1) if diag is even else
deployment results, our SCSegamba produces high-quality (k, W − diag + k − 1)
segmentation results on crack video data with low param- 30: if j < W then
eters and computational resources, making it more suit- 31: idx ← i × W + j
able for deployment on resource-constrained devices and 32: Append idx to o4, set o4 inverse[idx], add
demonstrating its strong performance in practical produc- direction to d4
tion scenarios. 33: end if
34: end for
35: end for
36: d3 ← [dstart ] + d3[: −1], d4 ← [dstart ] + d4[: −1]
37: Return: O, O inverse, D
Frame
Frame 501 Frame 401 Frame 301 Frame 201 Frame 101 Frame 001
Original
GT
SCSegamba
CSMamba PlainMamba MambaIR
SimCrack
Crackmer
DTrCNet
CTCrackseg
SFIAN
RIND
Figure 10. Visualisation comparison on video data keyframes. The interval between keyframes is 100 frames in order to ensure continuity
of observation. Red boxes highlight critical details, and green boxes mark misidentified regions.
References [13] Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.
On the parameterization and initialization of diagonal state
[1] Zaid Al-Huda, Bo Peng, Riyadh Nazar Ali Algburi, Muga- space models. Advances in Neural Information Processing
hed A Al-antari, AL-Jarazi Rabea, Omar Al-maqtari, and Systems, 35:35971–35983, 2022. 3
Donghai Zhai. Asymmetric dual-decoder-u-net for pavement
[14] Albert Gu, Karan Goel, and Christopher Ré. Efficiently mod-
crack semantic segmentation. Automation in Construction,
eling long sequences with structured state spaces. In The In-
156:105138, 2023. 2
ternational Conference on Learning Representations, 2022.
[2] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian 2, 3
Schroff, and Hartwig Adam. Encoder-decoder with atrous
[15] Albert Gu, Karan Goel, and Christopher Ré. Efficiently mod-
separable convolution for semantic image segmentation. In
eling long sequences with structured state spaces. In The In-
Proceedings of the European conference on computer vision,
ternational Conference on Learning Representations, 2022.
pages 801–818, 2018. 1
2
[3] Zhuangzhuang Chen, Jin Zhang, Zhuonan Lai, Jie Chen, Zun [16] Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong
Liu, and Jianqiang Li. Geometry-aware guided loss for deep Ren, and Shu-Tao Xia. Mambair: A simple baseline for im-
crack recognition. In Proceedings of the IEEE/CVF Con- age restoration with state-space model. In European Confer-
ference on Computer Vision and Pattern Recognition, pages ence on Computer Vision, pages 222–241. Springer, 2024. 2,
4703–4712, 2022. 1 6, 7, 3, 5
[4] Zhuangzhuang Chen, Zhuonan Lai, Jie Chen, and Jianqiang [17] Jing-Ming Guo, Herleeyandi Markoni, and Jiann-Der Lee.
Li. Mind marginal non-crack regions: Clustering-inspired Barnet: Boundary aware refinement network for crack detec-
representation learning for crack segmentation. In Proceed- tion. IEEE Transactions on Intelligent Transportation Sys-
ings of the IEEE/CVF Conference on Computer Vision and tems, 23(7):7343–7358, 2021. 2
Pattern Recognition, pages 12698–12708, 2024. 1
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[5] Xu Cheng, Tian He, Fan Shi, Meng Zhao, Xiufeng Liu, and Deep residual learning for image recognition. In Proceed-
Shengyong Chen. Selective feature fusion and irregular- ings of the IEEE conference on computer vision and pattern
aware network for pavement crack detection. IEEE Trans- recognition, pages 770–778, 2016. 2
actions on Intelligent Transportation Systems, 2023. 1, 2, 6,
[19] Yung-An Hsieh and Yichang James Tsai. Machine learning
7, 3, 5
for crack detection: Review and model performance com-
[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. parison. Journal of Computing in Civil Engineering, 34(5):
Generating long sequences with sparse transformers. arXiv 04020038, 2020. 1
preprint arXiv:1904.10509, 2019. 2 [20] Achref Jaziri, Martin Mundt, Andres Fernandez, and Vis-
[7] Wooram Choi and Young-Jin Cha. Sddnet: Real-time crack vanathan Ramesh. Designing a hybrid neural system to
segmentation. IEEE Transactions on Industrial Electronics, learn real-world crack segmentation from fractal-based sim-
67(9):8016–8025, 2019. 2 ulation. In Proceedings of the IEEE/CVF Winter Confer-
[8] Yann N Dauphin, Angela Fan, Michael Auli, and David ence on Applications of Computer Vision, pages 8636–8646,
Grangier. Language modeling with gated convolutional net- 2024. 6, 7, 5
works. In International conference on machine learning, [21] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and
pages 933–941. PMLR, 2017. 4 François Fleuret. Transformers are rnns: Fast autoregressive
[9] Jiaxiu Dong, Niannian Wang, Hongyuan Fang, Wentong transformers with linear attention. In International confer-
Guo, Bin Li, and Kejie Zhai. Mfafnet: An innovative crack ence on machine learning, pages 5156–5165. PMLR, 2020.
intelligent segmentation method based on multi-layer feature 2
association fusion network. Advanced Engineering Infor- [22] Iason Katsamenis, Eftychios Protopapadakis, Nikolaos
matics, 62:102584, 2024. 1, 2 Bakalos, Andreas Varvarigos, Anastasios Doulamis, Niko-
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, laos Doulamis, and Athanasios Voulodimos. A few-shot at-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tention recurrent residual u-net for crack segmentation. In
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- International Symposium on Visual Computing, pages 199–
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 209. Springer, 2023. 5, 6, 1
worth 16x16 words: Transformers for image recognition at [23] Narges Kheradmandi and Vida Mehranfar. A critical re-
scale. In The International Conference on Learning Repre- view and comparative study on image segmentation-based
sentations, 2021. 1, 3 techniques for pavement crack detection. Construction and
[11] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Building Materials, 321:126162, 2022. 1
Ke Wei, and Zhouchen Lin. Is attention better than matrix [24] Hong Lang, Ye Yuan, Jiang Chen, Shuo Ding, Jian John Lu,
decomposition? In International Conference on Learning and Yong Zhang. Augmented concrete crack segmentation:
Representations, 2021. 8 Learning complete representation to defend background in-
[12] Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. terference in concrete pavements. IEEE Transactions on In-
It’s raw! audio generation with state-space models. In In- strumentation and Measurement, 2024. 1
ternational Conference on Machine Learning, pages 7616– [25] David Lattanzi and Gregory R Miller. Robust automated
7633. PMLR, 2022. 3 concrete damage detection algorithms for field applications.
Journal of Computing in Civil Engineering, 28(2):253–262, [39] Haochen Qi, Xiangwei Kong, Zhibo Jin, Jiqiang Zhang, and
2014. 2 Zinan Wang. A vision-transformer-based convex variational
[26] Qin Lei, Jiang Zhong, and Chen Wang. Joint optimization network for bridge pavement defect segmentation. IEEE
of crack segmentation with an adaptive dynamic threshold Transactions on Intelligent Transportation Systems, 2024. 2
module. IEEE Transactions on Intelligent Transportation [40] Zhong Qu, Wen Chen, Shi-Yan Wang, Tu-Ming Yi, and
Systems, 2024. 2 Ling Liu. A crack detection algorithm for concrete pave-
[27] Huaiyuan Li, Hui Li, Chuang Li, Baohai Wu, and Jinghuai ment based on attention mechanism and multi-features fu-
Gao. Hybrid swin transformer-cnn model for pore-crack sion. IEEE Transactions on Intelligent Transportation Sys-
structure identification. IEEE Transactions on Geoscience tems, 23(8):11710–11719, 2021. 2
and Remote Sensing, 2024. 2 [41] Jianing Quan, Baozhen Ge, and Min Wang. Crackvit: a uni-
[28] Jialin Li, Qiang Nie, Weifu Fu, Yuhuan Lin, Guangpin Tao, fied cnn-transformer model for pixel-level crack extraction.
Yong Liu, and Chengjie Wang. Lors: Low-rank residual Neural Computing and Applications, 35(15):10957–10973,
structure for parameter-efficient network stacking. In Pro- 2023. 2
ceedings of the IEEE/CVF Conference on Computer Vision [42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
and Pattern Recognition, pages 15866–15876, 2024. 4 net: Convolutional networks for biomedical image segmen-
[29] Qiufu Li, Xi Jia, Jiancan Zhou, Linlin Shen, and Jinming tation. In Medical image computing and computer-assisted
Duan. Rediscovering bce loss for uniform classification. intervention–MICCAI 2015: 18th international conference,
arXiv preprint arXiv:2403.07289, 2024. 5, 1 Munich, Germany, October 5-9, 2015, proceedings, part III
[30] Jianghai Liao, Yuanhao Yue, Dejin Zhang, Wei Tu, Rui Cao, 18, pages 234–241. Springer, 2015. 8
Qin Zou, and Qingquan Li. Automatic tunnel crack in- [43] Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien
spection using an efficient mobile imaging module and a Ourselin, and M Jorge Cardoso. Generalised dice overlap as
lightweight cnn. IEEE Transactions on Intelligent Trans- a deep learning loss function for highly unbalanced segmen-
portation Systems, 23(9):15190–15203, 2022. 1 tations. In Deep Learning in Medical Image Analysis and
[31] Huajun Liu, Xiangyu Miao, Christoph Mertz, Chengzhong Multimodal Learning for Clinical Decision Support: Third
Xu, and Hui Kong. Crackformer: Transformer network International Workshop, DLMIA 2017, and 7th International
for fine-grained crack detection. In Proceedings of the Workshop, ML-CDS 2017, Held in Conjunction with MIC-
IEEE/CVF International Conference on Computer Vision CAI 2017, pages 240–248. Springer, 2017. 5, 1
(ICCV), pages 3783–3792, 2021. 2 [44] Huaqi Tao, Bingxi Liu, Jinqiang Cui, and Hong Zhang. A
[32] Huajun Liu, Jing Yang, Xiangyu Miao, Christoph Mertz, and convolutional-transformer network for crack segmentation
Hui Kong. Crackformer network for pavement crack seg- with boundary awareness. In 2023 IEEE International Con-
mentation. IEEE Transactions on Intelligent Transportation ference on Image Processing, pages 86–90. IEEE, 2023. 2,
Systems, 24(9):9240–9252, 2023. 2 6, 7, 5
[33] Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mianzhao Wang, [45] A Vaswani. Attention is all you need. Advances in Neural
and Shengyong Chen. Staircase cascaded fusion of Information Processing Systems, 2017. 1, 2
lightweight local pattern recognition and long-range depen- [46] Jin Wang, Zhigao Zeng, Pradip Kumar Sharma, Osama Al-
dencies for structural crack segmentation. arXiv preprint farraj, Amr Tolba, Jianming Zhang, and Lei Wang. Dual-
arXiv:2408.12815, 2024. 1, 5, 7, 2 path network combining cnn and transformer for pave-
[34] Wenze Liu, Hao Lu, Hongtao Fu, and Zhiguo Cao. Learn- ment crack segmentation. Automation in Construction, 158:
ing to upsample by learning to sample. In Proceedings of 105217, 2024. 1, 6, 7, 5
the IEEE/CVF International Conference on Computer Vi- [47] Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and
sion, pages 6027–6037, 2023. 5 Yifeng Shi. Vit-comer: Vision transformer with convolu-
[35] Yahui Liu, Jian Yao, Xiaohu Lu, Renping Xie, and Li Li. tional multi-scale feature interaction for dense predictions.
Deepcrack: A deep hierarchical feature learning architec- In Proceedings of the IEEE/CVF Conference on Computer
ture for crack segmentation. Neurocomputing, 338:139–153, Vision and Pattern Recognition, pages 5493–5502, 2024. 1
2019. 2, 5, 6, 1 [48] Chao Xiang, Jingjing Guo, Ran Cao, and Lu Deng. A
[36] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi crack-segmentation algorithm fusing transformers and con-
Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: volutional neural networks for complex detection scenarios.
Visual state space model. arXiv preprint arXiv:2401.10166, Automation in Construction, 152:104894, 2023. 1, 2, 6, 7, 5
2024. 2, 3, 4 [49] Xiao Xiao, Shen Lian, Zhiming Luo, and Shaozi Li.
[37] Liu Mushui, Jun Dan, Ziqian Lu, Yunlong Yu, Yingming Weighted res-unet for high-quality retina vessel segmenta-
Li, and Xi Li. Cm-unet: Hybrid cnn-mamba unet for re- tion. In 2018 9th international conference on informa-
mote sensing image semantic segmentation. arXiv preprint tion technology in medicine and education, pages 327–331.
arXiv:2405.10530, 2024. 2, 6, 7, 3, 5 IEEE, 2018. 2
[38] Mengyang Pu, Yaping Huang, Qingji Guan, and Haibin [50] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Ling. Rindnet: Edge detection for discontinuity in re- Jose M Alvarez, and Ping Luo. Segformer: Simple and
flectance, illumination, normal and depth. In Proceedings of efficient design for semantic segmentation with transform-
the IEEE/CVF international conference on computer vision, ers. Advances in neural information processing systems, 34:
pages 6879–6888, 2021. 6, 7, 3, 5 12077–12090, 2021. 8
[51] Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, and Zitong
Yu. Fusionmamba: Dynamic feature enhancement for mul-
timodal image fusion with mamba. Visual Intelligence, 2(1):
37, 2024. 2
[52] Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu.
Segmamba: Long-range sequential modeling mamba for 3d
medical image segmentation. In International Conference on
Medical Image Computing and Computer-Assisted Interven-
tion, pages 578–588. Springer, 2024. 2
[53] Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haon-
ing Wu, Qiong Yan, and Weisi Lin. Boosting image quality
assessment through efficient transformer adaptation with lo-
cal feature enhancement. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 2662–2672, 2024. 2
[54] Tomoyuki Yamaguchi, Shingo Nakamura, Ryo Saegusa, and
Shuji Hashimoto. Image-based crack detection for real con-
crete surfaces. IEEJ Transactions on Electrical and Elec-
tronic Engineering, 3(1):128–135, 2008. 2
[55] Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Er-
icsson, Zhenyu Wang, Jiaming Liu, and Elliot J Crowley.
Plainmamba: Improving non-hierarchical mamba in visual
recognition. arXiv preprint arXiv:2403.17695, 2024. 2, 3, 4,
6, 7, 5
[56] Fan Yang, Lei Zhang, Sijia Yu, Danil Prokhorov, Xue Mei,
and Haibin Ling. Feature pyramid and hierarchical boosting
network for pavement crack detection. IEEE Transactions on
Intelligent Transportation Systems, 21(4):1525–1535, 2019.
2, 5, 6, 1
[57] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
Thomas S Huang. Free-form image inpainting with gated
convolution. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 4471–4480, 2019. 4
[58] Hang Zhang, Allen A Zhang, Zishuo Dong, Anzheng He,
Yang Liu, You Zhan, and Kelvin CP Wang. Robust semantic
segmentation for automatic crack detection within pavement
images using multi-mixing of global context and local image
features. IEEE Transactions on Intelligent Transportation
Systems, 2024. 1
[59] Tianjie Zhang, Donglei Wang, and Yang Lu. Ecsnet: An ac-
celerated real-time image segmentation cnn architecture for
pavement crack detection. IEEE Transactions on Intelligent
Transportation Systems, 2023. 1
[60] Xiaohu Zhang and Haifeng Huang. Distilling knowledge
from a transformer-based crack segmentation model to a
light-weighted symmetry model with mixed loss function for
portable crack detection equipment. Symmetry, 16(5):520,
2024. 3
[61] Jian Zhou, Peisen S Huang, and Fu-Pen Chiang. Wavelet-
based pavement distress detection and evaluation. Optical
Engineering, 45(2):027007–027007, 2006. 2
[62] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang,
Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient
visual representation learning with bidirectional state space
model. In International Conference on Machine Learning,
2024. 2, 3, 4