Scene Change Detection
Scene Change Detection
Segmentation
c Department of Computer Science and Engineering, Shanghai Jiao Tong University, China
Abstract
Change detection (CD) aims to identify changes that occur in an image pair
taken different times. Prior methods devise specific networks from scratch to
predict change masks in pixel-level, and struggle with general segmentation
problems. In this paper, we propose a new paradigm that reduces CD to se-
mantic segmentation which means tailoring an existing and powerful seman-
tic segmentation network to solve CD. This new paradigm conveniently enjoys
the mainstream semantic segmentation techniques to deal with general segmen-
tation problems in CD. Hence we can concentrate on studying how to detect
changes. We propose a novel and importance insight that different change types
exist in CD and they should be learned separately. Based on it, we devise a
module named MTF to extract the change information and fuse temporal fea-
tures. MTF enjoys high interpretability and reveals the essential characteristic
of CD. And most segmentation networks can be adapted to solve the CD prob-
lems with our MTF module. Finally, we propose C-3PO, a network to detect
changes at pixel-level. C-3PO achieves state-of-the-art performance without
bells and whistles. It is simple but effective and can be considered as a new
baseline in this field. Our code is at https://github.com/DoctorKey/C-3PO.
Keywords: Change Detection, Semantic Segmentation, Feature Fusion
∗ Correspondingauthor
Email address: [email protected] (Bin-Bin Gao)
2
change mask
Figure 1: Illustration of change detection task. We propose there are three possible change
types (i.e., “appear”, “disappear” and “exchange”, respectively) in this task. The network
should model these three different change types separately.
tracts these three changes (cf. Fig. 8). Compared with previous fusion manners,
our MTF enjoys high interpretability and reveals the essential characteristic of
CD. And it is more generalized that can be easily inserted between the backbone
and the segmentation head (as in Fig. 2).
At last, we propose C-3PO, namely Combine 3 POssible change types, to
solve the change detection task. C-3PO utilizes the mature techniques of se-
mantic segmentation. So it is reliable and easily deployed. With our MTF
module, C-3PO can effectively solve the CD task with high interpretability. It
outperforms previous methods and achieves state-of-the-art performance.
Our contributions can be summarized as follows.
• We propose a new paradigm to solve the CD problem. That is reducing
change detection to semantic segmentation. Compared with previous paradigm
that devises specific networks from scratch, our paradigm is more advantageous
that it can make the most techniques in the fields of semantic segmentation.
• We argue that three possible change types are contained in the CD task,
and they should be learned separately. The MTF module is proposed to com-
bine two temporal features and extract all three change types. Our MTF is
3
backbone head
Semantic segmentation
backbone
MTF head
backbone
Change detection
Figure 2: Illustration of our paradigm that reducing change detection to semantic segmen-
tation. We propose the MTF module to fuse temporal features. The semantic segmentation
network is transformed to solve CD by inserting MTF.
generalized well and can work with lots of segmentation networks. Experiments
demonstrate the importance of distinguishing three change types and MTF in-
deed extracts information to identify them.
• With our paradigm and MTF, we propose C-3PO, a network to detect
changes in pixel-level. C-3PO is simple but effective. It achieves state-of-the-
art performance without bells and whistles. It can be considered as a new
baseline network in the CD field.
2. Related Works
4
tion and conditional random field based multimodal change detection method.
SDACD [10] unifies image adaptation and feature adaptation in an end-to-end
trainable manner, and handles cross-domain change detection. SSCD cares
about street scene images. In this paper, we concentrate on SSCD because it
has more applications. Early researches utilize handcrafted feature and feature
matching to solve the CD problem [11, 12]. Recently, the deep neural network
is adopted and achieves better performance. However, most researchers de-
sign specific networks from scratch and struggle with the general segmentation
problems. HPCFNet [2] hierarchically combines features at multiple levels to
deal with multi-scaled objects. Further, it introduces the MPFL [2] strategy
to alleviate the problem that the locations and the scales of changed regions
are unbalanced. CosimNet [13] proposes a threshold contrastive loss to deal
with the noise in the change masks. DR-TANet [3] introduces the attention
mechanism into the CD field and proposes CHVA [3] to refine the strip entity
changes. The performance of their methods is determined by both segmentation
and change detection. And it is hard to distinguish which one contributes more.
Although previous methods also use the framework of semantic segmentation
for change detection. They consider these two problems at the same time, i.e.,
coupling semantic segmentation and CD. The previous paradigm can be con-
sidered as designing specific networks from scratch. In this paper, we propose
a new paradigm that reducing CD to semantic segmentation, which means di-
rectly using a segmentation network to solve CD with minimal changes. Ours
explicitly decouples CD into feature fusion and semantic segmentation. MTF is
proposed and used to reduce CD to semantic segmentation, whereas previous
methods cannot make this reduction explicitly. We can directly adopt a pow-
erful semantic segmentation network, and therefore enjoy their techniques to
deal with general segmentation problems. On other hand, we can concentrate
on how to improve change detection, i.e., how to design and improve our MTF.
Previous works also study how to fuse two temporal features. The corre-
lation layer is utilized by CSCDNet [14] to estimate optical flow for localizing
changes. HPCFNet [2] concatenates features and applies parallel atrous group
5
convolution to capture multi-scaled information. Inspired by the self-attention
mechanism, DR-TANet [3] introduces temporal attention for the feature fusion.
However, these fusion manners are heavily coupled with their specific network
structure and hard to be applied with the general semantic segmentation net-
works. In this paper, we focus on how to build features for the general segmenta-
tion network to identify changes. We propose a novel and essential insight that
distinguishing three change types will result in better performance. Finally, the
proposed MTF module is more generalized that can be easily inserted in most
semantic segmentation networks.
Reduction is one of the most important concepts in theoretical computer
science, e.g. NP-Complete problems can be reduced to each other [15]. Transfer
learning [16] can be considered as reduction in the fields of machine learn-
ing. When it comes to computer vision, RCNN [17] reduces object detection
to classification by extracting region proposals and classifying which proposal
contains objects. FCOS [18] reduces object detection to semantic segmentation
by predicting bounding boxes in a per-pixel fashion. ViT [19] reduces image
classification to natural language processing (NLP) by splitting an image into
patches and treating them as words in NLP. In this paper, we study how to
reduce change detection to semantic segmentation. It is reasonable that both
two tasks need to predict labels at the pixel level. And the techniques in the
semantic segmentation field can indeed help to solve CD.
Semantic segmentation is a basic computer vision task that predicts the
semantic classes on pixel-level. FCN [20] is an early work based on deep learning
and achieve remarkable success in this task. Then a serial of DeepLab [21,
22] are proposed and gradually become the mainstream approach. CENet [23]
introduces a novel encoder-decoder architecture to capture multi-scale context
via ensemble deconvolution. GPNet [24] proposes a gated pyramid module to
incorporate both low-level and high-level features.. Recently, attention-based
models [25, 26] are introduced in this field, e.g. SETR [27] uses the transformer
to deal with this task. In this paper, for robustness and reliability, we adopt FCN
and DeepLab to solve the CD task. Note that with our proposed MTF module,
6
Backbone
𝒕𝟎 𝒇𝟎
Share weight
MTF MSF Head
Backbone
𝒕𝟏 𝒇𝟏
Figure 3: The overall architecture of C-3PO. Two temporal feature pyramids are extracted
by the same backbone and merged by MTF. Then, MSF generates a single feature map for
the segmentation head to predict change masks.
Fig. 3 illustrates the overall architecture of our network. The main difference
between our network and previous ones is that ours decouples change detection
into feature fusion and semantic segmentation. Without the MTF module, the
remained parts can form into a semantic segmentation network. So we can
directly adopt the mainstream architectures of semantic segmentation in these
parts. Our MTF module aims at extracting the requisite information from two
temporal features for the general segmentation network to identify changes. The
designation of MTF is based on our insight that different change types should
be learned separately.
Backbone: A general-purpose backbone network is used to extract hierar-
chical feature maps in our framework. We adopt VGG [28], ResNet [29], and
MobileNetV2 [30] in this paper. All backbones can generate pyramid features of
multiple scales from 1/32 to 1/4 resolution. Given paired images t0 and t1 , two
types of pyramids f0 and f1 are extracted by the same backbone, respectively.
Merge temporal features (MTF): Fig. 4 shows the detailed architecture
of our MTF module which has five branches. The upper three branches use
the minus operation to model three different change types, while the under two
branches directly use raw features to provide the basic semantic information of
7
FRQY
5H/8
FRQY
FRQY
7REMHFWLQIREUDQFK FRQY
7REMHFWLQIREUDQFK FRQY
0HUJH7HPSRUDO)HDWXUHV 07)
Figure 4: The detailed architecture of the MTF module. This module aims at merging two
temporal feature maps into one. We use upper three branches to model “appear”, “disappear”
and “exchange”, respectively.
8
FRQY[ FRQY[ [
[ [ FRQY[
[ [
[ [
0HUJH6SDWLDO)HDWXUHV 06)
Figure 5: The detailed architecture of the MSF module. This module aims at merging the
pyramid into a single feature map.
9
the smaller feature maps are enlarged by 2× bilinear upsampling and added in
the lower feature map. 3 × 3 conv. are used to further process these feature
maps. To construct the final single feature map, the upper 3 feature maps are
enlarged by bilinear upsampling into the resolution of the lowest feature map.
All feature maps are concatenated and reduced to 512 channels by 1 × 1 conv.
Semantic segmentation head: Once constructing the final feature map,
we can adopt the existing semantic segmentation heads to predict the change
mask. In this paper, FCN [20] and DeepLabv3 [22] are used. The FCN head
first uses 3 × 3 conv. to reduce the number of channel by 4 times, and then
uses 1 × 1 conv. to further reduce the channel to N classes. Finally, 4× bilinear
upsampling and softmax are used to generate the prediction mask at the original
image resolution. Compared with FCN, DeepLabv3 is more advantageous that
consists of ASPP [21] and can effectively capture multi-scale information.
3.1. Training
The change detection task requires the model to predict a change mask to
indicate the change region. This can be considered as a semantic segmentation
task with multi classes. Hence, the softmax cross-entropy loss function is used
to train our network. However, the distribution of change/background is heavily
imbalanced (e.g. 0.06 : 0.94 in VL-CMU-CD task). Following previous works [2],
we use the weighted cross-entropy loss to alleviate this problem. Assume there
are total N classes from 0 to N − 1 (the 0-th class denotes the background),
and the loss function on each pixel is defined as
N
X −1
L=− wi yi log pi , (4)
i=0
where pi and yi are the prediction and ground-truth of the i-th class region,
PN −1
j=0 nj −ni
respectively. wi = (N −1)
PN −1 is the balance weight, where ni is the number
j=0 nj
of i-th class pixels. We use the training set to calculate these statistics. Finally,
the loss averaged over pixels is used for optimization.
Typically, only two classes (i.e., change/background) are considered in the
CD field. Recently, a dataset with multi-class is proposed [7]. Note that our
10
method can solve binary-class and multi-class CD problems with Eq. 4.
3.2. Analysis
11
CD task. Note that it is difficult for previous CD networks to decompose into
segmentation networks and be pretrained on the semantic segmentation task.
4. Experimental Results
12
256 × 256 for testing.
Implementation details. To train the network, we use Adam [36] with an
initial learning rate of 0.0001. And the learning rate is decayed by the cosine
strategy. We use 4 V100 GPUs with batch size 16 (i.e., 4 images per GPU).
For a fair comparison, the backbone is pretrained on ImageNet [37] while other
parts are initialized randomly except experiments in Section 4.3. All networks
are trained for 100 epochs. Note that all experiments use the same training
strategy and there are no more hyperparameters to tune.
Evaluation metrics. Following previous works [3, 2], we evaluate the per-
formance on VL-CMU-CD and PCD with F1-score. It is calculated upon Pre-
cision and Recall. In these two binary CD task, the change regions represent
positive while the background (unchanged regions) represent negative. Given
true positive (TP), false positive (FP) and false negative (FN), the F1-score is
defined as
2 × Precision × Recall
F1-score = , (5)
Precision + Recall
TP TP
where Precision = T P +F P and Recall = T P +F N . In addition, some ground-
truth masks are not accurate in these change detection tasks (cf. Fig. 6). Fol-
lowing the researches on noise label [38, 39], we evaluate the network after each
epoch and report both best and last performance.
ChangeSim contains two CD tasks (i.e., binary and multi-class). Follow-
ing [7], we evaluate the performance with mIoU, which is defined as
C
1 X Pi ∩ Gi
mIoU = , (6)
C i=1 Pi ∪ Gi
where C is the total number of category, and Pi and Gi denote the prediction
and groundtruth masks for the i-th class.
To evaluate our insight that different change types should be learned sep-
arately, we conduct ablative experiments with using different branches in the
MTF module.Fig. 7a shows F1-score on VL-CMU-CD for C-3PO with just one
13
(a) (b) (c)
Figure 6: Examples of noisy label. Images in (a) are from VL-CMU-CD and the changed
cars are not annotated. Examples in (b) and (c) are from PCD and the changed trees are not
annotated well.
type branch after each epoch. Because this task only contains “disappear”
changes, both appear and disappear branches perform well according to anal-
ysis in Section 3.2. The info branch loses the discrepant information to detect
changes. And the exchange branch struggles with the dilemma that detecting
changes according to objects or background (i.e., whether to make the object
or background feature larger during training). These results demonstrate the
manner of feature fusion is needed to suit the task.
GSV is different from VL-CMU-CD that it contains all three change types.
Fig. 7b shows results. First, the exchange branch performs best among all three
change branches, because three change types exist in this task. Second, because
changes in GSV are heavily based on objects, C-3PO works well even with only
the info branches. For example, as in Fig. 1, cars are changed with a high
probability in image pairs. Third, it is unsuitable to only use the unidirectional
change branch such as appear and disappear.
Fig. 8 shows the change masks predicted by C-3PO with different branches.
14
)6FRUH
)6FRUH
, ,
$ $
' '
( (
(SRFK (SRFK
(a) (b)
Figure 7: F1-score of C-3PO after each training epoch. (a) presents the results on VL-CMU-
CD which only contains the “disappear” change. (b) shows the results on GSV which contains
three change types. “I”, “A”, “D” and “E” denote the info, appear, disappear and exchange
branches in Fig. 4, respectively.
t0 disappear
appear
t1
exchange
GT predict
Figure 8: Visual illustration of change masks predicted by C-3PO with different branches.
We train C-3PO with all three change branches, then use just one branch to identify changes.
“disappear”, “appear” and “exchange” denote three change branches, respectively. “predict”
uses all branches and “GT” is the ground-truth change mask.
15
Table 1: F1-score (%) on benchmarks for C-3PO with using different branches in the MTF
module. “I”, “A”, “D” and “E” denote the info, appear, disappear and exchange branches in
Fig. 4, respectively.
16
Table 2: F1-score (%) for C-3PO with different positions to merge temporal information and
different pretrained models. T and S denotes temporal and spatial fusions, respectively. B
and H denotes backbone and semantic segmentation head, respectively. ImageNet pretrain-
ing means B is pretrained on ImageNet, while COCO pretraining means B, S and H are
pretrained on COCO.
and TSUNAMI, these two datasets contain all change types. Hence, the model
should be equipped with all change branches, especially the exchange branch for
the bidirectional changes. Considering the difference among these benchmarks,
in the rest of this paper we use “I + D” on VL-CMU-CD, and “I + A + D + E”
on GSV, TSUNAMI, and ChangeSim. Note that these structures are not the
best combinations in Tab. 1. We choose the model’s structure according to the
training set rather than the performance on the testing set. From training sets,
VL-CMU-CD only contains disappeared changes, while other datasets have all
change types. Our choices achieve comparable performance with the best com-
binations (less than 1% F1-score). We believe these choices are reasonable and
give sound conclusions in the latter experiments.
17
Premature merging results in overfitting easily, because it does not utilize the
backbone’s ability to detect objects. Inserting MTF between the backbone and
MSF achieves the best performance. And placing MTF after MSF also works
well. These two settings utilize the backbone to detect objects and the head
to predict change masks. Merging temporal information before spatial informa-
tion can make the MSF module obtain more information for detecting changes.
Finally, placing MTF after the head expects the head can predict the semantic
mask of each image and the change mask can be predicted by simply combining
two semantic masks. That is difficult due to only change masks are available to
train the network.
4.4. Symmetry
The symmetry of C-3PO is due to the sharing weights used in the backbone,
the info branches and the appear/disappear branches. We conduct experiments
for C-3PO without sharing specific parts and Tab. 3 summarizes results. The
symmetry matters significantly on GSV and sharing weights in all three parts
achieves the best performance. Without the symmetry, it is easy for the network
to overfit the individual information in two input images (e.g., the weather in
each image). Using a symmetrical framework can make the network concentrate
on the changes in the image pair. And it implicitly augments the image pair by
exchanging their order which further alleviates the overfitting problem.
18
Table 3: F1-score (%) for C-3PO with sharing specific parts. “B”, “I” and “C” denote the
backbone, the info branches and the appear/disappear branches, respectively. “X” denotes
sharing weights. Note that we use “I+D” on VL-CMU-CD, so it does not need to consider
the symmetry for appear/disappear branches in this model. We evaluate networks on the first
cross-validation fold.
Table 4: F1-score (%) for C-3PO with/without weighted loss. “X” denotes using balance
weight in Eq. 4. We evaluate networks on the first cross-validation fold.
Following previous works [3, 14, 2], we adopt the weighted cross-entropy
loss to alleviate the imbalance problem. Tab. 4 makes a comparison for the loss
without weighting. This balanced weight matters significantly on VL-CMU-CD,
because its change/background distribution is about 0.06 : 0.94, more imbal-
anced than that of PCD (0.29 : 0.71). Overall, as argued by previous works, we
also appreciate using weighted cross-entropy loss on these tasks.
19
Table 5: F1-score (%) for C-3PO with different configurations of the backbone, MSF and the
head.
Tab. 5 summarizes the results. First, the number of MSF denotes how many
feature maps are used in the MSF module. “1” means only using the feature
map with spatial scale 1/32. The F1-score gets higher with adding more low-
level feature maps. The low-level feature maps contain more details for C-
3PO to segment more accurately. Second, we try FCN and DeepLabv3 as the
semantic segmentation head. Both FCN and DeepLabv3 perform well on these
benchmarks. Third, different backbones are adopted. The backbone with a
larger capacity results in better performance.
In the case of street scene change detection, the viewpoints of two input
images may differ slightly. We perform the perspective transformation of the
input images to study this scenario. Fig. 9 presents visualization results. In each
sub-figure, the top two images are the images captured at t0 and t1 , respectively.
The bottom two masks are the ground truth and the model’s prediction, respec-
tively. We transform one image with keeping another unchanged. Note that all
20
瀇瀅濴瀁瀆濹瀂瀅瀀澳瀇濃澳濴瀁濷澳濚濧 瀇瀅濴瀁瀆濹瀂瀅瀀澳瀇濄澳濴瀁濷澳濚濧
瀊濂瀂澳瀇瀅濴瀁瀆濹瀂瀅瀀濴瀇濼瀂瀁
瀇瀅濴瀁瀆濹瀂瀅瀀澳瀇濃 瀇瀅濴瀁瀆濹瀂瀅瀀澳瀇濄
predictions are from the same model trained without this transformation aug-
mentation. First, to evaluate the accuracy, the crucial problem is how to define
the groundtruth (GT) change mask, i.e., choosing which viewpoint for it. We
find that it easily results in misalignment of the model’s prediction and the GT
mask. Our model predicts according to the viewpoint of t0 , because the disap-
peared object exists in this image. And our model is robust that detecting this
object in all cases. Overall, we suggest adopting techniques from multi-view
geometry to align two images first to alleviate the perspective problem, and
therefore avoid the dilemma of how to define GT masks.
21
瀆瀀濴濿濿 瀀濸濷濼瀈瀀 濿濴瀅濺濸
Figure 10: Visualization of C-3PO on testing images with different object sizes.
VL-CMU-CD
Method Backbone
last best
FC-EF [31] U-Net 43.2 44.6
FC-Siam-diff [31] U-Net 65.1 65.3
FC-Siam-conc [31] U-Net 65.6 65.6
DR-TANet [3] ResNet-18 72.5 75.1
CSCDNet [14] ResNet-18 75.4 76.6
C-3PO ResNet-18 77.6 79.4
HPCFNet [2] VGG-16 75.2
C-3PO VGG-16 79.6 80.0
To study the performance w.r.t. the changing object size, we split the test-
ing set of VL-CMU-CD into three folds. Given the binary change masks, we
calculate the ratio for the changing area over the whole area. According to this
ratio, “Small”, “Medium”, and “Large” testing sets are obtained (cf. Tab. 6)
with comparable data amounts. F1-scores are evaluated on each testing set. As
shown in Tab. 6, it is more challenging to detect small objects. Fig. 10 visualizes
results with different sizes. Overall, our model detects all objects successfully.
22
Table 8: F1-score (%) for C-3PO and previous methods on PCD.
23
Table 9: The mIoU (%) performance on the ChangeSim benchmark for C-3PO and previous
methods. “S”, “C”, “N”, “M”, “Ro” and “Re” denote “static”, “change”, “new”, “missing”,
“rotated” and “replaced”, respectively.
Binary Multi-class
Method
S C mIoU S N M Ro Re mIoU
ChangeNet [32] 73.3 17.6 45.4 80.6 9.1 6.9 11.6 6.6 23.0
CSCDNet [14] 87.3 22.9 55.1 90.2 12.4 6.0 17.5 7.9 26.8
C-3PO 90.4 28.8 59.6 92.6 13.3 8.0 16.9 8.0 27.8
ply set different numbers of final output channels according to different settings.
At last, C-3PO outperforms other methods on both two tasks by a significant
margin. Although ours improves the state-of-the-art results on this benchmark,
it is still challenging and needs more effort to overcome the difficulties. Differ-
ent from PCD and VL-CMU-CD, ChangeSim contains more change types, i.e.,
“rotated”. To imporve the performance on ChangeSim, a more sophisticated
MTF module should be proposed. However, this paper focuses on the general
model for all change detection datasets. We leave the sophisticated model for
only ChangeSim as the future work.
Overall, our C-3PO outperforms state-of-the-art by a significant margin.
And due to its simplicity, we believe it can be served as a new baseline in this
field and inspire more researches.
4.10. Visualization
We visualize the predicted masks of C-3PO. Figs. 11, 12, 13, and 14 illustrate
the results of VL-CMU-CD, GSV, TSUNAMI, and ChangeSim, respectively.
In each sub-figure, the top two images are the images captured at t0 and t1 ,
respectively. The bottom two masks are the ground truth and the model’s
prediction, respectively.
24
Figure 11: Visualization of C-3PO on the testing set of VL-CMU-CD. The last row shows
some failure cases.
25
Figure 14: Visualization of C-3PO on the testing set of ChangeSim.
26
5. Conclusions and Future Work
References
[2] Y. Lei, D. Peng, P. Zhang, Q. Ke, H. Li, Hierarchical paired channel fusion
network for street scene change detection, IEEE TIP 30 (2021) 55–67.
27
[5] K. Sakurada, T. Okatani, Change detection from a street image pair using
CNN features and superpixel segmentation, in: BMVC, Vol. 61, 2015, pp.
1–12.
[7] J.-M. Park, J.-H. Jang, S.-M. Yoo, S.-K. Lee, U.-H. Kim, J.-H. Kim, Chan-
geSim: Towards end-to-end online scene change detection in industrial in-
door environments, in: IROS, 2021, pp. 1–8.
28
[14] K. Sakurada, M. Shibuya, W. Wang, Weakly supervised silhouette-based
semantic scene change detection, in: ICRA, 2020, pp. 6861–6867.
29
[24] Y. Zhang, X. Sun, J. Dong, C. Chen, Q. Lv, GPNet: gated pyramid network
for semantic segmentation, Pattern Recognition 115 (2021) 107940.
[25] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention
network for scene segmentation, in: CVPR, 2019, pp. 3146–3154.
[29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
tion, in: CVPR, 2016, pp. 770–778.
30
[35] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: ICCV, 2017,
pp. 2961–2969.
[38] K. Yi, J. Wu, Probabilistic end-to-end noise correction for learning with
noisy labels, in: CVPR, 2019, pp. 7017–7025.
[39] J. Li, R. Socher, S. C. Hoi, DivideMix: Learning with noisy labels as semi-
supervised learning, in: ICLR, 2020, pp. 1–14.
31