0% found this document useful (0 votes)

48 views14 pages

Deepfake - Detection - and - Localization - Using - Multi-View - Inconsistency - Measurement

This article presents a novel Multi-View Inconsistency Measurement (MVIM) network for detecting and localizing tampered regions in deepfake videos, addressing challenges in multi-face scenarios where only some faces are manipulated. The MVIM network incorporates two key modules: Noise Inconsistency Measurement (Noise-IM) for analyzing noise patterns and Temporal Inconsistency Measurement (Temporal-IM) for capturing inter-frame differences. Extensive experiments demonstrate the effectiveness of the proposed method compared to existing state-of-the-art techniques.

Uploaded by

blumoongamer.0123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views14 pages

Deepfake - Detection - and - Localization - Using - Multi-View - Inconsistency - Measurement

Uploaded by

blumoongamer.0123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

This article has been accepted for publication in IEEE Transactions on Dependable and Secure Computing.

This is the author's version which has not been fully edited and
102srm2 content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2024.3472064

Deepfake Detection and Localization Using

Multi-View Inconsistency Measurement
Bolin Zhang, Qilin Yin, Wei Lu, Member, IEEE, Xiangyang Luo

Abstract—As deepfake technology advances, forgery detection

techniques have evolved beyond simple classification to include
fine-grained localization. However, existing deepfake localization
methods struggle with with real-world deepfake videos, which
are often multi-face scenarios with only some parts manipulated.
To address the above-mentioned problems, we propose a Multi-
View Inconsistency Measurement (MVIM) network that simulta-
neously measures inconsistencies from noise and temporal view
to detect and locate tampered regions. Specifically, considering
the noise inconsistencies in multi-face scenarios where fake faces
have inconsistent noise patterns compared to real faces and
backgrounds, we design a Noise Inconsistency Measurement
(Noise-IM) module that measures noise similarity among faces
and between faces and backgrounds using a masked attention
mechanism to identify suspected tampered regions in noise
domain. Since facial jitter of tampered regions in deepfake videos
is observed to be more intense than that of real regions, we
design a Temporal Inconsistency Measurement (Temporal-IM)
module which adopts self-attention mechanism and fine-grained (a) Noise inconsistency between real (green) and fake (red) faces
bi-direction convolutions to capture tampering traces between
frames in temporal domain. Inconsistency features obtained by
the two modules are fused for detecting and locating tampered
regions. The superiority of our MVIM network is verified by
extensive experiments with many state-of-the-art methods in
different benchmark datasets.
Index Terms—Deepfake detection, face forgery localization,
multi-face forensics, inconsistency measurement. (b) Temporal inconsistency in the mouth and eye regions
Fig. 1. Examples of noise inconsistency and temporal inconsistency in multi-
face forgery videos where more than one person is active in the scene with
I. I NTRODUCTION only some of them being manipulated. White areas in tampering mask indicate
tampered faces in that area, while black areas indicate real untampered areas.

T HE continuous development of deep learning techniques

and the widespread dissemination of multi-media have
brought about a situation where seeing is no longer believing.
likely to be misused for certain malicious purposes, resulting
in serious security and ethical issues (e.g., the promotion of
With advancements in generative models represented by Vari- celebrity pornography and political persecution). Therefore,
ational Autoencoders (VAEs) [1], Generative Adversarial Net- in order to mitigate the negative impact on public safety and
works (GANs) [2], [3] and Diffusion Models (DMs) [4], [5], it personal privacy, it is crucial to develop effective solutions to
has become relatively straightforward to alter one person’s face counteract these face forgery attacks.
to another while maintaining the original facial expression and To address this issue, there are some methods aiming at ex-
head pose easily. However, these forgery techniques [6]–[8] are ploring spatial visual artifacts caused by forgery methods such
as abnormal face regions (e.g., eyes with different colors [9],
This work is supported by the National Natural Science Foundation of asymmetric faces [10]) and capturing specific artifacts (e.g.,
China (No. U2001202, No.62072480, No.U23A20305 and No.62172435), the
Guangdong Provincial Key Laboratory of Information Security Technology blending boundaries [11], illumination artifacts [12]). These
(No. 2023B1212060026) and the Open Research Project of the State Key methods solely exploited spatial information while ignoring
Laboratory of Media Convergence and Communication (Communication Uni- valuable temporal information, which is of great significance
versity of China) (NO.SKLMCC2022KF003). (Corresponding author :
W ei Lu.) to deepfake detection.
Bolin Zhang, Qilin Yin and Wei Lu are with the School of Com- Since deepfake videos are generated frame-by-frame, there
puter Science and Engineering, Ministry of Education Key Laboratory of often exist temporal inconsistencies such as inconsistent char-
Information Technology, Guangdong Province Key Laboratory of Infor-
mation Security Technology, Sun Yat-sen University, Guangzhou 510006, acter movement and abnormal jittering of textures. Prior meth-
China (e-mail: [email protected], [email protected], ods adopted general temporal modeling networks including
[email protected]). LSTM [13] and 3DCNN [14] to solve this problem but they
Xiangyang Luo is with the State Key Laboratory of Mathematical En-
gineering and Advanced Computing, Zhengzhou 450002, China (e-mail: are not specifically designed for deepfake detection. Later
luoxy [email protected]). methods aimed to detect deepfake by means of biosignal

Authorized licensed use limited to: Raj Dixit. Downloaded on November 24,2024 at 04:18:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Dependable and Secure Computing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2024.3472064

inconsistencies such as movement of lip [15] and heart rate [37], [42] , indicating that noise is effective in identifying
[16]. More recent methods [17]–[20] concentrated on in- suspicious tampering areas. From the perspective of temporal
consistencies between adjacent frames through well-designed inconsistency, deepfake videos are generated frame-by-frame,
modules. Moreover, some works [21]–[23] combined spatial which inevitably introduces inter-frame differences [17], [18].
and temporal information and achieved decent results, but Fig. 1(b) illustrates temporal inconsistencies in the mouth and
they have a relatively fixed composition of network modules eye region of fake faces across consecutive frames. Forgery
and lack a fine-grained and comprehensive design of both manipulation would leave abrupt inter-frame differences and
spatial and temporal inconsistent information. Some studies normal facial motions are more consistent leaving fewer inter-
[24] have compared these methods fairly to provide more frame differences. Temporal inconsistencies reflected in inter-
accurate practice guidance. frame differences are quite instructive for deepfake localiza-
However, all of the above approaches treated deepfake tion.
detection as a binary classification problem, while fine-grained More specifically, in this paper we design a Noise Incon-
localization is more significant and valuable in the research sistency Measurement (Noise-IM) module to identify noise
field of multi-media forensics as it is fundamental to uncover inconsistent regions from noise domain, considering the incon-
the intention of a forger. Some approaches attempted to sistent noise patterns of fake faces compared to real faces and
perform localization for image forgery detection [25]–[28]. the background. Noise-IM first calculates the noise similarity
[25] proposed a two-stream network using RGB features by means of an attention mechanism. Then the attention
and noise features, [26] utilized both frequency and spatial map is filtered by facemask to measure noise inconsistency
domain features to locate tampered regions and [27] modeled among faces and between faces and background. Finally, we
the relationships within image patches on multiple scales for compute the consistency loss based on noise similarity scores
manipulation detection. These methods were developed to according to whether their corresponding image locations
detect image forgeries rather than face forgeries. contain consistent noise patterns. We penalize areas containing
In terms of deepfake localization, previous method [29] inconsistent noise for having a low-similarity score while
highlighted manipulated regions with the assistance of atten- awarding consistent areas a high-similarity score. Furthermore,
tion mechanism under the supervision of tampering masks. we design a Temporal Inconsistency Measurement (Temporal-
Other methods [30]–[32] aimed to improve the classification IM) module to capture suspicious tampering traces between
performance of the model with the help of localization tasks. frames in temporal domain. Temporal-IM incorporates a self-
Meanwhile, [33]–[35] utilized the imperfections associated attention mechanism and calculates the cosine similarity be-
with the upsampling process in GAN-based forgeries to locate tween corresponding regions in successive frames to measure
manipulated regions. [36]–[38] introduce noise features into the degree of inconsistency. A fine-grained denoising operation
the two-stream branch and combine them with multiscale fea- is then carried to mitigate the influence of normal facial
tures to accomplish detection. The existing deepfake localiza- motions. To better exploit temporal inconsistency information,
tion methods mentioned primarily focus on spatial aspects and Temporal-IM extracts more comprehensive representations us-
overlook the significant contributions of temporal information ing convolution along both the height and width directions,
in deepfake localization. thereby expanding the temporal receptive field. Temporal-IM
Besides, commonly used public datasets, i.e., FaceForen- highlights the inconsistent position in temporal domain so that
sics++ [39], Celeb-DF [40] have a strong selection bias [41] to the network can focus on suspected areas. Noise inconsistency
favor trimmed videos, each of which involves only one person. features from Noise-IM and temporal inconsistency features
As a result, they are not competent enough to represent true from Temporal-IM are then fused by Feature Fusion Module
visual world, multi-face circumstances. As shown in Fig. 1(a), (FFM). Finally, multi-level fused features are used to detect
multi-face video frames often contain many people active in and locate tampered regions.
the scene with only a small subset having been manipulated The main contributions of this work can be summarized as
[41]. Existing deepfake detection methods based on these follows:
datasets didn’t take multi-face forensics into consideration and • We propose a novel Multi-View Inconsistency Measure-
there still exists some distance between these methods and real ment (MVIM) network that simultaneously measures
application scenarios. noise inconsistencies and temporal inconsistencies for
To address the aforementioned limitations, we propose a detecting and localizing tampered regions in deepfake
novel Multi-View Inconsistency Measurement (MVIM) net- videos, making it more in line with real-world application
work that simultaneously explores inconsistencies caused by of multi-face scenarios.
face manipulation from noise view and inconsistencies caused • A novel Noise Inconsistency Measurement (Noise-IM)
by inter-frame differences from temporal view for detecting module is devised to finely capture the noise co-
and localizing tampered regions in deepfake videos. From the occurrence of faces and their backgrounds by masked
perspective of noise inconsistency, face manipulation results attention mechanism, allowing better use of noise in-
in changes to the spatial distribution of noise features at consistency information to guide the network in locating
different positions of the image, causing inconsistencies in tampered regions.
noise patterns. Fig. 1(a) shows the median noise of real • A new Temporal Inconsistency Measurement (Temporal-
and fake video and it can be seen that noise pattern of IM) module is designed to capture suspected tampering
real face is more consistent and homogeneous than others traces reflected in inter-frame inconsistencies using self-

H W3 Η W Η W Η W Η W
 3   2C   4C   8C
4 4 8 8 16 16 32 32

Temporal-IM Temporal-IM Temporal-IM Temporal-IM

Block I Block II Block III Block IV

Temporal Branch
FFM UP
FFM UP
Bayer
Conv Feature Fusion FFM UP
Real
F
FFM or
C
Noise Branch Fake

Noise-IM Noise-IM Noise-IM Noise-IM

Block I Block II Block III Block IV

Temporal Inconsistency Measurement Block Feature Fusion Module Noise Inconsistency Measurement Block
Temporal-IM

enhancement

enhancement
7×7, Conv

1×1, Conv

7×7, Conv

1×1, Conv

Noise-IM
Channel
Concat

Spatial

Fig. 2. The overall framework of the proposed MVIM model. Given a video frame under investigation, MVIM isC× capable
H×W
of performing both classification and
localization tasks. Adjacent frames are fed intoT×the
C×H×upper
W temporal branch and noise features are sent into noise branch at the bottom. The proposed Noise-IM
and Temporal-IM modules are inserted into the modules of the two ConvNeXt branches respectively Splitinto the Noise-IM and Temporal-IM
and turn them
C×H×W C×H×W
blocks. Different resolution features are fused by FFM module.
W×C×H×T H×C×T×W
Fface Fback
Query-Conv Key-Conv Reshape

attention and bi-direction convolutions, ultimately leading both branches,Key-Conv

which are utilized by classifier
Query-Conv and decoder. Bi-
Key-Conv
Value-Conv HT-Conv TW-Conv
to noteworthy
Similarity improvement in localization and classifica- nary classification decision and tamper localization prediction
calculation
tion performance. mask of the given video frame will be output simultaneously
1×1,Conv 1×1,Conv Value-Conv classification and localization tasks.
In the following, we will introduce our proposed method to complete
Denoising

MVIM in detail in Section II. Experiments on public datasets C×H×W

Sum
are presented in Section III, and finallyT×C×the conclusion is Given a video with T frames{It }Tt=1 , each frame It of size
H×W
summarizedH×in Section IV. H ×W ×3, MVIM aims to determine whether H×W each face in the
W×T×T T×C×H×W α HW×HW HW×HW
frame is manipulated or not, and predicts the pixel-level mask
II. P ROPOSED METHOD of manipulated C× regions.
H×W For frame It to be detected, the noise
T×C×H×W branch takes the corresponding noise components obtained by
In this section, we first present the overall framework of C×H×W
BayerConv as input. Features Fnoise obtained by backbone
theTemporal
proposed model, which
Inconsistency includes
Measurement Noise Inconsistency Mea-
Module Noise Inconsistency Measurement Module
convolution layers are processed through masked attention by
surement (Noise-IM) module, Temporal Inconsistency Mea-
Noise Inconsistency Measurement (Noise-IM) module. The
surement (Temporal-IM) module, and Feature Fusion Module
module measures the similarity among faces and between
(FFM). Then we provide the technical details for each of these
faces and backgrounds and attention maps generated by Noise-
modules. Finally, we introduce the classifier and decoder of
IM are under the supervision of consistent maps produced by
the network, along with its loss function.
segmentation maps. Adjacent frames It−1 , It and It+1 are put
together into temporal branch. Features extracted by backbone
A. Overview convolution layers are denoted as Ftemp , which is processed by
We propose a Multi-View Inconsistency Measurement Temporal Inconsistency Measurement (Temporal-IM) module
(MVIM) network for deepfake detection and localization from in two ways. Firstly, negative cosine similarity is calculated to
noise and temporal view. The architecture, as shown in Fig. measure the degree of inconsistency between adjacent frames
2, comprises a noise branch and a temporal branch in a two- through self-attention and denoising operation. Secondly, fine-
stream multi-scale framework. To detect a video frame as in- grained representations are obtained by convolution along
put, the adjacent frames and the noise components extracted by both the height and width directions. Features from both
the BayerConv [43] are fed into their respective branches. Each branches are fused at different scales through Feature Fusion
branch employs a backbone network to extract features that Module (FFM), which consists of channel enhancement and
are reduced in scale but increased in dimensionality. Features spatial enhancement. Finally, fused features from the last
are then fused at different scales to aggregate information from stage are used for classification, and the decoder aggregates
Split

Authorized licensed use limited to: Raj Dixit. Downloaded on November 24,2024 at 04:18:47 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Dependable and Secure Computing. This is the author's version which has not been fully edited and
Noise
NoiseInconsistency Measurement
Inconsistency Measurement Temporal Inconsistency
Temporal Measurement
Inconsistency Measurement
content may change prior to final publication. Citation information: DOI 10.1109/TDSC.2024.3472064

FnoiseFnoise FtempFtemp
C×H×
C×W
H×W T×C×
T×H×
C×W
H×W

HW×C C
HW× C×WH
C×WH H×W×T×CT×C
H×W× H×W×C×TC×T
H×W×
W×W×
C×H×
C×TH×T H×C×
H×T×
C×WT×W
Query-Conv
Query-Conv Key-Conv
Key-Conv Query-Conv
Query-Conv Key-Conv
Key-Conv Reshape
Reshape
h w
Qn Qn Kn Kn C×H× W×W×
C×H×
C×TH×T Ftemp h
Ftemp Ftemp w
Ftemp
C×W
H×W Qt Qt Kt Kt

Value-Conv
Value-Conv Value-Conv
Value-Conv HT-Conv
HT-Conv TW-Conv
TW-Conv
S S Similarity
Similarity Vt Vt
Vn Vn

Reshape
calculation

Reshape
facemask calculation
facemask M̂ M̂ MM
filtering
filtering HW×1 1
HW×
1×1,Conv
1×1,Conv 1×1,Conv
1×1,Conv
H×W

Denoising
H×W

Denoising
AtempAtemp
Anoise

Norm
Anoise

Norm
MM
Sum
Sum

Rt1 Rt1 β β Rt2 R 2

T×
t C×
T×H×
C×W
H×W
H×W×T×TT×T
H×W× T×C×
T×H×
C×W
H×W M temp
M temp
H×W
H×W
HW×HWHW
HW×
α α

C×H× T×C×
T×H×
C×W
H×W
C×W
H×W

(a) Noise inconsistency measurement (b) Temporal inconsistency measurement

Fig. 3. The architecture of the proposed Temporal-IM and Noise-IM modules. Adjacent frames are fed into Temporal-IM on the left and noise features are
sent into Noise-IM on the right. HT-Conv and TW-Conv represent convolutions along height and width directions. ⊙ stands for matrix multiplication.⊕ and
⊗ denote
Noiseelement-wise
Inconsistency
Noise addition and multiplication respectively.
Measurement
Inconsistency Measurement
Temporal
TemporalInconsistency Measurement
Inconsistency Measurement

the fused features Ffi use from different stages to accomplish c ∈ RH×W , where positions with a value of 1 are turned
M
localization. More details are described as follows. into 0 and positions with a value of 0 become -1e9 as follws:
(
0, if Mi,j = 1
B. Noise inconsistency measurement Mi,j =
c (2)
−1e9, otherwise
It is observed that noise can be considered as an intrinsic
characteristic of images, tampering operation will disrupt the We aim to use M c to zero out the values in S corresponding to
coherence of the original features and leave different kinds non-face regions. To achieve this, Mc is reshaped to RHW ×1
of traces in noise domain, which is reflected as noise in- ′ HW ×HW
and then expanded to M ∈ R
c by repeating along
consistencies. In multi-face scenarios, the noise patterns of c′ is then
real faces and backgrounds tend to be more consistent than the columns to match the size of S. The expanded M
those of fake faces. Thus we design Noise-IM to uncover added to the negative dot-product, ensuring that after sof tmax
inconsistent information at a finer granularity from noise view. activation, the inconsistency scores for non-face regions are
As shown in Fig. 3(a), Noise-IM uses a face mask M obtained close to 0 and thus filtered out:
during the data preprocessing stage where the value of each c′ − Qn K T
M n
pixel indicates whether the place is a face area or not. Then Anoise = sof tmax( √ ). (3)
d
masked attention mechanism is applied to measure the noise
inconsistency among faces and between faces and background. Then we sum along each row and normalize Anoise to
Specifically, we use one 1×1 convolution to map input noise RHW ×1 , which is then reshaped to M f ∈ RH×W . Each
features Fnoise ∈ RC×H×W into Qn ∈ RHW ×C (Query), and entry m̃i,j (1 ≤ i ≤ H, 1 ≤ j ≤ W ) in M f measures
use two 1 × 1 convolutions to map noise features into K ∈ how inconsistent the noise pattern is with other faces or
RHW ×C , Vn ∈ RC×H×W (Key and Value). Attention maps background noise features, and a value close to 1 indicates
are then obtained to measure inconsistency between different more inconsistency and close to 0 otherwise. This summation
positions by computing their negative dot-product similarity: process allows us to model the relationships between fake
faces, real faces, and the background from a global perspective.
−Qn KnT Overall Mf will highlight areas of suspected tampering where
S = sof tmax( √ ), (1)
d the noise is inconsistent. Noise-IM updates noise features
√ Fnoise in the end:
where d denotes the scaling factor, S ∈ RHW ×HW indicates
the noise inconsistency score among every two patches in f ⊗ Vn + Fnoise ,
Fnoise = αM (4)
noise features.
To measure the noise similarity among faces and between where α is a trainable parameter to adaptively adjust the
faces and the background, we filtered the attention map influence of Noise-IM module.
using a face mask, retaining only the attention scores at To better facilitate the learning of Noise-IM module, at-
locations corresponding to face regions. More specifically, we tention map Mf is under the supervision of consistency maps
H×W
downsample M to match the size of input feature Fnoise in M̄gt ∈ R , which is generated from ground truth manip-
each stage. The elements of face mask M are updated to get ulation mask Mgt through bi-linear down-sampling to match

the size. Noise inconsistency loss is calculated using binary Furthermore, in order to extract inter-frame differences from
cross-entropy loss Lbce : different perspective, temporal features Ftemp is reshaped into
two coordinate-wise representations Ftemp h
∈ RW ×C×H×T
Lnoise = Lbce (M
f, M̄gt ). (5) w
and Ftemp ∈ R H×C×T ×W
, followed by convolution along
both height and width directions:
C. Temporal inconsistency measurement Rt2 = f1×1 (fht (Ftemp
h w
)) + f1×1 (ftw (Ftemp )), (7)
Video tampering process will unavoidably cause temporal
where f1×1 represents 1 × 1 convolution and fht , ftw are
inconsistencies since deepfake videos are synthesized frame-
stripe convolutions. Unlike traditional convolutions that focus
by-frame. Temporal inconsistencies caused by inter-frame dif-
on spatial dimensions, strip convolutions can simultaneously
ferences can be reflected in face motions and head movements,
capture temporal and spatial features along the height or
which is crucial for Deepfake detection and localization as
width to focus on inter-frame inconsistencies from different
they indicate the areas where tampering traces can be found.
directions, and correlate spatial information with temporal
In order to effectively capture suspicious traces of tampering
information to obtain Rt2 . Combining two temporal incon-
left behind during the tampering process, which is reflected in
sistency features to get more comprehensive representations,
inconsistencies in temporal domain, we introduce Temporal-
Temporal-IM will eventually update Ftemp :
IM to measure inconsistent information at a finer granularity.
For frame It to be detected, we utilize its adjacent frames Ftemp = (βRt1 + (1 − β)Rt2 ) + Ftemp , (8)
It−1 and It+1 to extract temporal inconsistency information.
where β is a trainable parameter to adaptively adjust the influ-
Denote features extracted by temporal backbone as Ftemp ∈
ence of two inconsistent features. Finally, extracting interme-
RT ×C×H×W , where T is the number of frames and is set to
diate frames Mtemp ∈ RC×H×W from inconsistency features
3 in this paper, C is the number of channels and H, W are
Ftemp to represent the inconsistent relationship between It
spatial dimensions.
and It−1 , It+1 , which is beneficial for indicating suspected
As shown in Fig. 3(b), we use different 1 × 1 convolutions
tampering areas. We downsample ground truth mask to match
to project Ftemp to Qt ∈ RH×W ×T ×C , Kt ∈ RH×W ×C×T ′
the corresponding size and get Mgt , the learning process is
(Query and Key). The negative cosine similarity score between
supervised by:
corresponding image patches is then calculated to represent
the degree of inconsistency between adjacent frames, denoted ′
Ltemp = Lbce (Mtemp , Mgt ). (9)
as Atemp ∈ RH×W ×T ×T . By summing the attention map
Atemp along the time dimension to ∈ RH×W ×T , we can D. Feature fusion module
measure the temporal inconsistency degree of each video
It is widely acknowledged that high-level features possess
region from a global perspective. However, the inter-frame
a relatively large perceptual field and strong ability to charac-
differences characterized in this way include not only the
terize semantic information, while displaying weaknesses in
abnormal inter-frame jitter introduced by video forgery but
the characterization of geometric information, i.e., the lack
also the normal facial motion differences. Considering that
of local texture feature details. On the contrary, low-level
real regions exhibit continuous and consistent motion changes,
features, which have smaller perceptual fields, possess strong
while forged regions may show random and intense jitter, we
abilities in the characterization of geometric detail information,
further use a fine-grained denoising operation to mitigate the
but exhibit limitations in characterizing semantic information.
interference of normal facial motion.
In order to fully utilize the information of different hierarchical
The denoising operation on Atemp first calculates the vari-
features, FFM is designed to fuse inconsistency features from
ance of each position, which characterizes the variability of
noise branch and temporal branch at different scales to get
inter-frame differences at each position. Regions with high
Ffi use , which is used for final classification and localization:
variance have large inter-frame differences and are more likely
to be forged regions, while real regions exhibit the opposite Ffi use = F F M (Fnoise
i i
, Ftemp ), i = 1, ..., N (10)
behavior. The variance is then compared to a threshold θ, and i
regions smaller than the threshold are filtered out. Now the where N indicates stage number, Fnoise
represents features
i
attention weights corresponding to real regions are reduced, from Noise-IM and Ftemp represents features from Temporal-
allowing the network to focus on regions with greater inter- IM. As is shown in Fig. 4, FFM consists of channel enhance-
frame differences, finally completing the denoising process. ment and spatial enhancement to accomplish feature fusion.
i
The threshold θ is manually set to the H-th largest variance in Specifically, noise features Fnoise ∈ RC×H×W and temporal
i C×H×W
our implementation. The denoised attention map is multiplied features Ftemp ∈ R are first concatenate by channel
with the value feature Vt to obtain the temporal inconsistency F̄ i ∈ R2C×H×W . Channel enhancement is then applied from
feature Rt1 ∈ RT ×C×H×W . The overall process can be global and local perspective:
expressed as follows: W = σ(fact (GAP (F̄ i ) + fact (F̄ i ))), (11)
Rt1 = fdenoise (mcos (Qt , Kt ))Vt , (6) bi
F =W⊗ i
Fnoise + (1 − W ) ⊗ i
Ftemp , (12)
where fdenoise represents the denoising operation and mcos where fact represents operation of a 1×1 convolution followed
denotes the negative cosine distance. by ReLU function and another 1×1 convolution, GAP stands

MaxPool
1×1,Conv

1×1,Conv
ReLU
GAP
i
Fnoise

7×7,Conv
Sigmoid

Sigmoid
Concat
Fˆ i
i
Fi Ffuse

1×1,Conv

1×1,Conv
ReLU
i
Ftemp

AvgPool
i
Fig. 4. The architecture of the FFM module. Features from noise branch Fnoise i
and features from temporal branch Ftemp are fused through channel
enhancement and spatial enhancement. ⊕ and ⊗ denote element-wise addition and multiplication respectively.

where Mpd is localization mask predicted by our model and

F i −1
F i −1 Mgt is the ground truth mask. The final loss is defined as
fuse
Channel fuse

enhancement follows:
L = λ1 Lnoise + λ2 Ltemp + λ3 Lcls + λ4 Lloc ,
Upsample

(15)
i Concat 1×1,Conv
F fuse where λ1 , λ2 , λ3 and λ4 are parameters, which are optimized
during the network training process.

Fig. 5. Schematic diagram of the up process. Ffi use represents fused features III. E XPERIMENTS
of smaller size and Ffi−1
use stands for larger size.
A. Datasets
for global average pooling operation. In order to enhance local In this paper, we conduct experiments on the widely-used
details and suppress irrelevant regions, spatial enhancement is datasets FFIW [41], DF-Platter [45], DFD [46] and FF++ [39].
1) FFIW: is a large-scale multi-face face forgery dataset,
1×1,Conv

1×1,Conv

adopted from CBAM [44] block:

ReLU
GAP

containing 10,000 high-fidelity real and synthetic videos re-

Ffi use = σ(f7×7 ([favg (Fbi ); fmax (Fbi )])) ⊗ Fbi , (13) spectively. It is constructed for multi-face face forgery detec-
Sigmoid

where σ represents sigmoid activation, favg and fmax are tion and the number of identities in each frame ranges from
1×1,Conv

1×1,Conv

one to fifteen, with three on average. There are three different

ReLU

average pooling and max pooling respectively.

face forgery techniques in FFIW, including two learning-
based methods(DeepFaceLab [47] and FSGAN [48]), and one
Upsample

E. Classification and localization

Features from theConcat
last stage of1×FFM
1,Conv n
Ff use are fed into graphic-based method(FaceSwap [49]) [41]. Both video- and
fully-connected layer for binary decisions and the binary cross face-level annotations are provided for forgery classification
entropy loss is calculated as forgery classification loss Lcls . and localization tasks.
To generate the final localization mask, fused features Ffi use 2) DF-Platter: is a large multi-face heterogeneous dataset,
from different stages are needed to avoid the noise caused by which improves the forgery quality of low-resolution videos. It
direct upsampling. As shown in Fig. 5, we upsample small contains a total of 133,260 videos in two resolutions and three
size features from stage i to match size of stage i − 1, e.g., video compression specifications, and the means of forgery
Ffi use of size 16H
× W i−1 H W include FSGAN [48], FaceSwap [49], and FaceShifter [50].
16 and Ff use of size 8 × 8 , then
i−1 In this paper, we choose the multi-face subsets SetB and SetC
Ffi use is upsampled to the size of Ff use . Two features are
for experiments.
concatenated followed by channel enhancement to highlight
3) DFD: is a forensic dataset containing over 363 original
important information and a 1 × 1 convolution is used to
sequences from 28 actors in 16 different scenes as well as
align channel dimension. The obtained feature Fefi−1 use of size
H W over 3000 manipulated videos using DeepFakes. Besides, three
× is used for next up process and output the predicted
8 8 video quality levels are provided in DFD(i.e., raw, c23, and
localization map eventually. Features at different scales can
c40) and manipulation masks are also provided. Considering
focus on inconsistent information at different levels, which is
the deployment in real-world application scenarios, our exper-
beneficial for aggregating global inconsistent information by
iments are conducted on both c23 and c40 videos.
progressive upsampling.
4) FaceForensics++(FF++): is by far the most popular
We use Dice loss to calculate localization loss to allevi-
dataset that contains 1000 original videos from YouTube
ate data imbalance since manipulated pixels are typically in
as well as 4000 associated manipulated videos created by
minority in a given image:
four forgery techniques: DeepFakes [51], Face2Face [52],
Lloc = LDice (Mpd , Mgt ), (14) FaceSwap [49] and NeuralTextures [53]. Three video quality

levels are provided in FF++(i.e., raw, c23, and c40) and TABLE I
manipulation masks are also provided. T HE QUANTITATIVE COMPARISONS AMONG RECENT METHODS AND THE
PROPOSED ON FFIW DATASET. F1 SCORE (%) AND I O U(%) ARE ADOPTED
FOR LOCALIZATION AND ACC(%) AND AUC(%) ARE ADOPTED FOR
B. Implementation details CLASSIFICATION . T HE BEST PERFORMANCES ARE MARKED AS BOLD .

1) Training strategy: The proposed MVIM network is im- Localization Classification

plemented using the PyTorch framework and the experiments Methods
F1 IoU ACC AUC
are performed on NVIDIA GTX GeForce 3090 GPU platform. FFD 71.71 67.02 86.62 86.53
We take ConvNeXt-Tiny [54] as our backbone network, which D&L 51.46 46.67 90.32 90.34
is widely used in computer vision field. MTCNN [55] is M2TR 65.32 61.19 96.79 96.80
used to detect faces and generate face masks while images SFFs 92.18 90.65 94.74 94.75
are resized to 512 × 512 as input to the network. During SDIML 94.17 92.68 93.65 93.61
the training process, the model is optimized by AdamW [56] LVNet 68.57 61.52 93.86 93.86
MVIM(Ours) 96.38 94.83 99.63 99.62
optimizer, the batch size is set to be 16, and the learning rate
is initialized by 5e-4 with a cosine annealing schedule.
2) Evaluation metrics: We evaluate the performance of TABLE II
the proposed model on both deepfake classification task and T HE QUANTITATIVE COMPARISONS AMONG RECENT CLASSIFICATION
METHODS AND THE PROPOSED ON FFIW DATASET. ACC(%) AND
localization task. Two commonly used metrics in deepfake AUC(%) ARE ADOPTED FOR CLASSIFICATION , AND THE BEST
classification task, including Acuracy(Acc) and Area Under PERFORMANCES ARE MARKED AS BOLD .
the ROC Curve(AUC) is reported. For deepfake localization
task, we adopt F1 score and Intersection over Union(IoU) as Methods ACC AUC
our evaluation metric. Xception 93.52 93.51
3) Baseline models: We compare our method with various PCL 90.92 91.93
GFFD 98.62 98.61
baseline methods as below:
LDAM 90.07 90.05
• FFD [29] produces an attention map using attention DDLM 94.10 94.09
mechanism to make the extracted features focus on MINet 89.70 89.67
manipulated regions. MVIM(Ours) 99.63 99.62
• D&L [37] introduces a two-stream multi-scale framework
based on the fusion of semantic-level and noise-level
localization ability of different methods. As illustrated in Table
features, which is capable of detecting and locating the
I, we observe that all methods show good classification results,
manipulated regions.
but the localization performances exhibit relatively substantial
• M2TR [31] captures the subtle manipulation artifacts at
variability, underscoring the significance of localization detec-
different scales using transformer models in a multi-task
tion.
learning approach.
FFD leverages an attention mechanism to facilitate effective
• SFFs [57] introduces a feature extractor through classi-
localization, however, its classification performance is some-
fication, localization and triplet loss training to extract
what limited. D&L and M2TR achieve better classification
specific forgery features with good discriminability.
accuracy at the cost of some localization performance. With
• SDIML [58] proposes an all-in-one solution to detect
the benefits of contrastive learning, SFFs has shown improved
and locate both shallow- and deep-fake images, which
classification and localization results. LVNet uses localization
is accomplished by modifying UPerNet [59] with noise
branche to enhance the classification features and achieves
features.
better classification results but there is more room for im-
• LVNet [38] introduces a two-stream network with three
provement in localization. Our proposed approach, MVIM,
well-designed functional modules to combine multi-
can better capture the suspected tampering traces and make
modal features and multi-scale patch features, effectively
the network focus on the tampering areas by measuring the
focusing on the potential forged regions.
degree of inconsistency from both noise and temporal views.
To be more specific, MVIM outperforms the state-of-the-art
C. Intra-dasaset comparison results M2TR by 2.84% ACC and 2.82% AUC for classification task,
To verify the superiority of the method in this paper, we first which is beneficial from multi-view inconsistency features.
carried out intra-dataset comparison experiments on FFIW, As for localization task, MVIM exceeds the state-of-the-art
DFD and FF++ datasets with some state-of-the-art methods. SDIML by 2.21% F1 score and 2.15% IoU, respectively. This
1) Comparison results on FFIW dataset: FFIW dataset is because we take the consistency of the noise distribution
contains multi-face tampered videos representing real applica- among faces and between faces and background into account,
tion scenarios, with many people active in the frame but only more in line with multi-face scenario. Moreover, we fuse
a small subset being manipulated. We conduct experiments on features from different stages to better aggregate inconsistency
this dataset to evaluate the performance of various forensic information at different scales, which is of great help for
methods in multi-face tampered video detection and local- localization.
ization tasks. Different methods are trained on FFIW dataset In addition, we have conducted comparative experiments on
and tested on its test subset. We report both classification and the FFIW dataset with several recent classification methods.

TABLE III
T HE QUANTITATIVE COMPARISONS AMONG RECENT METHODS AND THE PROPOSED ON DFD DATASETS . F1 SCORE (%) AND I O U(%) ARE ADOPTED FOR
LOCALIZATION AND ACC(%) AND AUC(%) ARE ADOPTED FOR CLASSIFICATION . T HE BEST PERFORMANCES ARE MARKED AS BOLD .

High Quality(c23) Low Quality(c40)

Methods Localization Classification Localization Classification
F1 IoU ACC AUC F1 IoU ACC AUC
FFD 67.15 60.62 80.57 77.42 53.36 46.76 67.42 61.57
D&L 59.63 51.80 90.03 88.25 46.76 39.34 73.84 68.38
M2TR 51.51 44.46 79.42 74.41 44.37 36.64 68.29 61.39
SFFs 85.83 82.48 89.16 86.48 66.21 61.70 73.98 68.43
SDIML 86.26 82.94 90.22 89.75 66.67 62.61 73.47 68.66
LVNet 48.25 37.14 84.15 84.56 41.06 30.59 74.12 68.46
MVIM(Ours) 87.20 83.96 90.99 90.85 67.49 63.23 75.39 69.80

TABLE IV
T HE QUANTITATIVE COMPARISONS AMONG RECENT METHODS AND THE PROPOSED ON FF++ DATASET C 40 VERSION . F1 SCORE (%) AND I O U(%) ARE
ADOPTED FOR LOCALIZATION AND ACC(%) AND AUC(%) ARE ADOPTED FOR CLASSIFICATION . T HE BEST PERFORMANCES ARE MARKED AS BOLD .

Deepfakes Face2Face FaceSwap

Methods Localization Classification Localization Classification Localization Classification
F1 IoU ACC AUC F1 IoU ACC AUC F1 IoU ACC AUC
FFD 87.64 85.31 91.83 91.84 80.58 78.22 88.47 88.48 79.70 77.15 88.15 87.48
D&L 66.69 91.76 92.93 92.94 63.96 61.01 89.95 89.96 68.73 66.12 90.14 90.15
M2TR 70.57 66.76 93.21 93.22 69.35 66.89 88.10 88.10 71.87 69.42 90.38 90.37
SFFs 88.98 87.33 91.31 91.32 85.01 83.84 88.76 88.77 86.05 84.43 88.42 87.63
SDIML 91.86 90.29 93.76 93.77 83.60 82.33 87.67 87.68 87.44 85.96 89.97 89.01
LVNet 77.12 69.64 94.01 94.02 71.65 64.82 87.91 87.91 76.68 70.18 92.06 91.93
MVIM(Ours) 93.74 93.71 93.86 93.87 87.86 86.38 90.76 90.77 91.56 91.43 92.36 91.99

The selected comparison methods include the classic classifi- view detection of tampering traces.
cation method Xception [60], the consistency learning method 3) Comparison results on FF++ dataset: To further il-
PCL [30], the high-frequency noise-based method GFFD [42], lustrate the validity and practicality of the proposed method,
the long-distance attention-based method LDAM [22], the we also carried out experiments on the most popular dataset
spatio-temporal dynamic difference learning method DDLM FF++, which is a single face dominated dataset with multiple
[23], and the latest representation learning method MINet [61]. tampering methods. Notice that the tamper masks provided
The experimental results are shown in Table II. All methods for NeuralTextures cover the entire facial area, while the
are capable of effectively judging the authenticity of video actual tampered area is limited to the mouth region only, so
samples, but our method achieves the best classification detec- we performed experiments on the remaining three methods.
tion results due to the fine-grained inconsistency information Considering the compression of common social platforms in
measurement module design. real life, we only carry out experiments on the c40 version
here. As shown in Table IV, proposed method MVIM achieves
2) Comparison results on DFD dataset: We also conduct state-of-the-art performance on FF++ dataset in the case of
comparison experiments on another popular datasets DFD, Face2Face and FaceSwap tampering methods for classification
most of videos contain only one person active in the scene. and localization tasks, and sub-optimal classification results
There are two compression versions in DFD dataset, c23 on Deepfakes. Comparison methods achieved decent classifi-
stands for low-level compression with high visual quality and cation accuracy for each tampering method, but there is still
c40 on the contrary. In general, detection performance on some room for improvement in the accuracy of localization.
low-quality videos is not as good as that on high-quality Since our method focuses on the generic tampering features
videos. This is because high compression rate will cause left by the tampering process in both noise and temporal
some texture details of the video to be lost, which is one domain, MVIM obtains better classification and localization
of the important information that the network needs to pay detection results under different tampering methods.
attention to. As listed in Table III, proposed method MVIM To compare the localization performance of different meth-
achieves state-of-the-art performance on both versions of DFD ods more intuitively, we present the predicted localization
dataset for classification and localization tasks. Comparison masks of different methods in Fig. 6. Notably, the compar-
methods mainly capture suspicious tampering traces from ison methods sometimes exhibit mislocalizations in different
spatial domain, which may be lost to some extent during the regions and fail to maintain consistent performance across
compression process. However, our approach offers improved diverse circumstances. For instance, FFD tends to overde-
performance by incorporating inconsistency information from tect appearing faces, such as the cases in the first and last
both noise and temporal view, thereby enhancing the multi- two columns. D&L and M2TR face challenges in accurately

9
66-42
Train23_189 785_99 25-393 Train168_43 Train69_132

Input
Ground-truth

大区域漏检多区域漏检
真实误检
小区域误检
FFD
D&L
M2TR
SFFs
SDIML
LVNet
MVIM(Ours)

Fig. 6. Visualization of the predicted localization mask by different methods. From top to bottom, we show input frames, GT masks, predictions of FFD,
D&L, M2TR, SFFs, SDIML, LVNet and our MVIM, respectively.

locating tampered areas, as shown in columns two through acquire all types of forged video samples for training in
six. SFFs may occasionally overlook frontal or minor faces practical applications. Cross-dataset experiments are carried
in the third and last column, and sometimes overdetect faces out to evaluate the generalization performance of different
in the remaining columns. Leveraging noise information and methods in the face of unknown samples.
the use of multi-scale features, SDIML performs well in the 1) Comparison results from FF++ to FFIW dataset:
first two columns but still experiences mislocalizations in the Firstly, we trained different methods on the c23 version of
last few columns. LVNet is able to locate the approximate the traditional FF++ dataset and tested them on the multi-face
location of the forged area in most cases except for the third FFIW dataset. The experimental results are shown in Table
and fifth columns, but is not sufficiently accurate. In contrast, V. The performance of all methods significantly decreased,
our method can not only achieve more accurate tampering indicating considerable differences between traditional single-
region localization but also more precise boundary estimation, face datasets and multi-face datasets. Cross-dataset detection
benefitting from inconsistency information derived from noise poses significant challenges and raises higher requirements
and temporal views. for future detection schemes. Nevertheless, our method still
achieved leading cross-dataset detection performance.
D. Cross-dataset comparison results 2) Comparison results from FFIW to DF-Platter dataset:
As generative techniques and forgery methods continue Furthermore, we selected the DF-Platter dataset, in addition
to evolve, it becomes increasingly difficult to anticipate and to the FFIW dataset, to further evaluate cross-dataset per-

TABLE V TABLE VII

T HE QUANTITATIVE COMPARISONS AMONG RECENT METHODS AND THE T HE QUANTITATIVE COMPARISONS AMONG RECENT METHODS AND THE
PROPOSED TRAINED ON FF++ AND TESTED ON FFIW DATASET. F1 PROPOSED ON DFD DATASET, WHICH ARE TRAINED ON C 23 VERSION AND
SCORE (%) AND I O U(%) ARE ADOPTED FOR LOCALIZATION AND ACC(%) TESTED ON C 40 VERSION OF DFD.
AND AUC(%) ARE ADOPTED FOR CLASSIFICATION . T HE BEST
PERFORMANCES ARE MARKED AS BOLD .
Localization Classification
Methods
F1 IoU ACC AUC
Localization Classification FFD 52.47 45.76 67.39 59.25
Methods
F1 IoU ACC AUC D&L 33.95 26.58 68.12 70.15
FFD 39.78 37.67 53.94 53.73 M2TR 45.45 38.09 69.93 63.08
D&L 32.44 28.06 55.64 55.65 SFFs 60.98 57.57 68.53 68.71
M2TR 35.15 32.75 57.47 57.39 SDIML 58.09 54.84 61.99 65.97
SFFs 48.02 46.89 54.60 54.35 LVNet 37.45 28.99 64.89 65.44
SDIML 46.89 45.52 56.73 56.60 MVIM(Ours) 68.76 63.52 76.41 71.43
LVNet 37.32 35.20 53.90 53.89
MVIM(Ours) 48.54 46.30 57.52 57.49

TABLE VI
T HE QUANTITATIVE COMPARISONS AMONG RECENT METHODS AND THE
PROPOSED TRAINED ON FFIW AND TESTED ON DF-P LATTER DATASET.
ACC(%) AND AUC(%) ARE ADOPTED FOR CLASSIFICATION , AND THE
BEST PERFORMANCES ARE MARKED AS BOLD .

Methods Classification
ACC AUC
FFD 56.90 59.01
D&L 83.40 84.31
M2TR 71.36 73.25 (a) Performance curves w.r.t. Gaussian Blur
SFFs 76.49 78.16
SDIML 83.00 83.69
LVNet 60.51 62.59
MVIM(Ours) 85.56 84.91

formance. Since this dataset does not provide precise tam-

pering masks, we trained our model on the FFIW dataset
and tested it on the DF-Platter dataset to compare the cross-
dataset classification performance of different methods. As
shown in Table VI, it can be seen that benefiting from the
modeling of relationships between faces in the noise space and
the introduction of temporal inconsistency information, our (b) Performance curves w.r.t. Gaussian Noise

proposed method MVIM still achieves the best cross-dataset

classification performance.

E. Robustness evaluation
1) Generalization results on DFD: To examine the robust-
ness of the proposed method MVIM under video compression,
we train different methods on high-quality(c23) version of
DFD, and test them on low-quality(c40). As shown in Table
VII, video compression has a relatively large impact on the
detection performance of different methods, leading to worse
results than trained and tested on low-quality version of (c) Performance curves w.r.t. JPEG compression
DFD. It is worth noting that SDIML performs well at a Fig. 7. Robustness evaluation against Gaussian Blur, Gaussian Noise and
specific compression rate, but the classification accuracy drops JPEG compression on FFIW. Localization F1 score and classification ACC
score are reported.
more when trained and tested at different compression rates,
indicating that the stability of the method is not robust enough.
The remaining methods demonstrate a relatively similar per- localization results, which illustrates its robust performance
formance on the classification task, while our method achieves under video compression.
the highest ACC and AUC metrics. Our MVIM considers both 2) Performance under various distortion: To further ex-
noise and temporal domains to effectively extract suspicious plore the performance of different methods under various
tampered traces, thereby mitigating the negative impact of distortions, we apply different image distortion methods to
video compression. Moreover, our approach exhibits the best raw frames from FFIW dataset and evaluate their localization

TABLE VIII TABLE IX

A BLATION STUDIES REGARDING DIFFERENT ARCHITECTURES ON FFIW A BLATION STUDIES REGARDING DIFFERENT MODULE DESIGNS FOR
DATASET. CAPTURING INCONSISTENCIES ON FFIW DATASET.

Module settings Localization Classification Localization Classification

Module settings
Noise-IM Temporal-IM FFM F1 IoU ACC AUC F1 IoU ACC AUC
92.64 91.07 95.62 95.61 Noise-IM w/o facemask filtering 95.47 93.97 99.26 99.26
! 94.47 93.04 96.82 96.81 Temporal-IM w/o denoising 96.11 94.30 99.54 99.53
! 94.61 93.23 96.99 96.98 Temporal-IM w/o bi-direction conv 95.93 94.20 99.38 99.38
! 93.69 92.26 96.20 96.19 MVIM(Ours) 96.38 94.83 99.63 99.62
! ! 95.02 93.49 98.55 98.54
! ! 95.16 93.71 98.99 98.98
! ! 95.47 93.97 98.50 98.49
! ! ! 96.38 94.83 99.63 99.62

and classification performance. The distortion types include:

1) Gaussian blurring with a kernel size k, 2) Gaussian noise
with a standard deviation σ, and 3) JPEG compression with a
quality factor q. As shown in Fig. 7, we compare the localiza-
tion performance (F1 scores) and classification performance
(ACC scores) of different methods on these corrupted data.
Gaussian blur is a data smoothing technique, that causes
loss of texture details in an image. The degree of distortion
144-43
caused by Gaussian blur is directly proportional to the kernel
size, and the degradation of the different methods in terms
of classification was generally greater than that observed in 168-76

localization. The most significant reduction in detection perfor-

mance of the different methods is observed when the Gaussian
kernel is 29. Our method maintains relatively stable detection Fig. 8. Visualization of averaged feature maps after fusion of noise and
under Gaussian blur of different sizes, outperforming the temporal features, brighter color indicating larger inconsistency in the region
and are more likely to be tampered areas. From left to right, we show input
comparison method. Gaussian noise is a common type of noise frames, GT masks, feature maps of our MVIM, respectively.
that appears at almost every point of the image. The larger 4352-44

the standard deviation of the Gaussian noise, the higher the

intensity of the noise and the more interference information on significance of both modules. The enhancement delivered
the image. The performance of the contrast methods degrades by Noise-IM emphasizes the beneficial role of noise co-
more significantly under the influence of Gaussian noise. JPEG 4380-22
occurrence within faces and their surroundings for deepfake
compression is a mainstream lossy image compression method detection and localization purposes. On the other hand, the
that can compress images into a small storage space and improvement resulting from Temporal-IM demonstrates the
is also the compression method used by some online social effectiveness of temporal inconsistency information in pro-
networks. As the compression factor decreases, the quality viding better results. In addition, FFM enables better fusion
of the image progressively reduces, consequently causing a of noise and temporal inconsistency features through channel
decrease in detection performance. In view of the fact that enhancement and spatial enhancement, which can further
our approach measures inconsistent information from both the improve the performance of the model.
noise and temporal view, our method demonstrates relatively To further validate the effectiveness of our module designs,
stable performance with desirable robustness in the face of we conducted ablation studies within each module, as shown
Gaussian blur, Gaussian noise, and JPEG compression. in Table IX. Specifically, we removed the facemask filtering
phase in the Noise-IM module, the denoising operation in
the Temporal-IM module, and the bi-direction convolution.
F. Ablation study Each modification led to a decrease in detection performance,
To demonstrate the benefit of each module, we evaluate demonstrating that our meticulously designed attention mecha-
the performance of our proposed model and its variants on nisms significantly enhance the model’s ability to capture noise
FFIW dataset. As presented in Table VIII, the baseline model and temporal inconsistencies, thereby improving detection
is implemented using ConvNeXt and does not encompass performance.
the proposed Noise-IM and Temporal-IM modules, alongside
FFM being substituted with bilinear pooling operation. With
the help of noise information and a progressive upsampling G. Visualization analysis
design, the baseline model is able to achieve decent results. In order to further demonstrate the superiority of our
With the incorporation of either the Noise-IM or Temporal- method, we present some visual results of MVIM on FFIW
IM module, our model emphasizes improved performance dataset. As shown in Fig. 8, we compare the feature maps
in both classification and localization tasks, highlighting the after fusion of noise and temporal features from MVIM

FFIW DFD
fuse noise inconsistency features from Noise-IM and temporal
inconsistency features from Temporal-IM to detect and locate
tampered regions. Extensive experiments with multiple state-
Original

Original
of-the-art methods in different benchmark datasets have been
performed to verify the superiority of our MVIM network.

ACKNOWLEDGMENTS
This work is conducted on RTAI cluster, which is supported
by School of Computer Science and Engineering and Institute
of Artificial Intelligence, Sun Yat-sen University.
FFD

FFD
R EFERENCES
[1] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,
“Variational autoencoder for deep learning of images, labels and cap-
tions,” in Proceedings of the 30th International Conference on Neural
Information Processing Systems (NIPS), 2016, pp. 2360–2368.
MVIM(Ours)

MVIM(Ours)
[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” in Proceedings
of the IEEE International Conference on Computer Vision (ICCV), 2017,
pp. 2223–2232.
[3] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for
generative adversarial networks,” in Proceedings of the IEEE/CVF Con-
FFIW DFD
ference on Computer Vision and Pattern Recognition (CVPR). IEEE,
2019, pp. 4396–4405.
[4] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic mod-
Fig. 9. t-SNE visualization of features derived from different methods on the els,” in Proceedings of the 34th International Conference on Neural
test set of DFD and FFIW datasets. From top to bottom, we show original Information Processing Systems (NIPS), 2020, pp. 6840–6851.
features, FFD features and our MVIM features, respectively. The orange points [5] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
represent the forgery samples and blue points stand for real samples. resolution image synthesis with latent diffusion models,” in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2022, pp. 10 674–10 685.
with the ground truth localization mask. The results indicate [6] J. Morris, S. Newman, K. Palaniappan, J. Fan, and D. Lin, “”do
that MVIM assigns higher activation weights to suspected you know you are tracked by photos that you didnt take: Large-scale
tampered regions, allowing the network to focus on crucial location-aware multi-party image privacy protection,” IEEE Transactions
on Dependable and Secure Computing, 2021.
areas and effectively achieving the localization task.
[7] C. Liu, H. Chen, T. Zhu, J. Zhang, and W. Zhou, “Making deepfakes
We also visualize the feature distribution derived from more spurious: evading deep face forgery detection via trace removal
both the comparison method FFD and our MVIM by t-SNE attack,” IEEE Transactions on Dependable and Secure Computing, 2023.
algorithm [62] on two datasets in Fig. 9. Initially, the two [8] C. Yu, X. Zhang, Y. Duan, S. Yan, Z. Wang, Y. Xiang, S. Ji, and W. Chen,
“Diff-id: An explainable identity difference quantification framework
datasets exhibit a random distribution. However, both FFD for deepfake detection,” IEEE Transactions on Dependable and Secure
and our method can effectively distinguish between forgery Computing, 2024.
and real samples. Our method produces a more compact [9] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts to
expose deepfakes and face manipulations,” in Proceedings of the IEEE
feature distribution on FFIW dataset and a more discriminative Winter Applications of Computer Vision Workshops (WACVW). IEEE,
distribution on DFD dataset, which benefits from our multi- 2019, pp. 83–92.
view inconsistency measurement to ensure the discriminative [10] D.-T. Dang-Nguyen, G. Boato, and F. G. De Natale, “Discrimination be-
tween computer generated and natural human faces based on asymmetry
power of feature representations learned by our method. information,” in Proceedings of the 20th European Signal Processing
Conference (EUSIPCO). IEEE, 2012, pp. 1234–1238.
IV. C ONCLUSION [11] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face
x-ray for more general face forgery detection,” in Proceedings of the
In this paper, we propose a Multi-View Inconsistency IEEE/CVF Conference on Computer Vision and Pattern Recognition
Measurement (MVIM) network to measure inconsistencies (CVPR), 2020, pp. 5001–5010.
[12] B. Peng, W. Wang, J. Dong, and T. Tan, “Optimized 3D lighting
from both noise and temporal view. Taking into account environment estimation for image forgery detection,” IEEE Transactions
the defects of existing deepfake video generation methods, a on Information Forensics and Security, vol. 12, no. 2, pp. 479–494, 2016.
novel Noise Inconsistency Measurement (Noise-IM) module [13] S. J. Sohrawardi, A. Chintha, B. Thai, S. Seng, A. Hickerson, R. Ptucha,
and M. Wright, “Poster: Towards robust open-world detection of
is designed to identify regions with inconsistent noise patterns deepfakes,” in Proceedings of the 2019 ACM SIGSAC Conference on
in multi-face scenarios where the noise patterns of fake faces Computer and Communications Security, 2019, pp. 2613–2615.
are inconsistent with those of real faces and backgrounds. [14] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, “Wilddeepfake: A
challenging real-world dataset for deepfake detection,” in Proceedings
And a Temporal Inconsistency Measurement (Temporal-IM) of the 28th ACM International Conference on Multimedia, 2020, pp.
module is developed to capture suspicious tampering traces 2382–2390.
between frames in temporal view, observing that facial jitter [15] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, “Lips don’t
lie: A generalisable and robust approach to face forgery detection,”
of tampered regions is more intense than real regions. In in Proceedings of the IEEE/CVF Conference on Computer Vision and
addition, a Feature Fusion Module (FFM) is proposed to Pattern Recognition (CVPR), 2021, pp. 5039–5049.

[16] S. Fernandes, S. Raj, E. Ortiz, I. Vintila, M. Salter, G. Urosevic, and [37] C. Kong, B. Chen, H. Li, S. Wang, A. Rocha, and S. Kwong, “Detect
S. Jha, “Predicting heart rate variations of deepfake videos using neural and locate: Exposing face manipulation by semantic-and noise-level
ode,” in Proceedings of the IEEE/CVF International Conference on telltales,” IEEE Transactions on Information Forensics and Security,
Computer Vision Workshops (ICCVW), 2019, pp. 1721–1729. vol. 17, pp. 1741–1756, 2022.
[17] Z. Gu, T. Yao, C. Yang, R. Yi, S. Ding, and L. Ma, “Region- [38] C. Shuai, J. Zhong, S. Wu, F. Lin, Z. Wang, Z. Ba, Z. Liu, L. Cavallaro,
aware temporal inconsistency learning for deepfake video detection,” and K. Ren, “Locate and verify: A two-stream network for improved
in Proceedings of the 31th International Joint Conference on Artificial deepfake detection,” in Proceedings of the 31st ACM International
Intelligence (IJCAI), vol. 1, 2022. Conference on Multimedia, 2023, pp. 7131–7142.
[18] Z. Gu, Y. Chen, T. Yao, S. Ding, J. Li, and L. Ma, “Delving into the [39] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and
local: Dynamic inconsistency learning for deepfake video detection,” in M. Nießner, “Faceforensics++: Learning to detect manipulated facial
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, images,” in Proceedings of the IEEE International Conference on
no. 1, 2022, pp. 744–752. Computer Vision (ICCV), 2019, pp. 1–11.
[19] Z. Yu, R. Cai, Z. Li, W. Yang, J. Shi, and A. C. Kot, “Benchmarking [40] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-
joint face spoofing and forgery detection with visual and physiological scale challenging dataset for deepfake forensics,” in Proceedings of
cues,” IEEE Transactions on Dependable and Secure Computing, 2024. the IEEE/CVF Conference on Computer Vision and Pattern Recognition
[20] C. Zhu, B. Zhang, Q. Yin, C. Yin, and W. Lu, “Deepfake detection (CVPR), 2020, pp. 3207–3216.
via inter-frame inconsistency recomposition and enhancement,” Pattern [41] T. Zhou, W. Wang, Z. Liang, and J. Shen, “Face forensics in the wild,”
Recognition, p. 110077, 2023. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[21] D. Zhang, F. Lin, Y. Hua, P. Wang, D. Zeng, and S. Ge, “Deepfake video Pattern Recognition (CVPR), 2021, pp. 5778–5788.
detection with spatiotemporal dropout transformer,” in Proceedings of [42] Y. Luo, Y. Zhang, J. Yan, and W. Liu, “Generalizing face forgery de-
the 30th ACM International Conference on Multimedia, 2022, pp. 5833– tection with high-frequency features,” in Proceedings of the IEEE/CVF
5841. Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
[22] W. Lu, L. Liu, B. Zhang, J. Luo, X. Zhao, Y. Zhou, and J. Huang, pp. 16 317–16 326.
“Detection of deepfake videos using long-distance attention,” IEEE [43] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks:
Transactions on Neural Networks and Learning Systems, 2023. A new approach towards general purpose image manipulation detection,”
[23] Q. Yin, W. Lu, B. Li, and J. Huang, “Dynamic difference learning IEEE Transactions on Information Forensics and Security, vol. 13,
with spatio-temporal correlation for deepfake video detection,” IEEE no. 11, pp. 2691–2706, 2018.
Transactions on Information Forensics and Security, vol. 18, pp. 4046– [44] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional
4058, 2023. block attention module,” in Proceedings of the European Conference on
[24] J. Deng, C. Lin, P. Hu, C. Shen, Q. Wang, Q. Li, and Q. Li, “Towards Computer Vision (ECCV), 2018, pp. 3–19.
benchmarking and evaluating deepfake detection,” IEEE Transactions [45] K. Narayan, H. Agarwal, K. Thakral, S. Mittal, M. Vatsa, and R. Singh,
on Dependable and Secure Computing, pp. 1–16, 2024. “Df-platter: Multi-face heterogeneous deepfake dataset,” in Proceedings
[25] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
for image manipulation detection,” in Proceedings of the IEEE/CVF tion (CVPR), 2023, pp. 9739–9748.
Conference on Computer Vision and Pattern Recognition (CVPR), 2018, [46] Google ai blog:contributing data to deepfake detection
pp. 1053–1061. research. [Online]. Available: https://ai.googleblog.com/2019/09/
[26] J. H. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. K. Roy- contributing-data-to-deepfake-detection.html.
Chowdhury, “Hybrid lstm and encoder–decoder architecture for de- [47] I. Perov, D. Gao, N. Chervoniy, K. Liu, S. Marangonda, C. Umé,
tection of image forgeries,” IEEE Transactions on Image Processing, M. Dpfks, C. S. Facenheim, L. RP, J. Jiang et al., “Deepfacelab:
vol. 28, no. 7, pp. 3286–3300, 2019. Integrated, flexible and extensible face-swapping framework,” arXiv
[27] X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, and R. Nevatia, preprint arXiv:2005.05535, 2020.
“Span: Spatial pyramid attention network for image manipulation lo- [48] Y. Nirkin, Y. Keller, and T. Hassner, “Fsgan: Subject agnostic face
calization,” in Proceedings of the European Conference on Computer swapping and reenactment,” in Proceedings of the IEEE International
Vision (ECCV), 2020, pp. 312–328. Conference on Computer Vision (ICCV), 2019, pp. 7184–7193.
[28] W. Lu, W. Xu, and Z. Sheng, “An interpretable image tampering [49] Faceswap. [Online]. Available: https://www.github.com/MarekKowalski/
detection approach based on cooperative game,” IEEE Transactions on FaceSwap.
Circuits and Systems for Video Technology, vol. 33, no. 2, pp. 952–962, [50] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “Faceshifter: To-
2023. wards high fidelity and occlusion aware face swapping,” arXiv preprint
[29] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, “On the arXiv:1912.13457, 2019.
detection of digital face manipulation,” in Proceedings of the IEEE/CVF [51] Deepfakes. [Online]. Available: https://github.com/deepfakes/faceswap.
Conference on Computer Vision and Pattern Recognition (CVPR), 2020, [52] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Niessner,
pp. 5781–5790. “Face2face: Real-time face capture and reenactment of rgb videos,”
[30] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia, “Learning self- in Proceedings of the IEEE/CVF Conference on Computer Vision and
consistency for deepfake detection,” in Proceedings of the IEEE/CVF Pattern Recognition (CVPR), 2016, pp. 2387–2395.
Conference on Computer Vision and Pattern Recognition(CVPR), 2021, [53] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural rendering:
pp. 15 023–15 033. Image synthesis using neural textures,” Acm Transactions on Graphics
[31] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-N. Li, (TOG), vol. 38, no. 4, pp. 1–12, 2019.
“M2tr: Multi-modal multi-scale transformers for deepfake detection,” in [54] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A
Proceedings of the International Conference on Multimedia Retrieval convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on
(ICMR), 2022, pp. 615–623. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 976–
[32] J. Wang, Y. Sun, and J. Tang, “Lisiam: Localization invariance siamese 11 986.
network for deepfake detection,” IEEE Transactions on Information [55] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
Forensics and Security, vol. 17, pp. 2425–2436, 2022. alignment using multitask cascaded convolutional networks,” IEEE
[33] K. Songsri-in and S. Zafeiriou, “Complement face forensic detection and signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.
localization with faciallandmarks,” arXiv preprint arXiv:1910.05455, [56] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
2019. arXiv preprint arXiv:1711.05101, 2017.
[34] B. Chen, X. Ju, B. Xiao, W. Ding, Y. Zheng, and V. H. C. de Albu- [57] P. Yu, J. Fei, Z. Xia, Z. Zhou, and J. Weng, “Improving generalization
querque, “Locally gan-generated face detection based on an improved by commonality learning in face forgery detection,” IEEE Transactions
xception,” Information Sciences, vol. 572, pp. 16–28, 2021. on Information Forensics and Security, vol. 17, pp. 547–558, 2022.
[35] Y. Huang, F. Juefei-Xu, Q. Guo, Y. Liu, and G. Pu, “Fakelocator: Robust [58] J. Zhang, H. Tohidypour, Y. Wang, and P. Nasiopoulos, “Shallow-
localization of gan-based face manipulations,” IEEE Transactions on and deep-fake image manipulation localization using deep learning,” in
Information Forensics and Security, vol. 17, pp. 2657–2672, 2022. Proceedings of the International Conference on Computing, Networking
[36] P. Chen, J. Liu, T. Liang, C. Yu, S. Zou, J. Dai, and J. Han, “Dlfmnet: and Communications (ICNC). IEEE, 2023, pp. 468–472.
End-to-end detection and localization of face manipulation using multi- [59] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual
domain features,” in Proceedings of the IEEE International Conference parsing for scene understanding,” in Proceedings of the European
on Multimedia and Expo (ICME). IEEE, 2021, pp. 1–6. Conference on Computer Vision (ECCV), 2018, pp. 418–434.

[60] F. Chollet, “Xception: Deep learning with depthwise separable convolu-

tions,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition(CVPR), 2017, pp. 1251–1258.
[61] Z. Ba, Q. Liu, Z. Liu, S. Wu, F. Lin, L. Lu, and K. Ren, “Exposing the
deception: Uncovering more forgery clues for deepfake detection,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38,
no. 2, 2024, pp. 719–728.
[62] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal
of machine learning research, vol. 9, no. 11, 2008.

Bolin Zhang received the B.S. degree in computer

science and technology from Sun Yat-sen University,
China, in 2021, and the M.S degree in computer
science and technology from Sun Yat-sen University,
Guangzhou, China, in 2024. His research interests
include multimedia forensics and security.

Qilin Yin received the B.S. degree and M.S. degree

from Nanjing University of Information Science
and Technology, in 2018 and 2021. He is currently
pursuing the Ph.D. degree in School of Computer
Science and Engineering, Sun Yat-sen University,
Guangzhou, China. His research interests include
multimedia forensics and security.

Wei Lu received the B.S. degree in automation

from Northeast University, China in 2002, the M.S.
degree and the Ph.D. degree in computer science
from Shanghai Jiao Tong University, China in 2005
and 2007 respectively. He was a research assistant
at Hong Kong Polytechnic University from 2006 to
2007. He is currently a Professor with the School
of Computer Science and Engineering, Sun Yat-sen
University, Guangzhou, China. His research interests
include multimedia forensics and security, AI secu-
rity.

Xiangyang Luo received his B.S., M.S., and Ph.D.

degrees from the State Key Laboratory of Math-
ematical Engineering and Advanced Computing,
Zhengzhou, China, in 2001, 2004, and 2010, respec-
tively. He is the author or co-author of more than
100 refereed international journal and conference
papers. He is currently a Professor of the State
Key Laboratory of Mathematical Engineering and
Advanced Computing. His research interests include
multimedia forensics and security.

Mover: Mask and Recovery Based Facial Part Consistency Aware Method For Deepfake Video Detection
No ratings yet
Mover: Mask and Recovery Based Facial Part Consistency Aware Method For Deepfake Video Detection
10 pages
ATTENDANCE 2nd Quarter (AutoRecovered)
No ratings yet
ATTENDANCE 2nd Quarter (AutoRecovered)
3 pages
Kombolcha
No ratings yet
Kombolcha
22 pages
IEEE Xplore Reference Download 2025.5.29.11.51.9
No ratings yet
IEEE Xplore Reference Download 2025.5.29.11.51.9
2 pages
2405.08463v1
No ratings yet
2405.08463v1
25 pages
ISTVT Interpretable Spatial-Temporal Video Transformer For Deepfake Detection
No ratings yet
ISTVT Interpretable Spatial-Temporal Video Transformer For Deepfake Detection
14 pages
Evolving From Single-Modal To Multi-Modal Facial D
No ratings yet
Evolving From Single-Modal To Multi-Modal Facial D
22 pages
CViT
No ratings yet
CViT
29 pages
RPS Read Aloud
No ratings yet
RPS Read Aloud
3 pages
2410.07888v1
No ratings yet
2410.07888v1
10 pages
Major Report
No ratings yet
Major Report
46 pages
Final Repot Face Project
No ratings yet
Final Repot Face Project
46 pages
Ap21 SG Seminar Eoc
No ratings yet
Ap21 SG Seminar Eoc
12 pages
Uia-Vit: Unsupervised Inconsistency-Aware Method Based On Vision Transformer For Face Forgery Detection
No ratings yet
Uia-Vit: Unsupervised Inconsistency-Aware Method Based On Vision Transformer For Face Forgery Detection
17 pages
GTA-Net A Robust Method For Deepfake Face Image Detection
100% (1)
GTA-Net A Robust Method For Deepfake Face Image Detection
6 pages
It Wasnt Me Irregular Identity in Deepfake Videos
No ratings yet
It Wasnt Me Irregular Identity in Deepfake Videos
5 pages
Fang 2022 J. Phys. Conf. Ser. 2224 012014
No ratings yet
Fang 2022 J. Phys. Conf. Ser. 2224 012014
7 pages
DIP Lab 2
No ratings yet
DIP Lab 2
1 page
Large Scale Benchmark For Content Driven
No ratings yet
Large Scale Benchmark For Content Driven
19 pages
Deepfake Detection Based On Cross-Domain Local Characteristic Analysis With Multi-Domain Transformer
No ratings yet
Deepfake Detection Based On Cross-Domain Local Characteristic Analysis With Multi-Domain Transformer
18 pages
Researchpaper New
No ratings yet
Researchpaper New
17 pages
Wang AltFreezing For More General Video Face Forgery Detection CVPR 2023 Paper
No ratings yet
Wang AltFreezing For More General Video Face Forgery Detection CVPR 2023 Paper
10 pages
Jaya Term-1
No ratings yet
Jaya Term-1
38 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
Vod: Learning Volume of Differences For Video-Based Deepfake Detection
No ratings yet
Vod: Learning Volume of Differences For Video-Based Deepfake Detection
9 pages
Identity Consistency For Enhanced Deep Fake Detection in Video Content Using Deep Learning
No ratings yet
Identity Consistency For Enhanced Deep Fake Detection in Video Content Using Deep Learning
6 pages
F2Trans High-Frequency Fine-Grained Transformer For Face Forgery Detection
No ratings yet
F2Trans High-Frequency Fine-Grained Transformer For Face Forgery Detection
13 pages
2307.08317v1
No ratings yet
2307.08317v1
11 pages
MINTIME Multi-Identity Size-Invariant Video Deepfake Detection
No ratings yet
MINTIME Multi-Identity Size-Invariant Video Deepfake Detection
13 pages
Robust Manipulated Media Localization and Detection Based
No ratings yet
Robust Manipulated Media Localization and Detection Based
17 pages
Exploring Spatial - Temporal Features Fusion Model For Deepfake Video Detection
No ratings yet
Exploring Spatial - Temporal Features Fusion Model For Deepfake Video Detection
12 pages
Detection of Deep Fake in Face Images Based Machin
No ratings yet
Detection of Deep Fake in Face Images Based Machin
12 pages
Chapter - 7:referesce
No ratings yet
Chapter - 7:referesce
3 pages
Detection of Deepfake Videos Using Long Distance Attention 2
No ratings yet
Detection of Deepfake Videos Using Long Distance Attention 2
10 pages
Bita-Net Bi-Temporal Attention Network For Facial Video Forgery Detection
No ratings yet
Bita-Net Bi-Temporal Attention Network For Facial Video Forgery Detection
8 pages
Information Security UNIT-3 Notes
No ratings yet
Information Security UNIT-3 Notes
48 pages
Deep Fake Detection - Finalized
No ratings yet
Deep Fake Detection - Finalized
8 pages
LSM6DS3 Datasheet
No ratings yet
LSM6DS3 Datasheet
100 pages
Paper 91-Exploiting Deepfakes by Analyzing Temporal Feature Inconsistency
No ratings yet
Paper 91-Exploiting Deepfakes by Analyzing Temporal Feature Inconsistency
9 pages
Forensic Symmetry For DeepFakes
No ratings yet
Forensic Symmetry For DeepFakes
16 pages
Aisc Presentation
No ratings yet
Aisc Presentation
15 pages
Deep Fake
No ratings yet
Deep Fake
7 pages
2020 04 16 ICPR 2004.07676v1
No ratings yet
2020 04 16 ICPR 2004.07676v1
8 pages
v1 Covered
No ratings yet
v1 Covered
17 pages
Deepfake Detection Based On The Discrepancy Between The Face and Its Context
No ratings yet
Deepfake Detection Based On The Discrepancy Between The Face and Its Context
11 pages
Detecting Deepfake, Faceswap and Face2Face Facial Forgeries Using Frequency CNN
No ratings yet
Detecting Deepfake, Faceswap and Face2Face Facial Forgeries Using Frequency CNN
18 pages
An Ensemble of CNNs With Self-Attention Mechanism For DeepFake
No ratings yet
An Ensemble of CNNs With Self-Attention Mechanism For DeepFake
17 pages
Horary Astrology - : Horarya Strologyor Prasnasashtra
No ratings yet
Horary Astrology - : Horarya Strologyor Prasnasashtra
3 pages
没代码ProActive DeepFake Detection Using GAN-based Visible
No ratings yet
没代码ProActive DeepFake Detection Using GAN-based Visible
27 pages
In-The-Wild Deepfake Detection Using Adaptable CNN Models With Visual Class Activation Mapping For Improved Accuracy
No ratings yet
In-The-Wild Deepfake Detection Using Adaptable CNN Models With Visual Class Activation Mapping For Improved Accuracy
6 pages
Interactive Two-Stream Network Across Modalities For Deepfake Detection
No ratings yet
Interactive Two-Stream Network Across Modalities For Deepfake Detection
13 pages
Protecting Celebrities From Deepfake With Identity Consistency Transformer
No ratings yet
Protecting Celebrities From Deepfake With Identity Consistency Transformer
11 pages
Video Deepfake Detection Using Particle Swarm Optimization Improved Deep Neural Networks
No ratings yet
Video Deepfake Detection Using Particle Swarm Optimization Improved Deep Neural Networks
37 pages
Research Paper Draft
No ratings yet
Research Paper Draft
12 pages
Leveraging Real Talking Faces Via Self-Supervision For Robust Forgery Detection
No ratings yet
Leveraging Real Talking Faces Via Self-Supervision For Robust Forgery Detection
17 pages
2019 01 25 Faceforensic PP 1901.08971v3
No ratings yet
2019 01 25 Faceforensic PP 1901.08971v3
14 pages
8607
50% (2)
8607
45 pages
NeurIPS 2022 Delving Into Sequential Patches For Deepfake Detection Paper Conference
No ratings yet
NeurIPS 2022 Delving Into Sequential Patches For Deepfake Detection Paper Conference
14 pages
ERSE All PYQs
No ratings yet
ERSE All PYQs
6 pages
Global-Local Facial Fusion Based GAN Generated Fake Faces
No ratings yet
Global-Local Facial Fusion Based GAN Generated Fake Faces
18 pages
Paper (Related Project-3)
No ratings yet
Paper (Related Project-3)
9 pages
Essentials of Organizational Behavior 14th Edition Robbins Test Bank Instant Download
100% (4)
Essentials of Organizational Behavior 14th Edition Robbins Test Bank Instant Download
51 pages
Deepfake Video Detection Using Convolutional Vision Transformer
No ratings yet
Deepfake Video Detection Using Convolutional Vision Transformer
9 pages
Unit Test-1 Portion and Time Table Grade IX - Compressed
No ratings yet
Unit Test-1 Portion and Time Table Grade IX - Compressed
3 pages
IJRPR7765
No ratings yet
IJRPR7765
5 pages
Deepfake Video Detection Using Convolutional Visio
No ratings yet
Deepfake Video Detection Using Convolutional Visio
9 pages
Subudhi Techno Engineers PVTLTD Checklist For Reiki Survey: SN Details Remarks Option Choice (Y/N)
No ratings yet
Subudhi Techno Engineers PVTLTD Checklist For Reiki Survey: SN Details Remarks Option Choice (Y/N)
16 pages
Effective Technical Communication in English - All U - 230502 - 172912
No ratings yet
Effective Technical Communication in English - All U - 230502 - 172912
149 pages
Multi-Task Learning For Detecting and Segmenting Manipulated Facial Images and Videos
No ratings yet
Multi-Task Learning For Detecting and Segmenting Manipulated Facial Images and Videos
8 pages
Erse Unit III
No ratings yet
Erse Unit III
9 pages
Information Security UNIT-1 Notes
No ratings yet
Information Security UNIT-1 Notes
26 pages
2 Polynomials Objective WS CBSE
No ratings yet
2 Polynomials Objective WS CBSE
4 pages
WLP - PR2 - Week 2
No ratings yet
WLP - PR2 - Week 2
3 pages
Copper Oxide Nanoparticles Thesis
No ratings yet
Copper Oxide Nanoparticles Thesis
8 pages
Chem 201 Experiment 5 - Lab Report
No ratings yet
Chem 201 Experiment 5 - Lab Report
3 pages
National Security Strategy of Japan 12:2022
No ratings yet
National Security Strategy of Japan 12:2022
36 pages
Mesonet: A Compact Facial Video Forgery Detection Network
No ratings yet
Mesonet: A Compact Facial Video Forgery Detection Network
7 pages
Deepfake Detection Techniques: A Review: Neeraj Guhagarkar, Sanjana Desai Swanand Vaishyampayan, Ashwini Save
No ratings yet
Deepfake Detection Techniques: A Review: Neeraj Guhagarkar, Sanjana Desai Swanand Vaishyampayan, Ashwini Save
10 pages
Effective Technical Communication Skills Manuals Handwritten
No ratings yet
Effective Technical Communication Skills Manuals Handwritten
11 pages
EMERITUS PROFESSOR OF EMINENCE Professor M. M. Sharma
No ratings yet
EMERITUS PROFESSOR OF EMINENCE Professor M. M. Sharma
1 page
Where Do Deep Fakes Look? Synthetic Face Detection Via Gaze Tracking
No ratings yet
Where Do Deep Fakes Look? Synthetic Face Detection Via Gaze Tracking
14 pages
Unit 2 NLP Part 1
No ratings yet
Unit 2 NLP Part 1
20 pages
Chapter 5 - Short-Term and Working Memory
100% (1)
Chapter 5 - Short-Term and Working Memory
32 pages
C1 Advanced Speaking Part 1: Teacher's Notes
No ratings yet
C1 Advanced Speaking Part 1: Teacher's Notes
5 pages
Learning To Recognize Patch-Wise Consistency For Deepfake Detection
No ratings yet
Learning To Recognize Patch-Wise Consistency For Deepfake Detection
13 pages
HW1 Solution
No ratings yet
HW1 Solution
5 pages
Quick Guide FDX Console
No ratings yet
Quick Guide FDX Console
5 pages
Beam String 5
No ratings yet
Beam String 5
19 pages
Q2 Whirlpool Case
17% (6)
Q2 Whirlpool Case
21 pages
Unit 2 NLP Part 2
No ratings yet
Unit 2 NLP Part 2
12 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
10 pages
Unit 3 and 4 NLP
No ratings yet
Unit 3 and 4 NLP
10 pages
CIVWARE Lecture Topic 5.1 (Water Treatment) PDF
No ratings yet
CIVWARE Lecture Topic 5.1 (Water Treatment) PDF
19 pages
Information Security UNIT-5 Notes
No ratings yet
Information Security UNIT-5 Notes
20 pages
Best & Tyler 2007
No ratings yet
Best & Tyler 2007
12 pages
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
No ratings yet
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
57 pages
Catalogue - Contact Rivets
No ratings yet
Catalogue - Contact Rivets
10 pages
Civil Engineering Broucher
No ratings yet
Civil Engineering Broucher
13 pages
Project Report Final
No ratings yet
Project Report Final
27 pages

Deepfake - Detection - and - Localization - Using - Multi-View - Inconsistency - Measurement

Uploaded by

Deepfake - Detection - and - Localization - Using - Multi-View - Inconsistency - Measurement

Uploaded by

This article has been accepted for publication in IEEE Transactions on Dependable and Secure Computing.

Deepfake Detection and Localization Using

Abstract—As deepfake technology advances, forgery detection

T HE continuous development of deep learning techniques

Temporal-IM Temporal-IM Temporal-IM Temporal-IM

Noise-IM Noise-IM Noise-IM Noise-IM

attention and bi-direction convolutions, ultimately leading both branches,Key-Conv

MVIM in detail in Section II. Experiments on public datasets C×H×W

Rt1 Rt1 β β Rt2 R 2

(a) Noise inconsistency measurement (b) Temporal inconsistency measurement

where Mpd is localization mask predicted by our model and

adopted from CBAM [44] block:

containing 10,000 high-fidelity real and synthetic videos re-

one to fifteen, with three on average. There are three different

average pooling and max pooling respectively.

E. Classification and localization

1) Training strategy: The proposed MVIM network is im- Localization Classification

High Quality(c23) Low Quality(c40)

Deepfakes Face2Face FaceSwap

TABLE V TABLE VII

formance. Since this dataset does not provide precise tam-

proposed method MVIM still achieves the best cross-dataset

TABLE VIII TABLE IX

Module settings Localization Classification Localization Classification

and classification performance. The distortion types include:

localization. The most significant reduction in detection perfor-

the standard deviation of the Gaussian noise, the higher the

[60] F. Chollet, “Xception: Deep learning with depthwise separable convolu-

Bolin Zhang received the B.S. degree in computer

Qilin Yin received the B.S. degree and M.S. degree

Wei Lu received the B.S. degree in automation

Xiangyang Luo received his B.S., M.S., and Ph.D.

You might also like