0% found this document useful (0 votes)
50 views8 pages

Bita-Net Bi-Temporal Attention Network For Facial Video Forgery Detection

Uploaded by

STUDY PURPOSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views8 pages

Bita-Net Bi-Temporal Attention Network For Facial Video Forgery Detection

Uploaded by

STUDY PURPOSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Bita-Net: Bi-temporal Attention Network for Facial Video Forgery Detection

Yiwei Ru,1,2 Wanting Zhou,3 Yunfan Liu,1,2 Jianxin Sun,1,2 Qi Li*1,2


2021 IEEE International Joint Conference on Biometrics (IJCB) | 978-1-6654-3780-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/IJCB52358.2021.9484408

1
Center for Research on Intelligent Perception and Computing, CASIA, Beijing, China
2
National Laboratory of Pattern Recognition, CASIA, Beijing, China
3
Beijing University of Posts and Telecommunications, Beijing, China
{yiwei.ru, wanting.zhou, yunfan.liu, jianxin.sun}@cripac.ia.ac.cn, {qli}@nlpr.ia.ac.cn

Abstract necessary to develop facial manipulation detection (or face


forgery detection) methods to reduce the negative effects
Deep forgery detection on video data has attracted re- brought by synthesized facial images or videos.
markable research attention in recent years due to its poten- While many effective methods have already been pro-
tial in defending forgery attacks. However, existing methods posed for face forgery image detection in recent years, e.g.
either only focus on the visual evidence within individual MesoNet [2], face X-ray [18], Frequency-aware Decompo-
images, or are too sensitive to fluctuations across frames. sition (FAD) [25], frequency domain analysis [8] and at-
To address these issues, this paper propose a novel model, tention mechanism [6], forgery video detection has drawn
named Bita-Net, to detect forgery faces in video data. The much less research attention and still remain largely un-
network design of Bita-Net is inspired by the mechanism of solved.
how human beings detect forgery data, i.e. browsing and Generally, existing forgery video detection methods
scrutinizing, which is reflected by the two-pathway archi- could be divided into two categories: single frame based
tecture of Bita-Net. Concretely, the browsing pathway scans approaches and temporal information based approaches. As
the entire video at a high frame rate to check the temporal the name suggests, single frame based approaches consider
consistency, while the scrutinizing pathway focuses on an- video data as a sequence of image frames, and seek for evi-
alyzing key frames of the video at a lower frame rate. Fur- dence within a single frame for forgery detection, e.g. warp-
thermore, an attention branch is introduced to improve the ing artifacts [20], visual bio-metric artifacts [22], photo re-
forgery detection ability of the scrutinizing pathway. Exten- sponse non-uniformity (PRNU) [15], etc. However, with
sive experiment results demonstrate the effectiveness and the rapid improvement of the quality of synthetic images, it
generalization ability of Bita-Net on various popular face has become much more difficult to distinguish real images
forensics detection datasets, including FaceForensics++, from fake ones with image-level evidence within a single
CelebDF, DeepfakeTIMIT and UADFV. frame. Moreover, these methods are usually designed for
certain type of forgery attacks and thus lack of generaliza-
tion ability, which prevent they from being applicable in
1. Introduction real-world scenarios. To solve this issue, [23] introduces
a capsule network to detect various kinds of attacks, but its
With the rapid development of deep generative models, effectiveness is facing great challenge with the rapidly im-
significant progress has been made in the task of facial im- proving visual quality of synthetic images.
age manipulation, including swapping, re-enactment, and
On the other hand, temporal information based ap-
attribute editing [4, 5, 11, 12, 30]. Although the high qual-
proaches perform forgery detection based on inter-frame
ity of manipulation results has facilitated the application
features across the whole image sequence. Some meth-
of deep generative models in various aspects, concerns are
ods focus on physiological signal of human faces, e.g. 3D
raised as it may also pose great threat to public safety. For
head poses [31], frequency of eye blinking [19], and heart-
example, pornographic movies, political framing videos, fi-
beat rhythms [24], to distinguish real videos from forged
nancial fraud and even fake evidence in criminal investi-
ones. These methods require input videos to be of high res-
gation could be produced by simply swapping faces with
olutions for capturing subtle discriminative features, which
only a few publicly available facial images. Therefore, it is
limits their practical applications. Other studies attempt to
* Corresponding author solve the video forgery detection problem based on the tem-
978-1-6654-3780-6/21/$31.00 ©2021 IEEE poral consistency of image content across frames. Sabir et

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
al. propose a recurrent convolutional model (RCN) to ex- warping the target face to match the original one. Based on
ploit temporal discrepancies [27]. Guera and Delp pro- this observation, Li et al. [20] detect fake videos by forcing
pose a temporal aware pipeline method that involve convo- on the inconsistency between the facial area and the back-
lutional neural networks (CNNs) and long short term mem- ground. Recently, capsule network [14] is used to detect
ory (LSTM) [13]. However, these methods are too sensitive manipulated images and videos in [23], which introduces a
to changes between frames, which results in low detection dynamic routing algorithm to integrate the outputs of three
accuracy and overall robustness. capsules to obtain the discriminative features. Some meth-
To overcome the weaknesses of existing methods, we ods also make full use of the features of face in the video
propose a Bi-temporal Attention Network (Bita-Net) for fa- for forgery detection. By estimating 68 facial landmarks
cial video forgery detection, which makes the full use of and observing the differences in 3D head poses, Yang et
both spatial and temporal information. Closer inspection al. [31] extract features from 3D head poses and feed them
suggests that although different forgery methods tend to into SVM to detect the deepfake videos. Matern et al. [22]
produce different ghosting artifacts within a frame, simi- extract eye feature vector, teeth feature vector and facial
lar temporal characteristics across frames, e.g., fluctuations contour feature vector to detect the missing details in these
of high frequency information, are shared among most syn- areas. However, this method can be only applied to videos
thetic video data. Based on this observation, our model in- with open eyes and mouth. Koopman et al. [15] leverage the
volves two pathways to extract feature at low frame rate and use of photo response non uniformity(PRNU) analysis to
high frame rate, respectively. Moreover, we also introduce detect deepfakes. The frames are cropped and grouped into
an attention module to help the model to focus on the ma- eight groups, then compare the PRNU pattern of each group
nipulated region of each frame. The proposed method suc- by calculating the normalized cross correlation scores.
cessfully captures representative patterns of synthetic video
data, which improves its generalization ability on various Temporal Feature based Methods. Frame based meth-
face forgery datasets. ods do not consider the correlation between video frames,
which is a waste of information. Based on the fact that low
2. Related Work level features between adjacent frames are inconsistent in
fake videos. Sabir et al. [27] introduce a recurrent convo-
2.1. Facial Image Forgery Detection Methods
lutional network(RCN) to extract spatio-temporal features
Galbally et al. [8] utilize traditional frequency domain from video streams, which is used to detect fake videos.
analysis to reveal the discrepancy at high frequencies, and Similarity, Guera and Delp [13] make use of CNN to ex-
they have achieved great performance on forgery detection tract features from frames followed by a LSTM to create a
by a sample classifier which relies on a few annotated sam- sequence descriptor and a fully connected layer to distinct
ples. Dang et al. [6] propose to make use of an attention with fake and real ones. Since most of face images avail-
mechanism to make features better classified by the hy- able online are with open eyes, the blink rate is much lower
pothesis that highlighting the informative regions will help in deepfakes compared with authentic videos. Li et al. [19]
the network to make a better decision. Li et al. [18] fo- send the sequences of cropped eye area into a long-term re-
cus on face forgeries, and introduce face X-ray to detect current convolutional networks [7] to predict the blinking
forgery based on the fact that most face manipulation meth- rate which has achieved great progress in deepfake video
ods blend the fake face into an background image. Qian detection. Different from the frame based and temporal
et al. [25] discover the artifacts of forged faces can be ex- methods, our method takes into account both the informa-
posed in frequency domain, and in order to integrate the tion of across frames and within a frame. To imitate the
frequency-aware clues with CNN models, they propose a human process of observation, the authenticity of the input
new frame work with two frequency-aware branches. video is detected from two aspects of browsing and scruti-
nizing.
2.2. Facial Video Forgery Detection Methods
Fake video detection is more challenging in that videos 3. Bita-Net
have temporal characteristics, and the frame data degraded The overall framework of Bita-Net is illustrated in Fig-
much after video compression. Fake video detection can be ure 1. According to the bionics principle, Bita-Net im-
divided into two groups: frame based methods and temporal itates the mechanism of how human beings process the
feature based methods. visual data, i.e. browsing and scrutinizing, using a two-
pathway network with different sampling rates. Concretely,
Single Frame based Methods. Similar to fake image de- the browsing pathway focuses on inspecting the inter-frame
tection, some methods process single frames of video to get changes of the input video, while the scrutinizing pathway
the discriminative features. Most of deepfake videos require pays more attention to the intra-frame information. These

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
Figure 1. The overall framework of Bita-Net. The input video undergoes different down-sampling rates to obtain the input data for
browsing pathway and scrutinizing pathway. The browsing network processes more frames with fewer convolutional kernels for each,
while the scrutinizing network extracts more feature maps on fewer frames. Moreover, a U-Net is used to generate attention maps,
indicating suspicious regions consisting of manipulated image content. Afterwards, these attention maps are injected into the scrutinizing
pathway to improve the detection accuracy. Moreover, lateral connections are adopted to able information communication between the
browsing and scrutinizing.

two pathways communicate with each other via a lateral tection. Although some advanced single frame based im-
connection at each stage, and output results are combined age manipulation method could eliminate the fingerprint of
by a fully connection layer to produce the final prediction. modifications in the spatial dimension, evidence of forgery
Furthermore, an attention module is introduced to help the could still be discovered in the temporal dimension. There-
model to better concentrate on highly suspicious image re- fore, it is feasible to distinguish manipulated videos from
gions. intact ones via the browsing pathway with the temporal ev-
idence. Notably, no restriction is imposed on the network
3.1. The Two-pathway Network structure of these two pathways, and thus various popu-
As shown in Figure 1, the main trunk of Bita-Net is a lar backbone models, e.g. ResNet18, ResNet34, ResNet50,
two-pathway network, including a browsing pathway and ResNet101, and EfficientNet [29], could be adopted to bal-
a scrutinizing pathway. Although both of these two net- ance the tradeoff between computational resources avail-
works receive the same video clip as input, they sample the able and the requirement on detection accuracy. Concretely,
given frames with different time intervals. Concretely, after in this paper, ResNet50 is used as the feature extracting net-
dividing the input video into T × τb individual frames, work in the browsing pathway. According to the experimen-
the browsing pathway selects one frame out of every τb tal observation in [10], using temporal convolutions in ear-
frames, resulting in a input tensor containing T frames. For lier convolutional layers degrades the overall performance
each input frame, a relatively small number of feature maps of the model. Therefore, temporal convolutions in the first
are computed capture the temporal information of the in- four blocks of the browsing pathway are removed, so as to
put video promptly. Compared to the browsing pathway, capture more detailed forgery fingerprints with high tempo-
the scrutinizing network samples the input video at a lower ral resolution.
frame rate (time interval denoted as τs , τs > τb ), while
3.2. The Attention Branch
computes more feature maps at each stage to gain better ex-
pressive ability. Similar to the inspiration of the two-pathway network,
The advantage of the two-pathway architecture is that, the attention mechanism is also inspired by the bionics prin-
discriminative features along both the spatial and temporal ciple and imitates the human visual attention mechanism.
dimension could be fully explored to perform forgery de- Given an image, people tend to focus only on specific re-

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
(a) Channel element addition (b) Temporal stride

Figure 2. Details of the U-Net in the attention branch. An


hourglass-shaped network with skip connections is adopted to (c) Temporal Dense conv. (d) Temporal sparse conv.
compute attention maps from input video frames.
Figure 3. lateral connections methods Output of the lateral con-
nections is fused into the Scrutinizing pathway via summation or
concatenation. (a) the recombined channels are added element-
gions that are interesting and attractive, rather than the en-
wise.(b) each frame in browsing pathway extracts a channel from
tire image. If those regions appear regularly on certain spa- feature map and concat it with the feature map in scrutinizing path-
tial locations, then this pattern could be learned and facili- way by frame.(c) feature map of each frame in browsing pathway
tate the model to concentrate on the highly suspicious im- reduces to 2 through 3D convolution, then concat the feature map
age areas. To this end, we adopt a U-Net model to predict of scrutinizing pathway by frame.(d) feature map of each frame
the attention maps based on input video frames (as shown in browsing pathway reduces to 1 through 3D convolution, then
in Figure 2), and these attention maps are injected into the concat the feature map of scrutinizing pathway by frame.
scrutinizing pathway after resizing to improve the accuracy
of forgery detection.
We subtract fake and its corresponding real video on maps before fusion.
each frame as the Attention area of fake Video. For ease For ease of understanding, let {T, W, H, C} denote the
of description, fake video frame sequence is represented shape of the input data for the browsing pathway, where T
as {Vf 1 , Vf 2 ,...,Vf n }, that real video as {Vr1 ,Vr2 ,...,Vrn }. refers to the number of frames samples and W , H, C stands
The subtraction of the two on the corresponding frames is for the width, height, and channels for a single frame, re-
{Vf 1 −Vr1 , Vf 2 −Vr2 ,...,Vf n −Vrn }, represented as {Vatt1 , spectively. Similarly, the shape of input to the scrutinizing
Vatt2 ,...,Vattn }. Attention branch network adopts the Unet pathway could be denoted as {αT, W, H, βC}, where α and
structure, shown in Fig.2. If the fake input of the Attention β are ratio values satisfying α · β = 1. In Section 4.3, we
branch network is {Vf 1 , Vf 2 ,...,Vf n }, then the correspond- discuss the influence of hyper-parameter selection and net-
ing label is {Vatt1 , Vatt2 ,...,Vattn }, but if the input is real work architecture on the performance of Bita-Net, and em-
video sequence, then the corresponding label is soft diff se- pirically set α as 1/4 andd β as 4 according to experimental
quence (Sec.4.2 for details). Finally, the Attention sequence results. In the rest of this subsection, we describe details of
obtained via U-Net is weighted into the Scrutinizing path- several candidate implementations of the lateral connection
way. module.
In order to strengthen the ability of attention branch, we Channel element addition. For feature maps of shape
use soft attention label instead of pure black image as the {T , W , H, C} in the browsing pathway, we first reshape
label of real face. For the convenience of expression, real them into the shape of {αT , W , H, βC} (α · β = 1), and
face is represented by Rimage , compression rate by ρ, and then add them to the feature map in scrutinizing pathway at
soft label of the real image by Slabel . ρ × Slabel indicates the corresponding stage.
that image is reduced by ρ times, which means the size of Temporal stride. One frame out of every 1/α frames are
the image is reduced from (W , H) to (ρ × W , ρ × H) ,0 sampled from feature maps in the browsing pathway, result-
<ρ <1. Calculation results are shown in Fig.4. ing in tensors with the shape of {αT , W , H, C}, where the
feature map for each frame is kept unchanged. Afterwards,
Slabel = (ρ × Rimage ) × (1/ρ) − Rimage (1) the output of the lateral connect module and the feature map
from scrutinizing pathway are concatenated in a frame-by-
3.3. Lateral connections frame manner.
As discussed in Section 3.1, the final prediction is made Temporal sparse convolution. In stead of direct sam-
jointly based on the output of both the browsing and scru- pling, 3D convolutions with the kernel of size (C, 5, 1, 1)
tinizing pathway, but they have different channel sizes and and stride of 1/α are used in this connection method. There-
frame sampling frequencies. To solve this issue, lateral con- fore, for feature maps in the browsing pathway with size
nections are proposed to enable information communication {T , W , H, C}, the shape of the output tensor is {αT , W ,
between these two networks and match the shape of feature H, C}. Afterwards, the same concatenation method as in

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
Model FF++(Deepfakes) FF++(Face2Face) FF++(FaceShifter) FF++(FaceSwap) FF++(NeuralTextures) UADFV Celeb-DF TIMIT
I3D 97.29% 96.18% 97.53% 96.64% 97.79% 97.31% 96.54% 96.33%
C2D 96.31% 94.63% 95.45% 95.37% 96.10% 95.17% 93.87% 92.74%
Meso-Net 95.04% 94.75% 94.77% 95.91% 95.15% 94.39% 95.28% 94.85%
EfficientNet-B0 97.09% 95.50% 96.17% 96.15% 96.06% 96.67% 95.34% 96.12%
EfficientNet-B4 98.53% 97.81% 97.39% 97.82% 97.54% 97.69% 98.14% 96.23%
Bita-Net-4x16,R50 98.04% 96.75% 97.77% 97.91% 97.15% 97.39% 97.28% 96.85%
Bita-Net-16x64,R50 99.16% 97.85% 98.76% 97.43% 98.60% 99.47% 98.26% 97.88%
Bita-Net-4x16,R101 98.81% 97.57% 99.42% 98.27% 98.24% 98.61% 98.45% 98.33%
Bita-Net-16x64,R101 99.83% 98.54% 99.11% 98.65% 98.87% 99.26% 98.75% 98.67%
Table 1. Results of comparison between Bita-Net and four open-source state-of-the-art benchmark methods on different datasets.
To ensure the fairness of the comparison, all methods use DFDC as the training dataset and share the same data augmentation methods.
The naming principle for variants of Bita-Net: ‘Bita-Net-hτb i × hτs i, R hnumber of layers in the ResNet backbonei’.

‘Temporal stride’ is adopted to produce the fusion result. are augmented by combinations of random cropping, scal-
Temporal dense convolution. Similar to ‘Temporal ing, rotation, and adjusting the contrast and intensity, to im-
sparse convolution’, same 3D convolutions with doubled prove the generalization ability of the model.
output channels (2C) are adopted to further enhance the ex- The averaged length of videos for both training and test-
pressive ability of the network, followed by the same frame- ing is 11.35 seconds (338 frames). In our experiments, τb
by-frame concatenation method. and τs are set to 2 and 8, respectively.
Datasets. Five publicly available video forgery
3.4. Objective Functions datasets are involved in our experiments, one for train-
Training process of the attention branch is independent ing (DFDC [1]) and the others for testing (Faceforen-
of the main trunk of browsing and scrutinizing pathways. sics++ [26], UADFV [20], DeepfakeTIMIT [16], and
U-Net model is adopted in the attention branch, in which the Celeb-DF [21]). The DFDC dataset contains 119,197
real and fake images with Soft label referred in Sec.3.2 are videos, where 19,197 videos are real clips of 430 sub-
set as input, and the output is attention weight. The weight ject, and the rest 100,000 videos are fake samples gener-
is directly multiplied to the feature maps of the correspond- ated from real ones. In Faceforensics++, five facial ma-
ing size shown in Fig.1. Therefore, we only discuss the loss nipulation algorithms, i.e. Face2Face [30], FaceShifter[17],
function of browsing and scrutinizing pathway. Concretely, FaceSwap [9], DeepFakes [28] and NeuralTextures [30], are
the outputs of browsing pathway and scrutinizing pathway, applied to process 1,000 real videos, resulting in 510,207
represented as F browsing and F scrutinizing, are ob- images containing both of real and fake faces. The UADFV
tained through the last convolutional layer and the global dataset contains 49 real videos and 49 fake videos, consist-
average pooling layer, and then they are concatenated with ing of 32,752 frames in total. The DeepfakeTIMIT dataset
⊕ and fully connected layer operation (F C). The final pre- contains low-quality videos with resolution of 64 × 64 and
diction result of the model is expressed as follows: high-quality videos with resolution of 128 × 128, includ-
ing 10,537 original images and 34,023 synthetic images
prediction = F C(F browsing ⊕ F scrutinizing) (2) extracted from 320 videos. The Celeb-DF dataset con-
tains 590 original videos collected from YouTube and 5,639
Using cross entropy as the loss function, L represents the videos generated by deep forgery methods correspondingly.
label of the video to be detect, and Pred represents the pre- Testing. In order to ensure the consistency of test, un-
diction of Equ.2. The loss function is expressed as follows: less otherwise specified, the data used in experiment are
all raw data, and no other operations are performed on test
Loss = −(L · log(P red) + (1 − L) · log(1 − P red)) (3) dataset. Since the duration of the video to be tested is longer
than the duration of our sample,we use random sampling to
4. Experiments construct the test set, start frame of input videos is set up
randomly,the detection sequences are extracted with equal
4.1. Preliminaries intervals. In order to ensure the fairness of test, each test
video is copied for ten times to random recreate a test set.
Training Configurations. All networks in the pro-
posed model are trained from scratch where no pre-trained
4.2. Results of Video Forgery Detection
weights are adopted for initialization. For each frame, facial
image regions are first detected and then scaled to the res- The comparison results of Bita-Net and four state-of-art
olution of 224 × 224. Afterwards, resultant face sequences benchmark algorithms on four facial video forgery detec-

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
tion datasets are reported in Table 1. It is clear that Bita-Net Bita-Net performs well in facial fake video detection, even
outperforms all other methods on all datasets (over 98.5%), though the videos are generated by a variety of forgery
demonstrating the effectiveness of our method. The supe- methods. It is feasible to increase the accuracy by the com-
rior generalization ability in forgery detection of Bita-Net bination of bi-temporal structure and attention branch. Ac-
comes from two factors: 1) Architecture of the bi-temporal tually, the discrimination of fake videos is complex in daily
network, which combines the temporal information across life. Nevertheless, Bita-Net has advantages to tackle with
frames with spatial features within a frame. 2) Introduction the sophisticated outdoor scenes compared with other de-
of the attention branch further enhances the sensitivity of tection algorithms. Bita-Net has at least 1% performance
the model to fingerprints of forgery. improvement than current detection algorithms in multi-
In addition, we also investigate the influence of various task and multi-scenarios datasets.
configurations of the network structure on the detection ac-
4.3. Ablation Experiments
curacy. We conduct experiments with τb increased to both
4 and 16, with α and β unchanged. Moreover, we also This part mainly accounts the ablation experiments on
check how the backbone structure of the two-pathway net- FaceForensics++ dataset to demonstrates the effectiveness
work (ResNet50 and ResNet101) affects the overall perfor- of our Bita-Net method. In section Sec4.1, detailed experi-
mance. According to comparison results in Table 1, sub- ments are indicated to prove the significant improvement of
stituting ResNet50 by ResNet101 generally improves the Bite-Net compared with state-of-art methods on a variety
detection accuracy (except for UDAFV), and the detailed of public test sets. To further discuss the effectiveness of
discussion of experimental results on each dataset is elabo- each module in Bita-Net, ablation experiments are carried
rated as follows: out on FaceForensics++ dataset. Different pathways and
FaceForensics++. To evaluate the accuracy of detecting lateral connections, attention label, and attention branch are
fake videos generated by a variety of manipulation methods, demonstrated as follows.
the experiments are performed with each forgery method on Different pathways and lateral connections. Table
different sub-datasets of facesforensics++. The experimen- 2 shows that these Bita-Net models with connections all
tal results on FF++ (Deepfakes), FF++ (Face2Face), FF++ perform better than Browsing-only pathway (which accu-
(FaceShifter), FF++ (FaceSwap) and FF++ (NeuralTex- racy is 81.09%) and Scrutinizing-only pathway (which ac-
tures) datasets shows that the greatest accuracy with Bita- curacy is 95.77%), but the methods without connection do
Net is up to 99.83%, 98.54%, 99.42%, 98.56%, 98.87%, not make sense. According to the comparison of different
which is 1.30%, 0.73%, 1.89%, 0.83%, 1.08% higher than connection methods, the best way is temporal dense convo-
the other open source detection algorithm respectively. It lution, which detection accuracy is 2.59%, 17.27% higher
follows that Bita-Net has good generalization in variety of than scrutinizing-only pathway and browsing-only pathway.
forgery methods from a comprehensive perspective. So, we set temporal dense convolution as default connect
UADFV. These results on UADFV dataset reveal method. The reason why lateral connection method of tem-
that the I3D, C2D [3], Meso-Net [2], EfficientNet-B0 poral dense convolution performs better than other connec-
and EfficientNet-B4 models [29] achieve performance of tion methods is that it retains the information of Browsing
97.31%, 95.17%, 94.39%, 96.67%,97.69% respectively. pathway more effectively .
It is notable that the accuracy of detection using Bita-Net
16x64,R101 increases about 2%. Due to the residual con- Model Lateral Connects Acc.
nections and richer input sequences, the learning process is Browse only - 81.09%
more effective. Scrutinize only - 95.77%
DeepfakeTIMIT. DeepfakeTIMIT is not a large scale Bita-Net without connection 89.53%
dataset containing a large number of fake videos, so it Bita-Net Channel element addition 96.18%
poses a challenge to evaluate generalization with face foren- Bita-Net Temporal stride 97.84%
sics detection algorithms. However, Bita-Net still achieves Bita-Net Temporal sparse conv. 98.05%
1.78%-3.08% better performance than other algorithms. Bita-Net Temporal dense conv. 98.36%
CelebDF. CelebDF [21] database has a great number of Table 2. Bi-temporal fusion.Comparison of different lateral con-
videos. The quality of videos in Celeb-DF dataset is closed nects,in order to facilitate comparison, all experiments choose
to that of daily TV shows. The performance of detection al- Bita-Net (Bita-Net 16-64, R50) as the comparison network
gorithm on this dataset is more representative of daily data.
From Table.1, it can be clearly seen that the performance of Soft Attention Label. It is presented in experiments that
our algorithm tend to be better, with an increase of about the performance of detection algorithm drops drastically
2% than others. due to the image compressing. Because the ”artifacts” are
In conclusion, the above experimental results verify that generated from original images during compressing, but it

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
is not the forged fingerprints. The soft label (Fig.4.) method model Attention Method Accuracy
achieves higher accuracy than simply setting the real image
attention label to a pure black image. Moreover, it improves Bita-Net Bi-temporal 96.53%
the generalization of the model on compressed image. Bita-Net Browsing pathway 97.35%
Bita-Net Scrutinizing pathway 98.36%
Bita-Net Bi-pathway 98.11%
Table 4. Attention branch. Accuracy comparison with attention
branch added to different pathways.Bi-temporal means no atten-
tion branch.Bi-pathway means attention branch add to all pathway

(a) real (b) fake (c) diff (d) black


5. Conclusion
The introduction of Bi-temporal and Attention has
greatly improved the accuracy of forgery detection, and has
obtained better detection results compared with the current
(e) ρ-0.2 (f) ρ-0.3 (g) ρ-0.4 (h) ρ-0.5 open source mainstream methods. The frame number set-
tings of the two pathways have a great impact on the results.
It is expected that there will be a form similar to NAS in the
future so that the model could select the frame numbers of
the two pathways adaptively according to the training data.
(i) ρ-0.6 (j) ρ-0.7 (k) ρ-0.8 (l) ρ-0.9
Figure 4. Soft Attention Label.First row is true image, fake im- 6. Acknowledge
age, difference between true and false image, and pure black im- This work was supported in part by the National Key
age.Second and third row represent the result of the difference be-
Research and Development Program of China under Grant
tween the true and false image after different compression ratios,
2020AAA0140002, in part by the Natural Science Foun-
name as ρ-compression rate.
dation of China under Grant No. 62076240, Grant No.
U1836217, Grant No. 61721004.
The impact of soft attention label on detection accuracy
is indicated in Table.3. In order to compare the general- References
ization performance, models are trained with uncompressed
data. Compared with hard label, soft label improves the ac- [1] https://www.kaggle.com/c/deepfake-detection-challenge.
curacy of model on raw test set about 0.13%, 0.71% on c23 [2] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. Mesonet:
test set and 1.08% on c40 test set respectively. a compact facial video forgery detection network. In IEEE
International Workshop on Information Forensics and Secu-
rity, pages 1–7, 2018.
Model Att. Label Raw C23 C40 [3] J. Carreira and A. Zisserman. Quo vadis, action recogni-
tion? a new model and the kinetics dataset. In proceedings
Bita-Net Hard 98.23% 97.46% 96.47% of the IEEE/CVF Conference on Computer Vision and Pat-
Bita-Net Soft 98.36% 98.17% 97.55% tern Recognition, pages 6299–6308, 2017.
Table 3. Soft Attention Label vs Hard Attention Label. model [4] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo.
train set use raw real and fake data Stargan: Unified generative adversarial networks for multi-
domain image-to-image translation. In Proceedings of the
Attention branch. Table 4 compares the positive impact IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8789–8797, 2018.
of attention branch attached to different parts of the model.
According to the comparison results, the accuracy of Bi- [5] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain. On
the detection of digital face manipulation. In Proceedings of
temporal method without any attention branch is 96.53%.
the IEEE/CVF Conference on Computer Vision and Pattern
The attention branch attached to Browsing pathway and Recognition, pages 5781–5790, 2020.
Scrutinizing pathway separately increases to 97.35% and [6] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain. On
98.36%, that the latter is the best obviously. However, if the the detection of digital face manipulation. In Proceedings of
two pathways are both introduced of attention mechanism, the IEEE/CVF Conference on Computer Vision and Pattern
the accuracy will reduce to 98.11%. The reason is proba- Recognition, pages 5781–5790, 2020.
bly that the Browsing pathway mainly focuses on temporal [7] J. Donahue, L. Anne Hendricks, S. Guadarrama,
information. So, the attention branch to Browsing pathway M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
does not improve the accuracy significantly. rell. Long-term recurrent convolutional networks for

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
visual recognition and description. In Proceedings of the International Conference on Multimedia, pages 4318–4327,
IEEE/CVF Conference on Computer Vision and Pattern 2020.
Recognition, pages 2625–2634, 2015. [25] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao. Thinking
[8] R. Durall, M. Keuper, F.-J. Pfreundt, and J. Keuper. Un- in frequency: Face forgery detection by mining frequency-
masking deepfakes with simple features. arXiv preprint aware clues. In European Conference on Computer Vision,
arXiv:1911.00686, 2019. pages 86–103, 2020.
[9] H. Farid. Image forgery detection. IEEE Signal processing [26] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
magazine, 26(2):16–25, 2009. and M. Nießner. Faceforensics++: Learning to detect ma-
[10] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slow- nipulated facial images. In Proceedings of the IEEE/CVF
fast networks for video recognition. In Proceedings of the International Conference on Computer Vision, pages 1–11,
IEEE/CVF International Conference on Computer Vision, 2019.
pages 6202–6211, 2019. [27] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi,
[11] J. Galbally, S. Marcel, and J. Fierrez. Biometric antispoof- and P. Natarajan. Recurrent convolutional strategies for face
ing methods: A survey in face recognition. IEEE Access, manipulation detection in videos. Interfaces, 3(1), 2019.
2:1530–1552, 2014. [28] M. C. Stamm and K. R. Liu. Forensic detection of image ma-
[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, nipulation using statistical intrinsic fingerprints. IEEE Trans-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- actions on Information Forensics and Security, 5(3):492–
gio. Generative adversarial networks. arXiv preprint 506, 2010.
arXiv:1406.2661, 2014. [29] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for
[13] D. Güera and E. J. Delp. Deepfake video detection using convolutional neural networks. In International Conference
recurrent neural networks. In IEEE International Conference on Machine Learning, pages 6105–6114, 2019.
on Advanced Video and Signal Based Surveillance), pages [30] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
1–6, 2018. M. Nießner. Face2face: Real-time face capture and reenact-
[14] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transform- ment of rgb videos. In Proceedings of the IEEE/CVF Con-
ing auto-encoders. In International conference on artificial ference on Computer Vision and Pattern Recognition, pages
neural networks, pages 44–51, 2011. 2387–2395, 2016.
[15] R. A. M. Koopman, M. and Z. Geradts. Detection of deep-
[31] X. Yang, Y. Li, and S. Lyu. Exposing deep fakes using in-
fake video manipulation. In Irish Machine Vision and Image
consistent head poses. In IEEE International Conference on
Processing Conference, pages 133–136, 2018.
Acoustics, Speech and Signal Processing, pages 8261–8265,
[16] P. Korshunov and S. Marcel. Deepfakes: a new threat to
2019.
face recognition? assessment and detection. arXiv preprint
arXiv:1812.08685, 2018.
[17] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen. Faceshifter:
Towards high fidelity and occlusion aware face swapping.
arXiv preprint arXiv:1912.13457, 2019.
[18] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and
B. Guo. Face x-ray for more general face forgery detection.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5001–5010, 2020.
[19] Y. Li, M.-C. Chang, and S. Lyu. In ictu oculi: Exposing
ai created fake videos by detecting eye blinking. In IEEE
International Workshop on Information Forensics and Secu-
rity, pages 1–7, 2018.
[20] Y. Li and S. Lyu. Exposing deepfake videos by detecting face
warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
[21] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu. Celeb-df
(v2): a new dataset for deepfake forensics. arXiv preprint
arXiv:1909.12962, 2019.
[22] F. Matern, C. Riess, and M. Stamminger. Exploiting vi-
sual artifacts to expose deepfakes and face manipulations.
In IEEE Winter Applications of Computer Vision Workshops,
pages 83–92, 2019.
[23] H. H. Nguyen, J. Yamagishi, and I. Echizen. Use of a capsule
network to detect fake images and videos. arXiv preprint
arXiv:1910.12467, 2019.
[24] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu,
and J. Zhao. Deeprhythm: exposing deepfakes with atten-
tional visual heartbeat rhythms. In Proceedings of the ACM

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.

You might also like