Bita-Net Bi-Temporal Attention Network For Facial Video Forgery Detection
Bita-Net Bi-Temporal Attention Network For Facial Video Forgery Detection
1
Center for Research on Intelligent Perception and Computing, CASIA, Beijing, China
2
National Laboratory of Pattern Recognition, CASIA, Beijing, China
3
Beijing University of Posts and Telecommunications, Beijing, China
{yiwei.ru, wanting.zhou, yunfan.liu, jianxin.sun}@cripac.ia.ac.cn, {qli}@nlpr.ia.ac.cn
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
al. propose a recurrent convolutional model (RCN) to ex- warping the target face to match the original one. Based on
ploit temporal discrepancies [27]. Guera and Delp pro- this observation, Li et al. [20] detect fake videos by forcing
pose a temporal aware pipeline method that involve convo- on the inconsistency between the facial area and the back-
lutional neural networks (CNNs) and long short term mem- ground. Recently, capsule network [14] is used to detect
ory (LSTM) [13]. However, these methods are too sensitive manipulated images and videos in [23], which introduces a
to changes between frames, which results in low detection dynamic routing algorithm to integrate the outputs of three
accuracy and overall robustness. capsules to obtain the discriminative features. Some meth-
To overcome the weaknesses of existing methods, we ods also make full use of the features of face in the video
propose a Bi-temporal Attention Network (Bita-Net) for fa- for forgery detection. By estimating 68 facial landmarks
cial video forgery detection, which makes the full use of and observing the differences in 3D head poses, Yang et
both spatial and temporal information. Closer inspection al. [31] extract features from 3D head poses and feed them
suggests that although different forgery methods tend to into SVM to detect the deepfake videos. Matern et al. [22]
produce different ghosting artifacts within a frame, simi- extract eye feature vector, teeth feature vector and facial
lar temporal characteristics across frames, e.g., fluctuations contour feature vector to detect the missing details in these
of high frequency information, are shared among most syn- areas. However, this method can be only applied to videos
thetic video data. Based on this observation, our model in- with open eyes and mouth. Koopman et al. [15] leverage the
volves two pathways to extract feature at low frame rate and use of photo response non uniformity(PRNU) analysis to
high frame rate, respectively. Moreover, we also introduce detect deepfakes. The frames are cropped and grouped into
an attention module to help the model to focus on the ma- eight groups, then compare the PRNU pattern of each group
nipulated region of each frame. The proposed method suc- by calculating the normalized cross correlation scores.
cessfully captures representative patterns of synthetic video
data, which improves its generalization ability on various Temporal Feature based Methods. Frame based meth-
face forgery datasets. ods do not consider the correlation between video frames,
which is a waste of information. Based on the fact that low
2. Related Work level features between adjacent frames are inconsistent in
fake videos. Sabir et al. [27] introduce a recurrent convo-
2.1. Facial Image Forgery Detection Methods
lutional network(RCN) to extract spatio-temporal features
Galbally et al. [8] utilize traditional frequency domain from video streams, which is used to detect fake videos.
analysis to reveal the discrepancy at high frequencies, and Similarity, Guera and Delp [13] make use of CNN to ex-
they have achieved great performance on forgery detection tract features from frames followed by a LSTM to create a
by a sample classifier which relies on a few annotated sam- sequence descriptor and a fully connected layer to distinct
ples. Dang et al. [6] propose to make use of an attention with fake and real ones. Since most of face images avail-
mechanism to make features better classified by the hy- able online are with open eyes, the blink rate is much lower
pothesis that highlighting the informative regions will help in deepfakes compared with authentic videos. Li et al. [19]
the network to make a better decision. Li et al. [18] fo- send the sequences of cropped eye area into a long-term re-
cus on face forgeries, and introduce face X-ray to detect current convolutional networks [7] to predict the blinking
forgery based on the fact that most face manipulation meth- rate which has achieved great progress in deepfake video
ods blend the fake face into an background image. Qian detection. Different from the frame based and temporal
et al. [25] discover the artifacts of forged faces can be ex- methods, our method takes into account both the informa-
posed in frequency domain, and in order to integrate the tion of across frames and within a frame. To imitate the
frequency-aware clues with CNN models, they propose a human process of observation, the authenticity of the input
new frame work with two frequency-aware branches. video is detected from two aspects of browsing and scruti-
nizing.
2.2. Facial Video Forgery Detection Methods
Fake video detection is more challenging in that videos 3. Bita-Net
have temporal characteristics, and the frame data degraded The overall framework of Bita-Net is illustrated in Fig-
much after video compression. Fake video detection can be ure 1. According to the bionics principle, Bita-Net im-
divided into two groups: frame based methods and temporal itates the mechanism of how human beings process the
feature based methods. visual data, i.e. browsing and scrutinizing, using a two-
pathway network with different sampling rates. Concretely,
Single Frame based Methods. Similar to fake image de- the browsing pathway focuses on inspecting the inter-frame
tection, some methods process single frames of video to get changes of the input video, while the scrutinizing pathway
the discriminative features. Most of deepfake videos require pays more attention to the intra-frame information. These
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
Figure 1. The overall framework of Bita-Net. The input video undergoes different down-sampling rates to obtain the input data for
browsing pathway and scrutinizing pathway. The browsing network processes more frames with fewer convolutional kernels for each,
while the scrutinizing network extracts more feature maps on fewer frames. Moreover, a U-Net is used to generate attention maps,
indicating suspicious regions consisting of manipulated image content. Afterwards, these attention maps are injected into the scrutinizing
pathway to improve the detection accuracy. Moreover, lateral connections are adopted to able information communication between the
browsing and scrutinizing.
two pathways communicate with each other via a lateral tection. Although some advanced single frame based im-
connection at each stage, and output results are combined age manipulation method could eliminate the fingerprint of
by a fully connection layer to produce the final prediction. modifications in the spatial dimension, evidence of forgery
Furthermore, an attention module is introduced to help the could still be discovered in the temporal dimension. There-
model to better concentrate on highly suspicious image re- fore, it is feasible to distinguish manipulated videos from
gions. intact ones via the browsing pathway with the temporal ev-
idence. Notably, no restriction is imposed on the network
3.1. The Two-pathway Network structure of these two pathways, and thus various popu-
As shown in Figure 1, the main trunk of Bita-Net is a lar backbone models, e.g. ResNet18, ResNet34, ResNet50,
two-pathway network, including a browsing pathway and ResNet101, and EfficientNet [29], could be adopted to bal-
a scrutinizing pathway. Although both of these two net- ance the tradeoff between computational resources avail-
works receive the same video clip as input, they sample the able and the requirement on detection accuracy. Concretely,
given frames with different time intervals. Concretely, after in this paper, ResNet50 is used as the feature extracting net-
dividing the input video into T × τb individual frames, work in the browsing pathway. According to the experimen-
the browsing pathway selects one frame out of every τb tal observation in [10], using temporal convolutions in ear-
frames, resulting in a input tensor containing T frames. For lier convolutional layers degrades the overall performance
each input frame, a relatively small number of feature maps of the model. Therefore, temporal convolutions in the first
are computed capture the temporal information of the in- four blocks of the browsing pathway are removed, so as to
put video promptly. Compared to the browsing pathway, capture more detailed forgery fingerprints with high tempo-
the scrutinizing network samples the input video at a lower ral resolution.
frame rate (time interval denoted as τs , τs > τb ), while
3.2. The Attention Branch
computes more feature maps at each stage to gain better ex-
pressive ability. Similar to the inspiration of the two-pathway network,
The advantage of the two-pathway architecture is that, the attention mechanism is also inspired by the bionics prin-
discriminative features along both the spatial and temporal ciple and imitates the human visual attention mechanism.
dimension could be fully explored to perform forgery de- Given an image, people tend to focus only on specific re-
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
(a) Channel element addition (b) Temporal stride
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
Model FF++(Deepfakes) FF++(Face2Face) FF++(FaceShifter) FF++(FaceSwap) FF++(NeuralTextures) UADFV Celeb-DF TIMIT
I3D 97.29% 96.18% 97.53% 96.64% 97.79% 97.31% 96.54% 96.33%
C2D 96.31% 94.63% 95.45% 95.37% 96.10% 95.17% 93.87% 92.74%
Meso-Net 95.04% 94.75% 94.77% 95.91% 95.15% 94.39% 95.28% 94.85%
EfficientNet-B0 97.09% 95.50% 96.17% 96.15% 96.06% 96.67% 95.34% 96.12%
EfficientNet-B4 98.53% 97.81% 97.39% 97.82% 97.54% 97.69% 98.14% 96.23%
Bita-Net-4x16,R50 98.04% 96.75% 97.77% 97.91% 97.15% 97.39% 97.28% 96.85%
Bita-Net-16x64,R50 99.16% 97.85% 98.76% 97.43% 98.60% 99.47% 98.26% 97.88%
Bita-Net-4x16,R101 98.81% 97.57% 99.42% 98.27% 98.24% 98.61% 98.45% 98.33%
Bita-Net-16x64,R101 99.83% 98.54% 99.11% 98.65% 98.87% 99.26% 98.75% 98.67%
Table 1. Results of comparison between Bita-Net and four open-source state-of-the-art benchmark methods on different datasets.
To ensure the fairness of the comparison, all methods use DFDC as the training dataset and share the same data augmentation methods.
The naming principle for variants of Bita-Net: ‘Bita-Net-hτb i × hτs i, R hnumber of layers in the ResNet backbonei’.
‘Temporal stride’ is adopted to produce the fusion result. are augmented by combinations of random cropping, scal-
Temporal dense convolution. Similar to ‘Temporal ing, rotation, and adjusting the contrast and intensity, to im-
sparse convolution’, same 3D convolutions with doubled prove the generalization ability of the model.
output channels (2C) are adopted to further enhance the ex- The averaged length of videos for both training and test-
pressive ability of the network, followed by the same frame- ing is 11.35 seconds (338 frames). In our experiments, τb
by-frame concatenation method. and τs are set to 2 and 8, respectively.
Datasets. Five publicly available video forgery
3.4. Objective Functions datasets are involved in our experiments, one for train-
Training process of the attention branch is independent ing (DFDC [1]) and the others for testing (Faceforen-
of the main trunk of browsing and scrutinizing pathways. sics++ [26], UADFV [20], DeepfakeTIMIT [16], and
U-Net model is adopted in the attention branch, in which the Celeb-DF [21]). The DFDC dataset contains 119,197
real and fake images with Soft label referred in Sec.3.2 are videos, where 19,197 videos are real clips of 430 sub-
set as input, and the output is attention weight. The weight ject, and the rest 100,000 videos are fake samples gener-
is directly multiplied to the feature maps of the correspond- ated from real ones. In Faceforensics++, five facial ma-
ing size shown in Fig.1. Therefore, we only discuss the loss nipulation algorithms, i.e. Face2Face [30], FaceShifter[17],
function of browsing and scrutinizing pathway. Concretely, FaceSwap [9], DeepFakes [28] and NeuralTextures [30], are
the outputs of browsing pathway and scrutinizing pathway, applied to process 1,000 real videos, resulting in 510,207
represented as F browsing and F scrutinizing, are ob- images containing both of real and fake faces. The UADFV
tained through the last convolutional layer and the global dataset contains 49 real videos and 49 fake videos, consist-
average pooling layer, and then they are concatenated with ing of 32,752 frames in total. The DeepfakeTIMIT dataset
⊕ and fully connected layer operation (F C). The final pre- contains low-quality videos with resolution of 64 × 64 and
diction result of the model is expressed as follows: high-quality videos with resolution of 128 × 128, includ-
ing 10,537 original images and 34,023 synthetic images
prediction = F C(F browsing ⊕ F scrutinizing) (2) extracted from 320 videos. The Celeb-DF dataset con-
tains 590 original videos collected from YouTube and 5,639
Using cross entropy as the loss function, L represents the videos generated by deep forgery methods correspondingly.
label of the video to be detect, and Pred represents the pre- Testing. In order to ensure the consistency of test, un-
diction of Equ.2. The loss function is expressed as follows: less otherwise specified, the data used in experiment are
all raw data, and no other operations are performed on test
Loss = −(L · log(P red) + (1 − L) · log(1 − P red)) (3) dataset. Since the duration of the video to be tested is longer
than the duration of our sample,we use random sampling to
4. Experiments construct the test set, start frame of input videos is set up
randomly,the detection sequences are extracted with equal
4.1. Preliminaries intervals. In order to ensure the fairness of test, each test
video is copied for ten times to random recreate a test set.
Training Configurations. All networks in the pro-
posed model are trained from scratch where no pre-trained
4.2. Results of Video Forgery Detection
weights are adopted for initialization. For each frame, facial
image regions are first detected and then scaled to the res- The comparison results of Bita-Net and four state-of-art
olution of 224 × 224. Afterwards, resultant face sequences benchmark algorithms on four facial video forgery detec-
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
tion datasets are reported in Table 1. It is clear that Bita-Net Bita-Net performs well in facial fake video detection, even
outperforms all other methods on all datasets (over 98.5%), though the videos are generated by a variety of forgery
demonstrating the effectiveness of our method. The supe- methods. It is feasible to increase the accuracy by the com-
rior generalization ability in forgery detection of Bita-Net bination of bi-temporal structure and attention branch. Ac-
comes from two factors: 1) Architecture of the bi-temporal tually, the discrimination of fake videos is complex in daily
network, which combines the temporal information across life. Nevertheless, Bita-Net has advantages to tackle with
frames with spatial features within a frame. 2) Introduction the sophisticated outdoor scenes compared with other de-
of the attention branch further enhances the sensitivity of tection algorithms. Bita-Net has at least 1% performance
the model to fingerprints of forgery. improvement than current detection algorithms in multi-
In addition, we also investigate the influence of various task and multi-scenarios datasets.
configurations of the network structure on the detection ac-
4.3. Ablation Experiments
curacy. We conduct experiments with τb increased to both
4 and 16, with α and β unchanged. Moreover, we also This part mainly accounts the ablation experiments on
check how the backbone structure of the two-pathway net- FaceForensics++ dataset to demonstrates the effectiveness
work (ResNet50 and ResNet101) affects the overall perfor- of our Bita-Net method. In section Sec4.1, detailed experi-
mance. According to comparison results in Table 1, sub- ments are indicated to prove the significant improvement of
stituting ResNet50 by ResNet101 generally improves the Bite-Net compared with state-of-art methods on a variety
detection accuracy (except for UDAFV), and the detailed of public test sets. To further discuss the effectiveness of
discussion of experimental results on each dataset is elabo- each module in Bita-Net, ablation experiments are carried
rated as follows: out on FaceForensics++ dataset. Different pathways and
FaceForensics++. To evaluate the accuracy of detecting lateral connections, attention label, and attention branch are
fake videos generated by a variety of manipulation methods, demonstrated as follows.
the experiments are performed with each forgery method on Different pathways and lateral connections. Table
different sub-datasets of facesforensics++. The experimen- 2 shows that these Bita-Net models with connections all
tal results on FF++ (Deepfakes), FF++ (Face2Face), FF++ perform better than Browsing-only pathway (which accu-
(FaceShifter), FF++ (FaceSwap) and FF++ (NeuralTex- racy is 81.09%) and Scrutinizing-only pathway (which ac-
tures) datasets shows that the greatest accuracy with Bita- curacy is 95.77%), but the methods without connection do
Net is up to 99.83%, 98.54%, 99.42%, 98.56%, 98.87%, not make sense. According to the comparison of different
which is 1.30%, 0.73%, 1.89%, 0.83%, 1.08% higher than connection methods, the best way is temporal dense convo-
the other open source detection algorithm respectively. It lution, which detection accuracy is 2.59%, 17.27% higher
follows that Bita-Net has good generalization in variety of than scrutinizing-only pathway and browsing-only pathway.
forgery methods from a comprehensive perspective. So, we set temporal dense convolution as default connect
UADFV. These results on UADFV dataset reveal method. The reason why lateral connection method of tem-
that the I3D, C2D [3], Meso-Net [2], EfficientNet-B0 poral dense convolution performs better than other connec-
and EfficientNet-B4 models [29] achieve performance of tion methods is that it retains the information of Browsing
97.31%, 95.17%, 94.39%, 96.67%,97.69% respectively. pathway more effectively .
It is notable that the accuracy of detection using Bita-Net
16x64,R101 increases about 2%. Due to the residual con- Model Lateral Connects Acc.
nections and richer input sequences, the learning process is Browse only - 81.09%
more effective. Scrutinize only - 95.77%
DeepfakeTIMIT. DeepfakeTIMIT is not a large scale Bita-Net without connection 89.53%
dataset containing a large number of fake videos, so it Bita-Net Channel element addition 96.18%
poses a challenge to evaluate generalization with face foren- Bita-Net Temporal stride 97.84%
sics detection algorithms. However, Bita-Net still achieves Bita-Net Temporal sparse conv. 98.05%
1.78%-3.08% better performance than other algorithms. Bita-Net Temporal dense conv. 98.36%
CelebDF. CelebDF [21] database has a great number of Table 2. Bi-temporal fusion.Comparison of different lateral con-
videos. The quality of videos in Celeb-DF dataset is closed nects,in order to facilitate comparison, all experiments choose
to that of daily TV shows. The performance of detection al- Bita-Net (Bita-Net 16-64, R50) as the comparison network
gorithm on this dataset is more representative of daily data.
From Table.1, it can be clearly seen that the performance of Soft Attention Label. It is presented in experiments that
our algorithm tend to be better, with an increase of about the performance of detection algorithm drops drastically
2% than others. due to the image compressing. Because the ”artifacts” are
In conclusion, the above experimental results verify that generated from original images during compressing, but it
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
is not the forged fingerprints. The soft label (Fig.4.) method model Attention Method Accuracy
achieves higher accuracy than simply setting the real image
attention label to a pure black image. Moreover, it improves Bita-Net Bi-temporal 96.53%
the generalization of the model on compressed image. Bita-Net Browsing pathway 97.35%
Bita-Net Scrutinizing pathway 98.36%
Bita-Net Bi-pathway 98.11%
Table 4. Attention branch. Accuracy comparison with attention
branch added to different pathways.Bi-temporal means no atten-
tion branch.Bi-pathway means attention branch add to all pathway
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.
visual recognition and description. In Proceedings of the International Conference on Multimedia, pages 4318–4327,
IEEE/CVF Conference on Computer Vision and Pattern 2020.
Recognition, pages 2625–2634, 2015. [25] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao. Thinking
[8] R. Durall, M. Keuper, F.-J. Pfreundt, and J. Keuper. Un- in frequency: Face forgery detection by mining frequency-
masking deepfakes with simple features. arXiv preprint aware clues. In European Conference on Computer Vision,
arXiv:1911.00686, 2019. pages 86–103, 2020.
[9] H. Farid. Image forgery detection. IEEE Signal processing [26] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies,
magazine, 26(2):16–25, 2009. and M. Nießner. Faceforensics++: Learning to detect ma-
[10] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slow- nipulated facial images. In Proceedings of the IEEE/CVF
fast networks for video recognition. In Proceedings of the International Conference on Computer Vision, pages 1–11,
IEEE/CVF International Conference on Computer Vision, 2019.
pages 6202–6211, 2019. [27] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi,
[11] J. Galbally, S. Marcel, and J. Fierrez. Biometric antispoof- and P. Natarajan. Recurrent convolutional strategies for face
ing methods: A survey in face recognition. IEEE Access, manipulation detection in videos. Interfaces, 3(1), 2019.
2:1530–1552, 2014. [28] M. C. Stamm and K. R. Liu. Forensic detection of image ma-
[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, nipulation using statistical intrinsic fingerprints. IEEE Trans-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- actions on Information Forensics and Security, 5(3):492–
gio. Generative adversarial networks. arXiv preprint 506, 2010.
arXiv:1406.2661, 2014. [29] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for
[13] D. Güera and E. J. Delp. Deepfake video detection using convolutional neural networks. In International Conference
recurrent neural networks. In IEEE International Conference on Machine Learning, pages 6105–6114, 2019.
on Advanced Video and Signal Based Surveillance), pages [30] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
1–6, 2018. M. Nießner. Face2face: Real-time face capture and reenact-
[14] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transform- ment of rgb videos. In Proceedings of the IEEE/CVF Con-
ing auto-encoders. In International conference on artificial ference on Computer Vision and Pattern Recognition, pages
neural networks, pages 44–51, 2011. 2387–2395, 2016.
[15] R. A. M. Koopman, M. and Z. Geradts. Detection of deep-
[31] X. Yang, Y. Li, and S. Lyu. Exposing deep fakes using in-
fake video manipulation. In Irish Machine Vision and Image
consistent head poses. In IEEE International Conference on
Processing Conference, pages 133–136, 2018.
Acoustics, Speech and Signal Processing, pages 8261–8265,
[16] P. Korshunov and S. Marcel. Deepfakes: a new threat to
2019.
face recognition? assessment and detection. arXiv preprint
arXiv:1812.08685, 2018.
[17] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen. Faceshifter:
Towards high fidelity and occlusion aware face swapping.
arXiv preprint arXiv:1912.13457, 2019.
[18] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and
B. Guo. Face x-ray for more general face forgery detection.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5001–5010, 2020.
[19] Y. Li, M.-C. Chang, and S. Lyu. In ictu oculi: Exposing
ai created fake videos by detecting eye blinking. In IEEE
International Workshop on Information Forensics and Secu-
rity, pages 1–7, 2018.
[20] Y. Li and S. Lyu. Exposing deepfake videos by detecting face
warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
[21] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu. Celeb-df
(v2): a new dataset for deepfake forensics. arXiv preprint
arXiv:1909.12962, 2019.
[22] F. Matern, C. Riess, and M. Stamminger. Exploiting vi-
sual artifacts to expose deepfakes and face manipulations.
In IEEE Winter Applications of Computer Vision Workshops,
pages 83–92, 2019.
[23] H. H. Nguyen, J. Yamagishi, and I. Echizen. Use of a capsule
network to detect fake images and videos. arXiv preprint
arXiv:1910.12467, 2019.
[24] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, Y. Liu,
and J. Zhao. Deeprhythm: exposing deepfakes with atten-
tional visual heartbeat rhythms. In Proceedings of the ACM
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on September 03,2024 at 18:51:38 UTC from IEEE Xplore. Restrictions apply.