0% found this document useful (0 votes)
115 views10 pages

Narayan DF-Platter Multi-Face Heterogeneous Deepfake Dataset CVPR 2023 Paper

Narayan DF-Platter Multi-Face Heterogeneous Deepfake Dataset CVPR 2023 Paper

Uploaded by

Bhavana M S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views10 pages

Narayan DF-Platter Multi-Face Heterogeneous Deepfake Dataset CVPR 2023 Paper

Narayan DF-Platter Multi-Face Heterogeneous Deepfake Dataset CVPR 2023 Paper

Uploaded by

Bhavana M S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

DF-Platter: Multi-Face Heterogeneous Deepfake Dataset

Kartik Narayan*, Harsh Agarwal*, Kartik Thakral*, Surbhi Mittal*, Mayank Vatsa, and Richa Singh
IIT Jodhpur, India
{narayan.2, agarwal.10, thakral.1, mittal.5, mvatsa, richa}@iitj.ac.in

Abstract

Deepfake detection is gaining significant importance in


the research community. While most of the research ef-
forts are focused towards high-quality images and videos
with controlled appearance of individuals, deepfake gener-
ation algorithms now have the capability to generate deep- (a) (b)
fakes with low-resolution, occlusion, and manipulation of
multiple subjects. In this research, we emulate the real-
world scenario of deepfake generation and propose the DF-
Platter dataset, which contains (i) both low-resolution and
high-resolution deepfakes generated using multiple genera-
tion techniques and (ii) single-subject and multiple-subject
deepfakes, with face images of Indian ethnicity. Faces in
the dataset are annotated for various attributes such as gen-
der, age, skin tone, and occlusion. The dataset is prepared
(c)
in 116 days with continuous usage of 32 GPUs account-
ing to 1,800 GB cumulative memory. With over 500 GBs Figure 1. Samples showcasing multi-face deepfakes circulated on
in size, the dataset contains a total of 133,260 videos en- social media. (a) A zoom call with a deepfake of Elon Musk [8]
compassing three sets. To the best of our knowledge, this (b) Real-time deepfake generation at America’s Got Talent [9] (c)
is one of the largest datasets containing vast variability Deepfake round-table with multiple deepfake subjects [33].
and multiple challenges. We also provide benchmark re-
sults under multiple evaluation settings using popular and
state-of-the-art deepfake detection models, for c0 images amount of fake multimedia content being generated due to
and videos along with c23 and c40 compression variants. increased accessibility and less training requirements. Not
The results demonstrate a significant performance reduc- only has the amount of such media risen, but the sophisti-
tion in the deepfake detection task on low-resolution deep- cation of such content has also improved drastically, mak-
fakes. Furthermore, existing techniques yield declined de- ing it indistinguishable from real videos. While most deep-
tection accuracy on multiple-subject deepfakes. It is our fakes are used for entertainment purposes like parody films
assertion that this database will improve the state-of-the- and filters in apps, they can also be used to illicitly defame
art by extending the capabilities of deepfake detection al- someone, spread misinformation or propaganda, or conduct
gorithms to real-world scenarios. The database is available fraud. In 2020 Delhi state elections in India, a deepfake
at: http://iab-rubric.org/df-platter-database. video of a popular political figure was created [34] and ac-
cording to some estimates, the deepfake was disseminated
to about 15 million people in the state [13]. Given the abuse
1. Introduction of deepfakes and their possible impact, the necessity for bet-
ter and robust deepfake detection methods is unavoidable.
With the advent of diverse deep learning architectures,
significant breakthrough have been made in the field of im- Designing a dependable deepfake detection system re-
age/video forgery. This has led to an incredible rise in the quires availability of comprehensive deepfake datasets for
training. Table 1 summarizes the key characteristics of the
* Equal contribution by student authors. publicly available deepfake datasets. Most of the datasets

9739
Table 1. Quantitative comparison of DF-Platter with existing Deepfake datasets.

Real Fake Total Total Real Multiple faces per Face Generation Low
Dataset Annotations 2
Videos Videos Videos Subjects Source image/video Occlusion Techniques Resolution 1
FF++ [30] 1,000 4,000 5,000 N/A YouTube ✗ ✗ 4 ✗ ✗
Celeb-DF [21] 590 5,639 6,229 59 YouTube ✗ ✗ 1 ✗ ✗
UADFV [36] 49 49 98 49 YouTube ✗ ✗ 1 ✗ ✗
DFDC [6] 23,654 104,500 128,154 960 Self-Recording ✗ ✗ 8 ✗ ✗
DeepfakeTIMIT [15] 640 320 960 32 VidTIMIT ✗ ✗ 2 ✓ ✗
DF-W [29] N/A 1,869 1,869 N/A YouTube & Bilibili ✗ ✗ 4 ✗ ✗
KoDF [16] 62,166 175,776 237,942 403 Self-Recording ✗ ✗ 6 ✗ ✗
WildDeepfake [39] 707 707 1,414 N/A Internet ✗ ✗ N/A ✗ ✗
OpenForensics [17] 45,473* 70,325* 115,325* N/A Google Open Images ✓ ✓ 1 ✗ ✗
DeePhy [26] 100 5,040 5,140 N/A YouTube ✗ ✓ 3 ✗ ✓
DF-Platter (ours) 764 132,496 133,260 454 YouTube ✓ ✓ 3 ✓ ✓
1
Low resolution means the dataset contains low-resolution deepfakes generated using low-resolution videos and not by down-sampling.
2
The dataset provides annotations such as skin tone, facial attributes and face occlusion.
*
The number of images have been reported since the dataset contains only images.

contain high-resolution images with single faces in the are annotated for various attributes such as gender, age,
image, while some of them contain deepfakes generated skin tone, and occlusion. The video samples comprise of
through multiple generation techniques with multiple lev- occluded deepfakes, low-resolution (LR) deepfakes, and
els of compression. In the online era, where most content is multi-face (multiple-subject) deepfakes. Following are the
shared over the web and social media channels, the videos key characteristics of the research work:
and images shared are of low-resolution to provide trans-
mission efficiency. There are increasing instances of deep- • The dataset utilizes low-resolution videos for creating
fake videos in unconstrained settings, for instance, occlu- deepfakes. While existing datasets synthetically in-
sions on face (such as a pair of spectacles, hat, cap, turban, terpolate deepfakes from high-resolution videos, we
or hijab) and multiple faces with pose variations. While generate low-resolution deepfakes by utilizing low-
there have been several works [11, 22, 25, 37, 38] related resolution videos. This improves the visual quality of
to deepfake detection, we empirically observe that state-of- low-resolution deepfakes.
the-art detection techniques fail to detect such deepfakes.
• The dataset contains multi-face (multiple-subject)
This demonstrates the need to enhance the deepfake detec-
deepfake sets using multiple generation techniques
tion technology to address such upcoming challenges.
where each face in the video frame is annotated as real
Existing datasets traditionally comprise single-subject or fake. We also use three metrics for thorough evalu-
deepfakes generated using a single-generation tech- ation on multi-face deepfakes.
nique [14,17,21,36]. However, developing a deepfake video
with multiple forged subjects is also possible. Recently, de- • The dataset provides a gender-balanced distribution
velopers at Collider [33] published a deepfake with multi- of deepfakes with subjects of Indian ethnicity and is
ple fake faces in a single frame. The video titled “Deepfake annotated on various attributes like gender, age, skin
Roundtable” envisions a discussion involving deepfakes of tone, and occlusion.
5 celebrities. A state-of-the-art model trained on the Face-
Forensics++ dataset is unable to identify the deepfake faces
in the video. Some of these examples are shown in Fig- 2. The DF-Platter Dataset
ure 1. The recently published OpenForensics dataset [17] In this work, we introduce a large-scale deepfake dataset
contains deepfakes with multiple faces and occlusion; how- termed as DF-Platter. This dataset contains a total of
ever, it contains only one generation technique and does 133,260 videos having an approximate duration of 20 sec-
not contain any annotations for skin tone and age of sub- onds each (estimated total of 30.67 days). It is the second
jects. Further, the OpenForensics dataset is a segmentation- largest dataset in terms of the total number of videos, only
based dataset while the DF-Platter dataset is a new detec- behind KoDF [16]. The dataset contains deepfake videos
tion dataset with multiple generation techniques and low- curated and generated at high-resolution (HR) as well as
resolution variations. low-resolution (LR). It comprises of three sets: Set A, Set
Contributions: This research proposes a novel deepfake B, and Set C. Set A contains single-subject deepfakes. For
detection dataset titled the DF-Platter dataset to promote generating single-subject deepfakes, there is a source video
the capabilities of deepfake detection for the upcoming and a target video containing one subject each. The back-
challenges. The dataset contains a total of 133,260 videos ground in the target video is preserved while the face in the
encompassing different sets. The subjects in the real videos target video is swapped with the face in the source video.

9740
REAL
FAKE
FAKE

SET A
REAL
FAKE
FAKE

SET B SET C

Figure 2. Samples from the DF-Platter dataset in Set A: Occluded and low-resolution deepfakes, Set B: Multi-face intra-deepfakes, and
Set C: Multi-face deepfakes with celebrities as the target.

Sets B and C contain videos with multiple subjects and we Table 2. Summarizing the details of the DF-Platter dataset.
create multi-face deepfakes where the face of more than one
Sets Resolution Compression Protocol
subject in the video is manipulated. Set B consists of intra-
Low High c0 c23 c40 Train Test
deepfakes where faces of one or more subjects within a par-
ticular video are swapped whereas Set C consists of multi- Set A 65,649 65,649 ✓ ✓ ✓ ✓ ✓
Set B 500 500 ✓ ✓ ✓ ✗ ✓
face deepfakes where faces in the videos (source) are ma-
Set C 481 481 ✓ ✓ ✓ ✗ ✓
nipulated to look like celebrity faces (target). The dataset is
Total 66,630 66,630 ✓ ✓ ✓ - -
generated using FSGAN, FaceShifter and FaceSwap tech-
niques to diversify the dataset in terms of the generation
techniques. The dataset consists of subjects belonging to
Indian ethnicity and is richly annotated in attributes such real videos encompassing LR as well as HR videos with 454
as resolution, gender, age, skin tone, and facial occlusion. different subjects. Many of the existing deepfake datasets
While most publicly available datasets have an imbalance consist of source videos filmed in an extremely controlled
in terms of different attributes such as gender, skin tone, environment with limited variation in expressions, poses,
and age [24, 35], the DF-Platter dataset is balanced across background, and illumination conditions [6, 16]. To closely
resolution and gender. Further, all videos in the dataset are mimic real-world scenarios, the videos in our dataset are
provided at two additional compression levels- c23 and c40. collected in the wild, specifically from YouTube, with di-
versity in gender, orientation, skin tone (measured in Fitz-
2.1. Dataset Statistics patrick scale [7]), size of face (in pixels), lighting condi-
tions, background, and in the presence of occlusion. Oc-
The dataset statistics for the DF-Platter dataset are pre- clusion occurs when hands, hair, spectacles, or any other
sented in Table 2. The DF-Platter dataset comprises of 764 object blocks part of the source or target face. The videos

9741
are downloaded at 720p resolution for HR videos and at 2.3. Dataset Organization and Description
360p resolution for LR videos. For Set A, a total of 602
In this section, we discuss the organization of the
real videos are used for the generation of deepfakes. These
dataset. An equal distribution amongst gender and res-
videos have nearly equal distribution in gender and resolu-
olution is maintained across all sets. The dataset com-
tion, i.e., 151 videos for the male gender and 150 videos
prises 16,337 fake videos per generative model (FSGAN,
for the female gender. All videos are collected for both low
FaceShifter), per gender (male, female), and per resolution
and high resolutions. The time duration of each video is
(low-resolution, high-resolution) for Set A. Similarly, for
approximately 20 seconds. For Set B, we employ 100 real
Set B and Set C, there are 150 fake videos per model (FS-
videos for deepfake generation, with an equal split of low-
GAN, FaceShifter, FaceSwap), per gender (male, female),
resolution and high-resolution videos. These videos have
per resolution (low-resolution, high-resolution). Manual
multiple subjects within one frame of a video. Set C is gen-
filtering of sets is performed to ensure good-quality deep-
erated using 62 real videos. These deepfakes are generated
fakes. The dataset has been organized into different subsets
with celebrity faces pasted over multiple target subjects in
on the basis of resolution, and compression. Notations ‘HR’
each frame. The dataset contains all sets at three compres-
and ‘LR’ indicate high-resolution and low-resolution, re-
sion levels- c0, c23, and c40, where c0 encompasses the
spectively. ‘c0’, ‘c23’ and ‘c40’ represent the three levels of
default compression of the videos when downloaded from
compression namely no compression, medium compression
the YouTube platform.
and hard compression, respectively. We generate LR deep-
fakes by using low-resolution videos instead of downsam-
2.2. Dataset Generation Techniques pling HR deepfakes. Samples from the dataset are shown in
Figure 2. The entire dataset has been organized into three
The deepfakes are generated using three state-of-the- sets on the basis of distinct properties:
art synthesis methods- FSGAN [28], FaceSwap [2] and Set A: It consists of 130,696 single-subject deepfake videos
FaceShifter [19]. Set A contains deepfakes generated using synthesized using two-generation techniques, FSGAN and
FSGAN and FaceShifter, whereas Sets B and C consist of FaceShifter. The set consists of annotations for skin tone,
deepfakes generated using all three generation techniques. facial occlusion, and the apparent age of each subject. We
The details of the methods employed are described below: include a variety of facial occlusions such as beard, spec-
FSGAN [28] is capable of face-swapping as well as reen- tacles, cap/turban, hair as present in an uncontrolled envi-
actment. It restores the missing attributes of the reenacted ronment. The set is also gender-balanced, comprising 150
face and blends the whole face with the target. It captures female and 151 male subjects. The distribution of attributes
the identity of the source subject and recreates it to suit the for real subjects in Set A is shown in Figure 3.
target subject’s pose, expression, and angle. We fine-tune Set B: It consists of a total of 900 intra-deepfake videos
the FSGAN model on the real videos as suggested by the and 100 real videos. The fake videos are synthesized using
authors for producing more realistic results [3]. The FS- the three generation techniques (FSGAN, FaceSwap and
GAN architecture is adopted because of its efficiency and FaceShifter). In each real video, a minimum of 2 and a
good generation of occluded samples. maximum of 5 subjects are present, out of which a mini-
mum of 2 and a maximum of 3 faces are swapped during
FaceSwap [2] is a popular, computationally expensive the generation of fake videos.
open-source deepfake generation software employed to Set C: This set is similar to Set B, focusing specifically on
swap faces across videos and images. It comprises of the Indian celebrities as source faces in the deepfake videos.
an encoder-decoder-based architecture with a common en- We utilize real videos used in Set B and swap them with
coder and two different decoders for source and target faces. single-subject celebrity faces. The set contains 62 real and
It is one of the generation techniques used in the FaceForen- 900 deepfake videos. As in Set B, the videos contain a min-
sics++ [30] dataset. imum of 2 and a maximum of 5 subjects in the videos, out
FaceShifter [19] is a two-stage framework to generate of which 2 to 3 faces are swapped out in each video.
face-swapped videos in an occlusion-aware manner with Size and Format: The DF-Platter dataset is around 417 GB
high fidelity. In contrast to previous face-swapping algo- in its raw form. It contains a total of 133,260 videos wherein
rithms, FaceShifter tries to thoroughly blend facial features each video is approximately 20 seconds in duration. The
by transferring localized feature maps between facial re- videos are made available in MPEG4.0 format with high-
gions. For our dataset, we utilize publicly available pre- resolution as well as its corresponding low-resolution. All
trained weights [1]. This method is adopted because of videos have a frame rate of 25 fps. The dataset consis-
its occlusion-aware nature. The GAN-based framework of tently contains the same videos across resolution, compres-
FaceShifter guarantees less training time and high realism sion and the generation technique utilized. For compression
in generated fake videos. at levels c23 and c40, H.264 video compression is utilized.

9742
Gender Age Skin tone (In Fitzpatrick Scale) Occlusion

Figure 3. Distribution of real subjects of the DF-Platter dataset across multiple attributes. The dataset is gender-balanced, and the subjects
are majorly young adults. The subjects are distributed in the skin tone types as per the Fitzpatrick scale [7]. The rightmost pie chart shows
the occlusion types and their distribution across the dataset.

Annotation and Diversity: DF-Platter has subjects of In-


dian ethnicity and is annotated with attributes- gender, res-
olution, occlusion, and skin tone. Gender has been anno-
tated in two classes- Male or Female. The skin tone is
annotated on a scale of 1 to 6 for each subject using the
Fitzpatrick scale [7]. The skin tone annotations are gen-
erated automatically by utilizing the method introduced in
Groh et al. [10] and then verified by a human annotator.
The apparent age attribute is classified into three classes-
Young Adult (subjects with evident age between 18 to 30
years), Adult (subjects with evident age between 30 to 55
years), and Old (subjects with evident age above 55 years).
51.33% subjects are classified as “Young Adult”, 42% sub- Figure 4. Comparison of BRISQUE score with existing deepfake
jects as “Adult” and 6.66% subjects as “Old”. Annota- detection datasets. The blue colored bars signify the scores for the
tions relating to facial occlusion are annotated into eight entire dataset (Best viewed in color).
broad categories: 5‘o’clock shadow, Beard, Moustache,
Spectacles, Shades, Microphone, Cap/Turban/Hijab/Scarf,
and Hair Occlusion. These attributes are binary in nature. computer vision domain. A total of 200 samples from the
Moustache and Beard are the most common type of occlu- dataset (gender balanced) were randomly selected to per-
sion in males with around 90% of subjects having them. form the study. Each participant was asked to classify the
Figure 3 summarizes the different types of annotations with sample as real or fake, along with their confidence level out
their distribution. of 5. We observed that the deepfakes in our dataset are hard
for humans to detect, with an overall detection accuracy of
2.4. Visual Quality Assessment 59.94% and an average classification confidence of 3.9.
To evaluate the visual quality of the proposed dataset, 2.5. Computational Setup
we use the BRISQUE [23] quality metric for all (HR,c0)
sets, as shown in Figure 4. On a scale of 0 (best) - 100 The real videos for the DF-Platter dataset are collected
(worst), the average BRISQUE score for the dataset is from YouTube. The videos are generated using FaceSwap
43.25. The set-wise BRISQUE scores for Set A (Train), [2], FaceShifter [19], and FSGAN [28] through their pub-
Set A (Test), Set B, and Set C are 42.69, 43.80, 52.46 and licly available GitHub repositories. For FaceSwap, each
51.66 respectively. The BRISQUE scores for FaceForen- video was generated after 8 hours of training on sixteen
sics++, CelebDF, DFDC, and OpenForensics are approxi- Nvidia A100 GPUs of 80GB memory each and twelve
mated from Le et al. [17]. These scores highlight that the Nvidia V100 GPUs of 32GB memory each. Similarly,
proposed dataset is of high quality and is, therefore, chal- deepfake videos using FaceShifter were generated using
lenging with multiple covariates. We also perform a user pre-trained weights with default parameters on three Nvidia
study with 28 participants having prior experience in the RTX 3090 GPUs of 24 GB memory each. Further, for gen-

9743
Table 3. The dataset protocol used for training, validation, and the frames from each real video. This is done in order to
testing in the DF-Platter dataset (HR, c0). The numbers will ap- mitigate the imbalance between the number of real and fake
proximately be the same for other resolutions and compressions. videos in the dataset (refer Table 2). The details of the train-
Sets Training
ing and testing sets are summarized in Table 3.
Real Frames Fake frames Total frames Videos Protocol 1 - Occluded Deepfakes: This protocol employs
Set A 307,221 323,120 630,341 32,824 Set A as the test set. It is divided in a subject-disjoint man-
Sets Testing ner such that there are 130 subjects in the training set, 10
Real Frames Fake frames Total frames Videos subjects in the validation set, and 150 in the testing set. In
Set A 85,135 330,360 415,495 32,825 terms of the number of frames, this set contains 677,980
Set B 500 4,500 5000 500 frames in the training split, 119,320 frames in the valida-
Set C 310 4,500 4,810 481 tion split, and 840,408 frames in the testing split. There is a
significant difference between the two classes - “Real” and
“Fake” in the number of videos (the real:fake ratio is nearly
erating deepfake videos using FSGAN, the re-enactment 1:4) which leads to a skew in the dataset. To counter this,
generator of FSGAN was fine-tuned for each source video. we repeat the real videos while training different architec-
The inferencing was performed using twelve Nvidia V100 tures so as to get a nearly equal number of real and fake
GPUs of 32GB memory each and an Nvidia DGX A40 GPU samples. The state-of-the-art models are then tested to de-
of 48 GB memory. The dataset generation was completed tect occluded deepfakes. The results are provided in three
in over 116 days with parallel usage of the above-mentioned compression settings - c0, c23, and c40 and help gauge the
GPUs. The benchmark experiments of the dataset are per- quality of deepfakes in the dataset with existing datasets.
formed on a Nvidia DGX station with four V100 GPUs in a Protocol 2 - Multi-Face Deepfakes: This protocol em-
multi-GPU fashion. ploys Set B and Set C as the test sets. Set B consists of
2.6. Implementation Details a total of 500 videos. In terms of frames, it consists of 500
real frames and 4500 fake frames. Each frame can have one
In this section, we provide the implementation details or more real (or fake) faces. A video (or frame) is consid-
for reproducibility of the benchmarking experiments. We ered fake if at least one face is manipulated. Similar to Set
use the DSFD detector [18] to extract faces from the frames B, Set C is also an evaluation set. It consists of 481 videos,
of each video. For all the protocols, the models are trained out of which 31 are real and 450 are fake. The models are
for 30 epochs with early stopping and the model with the tested for performance on Sets B and C, which have multi-
best validation accuracy is selected. Each experiment is per- ple subjects using a variety of metrics.
formed by training the same model three times with differ- Protocol 3 - Cross-Resolution and Cross-Compression:
ent training and validation splits. The performance of the We perform experiments to analyze the cross-resolution and
three models is averaged across the test set and reported. cross-compression performance of existing deepfake detec-
We use the Adam Optimizer with an initial learning rate of tors in real-world settings where the deepfakes are shared
0.0001. A batch size of 256 is used for distributed training. on the web and social media. In the cross-resolution ex-
periment, the models are trained on (c0, HR) samples, and
3. Experimental Setup tested on (c0, LR) samples. The models in the cross-
In this section, we describe the designed protocol for compression experiment are trained on (c23, HR) samples,
training and testing on the DF-Platter dataset, followed by and tested on (c40, HR) samples. In both experiments, the
the deepfake detection algorithms and evaluation metrics samples for training are taken from Set A and tested on all
used for benchmarking. The proposed dataset aims to ad- three sets.
dress the following research questions:
RQ1: Can we detect occluded deepfakes? 3.2. DeepFake Detection Methods
RQ2: Can we detect multi-face deepfakes in videos? We employ six state-of-the-art deepfake detection mod-
RQ3: Can we detect low-resolution and compressed els to benchmark the dataset, for both single-subject as well
deepfakes on the web and on social media channels? as multiple-subject deepfake detection.
MesoNet [4] takes a mesoscopic approach for detecting fa-
3.1. Evaluation Protocol
cial forgery. Both its variants, Meso-4 and MesoInception-4
DF-Platter comprises of three sets. Set A in the dataset is are utilized for benchmarking.
used for training as well as evaluation whereas Sets B and FWA [20] is a CNN-based detection technique that focuses
C are evaluation sets only. Since the dataset comprises of on the artifacts stemming from affine and other transforms
videos, we conduct experiments by extracting frames from applied during deepfake generation.
videos. We extract 10 frames from each fake video and all XceptionNet [5] is based on depthwise separable convo-

9744
Table 4. All the methods are trained and tested in the same setting of compression and resolution for this experiment. We report the
Accuracy(%) and AUC for Set A (Protocol 1), and FaceWA(%), FaceAUC, FLA(%) and VLA(%) for Sets B and C (Protocol 2).

Trained & Models Set A Set B Set C


Tested on Accuracy AUC FaceWA FaceAUC FLA VLA FaceWA FaceAUC FLA VLA
MesoNet [4] 84.90 ± 2.25 0.67 ± 0.08 78.46 ± 2.92 0.57 ± 0.01 58.13 ± 3.87 62.6 ± 5.60 78.57 ± 2.42 0.69 ± 0.03 57.90 ± 4.02 58.76 ± 6.96
Meso-Inception [4] 86.62 ± 0.40 0.70 ± 0.01 79.92 ± 1.95 0.58 ± 0.00 60.89 ± 2.01 65.41 ± 1.48 79.68 ± 1.58 0.69 ± 0.01 60.81 ± 2.51 62.60 ± 2.18
FWA [20] 82.47 ± 1.50 0.59 ± 0.04 84.83 ± 2.35 0.55 ± 0.01 71.98 ± 7.01 79.90 ± 8.34 83.71 ± 1.04 0.64 ± 0.06 66.75 ± 1.35 78.72 ± 7.79
c0, HR
Xception [5] 84.76 ± 0.77 0.64 ± 0.02 86.02 ± 1.60 0.56 ± 0.01 74.07 ± 4.29 80.41 ± 5.10 86.00 ± 0.74 0.71 ± 0.03 71.36 ± 1.41 78.12 ± 4.94
DSP-FWA [20] 91.92 ± 0.57 0.81 ± 0.01 81.59 ± 1.60 0.42 ± 0.26 62.29 ± 2.95 65.41 ± 4.11 83.08 ± 0.46 0.77 ± 0.02 65.52 ± 0.97 64.16 ± 3.26
Capsule [27] 92.70 ± 1.92 0.83 ± 0.05 84.09 ± 2.57 0.64 ± 0.03 66.91 ± 5.48 70.96 ± 7.53 85.02 ± 2.91 0.81 ± 0.01 69.01 ± 4.75 67.04 ± 7.61
MesoNet [4] 85.05 ± 1.21 0.65 ± 0.03 82.21 ± 0.47 0.57 ± 0.00 64.37 ± 1.36 70.07 ± 2.37 82.35 ± 1.34 0.69 ± 0.05 64.16 ± 2.98 68.59 ± 2.41
Meso-Inception [4] 86.49 ± 0.30 0.68 ± 0.02 83.33 ± 0.42 0.58 ± 0.01 68.92 ± 0.53 75.09 ± 0.73 82.24 ± 1.11 0.67 ± 0.02 66.14 ± 1.83 71.47 ± 1.21
FWA [20] 83.93 ± 0.63 0.58 ± 0.02 83.49 ± 0.75 0.53 ± 0.00 68.65 ± 1.66 75.54 ± 2.02 82.83 ± 0.51 0.61 ± 0.03 65.03 ± 0.75 76.79 ± 3.26
c23, HR
Xception [5] 84.22 ± 0.84 0.59 ± 0.03 86.05 ± 1.75 0.55 ± 0.01 74.01 ± 4.70 81.30 ± 6.10 84.20 ± 1.76 0.65 ± 0.02 68.72 ± 3.01 78.12 ± 5.77
DSP-FWA [20] 87.44 ± 4.51 0.68 ± 0.13 85.27 ± 4.27 0.56 ± 0.04 72.47 ± 13.15 78.57 ± 16.05 85.73 ± 1.90 0.69 ± 0.14 70.85 ± 4.19 78.20 ± 16.60
Capsule [27] 90.09 ± 0.83 0.74 ± 0.02 83.71 ± 2.45 0.60 ± 0.01 66.27 ± 5.83 71.10 ± 7.06 83.78 ± 1.90 0.74 ± 0.02 66.55 ± 4.40 68.07 ± 6.99
MesoNet [4] 82.98 ± 0.28 0.59 ± 0.01 80.20 ± 3.74 0.54 ± 0.01 61.62 ± 7.82 67.04 ± 9.38 79.74 ± 3.05 0.61 ± 0.02 58.53 ± 4.75 66.44 ± 8.18
Meso-Inception [4] 84.21 ± 0.41 0.63 ± 0.01 81.04 ± 1.58 0.57 ± 0.02 61.92 ± 3.53 67.41 ± 4.62 79.55 ± 1.25 0.64 ± 0.01 57.49 ± 2.25 61.05 ± 3.25
FWA [20] 82.61 ± 0.18 0.55 ± 0.01 84.80 ± 1.47 0.53 ± 0.01 71.67 ± 4.00 79.23 ± 4.94 83.59 ± 0.44 0.60 ± 0.01 65.74 ± 1.27 78.49 ± 1.73
c40, HR
Xception [5] 82.64 ± 0.06 0.55 ± 0.00 87.47 ± 0.30 0.53 ± 0.01 78.65 ± 1.13 86.62 ± 1.51 85.47 ± 0.78 0.59 ± 0.02 69.27 ± 1.92 84.11 ± 2.72
DSP-FWA [20] 85.14 ± 2.24 0.63 ± 0.05 82.93 ± 0.84 0.56 ± 0.00 66.45 ± 1.88 73.39 ± 2.39 81.89 ± 0.55 0.63 ± 0.04 62.39 ± 1.23 70.66 ± 1.86
Capsule [27] 87.40 ± 0.18 0.68 ± 0.00 83.90 ± 0.90 0.60 ± 0.01 68.09 ± 1.35 74.20 ± 1.86 83.40 ± 0.80 0.68 ± 0.02 65.39 ± 1.39 71.25 ± 4.07

lutions and skip-connections as in ResNet [12]. It com- DF-Platter dataset. The evaluation is performed on the three
prises depth-wise convolution followed by point-wise con- protocols described in Section 3.1.
volution. Protocol 1 - Occluded Deepfakes: The results of deep-
DSP-FWA [20] is an improvement over the aforementioned fake detection for high-resolution videos from Set A over
FWA that utilizes a dual spatial pyramid strategy. different compressions are summarized in Table 4. At (c0,
Capsule [27] is a deep neural network that is able to com- HR) setting, all architectures are seen to perform well on
pensate for the information lost during pooling operations classical single-subject deepfake videos. However, a sig-
by utilizing spatial information. It is a VGG19 [32] based nificant drop in performance is observed for c23 and c40
architecture consisting of various capsules [31] and em- compressed videos. Capsule outperforms all other network
ploys a dynamic routing algorithm to calculate the agree- architectures, achieving mean AUC scores of 0.83 for c0,
ment between extracted features. 0.74 for c23, and 0.68 for (c40, HR) videos, followed by
DSP-FWA. The AUC scores obtained across different mod-
3.3. Evaluation Metrics els are in the range of 0.55 to 0.83, indicating that the exist-
We utilize the following evaluation metrics to evaluate ing deepfake detectors fall short of detecting deepfakes with
the performance of different algorithms on different subsets facial occlusions. This also reflects the high quality of deep-
of the DF-Platter dataset. fakes in the DF-Platter dataset and a scope for improvement
Set A: We report the frame-level accuracy (Accuracy) and in deepfake detection.
ROC-AUC scores on the frame-level (AUC). Each frame is Protocol 2 - Multi-Face Deepfakes: The results for the
used for computation and classified as fake or real. task of multi-face deepfake detection of (c0, HR) videos is
Set B and Set C: For Sets B and C, accuracy is computed shown in Table 4 block (c0, HR) for Sets B and C. In com-
under three settings: face-level, frame-level, and video- parison to Set A, these are more challenging sets which is
level. At the face-level (FaceWA), the predictions corre- also confirmed by the frame-level accuracies achieved by
sponding to each face are used for computation. We also different models on them. We also observe consistently low
report the face-level ROC-AUC score (FaceAUC). At the FLA and VLA performance by all the deepfake detectors on
frame-level (FLA), the frame is considered to be correctly Set B and Set C due to the strict nature of correct classifica-
classified only if the predictions corresponding to all the tion. For Set B, the face-wise ROC-AUC score is just above
faces in the frame are correct. At the video-level, we com- 0.5 for most models which indicates a near random classi-
bine the predictions obtained for the frames of a particular fication performance. We also test state-of-the-art deepfake
video, and if more than 50% frames are classified as fake detectors on Set C where we observe XceptionNet provides
(or, real), we classify the video as fake (or, real). mean VLA of 78.12%, 78.12%, and 84.11% on c0, c23, and
c40 videos. Table 4 block (c0, HR) for Set C shows the re-
4. Results and Analysis sults for the task of multi-face deepfake detection for (c0,
HR) videos. Similar to Set B, a decrease in the frame-level
This section summarizes the benchmark results obtained accuracy is observed when compared to Set A. Though, the
using the state-of-the-art deepfake detection models men- performance of these models is comparable to Set B for (c0,
tioned in Section 3.2, when trained and evaluated on the HR) videos, there is a significant dip in performance for the

9745
Table 5. All the methods are trained on (c0, HR) and tested on (c0, LR) videos (Protocol 3: Cross-resolution).

Trained Models Set A Set B Set C


on Accuracy AUC FaceWA FaceAUC FLA VLA FaceWA FaceAUC FLA VLA
MesoNet [4] 82.00 ± 0.20 0.58 ± 0.02 81.89 ± 2.34 0.53 ± 0.01 66.07 ± 4.79 71.62 ± 6.79 81.06 ± 2.91 0.58 ± 0.01 61.10 ± 5.12 73.17 ± 8.81
Meso-Inception [4] 82.12 ± 0.46 0.57 ± 0.01 84.25 ± 1.43 0.54 ± 0.01 71.13 ± 3.11 77.61 ± 3.49 82.76 ± 0.77 0.56 ± 0.01 63.64 ± 1.11 77.68 ± 1.94
FWA [20] 80.80 ± 0.27 0.53 ± 0.01 85.62 ± 2.38 0.51 ± 0.00 75.43 ± 5.80 84.85 ± 6.78 84.41 ± 1.48 0.55 ± 0.02 66.79 ± 2.69 85.29 ± 5.61
c0, HR
Xception [5] 81.12 ± 0.46 0.53 ± 0.02 88.05 ± 1.38 0.53 ± 0.01 80.79 ± 3.97 89.95 ± 4.90 86.33 ± 1.19 0.56 ± 0.02 70.46 ± 2.44 89.65 ± 5.28
DSP-FWA [20] 84.11 ± 1.21 0.61 ± 0.03 82.71 ± 2.24 0.54 ± 0.02 67.38 ± 4.84 72.95 ± 6.19 81.77 ± 1.61 0.58 ± 0.04 62.03 ± 3.52 73.39 ± 2.51
Capsule [27] 85.37 ± 1.50 0.64 ± 0.04 82.79 ± 1.51 0.56 ± 0.01 67.61 ± 3.26 74.06 ± 4.11 82.08 ± 1.78 0.59 ± 0.04 63.29 ± 3.28 74.13 ± 3.24

Table 6. All the methods are trained on (c23, HR) videos and tested on (c40, HR) videos (Protocol 3: Cross-compression).

Trained Models Set A Set B Set C


on Accuracy AUC FaceWA FaceAUC FLA VLA FaceWA FaceAUC FLA VLA
MesoNet [4] 82.77 ± 0.59 0.59 ± 0.02 84.21 ± 0.48 0.55 ± 0.01 69.48 ± 1.04 75.31 ± 1.15 83.51 ± 0.94 0.64 ± 0.03 65.12 ± 2.30 75.09 ± 2.29
Meso-Inception [4] 83.34 ± 0.18 0.61 ± 0.01 84.75 ± 0.19 0.56 ± 0.01 72.09 ± 0.73 78.49 ± 1.55 83.25 ± 0.84 0.63 ± 0.03 65.66 ± 2.03 76.42 ± 2.10
FWA [20] 82.49 ± 0.39 0.55 ± 0.01 84.91 ± 1.46 0.52 ± 0.01 72.01 ± 3.79 80.12 ± 4.34 83.56 ± 1.36 0.58 ± 0.01 64.43 ± 2.55 80.05 ± 4.39
c23, HR
Xception [5] 82.02 ± 0.33 0.54 ± 0.02 87.21 ± 1.43 0.54 ± 0.01 77.35 ± 4.19 85.81 ± 5.72 84.85 ± 1.89 0.59 ± 0.01 67.71 ± 3.68 82.85 ± 6.39
DSP-FWA [20] 83.76 ± 1.79 0.61 ± 0.08 85.41 ± 4.44 0.54 ± 0.03 73.43 ± 12.97 79.67 ± 15.96 84.73 ± 2.33 0.63 ± 0.10 67.73 ± 5.06 78.94 ± 15.90
Capsule [27] 85.48 ± 0.60 0.66 ± 0.02 80.06 ± 3.43 0.55 ± 0.01 60.55 ± 7.17 65.56 ± 7.83 79.41 ± 2.61 0.64 ± 0.01 58.59 ± 4.38 63.86 ± 6.41

low-resolution test set. These results are available in the scale DF-Platter dataset. It introduces the novel concept
supplementary file. of intra-deepfakes, and generates low-resolution deepfakes
Protocol 3 - Cross-Resolution and Cross-Compression: from low-resolution videos instead of downsampling high-
(a) Cross-Resolution: Table 5 demonstrates the perfor- resolution videos. The DF-Platter dataset also contains oc-
mance of various architectures when trained on HR videos cluded deepfakes to make the problem of deepfake detec-
and tested on LR videos. It is observed that the perfor- tion more challenging. The dataset is balanced across gen-
mance of these architectures drops significantly in the LR der and resolution and provides annotations for age, gen-
test set. The artifacts induced by the generative models der, resolution, occlusion, and skin tone. The benchmark
in low-resolution videos are different than those induced in results provided for various state-of-the-art deepfake detec-
high-resolution videos. The models have not seen these ar- tors demonstrate that there is still a large scope of improve-
tifacts in HR videos during training and thus perform poorly ment for deepfake detection. We anticipate that this novel
on LR videos. For instance, Capsule shows a deterioration dataset will introduce new avenues in deepfake detection re-
of 0.19 in AUC score on Set A, 0.08 in FaceAUC on Set B, search and serve as a building block for further exploration.
and 0.22 in FaceAUC on Set C.
(b) Cross-Compression: Table 6 showcases the results for Ethics Statement
cross-compression experiments for all three sets. The de-
The collection and generation of the DF-Platter dataset
tection models are trained on c23 compression and tested on
is approved by the Institutional Ethics Review Committee.
c40 compression. From 6, it can be inferred that as the com-
The dataset will be provided only to academic institutions
pression increases, the performance of the models shows a
for research purposes. Further, the collection and gener-
slight dip. In particular, models trained and tested on the
ation of the proposed dataset are in accordance with the
same level of compression perform better than those tested
YouTube’s fair use policy1 since (1) we use the material in
on different compressions (refer Table 4). The FaceAUC
a transformative way and for non-commercial purposes, (2)
scores for Set B remain marginally above 0.5 which de-
we only use a small portion (∼20s) of each YouTube video
notes that the performance is close to random classifica-
in our dataset, and (3) we do not use the collected videos to
tion. For Set C, we observe that Capsule gives a FWA value
cause harm to the copyright owner’s ability to profit from
of 83.40% when trained and tested on c40, which is ∼4%
their original work in any way.
greater than when tested in a HR cross-compression setting.
We also evaluate the performance of these networks on the
low-resolution test sets and observe a dip in accuracy for all
Acknowledgement
networks except MesoNet and FWA indicating their robust- This research is supported by a grant from the Ministry
ness to resolution. These results have been provided in the of Home Affairs, Government of India. Thakral is partially
supplementary file. supported by the PMRF Fellowship. Mittal is partially sup-
ported by the UGC-Net JRF Fellowship and IBM fellow-
5. Conclusion and Broader Impact ship. Vatsa is partially supported through the Swarnajayanti
Fellowship.
For aiding researchers in developing robust and general-
ized deepfake detection methods, we curate a novel large- 1 http://bitly.ws/CaoJ

9746
References dataset for multi-face forgery detection and segmentation in-
the-wild. In IEEE/CVF ICCV, pages 10117–10127, October
[1] FaceShifter. https://github.com/richarduuz/ 2021. 2, 5
Research_Project/tree/master/ModelC. [Ac-
[18] Jian Li, Yabiao Wang, Changan Wang, Ying Tai, Jianjun
cessed: 06-June-2022]. 4
Qian, Jian Yang, Chengjie Wang, Jilin Li, and Feiyue Huang.
[2] Faceswap. https : / / github . com /
Dsfd: dual shot face detector. In IEEE/CVF CVPR, pages
MarekKowalski / FaceSwap/. [Accessed: 27-
5060–5069, 2019. 6
April-2022]. 4, 5
[19] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang
[3] FSGAN. https://github.com/YuvalNirkin/
Wen. Faceshifter: Towards high fidelity and occlusion aware
fsgan. [Accessed: 06-June-2022]. 4
face swapping. CoRR, abs/1912.13457, 2019. 4, 5
[4] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao
Echizen. Mesonet: a compact facial video forgery detection [20] Yuezun Li and Siwei Lyu. Exposing deepfake videos
network. In IEEE WIFS, pages 1–7, 2018. 6, 7, 8 by detecting face warping artifacts. arXiv preprint
[5] François Chollet. Xception: Deep learning with depthwise arXiv:1811.00656, 2018. 6, 7, 8
separable convolutions. In IEEE/CVF CVPR, pages 1251– [21] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei
1258, 2017. 6, 7, 8 Lyu. Celeb-df: A new dataset for deepfake forensics. CoRR,
[6] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo abs/1909.12962, 2019. 2
Lu, Russ Howes, Menglin Wang, and Cristian Canton- [22] Aman Mehra, Akshay Agarwal, Mayank Vatsa, and Richa
Ferrer. The deepfake detection challenge dataset. CoRR, Singh. Motion magnified 3-d residual-in-dense network for
abs/2006.07397, 2020. 2, 3 deepfake detection. IEEE Transactions on Biometrics, Be-
[7] Thomas B Fitzpatrick. Soleil et peau. J Med Esthet, 2:33–34, havior, and Identity Science, 5(1):39–52, 2022. 2
1975. 3, 5 [23] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad
[8] Fanatical Futurist. DeepFake Elon Musk bombs a Zoom Bovik. No-reference image quality assessment in the spa-
call. https : / / www . youtube . com / watch ? v = tial domain. IEEE Transactions on Image Processing,
JiJKXCkWH3w. [Accessed: 2022-06-04]. 1 21(12):4695–4708, 2012. 5
[9] Got Talent Global. ELVIS AND THE AMERICA’S GOT [24] Aakash Varma Nadimpalli and Ajita Rattani. GBDF: Gender
TALENT JUDGES SING?! All Auditions and Performances Balanced DeepFake Dataset Towards Fair DeepFake Detec-
from Metaphysic! https://www.youtube.com/ tion. https://arxiv.org/abs/2207.10246, 2022. 3
watch?v=JiJKXCkWH3w. [Accessed: 2022-06-04]. 1 [25] Kartik Narayan, Harsh Agarwal, Surbhi Mittal, Kartik
[10] Matthew Groh, Caleb Harris, Luis Soenksen, Felix Lau, Thakral, Suman Kundu, Mayank Vatsa, and Richa Singh.
Rachel Han, Aerin Kim, Arash Koochek, and Omar Badri. Desi: Deepfake source identifier for social media. In
Evaluating deep neural networks trained on clinical im- IEEE/CVF CVPRw, pages 2858–2867, 2022. 2
ages in dermatology with the fitzpatrick 17k dataset. In [26] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mit-
IEEE/CVF CVPR, pages 1820–1828, 2021. 5 tal, Mayank Vatsa, and Richa Singh. Deephy: On deepfake
[11] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. phylogeny. In 2022 IEEE International Joint Conference on
Deepfake detection by analyzing convolutional traces. In Biometrics (IJCB), pages 1–10, 2022. 2
IEEE/CVF CVPRw, June 2020. 2
[27] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. Use
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
of a capsule network to detect fake images and videos. arXiv
Deep residual learning for image recognition. In IEEE/CVF
preprint arXiv:1910.12467, 2019. 7, 8
CVPR, pages 770–778, 2016. 7
[28] Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsgan: Sub-
[13] Charlotte Jee. Deepfakes enter Indian Election Campaigns.
ject agnostic face swapping and reenactment. In IEEE/CVF
https : / / www . technologyreview . com / 2020 /
ICCV, pages 7184–7193, 2019. 4, 5
02 / 19 / 868173 / an - indian - politician - is -
using- deepfakes- to- try- and- win- voters/, [29] Jiameng Pu, Neal Mangaokar, Lauren Kelly, Parantapa Bhat-
2020. [Accessed: 06-June-2022]. 1 tacharya, Kavya Sundaram, Mobin Javed, Bolun Wang, and
[14] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Bimal Viswanath. Deepfake videos in the wild: Analysis
Chen Change Loy. Deeperforensics-1.0: A large-scale and detection. In Proceedings of the Web Conference 2021,
dataset for real-world face forgery detection. In IEEE/CVF pages 981–992, 2021. 2
CVPR, pages 2889–2898, 2020. 2 [30] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
[15] Pavel Korshunov and Sébastien Marcel. Deepfakes: a new tian Riess, Justus Thies, and Matthias Niessner. Faceforen-
threat to face recognition? assessment and detection. arXiv sics++: Learning to detect manipulated facial images. In
preprint arXiv:1812.08685, 2018. 2 IEEE/CVF ICCV, pages 1–11, 2019. 2, 4
[16] Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sungwoo [31] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dy-
Park, and Gyeongsu Chae. Kodf: A large-scale korean deep- namic routing between capsules. Advances in neural infor-
fake detection dataset. In IEEE/CVF ICCVs, pages 10744– mation processing systems, 30, 2017. 7
10753, 2021. 2, 3 [32] Karen Simonyan and Andrew Zisserman. Very deep convo-
[17] Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, and lutional networks for large-scale image recognition. arXiv
Isao Echizen. Openforensics: Large-scale challenging preprint arXiv:1409.1556, 2014. 7

9747
[33] Collider Video. Deepfake Roundtable with George Lucas,
Tom Cruise, Robert Downey Jr. and More. https :
/ / collider . com / deepfake - roundtable -
george- lucas- tom- cruise- robert- downey-
jr. [Accessed: 2022-06-04]. 1, 2
[34] John Xavier. Deepfakes enter Indian Election Cam-
paigns. https : / / www . thehindu . com / news /
national / deepfakes - enter - indian -
election - campaigns / article61628550 . ece,
2020. [Accessed: 06-June-2022]. 1
[35] Ying Xu, Philipp Terhörst, Kiran Raja, and Marius
Pedersen. A comprehensive analysis of ai biases in
deepfake detection with massively annotated databases.
https://arxiv.org/abs/2208.05845, 2022. 3
[36] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes
using inconsistent head poses. CoRR, abs/1811.00661, 2018.
2
[37] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei,
Weiming Zhang, and Nenghai Yu. Multi-attentional deep-
fake detection. In IEEE/CVF CVPR, pages 2185–2194, June
2021. 2
[38] Yipin Zhou and Ser-Nam Lim. Joint audio-visual deepfake
detection. In IEEE/CVF ICCV, pages 14800–14809, October
2021. 2
[39] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and
Yu-Gang Jiang. Wilddeepfake: A challenging real-world
dataset for deepfake detection. Proceedings of the 28th ACM
International Conference on Multimedia, 2020. 2

9748

You might also like