Final Report
Final Report
Bachelor of Technology
in
Computer Science and Engineering
by
21BCE3785 PIDUGU VEKATA VAMSI MUKESH
21BCE2722 SOMANNAGARI RISHIKESAVA REDDY
21BCE0325 DALIPARTHI SRIRAM
Dr. SUGANTHINI C
Assistant Professor Senior Grade 1
School of Computer Science and Engineering (SCOPE)
November 2024
DECLARATION
I hereby declare that the project entitled Video Based Face Detection and
Recognition submitted by me, for the award of the degree of Bachelor of Technology
in Computer Science and Engineering to VIT is a record of Bonafide work carried out
by me under the supervision of Dr. Suganthini C
I further declare that the work reported in this project has not been
submitted and will not be submitted, either in part or in full, for the award of any other
degree or diploma in this institute or any other institute or university.
Place: Vellore
Date: 20/11/24
This is to certify that the project entitled Video Based Face Detection and Recognition
submitted by Pidugu Venkata Vamsi Mukesh(21BCE3785), Somannagari Rishikesava
Reddy (21BCE2722), Daliparthi Sriram(21BCE0325), School of Computer Science
and Engineering, VIT, for the award of the degree of Bachelor of Technology in
Computer Science and Engineering, is a record of bonafide work carried out by them
under my supervision during Fall Semester 2024-2025, as per the VIT code of
academic and research ethics.
The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any
other institute or university. The project fulfills the requirements and regulations of
the University and in my opinion meets the necessary standards for submission.
Place: Vellore
Date: 20/11/2024
Examiner(s)
Dr UMADEVI K S
Page | i
2.3.4 Performance Evaluation:
2.3.5 Real-Time Video Integration: 16
2.3.6 Ethical and Bias Considerations: 16
2.3.7 Scalability Testing 17
17
4.2 Design 35
4.2.1 Data Flow Diagram 35
Page | ii
4.2.2 Use Case Diagram 36
4.2.3 Class Diagram 37
4.2.4 Sequence Diagram 38
5.1 Methodology 39
6 PROJECT DEMONSTRATION 42
7.2 Discussions 49
Page | iii
7.2.1 strengths 49
7.2.2 Challenges and Limitations 49
7.3 Cost Analysis
49
7.3.1 Development Costs
49
7.3.2 Training Costs
50
7.3.3 Deployment Costs
7.3.4 Total Cost Estimation 50
50
8. CONCLUSION 52
9. REFERENCES 53
Page | iv
List of Figures
1. Gantt chart 18
2. Architecture Diagram 29
3. DFD 33
4. Use Case Diagram 34
5. Class Diagram 35
6. Sequence Diagram 36
7. Dataset 40
8. Reading the Dataset 41
9. Splitting the Dataset 41
10. Creating Triplets 42
11. Visualizing the Data 43
12. Model Table 44
13. Model Architecture 44
14. Epoch Run 45
15. Graph of SNN 45
16. Evaluated Data 46
17. Accuracy 46
Page | 1
List of Tables
List of Abbreviations
Page | 2
CNN Convolutional Neural Network
AP Anchor-Positive
AN Anchor-Negative
SD Standard Deviation
ML Machine Learning
DL Deep Learning
TF Tensor flow
Page | 3
A Anchor image in the triplet input
μ Mean of a distribution
ABSTRACT
Page | 4
Facial recognition systems represent a significant advancement in biometric
authentication, leveraging sophisticated machine learning techniques to identify and verify
human faces. These systems work by comparing a given face to a database of stored images,
making them integral to applications in security, surveillance, and personalized user
experiences. This project focuses on developing a robust face recognition model using
Siamese Networks, a neural network architecture designed to learn pairwise similarity by
computing a distance metric that indicates whether two images belong to the same person.
Unlike traditional classification models, the Siamese Network is particularly effective for
one-shot learning, where only a few images per individual are available.
The dataset used in this project consists of extracted face images derived
from the Labeled Faces in the Wild (LFW) dataset, with pre-processing via Haar-Cascade
face detection to ensure consistency. The dataset includes 1324 individuals, each represented
by 2–50 images, resized to 128x128 pixels. Training involves creating triplets of images—
anchor, positive, and negative—designed to train the model to minimize the intra-class
distance (anchor-positive pairs) while maximizing the inter-class distance (anchor-negative
pairs).
During training, the model was evaluated on accuracy, mean distances, and
standard deviation of positive and negative pair distances. These metrics provided insights
into the effectiveness of the learned embedding space. The trained Siamese Network
demonstrated strong performance in differentiating between individuals, with high accuracy
and robustness. Post-training, the encoder was extracted for practical use in facial similarity
tasks, enabling real-world applications.
1. INTRODUCTION
Page | 5
1.1 Background
Despite this, challenges like distinguishing visually similar individuals and handling
low-quality images remain, particularly in real-time video applications. The need for both
high precision and recall in large-scale systems emphasizes the importance of addressing
false positives and negatives.
Siamese neural networks (SNNs) have emerged as a solution, using one-shot learning
to compare pairs of images and determine whether they represent the same person. This
approach is especially useful in face verification, where recognizing similarities between two
images is more critical than classification.
This paper introduces an optimized Siamese neural network for both image and video-
based facial recognition, designed to handle image variability and real-time detection. By
focusing on scalability and robustness, the system provides an effective solution for facial
verification across static and dynamic environments.
1.2 Motivation
The project is driven by the increasing demand for reliable and scalable facial
recognition systems, particularly in applications such as security, biometric authentication,
and personalized user experiences. As facial recognition becomes more integrated into daily
life, the limitations of traditional methods—like their struggle with variations in lighting,
facial expressions, and low-quality images—highlight the need for more robust solutions.
Inspiration for this project stems from these growing challenges and the
limitations of current models in real-world applications. The ability to accurately verify
Page | 6
identities, not only in static images but also in dynamic, real-time video feeds, is critical for
enhancing security and ensuring seamless biometric authentication.
This project focuses on developing a facial recognition system utilizing a Siamese neural
network architecture, tailored for both image and video verification tasks. The key
components and boundaries of the project include:
Data Collection and Preprocessing: Gathering and preparing datasets comprising
positive (matching) and negative (non-matching) facial image pairs. This involves
data augmentation techniques to enhance model generalization.
Model Development: Constructing a twin neural network (TNN) model that processes
input images to generate high-dimensional embeddings. The model incorporates an
L1 distance layer to compute similarities between image pairs, facilitating effective
face verification.
Training and Optimization: Employing binary cross-entropy loss for model training,
with the integration of checkpointing mechanisms to monitor progress and prevent
overfitting.
Evaluation Metrics: Assessing model performance using precision, recall, and F1
score metrics to ensure accuracy and reliability in face recognition tasks.
Real-Time Video Integration: Extending the model's application to real-time video
feeds by implementing face detection algorithms and image rescaling methods,
enabling dynamic facial recognition.
Page | 7
Hardware Deployment: The project does not encompass the development or
deployment of specialized hardware for facial recognition tasks.
Ethical and Privacy Implications: While the project acknowledges the importance of
ethical considerations in facial recognition technology, addressing these implications
is beyond its current scope.
By delineating these boundaries, the project aims to create a scalable and efficient
facial recognition system applicable to various real-world scenarios, including
security systems and biometric authentication processes.
Page | 8
2. PROJECT DESCRIPTION AND GOALS
Page | 9
However, despite the success of CNNs in improving facial
recognition, they are not without challenges. Overfitting, the need for large labelled datasets,
and real-time application challenges remain significant hurdles. Overfitting occurs when
models become too complex and fail to generalize to new data. As CNNs require vast
amounts of labelled data, collecting and annotating large datasets is resource-intensive
(Moghadam & Mottaghi ,2020) [7]. Additionally, deploying CNN-based models in real-time
scenarios requires high computational power, limiting their practicality in environments
where speed is critical.
Page | 10
2.1.3. Real-Time Face Recognition Challenges:
While Siamese Neural Networks (SNNs) and other deep learning
techniques have proven effective for static face recognition tasks, applying these models to
real-time face recognition in video feeds presents additional challenges. Real-time
recognition requires rapid processing and must be able to handle dynamic environments
where faces may vary in size, orientation, and expression (Nguyen & De La Torre,2017) [14].
Recent advances in object detection algorithms, such as the Single Shot
MultiBox Detector (SSD) (Liu et al., 2016) [15] and You Only Look Once (YOLO) (Redmon
et al., 2016) [16], have been integrated with SNNs to improve real-time face recognition.
These algorithms are designed to quickly detect objects (in this case, faces) within video
frames, providing bounding boxes around faces that can then be passed to SNNs for
verification or recognition. SSD and YOLO stand out for their ability to balance speed and
accuracy, making them suitable for real-time applications.
However, real-time facial recognition is still not without its challenges.
Zhang et al. (2019) [17] and Wang et al. (2020) [18] found that combining face detection and
recognition in video streams could significantly improve performance, but the need for high
precision and recall remains critical. In dynamic environments, false positives (incorrectly
recognizing a face that isn’t present) and false negatives (failing to recognize a face) can
significantly impact the reliability of the system. Parkhi et al. (2015) [19] and Li & Zhang
(2019) [20] further emphasized that the integration of face alignment techniques is vital to
improving recognition accuracy in video streams.
Moreover, several ethical concerns must be addressed as real-time facial
recognition becomes more widespread. Issues such as privacy, bias, and discrimination are
increasingly being scrutinized. Research by Buolamwini & Gebru (2018) [21] highlighted
biases in facial recognition systems, particularly regarding gender and racial disparities.
Addressing these concerns will be essential to ensuring the fair and responsible deployment
of facial recognition technologies.
2.1.4. Summary:
The literature on facial recognition illustrates a significant evolution in
methodologies, from traditional statistical techniques to advanced deep learning approaches.
While SNNs have proven effective in face verification tasks, the need for robust, scalable
solutions in real-time applications remains pressing. This project seeks to contribute to this
Page | 11
field by leveraging SNNs to enhance facial recognition capabilities, ultimately bridging the
gap between technology and practical application.
This project aims to address several critical gaps in the existing research on facial recognition
systems, particularly using Siamese neural networks. Below are the specific areas where
current research is lacking or incomplete:
2.2.1 Robustness Against Real-World Variability:
Lack of Comprehensive Solutions: Many existing models do not effectively handle
variations in lighting, angles, facial expressions, and occlusions found in real-world
settings. This project will enhance robustness against these variables to improve
accuracy in diverse environments (Chen et al.,2018) [22].
2.2.2 Performance with Low-Quality Images:
Inadequate Handling of Image Quality: There is limited research on improving facial
recognition performance under conditions of low-resolution or low-quality images.
This project will focus on preprocessing techniques and model adaptations to better
manage such scenarios (Gao et al.,2019).
2.2.3 Scalability and Efficiency:
Challenges in Large-Scale Applications: Current facial recognition systems often
struggle with scalability and maintaining performance in large datasets. The project
will aim to create a system optimized for efficiency in both static and dynamic
environments, making it suitable for real-time applications (Gonzalez et al.,2018)
[23].
2.2.4 Real-Time Processing Capabilities:
Limited Real-Time Applications: While some models exist for live video processing,
many are not optimized for speed and accuracy simultaneously. This project will
integrate face detection algorithms to ensure rapid and accurate recognition in real-
time video feeds (Liu et al.,2016) [24].
2.2.5 Ethical Implications and Bias Mitigation:
Underexplored Ethical Considerations: The impact of bias in facial recognition
technology and its ethical implications are insufficiently addressed in current
research. This project will consider strategies to mitigate bias and enhance fairness in
the recognition process (Dastin 2018) [25].
Page | 12
2.2.6 Multi-Modal Integration:
Neglected Multi-Modal Approaches: Current systems primarily focus on visual data,
with little exploration of multi-modal integration (e.g., audio, behavioural data). This
project will investigate the potential benefits of incorporating additional data types to
enhance recognition performance (Siddiqui et al. 2020) [26].
2.2.7 Transfer Learning and Domain Adaptation:
Insufficient Adaptability: Existing models often do not generalize well across
different demographics or environments. This project will explore transfer learning
techniques to enhance model performance in unfamiliar settings (Pan & Yang,2010)
[27].
2.2.8 Model Interpretability:
Opaque Decision-Making Processes: The lack of interpretability in deep learning
models can hinder user trust and acceptance. This project will explore methods to
improve the transparency and explainability of the facial recognition system
(Lipton,2018) [28].
2.2.9 Few-Shot and One-Shot Learning Frameworks:
Limited Exploration of Learning Techniques: Although Siamese networks are well-
suited for few-shot learning, comprehensive frameworks for practical implementation
are lacking. This project aims to refine these learning strategies to improve
verification accuracy with minimal data (Vinyals et al.,2016) [29].
2.2.10 Longitudinal Performance Studies:
Insufficient Long-Term Evaluation: Research on how facial recognition systems
perform over time, particularly regarding changes in user appearance, is scarce. This
project will seek to address this gap by implementing longitudinal studies to evaluate
model adaptability over time (Brock et al.,2018) [30].
2.3 Objectives
2.3.1 Develop a Siamese Neural Network Model:
Specific: Create a Siamese neural network architecture optimized for facial
recognition.
Measurable: Achieve a model accuracy of at least 95% on validation datasets.
Page | 13
Achievable: Utilize existing frameworks and libraries (e.g., TensorFlow, PyTorch) for
implementation.
Relevant: Addresses the need for effective facial recognition in security and biometric
applications.
Time-Bound: Complete model development within 3 weeks.
2.3.2 Data Collection and Preprocessing:
Specific: Collect and preprocess a dataset of at least 10,000 facial image pairs, with a
balance of positive and negative pairs.
Measurable: Ensure a data augmentation increase of at least 50% in the dataset size.
Achievable: Use open-source datasets and generate additional pairs through
augmentation techniques.
Relevant: High-quality data is critical for model performance.
Time-Bound: Finish data collection and preprocessing within 2 weeks.
2.3.3 Training and Optimization:
Specific: Train the Siamese network using binary cross-entropy loss and implement
checkpointing.
Measurable: Monitor training loss and validation accuracy, aiming for convergence
within 5 epochs.
Achievable: Utilize computational resources effectively to manage training times.
Relevant: Training and optimization are essential for achieving high model
performance.
Time-Bound: Complete training within 1 week.
2.3.4 Performance Evaluation:
Specific: Evaluate the model using precision, recall, and F1 score metrics.
Measurable: Achieve a precision and recall rate of at least 90%.
Achievable: Utilize standard evaluation techniques and frameworks.
Relevant: Accurate evaluation ensures the model meets application needs.
Time-Bound: Conduct evaluation within 1-week post-training.
2.3.5 Real-Time Video Integration:
Specific: Implement real-time video facial recognition capabilities.
Measurable: Achieve a processing speed of at least 30 frames per second (FPS).
Achievable: Optimize the model and use efficient video processing libraries.
Relevant: Real-time capabilities are crucial for practical applications.
Page | 14
Time-Bound: Complete integration within 2 weeks.
2.3.6 Ethical and Bias Considerations:
Specific: Develop and implement a bias mitigation strategy for the facial recognition
system.
Measurable: Evaluate the model for demographic bias and aim for less than 5%
variance in performance across demographics.
Achievable: Leverage existing research on bias mitigation techniques.
Relevant: Ethical considerations are essential in facial recognition technology.
Time-Bound: Implement and evaluate bias considerations within 2 weeks.
2.3.7 Scalability Testing:
Specific: Conduct scalability tests for the system in large-scale applications.
Measurable: Ensure the system can handle at least 1,000 simultaneous user requests.
Achievable: Utilize cloud infrastructure for testing scalability.
Relevant: Scalability is key for practical deployment in security systems.
Time-Bound: Complete scalability testing within 1 week.
Page | 15
and scalability, the proposed system will enhance the reliability of facial recognition
technologies, ensuring their practical application in security and biometric authentication
systems.
Objectives:
Tasks:
Deliverables:
Objectives:
Tasks:
Collect facial image datasets, ensuring a balanced mix of positive and negative pairs.
Page | 16
Deliverables:
Objectives:
Tasks:
Implement the model architecture using a deep learning framework (e.g., TensorFlow,
PyTorch).
Deliverables:
Objectives:
Tasks:
Deliverables:
Page | 17
Phase 5: Real-Time Integration and Testing (Weeks 14-18)
Objectives:
Tasks:
Implement face detection algorithms and image rescaling for dynamic recognition.
Deliverables:
Fully functional facial recognition system capable of processing static images and
real-time video.
Page | 18
3. TECHNICAL SPECIFICATION
3.1 Requirements
1. Hardware
RAM: 16 GB minimum.
Storage: 1 TB SSD.
2. Software
3. Data
4. Model Development
Architecture: Siamese neural network with twin networks and L1 distance layer.
Training: 80% training, 10% validation, 10% testing; 10 epochs with early stopping.
Page | 19
Metrics: Precision, recall, F1 score, accuracy.
6. Real-Time Integration
7. Ethical Considerations
8. Scalability
9. Documentation
3.1.1 Functional
Image Input: The system must accept facial images as input in standard formats such
as JPG, PNG, or BMP.
Video Input: The system must process real-time video feeds from a camera or video
file.
Face Detection: Implement an automated face detection module that can locate and
crop faces from both images and video frames.
2. Model Training
Data Pairing: The system must create positive (same person) and negative (different
person) image pairs for training.
Page | 20
Siamese Network Architecture: The system must use twin neural networks (SNN) that
share weights for extracting facial embeddings from image pairs.
Loss Function: The system should calculate the similarity between embeddings using
L1 distance and apply binary cross-entropy as the loss function.
Image Verification: For image-based facial recognition, the system must verify if two
input images belong to the same individual.
4. Model Evaluation
Validation: During training, the system must validate the model on a separate
validation set and aim for a precision and recall rate of at least 90%.
5. Scalability
Concurrent User Handling: The system should handle at least 1,000 simultaneous user
requests for real-time facial recognition in large-scale applications.
Bias Mitigation: The system must analyze performance across different demographic
groups and ensure less than 5% variance in accuracy among these groups.
Privacy and Security: The system must ensure that user data is stored securely and
comply with legal data protection standards, such as GDPR.
Page | 21
7. User Interface
User Authentication: Provide an interface for users to upload images or connect video
feeds for recognition.
Results Display: The system must display the recognition results (match/no match)
along with confidence scores.
Real-Time Alerts: For real-time recognition, the system should trigger alerts for
mismatches or unidentified individuals based on predefined thresholds.
8. System Monitoring
Logging and Debugging: The system must maintain logs of all recognition attempts,
training results, and model evaluations for performance tracking and debugging.
Performance Tracking: The system should monitor real-time performance and alert
the user when thresholds (such as accuracy or frame rate) drop below acceptable
levels.
3.1.2 Non-Functional
1. Performance
Accuracy: The system must achieve at least 90% accuracy on both image and video-
based facial recognition tasks.
Real-Time Processing: The system must process video feeds at a minimum speed of
30 frames per second (FPS) to ensure real-time performance.
Response Time: The system should return verification results within 2 seconds for
static images and maintain near real-time verification for video feeds.
2. Scalability
User Load: The system should be able to scale to handle at least 1,000 simultaneous
user requests for real-time facial recognition in large-scale deployments.
Data Volume: The system must handle large datasets, with support for millions of
facial images and video frames, efficiently processing them for training, testing, and
real-time use.
Page | 22
3. Reliability
Uptime: The system must maintain 99.9% uptime to ensure consistent availability for
real-time applications, especially in security or biometric authentication systems.
Error Handling: The system should gracefully handle failures, such as missing faces
or poor-quality images, and provide meaningful error messages or fallback options.
4. Security
Data Privacy: The system must adhere to data privacy regulations, including GDPR,
ensuring that all biometric data is securely stored and processed.
Access Control: Implement user authentication and access control to ensure that only
authorized users can access the facial recognition system and its data.
Encryption: Data in transit and at rest should be encrypted to protect sensitive facial
information.
5. Usability
User Interface (UI): The system must provide a user-friendly interface that allows
non-technical users to easily upload images, stream video feeds, and view recognition
results.
Adaptability: The system should support multiple device types (desktop, mobile, etc.)
with responsive design.
6. Maintainability
Modular Design: The system must be built with a modular architecture, allowing for
easy updates, debugging, and future expansion.
Logging and Monitoring: The system should log all activities, including performance
metrics, user access, and errors, with monitoring tools to track system health.
Page | 23
7. Interoperability
Integration: The system must be capable of integrating with existing security systems,
databases, and APIs for seamless use in larger applications.
8. Efficiency
Resource Utilization: The system should optimize memory and CPU usage, especially
during real-time video processing and large-scale dataset handling.
Energy Consumption: For real-time applications, the system should minimize energy
consumption to ensure efficiency, especially for deployments on edge devices or
mobile platforms.
9. Compliance
Legal and Ethical Standards: The system must comply with biometric data regulations
and ethical guidelines, ensuring transparency in facial recognition usage.
Bias Mitigation: The system should continuously evaluate and address bias in facial
recognition accuracy across different demographic groups to ensure fairness.
10. Availability
Failover and Redundancy: The system must have built-in failover and redundancy
mechanisms to ensure continuous operation in case of hardware or software failures.
Backup and Recovery: Implement regular data backup and recovery procedures to
prevent data loss and ensure system recovery in the event of a failure.
A feasibility study assesses the viability of a project by examining its technical, economic,
operational, legal, and schedule aspects. For this facial recognition system based on a
Siamese neural network, the feasibility is analyzed as follows:
Page | 24
3.2.1 Technical Feasibility
Availability of Technology: The project can be built using existing technologies and
frameworks such as TensorFlow, PyTorch, and OpenCV for deep learning and real-
time video processing. Siamese neural network architectures are well-established for
image verification tasks.
Data Availability: Large datasets of facial images (such as LFW, CASIA-WebFace,
VGGFace, and MS-Celeb-1M) are readily available for training. Additionally,
techniques like data augmentation can be employed to increase dataset diversity and
size.
Hardware Requirements: High-performance GPUs (e.g., NVIDIA RTX series) will be
necessary to train the neural network efficiently and ensure real-time video
processing. Cloud services (AWS, Google Cloud) can be used for scaling, data
storage, and computation.
Technical Risks: Potential risks include overfitting, poor generalization due to limited
datasets, or challenges in real-time video processing due to high computational
requirements. However, these risks can be mitigated with optimization techniques and
proper hardware.
Conclusion: The technology required to develop the system is available, and the
project is technically feasible.
3.2.2 Economic Feasibility
Initial Investment: The primary costs will include hardware (GPUs for training),
software (cloud services), and human resources (developers, machine learning
experts). Open-source tools like TensorFlow, PyTorch, and public datasets can reduce
initial costs.
Operational Costs: Running the system will involve costs related to cloud
infrastructure, computational resources (for real-time processing), and data storage.
For large-scale deployments, additional costs for scaling infrastructure may be
incurred.
ROI (Return on Investment): The system has high potential in security, biometric
authentication, and surveillance markets. Once deployed, the system could offer
significant value to industries that require secure and efficient facial recognition,
reducing labor costs and improving security measures.
Page | 25
Conclusion: The potential return on investment justifies the initial and operational
costs, making the project economically feasible.
3.2.3 Operational Feasibility
User Acceptance: Facial recognition technology is already widely accepted in many
sectors, including security, retail, and personal device access. However, concerns
about privacy and data security must be addressed to ensure user trust.
Ease of Use: The system will feature an intuitive user interface, making it easy for
non-technical users to interact with the facial recognition tools. The inclusion of real-
time video processing will add to its practical utility.
Training and Support: Minimal training is required for end users, but ongoing
technical support will be necessary to maintain and update the system.
Comprehensive documentation and user guides will be provided.
Conclusion: The system is operationally feasible as it can be smoothly integrated into
existing infrastructures and will be user-friendly.
3.2.4 Legal Feasibility
Data Privacy: The system must comply with data privacy laws such as GDPR
(General Data Protection Regulation) and CCPA (California Consumer Privacy Act).
This includes securing user data and ensuring that individuals’ biometric data is
collected and used responsibly.
Ethical Considerations: Ethical concerns related to bias in facial recognition
(discrimination based on race, gender, etc.) need to be addressed. Bias mitigation
strategies must be incorporated to ensure fairness across demographic groups.
Licensing: Any open-source tools and datasets used in the project must be properly
licensed to avoid legal issues.
Conclusion: The project is legally feasible, provided it adheres to privacy regulations
and incorporates ethical considerations into the design.
3.2.5 Schedule Feasibility
Development Timeline: Based on a 5-month project timeline, the project tasks have
been outlined as follows:
o Model development: 2 months
o Data collection and preprocessing: 1 month
o Training and optimization: 1 month
o Real-time video integration: 1 month
Page | 26
o Performance evaluation and testing: Throughout the project timeline
Resource Allocation: With adequate human resources and technical infrastructure, the
project can be completed within the proposed timeline.
Risk Management: Delays may occur during data collection or model optimization,
but a well-defined project plan and regular milestone checks can mitigate these risks.
Conclusion: The project is feasible within the proposed 5-month timeline, assuming
efficient resource management and adherence to the project plan.
GPU: NVIDIA RTX 3060 or higher (for accelerated training and video processing)
Network: High-speed internet connection (for data access and cloud integration)
Cloud Platform: AWS, Google Cloud, or Azure (for scalability and deployment)
Page | 27
Compute Services: GPU-enabled virtual machines (for large-scale model training and
real-time video processing)
Storage: Cloud-based storage for datasets and model backups (S3, Google Cloud
Storage)
Page | 28
4. DESIGN APPROACH AND DETAILS
Page | 29
Fig. 4.1. System Architecture
The system architecture for a facial recognition system utilizing Siamese Neural Networks
(SNN) is designed to handle both static images and live video feeds. Below is a high-level
overview of the architecture, which includes various components and their interactions.
Page | 30
Siamese Neural Network (SNN): Unlike traditional CNNs, a Siamese Network is
designed to learn similarity between pairs of images. The network consists of twin
models that share the same weights, which makes it possible to compare two inputs
and determine whether they are of the same individual.
o Positive Pair: Two images of the same person.
o Negative Pair: Two images of different people.
Training Process:
o Each pair of images is passed through the Siamese Network.
o The network computes the distance metric between the embeddings of the
two images. The objective is to minimize the distance for similar images and
maximize it for dissimilar images.
Loss Function: Typically, contrastive loss or triplet loss is used. These functions
help the model learn to distinguish between same and different faces.
Optimization: Techniques like stochastic gradient descent (SGD) or Adam
optimizers are used to update the network’s weights based on the loss calculated for
each pair.
4.1.4 Model Evaluation
Validation Set: After training, the model's performance must be evaluated on a
validation set. The validation set consists of face pairs that were not part of the
training process.
o Accuracy: Percentage of correctly classified pairs (correctly identifying same
or different pairs).
o Precision and Recall: Metrics to ensure the model is not only accurate but
also balances false positives and false negatives.
Tuning: Hyperparameters such as learning rate, batch size, and the number of epochs
are fine-tuned based on the model’s validation performance.
Evaluation Metrics: Metrics like AUC-ROC (Area Under the Receiver Operating
Characteristic curve) are used to measure the trade-off between true positive and false
positive rates at different thresholds.
Overfitting Check: Techniques like early stopping or dropout layers can be used to
ensure the model does not overfit to the training data.
4.1.5 Real-Time Integration
Page | 31
Video Processing: Implement video stream processing using libraries like OpenCV to
integrate real-time face recognition:
o A video frame is captured and passed through the face detection algorithm to
identify faces in the live stream.
o For each detected face, it is resized and normalized before being passed
through the trained Siamese Network.
Face Matching: The detected face’s embedding (from the model) is compared
against a database of embeddings of known individuals:
o If the distance between the embeddings is below a certain threshold, a match is
made, and the system identifies the person.
o This threshold can be dynamically adjusted based on the desired security level.
Performance Optimization: To ensure real-time performance, methods like frame
skipping, GPU acceleration, or parallel processing can be implemented.
4.1.6 User Interaction
Interface: The face recognition results are displayed through a user interface, which
could be in the form of:
o Desktop or web application for live surveillance systems.
o Mobile applications for personal identification or access control.
Alerts: In scenarios such as security or surveillance, the system can trigger alerts if a
recognized individual is flagged (e.g., a VIP or unauthorized person).
Privacy and Security:
o Encryption: Facial embeddings and images stored in the system should be
encrypted to prevent unauthorized access.
o Anonymization: For privacy, systems could anonymize or hash sensitive user
data when storing or transmitting.
o Compliance: The system should comply with privacy regulations like GDPR
(General Data Protection Regulation) by ensuring that user consent is obtained
before storing biometric data and allowing users to delete their data if
requested.
Page | 32
4.2 Design
Page | 33
4.2.2 Use Case Diagram
Page | 34
Fig. 4.4. Class Diagram
Page | 35
Fig. 4.4. Sequence Diagram
5.3 Methodology
5.3.1 Dataset Preparation
Page | 36
Data preparation is a critical first step to ensure the model can effectively learn the distinction
between similar and dissimilar faces.
Source of Dataset:
The dataset is derived from the Labeled Faces in the Wild (LFW) dataset, a
benchmark dataset for face verification tasks. Images are cropped and resized to
128×128×3128 \times 128 \times 3128×128×3 resolution and formatted in RGB for
consistency with the network input requirements.
Data Organization:
The dataset is structured into folders, with each folder representing an individual.
Images within each folder correspond to variations of the same person’s face (e.g.,
different angles, lighting, or expressions).
Data Splitting:
o The dataset is split into 90% training data and 10% testing data.
o A small validation set is optionally extracted from the training data for
hyperparameter tuning.
Triplet Generation:
The model uses triplets of images as inputs to learn embeddings. Each triplet consists
of:
Page | 37
The face recognition system uses a Siamese Network with three identical sub-networks
(encoders) that learn to compare pairs of images based on their feature embeddings.
Encoder Design:
Siamese Network:
loss=max(ap_distance−an_distance+margin,0)\text{loss} = \max(\text{ap\_distance}
- \text{an\_distance} + \text{margin},
0)loss=max(ap_distance−an_distance+margin,0)
A margin of 1.0 ensures that negative samples are sufficiently separated from positive
ones.
5.3.3 Training Process
Page | 38
Custom Training Loop:
A custom training loop is implemented to track and optimize the triplet loss using the
Adam optimizer. This approach provides flexibility for monitoring metrics and
adjusting the learning rate dynamically.
Batch Processing:
o Triplets are fed into the network in batches for efficient training.
Epochs:
Training occurs over multiple epochs until the triplet loss converges or a validation
set shows performance stabilization. The model checkpoints are saved based on
validation accuracy.
Augmentation:
Basic augmentation techniques (e.g., random flips, rotations) are applied to increase
dataset diversity and reduce overfitting.
Testing evaluates the model’s ability to generalize to unseen data using several key metrics:
Accuracy:
The percentage of correctly classified triplets, i.e., cases
where:ap_distance<an_distance\text{ap\_distance} < \text{an\
_distance}ap_distance<an_distance
Distance Metrics:
The mean and standard deviation of:
Confusion Matrix:
A confusion matrix evaluates classification performance based on thresholds:
Page | 39
o True Positive (TP): Correctly identified as "similar."
5.4.2 Visualization
Accuracy Curve:
Displays the improvement in test accuracy across epochs.
Distance Distributions:
Visualization of positive and negative distance distributions to illustrate clear
separation.
During testing, embeddings are compared using L2 distance. A threshold (e.g., 1.3)
determines classification:
6.PROJECT DEMONSTRATION
Each stage of the project plays a vital role in ensuring the pipeline's success. From raw
data retrieval to delivering final actionable insights, every phase contributes outputs
that enhance accuracy, reliability, and user-friendliness.
Page | 40
6.1 Dataset Overview
The dataset is derived from Extracted Faces, a version of the LFW (Labeled Faces in the
Wild) dataset. Each folder corresponds to an individual, and the images within the folder
represent different photos of the same person.
Key Characteristics:
1324 individuals
The data is organized into folders, each representing one individual. Each folder contains
images with numerical filenames, e.g., 0.jpg, 1.jpg.
Page | 41
Fig 6.2. Reading the Dataset
We split the dataset into training and testing sets. A random 90% of the individuals are
allocated to training, and the remaining 10% are used for testing.
Page | 42
Negative: An image of a different person.
To verify the correctness of triplets and preprocessing, we plot a few triplets (anchor,
positive, negative).
Page | 43
Fig. 6.4. Visualizing the Data(a)
Page | 44
6.3 Creating the Model
After preparing the dataset, the next step is to train the Siamese Network. Below, we
outline the steps for defining, compiling, and training the model using triplet loss
We use the triplet batch generator created during data preparation for training.
Page | 45
Fig. 6.7. Epoch Run
Page | 46
6.5 Testing and Validation Output
After training the Siamese Network, it is important to evaluate its performance on the test
data and validate its ability to distinguish between similar and dissimilar images.
Page | 47
7.RESULT AND DISCUSSION (COST ANALYSIS as
applicable)
7.4 Results
7.4.1 Model Performance
Accuracy: The system achieved 97.8% accuracy on the test dataset, highlighting its
reliability in distinguishing between similar and dissimilar faces.
Distance Metrics:
Confusion Matrix:
Analysis showed:
o Training loss converged to 0.025 by the final epoch, confirming that the model
effectively minimized the triplet loss.
Page | 48
o The accuracy curve demonstrated steady improvement, plateauing after 15
epochs.
ROC Curve: The Receiver Operating Characteristic curve showed an area under the
curve (AUC) of 0.99, reinforcing the model's high classification capability.
Examples:
Qualitative evaluation with sample pairs (anchor-positive-negative) revealed the
system's ability to correctly identify challenging cases, such as similar-looking
individuals or low-resolution images.
7.5Discussion
7.5.1 Strengths
High Accuracy: The model achieved state-of-the-art accuracy on the LFW dataset,
ensuring suitability for real-world face verification tasks.
Hard Triplets: Some misclassifications occurred with hard triplets, where positive and
negative pairs shared close visual similarities.
Page | 49
Threshold Sensitivity: Classification performance depends on the chosen distance
threshold, which may vary across use cases.
The cost analysis covers the development, training, and deployment phases of the system.
Hardware Resources:
Software Tools:
Manpower:
Compute Time:
o For cloud-based training (e.g., AWS or Google Cloud), GPU rental costs
approximately $1/hour, total of $15.
Page | 50
Data Storage:
Cloud Hosting:
o Hosting the trained model as an API requires a cloud instance, costing $20–
$50/month depending on usage.
Edge Devices:
o Deploying the model on edge devices (e.g., mobile or IoT devices) requires
additional optimization, with one-time development costs around $1,000–
$2,000.
Page | 51
8.Conclusion
The Siamese Network-based face recognition system developed in this study successfully
achieves high accuracy, reliability, and efficiency in distinguishing similar and dissimilar
facial pairs. By leveraging triplet loss and transfer learning with pre-trained encoders, the
model ensures robust feature extraction and clear embedding separability, even for
challenging cases involving visual similarities or low-quality images.
Key Highlights:
Performance: The system achieved a remarkable 97.8% accuracy on the LFW dataset,
with strong metrics such as AUC (0.99) and precision-recall values. This
demonstrates its suitability for real-world applications in security, authentication, and
other domains requiring precise facial verification.
While the system performs robustly on standard datasets, its effectiveness can be further
enhanced by addressing the following areas:
Optimization for Edge Devices: Fine-tuning the model for low-power devices can
expand its usability for mobile and IoT applications.
Page | 52
9. REFERENCES
Page | 53
14. J. Redmon, et al., “You Only Look Once: Unified Real-Time Object Detection,” *arXiv
preprint arXiv:1506.02640*, 2016.
15. Y. Zhang, et al., “Combining face detection and recognition in video streams,” *2019
IEEE International Conference on Image Processing*, 2019.
16. A. Parkhi, et al., “Deep Face Recognition,” *Proceedings of the British Machine Vision
Conference*, 2015.
17. H. A. Alavi, et al., “Real-Time Face Recognition Using YOLO and FaceNet,”
*International Journal of Advanced Computer Science and Applications*, vol. 12, no. 2, pp.
417-423, 2021.
18. A. R. Rahmani, et al., “Real-Time Face Detection and Recognition System Using
Cascade Classifier and YOLO,” *Proceedings of the 2020 7th International Conference on
Cloud Computing and Big Data Analysis*, 2020.
19. Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities
in Commercial Gender Classification,” *Proceedings of the 1st Conference on Fairness,
Accountability and Transparency*, 2018.
20. D. A. Dastin, “Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against
Women,” *Reuters*, 2018.
21. C. Liu, et al., “SSD: Single Shot MultiBox Detector,” European Conference on
Computer Vision, 2016.
22. J. Redmon, et al., “You Only Look Once: Unified Real-Time Object Detection,” arXiv
preprint arXiv:1506.02640, 2016.
23. Y. Zhang, et al., “Combining face detection and recognition in video streams,” 2019
IEEE International Conference on Image Processing, 2019.
24. Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities
in Commercial Gender Classification,” Proceedings of the 1st Conference on Fairness,
Accountability and Transparency, 2018.
25. T. Dastin, “Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against
Women,” Reuters, 2018.
26. A. Raji and J. Buolamwini, “Actionable Auditing: Investigating Bias in Machine
Learning through Adversarial Testing,” Proceedings of the 2019 AAAI/ACM Conference on
AI, Ethics, and Society, 2019.
27. A. Siddiqui, et al., “Improving facial recognition with multi-modal data,” International
Journal of Computer Vision, vol. 128, no. 9, pp. 2485-2497, 2020.
Page | 54
28. Y. Zhou, et al., “Integrating visual and audio data in facial recognition,” IEEE
Transactions on Multimedia, vol. 20, no. 12, pp. 3517-3529, 2018.
29. M. Vasiljevic, et al., “Incorporating behavioural data in facial recognition systems,”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
30. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge
and Data Engineering, vol. 22, no. 10, pp. 1345-1359, 2010.
31. Y. Ganin, et al., “Domain-Adversarial Training of Neural Networks,” Journal of Machine
Learning Research, vol. 17, no. 1, pp. 2096-2030, 2016.
32. A. Khan, et al., “Using Transfer Learning for Facial Recognition Across Demographics,”
International Journal of Computer Applications, vol. 975, no. 8887, 2020.
33. Zachary C. Lipton, “The Mythos of Model Interpretability,” Proceedings of the 2016
ICML Workshop on Human Interpretability in Machine Learning, 2018.
34. M. T. Ribeiro, S. Singh, and C. Guestrin, “Why Should I Trust You?: Explaining the
Predictions of Any Classifier,” Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2016.
35. Finale Doshi-Velez and Been Kim, “Towards a rigorous science of interpretable machine
learning,” Proceedings of the 2017 ICML Workshop on Human Interpretability in Machine
Learning, 2017.
36. G. Koch, et al., “Siamese Neural Networks for One-Shot Image Recognition,”
Proceedings of the 32nd International Conference on Machine Learning, 2015.
37. E. Vinyals, et al., “Matching Networks for One Shot Learning,” Advances in Neural
Information Processing Systems, vol. 29, 2016.
38. J. Lake, et al., “Building Machines That Learn and Think Like People,” Proceedings of
the 36th International Conference on Machine Learning, 2015.
39. J. Brock, et al., “Long-term performance of facial recognition systems,” IEEE
Transactions on Information Forensics and Security, vol. 13, no. 12, pp. 3170-3182, 2018.
40.Y. Wang, et al., “Assessing Facial Recognition Performance with Changing User
Appearances,” Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019.
41.J. Deng, et al., “Ongoing Evaluation of Facial Recognition Models,” arXiv preprint
arXiv:1911.05378, 2019.
42. Dong & Tang (2020). Privacy-Preserving Siamese Neural Networks. IEEE Access.
43. Liu & Bhanu (2015). Siamese Networks for Person Re-Identification. TIP.
44.Lee et al. (2017). Real-Time Face Detection on Mobile Devices. IEEE Access
Page | 55
45.Wen et al. (2016). Deep Face Recognition with Noisy Labels. TIP.
46.Tran et al. (2017). Rotating Your Faces for Representation Learning. TPAM
APPENDIX A – SAMPLE CODE
import os
import cv2
import time
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras.applications.inception_v3 import preprocess_input
import seaborn as sns
import matplotlib.pyplot as plt
tf.__version__, np.__version__
Page | 56
num_files = len(os.listdir(os.path.join(directory, folder)))
train_list[folder] = num_files
# Creating Test-list
for folder in folders[num_train:]:
num_files = len(os.listdir(os.path.join(directory, folder)))
test_list[folder] = num_files
return train_list, test_list
train_list, test_list = split_dataset(ROOT, split=0.9)
print("Length of training list:", len(train_list))
print("Length of testing list :", len(test_list))
# train_list, test list contains the folder names along with the number of files in the folder.
print("\nTest List:", test_list)
Page | 57
train_triplet = create_triplets(ROOT, train_list)
test_triplet = create_triplets(ROOT, test_list)
print("Number of training triplets:", len(train_triplet))
print("Number of testing triplets :", len(test_triplet))
print("\nExamples of triplets:")
for i in range(5):
print(train_triplet[i])
num_plots = 6
Page | 58
f, axes = plt.subplots(num_plots, 3, figsize=(15, 20))
for x in get_batch(train_triplet, batch_size=num_plots, preprocess=False):
a,p,n = x
for i in range(num_plots):
axes[i, 0].imshow(a[i])
axes[i, 1].imshow(p[i])
axes[i, 2].imshow(n[i])
i+=1
break
def get_encoder(input_shape):
pretrained_model = Xception(
input_shape=input_shape,
weights='imagenet',
include_top=False,
pooling='avg',
)
for i in range(len(pretrained_model.layers)-27):
pretrained_model.layers[i].trainable = False
encode_model = Sequential([
pretrained_model,
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.BatchNormalization(),
layers.Dense(256, activation="relu"),
layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))
], name="Encode_Model")
Page | 59
return encode_model
class DistanceLayer(layers.Layer):
# A layer to compute ‖f(A) - f(P)‖² and ‖f(A) - f(N)‖²
def __init__(self, **kwargs):
super().__init__(**kwargs)
def call(self, anchor, positive, negative):
ap_distance = tf.reduce_sum(tf.square(anchor - positive), -1)
an_distance = tf.reduce_sum(tf.square(anchor - negative), -1)
return (ap_distance, an_distance)
def get_siamese_network(input_shape = (128, 128, 3)):
encoder = get_encoder(input_shape)
# Input Layers for the images
anchor_input = layers.Input(input_shape, name="Anchor_Input")
positive_input = layers.Input(input_shape, name="Positive_Input")
negative_input = layers.Input(input_shape, name="Negative_Input")
## Generate the encodings (feature vectors) for the images
encoded_a = encoder(anchor_input)
encoded_p = encoder(positive_input)
encoded_n = encoder(negative_input)
# A layer to compute ‖f(A) - f(P)‖² and ‖f(A) - f(N)‖²
distances = DistanceLayer()(
encoder(anchor_input),
encoder(positive_input),
encoder(negative_input)
)
# Creating the Model
siamese_network = Model(
inputs = [anchor_input, positive_input, negative_input],
outputs = distances,
name = "Siamese_Network"
)
return siamese_network
Page | 60
siamese_network = get_siamese_network()
siamese_network.summary()
plot_model(siamese_network, show_shapes=True, show_layer_names=True)
class SiameseModel(Model):
# Builds a Siamese model based on a base-model
def __init__(self, siamese_network, margin=1.0):
super(SiameseModel, self).__init__()
self.margin = margin
self.siamese_network = siamese_network
self.loss_tracker = metrics.Mean(name="loss")
def call(self, inputs):
return self.siamese_network(inputs)
def train_step(self, data):
# GradientTape get the gradients when we compute loss, and uses them to update the
weights
with tf.GradientTape() as tape:
loss = self._compute_loss(data)
gradients = tape.gradient(loss, self.siamese_network.trainable_weights)
self.optimizer.apply_gradients(zip(gradients, self.siamese_network.trainable_weights))
self.loss_tracker.update_state(loss)
return {"loss": self.loss_tracker.result()}
def test_step(self, data):
loss = self._compute_loss(data)
self.loss_tracker.update_state(loss)
return {"loss": self.loss_tracker.result()}
def _compute_loss(self, data):
# Get the two distances from the network, then compute the triplet loss
ap_distance, an_distance = self.siamese_network(data)
loss = tf.maximum(ap_distance - an_distance + self.margin, 0.0)
return loss
@property
def metrics(self):
# We need to list our metrics so the reset_states() can be called automatically.
return [self.loss_tracker]
Page | 61
siamese_model = SiameseModel(siamese_network)
optimizer = Adam(learning_rate=1e-3, epsilon=1e-01)
siamese_model.compile(optimizer=optimizer)
Page | 62
save_all = False
epochs = 256
batch_size = 128
max_acc = 0
train_loss = []
test_metrics = []
training(epochs)
def train():
for epoch in range(1, epochs+1):
t = time.time()
# Training the model on train data
epoch_loss = []
for data in get_batch(train_triplet, batch_size=batch_size):
loss = siamese_model.train_on_batch(data)
epoch_loss.append(loss)
epoch_loss = sum(epoch_loss)/len(epoch_loss)
train_loss.append(epoch_loss)
print(f"\nEPOCH: {epoch} \t (Epoch done in {int(time.time()-t)} sec)")
print(f"Loss on train = {epoch_loss:.5f}")
# Testing the model on test data
metric = test_on_triplets(batch_size=batch_size)
test_metrics.append(metric)
accuracy = metric[0]
# Saving the model weights
if save_all or accuracy>=max_acc:
siamese_model.save_weights("siamese_model")
max_acc = accuracy
# Saving the model after all epochs run
siamese_model.save_weights("siamese_model-final")
## **Evaluating the Model**
Page | 64
def extract_encoder(model):
encoder = get_encoder((128, 128, 3))
i=0
for e_layer in model.layers[0].layers[3].layers:
layer_weight = e_layer.get_weights()
encoder.layers[i].set_weights(layer_weight)
i+=1
return encoder
encoder = extract_encoder(siamese_model)
encoder.save_weights("encoder")
encoder.summary()
def classify_images(face_list1, face_list2, threshold=1.3):
# Getting the encodings for the passed faces
tensor1 = encoder.predict(face_list1)
tensor2 = encoder.predict(face_list2)
distance = np.sum(np.square(tensor1-tensor2), axis=-1)
prediction = np.where(distance<=threshold, 0, 1)
return prediction
pos_list = np.array([])
neg_list = np.array([])
for data in get_batch(test_triplet, batch_size=256):
a, p, n = data
pos_list = np.append(pos_list, classify_images(a, p))
neg_list = np.append(neg_list, classify_images(a, n))
break
ModelMetrics(pos_list, neg_list)
Page | 66