Report
Report
Anshu Kumar, Univ. Roll No.- 21301221009, Univ. Reg. No.- 212131001210036
Survi Pandey, Univ. Roll No.- 21301221046, Univ. Reg. No.- 212131001210024
Siddh Kumar, Univ. Roll No.- 21301221100, Univ. Reg. No.- 212131001210043
2023-2024
ACKNOWLEDGEMENT
We would take the opportunity to thank Prof. (Dr). Gour Banerjee, Principal, The
Heritage Academy for allowing us to form a group of four people and for supporting us
with the necessary facilities to make our project worth.
We are thankful to Prof. Madhurima Banerjee (Assistant Professor, BCA), our Project
Guide who constantly supported us, and Prof. Atindra Nag (Assistant Professor,
BCA), the Project Coordinator, for providing information and clarifying the
administrative formalities related to project proceedings. Their words of encouragement
have given us impetus to excel.
We thank all our other faculty members and technical assistants at The Heritage
Academy for paying a significant role during the development of the project. Last but
not the least we thank all our friends for their cooperation and encouragement that they
have bestowed on us.
2
The Heritage Academy, Kolkata
of 3rd Year 2nd Semester in BCA(H) have successfully completed their Minor Project Work
on Movie Recommendation System (Machine Learning) towards partial fulfilment of
Bachelor of Computer Applications from Maulana Abul Kalam Azad University of
Technology, West Bengal in the year 2023-2024.
___________________ _________________________
Prof. (Dr). Gour Banerjee Ms. Madhurima Banerjee
Principal Project Guide, Asst.Professor
The Heritage Academy The Heritage Academy
___________________________
3
Abstract
In the era of abundant digital content, the demand for personalized recommendations has
become increasingly crucial, especially in the realm of cinematic experiences. This abstract
introduces a cutting-edge movie recommendation system that leverages machine learning
algorithms to provide users with tailored suggestions, maximizing their enjoyment and
engagement with the diverse world of film.
Our proposed system employs a collaborative filtering approach, analysing user preferences
and behaviours to generate accurate and personalized movie recommendations. By
harnessing the power of advanced algorithms, such as matrix factorization and deep learning,
the system transcends traditional genre-based recommendations, taking into account nuanced
user tastes and evolving viewing patterns.
The methodology involves collecting and processing vast datasets of user interactions,
including ratings, watch histories, and implicit feedback, to train the recommendation model.
This extensive dataset is harnessed to create a robust and adaptable system capable of
continuously learning and improving its predictive accuracy over time.
To enhance the user experience, the system incorporates feature engineering techniques that
consider contextual information such as time of day, viewing platform, and social
interactions. The integration of these factors ensures that recommendations not only reflect
individual preferences but also adapt to the dynamic nature of users' viewing habits.
Furthermore, the system addresses the challenge of the cold-start problem by incorporating
content-based recommendations for new users or items with limited historical data. This
holistic approach enables the recommendation engine to provide valuable suggestions even
in scenarios with sparse user interactions.
4
Table of Contents Page No.
● Introduction 7
● Objectives 9
● Data Description 11
● Data Preprocessing 13
● Exploratory Data Analysis(EDA) 14
● Lemmatise words and text preprocessing
to analyze Sentiment of Users 19
● Building Machine Learning Models 22
● Recommendation System 28
○ Property Based 27
○ Popularity Based 32
○ User Collaborative 33
■ Pearson Correlation 33
● Deployment of the Application 37
● Hardware Requirements 38
● Software Requirements 40
● Limitations 42
● Brief Descriptions 43
● Conclusion 45
● References 46
5
1. Introduction:
In the contemporary landscape of digital entertainment,the Movie Recommendation System
to recommend movies to a like User project stands as a beacon of innovation, combining
natural language processing and machine learning to decipher user sentiments embedded in
movie reviews. The project's focal points include sentiment analysis to understand user
emotions and a personalized movie recommendation system tailored to individual
preferences.
1.2 Objectives:
The project unfolds with two primary objectives:
6
1.3 Motivation:
The motivation behind this undertaking lies in the aspiration to redefine the user experience
in the realm of movie recommendations. By going beyond traditional rating systems and
embracing sentiment analysis, the project endeavours to provide users with recommendations
that resonate with their emotional responses to movies, resulting in a more enriching and
personalized cinematic journey.
7
2. Objectives:
The Movie Recommendation System to Recommend Movies to a Like User project is
designed with two principal objectives, aiming to revolutionize the movie
recommendation landscape through the integration of sentiment analysis and
personalized user preferences.
Tasks:
● Implement natural language processing techniques for tokenization,
stemming, and removal of stop words.
● Categorize sentiments into positive, neutral, or negative classes.
● Gain insights into the emotional nuances conveyed in user reviews.
Movie Recommendation System:
Description:
● Develop a dynamic and personalized recommendation system that
suggests movies based on aligned sentiments and user preferences.
Tasks:
● Utilize sentiment analysis results to understand user emotions and
preferences.
● Implement a collaborative filtering approach to identify correlations
between users.
● Generate movie recommendations for users based on sentiment-aligned
preferences.
8
2.2 Project Scope:
The project's scope extends beyond traditional movie recommendation systems,
incorporating a human-centric understanding of sentiments to enhance the quality and
relevance of suggestions. By considering user emotions expressed in reviews, the
system aims to provide a more engaging and personalized movie-watching experience.
2.4 Impact:
The successful fulfilment of these objectives is anticipated to contribute to the evolution
of recommendation systems, offering a novel approach that considers the emotional
context of user reviews. The impact is reflected in the potential enhancement of user
engagement and satisfaction within the diverse landscape of cinematic content. 
9
3. Dataset Description:
The custom dataset used in the Movie Recommendation System to Recommend Movies
to a Like User project is designed with 4000+ entries, each capturing essential
information about user-generated movie reviews. The dataset's structure includes Seven
key features, providing a foundation for sentiment analysis and recommendation system
development.
3.1 Structure:
The dataset is structured with the following key features:
1. Movie Name:
The title of the movie for which the review is provided.
2. Movie ID:
Unique identifiers assigned to each movie in the dataset.
3. User ID:
Identification numbers corresponding to individual users submitting
reviews.
4. Reviews:
Textual content containing the user-generated movie reviews.
5. Ratings:
Numeric ratings given by users, providing a quantitative measure of their
overall satisfaction with the movies.
6. Genre:
The genre of the movie is described in this column for example
action,thriller, etc.
7. Overview:
The description of the movie or the summary of the movie
3.2 Size:
The dataset comprises 4000+ records, ensuring a sufficiently diverse set of reviews for
analysis and model development.
10
3.3 Making of the data:
The dataset is formed by the method of web scraping with the help of the
python library called beautiful soup which is used for web scraping.
fig. 1
3.4 Preprocessing:
Prior to analysis, the dataset underwent preprocessing to ensure data quality and
integrity. Key preprocessing tasks included:
● Handling Duplicates:
o Duplicate entries were identified and addressed to maintain the accuracy of
the analysis.
● Missing Values:
o Steps were taken to handle any missing values to prevent data gaps that
could impact the results.
11
3.5 Sample Data:
Here's a glimpse of the first few entries in the dataset:
fig.
2
4. Data Preprocessing:
The dataset underwent thorough preprocessing to ensure data quality and integrity
before conducting sentiment analysis and building the recommendation system. The
following steps were taken:
fig. 3
12
4.2 Missing Values:
Steps were taken to handle any missing values to prevent data gaps that could impact
the results. The dataset was examined for missing values, and the corresponding counts
were recorded.
fig. 4
fig. 5
13
5.1 Distribution of Ratings
fig. 6
fig. 7
14
5.2 Distribution of Ratings according to Genre :
This graph shows the distribution of the ratings according to the genre of the movies.
fig. 8
fig. 9
15
5.3 Positive, Neutral and Negative Words in Review
The number of words in each review was analysed and visualized based on sentiment
categories- positive, neutral, and negative.
fig. 10
fig. 11
16
6. Lemmatise words and text preprocessing to analyze Sentiment of
Users:
Lemmatization is a natural language processing (NLP) technique that reduces words to
their base or root form, known as a "lemma." For instance, the words "running," "ran,"
and "runs" are all forms of the lemma "run." Lemmatization aims to group together
different forms of a word so they can be analyzed as a single item. This process
considers the context and the word's part of speech, making it more sophisticated than
simple stemming, which might cut off prefixes or suffixes but doesn't consider the
word's meaning or context.
fig. 12
6.1 Categorizing movies based on Sentiments:
This method will categorise the movies into three particular zones.
fig. 13
17
Output of the previous snippet.
fig. 14
6.2 Visual Representation of Most Frequent Words in Positive Reviews
fig. 15
18
fig. 16
fig. 17
19
7.1 Random Tree Classifier:
A Random Tree classifier is a machine learning algorithm used for classification tasks. It is a
type of ensemble learning method that builds multiple decision trees during training and
outputs the class that is the mode of the classes of the individual trees. Here’s a more detailed
explanation.
7.2 Spacy:
spaCy is designed specifically for production use and helps you build applications that
process and understand large volumes of text. It can be used to build information extraction
or natural language understanding systems or to pre-process text for deep learning. It
provides advanced capabilities to conduct natural language preprocessing(NLP) on large
volumes of texts at high speed. It helps you build models and production applications that
can underpin document analysis, chatbot capabilities and all other forms of text analysis.
7.3 en_core_web_sm:
fig. 18
Importing the libraries.
20
fig. 19
fig. 20
21
fig. 21
22
fig. 23
fig. 24
23
fig. 25
Output of the model created using the three distinct models. 0.65 is said to be the
threshold value.
8. Recommendation System:
This is a movie recommendation system based on Machine Learning and Natural Language
Processing.
At first we do the sentiment analysis of the users to extract the sentiment of the users per
movie using NLP(natural language processing) and we check the accuracy of the sentiment
24
analysis using Logistic Regression, Random Forest Classifier and Decision Tree. Then we
proceed to develop the recommender system.
You can select the recommendation type by selecting it from the dropdown menu.
There are 5 different types of Recommendation Systems each having different roles.
fig. 26
fig. 27
Adding data of two dataframes together.
25
fig. 28
Vectorising the data
fig. 29
26
fig. 30
Recommending Movies Based on the similarity of the movies.
27
fig. 31
Recommending Movies based on the Ratings provided by the users(popularity).
28
It is important to note that while Pearson correlation measures linear relationships, it
may not capture nonlinear associations between variables.[1]
fig. 32
Code to Find Pearson Correlation between two Users.
29
fig. 33
Finding Pearson Correlation Between the Users.
fig. 34
30
Recommending Movies on the basis of the correlation value between the user(1) and
the other users. If the score is above 0.75 then they are correlated. Then we filter out the
movies seen by the user(1) from the movies seen by the users who are correlated and
then display the output to the user(1).
Hardware Requirements
The hardware requirements for a machine learning system can vary based on the
specific tasks and models involved, as well as the scale of the data being processed.
Here's a general outline of hardware components commonly considered for a machine
learning system:
● Central Processing Unit (CPU):
● Multi-core processors are essential for parallel processing, which is beneficial
for tasks like data preprocessing and some aspects of model training.
● CPUs with high clock speeds can speed up sequential operations.
● Graphics Processing Unit (GPU):
● GPUs are crucial for accelerating deep learning model training. They excel at
handling the matrix operations involved in neural network computations.
● NVIDIA GPUs, especially those from the Tesla and GeForce series, are widely
used for deep learning tasks, and software frameworks like TensorFlow and
PyTorch are optimized for GPU acceleration.
31
● RandomAccessMemory(RAM):
● Sufficient RAM is necessary to handle the size of datasets and the memory
requirements of machine learning models.
● Large datasets and complex models may require tens or hundreds of gigabytes
of RAM.
● Storage:
● Fast and ample storage is crucial for storing large datasets, model parameters,
and intermediate results.
● Solid State Drives (SSDs) are preferred over Hard Disk Drives (HDDs) for
faster data access.
● Network Interface Card (NIC):
● A high-speed network interface is essential for efficiently transferring data
between distributed components in a machine learning system, especially in a
cluster or cloud environment.
● Dedicated Hardware for Inference (Optional):
● For systems where real-time inference is a requirement, dedicated hardware
such as Field-Programmable Gate Arrays (FPGAs) or specialized hardware like
Google's Tensor Processing Units (TPUs) can be considered.
● Cluster or Distributed System (Optional):
● For large-scale machine learning tasks, a cluster of machines may be required
to distribute the workload and handle parallel processing.
● Technologies like Apache Spark, Kubernetes, and Hadoop can be employed to
manage distributed computing resources.
● Cooling System:
● Given the intensive computational nature of machine learning tasks, adequate
cooling is necessary to prevent overheating of components.
● PowerSupply:
● Astable and sufficient power supply is critical, especially for systems running
resource-intensive machine learning tasks for extended periods.
32
● HardwareAccelerators (Optional):
● Specialized hardware accelerators, such as TPUs or custom ASICs
(Application-Specific Integrated Circuits), can be used for specific machine
learning workloads.
Software Requirements
Software requirements for a machine learning project encompass a combination of
libraries, frameworks, and tools that facilitate the development, training, evaluation,
and deployment of machine learning models. The specific requirements can vary based
on the project's goals and the chosen machine learning approach. Here's a general list of
software requirements for a machine learning project:
● Programming Language:
● Choose a programming language suitable for machine learning. Python is
widely used for its extensive libraries and frameworks, including TensorFlow,
PyTorch, scikit-learn, and Keras.
● Integrated Development Environment (IDE):
● Select an IDE that supports the chosen programming language. Popular choices
for Python include Jupyter Notebooks, PyCharm, and VS Code.
● MachineLearning Libraries/Frameworks:
● Depending on the project requirements, include relevant machine learning
libraries and frameworks:
● TensorFlow or PyTorch for deep learning.
● scikit-learn for traditional machine learning algorithms.
● Keras as a high-level neural networks API.
● XGBoost, LightGBM, or CatBoost for gradient boosting.
●DataProcessing Libraries:
● Pandas for data manipulation and analysis.
● NumPy for numerical operations on arrays.
33
● DataVisualization Tools:
● Matplotlib or Seaborn for static visualizations.
● Plotly or Bokeh for interactive visualizations.
● Version Control:
● Git for version control to track changes and collaborate with team members.
● Platforms like GitHub or GitLab for hosting repositories.
● Database Management System (DBMS):
● If applicable, choose a DBMS to store and retrieve data efficiently. Common
choices include MySQL, PostgreSQL, or MongoDB.
● DataAnnotation Tools (if applicable):
● Tools for labeling and annotating data, such as Labelbox or Prodigy.
● ModelEvaluation Metrics:
● Implement metrics for evaluating model performance, depending on the
project's objectives (e.g., accuracy, precision, recall, F1-score, ROC-AUC).
● Testing Framework:
● Implement unit tests and integration tests using a testing framework like PyTest
or unittest.
● Containerization and Orchestration (Optional):
● Dockerfor containerizing applications.
● Kubernetes or Docker Compose for orchestration in a distributed environment.
● Continuous Integration/Continuous Deployment (CI/CD) Tools:
● Jenkins, GitLab CI, or Travis CI for automating the testing and deployment
processes.
● Documentation Tools:
● Usetools like Sphinx or MkDocs for creating project documentation.
34
● Collaboration Tools:
● Communication and collaboration tools, such as Slack, Microsoft Teams, or
other project management tools.
● CloudServices (if applicable):
● Cloud platforms like AWS, Google Cloud, or Azure for scalable computing
resources and services.
● Security Tools (if applicable):
● Implement security measures, including encryption and access controls,
depending on the sensitivity of the data.
● ModelDeployment Tools:
● Platforms like TensorFlow Serving, Flask, FastAPI, or container orchestration
tools for deploying machine learning models.
● Monitoring and Logging Tools:
● Tools like Prometheus or ELK Stack for monitoring and logging model
performance and system behavior.
Limitations
If the person's correlation value exceeds 1 or-1 in the coefficient column then it is said
to be a data anomaly.
35
Brief Description:
Sentiment Analysis, a powerful tool in natural language processing, takes centre stage in
revolutionizing the movie recommendation landscape. This innovative system harnesses the
emotional undercurrents within user reviews to offer movie recommendations that resonate
with individual sentiments.The foundation of this approach lies in the extraction and analysis
of sentiment from user-generated reviews. By employing advanced sentiment analysis
algorithms, the system deciphers the emotional tone of reviews, identifying whether
sentiments are positive, negative, or neutral. This nuanced understanding of user emotions
allows for a more insightful and personalized movie recommendation process.
The system also adapts to the evolving sentiments of users over time, ensuring that
recommendations remain relevant and align with changing preferences. By continuously
learning from user feedback and sentiment patterns, the system enhances its accuracy and
responsiveness, creating a dynamic and personalized movie recommendation ecosystem.
36
Conclusion
In conclusion, the Movie Recommendation System employing machine learning models
stands as a testament to the transformative potential of advanced technologies in enhancing
user experiences within the vast landscape of cinematic exploration. By harnessing the power
of collaborative filtering, deep learning, and sentiment analysis, this system goes beyond
conventional genre-based recommendations, delving into the intricacies of individual
preferences and emotions.
The integration of sophisticated algorithms, coupled with comprehensive datasets, allows the
system to not only accurately predict user preferences but also adapt and evolve with
changing viewing habits. The inclusion of contextual information, such as time, platform,
and social interactions, ensures that recommendations remain dynamic and attuned to the
nuances of each user's cinematic journey.
Moreover, the system addresses challenges such as the cold-start problem by incorporating
content-based recommendations, ensuring that even new users or items with limited
historical data receive valuable and relevant suggestions. The holistic approach taken by the
Movie Recommendation System underscores its commitment to providing a personalized
and immersive cinematic experience for users of diverse tastes and backgrounds.
As technology continues to advance, the Movie Recommendation System represents a
significant stride toward redefining how audiences engage with films. By creating a bridge
between users and a vast sea of cinematic content, this system not only streamlines the
discovery process but also fosters a deeper connection between viewers and the stories that
resonate with their emotions and preferences. In essence, the Movie Recommendation
System marks a compelling fusion of artificial intelligence and entertainment, paving the
way for a more enriched and tailored cinematic journey for audiences worldwide.
References:
[1] https://stackabuse.com/calculating-pearson-correlation-co
efficient-in-python-with-numpy/ 22 May 2024.
37