Analytic Project Report APR
Analytic Project Report APR
Table of Contents
1. Abstract
2. Introduction
3. Problem Statement
4. Objective
5. Tools & Technologies
6. Dataset Description
7. Data Collection
8. Data Preparation
9. Exploratory Data Analysis (EDA)
10.Visualization Techniques
11.Genre-Wise Analysis
12.Rating Distribution Analysis
13.Votes vs Rating Correlation
14.Release Year Trends
15.Key Insights
16.Challenges Faced
17.Limitations
18.Future Scope
19.Conclusion
20.References
21.Appendix A (Full Code)
22.Appendix B (Graph Outputs)
Chapter-1
Abstract
The rise of digital platforms and global access to entertainment has led to an unprecedented
explosion in the production and consumption of movies. In this context, understanding
viewer preferences through data analytics becomes increasingly vital for filmmakers, critics,
streaming services, and content creators. This project, titled “Movie Ratings Analysis,”
focuses on analysing a curated dataset of movie attributes—such as genre, rating, number of
votes, and release year—using Python and visualization libraries to extract valuable insights
and trends.
The primary goal of the project is to discover how factors like movie genre, public votes, and
year of release influence a film's average rating. By employing libraries such as Pandas,
Matplotlib, and Seaborn, we perform structured data cleaning, exploration, and
visualization. The dataset consists of 10 sample movies from various genres like Action,
Romance, Drama, Thriller, Sci-Fi, and Crime. Though the dataset is small and synthetic, it
mirrors the patterns commonly seen in real-world data and serves as a prototype for larger
analyses.
The analysis reveals several interesting patterns: Drama and Sci-Fi movies tend to receive
higher ratings on average, while Action movies, though more frequent, show moderate
rating values. A positive correlation is observed between the number of votes and the
overall rating, indicating that movies with higher viewer engagement often perform better in
terms of rating. Additionally, release year trends show a steady increase in the number of
films, reflecting industry growth.
Through bar charts, scatter plots, histograms, and line graphs, we present these insights in a
visual, easy-to-understand format. These visualizations help stakeholders make informed
decisions about audience preferences, genre popularity, and potential content strategies.
Although the project is based on a small dataset, it successfully demonstrates the
capabilities of data analytics in transforming raw data into strategic intelligence. The study
concludes with suggestions for future improvement, including the integration of real-time
data from sources like IMDb APIs, prediction models using machine learning, and interactive
dashboards for non-technical users.
This project serves as a foundational template for academic, professional, and business use-
cases where media data analytics is crucial. It highlights the importance of combining
statistical understanding with visual communication to unlock the full potential of data-
driven storytelling.
Chapter-2
Introduction
The global entertainment industry, particularly the film sector, has seen exponential growth
over the past few decades. With the emergence of online streaming platforms such as
Netflix, Amazon Prime, Disney+, and others, the accessibility and consumption of movies
have expanded across diverse audiences and regions. As the number of films released every
year increases, so does the volume of data associate with these films—ratings, reviews,
genres, cast, runtime, budgets, box office collections, and more.
Among these data points, movie ratings—usually out of 10—are one of the most critical
indicators of a film's success and public perception. Ratings provide a quick snapshot of how
a movie was received by its audience. When combined with additional features like genres,
number of votes, and release year, these ratings can offer powerful insights into trends in
viewer behaviour, genre popularity, and audience engagement.
This project, titled "Movie Ratings Analysis," is an exploratory data analytics project
designed to uncover meaningful patterns in a dataset of fictional but realistic movie entries.
The dataset includes key variables:
• Movie Title
• Genre
• Release Year
• Rating (on a scale of 0 to 10)
• Number of Votes
Using Python and data visualization libraries such as Pandas, Matplotlib, and Seaborn, the
project walks through a complete data analysis pipeline: from data loading and preparation,
to cleaning, transformation, exploration, and visualization.
The purpose of this study is to answer several core questions:
• What genres tend to receive higher ratings?
• Do movies with more votes tend to have better ratings?
• Which years saw more movie releases?
• What is the overall distribution of ratings in this sample?
Though the dataset is relatively small (10 entries), the techniques and concepts applied are
scalable to large real-world datasets such as those from IMDb or TMDb. This project not only
provides hands-on experience in data visualization and interpretation but also offers a
template for how to perform movie analytics using a structured, replicable approach.
Furthermore, this report emphasizes the importance of visual communication in data
science. By representing complex numeric relationships through bar charts, histograms, and
scatter plots, it becomes easier for both technical and non-technical stakeholders to
understand and act upon the insights derived from the data.
In the modern era of content recommendation engines, targeted advertising, and machine-
learning-driven entertainment platforms, such analytics play a crucial role in personalizing
the user experience and predicting trends. This project lays the foundation for such
advanced analyses by starting with exploratory techniques and ending with actionable
insights based on viewer ratings and preferences.
Chapter-3
Problem Statement
The movie industry produces thousands of films across multiple genres every year. While
some movies achieve massive critical and commercial success, others fail to meet audience
expectations. For producers, directors, streaming platforms, and marketing teams,
understanding why certain movies perform better than others is crucial for making informed
business decisions.
With the rise of online platforms such as IMDb, Rotten Tomatoes, and Metacritic, millions of
viewers now contribute their ratings and reviews for movies they've watched. These ratings,
combined with information about genres, release years, and the number of votes, represent
valuable data that can help stakeholders answer questions like:
• Which genres are consistently well-received by audiences?
• Do movies with a higher number of votes generally have higher ratings?
• Is there a relationship between the year of release and movie success?
• What is the typical rating distribution across the movie industry?
• How can this information guide future movie production and marketing strategies?
However, despite the abundance of available data, many organizations struggle to extract
actionable insights from it due to:
• The large volume of unstructured data.
• The complexity involved in analysing multiple variables.
• The lack of expertise in data visualization and interpretation.
This project aims to address these challenges by providing a structured approach to movie
ratings analysis using data science techniques.
Chapter-4
Objective
The primary objective of this project is to perform a comprehensive analysis of movie
ratings data using Python-based data analytics and visualization techniques. The study aims
to extract meaningful patterns, trends, and insights from the dataset, which can help
stakeholders in the entertainment industry better understand audience behaviour, genre
popularity, and movie success factors.
Key Objectives of the Study:
1. Data Preparation and Cleaning
• Load the provided dataset into a suitable analysis environment.
• Inspect the data for missing values, inconsistencies, or errors.
• Clean and preprocess the data to ensure it is analysis-ready.
• Convert relevant fields (e.g., Release Year) into correct data types.
2. Exploratory Data Analysis (EDA)
• Use descriptive statistics to summarize the data.
• Examine the central tendencies, dispersion, and shape of the data distribution.
• Understand the distribution of ratings across different movies and genres.
3. Genre-wise Analysis
• Group movies by their genre to compute average ratings.
• Identify which genres consistently achieve higher ratings from audiences.
• Compare performance across genres using bar plots and statistical summaries.
4. Rating Distribution Analysis
• Visualize how movie ratings are distributed across the dataset.
• Identify common rating ranges and outliers.
• Analyse whether ratings follow a normal, skewed, or bimodal distribution.
5. Votes vs. Ratings Correlation
• Investigate whether the number of votes influences the movie’s rating.
• Use scatter plots to visualize any positive or negative correlation.
• Interpret whether highly rated movies tend to attract more votes.
➢ Ultimate Goal:
To demonstrate the power of data analytics in converting raw movie ratings data into
meaningful insights that can inform better decisions in:
• Movie production planning
• Content recommendation algorithms
• Audience targeting strategies
• Marketing and promotional efforts.
Chapter-5
Tools & Technologies Used
In this project, several tools and technologies were employed to handle data processing,
analysis, and visualization. Each tool plays a specific role in the overall data analytics
pipeline, from data loading to generating insights.
• Why VS Code?
It's lightweight, highly customizable, free, and widely used by both beginners and
professionals in the data science community.
Dataset Overview
The dataset is a fictional, sample dataset created to simulate real-world movie ratings data.
While small in size, it contains all the essential attributes commonly found in actual movie
databases such as IMDb, Rotten Tomatoes, or TMDb.
Dataset Structure
The dataset consists of 10 movie entries with the following structure:
Attributes in Detail
1. Movie Title
• Type: Text/String
• Description: The unique title of each movie.
• Example: The Great Adventure
2. Genre
• Type: Categorical/Text
• Description: The category or genre the movie falls under.
• Categories Present:
o Action
o Romance
o Thriller
o Sci-Fi
o Drama
o Crime
• Importance: Helps identify viewer preferences across different genres.
3. Release Year
• Type: Integer
• Range: 2015 – 2021
• Description: The year in which the movie was released.
• Importance: Used to analyze trends in movie production over time.
4. Rating
• Type: Float
• Range: 6.9 – 9.0
• Description: Average rating based on viewer feedback, on a scale of 0 to 10.
• Importance: Main performance metric for movie quality and reception.
5. Votes
• Type: Integer
• Range: 1200 – 6100
• Description: Total number of audience votes submitted for each movie.
• Importance: Used to assess viewer engagement and correlation with ratings.
Chapter-7
Data Collection
Introduction to Data Collection
Data collection is one of the most critical steps in any data analytics
project. The accuracy, completeness, and reliability of the dataset directly
affect the validity of the analysis and the conclusions that can be drawn
from it. In real-world scenarios, data about movies can be collected from
multiple sources such as online databases, APIs, web scraping, or through
licensed datasets.
For the purposes of this project, the data collection process was simplified
by creating a synthetic sample dataset that closely resembles actual data
structures used by popular movie databases like IMDb, TMDb, and
Rotten Tomatoes.
This loads the data into a tabular format that allows easy manipulation and
exploration.
3. Initial Data Inspection
To get an overview of the dataset:
head() shows the first few records.
info() shows data types and non-null counts.
describe() provides summary statistics for numerical fields.
• Exploratory Data Analysis (EDA) is the phase in which we explore the dataset to
understand its structure, detect patterns, spot anomalies, and form hypotheses. In this
project, EDA is performed using both numerical summaries and graphical
representations to gain valuable insights about the movies dataset.
Genre Distribution
Objective:
Insight:
• Action and Sci-Fi were the most frequent genres.
• Drama and Crime were less represented in this sample.
Objective:
Objective:
• To determine if popular movies (more votes) tend to have higher or lower ratings.
Insight:
• There is no strong linear correlation between votes and rating.
• Some highly voted movies had moderate ratings and vice versa.
Objective:
Insight:
• Ratings fluctuated across years, peaking in 2016 and 2020.
• 2017 and 2021 had relatively lower average ratings.
Chapter-10
Data Visualization & Insights
Purpose of this Section
While data visualization is part of Exploratory Data Analysis (EDA), this section focuses on
interpreting the meaning behind the graphs. Good visualizations are not just about making
charts — they help you communicate complex ideas clearly and effectively. Here, we
analyze the key visual outputs generated and extract actionable insights.
1. Genre-wise Movie Count
Visualization Recap:
A bar plot showing the count of movies per genre.
Interpretation:
• Action genre appears most frequently, suggesting its popularity in production.
• Genres like Drama and Crime are underrepresented but can be high performers in
terms of audience satisfaction.
• Strategic Insight: A high number of movies in a genre doesn’t guarantee better
audience ratings (as we’ll see later). Quality over quantity is critical.
2. Average Rating by Genre
Visualization Recap:
A bar plot comparing average IMDb ratings across genres.
Interpretation:
• Drama had the highest average rating in the dataset, followed by Crime and Sci-Fi.
• Action movies had the lowest average rating, despite being the most frequent.
• Strategic Insight: For studios aiming at critical acclaim or higher audience approval,
investing in Drama and Sci-Fi content may offer better returns in reputation and
awards.
Summary Table
Chapter-16
Challenges Faced
Every data analysis project comes with its own set of challenges — technical,
analytical, and sometimes even creative. In this project, we faced several
obstacles while analyzing the movie dataset using Python and data visualization
tools.
Here are the major challenges explained in simple terms:
1.Small Dataset
What Happened:
• The dataset had only 10 movies, which is very small.
• With fewer data points, some insights (like correlations and trends) may
not be fully accurate or strong.
How We Handled It:
• We focused on qualitative insights (like genre performance) instead of
heavy statistics.
• Used visualization and interpretation to make meaningful observations
from limited data.
2. Balancing Genres
What Happened:
• Some genres had only one movie (e.g., Drama, Romance, Crime), while
others had more.
• This made it hard to compare genres fairly.
How We Handled It:
• Compared average ratings and votes instead of total counts.
• Made sure insights were clearly explained, keeping the data imbalance
in mind.
3. Weak Correlation Results
What Happened:
• Correlation analysis didn’t show any strong relationships between
variables.
• This could make it seem like the data has no story to tell.
How We Handled It:
• We looked beyond correlation — using scatter plots and genre-wise
analysis to find hidden patterns.
• Explained that popularity and quality are not always linked — which is
an insight in itself.
4. Visual Clarity
What Happened:
• With limited data, some charts looked flat or less meaningful.
• For example, a bar chart with only one value (like Drama or Crime) can
be hard to interpret.
How We Handled It:
• Added text-based insights below each graph.
• Used color schemes and labels to improve visual appeal and
understanding.
5. Choosing the Right Questions
What Happened:
• It was challenging to choose good questions to explore, especially with a
small dataset.
• We had to be careful to avoid over-analyzing limited data.
How We Handled It:
• Focused on basic but useful questions:
o What genre gets the highest ratings?
o Which year had the best movies?
o Is there a link between votes and ratings?
6. Code and Plot Adjustments
What Happened:
• At times, plot labels overlapped or charts didn’t show well in VS Code.
• Minor bugs in plotting code (e.g., incorrect axis labels or colors) slowed
progress.
How We Handled It:
• Used plt.tight_layout() and label rotation for clean visuals.
• Regularly debugged and improved the code for better presentation.
Chapter-17
Limitations
1.Small Number of Movies
• The dataset had only 10 movies.
• That’s too small to find very strong or accurate results.
This means the results might change if we use a bigger dataset.
2. No Audience Information
• We don’t know who voted (age, gender, or country).
So, we can’t say which group of people liked which movies the most.
What We Did:
• Cleaned and organized the movie data.
• Created different charts to understand trends.
• Compared genres like Action, Drama, Sci-Fi, etc.
• Looked at which years had the best-rated movies.
• Checked if more votes meant better ratings.
What We Found:
• Drama and Sci-Fi movies had the highest ratings.
• Action movies were the most common but had lower ratings.
• More votes didn’t always mean a better movie.
• 2016 was the best year for movie ratings.
Why It Matters:
• This analysis helps understand what kind of movies people like.
• It can guide movie makers or streaming platforms to improve content.
• Even a small dataset can give valuable insights when analyzed properly.
Chapter-20
References
Below are the websites, tools, and libraries that helped in completing this
project:
3.
4.