0% found this document useful (0 votes)
18 views17 pages

IMDB Movie Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

IMDB Movie Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

The goal of this project is to analyze a dataset of IMDB movies and draw

insights from the data. The dataset includes various columns such as movie
names, budgets, gross revenue, and IMDB ratings. To complete the project, you
will need to use a combination of Excel formulas and SQL commands to clean
and manipulate the data. You will be asked to complete specific tasks, such as
identifying the movie with the highest profit or the top IMDB movies, as well as
share your own insights by identifying any problems or trends in the data. You
may also be asked to use charts and visualizations to present your findings.
The overall objective of the project is to gain a better understanding of the
movie industry by analyzing the data and drawing meaningful conclusions.
1.Understand the data: Before beginning the analysis, I took some time to familiarize with the data. Look at the
structure of the data and get a sense of the overall content. This help me identify any potential issues or challenges
that I may need to address as I proceed with my analysis.

2.Check for missing or incomplete data: Make sure to check for any blank values or missing data in your dataset.

3.Identify and handle outliers: Outliers are data points that are significantly different from the rest of the data.
They can have a significant impact on summary statistics and can distort the results of your analysis. It's important
to identify any outliers and decide how to handle them, such as by excluding them from the analysis or by treating
them as separate cases.

4.Communicate your findings: Once completed with analysis, present your findings to your audience in a clear
and concise way. Use visualizations, such as charts and graphs, to help communicate your results. Be sure to clearly
explain your methodology and the implications of your results.

MS-Excel, MySQL, PowerBI


A.Cleaning the data::
This is one of the most important step to perform before moving forward with the analysis. Use your
knowledge learned till now to do this. (Dropping columns, removing null values, etc.)
Your task: Clean the data.
Ans: To clean the data, I arranged the
columns in the correct format and
increased the column width to improve
readability. I also removed null values and
duplicates by selecting them using the
"Find & Select" option and deleting the
entire row. This helped to ensure that the
data was accurate and ready for analysis.

Before Cleaning: 5044 rows and 28


columns
After Cleaning: 3850 rows and 13 columns
B. Movies with highest profit:
Create a new column called profit which contains the difference of the two columns: gross and budget. Sort
the column using the profit column as reference. Plot profit (y-axis) vs budget (x- axis) and observe the
outliers using the appropriate chart type.
Your task: Find the movies with the highest profit?
Ans: In this task we have to first create a new column to store the profit of the movies by taking the
difference of the gross and budget

:To identify outliers in the data, I plotted a chart and


looked for any unusually high or low values. One example
of an outlier that I observed was a value of -1.2E+10.
:I used a tool called Power BI to create this visualization
showing that the movie with the highest profit was
“Avatar’’.
C. Top 250: Create a new column IMDb_Top_250 and store the top 250 movies with the highest IMDb
Rating (corresponding to the column: imdb_score). Also make sure that for all of these movies,
the num_voted_users is greater than 25,000. Also add a Rank column containing the values 1 to 250
indicating the ranks of the corresponding films.

Extract all the movies in the IMDb_Top_250 column which are not in the English language and store
them in a new column named Top_Foreign_Lang_Film. You can use your own imagination also!

Your task: Find IMDB Top 250

:I used an SQL query to identify the top 250


movies with the highest IMDB scores and a
minimum of 25,000 voted users. Here is the
list:

..250 rows
:I used an SQL query to create a table of top foreign
language films from the top 250 IMDB movies. The films
in this table are those whose language is not English.

:From the top 250 IMDB movies, we can conclude that


only 37 of them are not in the English language. This
suggests that English is a more preferable language for
these films.

..37 rows
D. Best Directors: Group the column using the director_name column.

Find out the top 10 directors for whom the mean of imdb_score is the highest and store them in a new
column top10director. In case of a tie in IMDb score between two directors, sort them alphabetically.

Your task: Find the best directors

Based on the data provided, it appears that Tony Kaye


and Charles Chaplin are the best director, with an
average IMDB score of 8.6 for his movies.
E. Popular Genres: Perform this step using the knowledge gained while performing previous steps.
Your task: Find popular genres

Based on the data provided, it appears that the


Crime|Drama|Fantasy|Mystery genre has the highest
average IMDB score, indicating that it is a more
preferable genre.
F. Charts: Create three new columns namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt which contain the movies
in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only
the actor_1_name column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo
DiCaprio', and 'Brad Pitt' for the said extraction.

Append the rows of all these columns and store them in a new column named Combined.

Group the combined column using the actor_1_name column.

Find the mean of the num_critic_for_reviews and num_users_for_review and identify the actors which have the
highest mean.

Observe the change in number of voted users over decades using a bar chart. Create a column
called decade which represents the decade to which every movie belongs to. For example, the title_year year
1923, 1925 should be stored as 1920s. Sort the column based on the column decade, group it by decade and find
the sum of users voted in each decade. Store this in a new data frame called df_by_decade.

Your task: Find the critic-favorite and audience-favorite actors


Based on the data provided, it appears that Leonardo DiCaprio is the
audience favorite and critic favorite actor.
Johnny Depp is a highly popular actor among
both critics and audiences.
During the 2000s, there was a high number of users who
voted for movies.
• It appears that movies like "Avatar" and "Jurassic Park" have the potential to earn high profits.
If the goal is to maximize profit, it may be advisable to consider making movies with similar
themes or characteristics.

• It appears that the movie "The Shawshank Redemption" has the highest IMDB score
among those with a minimum of 25,000 voted users.

• From the top 250 IMDB movies, we can conclude that only 37 of them are not in the English
language. This suggests that English is a more preferable language for these films.

• Consider working with Tony Kaye or Charles Chaplin as a director on future projects, as their past
work has received high ratings from audiences and critics.

• It appears that the Crime|Drama|Fantasy|Mystery genre has the highest average IMDB score,
indicating that it is a more preferable genre.

• It appears that Johnny Depp is the audience favorite and critic favorite actor.
During this project, I discovered that a range of factors contribute to the success of a movie. I also
learned how to utilize various tools, such as SQL, Excel, and Power BI, to analyze and understand
data. By using these tools together, I gained a more comprehensive understanding of what makes
a movie successful. This project helped me to see the importance of considering multiple
variables and viewpoints when analyzing data.

Completing this project allowed me to improve my skills in crafting and executing queries, as well
as giving me a glimpse into the tasks and responsibilities of a data analyst in a professional
setting.

You might also like