0% found this document useful (0 votes)
9 views31 pages

Example Project

This Python project analyzes whether Netflix has transitioned to a more family-friendly streaming platform by examining content age ratings, the addition of kids and family content, and duration trends of family content. The dataset used contains 8,807 entries and focuses on various aspects of Netflix's content library, including types, countries of filming, and age ratings. The report aims to uncover Netflix's content strategy in response to audience demands and viewing habits.

Uploaded by

nsba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views31 pages

Example Project

This Python project analyzes whether Netflix has transitioned to a more family-friendly streaming platform by examining content age ratings, the addition of kids and family content, and duration trends of family content. The dataset used contains 8,807 entries and focuses on various aspects of Netflix's content library, including types, countries of filming, and age ratings. The report aims to uncover Netflix's content strategy in response to audience demands and viewing habits.

Uploaded by

nsba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

python-project-netflix-1

October 3, 2024

A PYTHON PROJECT

1 � Has Netflix Transitioned to a More Family-Friendly Streaming


Platform?�

INTRODUCTION
Netflix, a leader in global streaming, has evolved significantly since its inception. This project
examines whether Netflix has become more family-friendly over time by analyzing three key aspects
of its content library:
• Added Content Age Ratings: Assessing Family-Friendliness Trends
• Additions of Kids & Family Content: Yearly Trends
• Duration Trends of Family Content
This report aims to uncover Netflix’s content strategy and determine if it has shifted towards
becoming more family-friendly in response to audience demands and viewing habits.
Dataset used for this project is available at Netflix Movies and TV Shows Dataset .

2 I. DATA EXPLORATION
First, let’s import libraries to help with processing and visualizing the data.
[ ]: # Importing "pandas" library for reading the dataset and working with it.
import pandas as pd

#Importing "matlotlib.pyplot" library for visualization of the dataset.


import matplotlib.pyplot as plt

Now, we will import the “Netflix Movies and TV Shows” dataset using pandas and get an initial
overview.
[ ]: df = pd.read_csv('/content/drive/MyDrive/netflix_titles.csv')
df.shape

[ ]: (8807, 12)

1
This dataset has 8807 rows and 12 columns.
Now let’s display the first few rows of the dataset to get a sense of its structure.
[ ]: df.head()

[ ]: show_id type title director \


0 s1 Movie Dick Johnson Is Dead Kirsten Johnson
1 s2 TV Show Blood & Water NaN
2 s3 TV Show Ganglands Julien Leclercq
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN

cast country \
0 NaN United States
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… South Africa
2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi… NaN
3 NaN NaN
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… India

date_added release_year rating duration \


0 September 25, 2021 2020 PG-13 90 min
1 September 24, 2021 2021 TV-MA 2 Seasons
2 September 24, 2021 2021 TV-MA 1 Season
3 September 24, 2021 2021 TV-MA 1 Season
4 September 24, 2021 2021 TV-MA 2 Seasons

listed_in \
0 Documentaries
1 International TV Shows, TV Dramas, TV Mysteries
2 Crime TV Shows, International TV Shows, TV Act…
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …

description
0 As her father nears the end of his life, filmm…
1 After crossing paths at a party, a Cape Town t…
2 To protect his family from a powerful drug lor…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…

The dataset generally looks well-structured and mostly straight-forward. We will only
look into columns that need further clarification.
First, let’s have a look at column ‘type’.
[ ]: df.type.value_counts()

2
[ ]: type
Movie 6131
TV Show 2676
Name: count, dtype: int64

Netflix categorizes its offerings into two main types: “TV Show” and “Movie”. It is
clear that majority of Netflix content are movies.
Let’s check the column ‘country’.
[ ]: df.country.value_counts()

[ ]: country
United States 2818
India 972
United Kingdom 419
Japan 245
South Korea 199

Romania, Bulgaria, Hungary 1
Uruguay, Guatemala 1
France, Senegal, Belgium 1
Mexico, United States, Spain, Colombia 1
United Arab Emirates, Jordan 1
Name: count, Length: 748, dtype: int64

The result shows 748 different entries, which is much higher than the roughly 200 countries that exist
in the world. And since entries can contain more than one country per row, this suggests the column
represents filming or production locations rather than just release countries. While inconsistent
country name formatting may inflate the count, the column clearly represents production locations.
Therefore, we’ll rename “country” to “filming_countries” for clarity.
[ ]: #Rename the 'country' column into 'filming_countries'
df.rename(columns={'country': 'filming_countries'}, inplace=True)

[ ]: #New look of the dataframe


df.head()

[ ]: show_id type title director \


0 s1 Movie Dick Johnson Is Dead Kirsten Johnson
1 s2 TV Show Blood & Water NaN
2 s3 TV Show Ganglands Julien Leclercq
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN

cast filming_countries \
0 NaN United States
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… South Africa

3
2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi… NaN
3 NaN NaN
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… India

date_added release_year rating duration \


0 September 25, 2021 2020 PG-13 90 min
1 September 24, 2021 2021 TV-MA 2 Seasons
2 September 24, 2021 2021 TV-MA 1 Season
3 September 24, 2021 2021 TV-MA 1 Season
4 September 24, 2021 2021 TV-MA 2 Seasons

listed_in \
0 Documentaries
1 International TV Shows, TV Dramas, TV Mysteries
2 Crime TV Shows, International TV Shows, TV Act…
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …

description
0 As her father nears the end of his life, filmm…
1 After crossing paths at a party, a Cape Town t…
2 To protect his family from a powerful drug lor…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…

Now let’s see the distribution of Netflix content based on countries involved in production.
[ ]: # Distribution of content by countries involved in filming process.
df['filming_countries'].value_counts(normalize=True) * 100

[ ]: filming_countries
United States 35.330993
India 12.186560
United Kingdom 5.253260
Japan 3.071715
South Korea 2.494985

Romania, Bulgaria, Hungary 0.012538
Uruguay, Guatemala 0.012538
France, Senegal, Belgium 0.012538
Mexico, United States, Spain, Colombia 0.012538
United Arab Emirates, Jordan 0.012538
Name: proportion, Length: 748, dtype: float64

More than 35% of Netflix’s content is produced in the United States, making it the
most dominant source in the platform’s overall catalog.
Since “release_year” is temporal data, we’ll use matplotlib.pyplot to visualize trends rather than

4
using the describe() function.

[ ]: # Distribution of content by release year


plt.figure(figsize=(10,6))
df['release_year'].hist(bins=30)
plt.title('Distribution of Content by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()

Netflix’s content is predominantly from recent years, with the majority released in 2020.
Content from before 2000 is minimal.
Let’s have a look at the “rating” column.
[ ]: df.rating.value_counts()

[ ]: rating
TV-MA 3207
TV-14 2160
TV-PG 863
R 799
PG-13 490
TV-Y7 334
TV-Y 307

5
PG 287
TV-G 220
NR 80
G 41
TV-Y7-FV 6
NC-17 3
UR 3
74 min 1
84 min 1
66 min 1
Name: count, dtype: int64

The ‘rating’ column indicates the age classification of the content (e.g., PG, R, or TV-
MA) rather than its likability or star rating. Therefore, we will rename the “rating”
column to “age_rating.”
It’s important to note that there are 3 entry errors that appear to belong to “duration”
(including “74min,” “84 min,” and “66min”), but this does not affect the fact that the
column still represents age ratings.
[ ]: #Rename the 'rating' column into 'age_rating'
df.rename(columns={'rating': 'age_rating'}, inplace=True)

[ ]: #New look of the dataframe


df.head()

[ ]: show_id type title director \


0 s1 Movie Dick Johnson Is Dead Kirsten Johnson
1 s2 TV Show Blood & Water NaN
2 s3 TV Show Ganglands Julien Leclercq
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN

cast filming_countries \
0 NaN United States
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… South Africa
2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi… NaN
3 NaN NaN
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… India

date_added release_year age_rating duration \


0 September 25, 2021 2020 PG-13 90 min
1 September 24, 2021 2021 TV-MA 2 Seasons
2 September 24, 2021 2021 TV-MA 1 Season
3 September 24, 2021 2021 TV-MA 1 Season
4 September 24, 2021 2021 TV-MA 2 Seasons

listed_in \

6
0 Documentaries
1 International TV Shows, TV Dramas, TV Mysteries
2 Crime TV Shows, International TV Shows, TV Act…
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …

description
0 As her father nears the end of his life, filmm…
1 After crossing paths at a party, a Cape Town t…
2 To protect his family from a powerful drug lor…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…

Now we move on to look at the dataset information.


[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 filming_countries 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 age_rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
There are missing values in 6 out of the 12 columns, because the non-null counts of
these columns do not match the total number of entries.
Let’s have a closer look at the number of null-values each column has.
[ ]: #Count missing values for each column
df.isnull().sum()

[ ]: show_id 0
type 0
title 0
director 2634

7
cast 825
filming_countries 831
date_added 10
release_year 0
age_rating 4
duration 3
listed_in 0
description 0
dtype: int64

[ ]: #Visualize the missing value counts


plt.figure(figsize=(25, 10))
df.isnull().sum().plot.barh()
plt.show()

As shown in the chart, the director column has the most missing values, followed
by filming_countries and cast. The duration, age_rating, and date_added
columns also have missing values, but the number of missing values is minimal compared
to the overall size.

3 II. DATA PREPARATION


3.1 1. Check for missing values
In the exploration step, we found that there are a quite few missing values across our dataset. Now
let’s investigate those columns with missing values.
[ ]: #Total missing values for each column of the dataset
missing_values = df.isnull().sum()

#Show only columns with missing values


columns_with_missing = missing_values[missing_values > 0]

8
columns_with_missing

[ ]: director 2634
cast 825
filming_countries 831
date_added 10
age_rating 4
duration 3
dtype: int64

Let’s look at these missing values numerically.


[ ]: # Total number of entries (rows)
total_entries = len(df)

# Count of missing values per column


missing_counts = df.isnull().sum()

# Percentage of missing values per column


missing_percentages = (missing_counts / total_entries) * 100

Result is:
[ ]: missing_percentages

[ ]: show_id 0.000000
type 0.000000
title 0.000000
director 29.908028
cast 9.367549
filming_countries 9.435676
date_added 0.113546
release_year 0.000000
age_rating 0.045418
duration 0.034064
listed_in 0.000000
description 0.000000
dtype: float64

[ ]: #Visualize the missing values in percentages


plt.figure(figsize=(25,10))
missing_percentages.plot.barh()
plt.show()

9
Approximately 30% of the director column contains missing values, while the cast and
filming_countries columns have missing percentages of around 10%. The date_added,
age_rating, and duration columns each have missing values of less than 1%.

3.1.1 1.1. Column ‘director’


Let’s investigate why a significant 30% of the values are missing from the director column. We’ll
start by taking an overall look at the rows where the director column has null values.
[ ]: df[df['director'].isnull()].head()

[ ]: show_id type title director \


1 s2 TV Show Blood & Water NaN
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN
10 s11 TV Show Vendetta: Truth, Lies and The Mafia NaN
14 s15 TV Show Crime Stories: India Detectives NaN

cast filming_countries \
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… South Africa
3 NaN NaN
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… India
10 NaN NaN
14 NaN NaN

date_added release_year age_rating duration \


1 September 24, 2021 2021 TV-MA 2 Seasons
3 September 24, 2021 2021 TV-MA 1 Season
4 September 24, 2021 2021 TV-MA 2 Seasons
10 September 24, 2021 2021 TV-MA 1 Season
14 September 22, 2021 2021 TV-MA 1 Season

listed_in \

10
1 International TV Shows, TV Dramas, TV Mysteries
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …
10 Crime TV Shows, Docuseries, International TV S…
14 British TV Shows, Crime TV Shows, Docuseries

description
1 After crossing paths at a party, a Cape Town t…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…
10 Sicily boasts a bold "Anti-Mafia" coalition. B…
14 Cameras following Bengaluru police on the job …

[ ]: df[df['director'].isnull()].tail()

[ ]: show_id type title director \


8795 s8796 TV Show Yu-Gi-Oh! Arc-V NaN
8796 s8797 TV Show Yunus Emre NaN
8797 s8798 TV Show Zak Storm NaN
8800 s8801 TV Show Zindagi Gulzar Hai NaN
8803 s8804 TV Show Zombie Dumb NaN

cast \
8795 Mike Liscio, Emily Bauer, Billy Bob Thompson, …
8796 Gökhan Atalay, Payidar Tüfekçioglu, Baran Akbu…
8797 Michael Johnston, Jessica Gee-George, Christin…
8800 Sanam Saeed, Fawad Khan, Ayesha Omer, Mehreen …
8803 NaN

filming_countries date_added \
8795 Japan, Canada May 1, 2018
8796 Turkey January 17, 2017
8797 United States, France, South Korea, Indonesia September 13, 2018
8800 Pakistan December 15, 2016
8803 NaN July 1, 2019

release_year age_rating duration \


8795 2015 TV-Y7 2 Seasons
8796 2016 TV-PG 2 Seasons
8797 2016 TV-Y7 3 Seasons
8800 2012 TV-PG 1 Season
8803 2018 TV-Y7 2 Seasons

listed_in \
8795 Anime Series, Kids' TV
8796 International TV Shows, TV Dramas
8797 Kids' TV

11
8800 International TV Shows, Romantic TV Shows, TV …
8803 Kids' TV, Korean TV Shows, TV Comedies

description
8795 Now that he's discovered the Pendulum Summonin…
8796 During the Mongol invasions, Yunus Emre leaves…
8797 Teen surfer Zak Storm is mysteriously transpor…
8800 Strong-willed, middle-class Kashaf and carefre…
8803 While living alone in a spooky town, a young g…

There is a pattern where the ‘type’ column shows ‘TV Show’ when the ‘director’ column
is null.
Let’s explore if there are any other cases.
[ ]: # Total rows where 'director' is null
director_null = df[df['director'].isnull()]

# Count the unique values in the 'type' column when 'director' is null
type_when_null = director_null['type'].value_counts()
type_when_null

[ ]: type
TV Show 2446
Movie 188
Name: count, dtype: int64

Apparently, aside from ‘TV Show’, there are 188 rows with ‘Movie’ type also have no
director listed.
Let’s check how many entries in total there are for ‘TV Show’ and ‘Movie’ type.
[ ]: df['type'].value_counts()

[ ]: type
Movie 6131
TV Show 2676
Name: count, dtype: int64

For ‘TV Show’ content, it’s common for the director column to be left blank, as 2,446
out of 2,676 entries list no director. This frequent absence can be justified. Because TV
shows, particularly reality shows, are often unscripted and may not have a designated
director listed.
Let’s dig deeper into the 188 movies that are missing directors.
[ ]: df[(df['director'].isnull()) & (df['type'] == 'Movie')].head()

12
[ ]: show_id type title director \
404 s405 Movie 9to5: The Story of a Movement NaN
470 s471 Movie Bridgerton - The Afterparty NaN
483 s484 Movie Last Summer NaN
641 s642 Movie Sisters on Track NaN
717 s718 Movie Headspace: Unwind Your Mind NaN

cast filming_countries \
404 NaN NaN
470 David Spade, London Hughes, Fortune Feimster NaN
483 Fatih Şahin, Ece Çeşmioğlu, Halit Özgür Sarı, … NaN
641 NaN NaN
717 Andy Puddicombe, Evelyn Lewis Prieto, Ginger D… NaN

date_added release_year age_rating duration \


404 July 22, 2021 2021 TV-MA 85 min
470 July 13, 2021 2021 TV-14 39 min
483 July 9, 2021 2021 TV-MA 102 min
641 June 24, 2021 2021 PG 97 min
717 June 15, 2021 2021 TV-G 273 min

listed_in \
404 Documentaries
470 Movies
483 Dramas, International Movies, Romantic Movies
641 Documentaries, Sports Movies
717 Documentaries

description
404 In this documentary, female office workers in …
470 "Bridgerton" cast members share behind-the-sce…
483 During summer vacation in a beachside town, 16…
641 Three track star sisters face obstacles in lif…
717 Do you want to relax, meditate or sleep deeply…

It looks like there’s a trend where movies without directors are often documentaries.
Since documentaries are usually unscripted, it’s possible that’s why directors aren’t
always listed for them

3.1.2 1.2. Column ‘cast’


Let’s pull up some rows with missing values in the ‘cast’ column to look for any pattern.
[ ]: df[df['cast'].isnull()].head()

13
[ ]: show_id type title \
0 s1 Movie Dick Johnson Is Dead
3 s4 TV Show Jailbirds New Orleans
10 s11 TV Show Vendetta: Truth, Lies and The Mafia
14 s15 TV Show Crime Stories: India Detectives
16 s17 Movie Europe's Most Dangerous Man: Otto Skorzeny in …

director cast filming_countries \


0 Kirsten Johnson NaN United States
3 NaN NaN NaN
10 NaN NaN NaN
14 NaN NaN NaN
16 Pedro de Echave García, Pablo Azorín Williams NaN NaN

date_added release_year age_rating duration \


0 September 25, 2021 2020 PG-13 90 min
3 September 24, 2021 2021 TV-MA 1 Season
10 September 24, 2021 2021 TV-MA 1 Season
14 September 22, 2021 2021 TV-MA 1 Season
16 September 22, 2021 2020 TV-MA 67 min

listed_in \
0 Documentaries
3 Docuseries, Reality TV
10 Crime TV Shows, Docuseries, International TV S…
14 British TV Shows, Crime TV Shows, Docuseries
16 Documentaries, International Movies

description
0 As her father nears the end of his life, filmm…
3 Feuds, flirtations and toilet talk go down amo…
10 Sicily boasts a bold "Anti-Mafia" coalition. B…
14 Cameras following Bengaluru police on the job …
16 Declassified documents reveal the post-WWII li…

[ ]: df[df['cast'].isnull()].tail()

[ ]: show_id type title director cast \


8746 s8747 Movie Winnie Pascale Lamche NaN
8755 s8756 TV Show Women Behind Bars NaN NaN
8756 s8757 Movie Woodstock Barak Goodman NaN
8763 s8764 Movie WWII: Report from the Aleutians John Huston NaN
8803 s8804 TV Show Zombie Dumb NaN NaN

filming_countries date_added \
8746 France, Netherlands, South Africa, Finland February 26, 2018
8755 United States November 1, 2016

14
8756 United States August 13, 2019
8763 United States March 31, 2017
8803 NaN July 1, 2019

release_year age_rating duration \


8746 2017 TV-14 85 min
8755 2010 TV-14 3 Seasons
8756 2019 TV-MA 97 min
8763 1943 TV-PG 45 min
8803 2018 TV-Y7 2 Seasons

listed_in \
8746 Documentaries, International Movies
8755 Crime TV Shows, Docuseries
8756 Documentaries, Music & Musicals
8763 Documentaries
8803 Kids' TV, Korean TV Shows, TV Comedies

description
8746 Winnie Mandela speaks about her extraordinary …
8755 This reality series recounts true stories of w…
8756 For the 50th anniversary of the legendary Wood…
8763 Filmmaker John Huston narrates this Oscar-nomi…
8803 While living alone in a spooky town, a young g…

It’s evident that the ‘cast’ column is empty for entries listed under documentaries,
docuseries, or TV shows. This makes sense because these types of content often feature
regular people rather than professional actors, so the ‘cast’ column is typically left blank.

3.1.3 1.3. Column ‘filming_countries’


Since this column contains an insignificant 10% of missing values and will not be used for our
analysis, we lack sufficient knowledge to fill in the gaps, and the reasons for these missing entries are
most likely due to incomplete data submission by producers or decisions made during distribution,
where filming locations were not disclosed. It would be more effective to remove the entire column,
as its absence will not impact the overall analysis.
[ ]: df = df.drop(columns=['filming_countries'])

[ ]: #Dataset without the 'filming_countries' column


df

[ ]: show_id type title director \


0 s1 Movie Dick Johnson Is Dead Kirsten Johnson
1 s2 TV Show Blood & Water NaN
2 s3 TV Show Ganglands Julien Leclercq
3 s4 TV Show Jailbirds New Orleans NaN
4 s5 TV Show Kota Factory NaN

15
… … … … …
8802 s8803 Movie Zodiac David Fincher
8803 s8804 TV Show Zombie Dumb NaN
8804 s8805 Movie Zombieland Ruben Fleischer
8805 s8806 Movie Zoom Peter Hewitt
8806 s8807 Movie Zubaan Mozez Singh

cast date_added \
0 NaN September 25, 2021
1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban… September 24, 2021
2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi… September 24, 2021
3 NaN September 24, 2021
4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam K… September 24, 2021
… … …
8802 Mark Ruffalo, Jake Gyllenhaal, Robert Downey J… November 20, 2019
8803 NaN July 1, 2019
8804 Jesse Eisenberg, Woody Harrelson, Emma Stone, … November 1, 2019
8805 Tim Allen, Courteney Cox, Chevy Chase, Kate Ma… January 11, 2020
8806 Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan… March 2, 2019

release_year age_rating duration \


0 2020 PG-13 90 min
1 2021 TV-MA 2 Seasons
2 2021 TV-MA 1 Season
3 2021 TV-MA 1 Season
4 2021 TV-MA 2 Seasons
… … … …
8802 2007 R 158 min
8803 2018 TV-Y7 2 Seasons
8804 2009 R 88 min
8805 2006 PG 88 min
8806 2015 TV-14 111 min

listed_in \
0 Documentaries
1 International TV Shows, TV Dramas, TV Mysteries
2 Crime TV Shows, International TV Shows, TV Act…
3 Docuseries, Reality TV
4 International TV Shows, Romantic TV Shows, TV …
… …
8802 Cult Movies, Dramas, Thrillers
8803 Kids' TV, Korean TV Shows, TV Comedies
8804 Comedies, Horror Movies
8805 Children & Family Movies, Comedies
8806 Dramas, International Movies, Music & Musicals

description

16
0 As her father nears the end of his life, filmm…
1 After crossing paths at a party, a Cape Town t…
2 To protect his family from a powerful drug lor…
3 Feuds, flirtations and toilet talk go down amo…
4 In a city of coaching centers known to train I…
… …
8802 A political cartoonist, a crime reporter and a…
8803 While living alone in a spooky town, a young g…
8804 Looking to survive in a world taken over by zo…
8805 Dragged from civilian life, a former superhero…
8806 A scrappy but poor boy worms his way into a ty…

[8807 rows x 11 columns]

3.1.4 1.4. Columns ‘date_added’, ‘age_rating’, and ‘duration’


Let’s handle the last 3 columns with missing values.
[ ]: df[['date_added', 'age_rating', 'duration']].isnull().sum()

[ ]: date_added 10
age_rating 4
duration 3
dtype: int64

The columns date_added, age_rating, and duration have 10, 4, and 3 missing values,
respectively, out of 8,807 entries each. These represent very small percentages of the
total data: approximately 0.11%, 0.05%, and 0.03%. Since we don’t have external
data to accurately fill these gaps, we’ll leave them as missing, as they are unlikely to
significantly impact the analysis.

3.2 2. Check for duplicates


Let’s check if our dataset has any duplicated records.
[ ]: df.duplicated().sum()

[ ]: 0

Result returns 0 means no duplicate entries were present. Therefore, no further action
is needed.

4 III. Analysis
In this section, we will examine whether Netflix is becoming more family-friendly by analyzing three
aspects: the trends in age ratings of newly added content, the yearly number of added kid-friendly
content, and the durations of kids’ content.

17
4.1 1. Added Content Age Ratings: Assessing Family-Friendliness Trends
Age ratings indicate content suitability for different audiences and can reveal if Netflix is becoming
more family-oriented.
In this section, we will examine the distribution of Netflix’s content based on age ratings to identify
any increases in the number of additions that are classified as more kid-friendly.
[ ]: df['age_rating'].value_counts()

[ ]: age_rating
TV-MA 3207
TV-14 2160
TV-PG 863
R 799
PG-13 490
TV-Y7 334
TV-Y 307
PG 287
TV-G 220
NR 80
G 41
TV-Y7-FV 6
NC-17 3
UR 3
74 min 1
84 min 1
66 min 1
Name: count, dtype: int64

We notice that there are 3 entries with inappropriate values in the rating column, where
the values mistakenly represent durations instead.
To continue the analysis, we will need to remove these three rows. This action is unlikely to have
a significant impact on our results, as the number of removed rows represent a very small fraction
of the total dataset.
[ ]: #Create a list of unwanted values
list1 = ['74 min','84 min', '66 min']

#Create new dataframe without the unwanted values


df = df[~df.age_rating.isin(list1)]

#Total count of each unique value of the 'rating' column from new dataframe
new_rating = df['age_rating'].value_counts()
new_rating

[ ]: age_rating
TV-MA 3207

18
TV-14 2160
TV-PG 863
R 799
PG-13 490
TV-Y7 334
TV-Y 307
PG 287
TV-G 220
NR 80
G 41
TV-Y7-FV 6
NC-17 3
UR 3
Name: count, dtype: int64

Let’s look at it numerically and visualize the numbers.


[ ]: #Calculate the percentage for each rating type.
new_rating_percentages = new_rating / len(df) * 100
new_rating_percentages

[ ]: age_rating
TV-MA 36.426624
TV-14 24.534303
TV-PG 9.802363
R 9.075420
PG-13 5.565652
TV-Y7 3.793730
TV-Y 3.487051
PG 3.259882
TV-G 2.498864
NR 0.908678
G 0.465697
TV-Y7-FV 0.068151
NC-17 0.034075
UR 0.034075
Name: count, dtype: float64

[ ]: new_rating_percentages.plot.barh()
plt.show()

19
Netflix’s content is predominantly categorized as TV-MA (Mature Audiences) and TV-
14(Parents Strongly Cautioned), comprising 36.4% and 24.5% respectively, together
accounting for over 60% of the total content. TV-PG(Parental Guidance Suggested)
and R(Restricted - Under 17 requires accompanying parent or adult guardian) ratings
follow, each making up just under 10%, while all other rating types are below 5% each.
Currently, over 70% of Netflix’s content is classified as R, TV-MA, and TV-14, empha-
sizing a dominant focus on mature, complex, or intense content. This indicates that
the platform is primarily geared toward an audience seeking more adult-oriented con-
tent, including both adults and older teenagers, with less emphasis on family-friendly
or younger content.
Let’s analyze the yearly trend in the number of movies Netflix adds, categorized by age rating.
[ ]: # DataFrame to analyze yearly trends of content added to Netflix, categorized␣
↪by age rating.

df[['age_rating','date_added']]

[ ]: age_rating date_added
0 PG-13 September 25, 2021
1 TV-MA September 24, 2021
2 TV-MA September 24, 2021
3 TV-MA September 24, 2021
4 TV-MA September 24, 2021

20
… … …
8802 R November 20, 2019
8803 TV-Y7 July 1, 2019
8804 R November 1, 2019
8805 PG January 11, 2020
8806 TV-14 March 2, 2019

[8804 rows x 2 columns]

Since our ‘date_added’ column is of the object datatype, we need to convert it into a
datetime format to analyze trends effectively. In this case we will only retain the year
portion.
Now we will convert the ‘date_added’ column to datetime format and keep only the year portion.
We will call the new column ‘year_added’.
[ ]: # Convert the 'date_added' to datetime type

df['date_added'] = pd.to_datetime(df['date_added'])

# Extract just the year


df['year_added'] = df['date_added'].dt.year

# Display the result


df

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-87c447b46333> in <cell line: 3>()
1 # Convert the 'date_added' to datetime type
2
----> 3 df['date_added'] = pd.to_datetime(df['date_added'])
4
5 # Extract just the year

/usr/local/lib/python3.10/dist-packages/pandas/core/tools/datetimes.py in␣
↪to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit,␣

↪infer_datetime_format, origin, cache)

1106 result = arg.tz_localize("utc")


1107 elif isinstance(arg, ABCSeries):
-> 1108 cache_array = _maybe_cache(arg, format, cache, convert_listlike)
1109 if not cache_array.empty:
1110 result = arg.map(cache_array)

/usr/local/lib/python3.10/dist-packages/pandas/core/tools/datetimes.py in␣
↪_maybe_cache(arg, format, cache, convert_listlike)

252 unique_dates = unique(arg)


253 if len(unique_dates) < len(arg):

21
--> 254 cache_dates = convert_listlike(unique_dates, format)
255 # GH#45319
256 try:

/usr/local/lib/python3.10/dist-packages/pandas/core/tools/datetimes.py in␣
↪_convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst,␣

↪yearfirst, exact)

486 # `format` could be inferred, or user didn't ask for mixed-format␣


↪parsing.

487 if format is not None and format != "mixed":


--> 488 return _array_strptime_with_fallback(arg, name, utc, format,␣
↪exact, errors)

489
490 result, tz_parsed = objects_to_datetime64ns(

/usr/local/lib/python3.10/dist-packages/pandas/core/tools/datetimes.py in␣
↪_array_strptime_with_fallback(arg, name, utc, fmt, exact, errors)

517 Call array_strptime, with fallback behavior depending on 'errors'.


518 """
--> 519 result, timezones = array_strptime(arg, fmt, exact=exact,␣
↪errors=errors, utc=utc)

520 if any(tz is not None for tz in timezones):


521 return _return_parsed_timezone_results(result, timezones, utc,␣
↪name)

strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()

strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()

ValueError: time data " August 4, 2017" doesn't match format "%B %d, %Y", at␣
↪position 1441. You might want to try:

- passing `format` if your strings have a consistent format;


- passing `format='ISO8601'` if your strings are all ISO8601 but not␣
↪necessarily in exactly the same format;

- passing `format='mixed'`, and the format will be inferred for each element␣
↪individually. You might want to use `dayfirst` alongside this.

There’s an error due to inconsistent formatting, specifically a whitespace at the be-


ginning of some values in the ‘date_added’ column. To ensure consistency across all
entries, we’ll remove any whitespace.
[ ]: #Removing the whitespace
df['date_added'] = df['date_added'].str.strip()

[ ]: #Converting the 'date_added' column to datetime format again.


df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year

22
df

Now we create a chart that shows the trend of shows added to Netflix over the years, categorized
by age rating.
[ ]: import pandas as pd
import matplotlib.pyplot as plt

# Group by 'year_added' and 'rating', then count the shows


grouped_data = df.groupby(['year_added', 'age_rating'])['show_id'].count().
↪unstack(fill_value=0)

# Create the line plot


plt.figure(figsize=(12, 6))
for rating in grouped_data.columns:
plt.plot(grouped_data.index, grouped_data[rating], label=rating, marker='o')

plt.title('Number of Shows by Rating Over Years')


plt.xlabel('Year')
plt.ylabel('Number of Shows')
plt.legend(title='Rating', loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.grid(True, linestyle='--', alpha=0.7)

# Show the plot


plt.show()

The chart shows that while TV-MA and TV-14 rated movies remain dominant, there’s
been a sharp decline in these ratings since 2019. Similarly, TV-PG and G ratings have
also been decreasing. Conversely, R, TV-Y7, and PG-13 rated content have seen steady
growth.
Conclusion
This trend reflects Netflix’s effort to cater to a wider range of age groups, from children
and teenagers to adults, rather than focusing exclusively on mature audiences. By
increasing TV-Y7 and PG-13 rated movies, Netflix is balancing its content to appeal to
younger viewers and families while still providing R-rated content for mature audiences.
The significant decline in TV-MA and TV-14 ratings clearly indicates a shift away from
highly mature or teenage-oriented content, potentially making Netflix a more family-
friendly streaming platform.

4.2 2. Additions of Kids and Family Content: Yearly Trends.


Tracking yearly trends in kids and family content helps identify if Netflix is increasing its focus on
family-friendly offerings, signaling a shift toward a more inclusive platform.
In this section, we will create a chart to analyze yearly trends in Netflix’s kid-friendly offerings,
focusing on the number of additions to assess any increase or decrease in family-friendly content.

23
First, we need to filter our main DataFrame to isolate only the kids-related content before creating
the chart.
[ ]: # Define a list of keywords that are commonly found in kid-related genres
kid_keywords = ['Kids', 'Family', 'Children',␣
↪'Animated','Friendship','Animation' 'Cartoon','Anime','Educational']

# Create a condition to filter for kid-related content


condition = df['listed_in'].str.contains('|'.join(kid_keywords), case=False,␣
↪na=False)

# Filter the DataFrame to get only kid-related content


kids_content = df[condition]
kids_content

[ ]: show_id type title \


6 s7 Movie My Little Pony: A New Generation
13 s14 Movie Confessions of an Invisible Girl
23 s24 Movie Go! Go! Cory Carson: Chrissy Takes the Wheel
34 s35 TV Show Tayo and Little Wizards
37 s38 TV Show Angry Birds
… … … …
8793 s8794 Movie Yours, Mine and Ours
8795 s8796 TV Show Yu-Gi-Oh! Arc-V
8797 s8798 TV Show Zak Storm
8803 s8804 TV Show Zombie Dumb
8805 s8806 Movie Zoom

director \
6 Robert Cullen, José Luis Ucha
13 Bruno Garotti
23 Alex Woo, Stanley Moore
34 NaN
37 NaN
… …
8793 Raja Gosnell
8795 NaN
8797 NaN
8803 NaN
8805 Peter Hewitt

cast date_added \
6 Vanessa Hudgens, Kimiko Glenn, James Marsden, … September 24, 2021
13 Klara Castanho, Lucca Picon, Júlia Gomes, Marc… September 22, 2021
23 Maisie Benson, Paul Killam, Kerry Gudjohnsen, … September 21, 2021
34 Dami Lee, Jason Lee, Bommie Catherine Han, Jen… September 17, 2021
37 Antti Pääkkönen, Heljä Heikkinen, Lynne Guagli… September 16, 2021

24
… … …
8793 Dennis Quaid, Rene Russo, Sean Faris, Katija P… November 20, 2019
8795 Mike Liscio, Emily Bauer, Billy Bob Thompson, … May 1, 2018
8797 Michael Johnston, Jessica Gee-George, Christin… September 13, 2018
8803 NaN July 1, 2019
8805 Tim Allen, Courteney Cox, Chevy Chase, Kate Ma… January 11, 2020

release_year age_rating duration \


6 2021 PG 91 min
13 2021 TV-PG 91 min
23 2021 TV-Y 61 min
34 2020 TV-Y7 1 Season
37 2018 TV-Y7 1 Season
… … … …
8793 2005 PG 88 min
8795 2015 TV-Y7 2 Seasons
8797 2016 TV-Y7 3 Seasons
8803 2018 TV-Y7 2 Seasons
8805 2006 PG 88 min

listed_in \
6 Children & Family Movies
13 Children & Family Movies, Comedies
23 Children & Family Movies
34 Kids' TV
37 Kids' TV, TV Comedies
… …
8793 Children & Family Movies, Comedies
8795 Anime Series, Kids' TV
8797 Kids' TV
8803 Kids' TV, Korean TV Shows, TV Comedies
8805 Children & Family Movies, Comedies

description
6 Equestria's divided. But a bright-eyed hero be…
13 When the clever but socially-awkward Tetê join…
23 From arcade games to sled days and hiccup cure…
34 Tayo speeds into an adventure when his friends…
37 Birds Red, Chuck and their feathered friends h…
… …
8793 When a father of eight and a mother of 10 prep…
8795 Now that he's discovered the Pendulum Summonin…
8797 Teen surfer Zak Storm is mysteriously transpor…
8803 While living alone in a spooky town, a young g…
8805 Dragged from civilian life, a former superhero…

[1301 rows x 11 columns]

25
Although the keyword list for identifying kid-friendly content may not be complete,
we still have 1,301 entries specifically related to kids out of a total dataset of over
8,000 entries. This substantial sample size allows us to conduct a meaningful analysis,
ensuring that our findings are representative of the overall content available.
Now, let’s generate a chart to illustrate the yearly trend of kids and family-friendly content added
over the years.
[ ]: import pandas as pd
import matplotlib.pyplot as plt

# Main Dataframe is kids_content, and date_added is in 'Month Day, Year' format


kids_content['date_added'] = pd.to_datetime(kids_content['date_added'])
kids_content['year_added'] = kids_content['date_added'].dt.year

# Group by the year and count the number of entries


content_trends = kids_content.groupby('year_added').size().
↪reset_index(name='count')

# Plotting the trends


plt.figure(figsize=(10, 6))
plt.bar(content_trends['year_added'], content_trends['count'], color='skyblue')
plt.title('Trends of Added Kids Content Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Kids Content Added')
plt.xticks(content_trends['year_added'], rotation=45)
plt.grid(axis='y')

# Show the plot


plt.tight_layout()
plt.show()

The graph reveals a striking upward trend in kids content additions on Netflix over the
past decade. From just 25 new titles in 2015, Netflix dramatically increased its yearly
kids content additions to over 300 by 2020, representing a more than tenfold increase
in just five years.
Despite a slight dip in 2021, likely due to pandemic-related production constraints, the
overall trend shows a substantial increase in kids content additions compared to previous
years.
Conclusion
This transformation unmistakably positions Netflix as an increasingly attractive option
for family viewing, solidifying its evolution into a more comprehensive, family-friendly
streaming service.

26
4.3 3. Duration Trends of Family Content
Shorter durations often reflect content designed for younger audiences or families, as children tend
to have shorter attention spans and families prefer content that can be consumed quickly in one
sitting, fitting into busy schedules. Let’s analyze recent Netflix content to see if the duration trends
align with this.
We will concentrate exclusively on kid and family-related content for our analysis, as examining
the duration trends of other genres is not relevant to our focus on family-friendly programming.
Our dataset for this duration analysis will also be “kids_content” from the previous section, let’s
have a overall look.
[ ]: kids_content.head()

Let’s see the distribution of duration type of these kids related content.
[ ]: kids_content.duration.value_counts()

The content duration appears to be classified in two formats: either in minutes or by


the number of seasons.
Before performing a duration trend analysis, we separate entries classified as seasons from those
listed in minutes to maintain consistency in comparing values. This is necessary because * Content
listed in “seasons” cannot be directly compared to content measured in minutes for duration. *
Including seasons and minutes values in the same analysis would mix incompatible data types,
leading to unclear or misleading results. Additionally, extended episodes during times like COVID-
19 are acceptable, as they cater to children spending more time at home.”
[ ]: # Create a DataFrame for entries with "min" in the duration
duration_minutes = kids_content[kids_content['duration'].str.contains("min",␣
↪na=False)]

# Create a DataFrame for entries with "Season" in the duration


duration_season = kids_content[kids_content['duration'].str.contains("Season",␣
↪na=False)]

Let’s analyze the duration trends in minutes for content related to kids and families.
It’s important to note that while shorter durations typically signify more kid-friendly content,
longer durations can still align with Netflix’s goal of being kid-friendly if extended episodes are
released during times like COVID-19, as they cater to children spending more time at home.
[ ]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Dataframe is "duration_minutes"

# Step 1: Convert 'date_added' to datetime format


duration_minutes['date_added'] = pd.to_datetime(duration_minutes['date_added'])

27
# Step 2: Clean the 'duration' column and convert to numeric
# Remove " minutes" and " min" from the duration column
duration_minutes['duration'] = duration_minutes['duration'].str.replace('␣
↪minutes', '').str.replace(' min', '').str.replace('min', '').str.strip()

duration_minutes['duration'] = pd.to_numeric(duration_minutes['duration'],␣
↪errors='coerce')

# Step 3: Extract the year from the 'date_added' column


duration_minutes['year_added'] = duration_minutes['date_added'].dt.year

# Step 4: Group by year and calculate the average duration


trend_data = duration_minutes.groupby('year_added')['duration'].mean()

# Step 5: Create a line plot to show average duration trends over the years
plt.figure(figsize=(12, 6))
plt.plot(trend_data.index, trend_data.values, marker='o')
plt.title('Average Duration of Kid and Family-Related Content in Minutes Over␣
↪the Years')

plt.xlabel('Year Added')
plt.ylabel('Average Duration (Minutes)')
plt.xticks(trend_data.index) # Show all years on the x-axis
plt.grid(True)
plt.show()

From 2017 to 2020, the average duration of kids-related content on Netflix stabilized
around 80 minutes, which is 20% shorter than the average duration for non-kids-related
content, which stands at 100 minutes. This trend indicates that Netflix is prioritizing
shorter, more digestible content that likely appeals to families and children, reinforcing
its position as a more family-friendly platform.
However, there was a noticeable increase to nearly 90 minutes from 2020 to 2021, sug-
gesting that Netflix may have responded to viewer preferences for longer episodes or
films during this period, possibly due to increased family viewership as a result of
pandemic-related restrictions.
It’s worth noting that the trends from 2011 to 2013 may have been influenced by an
uneven number of entries each year, which could lead to potential outliers skewing the
average duration figures.
Conclusion
The data indicates that Netflix initially emphasized shorter durations for kids-labeled
content, suggesting a family-friendly strategy. Despite the increase in average duration
for kids-related content from 2020 to 2021, this shift reflects Netflix’s adaptation to
viewer preferences during the COVID pandemic, which increased the demand for more
substantial content and reinforces its commitment to being a family-friendly platform.
For comparision, average duration of non-kids-related content from Netflix is over 100 minutes and
is calculated as below.

28
[ ]: import pandas as pd

# Main DataFrame is df and kids_content DataFrame is 'kids_content'

# Step 1: Filter out entries that contain "Season" in the duration column
non_season_entries = df[~df['duration'].str.contains("Season", na=False)]

# Step 2: Exclude entries that are in the kids_content DataFrame


non_kids_entries = non_season_entries[~non_season_entries['show_id'].
↪isin(kids_content['show_id'])]

# Step 3: Clean the duration column and convert to numeric


# Remove " minutes" and " min" from the duration column
non_kids_entries['duration'] = non_kids_entries['duration'].str.replace('␣
↪minutes', '').str.replace(' min', '').str.replace('min', '').str.strip()

non_kids_entries['duration'] = pd.to_numeric(non_kids_entries['duration'],␣
↪errors='coerce')

# Step 4: Drop rows with NaN values in 'duration'


non_kids_entries = non_kids_entries.dropna(subset=['duration'])

# Step 5: Calculate the average duration


average_duration = non_kids_entries['duration'].mean()

print(f'The average duration of all non-season, non-kid-related content is:␣


↪{average_duration} minutes')

Let’s analyze the duration trends in seasons for content related to kids and families.
Please note that in this section, since duration is measured in seasons, longer durations actually
align with Netflix’s goal of becoming more kid-friendly, as they signify a commitment to developing
engaging material for young audiences.
[ ]: import pandas as pd
import matplotlib.pyplot as plt

# DataFrame is "duration_season"

# Step 1: Convert 'date_added' to datetime format


duration_season['date_added'] = pd.to_datetime(duration_season['date_added'])

# Step 2: Extract the year from the 'date_added' column


duration_season['year_added'] = duration_season['date_added'].dt.year

# Step 3: Extract numeric part from the 'duration' column


# This assumes the duration format is something like "1 Season", "2 Seasons"
duration_season['duration'] = duration_season['duration'].str.extract('(\d+)').
↪astype(float)

29
# Optional: Drop rows with NaN values in 'duration'
duration_season = duration_season.dropna(subset=['duration'])

# Step 4: Group by year and calculate the average duration in seasons


trend_data = duration_season.groupby('year_added')['duration'].mean()

# Step 5: Create a line plot to show the trend


plt.figure(figsize=(12, 6))
plt.plot(trend_data.index, trend_data.values, marker='o')
plt.title('Average Duration of Kid and Family-Related Content in Seasons Over␣
↪the Years')

plt.xlabel('Year Added')
plt.ylabel('Average Duration (Seasons)')
plt.xticks(trend_data.index) # Show all years on the x-axis
plt.grid(True)
plt.show()

The significant drop from an average of 5 seasons to under 1 season between 2014 and
2016 could be attributed to the limited number of data samples during that period. This
fluctuation may not accurately represent the overall trends in kids’ TV show offerings
on Netflix.
The steady growth in the number of seasons for kids’ TV shows on Netflix from 2016
to 2021 underscores rising demand for family-oriented programming, with the average
increasing from about one season in 2016 to approximately 2.1 seasons by 2021.
Conclusion
This trend reflects Netflix’s commitment to long-term investments in family-oriented
series and the cultivation of audience loyalty among younger viewers. Overall, it high-
lights Netflix’s strategic focus on becoming a more family-friendly platform.

5 IV. Summary
Key insights from the analysis include:
• Content Age Rating: Sharp decline in TV-MA and TV-14 ratings indicates a shift away from
mature content, making Netflix more family-friendly.
• Kids Content Additions: Significant increase in children’s content offerings positions Netflix
as an increasingly attractive option for family viewing.
• Duration Trends: Initially focused on shorter content for families, Netflix increased the av-
erage duration during the pandemic, showcasing its effort to cater to and attract family and
children audiences.
• Seasonal Content: The growth in multi-season kids’ shows reflects Netflix’s long-term invest-
ment in family-oriented series and the cultivation of younger audience loyalty, demonstrating
its shift toward becoming a more family-friendly platform.
These findings collectively demonstrate Netflix’s strategic evolution towards becoming

30
a more family-oriented streaming platform, adapting to viewer needs and preferences
while diversifying its content portfolio.

6 THANK YOU FOR YOUR ATTENTION!

31

You might also like