0% found this document useful (0 votes)
69 views20 pages

DS Final Project PDF

Uploaded by

Maikura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views20 pages

DS Final Project PDF

Uploaded by

Maikura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Gathering

This is where we upload the data to the notebook. In our case, we will deal with the
Spotify's Top 10,000 Streamed Songs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from google.colab import files


uploaded = files.upload()

<IPython.core.display.HTML object>

Saving Spotify_final_dataset.csv to Spotify_final_dataset.csv

df = pd.read_csv("Spotify_final_dataset.csv", sep=",")
df.head()

Position Artist Name Song Name


Days \
0 1 Post Malone Sunflower SpiderMan: Into the SpiderVerse
1506
1 2 Juice WRLD Lucid Dreams
1673
2 3 Lil Uzi Vert XO TOUR Llif3
1853
3 4 J. Cole No Role Modelz
2547
4 5 Post Malone rockstar
1223

Top 10 (xTimes) Peak Position Peak Position (xTimes) Peak Streams


\
0 302.0 1 (x29) 2118242

1 178.0 1 (x20) 2127668

2 212.0 1 (x4) 1660502

3 6.0 7 0 659366

4 186.0 1 (x124) 2905678

Total Streams
0 883369738
1 864832399
2 781153024
3 734857487
4 718865961

Cleansing Data
This is where we will check if all of them have proper datatypes. We will also filter out
some of the irrelevant information that will not be used later.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11084 entries, 0 to 11083
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Position 11084 non-null int64
1 Artist Name 11084 non-null object
2 Song Name 11080 non-null object
3 Days 11084 non-null int64
4 Top 10 (xTimes) 11084 non-null float64
5 Peak Position 11084 non-null int64
6 Peak Position (xTimes) 11084 non-null object
7 Peak Streams 11084 non-null int64
8 Total Streams 11084 non-null int64
dtypes: float64(1), int64(5), object(3)
memory usage: 779.5+ KB

Dropping the irrelevant columns


df.drop(['Top 10 (xTimes)','Peak Position','Peak Position
(xTimes)','Peak Streams'],axis=1,inplace=True)
df

Position Artist Name \


0 1 Post Malone
1 2 Juice WRLD
2 3 Lil Uzi Vert
3 4 J. Cole
4 5 Post Malone
... ... ...
11079 11080 The Band Perry
11080 11081 Justin Timberlake
11081 11082 Mike WiLL Made
11082 11083 The Vamps
11083 11084 JAY

Song Name Days Total Streams


0 Sunflower SpiderMan: Into the SpiderVerse 1506 883369738

1 Lucid Dreams 1673 864832399

2 XO TOUR Llif3 1853 781153024

3 No Role Modelz 2547 734857487

4 rockstar 1223 718865961

... ... ... ...

11079 If I Die Young 1 51321

11080 Not a Bad Thing 1 49512

11081 It 23 1 46547

11082 Somebody To You 1 44962

11083 Z Holy Grail 1 44323

[11084 rows x 5 columns]

Checking for null values


null_rows = df[df.isnull().any(axis=1)]
null_rows

Position Artist Name Song Name Days Total Streams


5506 5507 Jenny Duncan NaN 1 1737605
6217 6218 Dj Ozuna NaN 6 1198268
7177 7178 Daniel Marcy NaN 1 710534
8215 8216 Amy Kaylee NaN 2 412133

Checking if artists have more than 1 hit song. If not, we can replace the name of song with
artist name
for i in null_rows['Artist Name']:
count = (df["Artist Name"] == i).sum()
if count > 1:
print(f"{i} has occured {count} times.")
else:
print(f"{i} has occurred only once.")

Jenny Duncan has occurred only once.


Dj Ozuna has occurred only once.
Daniel Marcy has occurred only once.
Amy Kaylee has occurred only once.
df["Song Name"] = df["Song Name"].fillna(df["Artist Name"])

Double check if there are still null rows


null_rows = df[df.isnull().any(axis=1)]
null_rows

Empty DataFrame
Columns: [Position, Artist Name, Song Name, Days, Total Streams]
Index: []

Exploratory Data Analysis


In this phase, we will now sort the data according to the highest streams per song and per
artists. We will also create a graph where we will see the relationship between the Position
of the Song, the number of days since it was released, and its number of Streams.
Displaying the Top 25 Artists with the most Hit Songs
plt.figure(figsize=(20,20))
artist_counts = df.groupby('Artist
Name').size().sort_values(ascending=False)
top_artists = artist_counts.nlargest(25)
sns.barplot(y=top_artists.index, x=top_artists.values)
plt.title("Top 25 Artists with most Hit songs", size=20)
plt.ylabel("Artist", size=20)
plt.xlabel("Number of songs", size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
Drake, Future and Taylor Swift are the Top 3 artists that have highest hit songs
Visualizing the distribution of streams using a histogram
plt.figure(figsize=(20,10))
sns.histplot(data=df, x='Total Streams', bins=15)
plt.title("Streams Distribution", size=20)
plt.xlabel("Total Streams", size=20)
plt.ylabel("Frequency", size=20)
plt.show()
We can see here that the histogram displayed limited amount of data. This is most likely
because most of the songs are streamed less than 50 million times.
We will redo the process and limit the range of the data to be displayed.
plt.figure(figsize=(20,10))
sns.histplot(data=df, x='Total Streams', bins=20)
plt.title("Streams Distribution", size=20)
plt.xlabel("Total Streams in hundred millions", size=20)
plt.ylim(0,100)
plt.ylabel("Frequency", size=20)
plt.show()
As we can see, we limited the data to a hundred million streams. We can see that the
streams became lower
Display the Top 25 songs that are streamed the most
plt.figure(figsize=(20,40))
top_songs = df.sort_values(by='Total Streams',
ascending=False).head(25)
top_songs = top_songs[['Song Name', 'Total Streams']]
sns.barplot(y='Song Name', x='Total Streams', data=top_songs)
plt.title("Top 25 songs that are streamed the most", size=20)
plt.ylabel("Song Name", size=20)
plt.xlabel('Total Streams', size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
Sunflower, Lucid Dreams, and XO Tour Llif3 are the top 3 songs with the highest streams
Display the Artist with highest total streams from their songs
df['Artist Name'] = df['Artist Name'].str.replace('[^a-zA-Z0-9 \n\.]',
'')
plt.figure(figsize=(20,40))
artist_streams = df.groupby('Artist Name')['Total
Streams'].sum().sort_values(ascending=False)
top_artists = artist_streams.nlargest(50)
sns.barplot(y=top_artists.index, x=top_artists.values)
plt.title("Artist with highest total streams", size=20)
plt.ylabel("Artist", size=20)
plt.xlabel("Total Streams in Billions", size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()
Drake, Post Malone, and Juice WRLD are the top 3 artist with the highest total streams
Heatmap Correlation:
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True)
plt.xticks(size=20, rotation=90)
plt.yticks(size=20)
plt.show()

sns.pairplot(df, x_vars=['Days'], y_vars=['Total Streams'], height=10)


plt.xlabel('Days', size=20)
plt.ylabel('Total Streams', size=20)
plt.show()
Days and Total Streams have great positive correlation as shown in the plot. The higher the
days, the higher the streams.
sns.pairplot(df, x_vars=['Position'], y_vars=['Total Streams'],
height=10)
plt.xlabel('Position', size=20)
plt.ylabel('Total Streams', size=20)
plt.show()
As we can see from the graph, Position and Total Streams have negative correlation. This is
because the lower the position, the higher the total streams will be.
Observation from EDA Graphs:
1. Days has high positive correlation with Total streams. According to this data, the
more days a song has, the more streams it will get.
2. Position has negative correlation with the Total Streams. And since Total Streams
and Days have a high positive correlation, we can say that with more days, a song
may have better position.

Modelling
This is where we model our data. We will use three different models for this project. These
are Linear Regression, Multi-Layer Perceptron, and Random Forest models.
This is where importing happens.
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

These are declarations of variables that will be used to perform our model.
df_LR = LinearRegression()
df_MLPR = MLPRegressor()
df_RFP = RandomForestRegressor()

df.head()

Position Artist Name Song Name


Days \
0 1 Post Malone Sunflower SpiderMan: Into the SpiderVerse
1506
1 2 Juice WRLD Lucid Dreams
1673
2 3 Lil Uzi Vert XO TOUR Llif3
1853
3 4 J. Cole No Role Modelz
2547
4 5 Post Malone rockstar
1223

Total Streams
0 883369738
1 864832399
2 781153024
3 734857487
4 718865961

x = df[['Total Streams']]
y = df[['Days']]

Total Streams
0 883369738
1 864832399
2 781153024
3 734857487
4 718865961
... ...
11079 51321
11080 49512
11081 46547
11082 44962
11083 44323
[11084 rows x 1 columns]

Days
0 1506
1 1673
2 1853
3 2547
4 1223
... ...
11079 1
11080 1
11081 1
11082 1
11083 1

[11084 rows x 1 columns]

Train and test split method will be used to assess our data. They will be separated to train
and test. This will determine the accuracy of the graph above.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(x,y, test_size=0.2,
train_size=None, random_state=None, shuffle=True, stratify=None)

Xtrain

Total Streams
6416 1084804
3045 8815796
2314 15412318
280 163082925
7877 520840
... ...
8686 317994
4446 3117596
8395 360406
5557 1691747
6866 825130

[8867 rows x 1 columns]

Ytrain

Days
6416 2
3045 34
2314 34
280 414
7877 7
... ...
8686 1
4446 7
8395 4
5557 5
6866 2

[8867 rows x 1 columns]

Xtest

Total Streams
7466 619903
1 864832399
1712 25512899
3259 7466960
1568 29258354
... ...
8792 306573
6679 933608
1838 23088066
4133 3909187
9499 259997

[2217 rows x 1 columns]

Ytest

Days
7466 2
1 1673
1712 56
3259 37
1568 106
... ...
8792 1
6679 4
1838 59
4133 29
9499 1

[2217 rows x 1 columns]

df_LR.fit(Xtrain, Ytrain)
df_MLPR.fit(Xtrain, Ytrain)
df_RFP.fit(Xtrain, Ytrain)

RandomForestRegressor()

#Evaluation Now that we have our data modeled, we will now display these models in
graphs. For the R-squared score comparison, Linear Regression got 85.92%, Multi-Layer
Perceptor got 75.36%, and Random Forest got 97.57%. For the Mean Squared Error
Comparison, Linear Regression got 41.78%, Multi-Laye Perceptor got 44.57%, and Random
Forest got 49.10%. For the Mean Absolute Comparison, Linear Regression got 20.22%,
Multi-Layer Perceptor got 16.60%, and Random Forest got 18.54%.
print(df_LR.score(Xtrain, Ytrain))
print(df_MLPR.score(Xtrain, Ytrain))
print(df_RFP.score(Xtrain, Ytrain))

0.8566285188125835
-2.039755857615953
0.9731578751057082

from sklearn.metrics import mean_squared_error, r2_score,


mean_absolute_error

df_LRprediction = df_LR.predict(Xtest)
df_MLPRprediction = df_MLPR.predict(Xtest)
df_RFPprediction = df_RFP.predict(Xtest)

LR_sq_error = mean_squared_error(df_LRprediction, Ytest,


sample_weight=None, multioutput='uniform_average', squared=False)/100
MLPR_sq_error = mean_squared_error(df_MLPRprediction, Ytest,
sample_weight=None, multioutput='uniform_average', squared=False)/100
RFP_sq_error = mean_squared_error(df_RFPprediction, Ytest,
sample_weight=None, multioutput='uniform_average', squared=False)/100

print(LR_sq_error)
print(MLPR_sq_error)
print(RFP_sq_error)

0.47116017305214347
2.6123827992669333
0.5203310294005739

LR_ab_error = mean_absolute_error(df_LRprediction, Ytest,


sample_weight=None, multioutput='uniform_average')/100
MLPR_ab_error = mean_absolute_error(df_MLPRprediction, Ytest,
sample_weight=None, multioutput='uniform_average')/100
RFP_ab_error = mean_absolute_error(df_RFPprediction, Ytest,
sample_weight=None, multioutput='uniform_average')/100

print(LR_ab_error)
print(MLPR_ab_error)
print(RFP_ab_error)

0.20850917487160747
0.7469482505523841
0.19447871723550342

Model_score = ('Linear Regression', 'Multi-layer Perceptron', 'Random


Forest')
Ypos_score = np.arange(len(Model_score))
Values = [df_LR.score(Xtrain, Ytrain),df_MLPR.score(Xtrain,
Ytrain),df_RFP.score(Xtrain, Ytrain)]

plt.bar(Ypos_score, Values, align='center', alpha=0.5)


plt.xticks(Ypos_score, Model_score)
plt.ylabel('Values')
plt.title('R-squared score comparison (higher is better)')
plt.ylim([0,1])
plt.show()

Model_sq_error = ('Linear Regression', 'Multi-layer Perceptron',


'Random Forest')
Ypos_sq_error = np.arange(len(Model_sq_error))
Values = [LR_sq_error,MLPR_sq_error,RFP_sq_error]

plt.bar(Ypos_sq_error, Values, align='center', alpha=0.5)


plt.xticks(Ypos_sq_error, Model_sq_error)
plt.ylabel('Values')
plt.title('Mean squared error comparison (lower is better)')
plt.ylim([0,1])
plt.show()
Model_ab_error = ('Linear Regression', 'Multi-layer Perceptron',
'Random Forest')
Ypos_ab_error = np.arange(len(Model_ab_error))
Values = [LR_ab_error,MLPR_ab_error,RFP_ab_error]

plt.bar(Ypos_ab_error, Values, align='center', alpha=0.5)


plt.xticks(Ypos_ab_error, Model_ab_error)
plt.ylabel('Values')
plt.title('Mean absolute comparison (lower is better)')
plt.ylim([0,1])
plt.show()
#Conclusion
But what do all of these numbers mean in our data evaluation? For the case of R-squared
Score Comparison, the one with the highest value will be more acceptable model to be
used. In this case, it was Random Forest model. For the case of the Mean Squared
Comparison and the Mean Absolute Comparison, the lowest value will be more acceptable
model to be used, which are Linear Regression for the squared, and Multi-Layer Perceptron
for the absolute. We can conclude that when it comes to analyzing the Spotify's Songs and
their corresponding number of Streams and Days, a data scientist can use either of the
three models presented in this project.

You might also like