Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio
Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio
Portfolio
Advait Chavan
Professional Background
Currently in final year pursuing B.E. Electronics
Engineering. I have secured 9.36 CGPA (till 6th Sem) and
have several skills including Data Analysis, Python, P-
spice simulation, Multisim, Machine Learning, MySQL,
Excel.
1
Table of Contents
Professional Background ------------------- 1
Table of Contents ------------------- 2-3
Data Analytics Process
Description ------------------- 4
Design ------------------- 5
Conclusions ------------------- 6
Instagram User Analytics
Description ------------------- 7
The Problem ------------------- 8-9
Design ------------------- 10
Findings ------------------- 11-18
Analysis ------------------- 19-20
Conclusions ------------------- 21
Operation Analytics and Investigating Metric Spike
Description ------------------- 22
The Problem ------------------- 23-24
Design ------------------- 25
Findings ------------------- 26-38
Analysis ------------------- 39-40
Conclusions ------------------- 41
Hiring Process Analytics
Description ------------------- 42
The Problem ------------------- 43
Design ------------------- 44
Findings ------------------- 45-51
Analysis ------------------- 52
Conclusions ------------------- 53
2
Table of Contents (Cont..)
IMDB Movies Analysis
Description ------------------- 54
The Problem ------------------- 55-56
Design ------------------- 57
Findings ------------------- 58-67
Analysis ------------------- 68-69
Conclusions ------------------- 70
Bank Loan Case Study
Description ------------------- 71
The Problem ------------------- 72
Design ------------------- 73-77
Findings ------------------- 78-101
Analysis ------------------- 102-103
Conclusions ------------------- 104
XYZ Ads Airing Report
Description ------------------- 105
The Problem ------------------- 106
Design ------------------- 107
Findings ------------------- 108- 124
Analysis ------------------- 125- 127
Conclusions ------------------- 128- 131
ABC Call Volume Trend
Description ------------------- 132
The Problem ------------------- 133
Findings ------------------- 134- 141
Analysis ------------------- 142- 144
Conclusions ------------------- 145- 146
Appendix ------------------- 147- 149
3
Data Analytics Process
Description
4
Data Analytics Process
Design
Scenario: Going from Andheri (place in Mumbai) to Panvel (Place in Navi Mumbai)
If a person wants to go from one place to another say Andheri to Panvel in our case, he/she may have
many options to go for during the transit.
But he/she will only choose the best option according to his/her will.
The person will go through all the options which are available ad will do a step-by-step analysis while
selecting the best route from his/her current location to the desired location.
The following steps would be taken by the person while making the right decision:-
Step 1: Plan: -- First, one will decide which mode of transport will be most preferable i.e.
Roadway or Railway In Roadway one has two options i.e. Either he/she should take a Uber,
Ola, Meru etc. or a BEST/NMMT bus. In Railway one has three options wiz Western line, Harbor
Line and Central Line
Step 2: Prepare:-- If one opts for Roadways, he/she has two options depending upon how much
he/she is willing to pay wiz Uber/ola/meru or BEST/NMMT. If one opts for Railways, he/she has
three modes options wiz (First class, AC class and Second Class) in which he/she can select
which railway line route to take.
Step 3: Process: -- One needs to think whether he/she which of the above-mentioned route will
make his/her journey fast, comfortable as well as is cost effective.
Step 4: Analyze: -- It’s Obvious that if one is travelling alone no/some extra luggage/bag, he/she
will not opt to pay Rs.800-1500 on Uber/Ola/Meru Cab along with the toll-charges (at Vashi on
Mumbai0Pune Expressway) The time taken by road to reach Panvel from Andheri is almost 1.5 – 2
hours. But the same transit can be done via Railway in just 50 minutes to 1.5 hour (depending upon
type of train Fast or Slow) [Avoiding trains on Sunday due to MegaBlocks]. One can opt for the first-
class ticket which costs around Rs.100 else can go for the second-class ticket which costs for just
Rs.20
Step 5: Share: -- After finally going through the above-mentioned steps one can just book a direct
cab through the respective (OLA, UBER, MERU, etc.) apps. Else one can go to the station and ask
for the ticket (First-class, second-class) from the ticket vendor at the ticket window, or could use an
ATVM (Automatic Ticket Vending Machine) for buying the tickets via the ATVM card or just using
UPI for the payment, or one could directly book his/her ticket via the UTS app.
Step 6: Act: -- After buying the ticket/ booking the cab, one takes his/her seat (which is
sometimes not available in the trains due to crowd/peak hours) and enjoys the ride till
he/she reaches the destination
5
Data Analytics Process
Conclusions
Hence, we have seen how we can use the 6 steps of Data Analytics while making any decision
in real life scenarios(finding the best transit route from Andheri to Panvel in our case)
6
Instagram User Analytics
Description
You are working with the product team of Instagram and the
product manager has asked you to provide insights on the
questions asked by the management team.
7
Instagram User Analytics
The Problem
A) Marketing: The marketing team wants to launch some
campaigns, and they need your help with the following
Rewarding Most Loyal Users: People who have been using the
platform for the longest time. Your Task: Find the 5 oldest
users of the Instagram from the database provided
8
Instagram User Analytics
The Problem (Cont..)
B) Investor Metrics: Our investors want to know if Instagram
is performing well and is not becoming redundant like
Facebook, they want to assess the app on the following
grounds
9
Instagram User Analytics
Design
10
Instagram User Analytics
Findings - I
To find the most loyal i.e. the top 5 oldest users of Instagram:
1. We will use the data from the users table by selecting the username
and created_at columns.
2. Then using the order by function we will order the desired output by
sorting with the created_at column in ascending order.
3. Then using the limit function, the output will be displayed for top 5
oldest Instagram users.
Output/Result
11
Instagram User Analytics
Findings - II
To Find the most inactive users i.e. the users who have never posted a
single photo on Instagram:
2. Then we will left join photos table on the users table, on users.id =
photos.user_id because, both the users.id and photos.user_id have
common contents in them.
3. Then we will find rows from the users table where the photos.id IS NULL
Output/Result
12
Instagram User Analytics
Findings - III
2. Then, we will inner join the three tables wiz : photos, likes and users, on
likes.photo_id = photos.id and photos.user_id = users.id
3. Then, by using group by function we will group the output on the basis of
photos.id
4. Then, using order by function we will sorting the data on the basis of the total
in descending order
5. Then, to find the most liked photo we will using limit function to view only the
top liked photo’s information
Output/Result
13
Instagram User Analytics
Findings - IV
1. We need to select the tag_name column from the tag table and the count(*) as
total function so as to count the number of tags used individually.
3. Then using the group by function we need to group the desired output on the
basis of tags.tag_name
4. Then using the order by function we need to sort the output on the basis of
total(total number of tags per tag_name) in descending order
5. Then, to find the top 5 most used tag names we will use the limit 5 function.
Output/Result
14
Instagram User Analytics
Findings - V
To find the day of week on which most users register on Instagram:
1. First we define the columns of the desired output table using select
dayname(created_at) as day_of_week and count(*) as
total_number_of_users_registered from the users table
2. Then using the group by function we group the output table on the basis of
day_of_week
3. Then using the order by function we order/sort the output table on the basis of
total_number_of_users_registered in descending order
Output/Result
15
Instagram User Analytics
Findings - VI
To find the how many times does average posts on Instagram:
1. First, we need to find first the count number of photos(posts) that are present
in the photos.id column of the photos table i.e. count(*) from photos
2. Similarly, we need to find the number of users that are present in the users.id
column of the users table i.e. count(*) from users
3. Next, we need to divide both the values i.e. count(*) from photos/count(*) from
users and hence we would get the total number of photos / total number of users
4. To find how many times the users posts on Instagram we need to find the total
occurrences of each user_id in photos table
Output/Result
16
Instagram User Analytics
Findings - VI (Cont...)
17
Instagram User Analytics
Findings - VII
To find the bots and fake accounts :
1. First, we select the user_id column from the photos table
3. Then, we select the count(*) function to count total number of likes from the
likes table
4. Then we inner join users and likes table on the basis of users.id and
likes.user_id, using the on function/clause
5. Then by using the group by function we group the desired output table on the
basis of likes.user_id
6. Then, we search for the values from the cout(*) from photos having equal
values with the total_likes_per_user
Output/Result
18
Instagram User Analytics
Analysis
After performing the analysis I have the following points:-
The most loyal users i.e. the top 5 oldest users are:
Out of the 100 total users there are 26 users who are inactive and they
have never posted any kind of stuff of Instagram may it be any photo,
video or any type of text. So, the Marketing team of Instagram needs to
remind such inactive uers
So, the user named Zack_Kemmer93 with user_id 52 is the winner of the
contest cause his photo with photo_id 145 has the highest number of
likes i.e. 48
The top 5 most commonly used #hashtags along with the total count are
smile(59), beach(42), party(39), fun(38) and concert(24)
Most of the users registered on Thursday and Sunday i.e. 16 and hence it
would prove beneficial to start AD Campaign on these two days
So, there are in total 257 rows i.e. 257 photos in the photos table and 100
rows i.e. 100 ids in the users table which makes the desired output to be
257/100 = 2.57 (avg. users posts on Instagram)
Out of the total user id's there are 13 such user id's who have liked each
and every post on Instagram (which is not practically possible) and so
such user id's are considered as BOTS and Fake Accounts
19
Instagram User Analytics
Analysis (Cont...)
Using the 5 Whys approach I am finding the root cause of the
following:-
Why did the Marketing team wanted to know the most inactive
users?
--> So, they can reach out to those users via mail and ask them
What's keeping them away from using the Instagram.
Why did the Marketing team wanted to know the top 5 #hashtags
used?
--> May be the tech team wanted to add some filter features for
photos and videos posted using the top 5 mentioned #hashtags
Why did the Marketing team wanted to know on which day of the
week the platform had the most new users registered?
--> So, that they can run more Ads of various brands during such
days and also get profit from it
Why did the Investors wanted to know about the average posts per
user has on Instagram?
--> It is a fact that every brand or social platform is determined by
the user engagement on such platforms, also investors wanted to
know whether the platform has the right and authenticated user
base. It also helps the tech team determine how to handle such
traffic on the platform with the latest tech without disrupting the
smooth and efficient functioning of the platform
Why did the Investors wanted to know the count of BOTS and Fake
accounts if any?
--> So that the Investors are assured that they are investing into an
Asset and not a Future Liability
20
Instagram User Analytics
Conclusion
21
Operation Analytics and
Description
22
Operation Analytics and
The Problem
23
Operation Analytics and
The Problem(Cont...)
24
Operation Analytics and
Design
25
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - I
2. Then we will divide the total count of job_id (distinct and non-
distinct) by (30 days * 24 hours)for finding the number of jobs
reviewed per day
Output /Result
26
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - II
Output /Result
27
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - III
Output /Result
28
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - III(Cont..)
Output /Result
29
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - IV
4. Then using the WHERE function we will find the row_num having value
greater than 1 i.e. row_num > 1 based on the occurrence of the job_id in the
table
Output /Result
30
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - I
1. We will extract the week from the occurred_at column of the events table using
the EXTRACT function and WEEK function
2. Then we will be counting the number of distinct user_id from the events table
3. Then we will use the GROUP BY function to group the output w.r.t week from
occurred_at
Output /Result
31
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - I (Cont...)
Output /Result
32
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - II
To find the user growth (number of active users per week):-
1. First we will the extract the year and week for the occurred_at column of the users
table using the extract, year and week functions
2. Then we will group the extracted week and year on the basis of year and week number
3. Then we ordered the result on the basis of year and week number
4. Then we will find the cumm_active_users using the SUM, OVER and ROW function
between unbounded preceding and current row
Output /Result
33
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - II (Cont...)
Output /Result
34
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - III
The weekly retention of users-sign up cohort can be calculated by two means i.e.
either by specifying the week number (18 to 35) or for the entire column of
occurred_at of the events table.
1. Firstly we will extract the week from occurred_at column using the extract, week
functions
2. Then, we will select out those rows in which event_type = 'signup_flow’ and
event_name = 'complete_signup’
3. If finding for a specific week we will specify the week number using the extract
function
4. Then using the left join we will join the two tables on the basis of user_id where
event_type = 'engagement’
5. Then we will use the Group By function to group the output table on the basis of
user_id
6. Then we will use the Order By function to order the result table on the basis of
user_id
Output /Result
Trainity_task_3_case_stuy_2_question_c.
csv - Google Drive
35
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - III (Cont...)
Output /Result
Trainity_task_3_case_stuy_2_question_c_18_week.
csv - Google Drive
36
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - IV
1. Firstly we will extract the year_num and week_num from the occurred_at
column
of the events table using the extract, year and week function
3. Then by using the Group By and Order By function we will group and
order the
result on the basis of year_num, week_num and device
Output /Result
37
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - V
4. email_sent = ('sent_weekly_digest','sent_reengagement_email’)
5. email_opened = 'email_open’
6. email_clicked = 'email_clickthrough'
Output /Result
38
Operation Analytics and
Investigating Metric Spike
Analysis
From the tables and Bar plot I have infer the following:-
number of distinct job reviewed per day is 0.0083
7 day rolling average throughput for 25, 26, 27, 28, 29 and 30 Nov 2020 are 1, 1, 1,
1.25, 1.2 and 1.3333 respectively(for both distinct and non-distinct)
Percentage Share of each language i.e. Arabic, English, French, Hindi, Italian and
Persian are 12.5, 12.5, 12.5, 12.5, 12.5 and 37.5 respectively(for both distinct and non-
distinct)
There are 2 duplicates values/rows having job_id = 23 and language = Persian in both
the rows
Why one shall use 7 day rolling average for calculating throughput and not daily
metric average?
----> For calculating the throughput we will be using the 7-day rolling because 7-day
rolling gives us the average for all the days right from day 1 to day 7 Whereas daily
metric gives us average for only that particular day itself.
Why is it that percentage share of all other languages is 12.5% but that of language =
'Persian' is 37.5?
-----> In such cases there are two chances i.e. either there were duplicate rows having
language as 'Persian' or there were really two or more unique people who were
speaking in Persian language
39
Operation Analytics and
Investigating Metric Spike
Analysis (Cont...)
From the tables and Bar plot I have infer the following:-
The weekly user engagement is the highest for week 31 i.e. 1685
There are in total 9381 active users from 1st week of 2013 to the 35th week of 2014
40
Operation Analytics and
Investigating Metric Spike
Conclusion
41
Hiring Process Analytics
Description
You are working for a MNC such as Google as a lead Data Analyst
and the company has provided with the data records of their
previous hirings and have asked you to answer certain questions
making sense out of that data.
42
Hiring Process Analytics
The Problem
Design
Then I had imputed the numerical blank and NULL cells with mean
of the column(if no outliers existed for that particular column) or
with median (if outliers existed for that column)
Then I looked for if any outliers exists and replaced them with the
median of the particular column where the outlier existed
44
Hiring Process Analytics
Findings - I
From the above table and bar plot I have inferred that:-
There are 2563 Males hired for different roles in the company
While there are only 1855 Females hired for different roles in
the company
45
Hiring Process Analytics
Findings - II
Output/Result
49983.03223
46
Hiring Process Analytics
Findings - III
From the above Bar plot I have inferred that the highest number
of posts (both hired and rejected) is 406 for the salary range
41807 to 46907
From the above Bar plot I have inferred that the highest number
of posts (hired) is 315 for the salary range 42307 to 54107
47
Hiring Process Analytics
Findings - IV
48
Hiring Process Analytics
Findings - IV (Cont...)
From the above table, pie chart and Bar Plot I have inferred that
the Highest number of people were working in the Operations
Department i.e. 1843 which accounts for almost 39% of the total
workforce of the company
49
Hiring Process Analytics
Findings - V
50
Hiring Process Analytics
Findings - V (Cont...)
From the above table, Bar plot and Pie chart I have inferred that
the c9 post has the highest number of openings i.e. 1792 which
accounts for 25% of the total job openings of the company/firm
51
Hiring Process Analytics
Analysis
52
Hiring Process Analytics
Conclusion
For any company there will some employees who have high
salary packages compared to other employees, and this is due to
the fact that they have some special skills and years of
experience in their particular field of work
53
IMDB Movie Analysis
Description
For this task, you will need to define a problem you want to shed
some light on.
Once you have defined a problem, clean the data as necessary, and
use your Data Analysis skills to explore the data set and derive
insights.
54
IMDB Movie Analysis
The Problem
Movies with highest profit: Create a new column called profit which
contains the difference of the two columns: gross and budget. Sort
the column using the profit column as reference. Plot profit (y-axis)
vs budget (x- axis) and observe the outliers using the appropriate
chart type.
Your task: Find the movies with the highest profit?
Top 250: Create a new column IMDb_Top_250 and store the top 250
movies with the highest IMDb Rating (corresponding to the column:
imdb_score). Also make sure that for all of these movies, the
num_voted_users is greater than 25,000. Also add a Rank column
containing the values 1 to 250 indicating the ranks of the
corresponding films.
Extract all the movies in the IMDb_Top_250 column which are not in
the English language and store them in a new column named
Top_Foreign_Lang_Film. You can use your own imagination also!
Your task: Find IMDB Top 250
55
IMDB Movie Analysis
The Problem (Cont...)
Append the rows of all these columns and store them in a new
column named Combined.
56
IMDB Movie Analysis
Design
1 . Firstly I made a copy of the raw data where I can perform the
Analysis so that what ever changes I made it will not affect the
original data
2. Then dropping the columns which have no use for the analysis
that we will be doing
5. Then we need to get rid off the duplicate values in the dataset
which can be achieved by using the ‘Remove Duplicate Values/Cells’
available in the ‘Data’ tab
57
IMDB Movie Analysis
Findings - I
2. Then, by using the scatter plot option we will plot values of profit(y_axis)
and budget(x_axis)
58
IMDB Movie Analysis
Findings - I(Cont...)
After removing the outliers, from the above table I have inferred
that 'Avatar' was the highest profit making movie ever with a
profit of 523505847
59
IMDB Movie Analysis
Findings - II
2. Then we will arrange the dataset on the basis of imdb_score in descending order
3. Then we will select only the top 250 rows for the further analysis
4. Then we will create a new column rank using the RANK() function and using the
formula
=RANK(N2,$N$2:$N$251,0)+COUNTIFS($N$2:N2,N2)-1
5. Then we will filter out (unselect ‘English’) from the language column and we will
get the desired output
60
IMDB Movie Analysis
Findings - II (Cont...)
From the above table I have inferred that the movie 'The Good,
the Bad and the Ugly' had the highest IMDB ratings w.r.t movies
with all other languages (except English); it's country of origin in
Itlay
61
IMDB Movie Analysis
Findings - III
To find the best top 10 directors on the basis of mean of imdb_score
we will:-
3. We will add director_name into the series section of the pivot table
4. Then we will add average imdb_score into the values section of the
pivot table
From the above table I have inferred that Charles Chaplin and
Tony Kaye had the highest mean of IMDB Score i.e. 8.6
62
IMDB Movie Analysis
Findings - IV
4. Then we will the values as the count of the number of genres and
then sort it in descending order on the basis of count of the number
of genres
From the above table I have inferred that genre named 'Drama'
was the most popular with a count of 153
63
IMDB Movie Analysis
Findings - V
2. Then we will append the above 3 created columns into 1 column named
actor_1_name_combine
4. Then using the pivot table we will find the average, sum and count of
critic0favorite and audience-favorite actors
64
IMDB Movie Analysis
Findings - V(Cont..)
65
IMDB Movie Analysis
Findings - VI
66
IMDB Movie Analysis
Findings - VI (Cont..)
From the above table and Bar plot I have inferred that most
number of votes were in the decade 2001-2010 with a
count of 178592461
67
IMDB Movie Analysis
Analysis
Using the Why's approach I am trying to find some useful insights
Why is it that the Most rated IMDB movie and the highest
profit movie not the same?
-----> Maybe, due to fact that during the IMDB rating only
recognized and people who know how to vote on IMDB have
the access to the IMDB portal. On the other hand the profit is
calculated on the basis of the tickets sold in theatres
worldwide.
68
IMDB Movie Analysis
Analysis (Cont..)
69
IMDB Movie Analysis
Conclusion
Also, it is not necessary that the movie with the highest IMDB
rating will have the highest profit.
Profit is calculated truly on the basis on the number of tickets
sold by theatres all over the world
Most of the people are tired with their daily lives and they prefer
movies with Comedy/ Drama genre or both, and they would not go
for movies with Action/Horror genre
So, directors and production team must keep in mind the above
points and shall do the pre-production analysis before the
commencement of filming
70
Bank Loan Case Study
Description
The loan providing companies find it hard to give loans to the people
due to their insufficient or non-existent credit history. Because of that,
some consumers use it as their advantage by becoming a defaulter.
Suppose you work for a consumer finance company which specializes
in lending various types of loans to urban customers. You have to use
EDA to analyze the patterns present in the data. This will ensure that
the applicants capable of repaying the loan are not rejected.
When a client applies for a loan, there are four types of decisions that
could be taken by the client/company:
1. Approved: The company has approved loan application
2. Cancelled: The client cancelled the application sometime during
approval. Either the client changed her/his mind about the loan or in
some cases due to a higher risk of the client he received worse pricing
which he did not want.
3. Refused: The company had rejected the loan (because the client
does not meet their requirements etc.).
4. Unused Offer: Loan has been cancelled by the client but on
different stages of the process.
71
Bank Loan Case Study
The Problem
This case study aims to give you an idea of applying EDA in a real
business scenario. In this case study, apart from applying the
techniques that you have learnt in the EDA module, you will also
develop a basic understanding of risk analytics in banking and
financial services and understand how data is used to minimize the
risk of losing money while lending to customers.
72
Bank Loan Case Study
Design
Firstly create a copy of the raw data
Then the percentage of null values needs to be analyzed and those
columns that have more than 50% of the null data have to be dropped
And those columns with less than 50% of the null data have to be
replaced with mean or median or the highest occurring categorical
variables.
LANDAREA_MODE
LIVINGAPARTMENTS_MODE
LIVINGAREA_MODE
NONLIVINGAPARTMENTS_MODE
NONLIVINGAREA_MODE
APARTMENTS_MEDIAN
BASEMENTAREA_MEDIAN
YEARS_BUILD_MEDIAN
COMMON_AREA_MEDIAN
ELEVATORS_MEDIAN
ENTRANCES_MEDIAN
FLOORSMAX_MEDIAN
FLOORSMIN_MEDIAN
LANDAREA_MEDIAN
LIVINGAPARTMENTS_MEDIAN
LIVINGAREA_MEDIAN
NONLIVINGAPARTMENTS_MEDIAN
NONLIVINGAREA_MEDIAN
FONDKAPREMONT_MODE
HOUSETYPE_MODE
WALLSMATERIAL_MODE
74
Bank Loan Case Study
Design (Cont...)
Then drop those columns which are irrelevant for doing the Data
Analysis. The following columns needs to be dropped:-
FLAG_MOBILE
FLAG_EMPLOY_PHONE
FLAG_WORK_PHONE
FLAG_CONT_MOBILE
FLAG_PHONE
FLAG_EMAIL
CNT_FAMILY_MEMBERS
REGION_RATING_CLENT
REGION_RATING_CLENT_W_CITY
EXT_SOURCE_3
YEAR_BEGINEXPLUATATION_AVG
YEAR_BEGINEXPLUATATION_MODE
YEAR_BEGINEXPLUATATION_MEDIAN
TOTAL_AREA_MODE
EMERGENCYSTATE_MODE
DAYS_LAST_PHONE_CHANGE
FLAG DOC 2
FLAG DOC 3
FLAG DOC 4
FLAG DOC 5
FLAG DOC 6
FLAG DOC 7
FLAG DOC 8
FLAG DOC 9
FLAG DOC 10
FLAG DOC 11
FLAG DOC 12
75
Bank Loan Case Study
Design (Cont...)
FLAG DOC 13
FLAG DOC 14
FLAG DOC 15
FLAG DOC 16
FLAG DOC 17
FLAG DOC 18
FLAG DOC 19
FLAG DOC 20
FLAG DOC 21
76
Bank Loan Case Study
Design (Cont...)
WEEKDAY_APPR_PROCESS_START
Removing the rows with the values 'XNA' &'XAP' for the column:
NAME_TYPE_SUITE
----> Replace Blanks with Unaccompained
77
Bank Loan Case Study
Findings - I
The Target Variable Pie chart shows that almost 92% of the total
clients had no problem during payment while 8% of the clients
had some or the other problem
78
Bank Loan Case Study
Findings - II
79
Bank Loan Case Study
Findings - III
From the bar graphs of count and percentage The bank can target
those groups who do not have their own apartment i.e. the bank may
consider the people living in Co-op apartment, Municipal Apartment,
Rented Apartment and people living with their parents
80
Bank Loan Case Study
Findings - IV
From the above bar plot we can infer that most of the applicants
belong to the Age Group ’31-40’
81
Bank Loan Case Study
Findings - V
82
Bank Loan Case Study
Findings - VI
From the above Bar plot we can infer that clients belonging to ‘Low’ income
range have the highest count when it comes to clients with no payment issues
From the above Bar plot we can infer that clients belonging to ‘Medium’ income
range have the highest count when it comes to clients with payment issues
83
Bank Loan Case Study
Findings - VII
From the above bar plot we can infer that clients with occupation_type ‘Laborers’
have the highest number of count when it comes to clients with no payment issues
From the above bar plot we can infer that clients with occupation_type ‘Laborers’
have the highest number of count when it comes to clients with payment issues
84
Bank Loan Case Study
Findings - VIII
From the above Bar plot we can infer that clients having income_type as
‘WORKING’ have the highest count when it comes to clients with no payment
issues
From the above Bar plot we can infer that clients having income_type as
‘WORKING’ have the highest count when it comes to clients with payment issues
85
Bank Loan Case Study
Findings - IX
From the above Bar plot we can infer that client having the total income range as
‘LOW’ have the highest count when it comes to clients having no payment issues
From the above Bar plot we can infer that client having the total income range as
‘LOW’ have the highest count when it comes to clients having payment issues
86
Bank Loan Case Study
Findings - X
From the above Bar plot we can infer that clients having total count of family
members as 2 have the highest count when it comes to clients having no
payment issues
From the above Bar plot we can infer that clients having total count of family
members as 2 have the highest count when it comes to clients having payment
issues
87
Bank Loan Case Study
Findings - XI
From the above Bar Plot we can infer that Clients with CODE_GENDER
= ‘F’ have the highest number of non-defaulters i.e. 188278-14170 =
174108
88
Bank Loan Case Study
Findings - XII
From the above Bar Plot we can infer that clients having
NAME_INCOME_TYPE = ‘WORKING’ having the highest count of Non-
defaulters i.e. 143550-15224 = 128326
89
Bank Loan Case Study
Findings - XIII
From the above Bar Plot we can infer that clients having
NAME_EDUCATION_TYPE = ‘SECONDARY/SECONDARY SPECIAL’
have the highest count for Non0defaulters i.e.198867-19524 = 179343
90
Bank Loan Case Study
Findings - XIII
From the adjacent Bar Plot we can infer that clients having
NAME_FAMILY_STATUS = ‘MARRIED’ have the highest count
of Non0defaulters i.e.181582 – 14850 = 166732
91
Bank Loan Case Study
Findings - XIV
From the above Bar Plot we can infer that clients having
NAME_HOUSING_TYPE = ‘House/Apartment’ have the
highest count of Non-defaulters i.e.
251596-21272 = 230324
92
Bank Loan Case Study
Findings - XV
From the adjacent Bar plot we can infer that clients having
occupation_type = ‘Laborers’ have the highest count for Non-
defaulters i.e.
139461-12116 = 127345
93
Bank Loan Case Study
Findings - XVI
From the above Bar plot we can infer that Females belonging to Low
income group are the highest number of clients with no payment issues
94
Bank Loan Case Study
Findings - XVII
From the above Bar plot we can infer that Females belonging to Low
income group are the highest number of clients with payment issues
95
Bank Loan Case Study
Findings - XVIII
From the above Bar Plot we can infer that clients having
credit amt range as ‘Low’
and education status as ‘Secondary/ Secondary Special’
have the highest count for clients with no payment issues
96
Bank Loan Case Study
Findings - XIX
From the above Bar Plot we can infer that clients having
credit amt range as ‘Medium’
and education status as ‘Secondary/ Secondary Special’
have the highest count for clients with payment issues
97
Bank Loan Case Study
Findings - XX
From the above Bar plot we can infer that clients with
total_income_range as ‘Low’ and family_status as ‘Married’
have the highest count for clients having no payment issues
98
Bank Loan Case Study
Findings - XXI
From the adjacent Bar plot we can infer that clients with
total_income_range as ‘Low’ and family_status as
‘Married’
have the highest count for clients having payment issues
99
Bank Loan Case Study
Findings - XXII
100
Bank Loan Case Study
Findings - XXII (Cont...)
From the above Table and Bar Plot we can infer that Name of
Contract status i.e. Repairs work has the highest count of Approved
Loans
101
Bank Loan Case Study
Analysis
Using the Why's approach I am trying to find some more useful
insights
Why is it that the target_variable is of so much importance?
---> In this dataset target_variable represents whether the client
had some payment issues(1) or the client didn't had some
payment issues(0); It is important because the target_variable
decides whether the bank should increase/decrease it's interest
rates on various loans given by the bank; Also in this case almost
92% of the clients didn't had any payment issues and only 8% of
them had payment issues, this tells that bank's credit score is
good and it has very less or no Non-preforming Accounts.
Why is it that females with low income group have the lowest
count of defaulters?
----> Females belonging to such groups take loan of small amounts
just for starting their own start-ups, business or catering/ parlor
services and they usually enjoy benefit from government
schemes for such purpose
103
Bank Loan Case Study
Conclusion
In conclusion, I would like to conclude the following:-
The proportion/percentage of the defaulters(target = 1) is around
8% and that of non-defaulters(target = 0) is around 92%
The Bank generally lends more loan to Female clients as
compared to Males clients as the count of Female clients in the
defaulter’s list is less than that of Males. Still Bank can look for
more Male clients if their credit amount is satisfied
Also the clients who belong to Working class tend to pay their
loans on time followed by the clients who fall under Commercial
Associate
Clients having Education status like Secondary/ Higher Secondary
or more tend to pay loan on time so bank can prefer lending loans
to clients having such Education Status
Clients who fall in the Age Group 31-40 have the highest count for
paying off their loans on time followed by the clients who fall in the
Age Groups 41-60
Clients having LOW credit amount range tend to pay off their
loans on time than compared to HIGH and MEDIUM credit range
Clients living with their Parents tend to pay off their loans quickly
as compared to other housing type. So Bank can lend loan to
clients having housing type → Living with Parents
Clients taking loan for purchasing New Home i.e. clients taking
Home Loans or purchasing New Car i.e. Car Loans and clients who
have a income type as State Servant tend to pay their loans on
time and hence Bank should prefer clients having such
background
The Bank should be more cautious when lending money to clients
with Repairs purpose because they have high count of Defaulters
along with High count of Defaulters
104
XYZ Ads Airing Report
Description
105
XYZ Ads Airing Report
The Problem
What is Pod Position? Does the Pod position number affect the
amount spent on Ads for a specific period of time by a company?
106
XYZ Ads Airing Report
Design
1 . Firstly I made a copy of the raw data where I can perform the
Analysis so that what ever changes I made it will not affect the
original data
2. Then dropping the columns which have no use for the analysis that
we will be doing
4. Then we need to get rid off the duplicate values in the dataset which
can be achieved by using the ‘Remove Duplicate Values/Cells’
available in the ‘Data’ tab
107
XYZ Ads Airing Report
Findings - I
→So, in the above case the Pod positions of brands AMUL, Big Basket and
108
XYZ Ads Airing Report
Findings - I (Cont..)
109
XYZ Ads Airing Report
Findings - I (Cont..)
110
XYZ Ads Airing Report
Findings - II
From the above bar plot I have inferred that the brand named ‘Maruti
Suzuki’ has the highest share in each Quarter for TV Airings.
111
XYZ Ads Airing Report
Findings - III
112
XYZ Ads Airing Report
Findings - III (Cont...)
113
XYZ Ads Airing Report
Findings - III (Cont...)
114
XYZ Ads Airing Report
Findings - III (Cont...)
115
XYZ Ads Airing Report
Findings - III (Cont...)
For the cable type network the brand ‘Maruti Suzuki’ has the
highest share for all the DayParts of TV Ads
For the cable type network the brand ‘Maruti Suzuki’ has the
highest share for all the DayParts of TV Ads
Also the brand ‘Maruti Suzuki’ spent the highest sum of
amount in TV Ads Airings in all Quarters(Q1,Q2,Q3 and Q4) for
both the network types: Broadcast type and Cable Type The
brand ‘Maruti Suzuki’ has the highest share and also spent the
most for TV Ads Airings for all days of week
116
XYZ Ads Airing Report
Findings - IV
117
XYZ Ads Airing Report
Findings - IV (Cont...)
118
XYZ Ads Airing Report
Findings - IV (Cont...)
119
XYZ Ads Airing Report
Findings - IV (Cont...)
120
XYZ Ads Airing Report
Findings - IV (Cont...)
121
XYZ Ads Airing Report
Findings - V
122
XYZ Ads Airing Report
Findings - V (Cont...)
123
XYZ Ads Airing Report
Findings - V (Cont...)
Also most of the share in TV Ads Airing for each brand was on
Saturday Weekend show daypart
124
XYZ Ads Airing Report
Analysis
125
XYZ Ads Airing Report
Analysis (Cont...)
Why is it that most of the companies bid for Ads during the
Afternoon and Evening time and less in the Morning time?
---> Most of the people are in hurry during the weekdays in
the morning to reach their office and they don't have time to
see the Ads, they just read the main head lines and some
even don't switch ON their Television; during the afternoon
break people have some time during the break in the office
and during the evening/night time people have time during
after having dinner wherein they can sit back and enjoy the
news along with Ads; so most of the companies bid for Ads
during the Evening time and Afternoon time and less during
the morning time.
126
XYZ Ads Airing Report
Analysis (Cont...)
127
XYZ Ads Airing Report
Conclusions
In conclusion, I would like to conclude the following:-
The POD position of different Brands has some sort of relation
with the amount spent. Firstly the amount Spent for POD position
increases until a certain POD position and when the POD position
tends towards 31 there is a gradual decrease in tha amount spent
for some brands while for some brands the amount spent for POD
position decreases drastically.
For brand like Honda the avg_amt_spent is the highest or is at the
peak for POD position around 10
For brand like Hyundai motors the avg_amt_spent is the highest or
is at the peak for POD position around 20 and 22
For brand like Mahindra and Mahindra the avg_amt_spent is the
highest or is at the peak for POD position for around 26
For brand like Maruti Suzuki the avg_amt_spent is the highest or is
at the peak for POD position for around 19
For brand like TATA motors the avg_amt_spent is the highest or is
at the peak for POD position for around 25 and 27
For brand like Toyota the avg_amt_spent is the highest or is at the
peak for POD position for around 18,21 and 23
We can infer from the bar plots and line plots that, from POD
position 28 there is very less amount of avg_amt_spent by all the
brands
The brand ‘Maruti Suzuki’ had the highest Ads proportion in all the
quarters i.e. 38.78% in Q1; 37.31% in Q2; 36.55% in Q3 ;41.10% in
Q4
For brand ‘Honda’ We can infer that it has shown a decline in TV
Ads Airings from Q1 (12.44%) to Q2(9.77%), then from Q2(9.77%) to
Q3(12.99%) it has increased and then from Q3(12.99%) to
Q4(11.29%) it has again decreased.
128
XYZ Ads Airing Report
Conclusions (Cont..)
For brand ‘Hyundai Motors’ We can infer that it has shown a
decline in TV Ads Airings from Q1(10.48%) to Q2(9.84%), then
from Q2(9.84%) to Q3(9.17%) it has again shown a decline, then
from Q3(9.17%) to Q4(9.23%) it has shown some increase
For brand like ‘Mahindra and Mahindra’ It has shown an increase
in TV Ads Airings from Q1(19.71%) to Q2(24.01%), then from
Q2(24.01%) to Q3(22.05%) it has shown some decline, then from
Q3(22.05%) to Q4(13.57%) it has shown a sharp decline
For brand like ‘Maruti Suzuki’ It has shown a decline in TV Ads
Airings from Q1(38.78%) to Q2(37.31%), then from Q2(37.31%) to
Q3(36.55%) it has again shown some decline and then from
Q3(36.55%) to Q4(40.10%) it has shown a great increase of almost
5%.
For brand like ‘TATA Motors’ It has shown some decline in TV Ads
Airings from Q1(10.12%) to Q2(7.62%), then from Q2(7.62%) to
Q3(8.03%) it has shown an increase and then from Q3(8.03%) to
Q4(20.93%) it has shown a spectacular growth of almost 12% For
brand like ‘Toyota’ It has shown an increase in TV Ads Airings
from Q1(8.46%) to Q2(11.45%), the from Q2(11.45%) to Q3(11.21%)
it has shown some decline and then from Q3(11.21%) to Q4(3.87%)
it has shown a significant amount of decline of almost -10%
→ From the competitive Bar plots and Tables we can infer that • The brand
‘Maruti Suzuki’ has the highest share for TV Ads in both network types wiz
Broadcast(37.53%) and cable(38.37%)
The avg_amt_spent on Broadcast type network is the highest for
the brand ‘Hyundai Motors India’ i.e. $18,078 and on cable type
network is the highest for the brand ‘Mahindra and Mahindra’ i.e.
$1,612
129
XYZ Ads Airing Report
Conclusions (Cont..)
For the broadcast type network the brand ‘Maruti Suzuki’ has the
highest share for the Early Morning(35.01%), Evening
News(36.33%), Late Fringe(52.88%), Overnight(49.68%), Prime
Access(50.38%), Prime Time(39.49%), Weekend(40.89%), but for
Daytime and Early Fringe the brand ‘Honda Cars’ has the highest
share i.e. 33.03% and 44.95% respectively.
For the cable type network the brand ‘Maruti Suzuki’ has the
highest share for all the DayParts of TV Ads
Also the brand ‘Maruti Suzuki’ has the highest share in TV Ads
Airings in all Quarters(Q1,Q2,Q3 and Q4) for both the network
types: Broadcast type and Cable Type
Also the brand ‘Maruti Suzuki’ spent the highest sum of amount in
TV Ads Airings in all Quarters(Q1,Q2,Q3 and Q4) for both the
network types: Broadcast type and Cable Type
The brand ‘Maruti Suzuki’ has the highest share and also spent the
most for TV Ads Airings for all days of week
→ For the brand ‘Mahindra and Mahindra’ from the bar plots and tables we can
infer that:-
Most of the share of TV Ads is from the dayspart Late Fringe (40%)
for Northern India, Prime Time (42.86%) for Northern India and
Daytime (22.80%) for North East India
Most of the share for TV Ads is on Friday, Saturday and Sunday;
On Sunday the share for Northern India is 74.29%
In case of Cable network the share of Central India is 40.49%;
share of North East India is 90.70%; share of Northern India is 0%
and share of Southern India is 93.41%
In case of Broadcast network the share of Central India is 59.51%;
share of North East India is 9.30%; share of Northern India is 100%
and share of Southern India is 6.59%
130
XYZ Ads Airing Report
Conclusions (Cont..)
In Q1 Prime Time has the highest share from Northern India i.e.
14.29%
In Q2 Late Fringe has the highest share from Northern India i.e.
8.57%
In Q3 again Late Fringe has the highest share from Northern India
i.e. 28.57%
In Q4 Prime Time has the highest share from Northern India i.e.
8.57%
So, the CMO can select such a tactic which can compete with the
above conditions in daily, weekly, quarterly as well as on monthly
basis
Also most of the share in TV Ads Airing for each brand was on
Saturday Weekend show daypart
Also it is a verified fact that most of the viewers have day off
Sunday and most of them watch TV for late night hours on
Saturday Night and so most of the Brands bid a huge amount for
POD positions in these Saturday weekend shows
Also if some viewers want to get into more details for a particular
model of the brand they came across during the Ad break on
Saturday Weekend show; they could visit the nearest showroom of
that particular brand and even purchase the model if they liked it,
which would gain profits for the brand
So, most of the brands find it profitable to bid on POD positions
on the Saturday Weekend Shows
131
ABC Call Volume Trend Analysis
Description
A customer experience (CX) team consists of professionals who analyze
customer feedback and data, and share insights with the rest of the
organization. Typically, these teams fulfil various roles and
responsibilities such as: Customer experience programs (CX programs),
Digital customer experience, Design and processes, Internal
communications, Voice of the customer (VoC), User experiences,
Customer experience management, Journey mapping, Nurturing customer
interactions, Customer success, Customer support, Handling customer
data, Learning about the customer journey.
132
ABC Call Volume Trend Analysis
The Problem
Calculate the average call time duration for all incoming calls received by
agents (in each Time_Bucket).
Show the total volume/ number of calls coming in via charts/ graphs [Number
of calls v/s Time]. You can select time in a bucket form (i.e. 1-2, 2-3, …..)
Let’s say customers also call this ABC insurance company in night but didn’t
get answer as there are no agents to answer, this creates a bad customer
experience for this Insurance company. Suppose every 100 calls that
customer made during 9 Am to 9 Pm, customer also made 30 calls in night
between interval [9 Pm to 9 Am] and distribution of those 30 calls are as
follows:
Now propose a manpower plan required during each time bucket in a day.
Maximum Abandon rate assumption would be same 10%.
133
ABC Call Volume Trend Analysis
Findings - I
From the above bar plot we can infer that time_bucket 19_20 i.e. 7PM
to 8PM had the highest of average of calls answered in seconds i.e.
203.4
134
ABC Call Volume Trend Analysis
Findings - II
From the above Bar plot we can infer that the time_bucket 12_13 i.e.
during the time period 12PM to 1PM had the highest total number of
calls answered i.e. 1819327
135
ABC Call Volume Trend Analysis
Findings - III
From the above bar plot we can infer that the time_bucket 12-13 i.e.
12PM to 1PM had the highest count of calls answered i.e. 9432
136
ABC Call Volume Trend Analysis
Findings - IV
From the above bar plot we can infer that time bucket 11_12 i.e. 11AM
to 12PM has the highest count for total number incoming calls i.e.
14626
137
ABC Call Volume Trend Analysis
Findings - V
From the above bar plot we can infer that the time bucket 11_12 i.e.
11 AM to 12 PM has the highest share for incoming calls i.e. 12.40%
138
ABC Call Volume Trend Analysis
Findings - VI
139
ABC Call Volume Trend Analysis
Findings - VII
140
ABC Call Volume Trend Analysis
Findings - VIII
The table above shows the desired distribution of the night calls to keep the
abandon rate at 10%
• Also agents who work during 9_10, 10_11 time bucket can be asked to work for
7_8 and 8_9 time bucket as well
• The agents who work in the time bucket 1_2, 2_3, 3_4 and 4_5 can be asked to
work in time buckets 6_7, 7_8 and 8_9 so as to keep the abandon rate at 10%
141
ABC Call Volume Trend Analysis
Analysis
Why is it that the time bucket 11_12 has the highest number
of incoming calls but it does not have the highest number of
average answered calls?
---> Maybe there were more number of incoming calls in the
time bucket 11_12 and there were not enough personnel to
handle most of the queries of the customers during the 11_12
time bucket
142
ABC Call Volume Trend Analysis
Analysis (Cont...)
143
ABC Call Volume Trend Analysis
Analysis (Cont...)
144
ABC Call Volume Trend Analysis
Conclusion
In the conclusion, I would like to conclude the following:-
From the previous analysis we can derive that Avg calls answered
per agent is 198.6 in each time bucket
We need to reduce the abandon rate by 30%(current) – 10%
(desired) = 20% i.e. we need to increase call answered rate by 70%
(current) + 20%(change) = 90% .So, we need to have 90% of the
total calls to be answered so as to reduce the abandon rate to 10%
Total avg calls incoming per day = 5130 • Avg calls answered per
second = 198.6 • Answered rate = 90% i.e. 0.9 • Seconds per hour =
3600 • So, time required to answer 90% of the incoming calls = 5130
* 198.6 * 0.9 / 3600 = • So, new total number of agents working per
day is 255 divided by the number of hours an agent actually
works(on a consumer call) i.e. 4.5 = 255/4.5 = 56.67 == 57 Agents
working per day 254.7001826
So, to have a 10% abandon rate we need 57 Agents working per
day
From the assumptions given the following points were noted:-
In a day an agent work for 9 hours → Total Agent working hours = 9
HOURS
Out of the total 9 hours , 1.5 hours goes for lunch and coffee/tea
breaks; so remaining working hours = 9 – 1.5 = 7.5 HOURS
Out of the remaining 7.5 hours per day an agent is occupied with
consumers call for only 60% of the time i.e. 60% of 7.5 i.e. 0.6 * 7.5
= 4.5. So, an agent spends only 4.5 hours per day out of total 7.5
hours on consumer calls. An agent works 6 days a week. In a month of
30 days 6 days per week; In a month of 30 there are 4 weeks; 7 days
per week means total 28 days out of which 4 days are unplanned leave
145
ABC Call Volume Trend Analysis
Conclusion (Cont...)
146
Appendix
Data Analytics Process:-
---> Link for the shared PDF on Google Drive:
Data Analytics Trainee Assignement - 1.pdf - Google Drive
Trainity_Data_Analytics_Trainee/task3_case_sudy_2_Investig
ating_Metric_Spike.sql at main ·
ADVAIT135/Trainity_Data_Analytics_Trainee · GitHub
task3_case_sudy_2_Investigating_Metric_Spike.sql -
Google Drive
148
Appendix (Cont...)
Link to GitHub Portfolio:-
ADVAIT135 (ADVAIT GURUNATH CHAVAN) · GitHub
149