0% found this document useful (0 votes)
1K views

Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio

Uploaded by

reetubhanugarg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio

Uploaded by

reetubhanugarg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 151

Data Analysis

Portfolio
Advait Chavan
Professional Background
Currently in final year pursuing B.E. Electronics
Engineering. I have secured 9.36 CGPA (till 6th Sem) and
have several skills including Data Analysis, Python, P-
spice simulation, Multisim, Machine Learning, MySQL,
Excel.

I have worked on various personal projects related to


Data Analysis, Machine Learning, Excel and Python.
Also, I have participated in various challenges and
competitions on Kaggle and even got a decent rank on
the Kaggle leader board.

Also, I have worked on Book Recommendation Engine


which is based on Collaborative Book Filtering (CBF)
technique as part of mini project in Sem 6.
Currently, I am working on Sign Language Detection for
deaf and dumb people which is a group project for the
Final year.

As I am a fresher it would be great to experience the real


challenges of the corporate world and understand how
things work. Being a fresher, I think I am very flexible and
adaptive to learn new things. I have theoretical
knowledge. But I am waiting to use my theoretical
knowledge in a practical way. And I believe by putting
significant efforts, I will learn.

1
Table of Contents
Professional Background ------------------- 1
Table of Contents ------------------- 2-3
Data Analytics Process
Description ------------------- 4
Design ------------------- 5
Conclusions ------------------- 6
Instagram User Analytics
Description ------------------- 7
The Problem ------------------- 8-9
Design ------------------- 10
Findings ------------------- 11-18
Analysis ------------------- 19-20
Conclusions ------------------- 21
Operation Analytics and Investigating Metric Spike
Description ------------------- 22
The Problem ------------------- 23-24
Design ------------------- 25
Findings ------------------- 26-38
Analysis ------------------- 39-40
Conclusions ------------------- 41
Hiring Process Analytics
Description ------------------- 42
The Problem ------------------- 43
Design ------------------- 44
Findings ------------------- 45-51
Analysis ------------------- 52
Conclusions ------------------- 53

2
Table of Contents (Cont..)
IMDB Movies Analysis
Description ------------------- 54
The Problem ------------------- 55-56
Design ------------------- 57
Findings ------------------- 58-67
Analysis ------------------- 68-69
Conclusions ------------------- 70
Bank Loan Case Study
Description ------------------- 71
The Problem ------------------- 72
Design ------------------- 73-77
Findings ------------------- 78-101
Analysis ------------------- 102-103
Conclusions ------------------- 104
XYZ Ads Airing Report
Description ------------------- 105
The Problem ------------------- 106
Design ------------------- 107
Findings ------------------- 108- 124
Analysis ------------------- 125- 127
Conclusions ------------------- 128- 131
ABC Call Volume Trend
Description ------------------- 132
The Problem ------------------- 133
Findings ------------------- 134- 141
Analysis ------------------- 142- 144
Conclusions ------------------- 145- 146
Appendix ------------------- 147- 149

3
Data Analytics Process

Description

We use Data Analytics in everyday life without even knowing it.


Your task is to give the example(s) of such a real-life situation where we use Data Analytics and
link it with the data analytics process.

4
Data Analytics Process
Design
Scenario: Going from Andheri (place in Mumbai) to Panvel (Place in Navi Mumbai)

If a person wants to go from one place to another say Andheri to Panvel in our case, he/she may have
many options to go for during the transit.
But he/she will only choose the best option according to his/her will.
The person will go through all the options which are available ad will do a step-by-step analysis while
selecting the best route from his/her current location to the desired location.

The following steps would be taken by the person while making the right decision:-
Step 1: Plan: -- First, one will decide which mode of transport will be most preferable i.e.
Roadway or Railway In Roadway one has two options i.e. Either he/she should take a Uber,
Ola, Meru etc. or a BEST/NMMT bus. In Railway one has three options wiz Western line, Harbor
Line and Central Line

Step 2: Prepare:-- If one opts for Roadways, he/she has two options depending upon how much
he/she is willing to pay wiz Uber/ola/meru or BEST/NMMT. If one opts for Railways, he/she has
three modes options wiz (First class, AC class and Second Class) in which he/she can select
which railway line route to take.

Step 3: Process: -- One needs to think whether he/she which of the above-mentioned route will
make his/her journey fast, comfortable as well as is cost effective.

Step 4: Analyze: -- It’s Obvious that if one is travelling alone no/some extra luggage/bag, he/she
will not opt to pay Rs.800-1500 on Uber/Ola/Meru Cab along with the toll-charges (at Vashi on
Mumbai0Pune Expressway) The time taken by road to reach Panvel from Andheri is almost 1.5 – 2
hours. But the same transit can be done via Railway in just 50 minutes to 1.5 hour (depending upon
type of train Fast or Slow) [Avoiding trains on Sunday due to MegaBlocks]. One can opt for the first-
class ticket which costs around Rs.100 else can go for the second-class ticket which costs for just
Rs.20

Step 5: Share: -- After finally going through the above-mentioned steps one can just book a direct
cab through the respective (OLA, UBER, MERU, etc.) apps. Else one can go to the station and ask
for the ticket (First-class, second-class) from the ticket vendor at the ticket window, or could use an
ATVM (Automatic Ticket Vending Machine) for buying the tickets via the ATVM card or just using
UPI for the payment, or one could directly book his/her ticket via the UTS app.

Step 6: Act: -- After buying the ticket/ booking the cab, one takes his/her seat (which is
sometimes not available in the trains due to crowd/peak hours) and enjoys the ride till
he/she reaches the destination

5
Data Analytics Process

Conclusions

Hence, we have seen how we can use the 6 steps of Data Analytics while making any decision
in real life scenarios(finding the best transit route from Andheri to Panvel in our case)

The 6 steps used to take decisions in real life scenarios are:-


Plan
Prepare
Process
Analyze
Share
Act

6
Instagram User Analytics

Description

User analysis is the process by which we track how users


engage and interact with our digital product (software or
mobile application) in an attempt to derive business insights
for marketing, product & development teams.

These insights are then used by teams across the business to


launch a new marketing campaign, decide on features to build
for an app, track the success of the app by measuring user
engagement and improve the experience altogether while
helping the business grow.

You are working with the product team of Instagram and the
product manager has asked you to provide insights on the
questions asked by the management team.

7
Instagram User Analytics
The Problem
A) Marketing: The marketing team wants to launch some
campaigns, and they need your help with the following

Rewarding Most Loyal Users: People who have been using the
platform for the longest time. Your Task: Find the 5 oldest
users of the Instagram from the database provided

Remind Inactive Users to Start Posting: By sending them


promotional emails to post their 1st photo. Your Task: Find
the users who have never posted a single photo on Instagram

Declaring Contest Winner: The team started a contest and the


user who gets the most likes on a single photo will win the
contest now they wish to declare the winner. Your Task:
Identify the winner of the contest and provide their details to
the team

Hashtag Researching: A partner brand wants to know, which


hashtags to use in the post to reach the most people on the
platform. Your Task: Identify and suggest the top 5 most
commonly used hashtags on the platform

Launch AD Campaign: The team wants to know, which day


would be the best day to launch ADs. Your Task: What day of
the week do most users register on? Provide insights on when
to schedule an ad campaign

8
Instagram User Analytics
The Problem (Cont..)
B) Investor Metrics: Our investors want to know if Instagram
is performing well and is not becoming redundant like
Facebook, they want to assess the app on the following
grounds

User Engagement: Are users still as active and post on


Instagram or they are making fewer posts Your Task:
Provide how many times does average user posts on
Instagram. Also, provide the total number of photos on
Instagram/total number of users

Bots & Fake Accounts: The investors want to know if the


platform is crowded with fake and dummy accounts Your
Task: Provide data on users (bots) who have liked every
single photo on the site (since any normal user would not
be able to do this).

9
Instagram User Analytics
Design

Steps taken to load the data into the data base


Using the 'create db' function of MySQL create a data
base
Then add tables and column names
Then add the values into them using the 'insert into'
function of MySQL
By using the 'select' command we can query the desired
output

Software used for querying the results


--> MySQL Workbench 8.0 CE

10
Instagram User Analytics
Findings - I

To find the most loyal i.e. the top 5 oldest users of Instagram:
1. We will use the data from the users table by selecting the username
and created_at columns.

2. Then using the order by function we will order the desired output by
sorting with the created_at column in ascending order.

3. Then using the limit function, the output will be displayed for top 5
oldest Instagram users.

Output/Result

11
Instagram User Analytics
Findings - II

To Find the most inactive users i.e. the users who have never posted a
single photo on Instagram:

1. We will first select username column from the users table.

2. Then we will left join photos table on the users table, on users.id =
photos.user_id because, both the users.id and photos.user_id have
common contents in them.

3. Then we will find rows from the users table where the photos.id IS NULL
Output/Result

12
Instagram User Analytics
Findings - III

To find the most the username, photo_id, image_url and


total_number_of_likes of that image:

1. First we will select the users.username, photos.id, photos.image_url and


count(*) as total

2. Then, we will inner join the three tables wiz : photos, likes and users, on
likes.photo_id = photos.id and photos.user_id = users.id

3. Then, by using group by function we will group the output on the basis of
photos.id

4. Then, using order by function we will sorting the data on the basis of the total
in descending order

5. Then, to find the most liked photo we will using limit function to view only the
top liked photo’s information

Output/Result

13
Instagram User Analytics
Findings - IV

To find the top 5 most commonly used hashtags on Instagram:

1. We need to select the tag_name column from the tag table and the count(*) as
total function so as to count the number of tags used individually.

2. Then, we need to join tags table and photo_tags table, on tags.id =


photo_tags.tag_id cause they contain the same content in them i.e. tag_id

3. Then using the group by function we need to group the desired output on the
basis of tags.tag_name

4. Then using the order by function we need to sort the output on the basis of
total(total number of tags per tag_name) in descending order

5. Then, to find the top 5 most used tag names we will use the limit 5 function.

Output/Result

14
Instagram User Analytics
Findings - V
To find the day of week on which most users register on Instagram:

1. First we define the columns of the desired output table using select
dayname(created_at) as day_of_week and count(*) as
total_number_of_users_registered from the users table

2. Then using the group by function we group the output table on the basis of
day_of_week

3. Then using the order by function we order/sort the output table on the basis of
total_number_of_users_registered in descending order

Output/Result

15
Instagram User Analytics
Findings - VI
To find the how many times does average posts on Instagram:
1. First, we need to find first the count number of photos(posts) that are present
in the photos.id column of the photos table i.e. count(*) from photos

2. Similarly, we need to find the number of users that are present in the users.id
column of the users table i.e. count(*) from users

3. Next, we need to divide both the values i.e. count(*) from photos/count(*) from
users and hence we would get the total number of photos / total number of users

4. To find how many times the users posts on Instagram we need to find the total
occurrences of each user_id in photos table

Output/Result

16
Instagram User Analytics
Findings - VI (Cont...)

17
Instagram User Analytics
Findings - VII
To find the bots and fake accounts :
1. First, we select the user_id column from the photos table

2. Then we select the username column from the users table

3. Then, we select the count(*) function to count total number of likes from the
likes table

4. Then we inner join users and likes table on the basis of users.id and
likes.user_id, using the on function/clause

5. Then by using the group by function we group the desired output table on the
basis of likes.user_id

6. Then, we search for the values from the cout(*) from photos having equal
values with the total_likes_per_user

Output/Result

18
Instagram User Analytics
Analysis
After performing the analysis I have the following points:-
The most loyal users i.e. the top 5 oldest users are:

Out of the 100 total users there are 26 users who are inactive and they
have never posted any kind of stuff of Instagram may it be any photo,
video or any type of text. So, the Marketing team of Instagram needs to
remind such inactive uers

So, the user named Zack_Kemmer93 with user_id 52 is the winner of the
contest cause his photo with photo_id 145 has the highest number of
likes i.e. 48

The top 5 most commonly used #hashtags along with the total count are
smile(59), beach(42), party(39), fun(38) and concert(24)

Most of the users registered on Thursday and Sunday i.e. 16 and hence it
would prove beneficial to start AD Campaign on these two days

So, there are in total 257 rows i.e. 257 photos in the photos table and 100
rows i.e. 100 ids in the users table which makes the desired output to be
257/100 = 2.57 (avg. users posts on Instagram)

Out of the total user id's there are 13 such user id's who have liked each
and every post on Instagram (which is not practically possible) and so
such user id's are considered as BOTS and Fake Accounts

19
Instagram User Analytics
Analysis (Cont...)
Using the 5 Whys approach I am finding the root cause of the
following:-
Why did the Marketing team wanted to know the most inactive
users?
--> So, they can reach out to those users via mail and ask them
What's keeping them away from using the Instagram.

Why did the Marketing team wanted to know the top 5 #hashtags
used?
--> May be the tech team wanted to add some filter features for
photos and videos posted using the top 5 mentioned #hashtags

Why did the Marketing team wanted to know on which day of the
week the platform had the most new users registered?
--> So, that they can run more Ads of various brands during such
days and also get profit from it

Why did the Investors wanted to know about the average posts per
user has on Instagram?
--> It is a fact that every brand or social platform is determined by
the user engagement on such platforms, also investors wanted to
know whether the platform has the right and authenticated user
base. It also helps the tech team determine how to handle such
traffic on the platform with the latest tech without disrupting the
smooth and efficient functioning of the platform

Why did the Investors wanted to know the count of BOTS and Fake
accounts if any?
--> So that the Investors are assured that they are investing into an
Asset and not a Future Liability

20
Instagram User Analytics
Conclusion

In conclusion, I would like to conclude that not only


Instagram but many other social media and commercial
firms use such Analysis to find the insights from their
customer data which in turn help the firms to find the
customers who will be an Asset to the firm in the future and
not some Liability.

Such Analysis and sorting of the customer base is done at an


weekly, monthly, quarterly or yearly basis as per the needs of
the business firms so as to maximize their profits in future
with minimal cost to the company

21
Operation Analytics and

Investigating Metric Spike

Description

Operation Analytics is the analysis done for the complete end to


end operations of a company. With the help of this, the company
then finds the areas on which it must improve upon. You work
closely with the ops team, support team, marketing team, etc and
help them derive insights out of the data they collect.

Being one of the most important parts of a company, this kind of


analysis is further used to predict the overall growth or decline of
a company’s fortune. It means better automation, better
understanding between cross-functional teams, and more
effective workflows.

Investigating metric spike is also an important part of operation


analytics as being a Data Analyst you must be able to understand
or make other teams understand questions like- Why is there a dip
in daily engagement? Why have sales taken a dip? Etc. Questions
like these must be answered daily and for that its very important
to investigate metric spike.

You are working for a company like Microsoft designated as Data


Analyst Lead and is provided with different data sets, tables from
which you must derive certain insights out of it and answer the
questions asked by different departments.

22
Operation Analytics and

Investigating Metric Spike

The Problem

Case Study 1 (Job Data)

Number of jobs reviewed: Amount of jobs reviewed over


time.
Your task: Calculate the number of jobs reviewed per hour
per day for November 2020?

Throughput: It is the no. of events happening per second.


Your task: Let’s say the above metric is called throughput.
Calculate 7 day rolling average of throughput? For
throughput, do you prefer daily metric or 7-day rolling and
why?

Percentage share of each language: Share of each language


for different contents.
Your task: Calculate the percentage share of each language in
the last 30 days?

Duplicate rows: Rows that have the same value present in


them.
Your task: Let’s say you see some duplicate rows in the data.
How will you display duplicates from the table?

23
Operation Analytics and

Investigating Metric Spike

The Problem(Cont...)

Case Study 2 (Investigating metric spike)

User Engagement: To measure the activeness of a user.


Measuring if the user finds quality in a product/service.
Your task: Calculate the weekly user engagement?

User Growth: Amount of users growing over time for a


product.
Your task: Calculate the user growth for product?

Weekly Retention: Users getting retained weekly after


signing-up for a product.
Your task: Calculate the weekly retention of users-sign up
cohort?

Weekly Engagement: To measure the activeness of a user.


Measuring if the user finds quality in a product/service
weekly.
Your task: Calculate the weekly engagement per device?

Email Engagement: Users engaging with the email service.


Your task: Calculate the email engagement metrics?

24
Operation Analytics and

Investigating Metric Spike

Design

Steps taken to load the data into the data base


Using the 'create db' function of MySQL create a data
base
Then add tables and column names
Then add the values into them using the 'insert into'
function of MySQL
By using the 'select' command we can query the desired
output

Software used for querying the results


--> MySQL Workbench 8.0 CE

Software used for analyzing using Bar plots


--> Microsoft Excel

25
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - I

To find the number of jobs reviewed per hour per day of


November 2020:
1. We will use the data from job_id columns of the job_data table.

2. Then we will divide the total count of job_id (distinct and non-
distinct) by (30 days * 24 hours)for finding the number of jobs
reviewed per day

Output /Result

26
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - II

For calculating the 7-day rolling daily metric average of throughput:-

1. We will be first taking the count of job_id(distinct and non-distinct) and


ordering them w.r.t ds (date of interview)

2. Then by using the ROW function we will be considering the rows


between 6 preceding rows and the current row

3. Then we will be taking the average of the jobs_reviewed

Output /Result

27
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - III

To Calculate the percentage share of each language (distinct and non-


distinct):-

1. We will first divide the total number of languages (distinct/non-distinct)


by the
total number of rows presents in the table

2. Then we will do the grouping based on the languages.

Output /Result

28
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - III(Cont..)

Output /Result

29
Operation Analytics and
Investigating Metric Spike
Job Data
Findings - IV

To view the duplicate rows having the same value we will:-

1. First decide in which do we need to find the duplicate row values

2. After deciding the column(parameter) we will use the ROW_NUMBER


function to find the row numbers having the same value

3. Then we will portioning the ROW_NUMBER function over the column


(parameter) that we decided i.e. job_id

4. Then using the WHERE function we will find the row_num having value
greater than 1 i.e. row_num > 1 based on the occurrence of the job_id in the
table

Output /Result

30
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - I

To find the weekly user engagement:-

1. We will extract the week from the occurred_at column of the events table using
the EXTRACT function and WEEK function

2. Then we will be counting the number of distinct user_id from the events table

3. Then we will use the GROUP BY function to group the output w.r.t week from
occurred_at

Output /Result

31
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - I (Cont...)

Output /Result

32
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - II
To find the user growth (number of active users per week):-

1. First we will the extract the year and week for the occurred_at column of the users
table using the extract, year and week functions

2. Then we will group the extracted week and year on the basis of year and week number

3. Then we ordered the result on the basis of year and week number

4. Then we will find the cumm_active_users using the SUM, OVER and ROW function
between unbounded preceding and current row

Output /Result

33
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - II (Cont...)

Output /Result

34
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - III

The weekly retention of users-sign up cohort can be calculated by two means i.e.
either by specifying the week number (18 to 35) or for the entire column of
occurred_at of the events table.

1. Firstly we will extract the week from occurred_at column using the extract, week
functions

2. Then, we will select out those rows in which event_type = 'signup_flow’ and
event_name = 'complete_signup’

3. If finding for a specific week we will specify the week number using the extract
function

4. Then using the left join we will join the two tables on the basis of user_id where
event_type = 'engagement’

5. Then we will use the Group By function to group the output table on the basis of
user_id

6. Then we will use the Order By function to order the result table on the basis of
user_id

Output /Result

Google Drive Link for saved result


(without specifying week number)

Trainity_task_3_case_stuy_2_question_c.
csv - Google Drive

35
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - III (Cont...)

Output /Result

Google Drive Link for saved result


(specifying week number as 18)

Trainity_task_3_case_stuy_2_question_c_18_week.
csv - Google Drive

36
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - IV

To find the weekly user engagement per device:-

1. Firstly we will extract the year_num and week_num from the occurred_at
column
of the events table using the extract, year and week function

2. Then we will select those rows where event_type = 'engagement’ using


the WHERE
clause

3. Then by using the Group By and Order By function we will group and
order the
result on the basis of year_num, week_num and device

Output /Result

Google Drive link for saved result

question D weekly user engagement per


device.csv - Google Drive

37
Operation Analytics and
Investigating Metric Spike
Investigating Metric Spike
Findings - V

To find the email engagement metrics(rate) of users:-

1. We will first categorize the action on the basis of email_sent,


email_opened and email_clicked using the CASE, WHEN, THEN functions

2. Then we select the sum of category of email_opened divide by the sum of


the category of email_sent and multiply the result by 100.0 and name is as
email_opening_rate

3. Then we select the sum of category of email_clicked divide by the sum of


the category of email_sent and multiply the result by 100.0 and name is as
email_clicking_rate

4. email_sent = ('sent_weekly_digest','sent_reengagement_email’)

5. email_opened = 'email_open’

6. email_clicked = 'email_clickthrough'

Output /Result

38
Operation Analytics and
Investigating Metric Spike
Analysis

From the tables and Bar plot I have infer the following:-
number of distinct job reviewed per day is 0.0083

number of non-distinct jobs reviewed per day is 0.0111

7 day rolling average throughput for 25, 26, 27, 28, 29 and 30 Nov 2020 are 1, 1, 1,
1.25, 1.2 and 1.3333 respectively(for both distinct and non-distinct)

Percentage Share of each language i.e. Arabic, English, French, Hindi, Italian and
Persian are 12.5, 12.5, 12.5, 12.5, 12.5 and 37.5 respectively(for both distinct and non-
distinct)

There are 2 duplicates values/rows having job_id = 23 and language = Persian in both
the rows

Using the Why's approach I am trying to find more insights


Why there is a difference of values between the number of distinct jobs reviewed per
day and number of non-distinct jobs reviewed per day?
----> May be due to repeated values in two or more rows or the dataset consisted of
duplicate rows

Why one shall use 7 day rolling average for calculating throughput and not daily
metric average?
----> For calculating the throughput we will be using the 7-day rolling because 7-day
rolling gives us the average for all the days right from day 1 to day 7 Whereas daily
metric gives us average for only that particular day itself.

Why is it that percentage share of all other languages is 12.5% but that of language =
'Persian' is 37.5?
-----> In such cases there are two chances i.e. either there were duplicate rows having
language as 'Persian' or there were really two or more unique people who were
speaking in Persian language

Why do we need to look for duplicate rows in an dataset?


----> Duplicates have a direct influence of the Analysis going wrong and may led to
wrong Business Decision leading to loss to the company or any entity; so to avoid
these one must look for duplicates and remove them where necessary

39
Operation Analytics and
Investigating Metric Spike
Analysis (Cont...)

From the tables and Bar plot I have infer the following:-
The weekly user engagement is the highest for week 31 i.e. 1685

There are in total 9381 active users from 1st week of 2013 to the 35th week of 2014

The email_opening_rate is 33.5833 and email_clicking_rate is 14.78988

I have used the Why's approach to gain few more insights:-


Why is the weekly user engagement so less in the beginning and then got increased?
-----> It is a fact that for any new product or service launched, during it's initial period
in the market it is less known to all people only some people use the product and
based on their experience the product/service engagement increases or decreases
depending on whether the consumer experience was good or bad. In this case since
the user engagement increased after 2-3 weeks of the launch means that the
consumer had a good experience with the product/service

Why is weekly retention so important?


---> Weekly retention helps the firms to convince and help those visitors who just
complete the sign-up or leave the sign-up process in between, such visitors may
become customers in future if they are guided and convinced properly

Why is weekly engagement per device plays an important role?


----> Based on the reviews from users weekly engagement per device helps the firms
on which devices they must focus more and which devices need more improvements
so they also get a good review in users weekly engagement per device

Why is Email Engagement plays an important role?


----> Email Engagement helps the firms to decide the discounts and offers on specific
products. In this case the email_opening_rate is 33.58 i.e. out of the 100 mails send
only 34 mails were opened and the email_clicking_rate is 14.789 i.e. out of 100 mails
opened only 15 mails were clicked for more details regarding the discount/product
details. This means that the current firm needs to have some more catchy line for
mails also the firm needs to do rigorous planning and deciding content before
sending the mails

40
Operation Analytics and
Investigating Metric Spike
Conclusion

In Conclusion , I would like to conclude that Operation


Analytics and Investigating Metric Spike are very necessary
and they must be done on daily, weekly, Monthly, Quarterly
or Yearly basis based on the Business needs of the firm.

Also, any firm/entity must focus on the Email Engagement


with the customers; the firm must use catchy headings along
with reasonable discounts and coupons so as to increase
their existing customer base

Also any firm must have a separate department(if possible)


so as to hear out to the problems of those Visitors who had
left the Sign-up Process in between, the firm must guide
them so as to convert them from Visitors to Customers

41
Hiring Process Analytics

Description

Hiring process is the fundamental and the most important


function of a company. Here, the MNCs get to know about the
major underlying trends about the hiring process. Trends such
as- number of rejections, number of interviews, types of jobs,
vacancies etc. are important for a company to analyse before
hiring freshers or any other individual. Thus, making an
opportunity for a Data Analyst job here too!

Being a Data Analyst, your job is to go through these trends and


draw insights out of it for hiring department to work upon.

You are working for a MNC such as Google as a lead Data Analyst
and the company has provided with the data records of their
previous hirings and have asked you to answer certain questions
making sense out of that data.

42
Hiring Process Analytics

The Problem

Hiring: Process of intaking of people into an organization for


different kinds of positions.
Your task: How many males and females are Hired ?

Average Salary: Adding all the salaries for a select group of


employees and then dividing the sum by the number of
employees in the group.
Your task: What is the average salary offered in this company ?

Class Intervals: The class interval is the difference between the


upper class limit and the lower class limit.
Your task: Draw the class intervals for salary in the company ?

Charts and Plots: This is one of the most important part of


analysis to visualize the data.
Your task: Draw Pie Chart / Bar Graph ( or any other graph ) to
show proportion of people working different department ?

Charts: Use different charts and graphs to perform the task


representing the data.
Your task: Represent different post tiers using chart/graph?
43
Hiring Process Analytics

Design

Before starting the actual analysis I have:-


Firstly I made a copy of the raw data where I can perform the
Analysis so that what ever changes I made it will not affect the
original data

Secondly I looked for blank spaces and NULL values if any.

Then I had imputed the numerical blank and NULL cells with mean
of the column(if no outliers existed for that particular column) or
with median (if outliers existed for that column)

Then I looked for if any outliers exists and replaced them with the
median of the particular column where the outlier existed

Then for blank cells of categorical variables I had replaced with


the variable with the highest count

Then I looked for duplicate rows and removed them if any

Then I removed the irrelevant columns(data) from the dataset


which was not necessary for doing the analysis

Software used for doing the overall Analysis:-


----> Microsoft Excel

44
Hiring Process Analytics

Findings - I

From the above table and bar plot I have inferred that:-
There are 2563 Males hired for different roles in the company
While there are only 1855 Females hired for different roles in
the company

45
Hiring Process Analytics

Findings - II

To find the average salary offered in this company:-


1. First, we need to remove the outliers i.e. to remove the salaries
below 1000 and above 100000

2. Then using the formula


=AVERAGE(entire_column_of_salary_after_removing_outliers)

Output/Result
49983.03223

46
Hiring Process Analytics
Findings - III

From the above Bar plot I have inferred that the highest number
of posts (both hired and rejected) is 406 for the salary range
41807 to 46907

From the above Bar plot I have inferred that the highest number
of posts (hired) is 315 for the salary range 42307 to 54107

47
Hiring Process Analytics
Findings - IV

48
Hiring Process Analytics

Findings - IV (Cont...)

From the above table, pie chart and Bar Plot I have inferred that
the Highest number of people were working in the Operations
Department i.e. 1843 which accounts for almost 39% of the total
workforce of the company

49
Hiring Process Analytics
Findings - V

50
Hiring Process Analytics
Findings - V (Cont...)

From the above table, Bar plot and Pie chart I have inferred that
the c9 post has the highest number of openings i.e. 1792 which
accounts for 25% of the total job openings of the company/firm

51
Hiring Process Analytics

Analysis

Using the Why's approach I am trying to find some more


insights:-
Why is there so much difference in the total number of Males
and Females hired?
---> Since, the Company is an MNC and people from all around
the world work here; such difference exists due to the fact that
the men-women equality has not yet reached to each and
every part of the world. Some regions in the Gulf countries
and in African continents along with some Asian countries
face this problem

Why is it that there are less number of people who have


salaries more than 85000 and there are more number of
people who have salaries 35000 to 60000?
----> It is a fact that there are some positions in company who
require a specialist person with years of experience in that
particular field of work and hence company looks for such
people and offer them higher salary packages also such
people regularly prove themselves an asset to the company.
For any company there are more people having the salary in
the range 35000 to 60000; such people have spent 3-4 years in
the company and their salary and increments are decided
based on their monthly, quarterly and yearly performance.

Why is that the Operations department has the highest


number of people working?
----> Operations Department works like a central hub for all
other departments, all the execution tasks are carried out by
this department. Operations department has the highest work
load when compared to all other departments

52
Hiring Process Analytics

Conclusion

In the conclusion part, I would like to conclude that Hiring


Process Analytics plays an important part for all the companies
and firms to decide the job openings for the near future.

Hiring Process Analytics is done on monthly, quarterly or yearly


basis as per the needs and policies of the companies

For any company the Operations Department has the highest


number of workforce due to the workload on this department as
this department acts as a central hub for all the executive tasks
carried out

For any company there will some employees who have high
salary packages compared to other employees, and this is due to
the fact that they have some special skills and years of
experience in their particular field of work

Hiring Process Analytics helps the company to decide the


salaries for new freshers joining the company; also it tells
requirement of workforce by each department; it also helps the
company decide the appraisals and increment for it's current
employess

53
IMDB Movie Analysis
Description

You are provided with dataset having various columns of different


IMDB Movies. You are required to Frame the problem.

For this task, you will need to define a problem you want to shed
some light on.

Once you have defined a problem, clean the data as necessary, and
use your Data Analysis skills to explore the data set and derive
insights.

54
IMDB Movie Analysis
The Problem

Movies with highest profit: Create a new column called profit which
contains the difference of the two columns: gross and budget. Sort
the column using the profit column as reference. Plot profit (y-axis)
vs budget (x- axis) and observe the outliers using the appropriate
chart type.
Your task: Find the movies with the highest profit?

Top 250: Create a new column IMDb_Top_250 and store the top 250
movies with the highest IMDb Rating (corresponding to the column:
imdb_score). Also make sure that for all of these movies, the
num_voted_users is greater than 25,000. Also add a Rank column
containing the values 1 to 250 indicating the ranks of the
corresponding films.
Extract all the movies in the IMDb_Top_250 column which are not in
the English language and store them in a new column named
Top_Foreign_Lang_Film. You can use your own imagination also!
Your task: Find IMDB Top 250

Best Directors: Group the column using the director_name column.


Find out the top 10 directors for whom the mean of imdb_score is
the highest and store them in a new column top10director. In case
of a tie in IMDb score between two directors, sort them
alphabetically. Your task: Find the best directors

Popular Genres: Perform this step using the knowledge gained


while performing previous steps.
Your task: Find popular genres

55
IMDB Movie Analysis
The Problem (Cont...)

Charts: Create three new columns namely, Meryl_Streep,


Leo_Caprio, and Brad_Pitt which contain the movies in which the
actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the
lead actors. Use only the actor_1_name column for extraction.
Also, make sure that you use the names 'Meryl Streep', 'Leonardo
DiCaprio', and 'Brad Pitt' for the said extraction.

Append the rows of all these columns and store them in a new
column named Combined.

Group the combined column using the actor_1_name column.

Find the mean of the num_critic_for_reviews and


num_users_for_review and identify the actors which have the
highest mean.

Observe the change in number of voted users over decades using


a bar chart. Create a column called decade which represents the
decade to which every movie belongs to. For example, the
title_year year 1923, 1925 should be stored as 1920s. Sort the
column based on the column decade, group it by decade and find
the sum of users voted in each decade. Store this in a new data
frame called df_by_decade.

Your task: Find the critic-favorite and audience-favorite actors

56
IMDB Movie Analysis
Design

1 . Firstly I made a copy of the raw data where I can perform the
Analysis so that what ever changes I made it will not affect the
original data

2. Then dropping the columns which have no use for the analysis
that we will be doing

3. Columns like ‘Color’, ‘director_facebook_likes’,


‘actor_3_facebook_likes, ‘actor_2_name’, ‘actor_1_facebook_likes’,
‘cast_total_facebook_likes, ‘actor_3_name’, ‘facenumber_in_posts’,
‘plot_keywords’, ‘movie_imdb_link’, ‘content_rating’
,’actor_2_facebook_likes’, ‘aspect_ratio’, ‘movie_facebook_likes’ are
the columns containing irrelevant data for the analysis tasks
provided. So, these columns needs to be dropped.

4. After dropping the irrelevant columns now we need to remove the


rows from the dataset having anyone of its column value as
blank/NULL

5. Then we need to get rid off the duplicate values in the dataset
which can be achieved by using the ‘Remove Duplicate Values/Cells’
available in the ‘Data’ tab

57
IMDB Movie Analysis
Findings - I

To find the movies with the highest profit: -


1. First we need to subtract the budget value from the gross value to get the
profit.

2. Then, by using the scatter plot option we will plot values of profit(y_axis)
and budget(x_axis)

3. Then with the help of graph we will be finding the outliers

58
IMDB Movie Analysis
Findings - I(Cont...)

After removing the outliers, from the above table I have inferred
that 'Avatar' was the highest profit making movie ever with a
profit of 523505847

59
IMDB Movie Analysis
Findings - II

To find the IMDB Top 250 we will:-


1. First we will filter out those rows whose num_voted_users > 25000 using the
sort and filter option

2. Then we will arrange the dataset on the basis of imdb_score in descending order

3. Then we will select only the top 250 rows for the further analysis

4. Then we will create a new column rank using the RANK() function and using the
formula
=RANK(N2,$N$2:$N$251,0)+COUNTIFS($N$2:N2,N2)-1

5. Then we will filter out (unselect ‘English’) from the language column and we will
get the desired output

Top - 5 IMDB Movies all languages

From the above table I have inferred that 'The Shawshank


Redemption' had the highest IMDB ratings

60
IMDB Movie Analysis
Findings - II (Cont...)

Top - 5 IMDB Movies all languages (except English)

From the above table I have inferred that the movie 'The Good,
the Bad and the Ugly' had the highest IMDB ratings w.r.t movies
with all other languages (except English); it's country of origin in
Itlay

61
IMDB Movie Analysis
Findings - III
To find the best top 10 directors on the basis of mean of imdb_score
we will:-

1. First select the imdb_score column of the cleaned dataset

2. Then we will click on pivot table

3. We will add director_name into the series section of the pivot table

4. Then we will add average imdb_score into the values section of the
pivot table

5. Then we will first sort the data on the basis of average of


imdb_score in descending order and then on the basis of director
name alphabetically

From the above table I have inferred that Charles Chaplin and
Tony Kaye had the highest mean of IMDB Score i.e. 8.6

62
IMDB Movie Analysis
Findings - IV

To find the Popular Genres we will:-


1. First select the genres column of the cleaned dataset

2. Then we will go for the pivot table option

3. Then we will Select the genres name as row labels

4. Then we will the values as the count of the number of genres and
then sort it in descending order on the basis of count of the number
of genres

From the above table I have inferred that genre named 'Drama'
was the most popular with a count of 153

63
IMDB Movie Analysis
Findings - V

To find the critic-favorite and audience-favorite actors we will:-


1. First three new columns namely, Meryl_Streep, Leo_Caprio, and
Brad_Pitt which contain the movies in which the actors: 'Meryl Streep',
'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors from the
actor_1_name column

2. Then we will append the above 3 created columns into 1 column named
actor_1_name_combine

3. Then we will group the 3 columns of critic-favorite and audience-favorite


actors

4. Then using the pivot table we will find the average, sum and count of
critic0favorite and audience-favorite actors

64
IMDB Movie Analysis
Findings - V(Cont..)

From the above two graphs I have inferred that


‘Leonardo DiCaprio’ was both critic-favorite and audience-
favorite

65
IMDB Movie Analysis
Findings - VI

66
IMDB Movie Analysis
Findings - VI (Cont..)

From the above table and Bar plot I have inferred that most
number of votes were in the decade 2001-2010 with a
count of 178592461

67
IMDB Movie Analysis
Analysis
Using the Why's approach I am trying to find some useful insights

Why is it that the Most rated IMDB movie and the highest
profit movie not the same?
-----> Maybe, due to fact that during the IMDB rating only
recognized and people who know how to vote on IMDB have
the access to the IMDB portal. On the other hand the profit is
calculated on the basis of the tickets sold in theatres
worldwide.

Why there are more number of votes during the decade


2001-2010?
-----> The period 2001-2010 saw many scientific
advancements and computer graphics advancement, also
during this interval there was a splendid increase in the
production of movies all over the world, so huge number of
movies were produced and released during this decade. Also
before 2000 there were no laws around the world that had a
separate ministry/board/committee from the Government side
that looked into the matters of film production and release

Why is it that only movies having language as 'English' are the


top 5 ranked movies on the basis of IMDB?
-----> Movies having language as English were having country
of origin as USA; Also it is a well known fact that USA
economy was robust during those days. So the social media
investors looked for directors made movies so as to gain
some financial gains

68
IMDB Movie Analysis

Analysis (Cont..)

Why is it that only Drama and Comedy had the highest


popularity?
----> Most of people all over the world are stressed with their
work life so they need a relaxing refreshment and not some
action or horror type thing. So people prefer watching
movies that were of Comedy or Drama genre or both. But,
most of them preferred Comedy genre films

Why is it that there were more number of votes for the


decade 2001-2010 than compared to 2011-2020, though
there was advancement in graphics and animation during
2011-2020?
----> It is a fact that there was a great and immense growth of
technology not only in the graphics and animation sector but
in all aspects of life; Also it was during this interval VPN was
introduced; VPN led to piracy (illegal distribution of film) due
to which most of people avoided going to theatres.

69
IMDB Movie Analysis
Conclusion

In Conclusion, I would like to conclude that IMDB Movie Analysis


or any such analysis is done not only by Movie makers before
movie production, but it is also done by various investors, stake-
holders, theatre outlet owners.

Normal people would not mind to do such analysis but such


analysis plays an crucial part during the pre-production phase of
the movies and also during the post-production phase

Also, it is not necessary that the movie with the highest IMDB
rating will have the highest profit.
Profit is calculated truly on the basis on the number of tickets
sold by theatres all over the world

Most of the people are tired with their daily lives and they prefer
movies with Comedy/ Drama genre or both, and they would not go
for movies with Action/Horror genre

So, directors and production team must keep in mind the above
points and shall do the pre-production analysis before the
commencement of filming

70
Bank Loan Case Study
Description

The loan providing companies find it hard to give loans to the people
due to their insufficient or non-existent credit history. Because of that,
some consumers use it as their advantage by becoming a defaulter.
Suppose you work for a consumer finance company which specializes
in lending various types of loans to urban customers. You have to use
EDA to analyze the patterns present in the data. This will ensure that
the applicants capable of repaying the loan are not rejected.

When the company receives a loan application, the company has to


decide for loan approval based on the applicant’s profile. Two types
of risks are associated with the bank’s decision:
If the applicant is likely to repay the loan, then not approving the
loan results in a loss of business to the company.
If the applicant is not likely to repay the loan, i.e. he/she is likely to
default, then approving the loan may lead to a financial loss for the
company.

When a client applies for a loan, there are four types of decisions that
could be taken by the client/company:
1. Approved: The company has approved loan application
2. Cancelled: The client cancelled the application sometime during
approval. Either the client changed her/his mind about the loan or in
some cases due to a higher risk of the client he received worse pricing
which he did not want.
3. Refused: The company had rejected the loan (because the client
does not meet their requirements etc.).
4. Unused Offer: Loan has been cancelled by the client but on
different stages of the process.

71
Bank Loan Case Study
The Problem

This case study aims to give you an idea of applying EDA in a real
business scenario. In this case study, apart from applying the
techniques that you have learnt in the EDA module, you will also
develop a basic understanding of risk analytics in banking and
financial services and understand how data is used to minimize the
risk of losing money while lending to customers.

It aims to identify patterns which indicate if a client has difficulty


paying their installments which may be used for taking actions such
as denying the loan, reducing the amount of loan, lending (to risky
applicants) at a higher interest rate, etc. This will ensure that the
consumers capable of repaying the loan are not rejected.
Identification of such applicants using EDA is the aim of this case
study.

In other words, the company wants to understand the driving factors


(or driver variables) behind loan default, i.e. the variables which are
strong indicators of default. The company can utilize this knowledge
for its portfolio and risk assessment.

To develop your understanding of the domain, you are advised to


independently research a little about risk analytics – understanding
the types of variables and their significance should be enough).

72
Bank Loan Case Study
Design
Firstly create a copy of the raw data
Then the percentage of null values needs to be analyzed and those
columns that have more than 50% of the null data have to be dropped
And those columns with less than 50% of the null data have to be
replaced with mean or median or the highest occurring categorical
variables.

The following columns needs o be dropped as they have more than


50% of the NULL values
OWN_CAR_AGE
EXT_SOURCE_1
APARTMENTS_AVG
BASEMENTAREA_AVG
YEARS_BUILD_AVG
COMMON_AREA_AVG
ELEVATORS_AVG
ENTRANCES_AVG
FLOORSMAX_AVG
FLOORSMIN_AVG
LANDAREA_AVG
LIVINGAPARTMENTS_AVG
LIVINGAREA_AVG
NONLIVINGAPARTMENTS_AVG
NONLIVINGAREA_AVG
APARTMENTS_MODE
BASEMENTAREA_MODE
YEARS_BUILD_MODE
COMMON_AREA_MODE
ELEVATORS_MODE
ENTRANCES_MODE
FLOORSMAX_MODE
FLOORSMIN_MODE
73
Bank Loan Case Study
Design (Cont...)

LANDAREA_MODE
LIVINGAPARTMENTS_MODE
LIVINGAREA_MODE
NONLIVINGAPARTMENTS_MODE
NONLIVINGAREA_MODE
APARTMENTS_MEDIAN
BASEMENTAREA_MEDIAN
YEARS_BUILD_MEDIAN
COMMON_AREA_MEDIAN
ELEVATORS_MEDIAN
ENTRANCES_MEDIAN
FLOORSMAX_MEDIAN
FLOORSMIN_MEDIAN
LANDAREA_MEDIAN
LIVINGAPARTMENTS_MEDIAN
LIVINGAREA_MEDIAN
NONLIVINGAPARTMENTS_MEDIAN
NONLIVINGAREA_MEDIAN
FONDKAPREMONT_MODE
HOUSETYPE_MODE
WALLSMATERIAL_MODE

74
Bank Loan Case Study
Design (Cont...)

Then drop those columns which are irrelevant for doing the Data
Analysis. The following columns needs to be dropped:-
FLAG_MOBILE
FLAG_EMPLOY_PHONE
FLAG_WORK_PHONE
FLAG_CONT_MOBILE
FLAG_PHONE
FLAG_EMAIL
CNT_FAMILY_MEMBERS
REGION_RATING_CLENT
REGION_RATING_CLENT_W_CITY
EXT_SOURCE_3
YEAR_BEGINEXPLUATATION_AVG
YEAR_BEGINEXPLUATATION_MODE
YEAR_BEGINEXPLUATATION_MEDIAN
TOTAL_AREA_MODE
EMERGENCYSTATE_MODE
DAYS_LAST_PHONE_CHANGE
FLAG DOC 2
FLAG DOC 3
FLAG DOC 4
FLAG DOC 5
FLAG DOC 6
FLAG DOC 7
FLAG DOC 8
FLAG DOC 9
FLAG DOC 10
FLAG DOC 11
FLAG DOC 12

75
Bank Loan Case Study
Design (Cont...)

FLAG DOC 13
FLAG DOC 14
FLAG DOC 15
FLAG DOC 16
FLAG DOC 17
FLAG DOC 18
FLAG DOC 19
FLAG DOC 20
FLAG DOC 21

Replacing Blanks in Occupation_Type column of the Application


Dataset with the highest occurring categorical variable -->
Highest occurring categorical variable is ‘Laborers’

Replacing Blanks in AMT_ANNUTIY column of the Application Dataset


with the median of the AMT_ANNUITY as there exists outliers in the
AMT_ANNUITY column
--> Median of AMT_ANNUITY = 24903

Replacing Blanks in AMT_GOODS_PRICE column of the


Application Dataset with the median of the AMT_GOODS_PRICE as
there exists outliers in the AMT_GOODS_PRICE column --> Median
of AMT_GOODS_PRICE = 450000

Replacing Blanks in Name_Type_Suite column of the Application


Dataset with the highest occurring categorical variable
--> Highest occurring categorical variable is ‘Unaccompanied’

76
Bank Loan Case Study
Design (Cont...)

Replacing Blanks in Organization_type column of the Application


Dataset with the highest occurring categorical variable
--> Highest occurring categorical variable is ‘Business Entity Type 3’

The following columns of the previous application datasets need to


be dropped as they are irrelevant for doing the data analysis
HOUR_APPR_PROCESS_START
WEEKDAY_APPR_PROCESS_START_PREV
FLAG_LAST_APPL_PER_CONTRACT
NFLAG_LAST_APPL_IN_DAY SK_ID_CURR

WEEKDAY_APPR_PROCESS_START

Removing the rows with the values 'XNA' &'XAP' for the column:
NAME_TYPE_SUITE
----> Replace Blanks with Unaccompained

AMT_ANNUITY :- Replace Blanks with 21340(median)

77
Bank Loan Case Study
Findings - I

The Target Variable Pie chart shows that almost 92% of the total
clients had no problem during payment while 8% of the clients
had some or the other problem

78
Bank Loan Case Study
Findings - II

From the GENDER _VARIABLE pie chart we can infer that


almost 66% of the clients are female and 34% of the clients are
Male. The 0% of the applicants have gender as XNA which can
be ignored

79
Bank Loan Case Study
Findings - III

From the bar graphs of count and percentage The bank can target
those groups who do not have their own apartment i.e. the bank may
consider the people living in Co-op apartment, Municipal Apartment,
Rented Apartment and people living with their parents

80
Bank Loan Case Study
Findings - IV

From the above bar plot we can infer that most of the applicants
belong to the Age Group ’31-40’

81
Bank Loan Case Study
Findings - V

From the above Bar plots we can infer that clients/applicants in


the Age Group ’31-40’ are having the highest number when it
comes to doing/returning Payment to Banks also they have they
highest count of payment issues when returning money to the
bank

82
Bank Loan Case Study
Findings - VI

From the above Bar plot we can infer that clients belonging to ‘Low’ income
range have the highest count when it comes to clients with no payment issues

From the above Bar plot we can infer that clients belonging to ‘Medium’ income
range have the highest count when it comes to clients with payment issues

83
Bank Loan Case Study
Findings - VII

From the above bar plot we can infer that clients with occupation_type ‘Laborers’
have the highest number of count when it comes to clients with no payment issues

From the above bar plot we can infer that clients with occupation_type ‘Laborers’
have the highest number of count when it comes to clients with payment issues

84
Bank Loan Case Study
Findings - VIII

From the above Bar plot we can infer that clients having income_type as
‘WORKING’ have the highest count when it comes to clients with no payment
issues

From the above Bar plot we can infer that clients having income_type as
‘WORKING’ have the highest count when it comes to clients with payment issues

85
Bank Loan Case Study
Findings - IX

From the above Bar plot we can infer that client having the total income range as
‘LOW’ have the highest count when it comes to clients having no payment issues

From the above Bar plot we can infer that client having the total income range as
‘LOW’ have the highest count when it comes to clients having payment issues

86
Bank Loan Case Study
Findings - X

From the above Bar plot we can infer that clients having total count of family
members as 2 have the highest count when it comes to clients having no
payment issues

From the above Bar plot we can infer that clients having total count of family
members as 2 have the highest count when it comes to clients having payment
issues

87
Bank Loan Case Study
Findings - XI

From the above Bar Plot we can infer that Clients with CODE_GENDER
= ‘F’ have the highest number of non-defaulters i.e. 188278-14170 =
174108

88
Bank Loan Case Study
Findings - XII

From the above Bar Plot we can infer that clients having
NAME_INCOME_TYPE = ‘WORKING’ having the highest count of Non-
defaulters i.e. 143550-15224 = 128326

89
Bank Loan Case Study
Findings - XIII

From the above Bar Plot we can infer that clients having
NAME_EDUCATION_TYPE = ‘SECONDARY/SECONDARY SPECIAL’
have the highest count for Non0defaulters i.e.198867-19524 = 179343

90
Bank Loan Case Study
Findings - XIII

From the adjacent Bar Plot we can infer that clients having
NAME_FAMILY_STATUS = ‘MARRIED’ have the highest count
of Non0defaulters i.e.181582 – 14850 = 166732

91
Bank Loan Case Study
Findings - XIV

From the above Bar Plot we can infer that clients having
NAME_HOUSING_TYPE = ‘House/Apartment’ have the
highest count of Non-defaulters i.e.
251596-21272 = 230324

92
Bank Loan Case Study
Findings - XV

From the adjacent Bar plot we can infer that clients having
occupation_type = ‘Laborers’ have the highest count for Non-
defaulters i.e.
139461-12116 = 127345

93
Bank Loan Case Study
Findings - XVI

From the above Bar plot we can infer that Females belonging to Low
income group are the highest number of clients with no payment issues

94
Bank Loan Case Study
Findings - XVII

From the above Bar plot we can infer that Females belonging to Low
income group are the highest number of clients with payment issues

95
Bank Loan Case Study
Findings - XVIII

From the above Bar Plot we can infer that clients having
credit amt range as ‘Low’
and education status as ‘Secondary/ Secondary Special’
have the highest count for clients with no payment issues

96
Bank Loan Case Study
Findings - XIX

From the above Bar Plot we can infer that clients having
credit amt range as ‘Medium’
and education status as ‘Secondary/ Secondary Special’
have the highest count for clients with payment issues

97
Bank Loan Case Study
Findings - XX

From the above Bar plot we can infer that clients with
total_income_range as ‘Low’ and family_status as ‘Married’
have the highest count for clients having no payment issues

98
Bank Loan Case Study
Findings - XXI

From the adjacent Bar plot we can infer that clients with
total_income_range as ‘Low’ and family_status as
‘Married’
have the highest count for clients having payment issues

99
Bank Loan Case Study
Findings - XXII

100
Bank Loan Case Study
Findings - XXII (Cont...)

From the above Table and Bar Plot we can infer that Name of
Contract status i.e. Repairs work has the highest count of Approved
Loans

101
Bank Loan Case Study
Analysis
Using the Why's approach I am trying to find some more useful
insights
Why is it that the target_variable is of so much importance?
---> In this dataset target_variable represents whether the client
had some payment issues(1) or the client didn't had some
payment issues(0); It is important because the target_variable
decides whether the bank should increase/decrease it's interest
rates on various loans given by the bank; Also in this case almost
92% of the clients didn't had any payment issues and only 8% of
them had payment issues, this tells that bank's credit score is
good and it has very less or no Non-preforming Accounts.

Why is it that proportion of Female clients more than that of the


Male clients?
---> In countries like India especially there have been laws made
by the Government for Women who want to establish their own
Start-up, Business or their own classes, catering services, etc.;
These laws offer loans to women clients at a relatively low interest
rates; Also in some cases people purposely use their
retired/household mother or household wife so that they can get
some sort of concession i.e. low interest rates while applying for
Home loans

Why should bank prefer other Housing type clients though


House/Apartments Housing type clients have the highest
proportion of non-defaulters?
----> Cause people in other groups like Municipal Apartment, Co-
op Apartment, Rented Apartment, with Parents are in the search
of their own house of their own name plate; Also now a days in
India the joint family system is declining and the future
generations opt to live in their own 1/2 BHK's rather than living
together will all family members in big Family Apartments
102
Bank Loan Case Study
Analysis (Cont..)
Why should bank opt for working class clients more than the
state-government class clients though state-government
employees enjoy a lot of benefits and regular salary?
----> It is true that state government employee enjoy a lot of
benefits but they also get housing allowances greater than that of
working class and in some cases they even get an apartment to live
with their families as long as they work for the state government;
On the other hand the working class don't enjoy such housing
allowances or get very less of it, also the working class don't get an
apartment to live in for their entire professional life(i.e. until
retirement) and so working class opt for purchasing their own
house by taking house loan

Why should Bank not go for approving loans to 'Laborers'


occupation_type clients though they have the highest non-
defaulters count?
-----> Laborers take only personal loans for marriage or house
repair purpose and their loan amount is also less and the interest
on such loans is also less as compared to home loan, car loan, etc.
which in turn will cause less profits to the bank.

Why is it that females with low income group have the lowest
count of defaulters?
----> Females belonging to such groups take loan of small amounts
just for starting their own start-ups, business or catering/ parlor
services and they usually enjoy benefit from government
schemes for such purpose

103
Bank Loan Case Study
Conclusion
In conclusion, I would like to conclude the following:-
The proportion/percentage of the defaulters(target = 1) is around
8% and that of non-defaulters(target = 0) is around 92%
The Bank generally lends more loan to Female clients as
compared to Males clients as the count of Female clients in the
defaulter’s list is less than that of Males. Still Bank can look for
more Male clients if their credit amount is satisfied
Also the clients who belong to Working class tend to pay their
loans on time followed by the clients who fall under Commercial
Associate
Clients having Education status like Secondary/ Higher Secondary
or more tend to pay loan on time so bank can prefer lending loans
to clients having such Education Status
Clients who fall in the Age Group 31-40 have the highest count for
paying off their loans on time followed by the clients who fall in the
Age Groups 41-60
Clients having LOW credit amount range tend to pay off their
loans on time than compared to HIGH and MEDIUM credit range
Clients living with their Parents tend to pay off their loans quickly
as compared to other housing type. So Bank can lend loan to
clients having housing type → Living with Parents
Clients taking loan for purchasing New Home i.e. clients taking
Home Loans or purchasing New Car i.e. Car Loans and clients who
have a income type as State Servant tend to pay their loans on
time and hence Bank should prefer clients having such
background
The Bank should be more cautious when lending money to clients
with Repairs purpose because they have high count of Defaulters
along with High count of Defaulters

104
XYZ Ads Airing Report
Description

Advertising is a way of marketing your business in order to


increase sales or make your audience aware of your products
or services. Until a customer deals with you directly and
actually buys your products or services, your advertising may
help to form their first impressions of your business. Target
audience for businesses could be local, regional, national or
international or a mixture. So they use different ways for
advertisement. Some of the types of advertisement are:
Internet/online directories, Trade and technical press, Radio,
Cinema, Outdoor advertising, National papers, magazines and
TV. Advertising business is very competitive as a lot of
players bid a lot of money in a single segment of business to
target the same audience. Here comes the analytical skills of
the company to target those audiences from those types of
media platforms where they convert them to their customers
at a low cost.

105
XYZ Ads Airing Report
The Problem

What is Pod Position? Does the Pod position number affect the
amount spent on Ads for a specific period of time by a company?

What is the share of various brands in TV airings and how has it


changed from Q1 to Q4 in 2021?

Conduct a competitive analysis for the brands and define


advertisement strategy of different brands and how it differs
across the brands.

Mahindra and Mahindra wants to run a digital ad campaign to


complement its existing TV ads in Q1 of 2022. Based on the data
from 2021, suggest a media plan to the CMO of Mahindra and
Mahindra. Which audience should they target? *Assume XYZ Ads
has the ad viewership data and TV viewership for the people in
India.

106
XYZ Ads Airing Report
Design

1 . Firstly I made a copy of the raw data where I can perform the
Analysis so that what ever changes I made it will not affect the
original data

2. Then dropping the columns which have no use for the analysis that
we will be doing

3. After dropping the irrelevant columns now we need to remove the


rows from the dataset having anyone of its column value as
blank/NULL

4. Then we need to get rid off the duplicate values in the dataset which
can be achieved by using the ‘Remove Duplicate Values/Cells’
available in the ‘Data’ tab

107
XYZ Ads Airing Report
Findings - I

→Pod position is the position of the ad commercial during an ad commercial


break
eg: If in a commercial break one sees AMUL butter Ad first, then
he/she sees Big Basket Ad the next and then he/she sees Amazon Ad.

→So, in the above case the Pod positions of brands AMUL, Big Basket and

Amazon are 1,2 and 3 respectively.

108
XYZ Ads Airing Report
Findings - I (Cont..)

109
XYZ Ads Airing Report
Findings - I (Cont..)

From the above bar plots I have inferred the following:-


For Honda Motors as the Pod Position tends towards 31, the
amount spent first increases till pod pos.10 and then decreases
from pod pos.11 onwards
For Hyundai Motors as the Pod Position tends towards 31, the
amount spent first increases till pod pos.22 and then decreases
from pod pos. 23 onwards
For Mahindra and Mahindra as the Pod Position tends towards 31,
the amount spent first increases till pod pos. 26 and then
decreases from pod pos. 27 onwards
For Maruti Suzuki as the Pod Position tends towards 31, the
amount spent first increases till pod pos. 18 and then decreases
from pod pos. 19 onwards
For Tata Motors as the Pod Position tends towards 31, the amount
spent first increases till pod pos. 27 and then decreases from pod
pos. 28 onwards
For Toyota as the Pod Position tends towards 31, the amount
spent first increases till pod pos. 18 and then decreases from pod
pos. 19 onwards

110
XYZ Ads Airing Report
Findings - II

From the above bar plot I have inferred that the brand named ‘Maruti
Suzuki’ has the highest share in each Quarter for TV Airings.

111
XYZ Ads Airing Report
Findings - III

112
XYZ Ads Airing Report
Findings - III (Cont...)

113
XYZ Ads Airing Report
Findings - III (Cont...)

114
XYZ Ads Airing Report
Findings - III (Cont...)

115
XYZ Ads Airing Report
Findings - III (Cont...)

From the above Bar plots I have inferred the following:-


The brand ‘Maruti Suzuki’ has the highest share for TV Ads in
both network types wiz Broadcast(37.53%) and cable(38.37%)
The avg_amt_spent on Broadcast type network is the highest
for the brand ‘Hyundai Motors India’ i.e. $18,078 and on cable
type network is the highest for the brand ‘Mahindra and
Mahindra’ i.e. $1,612
For the broadcast type network the brand ‘Maruti Suzuki’ has
the highest share for the Early Morning(35.01%), Evening
News(36.33%), Late Fringe(52.88%), Overnight(49.68%), Prime
Access(50.38%), Prime Time(39.49%), Weekend(40.89%), but
for Daytime and Early Fringe the brand ‘Honda Cars’ has the
highest share i.e. 33.03% and 44.95% respectively.

For the cable type network the brand ‘Maruti Suzuki’ has the
highest share for all the DayParts of TV Ads
For the cable type network the brand ‘Maruti Suzuki’ has the
highest share for all the DayParts of TV Ads
Also the brand ‘Maruti Suzuki’ spent the highest sum of
amount in TV Ads Airings in all Quarters(Q1,Q2,Q3 and Q4) for
both the network types: Broadcast type and Cable Type The
brand ‘Maruti Suzuki’ has the highest share and also spent the
most for TV Ads Airings for all days of week

116
XYZ Ads Airing Report
Findings - IV

117
XYZ Ads Airing Report
Findings - IV (Cont...)

118
XYZ Ads Airing Report
Findings - IV (Cont...)

119
XYZ Ads Airing Report
Findings - IV (Cont...)

120
XYZ Ads Airing Report
Findings - IV (Cont...)

From the above bar plots I have inferred the following:-


Most of the share of TV Ads is from the dayspart Late Fringe (40%)
for Northern India, Prime Time (42.86%) for Northern India and
Daytime (22.80%) for North East India
Most of the share for TV Ads is on Friday, Saturday and Sunday;
On Sunday the share for Northern India is 74.29%
In case of Cable network the share of Central India is 40.49%;
share of North East India is 90.70%; share of Northern India is 0%
and share of Southern India is 93.41%
In case of Broadcast network the share of Central India is 59.51%;
share of North East India is 9.30%; share of Northern India is 100%
and share of Southern India is 6.59%
In Q1 Prime Time has the highest share from Northern India i.e.
14.29% ; In Q2 Late Fringe has the highest share from Northern
India i.e. 8.57% ; In Q3 again Late Fringe has the highest share from
Northern India i.e. 28.57% ; In Q4 Prime Time has the highest share
from Northern India i.e. 8.57%

121
XYZ Ads Airing Report
Findings - V

122
XYZ Ads Airing Report
Findings - V (Cont...)

123
XYZ Ads Airing Report
Findings - V (Cont...)

Also most of the share in TV Ads Airing for each brand was on
Saturday Weekend show daypart

124
XYZ Ads Airing Report
Analysis

Using the Why's approach I am trying to find some more useful


insights:-
Why is POD position is so important and why do companies
are betting on pod positions which fall in the range 7-13, 15-
20, 21-23?
---> POD position is the position of Ad of the particular Brand
in a series of Commercial Ad break. Most of the people are
bored when the main content is stopped and Ad commercial
starts, also in some cases after a continuous 5-8 minutes of
main content people get out of their couch/sitting place to do
some other work in the Ad break; POD position 1-6 is of
approx 1-2.5 minutes in which the viewer can do some other
work; POD position 7-13, 15-20 and 21-23 are favorable POD
positions to show their particular brand name; and so most of
companies bet on such POD positions

Why is that the share of 'Maruti Suzuki' is more in all quarters


when compared to other brands?
----> Due to efficient Marketing Strategy of the Marketing
team of the Maruti Suzuki

125
XYZ Ads Airing Report
Analysis (Cont...)

Why is it that most of the companies bid for Ads during the
Afternoon and Evening time and less in the Morning time?
---> Most of the people are in hurry during the weekdays in
the morning to reach their office and they don't have time to
see the Ads, they just read the main head lines and some
even don't switch ON their Television; during the afternoon
break people have some time during the break in the office
and during the evening/night time people have time during
after having dinner wherein they can sit back and enjoy the
news along with Ads; so most of the companies bid for Ads
during the Evening time and Afternoon time and less during
the morning time.

Why is that share of Maruti Suzuki and TATA Motors is


highest in the Q4 compared to it's share in other Quarters?
----> Most of the companies have their clearance stock sale in
Q4 and most of them give discounts on their respective
brands. In this case Marketing teams of both 'TATA Motors'
and 'Maruti Suzuki' have planned a proper strategy so as to
bid most of the share in TV Ads Airing in the Q4 of the
financial year

126
XYZ Ads Airing Report
Analysis (Cont...)

Why is it that most of the share in TV Ads Airing for each


brand was on Saturday Weekend show daypart?
----> Most of the viewers have day off Sunday and most of
them watch TV for late night hours on Saturday Night and so
most of the Brands bid a huge amount for POD positions in
these Saturday weekend shows. Also if some viewers want to
get into more details for a particular model of the brand they
came across during the Ad break on Saturday Weekend show;
they could visit the nearest showroom of that particular brand
and even purchase the model if they liked it, which would gain
profits for the brand. So, most of the brands find it profitable
to bid on POD positions on the Saturday Weekend Shows

127
XYZ Ads Airing Report
Conclusions
In conclusion, I would like to conclude the following:-
The POD position of different Brands has some sort of relation
with the amount spent. Firstly the amount Spent for POD position
increases until a certain POD position and when the POD position
tends towards 31 there is a gradual decrease in tha amount spent
for some brands while for some brands the amount spent for POD
position decreases drastically.
For brand like Honda the avg_amt_spent is the highest or is at the
peak for POD position around 10
For brand like Hyundai motors the avg_amt_spent is the highest or
is at the peak for POD position around 20 and 22
For brand like Mahindra and Mahindra the avg_amt_spent is the
highest or is at the peak for POD position for around 26
For brand like Maruti Suzuki the avg_amt_spent is the highest or is
at the peak for POD position for around 19
For brand like TATA motors the avg_amt_spent is the highest or is
at the peak for POD position for around 25 and 27
For brand like Toyota the avg_amt_spent is the highest or is at the
peak for POD position for around 18,21 and 23
We can infer from the bar plots and line plots that, from POD
position 28 there is very less amount of avg_amt_spent by all the
brands
The brand ‘Maruti Suzuki’ had the highest Ads proportion in all the
quarters i.e. 38.78% in Q1; 37.31% in Q2; 36.55% in Q3 ;41.10% in
Q4
For brand ‘Honda’ We can infer that it has shown a decline in TV
Ads Airings from Q1 (12.44%) to Q2(9.77%), then from Q2(9.77%) to
Q3(12.99%) it has increased and then from Q3(12.99%) to
Q4(11.29%) it has again decreased.

128
XYZ Ads Airing Report
Conclusions (Cont..)
For brand ‘Hyundai Motors’ We can infer that it has shown a
decline in TV Ads Airings from Q1(10.48%) to Q2(9.84%), then
from Q2(9.84%) to Q3(9.17%) it has again shown a decline, then
from Q3(9.17%) to Q4(9.23%) it has shown some increase
For brand like ‘Mahindra and Mahindra’ It has shown an increase
in TV Ads Airings from Q1(19.71%) to Q2(24.01%), then from
Q2(24.01%) to Q3(22.05%) it has shown some decline, then from
Q3(22.05%) to Q4(13.57%) it has shown a sharp decline
For brand like ‘Maruti Suzuki’ It has shown a decline in TV Ads
Airings from Q1(38.78%) to Q2(37.31%), then from Q2(37.31%) to
Q3(36.55%) it has again shown some decline and then from
Q3(36.55%) to Q4(40.10%) it has shown a great increase of almost
5%.
For brand like ‘TATA Motors’ It has shown some decline in TV Ads
Airings from Q1(10.12%) to Q2(7.62%), then from Q2(7.62%) to
Q3(8.03%) it has shown an increase and then from Q3(8.03%) to
Q4(20.93%) it has shown a spectacular growth of almost 12% For
brand like ‘Toyota’ It has shown an increase in TV Ads Airings
from Q1(8.46%) to Q2(11.45%), the from Q2(11.45%) to Q3(11.21%)
it has shown some decline and then from Q3(11.21%) to Q4(3.87%)
it has shown a significant amount of decline of almost -10%

→ From the competitive Bar plots and Tables we can infer that • The brand
‘Maruti Suzuki’ has the highest share for TV Ads in both network types wiz
Broadcast(37.53%) and cable(38.37%)
The avg_amt_spent on Broadcast type network is the highest for
the brand ‘Hyundai Motors India’ i.e. $18,078 and on cable type
network is the highest for the brand ‘Mahindra and Mahindra’ i.e.
$1,612

129
XYZ Ads Airing Report
Conclusions (Cont..)
For the broadcast type network the brand ‘Maruti Suzuki’ has the
highest share for the Early Morning(35.01%), Evening
News(36.33%), Late Fringe(52.88%), Overnight(49.68%), Prime
Access(50.38%), Prime Time(39.49%), Weekend(40.89%), but for
Daytime and Early Fringe the brand ‘Honda Cars’ has the highest
share i.e. 33.03% and 44.95% respectively.
For the cable type network the brand ‘Maruti Suzuki’ has the
highest share for all the DayParts of TV Ads
Also the brand ‘Maruti Suzuki’ has the highest share in TV Ads
Airings in all Quarters(Q1,Q2,Q3 and Q4) for both the network
types: Broadcast type and Cable Type
Also the brand ‘Maruti Suzuki’ spent the highest sum of amount in
TV Ads Airings in all Quarters(Q1,Q2,Q3 and Q4) for both the
network types: Broadcast type and Cable Type
The brand ‘Maruti Suzuki’ has the highest share and also spent the
most for TV Ads Airings for all days of week

→ For the brand ‘Mahindra and Mahindra’ from the bar plots and tables we can
infer that:-
Most of the share of TV Ads is from the dayspart Late Fringe (40%)
for Northern India, Prime Time (42.86%) for Northern India and
Daytime (22.80%) for North East India
Most of the share for TV Ads is on Friday, Saturday and Sunday;
On Sunday the share for Northern India is 74.29%
In case of Cable network the share of Central India is 40.49%;
share of North East India is 90.70%; share of Northern India is 0%
and share of Southern India is 93.41%
In case of Broadcast network the share of Central India is 59.51%;
share of North East India is 9.30%; share of Northern India is 100%
and share of Southern India is 6.59%
130
XYZ Ads Airing Report
Conclusions (Cont..)

In Q1 Prime Time has the highest share from Northern India i.e.
14.29%
In Q2 Late Fringe has the highest share from Northern India i.e.
8.57%
In Q3 again Late Fringe has the highest share from Northern India
i.e. 28.57%
In Q4 Prime Time has the highest share from Northern India i.e.
8.57%
So, the CMO can select such a tactic which can compete with the
above conditions in daily, weekly, quarterly as well as on monthly
basis

Also most of the share in TV Ads Airing for each brand was on
Saturday Weekend show daypart
Also it is a verified fact that most of the viewers have day off
Sunday and most of them watch TV for late night hours on
Saturday Night and so most of the Brands bid a huge amount for
POD positions in these Saturday weekend shows
Also if some viewers want to get into more details for a particular
model of the brand they came across during the Ad break on
Saturday Weekend show; they could visit the nearest showroom of
that particular brand and even purchase the model if they liked it,
which would gain profits for the brand
So, most of the brands find it profitable to bid on POD positions
on the Saturday Weekend Shows

131
ABC Call Volume Trend Analysis
Description
A customer experience (CX) team consists of professionals who analyze
customer feedback and data, and share insights with the rest of the
organization. Typically, these teams fulfil various roles and
responsibilities such as: Customer experience programs (CX programs),
Digital customer experience, Design and processes, Internal
communications, Voice of the customer (VoC), User experiences,
Customer experience management, Journey mapping, Nurturing customer
interactions, Customer success, Customer support, Handling customer
data, Learning about the customer journey.

Let’s look at some of the most impactful AI-empowered customer


experience tools you can use today:
Interactive Voice Response (IVR), Robotic Process Automation (RPA),
Predictive Analytics, Intelligent Routing

In a Customer Experience team there is a huge employment opportunities


for Customer service representatives A.k.a. call centre agents, customer
service agents. Some of the roles for them include: Email support, Inbound
support, Outbound support, social media support.

Inbound customer support is defined as the call centre which is


responsible for handling inbound calls of customers. Inbound calls are the
incoming voice calls of the existing customers or prospective customers
for your business which are attended by customer care representatives.
Inbound customer service is the methodology of attracting, engaging, and
delighting your customers to turn them into your business' loyal advocates.
By solving your customers' problems and helping them achieve success
using your product or service, you can delight your customers and turn
them into a growth engine for your business.

132
ABC Call Volume Trend Analysis
The Problem
Calculate the average call time duration for all incoming calls received by
agents (in each Time_Bucket).

Show the total volume/ number of calls coming in via charts/ graphs [Number
of calls v/s Time]. You can select time in a bucket form (i.e. 1-2, 2-3, …..)

As you can see current abandon rate is approximately 30%. Propose a


manpower plan required during each time bucket [between 9am to 9pm] to
reduce the abandon rate to 10%. (i.e. You have to calculate minimum number
of agents required in each time bucket so that at least 90 calls should be
answered out of 100.)

Let’s say customers also call this ABC insurance company in night but didn’t
get answer as there are no agents to answer, this creates a bad customer
experience for this Insurance company. Suppose every 100 calls that
customer made during 9 Am to 9 Pm, customer also made 30 calls in night
between interval [9 Pm to 9 Am] and distribution of those 30 calls are as
follows:

Now propose a manpower plan required during each time bucket in a day.
Maximum Abandon rate assumption would be same 10%.

Assumption: An agent work for 6 days a week; On an average total unplanned


leaves per agent is 4 days a month; An agent total working hrs is 9 Hrs out of
which 1.5 Hrs goes into lunch and snacks in the office. On average an agent
occupied for 60% of his total actual working Hrs (i.e 60% of 7.5 Hrs) on call with
customers/ users. Total days in a month is 30 days.

133
ABC Call Volume Trend Analysis
Findings - I

From the above bar plot we can infer that time_bucket 19_20 i.e. 7PM
to 8PM had the highest of average of calls answered in seconds i.e.
203.4

134
ABC Call Volume Trend Analysis
Findings - II

From the above Bar plot we can infer that the time_bucket 12_13 i.e.
during the time period 12PM to 1PM had the highest total number of
calls answered i.e. 1819327

135
ABC Call Volume Trend Analysis
Findings - III

From the above bar plot we can infer that the time_bucket 12-13 i.e.
12PM to 1PM had the highest count of calls answered i.e. 9432

136
ABC Call Volume Trend Analysis
Findings - IV

From the above bar plot we can infer that time bucket 11_12 i.e. 11AM
to 12PM has the highest count for total number incoming calls i.e.
14626

137
ABC Call Volume Trend Analysis
Findings - V

From the above bar plot we can infer that the time bucket 11_12 i.e.
11 AM to 12 PM has the highest share for incoming calls i.e. 12.40%

138
ABC Call Volume Trend Analysis
Findings - VI

From the table we can infer that the current abandon


rate is around 30%

139
ABC Call Volume Trend Analysis
Findings - VII

140
ABC Call Volume Trend Analysis
Findings - VIII

The table above shows the desired distribution of the night calls to keep the
abandon rate at 10%

• Since we have only 17 agents during night we need to distribute in an


non0analytical way i.e. the agents who work in 19_20, 20_21 time bucket to wait
and work in 21_22 and 22_23 time buckets as well

• Also agents who work during 9_10, 10_11 time bucket can be asked to work for
7_8 and 8_9 time bucket as well

• The agents who work in the time bucket 1_2, 2_3, 3_4 and 4_5 can be asked to
work in time buckets 6_7, 7_8 and 8_9 so as to keep the abandon rate at 10%

141
ABC Call Volume Trend Analysis
Analysis

Using the Why's approach I am trying to find some more


insights:-
Why is that the average call answered were more in count in
the time bucket of 10_11, 18_19, 19_20 and 20_21 as
compared to other time buckets?
---> Most of the customers are office people and they need to
reach office by 10 AM or 11 AM, so these customers call
during 10_11 time bucket i.e. while they in transit to office or
have reached office and have some free time before they start
their work; During the time bucket 18_19, 19_20 and
20_21 the customers have either left their office and reached
home or they are in the transit to reach home and during
these time period i.e. 6 Pm to 9 Pm people have free time
where they can share their concern to the customer service.
During these time buckets most of the calls are from
individual people with small problems which can be resolved
quickly

Why is it that the time bucket 11_12 has the highest number
of incoming calls but it does not have the highest number of
average answered calls?
---> Maybe there were more number of incoming calls in the
time bucket 11_12 and there were not enough personnel to
handle most of the queries of the customers during the 11_12
time bucket

142
ABC Call Volume Trend Analysis
Analysis (Cont...)

Why is it that the total number of incoming calls reached it's


peak value during the time bucket 11_12 and got decreased
from time bucket 12_13 onwards?
---> It is a general tendency of the customers(people) that they
want their query/complaint get resolved on that particular day
itself when they called the customer center; so most of the
customers try to place their complaint/query before 12 Pm so
that by the end of the day their complaint gets resolved
depending upon the complexity of the problem faced by the
customer

Why is proportion if the monthly transfer rate is less than


compared to monthly answered and abandon rate?
---> In most of the customer service centers they have the
dedicated toll free number of the particular problem faced by
the customer, also there are skilled people at the call center
who are well versed with the problems they come across while
handling and guiding thousands of customers on daily basis;
And so most of the calls gets answered by providing an
solution to the query, some of the calls get abandon due to
unavailability or shortage of the skilled person, and very few
calls gets transferred from the junior level to senior level if the
problem is too complex for the junor level expertise

143
ABC Call Volume Trend Analysis
Analysis (Cont...)

Why is that one cannot provide the exact distribution of


agents during the night time i.e. from 9 PM to 9 AM if the
number of agents available during the night shift are already
defined, so as to keep the abandon rate 10%?
---> For this particular case, Since we have only 17 agents
during night we need to distribute in an non0analytical way i.e.
the agents who work in 19_20, 20_21 time bucket to wait and
work in 21_22 and 22_23 time buckets as well. Also agents
who work during 9_10, 10_11 time bucket can be asked to
work for 7_8 and 8_9 time bucket as well. he agents who work
in the time bucket 1_2, 2_3, 3_4 and 4_5 can be asked to work
in time buckets 6_7, 7_8 and 8_9 so as to keep the abandon
rate at 10%. Also, the company needs to consider various
factors like how far is the home of the agent if he/she is made
to do night shift, Is the transport facility available during the
night hours from the agent's home to company and many
other factors and hence the exact distribution cannot be given
using an analytical approach

144
ABC Call Volume Trend Analysis
Conclusion
In the conclusion, I would like to conclude the following:-
From the previous analysis we can derive that Avg calls answered
per agent is 198.6 in each time bucket
We need to reduce the abandon rate by 30%(current) – 10%
(desired) = 20% i.e. we need to increase call answered rate by 70%
(current) + 20%(change) = 90% .So, we need to have 90% of the
total calls to be answered so as to reduce the abandon rate to 10%

Total avg calls incoming per day = 5130 • Avg calls answered per
second = 198.6 • Answered rate = 90% i.e. 0.9 • Seconds per hour =
3600 • So, time required to answer 90% of the incoming calls = 5130
* 198.6 * 0.9 / 3600 = • So, new total number of agents working per
day is 255 divided by the number of hours an agent actually
works(on a consumer call) i.e. 4.5 = 255/4.5 = 56.67 == 57 Agents
working per day 254.7001826
So, to have a 10% abandon rate we need 57 Agents working per
day
From the assumptions given the following points were noted:-
In a day an agent work for 9 hours → Total Agent working hours = 9
HOURS
Out of the total 9 hours , 1.5 hours goes for lunch and coffee/tea
breaks; so remaining working hours = 9 – 1.5 = 7.5 HOURS
Out of the remaining 7.5 hours per day an agent is occupied with
consumers call for only 60% of the time i.e. 60% of 7.5 i.e. 0.6 * 7.5
= 4.5. So, an agent spends only 4.5 hours per day out of total 7.5
hours on consumer calls. An agent works 6 days a week. In a month of
30 days 6 days per week; In a month of 30 there are 4 weeks; 7 days
per week means total 28 days out of which 4 days are unplanned leave

145
ABC Call Volume Trend Analysis
Conclusion (Cont...)

Days of agent on floor = (20*7)/28 = 5 days. Now, total days left 28


– 4 = 24 days. Per week there is one Sunday which is an official
holiday for all workplaces around the world; So in a month of 30
there are 4 Sundays. Now total days left for work = 24 -4 = 20 days
.So, an agent is available to work for 20 days in a month of 30 days
In a certain scenario there are calls from consumers not only
during the day time but also during the night time and if there are
no agents available during the night time to answer the call then it
creates a bad impression on the consumer regarding the company
Now we need to give the distribution of the total manpower
available for each time bucket right from 9AM to 9 PM and then
from 9 PM to 9 AM, keeping the abandon rate at 10% i.e. keeping
the answered rate at 90%
For each 100 day calls there are 30 night calls; then for 5130 day
calls there will be : 5130*30/100 = 1539 night calls. • So there are
1539 night calls for a total of 5130 day calls
So, the additional working hours keeping the answered rate at
90% will be 1539 * 198.6(avg calls answered per sec) * 0.9 /
3600(total seconds in each hour) = 76.41135
So, additional agents needed by the company to answer night
calls as well be 76.41135/4.5 = 16.98 == 17
So, we need additional 17 agents to answer the night calls as well,
making the total number of agents working per day keeping the
answer rate to 90% will be 57(day call answer 90%) + 17(night call
answer 90%) = 74 agents . So, we need 74 Agents per day to
answer the consumer calls from day as well as the night time
keeping the answered rate to 90% / Abandon rate to 10%

146
Appendix
Data Analytics Process:-
---> Link for the shared PDF on Google Drive:
Data Analytics Trainee Assignement - 1.pdf - Google Drive

Instagram User Analytics:-


----> Link for the shared file on Google Drive:
Data Analytics Trainee Task - 2.pdf - Google Drive
-----> Link for shared sql file on google drive:
Trainity_Data_Analytics_Trainee_Task_2.sql - Google Drive

Operation Analytics and Investigating Metric Spike


Analysis:-
-----> Link for the shared file on Google Drive:
Data Analytics Trainee Task - 3.pdf - Google Drive
-----> Link for shared sql files on Google Drive and GitHub:
Trainity_Data_Analytics_Trainee/Trainity_Data_Analytics_Tra
inee_task_3.sql at main ·
ADVAIT135/Trainity_Data_Analytics_Trainee · GitHub

Trainity_Data_Analytics_Trainee_task_3.sql - Google Drive

Trainity_Data_Analytics_Trainee/task3_case_sudy_2_Investig
ating_Metric_Spike.sql at main ·
ADVAIT135/Trainity_Data_Analytics_Trainee · GitHub

task3_case_sudy_2_Investigating_Metric_Spike.sql -
Google Drive

Hiring Process Analytics:-


----> Link for shared PDF on google drive:
Data Analytics Trainee Task - 4.pdf - Google Drive
-----> Link for Excel sheet on Google Drive(Analysis):
Statistics.xlsx - Google Sheets
147
Appendix (Cont...)
IMDB Movie Analysis-
---> Link for the shared PDF on Google Drive:
Data Analytics Trainee Task - 5.pdf - Google Drive
-----> Link for shared excel files folder on google drive:
(Download the excel sheets to view as they are huge files and
they can't be viewed online)
trainity_task_5_final_project_excel_files - Google Drive

Bank Loan Case Study:-


----> Link for the shared file on Google Drive:
Trainity Data Analytics Trainee Task 6.pdf - Google Drive
-----> Link for shared excel files folder on google drive:
(Download the excel sheets to view as they are huge files and
they can't be viewed online)
trainity_task_6_final_project_2 - Google Drive

XYZ Ads Airing Report :-


-----> Link for the shared file on Google Drive:
Trainity Data Analytics Trainee Task - 7.pdf - Google Drive
-----> Link for shared excel files folder on google drive:
(Download the excel sheets to view as they are huge files and
they can't be viewed online)
trainity_task_7_final_project_3 - Google Drive

ABC Call Volume Trend Analysis:-


-----> Link for the shared file on Google Drive:
Trainity Data Analytics Trainee Task 8.pdf - Google Drive
-----> Link for shared excel files folder on google drive:
(Download the excel sheets to view as they are huge files and
they can't be viewed online)
trainity_task_8_final_project_4 - Google Drive

148
Appendix (Cont...)
Link to GitHub Portfolio:-
ADVAIT135 (ADVAIT GURUNATH CHAVAN) · GitHub

Link to my Kaggle Portfolio:-


ADVAIT CHAVAN | Contributor | Kaggle

Link to HackerRank Profile:-


ADVAIT CHAVAN - advaitchavan135 | HackerRank

Link to LinkedIn Profile:-


https://www.linkedin.com/in/advait-chavan-69928b129/

149

You might also like