DM Project Report
DM Project Report
Page - 1
Part 1: Clustering
Read the data and perform basic analysis such as printing a few rows (head and tail), info, data summary, null
values duplicate values, etc. ------------------------------------------------------ 4
Treat missing values in CPC, CTR and CPM using the formula given. ------------------------- 10
Check if there are any outliers. Do you think treating outliers is necessary for K-Means clustering?
Based on your judgement decide whether to treat outliers and if yes, which method to employ. -----------11
Perform z-score scaling and discuss how it affects the speed of the algorithm. ----------------16
Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance. ---------17
Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm. ---------17
Print silhouette scores for up to 10 clusters and identify optimum number of clusters. ---------18
Profile the ads based on optimum number of clusters using silhouette score and your domain understanding.
-----18
Part 2: PCA
Read the data and perform basic checks like checking head, info, summary, nulls, and duplicates, etc. ----------------
27
(i) Which state has highest gender ratio and which has the lowest?
(ii) Which district has the highest & lowest gender ratio?
We choose not to treat outliers for this case. Do you think that treating outliers for this case is necessary? ------------
---32
Page - 2
Scale the Data using z-score method. Does scaling have any impact on outliers? Compare boxplots before and after
scaling and comment. ----------------36
Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get eigen values and eigen
vector. ---------------- 37
Identify the optimum number of PCs (for this project, take at least 90% explained variance). Show Scree plot.
---------------------- 39
Compare PCs with Actual Columns and identify which is explaining most variance. Write inferences about all the
principal components in terms of actual variables. ---------------- 41
Page - 3
1.1.1 Read the data and perform basic analysis such as printing a few rows (head and tail), info, data summary,
null values duplicate values, etc.
The data set is read from the file ‘Clustering Clean Ads_Data-2.xlsx’
There are 23066 rows and 19 columns with 6 of the columns are of float64 data-type, 7 of them are integers and 6
of them are alphabetical type.
Here is the Data dictionary of the columns of what they actually represent.
1 Timestamp == The Timestamp of the particular Advertisement.
2 InventoryType == The Inventory Type of the particular Advertisement. Format 1 to 7. This is a Categorical Variable.
7 Platform == The platform in which the particular Advertisement is displayed. Web, Video or App. This is a
Categorical Variable.
8 Device Type == The type of the device which supports the particular Advertisement. This is a Categorical Variable.
9 Format == The Format in which the Advertisement is displayed. This is a Categorical Variable.
10 Available_Impression == How often the particular Advertisement is shown. An impression is counted each time an
Advertisement is shown on a search result page or other site on a Network.
11 Matched_Queries == Matched search queries data is pulled from Advertising Platform and consists of the exact
searches typed into the search Engine that generated clicks for the particular Advertisement.
12 Impressions == The impression count of the particular Advertisement out of the total available impressions.
13 Clicks == It is a marketing metric that counts the number of times users have clicked on the particular
advertisement to reach an online property.
14 Spend == It is the amount of money spent on specific ad variations within a specific campaign or ad set. This metric
helps regulate ad performance.
16 Revenue == It is the income that has been earned from the particular advertisement.
17 CTR == CTR stands for "Click through rate". CTR is the number of clicks that your ad receives divided by the number
of times your ad is shown. Formula used here is CTR = Total Measured Clicks / Total Measured Ad Impressions x 100.
Note that the Total Measured Clicks refers to the 'Clicks' Column and the Total Measured Ad Impressions refers to the
'Impressions' Column.
18 CPM == CPM stands for "cost per 1000 impressions." Formula used here is CPM = (Total Campaign Spend /
Number of Impressions) * 1,000. Note that the Total Campaign Spend refers to the 'Spend' Column and the Number
of Impressions refers to the 'Impressions' Column.
19 CPC == CPC stands for "Cost-per-click". Cost-per-click (CPC) bidding means that you pay for each click on your ads.
The Formula used here is CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend) refers to the
'Spend' Column and the Number of Clicks refers to the 'Clicks' Column.
Page - 4
The first 5 records and last 5 records of the data set: -
Page - 5
Here is a list of the numeric field description which gives the count, mean, standard deviation, minimum, 25th
percentile, 50th percentile,75th percentile and maximum values of the Numeric fields.
Here are the object type columns with their respective value counts for each: -
o InventoryType Column
Format4 7165
Format5 4249
Format1 3814
Format3 3540
Page - 6
Format6 1850
Format2 1789
Format7 659
o Ad Type
Inter224 1658
Inter217 1655
Inter223 1654
Inter219 1650
Inter221 1650
Inter222 1649
Inter229 1648
Inter227 1647
Inter218 1645
inter230 1644
Inter220 1644
Inter225 1643
Inter226 1640
Inter228 1639
o Platform
Video 9873
Web 8251
App 4942
o Device Type
Mobile 14806
Desktop 8260
o Format
Video 11552
Display 11514
The Timestamp column was ignored as it is not relevant in getting unique values.
Some of the Numeric fields are converted to Categorical as there few unique values and not good for numerical analysis.
Ad - Length (Units)
120 - 7165
300- 4473
720 - 4249
480 - 3540
336 - 1850
728 – 1789
Ad- Width (Units)
600 - 7824
250 - 5664
300 - 4249
70 - 3540
90 – 1789
Ad Size (Ad – Length *Ad- Width)
72000 - 7165
216000 - 4249
75000 - 3814
33600 - 3540
84000 - 1850
65520 - 1789
180000 - 659
The remaining numerical type columns are not converted and they can be taken for calculation.
The Ad - Length, Ad- Width, Ad Size columns are converted into categorical type.
Here is the information after making changes to data type and renaming the columns for adaptation to the handle column
names in python.
Page - 7
Here is the Count plot graphs for the categorical columns, this shows the unique values and its total count for each Categorical
Columns
Page - 8
For the above Graphs which shows Fee column, 35 % is the Advertising Fees payable by Franchise Entities is mostly used amon
g the various other percentages, The least is 21%
For the above Graphs which shows Ad_ length column, 120 units is the length of the Ad dimension is mostly used.
For the above Graphs which shows Ad_Width column, 600 units is the Width of the Ad dimension which is mostly used.
For the above Graphs which shows Ad_Size column, which the product of the Ad_ length and Ad_Width, We could clearly infer
that 72000 sq.units is the mostly used Ad size
Page - 9
From the above Histogram plots for the numerical variables, we see that the columns are highly right skewed.
1.1.2 Treat missing values in CPC, CTR and CPM using the formula given.
As per the given formula provided in the question, we will be using to fill in the missing values in the CTR, CPM and CPC columns
The total null values for CTR before imputing are 4736, The total null values for CTR after imputing are 0
The total null values for CPC before imputing are 4736
The total null values for CPC after imputing are 0
From the below pic on the left, we can see that the values were imputed
There are some changes in data found for those 3 columns on the pic in the right-side.
1.1.3 Check if there are any outliers. Do you think treating outliers is necessary for K-Means clustering?
Based on your judgement decide whether to treat outliers and if yes, which method to employ
The Outliers are present in all the numeric columns.
Please find the below date which shows the summary of the boxplot is also evaluated for each of the 9 numerical
columns.
Boxplot values for Available_Impressions -
The Lower limit is -3707387.0
The Minimum value is 1
The Q1 (25th percentile) is 33672.0
The Q2 (50th percentile) (mid value) is 483771.0
The Q3 (75th percentile) is 2527712.0
The Upper limit is 6268771.0
The Maximum value is 27592861
In Available_Impressions column there are 2378 values above the upper range, roughly 10.31 % of the total records
Boxplot values for Matched_Queries -
The Lower limit is -1725344.0
The Minimum value is 1
The Q1 (25th percentile) is 18282.0
The Q2 (50th percentile) (mid value) is 258088.0
The Q3 (75th percentile) is 1180700.0
The Upper limit is 2924326.0
The Maximum value is 14702025
In Matched_Queries column there are 3192 values above the upper range, roughly 13.84 % of the total records
Boxplot values for Impressions -
The Lower limit is -1648666.0
The Minimum value is 1
The Q1 (25th percentile) is 7990.0
The Q2 (50th percentile) (mid value) is 225290.0
The Q3 (75th percentile) is 1112428.0
Page - 11
The Upper limit is 2769086.0
The Maximum value is 14194774
In Impressions column there are 3269 values above the upper range, roughly 14.17 % of the total records
Boxplot values for Clicks -
The Lower limit is -17416.0
The Minimum value is 1
The Q1 (25th percentile) is 710.0
The Q2 (50th percentile) (mid value) is 4425.0
The Q3 (75th percentile) is 12794.0
The Upper limit is 30919.0
The Maximum value is 143049
In Clicks column there are 1691 values above the upper range, roughly 7.33 % of the total records
Boxplot values for Spend -
The Lower limit is -4469.0
The Minimum value is 0.0
The Q1 (25th percentile) is 85.0
The Q2 (50th percentile) (mid value) is 1425.0
The Q3 (75th percentile) is 3121.0
The Upper limit is 7676.0
The Maximum value is 26931.87
In Spend column there are 2081 values above the upper range, roughly 9.02 % of the total records
Boxplot values for Revenue -
The Lower limit is -2999.0
The Minimum value is 0.0
The Q1 (25th percentile) is 55.0
The Q2 (50th percentile) (mid value) is 926.0
The Q3 (75th percentile) is 2091.0
The Upper limit is 5145.0
The Maximum value is 21276.18
In Revenue column there are 2325 values above the upper range, roughly 10.08 % of the total records
Boxplot values for CTR -
The Lower limit is -0.0
The Minimum value is 0.0001
The Q1 (25th percentile) is 0.0
The Q2 (50th percentile) (mid value) is 0.0
The Q3 (75th percentile) is 0.0
The Upper limit is 0.0
The Maximum value is 200.0
In CTR column there are 3487 values above the upper range, roughly 15.12 % of the total records
Boxplot values for CPM -
The Lower limit is -15.0
The Minimum value is 0.0
The Q1 (25th percentile) is 2.0
The Q2 (50th percentile) (mid value) is 8.0
The Q3 (75th percentile) is 13.0
The Upper limit is 30.0
The Maximum value is 715.0
In CPM column there are 208 values above the upper range, roughly 0.9 % of the total records
Boxplot values for CPC -
The Lower limit is -1.0
The Minimum value is 0.0
The Q1 (25th percentile) is 0.0
The Q2 (50th percentile) (mid value) is 0.0
The Q3 (75th percentile) is 1.0
The Upper limit is 1.0
The Maximum value is 7.26
In CPC column there are 568 values above the upper range, roughly 2.46 % of the total records
Please find the picture below which contains the boxplot of the 9 numerical columns
Page - 12
The Outlier treatment has been performed using the IQR method, the below pic shows the box plot after outliers
treatment.
Page - 13
These are the boxplot summary after the outlier treatment
Page - 15
The Histogram plots above, shows the treatment has been done all the values above the upper range has been equated to upper rang
e value.
1.1.4 Perform z-score scaling and discuss how it affects the speed of the algorithm.
The Z-score scaling was performed on the dataset. The volume of the data is high for sure the calculation time will
increase. The speed of the algorithm will depend upon the number of rows in a data set and any type of scaling alone
will not affect the speed of the algorithm
Sorry, I am not sure how to answer this question as I don’t know how to gauge the performance of the algorithm
Page - 16
1.1.5 Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance.
1.1.6 Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm
Page - 17
1.1.7 Print silhouette scores for up to 10 clusters and identify optimum number of clusters
Here is the list of the silhouette score for the clusters and minimum value silhouette width which determine how well it is clustered
The silhouette score for 2 clusters is 0.4764675196030162. The Minimum value for the silhouette width is -0.07846579351816758
The silhouette score for 3 clusters is 0.41202775538030506. The Minimum value for the silhouette width is -0.1213613474025022
The silhouette score for 4 clusters is 0.4435830153133098. The Minimum value for the silhouette width is -0.05444643658905788
The silhouette score for 5 clusters is 0.43504173598022067. The Minimum value for the silhouette width is -0.09688768275287114
The silhouette score for 6 clusters is 0.4580612382343229. The Minimum value for the silhouette width is -0.2167164693199777
The silhouette score for 7 clusters is 0.4436337977966384. The Minimum value for the silhouette width is -0.2183479792508185
The silhouette score for 8 clusters is 0.4597928303095475. The Minimum value for the silhouette width is -0.15369094092604285
The silhouette score for 9 clusters is 0.4624159152234553. The Minimum value for the silhouette width is -0.15474939888054123
The silhouette score for 10 clusters is 0.44198839415394225. The Minimum value for the silhouette width is -0.136669270144576
The Optimum number of Clusters is identified as K=4 as (indicated as red line) the silhouette score for 4 clusters is 0.4435830153133098. The
minimum value for the silhouette width is -0.05444643658905788 ( Closer to 1 where compared to other values
1.1.8 Profile the ads based on optimum number of clusters using silhouette score and your domain
understanding
The cluster column and sil_width is joined to the data set and means of the columns are grouped clusterwise. Clusters are from 0 to 3.
The below is the pic of the Bar plot fs the numerical variables hued by Device_type
Page - 18
The Desktop and Mobile device usage looks almost the equal for the all the above displayed columns
Even though the Clicks count for the Cluster 0 is high, the Cluster 1 ads is the revenue generating group
The amount spent for the ads per 1000 impressions are high for the Cluster 0 and 3 Ad groups
The Pic below shows the Value counts of the items for each column hued by cluster
Page - 19
From the above, Pic in the top, on the Inventorytype graph, Ads from the cluster 3 uses more Format4 of InventoryType
Ad_ size of 72000 dimension is used more by the cluster 3 ads
Advertising Fees of 35 % is mostly chosen percentage which is paid to the Franchise Entities by the Cluster 3 and Cluster2 groups of
ads
Here is the Graph which shows the Revenue for each column with unique value wise and hued by cluster groups .
From the above graph, the revenue generated by the Cluster 1 group of ads is the high on all the columns.
Format 2 of the InventoryType and 65520 of Ad_Size is mostly revenue generating for the Cluster1 group of ads.
The revenue generated from the Ads which pays 21% Advertising Fees in the Cluster 3 group is the highest among the remaining percentages
followed by 23% Advertising Fees
Cluster 3 gets revenue from all the Advertising Fees percentages levels when compared to other cluster groups.
Page - 20
1.1.9 Conclude the project by providing summary of your learning.
Recommendations:
More inferences from the analysed data will be provided for any recommendations
Groupby:1
The below table shows the mean of the values in the columns and grouped by Cluster 0 to 3
The Cluster 3 group of Ads is the highest count when compared to other Clusters
Cluster 1 group of Ads has the highest mean of the Available_Impression , Matched_Queries, Impressions, Spend, Revenue and CPC columns
Even though Cluster 3 group has the highest in count, it the least in the means of the Available_Impression , Matched_Queries, Impressions, Clicks, Spend and Revenue
columns.
The mean CPM (amount spent per 1000 impressions) is high for the Cluster 0 and 3 groups when compared to other Clusters, Cluster 1 has the least mean of CPM
Even though the mean CTR (number of clicks that your ad receives divided by the number of times your ad is shown) is high for the Cluster 3 Ads the revenue generated is
the least when compared to the other clusters groups
Groupby:2
The below table shows the sum of the values in the columns and grouped by Cluster 0 to 3
Cluster 1 group of Ads has got the highest sum of the Available_Impression , Matched_Queries, Impressions, Spend and Revenue columns.
Followed by, the Cluster 2 has the next highest sum of the Available_Impression , Matched_Queries and Impressions columns
Crosstab 1:
The Crosstab 1 pic on the left shows the cross tabulation of the Fee and Clusters with count as the value
The Crosstab 1 pic on the Right shows the cross tabulation of the Fee and Clusters with Revenue as the value
All of the Ads in the Cluster 3 group has paid the highest Advertising Fees of 35% paid by Franchise Entities and none towards the other percentages
Cluster 1 group paid all the available categories as the Advertising Fees and Franchise Entities paid 33% Advertising Fees for the highest number of Ads
(1734)
The Advertising Fees of 23% is the highly paid amount for the Cluster 1 group of Ads
Page - 21
Crosstab 2
Based on the Crosstab 2, We can see that only a certain combination of the Ad_Length and
Ad_width is available. Those are: -
480*70 = 33600
728*90 = 65520
300*250 = 75000
336*250 = 84000
720*300 = 216000
120*600 = 72000
300*600= 180000
As the Ad_Size = Ad_Length * Ad_width , Eventually these are the available sizes too.
Crosstab 3
Based on the Crosstab 3 with Revenas the value, Ad_Size 65520 (728*90)
is the highest revenue generating size which is used by cluster 1 group of
Ads
Crosstab 4a Crosstab 4b
The Crosstab 4a shows the Cross tabulation which involves Device_Type, Platform and Format in the row and Clusters in the column with Clicks as Value
The Crosstab 4b shows the Cross tabulation which involves Device_Type, Platform and Format in the row and Clusters in the column with Revenue as Value
Even though the Cluster 0 groups Ads has the highest number of clicks than Cluster 1, the Cluster 1 gets the more Revenue than Cluster 0
Other Inferences:
The Impression Share is the ratio of the Impression to the Available Impression
Page - 22
Here is the data for the clusters which shows the Matched Queries per click
Matched_Queries per Click for Cluster 0 is 8.865
Matched_Queries per Click for Cluster 1 is 500.579
Matched_Queries per Click for Cluster 2 is 265.021
Matched_Queries per Click for Cluster 3 is 10.697
We can infer that Cluster 1 group has the highest Matched Queries per click
Here is the data for the clusters which shows the Revenue per click
Revenue per Click for Cluster 0 is 0.07
Revenue per Click for Cluster 1 is 0.566
Revenue per Click for Cluster 2 is 0.301
Revenue per Click for Cluster 3 is 0.063
We can infer that Cluster 1 group has the highest Revenue per click.
Recommendations
Cluster 1 Group of Ads Spend more for the Advertising Ads strategically selecting the Ad_size of 65520 (728*90) and 75000 (300*250) paying the Fee of 23 % with
Format 2 of the inventory type is the highest in revenue Generating.
Page - 23
2.1 PCA FH (FT):
Primary census abstract for female headed households excluding institutional households (India & States/UTs - District Level), Scheduled tribes - 2011 PCA for
Female Headed Household Excluding Institutional Household.
The Indian Census has the reputation of being one of the best in the world. The first Census in India was conducted in the year 1872. This was conducted at
different points of time in different parts of the country. In 1881 a Census was taken for the entire country simultaneously. Since then, Census has been conducted
every ten years, without a break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since 1872, the seventh after independence and the
second census of the third millennium and twenty first century. The census has been uninterruptedly continued despite of several adversities like wars, epidemics,
natural calamities, political unrest, etc. The Census of India is conducted under the provisions of the Census Act 1948 and the Census Rules, 1990.
The Primary Census Abstract which is important publication of 2011 Census gives basic information on Area, Total Number of Households, Total Population,
Scheduled Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates, Main Workers and Marginal Workers classified by the four broad
industrial categories, namely, (i) Cultivators, ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other Workers and also Non-Workers.
The characteristics of the Total Population include Scheduled Castes, Scheduled Tribes, Institutional and Houseless Population and are presented by sex and rural-
urban residence. Census 2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details without using Data Science Techniques.
You are tasked to perform detailed EDA and identify Optimum Principal Components that explains the most variance in data. Use Sklearn only.
Note: The 24 variables given in the Rubric is just for performing EDA.
You will have to consider the entire dataset, including all the variables for performing PCA.
Read the data and perform basic checks like checking head, info, summary, nulls, and duplicates, etc.
Perform detailed Exploratory analysis by creating certain questions like (i) Which state has highest gender ratio and which has the lowest? (ii) Which
district has the highest & lowest gender ratio? (Example Questions). Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M,
TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M,
MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F, MAIN_OT_M, MAIN_OT_F
We choose not to treat outliers for this case. Do you think that treating outliers for this case is necessary?
Scale the Data using z-score method. Does scaling have any impact on outliers? Compare boxplots before and after scaling and comment.
Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get eigen values and eigen vector.
Identify the optimum number of PCs (for this project, take at least 90% explained variance). Show Scree plot.
Compare PCs with Actual Columns and identify which is explaining most variance. Write inferences about all the Principal components in terms of
actual variables.
Page - 24
Page - 25
2.1.1 Read the data and perform basic checks like checking head, info, summary, nulls, and duplicates, etc.
Page - 26
Here are the last 5 rows of data.
Page - 27
Here are the column names for the data set
Page - 28
These are the list 61 column names
Page - 29
2.1.2 Perform detailed Exploratory analysis by creating certain questions like
(i) Which state has highest gender ratio and which has the lowest?
(ii) Which district has the highest & lowest gender ratio? (Example Questions).
Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M, TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL,
TOT_WORK_M, TOT_WORK_F, MAINWORK_M, MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M, MAIN_HH_F, MAIN_OT_M,
MAIN_OT_F
Q1 Which state has highest gender ratio and which has the lowest?
Q3 Which are top 5 State/District that has the highest Literate Males and Literate Females?
The top 5 State/District that has the highest Literate Males are
State Area_Name
Maharashtra Mumbai Suburban 403261.00
West Bengal North Twenty Four Parganas 384839.00
Kerala Malappuram 371829.00
Maharashtra Thane 332986.00
Karnataka Bangalore 325690.00
Name: M_LIT, dtype: float64
The top 5 State/District that has the highest Literate Females are
State Area_Name
Kerala Malappuram 571140.00
Maharashtra Mumbai Suburban 568736.00
West Bengal North Twenty Four Parganas 517061.00
Maharashtra Thane 486756.00
Karnataka Bangalore 471354.00
Name: F_LIT, dtype: float64
Page - 30
2.1.3 We choose not to treat outliers for this case. Do you think that treating outliers for this case is necessary?
Page - 31
Page - 32
Boxplot values for No_HH - Boxplot values for F_ILL - Boxplot values for MAIN_OT_M -
The Upper limit is 143004.0 The Upper limit is 162627.0 The Upper limit is 47128.0
The Maximum value is 310450 The Maximum value is 254160 The Maximum value is 240855
In No_HH column there are 31 records above the In F_ILL column there are 26 records above the In MAIN_OT_M column there are 53 records
upper range, roughly 4.84 % of the total records upper range, roughly 4.06 % of the total records above the upper range, roughly 8.28 % of the
total records
Boxplot values for TOT_M - Boxplot values for TOT_WORK_M -
The Upper limit is 224454.0 The Upper limit is 104937.0 Boxplot values for MAIN_OT_F -
The Maximum value is 485417 The Maximum value is 269422 The Upper limit is 31207.0
In TOT_M column there are 25 records above the In TOT_WORK_M column there are 32 records The Maximum value is 209355
upper range, roughly 3.91 % of the total records above the upper range, roughly 5.0 % of the total In MAIN_OT_F column there are 59 records
records above the upper range, roughly 9.22 % of the
Boxplot values for TOT_F - total records
The Upper limit is 340853.0 Boxplot values for TOT_WORK_F -
The Maximum value is 750392 The Upper limit is 108939.0 Boxplot values for MARGWORK_M -
In TOT_F column there are 26 records above the The Maximum value is 257848 The Upper limit is 20094.0
upper range, roughly 4.06 % of the total records In TOT_WORK_F column there are 42 records The Maximum value is 47553
above the upper range, roughly 6.56 % of the In MARGWORK_M column there are 43 records
Boxplot values for M_06 - total records above the upper range, roughly 6.72 % of the
The Upper limit is 34200.0 total records
The Maximum value is 96223 Boxplot values for MAINWORK_M -
In M_06 column there are 32 records above the The Upper limit is 85617.0 Boxplot values for MARGWORK_F -
upper range, roughly 5.0 % of the total records The Maximum value is 247911 The Upper limit is 39061.0
In MAINWORK_M column there are 36 records The Maximum value is 66915
Boxplot values for F_06 - above the upper range, roughly 5.62 % of the In MARGWORK_F column there are 19 records
The Upper limit is 32747.0 total records above the upper range, roughly 2.97 % of the
The Maximum value is 95129 total records
In F_06 column there are 33 records above the Boxplot values for MAINWORK_F -
upper range, roughly 5.16 % of the total records The Upper limit is 73405.0 Boxplot values for MARG_CL_M -
The Maximum value is 226166 The Upper limit is 2735.0
Boxplot values for M_SC - In MAINWORK_F column there are 55 records The Maximum value is 13201
The Upper limit is 43375.0 above the upper range, roughly 8.59 % of the In MARG_CL_M column there are 55 records
The Maximum value is 103307 total records above the upper range, roughly 8.59 % of the
In M_SC column there are 29 records above the total records
upper range, roughly 4.53 % of the total records Boxplot values for MAIN_CL_M -
The Upper limit is 16202.0 Boxplot values for MARG_CL_F -
Boxplot values for F_SC - The Maximum value is 29113 The Upper limit is 5703.0
The Upper limit is 64545.0 In MAIN_CL_M column there are 25 records The Maximum value is 44324
The Maximum value is 156429 above the upper range, roughly 3.91 % of the In MARG_CL_F column there are 53 records
In F_SC column there are 29 records above the total records above the upper range, roughly 8.28 % of the
upper range, roughly 4.53 % of the total records total records
Boxplot values for MAIN_CL_F -
Boxplot values for M_ST - The Upper limit is 15335.0 Boxplot values for MARG_AL_M -
The Upper limit is 18704.0 The Maximum value is 36193 The Upper limit is 9442.0
The Maximum value is 96785 In MAIN_CL_F column there are 29 records above The Maximum value is 23719
In M_ST column there are 51 records above the the upper range, roughly 4.53 % of the total In MARG_AL_M column there are 48 records
upper range, roughly 7.97 % of the total records records above the upper range, roughly 7.5 % of the
total records
Boxplot values for F_ST - Boxplot values for MAIN_AL_M -
The Upper limit is 30556.0 The Upper limit is 18563.0 Boxplot values for MARG_AL_F -
The Maximum value is 130119 The Maximum value is 40843 The Upper limit is 20619.0
In F_ST column there are 58 records above the In MAIN_AL_M column there are 36 records The Maximum value is 45301
upper range, roughly 9.06 % of the total records above the upper range, roughly 5.62 % of the In MARG_AL_F column there are 30 records
total records above the upper range, roughly 4.69 % of the
Boxplot values for M_LIT - total records
The Upper limit is 163027.0 Boxplot values for MAIN_AL_F -
The Maximum value is 403261 The Upper limit is 24431.0 Boxplot values for MARG_HH_M -
In M_LIT column there are 30 records above the The Maximum value is 87945 The Upper limit is 784.0
upper range, roughly 4.69 % of the total records In MAIN_AL_F column there are 60 records above The Maximum value is 4298
the upper range, roughly 9.38 % of the total In MARG_HH_M column there are 58 records
Boxplot values for F_LIT - records above the upper range, roughly 9.06 % of the
The Upper limit is 180601.0 total records
The Maximum value is 571140 Boxplot values for MAIN_HH_M -
In F_LIT column there are 37 records above the The Upper limit is 2467.0 Boxplot values for MARG_HH_F -
upper range, roughly 5.78 % of the total records The Maximum value is 16429 The Upper limit is 2149.0
In MAIN_HH_M column there are 47 records The Maximum value is 15448
Boxplot values for M_ILL - above the upper range, roughly 7.34 % of the In MARG_HH_F column there are 39 records
The Upper limit is 60896.0 total records above the upper range, roughly 6.09 % of the
The Maximum value is 105961 total records
In M_ILL column there are 39 records above the Boxplot values for MAIN_HH_F -
upper range, roughly 6.09 % of the total records The Upper limit is 3216.0 Boxplot values for MARG_OT_M -
The Maximum value is 45979 The Upper limit is 8560.0
In MAIN_HH_F column there are 56 records The Maximum value is 24728
above the upper range, roughly 8.75 % of the In MARG_OT_M column there are 46 records
total record above the upper range, roughly 7.19 % of the
total records
Page - 33
Boxplot values for MARGWORK_0_3_M - Boxplot values for NON_WORK_M -
The Upper limit is 7168.0 The Upper limit is 1270.0
The Maximum value is 20648 The Maximum value is 6456
In MARGWORK_0_3_M column there are 48 In NON_WORK_M column there are 54 records
records above the upper range, roughly 7.5 % of above the upper range, roughly 8.44 % of the
the total records total records
Page - 34
2.1.4 Scale the Data using z-score method. Does scaling have any impact on outliers? Compare boxplots before and after scaling and comment.
The Z-score scaling is applied and here are first 5 records for the columns. The below pic shows the first 5 records of all column
Page - 35
The below pic shows the Boxplot after Z-scaling, there is no change in the Outliers.
2.1.5 Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get eigen values and eigen vector.
For PCA,
As P-value is 0.0 we can reject the Null Hypothesis and there are significant correlations exist
The kmo_model value is 0.804, it is above 0.7 and shows that adequate sample size
Here is the sample view of the eigen vector (Left Pic) and eigen values (Right Pic)
Page - 36
Here is the Heat map of the columns
Here is the glimpse of the (extracted_loadings ) table of the columns name and 57 PC values. (Unable to show entire table due to size restrictions)
Page - 37
2.1.6 Identify the optimum number of PCs (for this project, take at least 90% explained variance). Show Scree plot.
If the explained variance is taken as 90%, the Optimum number of PC can be taken as 6. PC 1 to PC 6.
Page - 38
Here is the PCs vs the columns on the basis of cumulative explained variance
Page - 39
2.1.7 Compare PCs with Actual Columns and identify which is explaining most variance. Write inferences about all the principal components in terms of actual
variables.
Page - 40
Page - 41
Here are the original features influence various
PCs
Page - 42
Here is the Extracted required number of PCs (as per the cumulative explained variance)
( 0.16 ) * No_HH + ( 0.17 ) * TOT_M + ( 0.17 ) * TOT_F + ( 0.16 ) * M_06 + ( 0.16 ) * F_06 + ( 0.15 ) * M_SC + ( 0.15 ) * F_SC + ( 0.03 ) * M_ST + ( 0.03 ) * F_ST + (
0.16 ) * M_LIT + ( 0.15 ) * F_LIT + ( 0.16 ) * M_ILL + ( 0.17 ) * F_ILL + ( 0.16 ) * TOT_WORK_M + ( 0.15 ) * TOT_WORK_F + ( 0.15 ) * MAINWORK_M + ( 0.12 ) *
MAINWORK_F + ( 0.1 ) * MAIN_CL_M + ( 0.07 ) * MAIN_CL_F + ( 0.11 ) * MAIN_AL_M + ( 0.07 ) * MAIN_AL_F + ( 0.13 ) * MAIN_HH_M + ( 0.08 ) * MAIN_HH_F + (
0.12 ) * MAIN_OT_M + ( 0.11 ) * MAIN_OT_F + ( 0.16 ) * MARGWORK_M + ( 0.16 ) * MARGWORK_F + ( 0.08 ) * MARG_CL_M + ( 0.05 ) * MARG_CL_F + ( 0.13 ) *
MARG_AL_M + ( 0.11 ) * MARG_AL_F + ( 0.14 ) * MARG_HH_M + ( 0.13 ) * MARG_HH_F + ( 0.16 ) * MARG_OT_M + ( 0.15 ) * MARG_OT_F + ( 0.16 ) *
MARGWORK_3_6_M + ( 0.16 ) * MARGWORK_3_6_F + ( 0.17 ) * MARG_CL_3_6_M + ( 0.16 ) * MARG_CL_3_6_F + ( 0.09 ) * MARG_AL_3_6_M + ( 0.05 ) *
MARG_AL_3_6_F + ( 0.13 ) * MARG_HH_3_6_M + ( 0.11 ) * MARG_HH_3_6_F + ( 0.14 ) * MARG_OT_3_6_M + ( 0.12 ) * MARG_OT_3_6_F + ( 0.15 ) *
MARGWORK_0_3_M + ( 0.15 ) * MARGWORK_0_3_F + ( 0.15 ) * MARG_CL_0_3_M + ( 0.14 ) * MARG_CL_0_3_F + ( 0.05 ) * MARG_AL_0_3_M + ( 0.04 ) *
MARG_AL_0_3_F + ( 0.12 ) * MARG_HH_0_3_M + ( 0.12 ) * MARG_HH_0_3_F + ( 0.14 ) * MARG_OT_0_3_M + ( 0.13 ) * MARG_OT_0_3 _F + ( 0.15 ) *
NON_WORK_M + ( 0.13 ) * NON_WORK_F +
Page - 43