World Development
Measurement
P ROJEC T T I T L E: “ WO R L D D EV ELO P M EN T M EA S U R EM EN T ”
M EN TO R N AM E: M s . S n e h al S h i n d e
GRO U P N O. : 4
STA RT DAT E: 6 N OV 2 0 2 3
Group details
Mr. Chintha Gangadhara [email protected] 8688620739
MR. SRIHARSHA NOORBHASHA [email protected] 9959905437
MR. Rohan Vasant Thorat [email protected] 8793574733
Mr. Akshay haridas rautray [email protected] 9637637670
Business Objective:
⮚Creating clusters on the global development measurement dataset
⮚The dataset has information about important economic and development
metrics related to various countries across the globe.
Business Problem
⮚To develop a cluster model for the world development measurement
⮚ Evaluate the performance of the model
Since the goal is to do a cluster analysis, this is a cluster analysis problem. The goal of clustering
is to find distinct groups or “clusters” within a data set.
Project Architecture / Project Flow
Understanding the Business Problem
Dataset Understanding
EDA process
Feature Engineering
Model Building
Model Evaluation and Feedback
Deployment
Data Set Details:
The data file contains 2708 observations with 25 variables and has information about important
economic and development metrics related to various countries across the globe.
The variables, or features, are the following:
Birth Rate, Business tax, CO2emissions, Country, Days to start a business, Ease of business,
energy usage, GDP, healthexp%GDP, healthexp/capita, hours to do tax, infant mortality,
internet usage, lending rate, life expectancy female, etc.
Data Set
Exploratory Data Analysis (EDA) and
Feature Engineering
⮚In the given dataset the datatype of the observations in each variable are floating point datatypes.
⮚In the data set each feature contains null values.
⮚There are no duplicate observations present in the dataset.
⮚There are a total of 12203 null values present in the dataset.
⮚Data Distribution:
skewness
Visualizing the null values for each
attribute with SNS Heatmap.
• Ease of business contains more number of
missing values
• Population urban having less number of missing
values.
• There are no missing values in country and
population total.
VISUALIZATION
Distribution plots for each column
• 'BusinessTaxRate',
'EaseofBusiness',
'HealthExpGDP', 'HourstodoTax'
and 'Population0to14' columns
have normal distribution so we
replace missing values by mean.
• And for remaining columns with
skewed data we replace missing
values by median.
OUTLIER DETECTION
• Some columns like "Population Total", "Tourism in bound", "Tourism out bound" has large number of outlier present.
• columns like "Population Urban", "Population 0 to 14" has less number of outliers.
TOP 10 CO2 EMISSION COUNTRIES
• China and the United States highest CO2
emission countries. TOP 10 HIGHEST BIRTH RATE COUNTRIES
• The Russian Federation, India, and Japan are
medium CO2 emission countries.
• Germany, Canada, UK, Korea, Iran Lowest CO2
emission Countries.
Visualizing Relation between variables
• 'Population 0 to 14' and 'Birth Rate’ have a strong
relation. (0.94)
• ‘ Population 15 to 64’ and ‘Birth Rate’ have a weak
relation. (0.90)
Scatter plot for CO2 emissions and energy usage Scatter plots for GDP-related sources
SCALING
• Scaling is a technique to standardize the
independent features present in the data in a
fixed range. We do this to make sure all the
features are in same scale
• Here we will be using Standard Scaler
Feature Engineering
PCA Technique :
• Looking at the graph we can decide how
much percentage we want and
accordingly go for that much column
numbers.
• here, we are taking 15 columns because
they are giving more than 95% data.
Model Building
Hierarchical Clustering with scaled data
Visualized with Dendrogram and Scatter plot
• Calculated silhouette score for labels.
• Scaled Data got a silhouette score - 0.4069002367119094
Hierarchical Clustering with complete linkage
• Calculated silhouette score for complete linkage.
• Scaled Data got a silhouette score - 0.44947005579625016
Agglomerative Clustering on PCA data
• Calculated silhouette score for agglomerative clustering on PCA data.
• agglomerative clustering on PCA data got a silhouette score - 0.4892175454522696
K-Means clustering on PCA Data
• By seeing above elbow curve considering 3 clusters.
• K Means with 4 Clusters we got silhouette score -
0.38589504828738286
• K Means with Clusters we got silhouette score -
0.4592576382823015
K-Means with PCA Data – 3 clusters
• By seeing the above elbow curve considering 3 clusters.
• K Means with PCA Data - 2 Clusters we got silhouette
score -0.4349454775483713
• K Means with PCA data - 4Clusters we got silhouette
score -0.3016847742272087
DBSCAN Using Original Data
• Calculated distance by using the nearest neighbors
method.
• DBSCAN with Original Data - we got silhouette score -
0.230808
• DBSCAN with PCA Data - we got silhouette score -
0.291135
Training and Testing Model accuracy using random forest classifier
We got accuracy – 0.96
Deployment Model
Developed country
Developing Country
References
1.https://pandas.pydata.org/
2.https://numpy.org/doc/
3.https://matplotlib.org/
4.https://seaborn.pydata.org/
5.https://scikit-learn.org/
6.https://www.statsmodels.org/
Thank You