0% found this document useful (0 votes)
118 views28 pages

Final PPT File Cluster Analysis

The project titled 'World Development Measurement' aims to create clusters based on a dataset containing 2708 observations and 25 variables related to economic and development metrics across various countries. The analysis involves exploratory data analysis, feature engineering, and model building, with a focus on clustering techniques such as K-Means and Hierarchical Clustering. The project ultimately seeks to evaluate the performance of these models to understand global development trends.

Uploaded by

reddykarishma840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views28 pages

Final PPT File Cluster Analysis

The project titled 'World Development Measurement' aims to create clusters based on a dataset containing 2708 observations and 25 variables related to economic and development metrics across various countries. The analysis involves exploratory data analysis, feature engineering, and model building, with a focus on clustering techniques such as K-Means and Hierarchical Clustering. The project ultimately seeks to evaluate the performance of these models to understand global development trends.

Uploaded by

reddykarishma840
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

World Development

Measurement

P ROJEC T T I T L E: “ WO R L D D EV ELO P M EN T M EA S U R EM EN T ”
M EN TO R N AM E: M s . S n e h al S h i n d e
GRO U P N O. : 4
STA RT DAT E: 6 N OV 2 0 2 3
Group details
Mr. Chintha Gangadhara [email protected] 8688620739

Mrs. Swapna Akella [email protected] 8008994791

Mr. Nikhil Gowda K [email protected] 8073285770

MR. SRIHARSHA NOORBHASHA [email protected] 9959905437

MR. Rohan Vasant Thorat [email protected] 8793574733

Mr. Akshay haridas rautray [email protected] 9637637670

Mrs. JONNADA BINDU [email protected] 9505539842


Business Objective:

⮚Creating clusters on the global development measurement dataset


⮚The dataset has information about important economic and development
metrics related to various countries across the globe.
Business Problem

⮚To develop a cluster model for the world development measurement

⮚ Evaluate the performance of the model

Since the goal is to do a cluster analysis, this is a cluster analysis problem. The goal of clustering
is to find distinct groups or “clusters” within a data set.
Project Architecture / Project Flow
Understanding the Business Problem

Dataset Understanding

EDA process

Feature Engineering

Model Building

Model Evaluation and Feedback

Deployment
Data Set Details:
The data file contains 2708 observations with 25 variables and has information about important
economic and development metrics related to various countries across the globe.
The variables, or features, are the following:

Birth Rate, Business tax, CO2emissions, Country, Days to start a business, Ease of business,

energy usage, GDP, healthexp%GDP, healthexp/capita, hours to do tax, infant mortality,

internet usage, lending rate, life expectancy female, etc.


Data Set
Exploratory Data Analysis (EDA) and
Feature Engineering
⮚In the given dataset the datatype of the observations in each variable are floating point datatypes.
⮚In the data set each feature contains null values.
⮚There are no duplicate observations present in the dataset.
⮚There are a total of 12203 null values present in the dataset.
⮚Data Distribution:
 skewness
Visualizing the null values for each
attribute with SNS Heatmap.

• Ease of business contains more number of


missing values
• Population urban having less number of missing
values.
• There are no missing values in country and
population total.
VISUALIZATION
Distribution plots for each column
• 'BusinessTaxRate',
'EaseofBusiness',
'HealthExpGDP', 'HourstodoTax'
and 'Population0to14' columns
have normal distribution so we
replace missing values by mean.
• And for remaining columns with
skewed data we replace missing
values by median.
OUTLIER DETECTION
• Some columns like "Population Total", "Tourism in bound", "Tourism out bound" has large number of outlier present.
• columns like "Population Urban", "Population 0 to 14" has less number of outliers.
TOP 10 CO2 EMISSION COUNTRIES
• China and the United States highest CO2
emission countries. TOP 10 HIGHEST BIRTH RATE COUNTRIES
• The Russian Federation, India, and Japan are
medium CO2 emission countries.
• Germany, Canada, UK, Korea, Iran Lowest CO2
emission Countries.
Visualizing Relation between variables

• 'Population 0 to 14' and 'Birth Rate’ have a strong


relation. (0.94)
• ‘ Population 15 to 64’ and ‘Birth Rate’ have a weak
relation. (0.90)
Scatter plot for CO2 emissions and energy usage Scatter plots for GDP-related sources
SCALING
• Scaling is a technique to standardize the
independent features present in the data in a
fixed range. We do this to make sure all the
features are in same scale
• Here we will be using Standard Scaler

Feature Engineering
PCA Technique :
• Looking at the graph we can decide how
much percentage we want and
accordingly go for that much column
numbers.
• here, we are taking 15 columns because
they are giving more than 95% data.
Model Building
Hierarchical Clustering with scaled data

Visualized with Dendrogram and Scatter plot

• Calculated silhouette score for labels.

• Scaled Data got a silhouette score - 0.4069002367119094


Hierarchical Clustering with complete linkage

• Calculated silhouette score for complete linkage.

• Scaled Data got a silhouette score - 0.44947005579625016


Agglomerative Clustering on PCA data

• Calculated silhouette score for agglomerative clustering on PCA data.


• agglomerative clustering on PCA data got a silhouette score - 0.4892175454522696
K-Means clustering on PCA Data

• By seeing above elbow curve considering 3 clusters.


• K Means with 4 Clusters we got silhouette score -
0.38589504828738286
• K Means with Clusters we got silhouette score -
0.4592576382823015

K-Means with PCA Data – 3 clusters

• By seeing the above elbow curve considering 3 clusters.


• K Means with PCA Data - 2 Clusters we got silhouette
score -0.4349454775483713

• K Means with PCA data - 4Clusters we got silhouette


score -0.3016847742272087
DBSCAN Using Original Data
• Calculated distance by using the nearest neighbors
method.
• DBSCAN with Original Data - we got silhouette score -
0.230808
• DBSCAN with PCA Data - we got silhouette score -
0.291135
Training and Testing Model accuracy using random forest classifier
We got accuracy – 0.96
Deployment Model
Developed country
Developing Country
References
1.https://pandas.pydata.org/
2.https://numpy.org/doc/
3.https://matplotlib.org/
4.https://seaborn.pydata.org/
5.https://scikit-learn.org/
6.https://www.statsmodels.org/
Thank You

You might also like