Com747-Statistical Modelling and Data Mining: Data Model On Suicide Rates
Com747-Statistical Modelling and Data Mining: Data Model On Suicide Rates
DATA MINING
Sobin Siby
[email protected]
Introduction
• This work is based on the dataset of global suicide rate calculated by
the world health organisation.
• This dataset contains the number of persons died, year, population,
gender and so on.
• Here I have used some techniques like data cleaning, EDA, data
visualisation and linear regression to build a data model.
Data Cleaning
Data cleansing or data cleaning is the process of detecting and correcting
corrupt or inaccurate records from a record set, table, or database and refers to
identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and
then replacing it into appropriate data.
Task Done:
• The data of 7countries are removed which have got 3 years and less of data
total.
• 2016 data was removed because, few countries had any, those that did often
had data missing.
• HDI was removed due to 2/3 missing data
•Continent was added to the dataset using the country code package
•Africa has very few countries providing suicide data
Linear Regression
Linear regression is a linear approach to modelling the
relationship between a scalar response (or dependent variable) and one
or more explanatory variables (or independent variables).
Task Done:
I have used some linear regression to identify whether richer countries
have a higher rate of suicide and in the classification of countries, to
find the p value.
Global Analysis
When we look on the global
analysis, we can some up
with some insights. The
below graph shows the
global average suicide rate
from 1985 - 2015: 13.1
deaths (per 100k, per year).
Global Analysis – Cont.
Hence we have sum-up into some insights from the graph
obtained.
• Peak suicide rate was 15.3 deaths per 100k in 1995
• Decreased steadily, to 11.5 per 100k in 2015 (~25% decrease)
• Rates are only now returning to their pre-90’s rates
• Limited data in the 1980’s, so it’s hard to say if rate then was truly
representative of the global population
By Continent
When we plot continent, we can sum up into some findings. The below shows
the graph obtained.
By Continent – Cont.
Hence we have sum-up into some insights from the graph
obtained.
• European rate highest overall, but has steadily decreased ~40%
since 1995
• The European rate for 2015 similar to Asia & Oceania
• The trendline for Africa is due to poor data quality - just 3 countries
have provided data
• Oceania & Americas trends are more concerning
By sex
The below graph shows the data when plotted by sex.
By sex-Cont.
We can arrive at a conclusion as follows
• Globally, the rate of suicide for men has been ~3.5x higher for men
• Both male & female suicide rates peaked in 1995, declining since
• This ratio of 3.5 : 1 (male : female) has remained relatively constant
since the mid 90’s
• However, during the 80’s this ratio was as low as 2.7 : 1 (male :
female)
By Age
The below graph shows the data when plotted by age.
By Age-Cont.
When looking on to the age, we can conclude as follows:
• Globally, the likelihood of suicide increases with age
• Since 1995, suicide rate for everyone aged >= 15 has been linearly
decreasing
• The suicide rate of those aged 75+ has dropped by more than 50%
since 1990
• Suicide rate in the ‘5-14’ category remains roughly static and small
(< 1 per 100k per year)
By Country
The below shows the classification among countries and the
geographical heat map of the suicide rates between the timeframe of this
analysis.
By Country-Cont.
By looking on to the output we can conclude with some insights
as follows:
• Lithuania’s rate has been highest by a large margin: > 41 suicides per
100k (per year)
• Large overrepresentation of European countries with high rates, few
with low rates
By Country (Linear
Regression)
• Instead of visualizing all 93
countries rates across time, I fit a
simple linear regression to every
countries data. I extract those
with a ‘year’ p-value of < 0.05.
• The below shows the output
obtained.
By Country (Linear Regression)-Cont.
• We can conclude as follows:
• ~1/2 of all countries suicide rates are changing linearly as time
progresses .
• 32 (2/3) of these 48 countries are decreasing
By Country (Linear Regression)-Cont.
When looking to the steepest increasing trends, there are 12
countries. The below graph shows the steepest increasing trends
(p<0.5).
By Country (Linear Regression)-Cont.