Outlier Analysis
Outlier Analysis
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("C:\\Users\\KIIT0001\\Desktop\\BDA_Documents\\StudentPerformance2.csv")
// outlier removal
def outlier_removal(df,column,threshold):
removed=df[df[column]<= threshold]
sns.boxplot(removed[column])
return removed
thresholdValue=60
nooutlier=outlier_removal(ndf,"Math", thresholdValue)
// removal of outlier
outlierIndices=np.where(ndf["Math"] <=60)
ndf.drop(outlierIndices[0],inplace=True)
ax=ndf.plot.scatter(x="Math",y="Reading Score",c="red")
// using Z-score
The z-score, also known as the standard score, is a statistical measure that describes a
data point's position relative to the mean of a group of data points. It is measured in
terms of standard deviations from the mean. The formula for calculating a z-score is:
z=(X- μ)/бwhere:
Interpretation of Z-Scores:
A z-score of 0 indicates that the data point is exactly at the mean.
A positive z-score indicates that the data point is above the mean.
A negative z-score indicates that the data point is below the mean.
The magnitude of the z-score indicates how many standard deviations the data point is from
the mean.
Example:
Suppose you have a data set with a mean (μ) of 100 and a standard deviation (σ) of
15. If you want to find the z-score of a data point X=130:
z=(130−100)/15=30/15=2
This means the data point 130 is 2 standard deviations above the mean.
Application of Z-Scores:
Detecting Outliers: Data points with z-scores typically beyond ±3 are considered outliers.
// Example
threshold_z = 2