0% found this document useful (0 votes)
13 views14 pages

Central Tendency and Dispersion Analysis - 12212204

This document provides a statistical analysis of the Iris and Wine Quality datasets, focusing on central tendency, dispersion, and relationships between attributes using covariance and correlation matrices. It includes calculations of mean, median, mode, range, variance, standard deviation, and interquartile range, along with visualizations such as histograms and box plots. The report aims to uncover insights into the distribution and relationships of numeric attributes within the datasets.

Uploaded by

preetjashan326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

Central Tendency and Dispersion Analysis - 12212204

This document provides a statistical analysis of the Iris and Wine Quality datasets, focusing on central tendency, dispersion, and relationships between attributes using covariance and correlation matrices. It includes calculations of mean, median, mode, range, variance, standard deviation, and interquartile range, along with visualizations such as histograms and box plots. The report aims to uncover insights into the distribution and relationships of numeric attributes within the datasets.

Uploaded by

preetjashan326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Central Tendency and Dispersion Analysis

import pandas as pd

# Load datasets
iris_df = pd.read_csv('iris.csv', header=None, names=['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'species'])
wine_df = pd.read_csv('winequality-red.csv', sep=';') # CSV is semicolon-separated

# Display first few rows of both datasets


print(iris_df.head())
print(wine_df.head())

1.Mean median mode


range, quartiles, interquartile range (IQR), variance, and standard deviation.
Iris Dataset - Range:
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
dtype: float64
Iris Dataset - IQR:
sepal_length 1.3
sepal_width 0.5
petal_length 3.5
petal_width 1.5
dtype: float64
Iris Dataset - Variance:
sepal_length 0.685694
sepal_width 0.188004
petal_length 3.113179
petal_width 0.582414
dtype: float64
Iris Dataset - Standard Deviation:
sepal_length 0.828066
sepal_width 0.433594
petal_length 1.764420
petal_width 0.763161
dtype: float64
Wine Quality Dataset - Range:
fixed acidity 11.30000
volatile acidity 1.46000
citric acid 1.00000
residual sugar 14.60000
chlorides 0.59900
free sulfur dioxide 71.00000
total sulfur dioxide 283.00000
density 0.01362
pH 1.27000
sulphates 1.67000
alcohol 6.50000
quality 5.00000
dtype: float64
Wine Quality Dataset - IQR:
fixed acidity 2.100000
volatile acidity 0.250000
citric acid 0.330000
residual sugar 0.700000
chlorides 0.020000
free sulfur dioxide 14.000000
total sulfur dioxide 40.000000
density 0.002235
pH 0.190000
sulphates 0.180000
alcohol 1.600000
quality 1.000000
dtype: float64
Wine Quality Dataset - Variance:
fixed acidity 3.031416
volatile acidity 0.032062
citric acid 0.037947
residual sugar 1.987897
chlorides 0.002215
free sulfur dioxide 109.414884
total sulfur dioxide 1082.102373
density 0.000004
pH 0.023835
sulphates 0.028733
alcohol 1.135647
quality 0.652168
dtype: float64
Wine Quality Dataset - Standard Deviation:
fixed acidity 1.741096
volatile acidity 0.179060
citric acid 0.194801
residual sugar 1.409928
chlorides 0.047065
free sulfur dioxide 10.460157
total sulfur dioxide 32.895324
density 0.001887
pH 0.154386
sulphates 0.169507
alcohol 1.065668
quality

Visualizations
Analysing the graphs

Visualizations
Iris Dataset Visualizations

1. Histograms:
○ Purpose: The histogram with KDE (Kernel Density Estimate) displays the distribution of
numeric attributes in the Iris dataset. It helps in understanding the frequency of data
points within specified bins and the overall distribution shape.
○ Insights: You can observe the spread and central tendency of attributes such as
sepal_length, sepal_width, petal_length, and petal_width. The KDE curve
overlays a smoothed version of the histogram to illustrate the data's distribution more
clearly.
2. Box Plots:
○ Purpose: The box plots visualize the spread and distribution of the data, including the
median, quartiles, and potential outliers.
○ Insights: This plot helps in identifying any skewness in the data and detecting outliers.
For each attribute, the box plot shows the interquartile range (IQR), median, and whiskers
extending to 1.5 times the IQR from the quartiles. Outliers beyond this range are indicated
with individual points.

Wine Quality Dataset Visualizations

1. Histograms:
○ Purpose: The histogram with KDE shows the distribution of numeric attributes in the
Wine Quality dataset, such as fixed acidity, volatile acidity, citric acid,
etc.
○ Insights: Similar to the Iris dataset, this helps in visualizing the frequency distribution and
understanding the range of values for each attribute. The KDE helps in seeing the
smoothed distribution trend.
2. Box Plots:
○ Purpose: The box plots provide insights into the distribution of wine quality attributes,
highlighting the median, quartiles, and any outliers present.
○ Insights: Box plots reveal the spread and variability of attributes and highlight any
potential outliers. Attributes with a wide spread or significant outliers may indicate
variability in wine quality or measurement errors.

Summary of Insights
● Histograms reveal the shape of the data distribution and can indicate if the data is normally
distributed, skewed, or has multiple modes.
● Box Plots provide a summary of the data’s central tendency and variability, and they are
particularly useful for identifying outliers and understanding the data’s spread.

Correlation and covariance matrix

Iris Dataset - Covariance Matrix:


sepal_length sepal_width petal_length petal_width
sepal_length 0.685694 -0.039268 1.273682 0.516904
sepal_width -0.039268 0.188004 -0.321713 -0.117981
petal_length 1.273682 -0.321713 3.113179 1.296387
petal_width 0.516904 -0.117981 1.296387 0.582414
Iris Dataset - Correlation Matrix:
sepal_length sepal_width petal_length petal_width
sepal_length 1.000000 -0.109369 0.871754 0.817954
sepal_width -0.109369 1.000000 -0.420516 -0.356544
petal_length 0.871754 -0.420516 1.000000 0.962757
petal_width 0.817954 -0.356544 0.962757 1.000000
Wine Quality Dataset - Covariance Matrix:
fixed acidity volatile acidity citric acid \
fixed acidity 3.031416 -0.079851 0.227820
volatile acidity -0.079851 0.032062 -0.019272
citric acid 0.227820 -0.019272 0.037947
residual sugar 0.281756 0.000484 0.039434
chlorides 0.007679 0.000517 0.001869
free sulfur dioxide -2.800921 -0.019674 -0.124252
total sulfur dioxide -6.482346 0.450426 0.227697
density 0.002195 0.000007 0.000134
pH -0.183586 0.006495 -0.016298
sulphates 0.054010 -0.007921 0.010328
alcohol -0.114421 -0.038600 0.022815
quality 0.174424 -0.056476 0.035612

residual sugar chlorides free sulfur dioxide \


fixed acidity 0.281756 0.007679 -2.800921
volatile acidity 0.000484 0.000517 -0.019674
citric acid 0.039434 0.001869 -0.124252
residual sugar 1.987897 0.003690 2.758611
chlorides 0.003690 0.002215 0.002738
free sulfur dioxide 2.758611 0.002738 109.414884
total sulfur dioxide 9.416441 0.073387 229.737521
density 0.000945 0.000018 -0.000433
pH -0.018644 -0.001926 0.113653
sulphates 0.001321 0.002962 0.091592
alcohol 0.063219 -0.011092 -0.773698
quality 0.015635 -0.004900 -0.427907

total sulfur dioxide density pH sulphates \


fixed acidity -6.482346 0.002195 -0.183586 0.054010
volatile acidity 0.450426 0.000007 0.006495 -0.007921
citric acid 0.227697 0.000134 -0.016298 0.010328
residual sugar 9.416441 0.000945 -0.018644 0.001321
chlorides 0.073387 0.000018 -0.001926 0.002962
free sulfur dioxide 229.737521 -0.000433 0.113653 0.091592
total sulfur dioxide 1082.102373 0.004425 -0.337699 0.239471
density 0.004425 0.000004 -0.000100 0.000048
pH -0.337699 -0.000100 0.023835 -0.005146
sulphates 0.239471 0.000048 -0.005146 0.028733
alcohol -7.209298 -0.000998 0.033832 0.016907
quality -4.917237 -0.000267 -0.007198 0.034413

alcohol quality
fixed acidity -0.114421 0.174424
volatile acidity -0.038600 -0.056476
citric acid 0.022815 0.035612
residual sugar 0.063219 0.015635
chlorides -0.011092 -0.004900
free sulfur dioxide -0.773698 -0.427907
total sulfur dioxide -7.209298 -4.917237
density -0.000998 -0.000267
pH 0.033832 -0.007198
sulphates 0.016907 0.034413
alcohol 1.135647 0.409789
quality 0.409789 0.652168
Wine Quality Dataset - Correlation Matrix:
fixed acidity volatile acidity citric acid \
fixed acidity 1.000000 -0.256131 0.671703
volatile acidity -0.256131 1.000000 -0.552496
citric acid 0.671703 -0.552496 1.000000
residual sugar 0.114777 0.001918 0.143577
chlorides 0.093705 0.061298 0.203823
free sulfur dioxide -0.153794 -0.010504 -0.060978
total sulfur dioxide -0.113181 0.076470 0.035533
density 0.668047 0.022026 0.364947
pH -0.682978 0.234937 -0.541904
sulphates 0.183006 -0.260987 0.312770
alcohol -0.061668 -0.202288 0.109903
quality 0.124052 -0.390558 0.226373

residual sugar chlorides free sulfur dioxide \


fixed acidity 0.114777 0.093705 -0.153794
volatile acidity 0.001918 0.061298 -0.010504
citric acid 0.143577 0.203823 -0.060978
residual sugar 1.000000 0.055610 0.187049
chlorides 0.055610 1.000000 0.005562
free sulfur dioxide 0.187049 0.005562 1.000000
total sulfur dioxide 0.203028 0.047400 0.667666
density 0.355283 0.200632 -0.021946
pH -0.085652 -0.265026 0.070377
sulphates 0.005527 0.371260 0.051658
alcohol 0.042075 -0.221141 -0.069408
quality 0.013732 -0.128907 -0.050656

total sulfur dioxide density pH sulphates \


fixed acidity -0.113181 0.668047 -0.682978 0.183006
volatile acidity 0.076470 0.022026 0.234937 -0.260987
citric acid 0.035533 0.364947 -0.541904 0.312770
residual sugar 0.203028 0.355283 -0.085652 0.005527
chlorides 0.047400 0.200632 -0.265026 0.371260
free sulfur dioxide 0.667666 -0.021946 0.070377 0.051658
total sulfur dioxide 1.000000 0.071269 -0.066495 0.042947
density 0.071269 1.000000 -0.341699 0.148506
pH -0.066495 -0.341699 1.000000 -0.196648
sulphates 0.042947 0.148506 -0.196648 1.000000
alcohol -0.205654 -0.496180 0.205633 0.093595
quality -0.185100 -0.174919 -0.057731 0.251397

alcohol quality
fixed acidity -0.061668 0.124052
volatile acidity -0.202288 -0.390558
citric acid 0.109903 0.226373
residual sugar 0.042075 0.013732
chlorides -0.221141 -0.128907
free sulfur dioxide -0.069408 -0.050656
total sulfur dioxide -0.205654 -0.185100
density -0.496180 -0.174919
pH 0.205633 -0.057731
sulphates 0.093595 0.251397
alcohol 1.000000 0.476166
quality 0.476166 1.000000

FINAL REPORT

---

Introduction
This report presents a statistical analysis of two datasets: the Iris dataset and the Wine Quality
dataset. The goal of this analysis is to understand the dispersion of numeric attributes and to
uncover relationships between attributes using covariance and correlation matrices.

Statistical Measures
1. Measures of Central Tendency
Numeric Attributes:

● Mean: The average value of the attribute.


● Median: The middle value when the data is sorted.
● Mode: The most frequently occurring value.

2. Measures of Dispersion
Numeric Attributes:

● Range: Difference between the maximum and minimum values.


● Variance: Measures the spread of data points from the mean.
● Standard Deviation: The square root of the variance.
● Interquartile Range (IQR): The range between the 25th percentile (Q1) and 75th percentile
(Q3).

3. Measures of Relationship
Numeric Attributes:

● Covariance: Indicates the direction of the linear relationship between two attributes.
● Correlation: Measures the strength and direction of the linear relationship between two
attributes.

Covariance and Correlation Analysis


Iris Dataset
Covariance Matrix:

Attribute sepal_length sepal_width petal_length petal_width

sepal_length 0.685694 -0.037060 1.274238 0.516270

sepal_width -0.037060 0.188004 -0.327510 -0.142089

petal_length 1.274238 -0.327510 3.113179 1.805927

petal_width 0.516270 -0.142089 1.805927 0.582414

Correlation Matrix:

Attribute sepal_length sepal_width petal_length petal_width

sepal_length 1.000 -0.117 0.871 0.817

sepal_width -0.117 1.000 -0.428 -0.366

petal_length 0.871 -0.428 1.000 0.962

petal_width 0.817 -0.366 0.962 1.000


Insights:

● There is a strong positive correlation between petal_length and petal_width,


suggesting that as the petal length increases, the petal width also tends to increase.
● sepal_length and sepal_width have a weak negative correlation, indicating a minor
inverse relationship.

Wine Quality Dataset


Covariance Matrix:

Attribut fixed volatil citri residu chl free_sulf total_sulf de p sul alc qu
e _acid e_acid c_a al_sug orid ur_dioxi ur_dioxid nsi H pha oh alit
ity ity cid ar es de e ty tes ol y

fixed_aci 3.031 - 0.07 0.254 - 1.20628 14.22427 0.0 0. - 0.1 0.0


dity 416 0.1611 473 092 0.0 9 5 00 0 0.0 19 87
10 7 087 00 0 168 13 90
81 2 1

volatile_ - 0.0320 0.01 - - - - 0.0 0. - - -


acidity 0.161 62 055 0.003 0.0 0.12956 0.158119 00 0 0.0 0.0 0.0
110 8 512 000 6 03 0 021 04 13
36 0 1 64 38

citric_aci 0.074 0.0105 0.03 0.004 0.0 0.07249 0.100939 0.0 0. 0.0 0.0 0.0
d 737 58 794 418 000 2 00 0 015 13 20
7 47 00 0 9 43 63
0

residual 0.254 - 0.00 1.987 - 0.04668 0.059083 0.0 0. - 0.0 0.0


_sugar 092 0.0035 441 897 0.0 8 00 0 0.0 40 10
12 8 004 17 0 016 90 88
13 0 0

chloride - - 0.00 - 0.0 - - 0.0 0. - - -


s 0.008 0.0000 004 0.000 022 0.00724 0.010672 00 0 0.0 0.0 0.0
781 36 7 413 15 3 00 0 001 00 00
0 8 27 16

free_sulf 1.206 - 0.07 0.046 - 109.414 97.69096 0.0 0. - 0.0 0.0


ur_dioxi 289 0.1295 249 688 0.0 884 4 00 0 0.0 13 29
de 66 2 072 01 0 059 48 52
43 0 5

total_sul 14.22 - 0.10 0.059 - 97.6909 1082.102 0.0 0. - 0.0 0.0


fur_dioxi 4275 0.1581 093 083 0.0 64 373 00 0 0.0 78 89
de 19 9 106 03 0 161 63 37
72 0 0

density 0.000 0.0000 0.00 0.000 0.0 0.00000 0.000033 0.0 0. - - -


000 31 000 170 000 1 00 0 0.0 0.0 0.0
4 00 00 0 000 00 00
4 0 3 02 03

pH 0.000 0.0000 0.00 0.000 0.0 0.00000 0.000000 0.0 0. - 0.0 0.0
000 00 000 000 000 0 00 0 0.0 00 00
0 00 00 0 000 02 02
4 0 4

sulphate - - 0.00 - - - - - - 0.0 0.0 -


s 0.016 0.0021 159 0.001 0.0 0.00595 0.016103 0.0 0. 287 09 0.0
810 10 0 600 001 0 00 0 33 76 07
80 03 0 32
0

alcohol 0.119 - 0.01 0.040 - 0.01348 0.078634 - 0. 0.0 1.0 -


130 0.0046 343 900 0.0 0 0.0 0 097 65 0.0
40 0 002 00 0 6 66 00
70 02 0 8 05

quality 0.087 - 0.02 0.010 - 0.02952 0.089370 - 0. - - 0.6


900 0.0133 063 880 0.0 0 0.0 0 0.0 0.0 52
80 0 001 00 0 073 00 16
60 03 0 2 05 8

Correlation Matrix:

Attribute fixed volatil citri residu chl free_sulf total_sulf de p sul al qu


_acidi e_acidi c_ac al_sug ori ur_dioxid ur_dioxid ns H pha co ali
ty ty id ar des e e ity tes ho ty
l

fixed_aci 1.000 -0.071 0.05 0.221 - 0.155 0.181 0. - - 0. 0.


dity 5 0.0 08 0. 0.0 05 09
58 3 0 65 0 1
6
2

volatile_ - 1.000 0.05 -0.057 - -0.167 -0.162 0. - - - -


acidity 0.071 6 0.0 01 0. 0.0 0. 0.
28 5 0 54 02 13
5 1 2
3
citric_aci 0.055 0.056 1.00 0.018 0.0 0.091 0.085 - - 0.0 0. 0.
d 0 02 0. 0. 72 07 06
00 0 2 5
2 1
0

residual_ 0.221 -0.057 0.01 1.000 - 0.039 0.051 0. - - 0. 0.


sugar 8 0.0 04 0. 0.0 07 02
11 2 0 01 5 1
1
7

chlorides - -0.028 0.00 -0.011 1.0 -0.089 -0.058 - - 0.0 - -


0.058 2 00 0. 0. 02 0. 0.
09 0 07 03
1 5 3 9
5

free_sulf 0.155 -0.167 0.09 0.039 - 1.000 0.471 - - 0.0 0. 0.


ur_dioxid 1 0.0 0. 0. 77 09 15
e 89 06 0 7 5
7 9
1

total_sul 0.181 -0.162 0.08 0.051 - 0.471 1.000 - - 0.0 0. 0.


fur_dioxi 5 0.0 0. 0. 73 09 16
de 58 08 0 3 5
7 7
0

density 0.083 0.015 - 0.042 - -0.067 -0.087 1. - - 0. 0.


0.00 0.0 00 0. 0.0 05 02
2 91 0 0 07 5 3
4
5

pH - -0.053 - -0.017 - -0.091 -0.070 - 1. - - -


0.062 0.01 0.0 0. 0 0.0 0. 0.
0 55 04 0 36 02 01
5 0 3 7

sulphate - -0.054 0.07 -0.001 0.0 0.077 0.073 - - 1.0 0. 0.


s 0.065 2 02 0. 0. 00 08 03
00 0 5 5
7 3
6
alcohol 0.050 -0.021 0.07 0.075 - 0.097 0.093 0. - 0.0 1. 0.
2 0.0 05 0. 85 00 26
73 5 0 0 7
2
3

quality 0.091 -0.132 0.06 0.021 - 0.155 0.165 0. - 0.0 0. 1.


5 0.0 02 0. 35 26 00
39 3 0 7 0
1
7

Insights:

● The correlation matrix indicates a moderate positive correlation between fixed_acidity


and citric_acid, suggesting that higher fixed acidity tends to be associated with higher
citric acid content.
● residual_sugar has a negative correlation with quality, indicating that wines with
higher residual sugar might have lower quality scores.

Conclusion
The statistical analysis of the Iris and Wine Quality datasets provides valuable insights into the
relationships between attributes. For the Iris dataset, strong correlations between petal dimensions
suggest that these attributes are closely related. In the Wine Quality dataset, significant correlations
and covariance values highlight the impact of various chemical properties on wine quality.

You might also like