0% found this document useful (0 votes)

4 views12 pages

Understanding Boxplots - Towards Data Science

The document explains boxplots, a graphical representation of data distribution based on a five-number summary, including minimum, first quartile, median, third quartile, and maximum. It discusses how to create and interpret boxplots using Python, emphasizing their utility in visualizing data variability and identifying outliers. The tutorial also includes practical examples using real datasets and highlights the importance of understanding the underlying data distribution when interpreting boxplots.

Uploaded by

startupkipathsala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views12 pages

Understanding Boxplots - Towards Data Science

Uploaded by

startupkipathsala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

12/9/2019 Understanding Boxplots - Towards Data Science

Understanding Boxplots
Michael Galarnyk
Sep 12, 2018 · 7 min read

Di erent parts of a boxplot

The image above is a boxplot. A boxplot is a standardized way of displaying the

distribution of data based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.

This tutorial will include:

What is a boxplot?

Understanding the anatomy of a boxplot by comparing a boxplot against the

probability density function for a normal distribution.

How do you make and interpret boxplots using Python?

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 1/12
12/9/2019 Understanding Boxplots - Towards Data Science

As always, the code used to make the graphs is available on my github. With that, let’s
get started!

What is a Boxplot?
For some distributions/datasets, you will find that you need more information than the
measures of central tendency (median, mean, and mode).

There are times when mean, median, and mode aren’t enough to describe a dataset (taken from here).

You need to have information on the variability or dispersion of the data. A boxplot is a
graph that gives you a good indication of how the values in the data are spread out.
Although boxplots may seem primitive in comparison to a histogram or density plot,
they have the advantage of taking up less space, which is useful when comparing
distributions between many groups or datasets.

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 2/12
12/9/2019 Understanding Boxplots - Towards Data Science

Di erent parts of a boxplot

Boxplots are a standardized way of displaying the distribution of data based on a five
number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and
“maximum”).

median (Q2/50th Percentile): the middle value of the dataset.

first quartile (Q1/25th Percentile): the middle number between the smallest number
(not the “minimum”) and the median of the dataset.

third quartile (Q3/75th Percentile): the middle value between the median and the
highest value (not the “maximum”) of the dataset.

interquartile range (IQR): 25th to the 75th percentile.

whiskers (shown in blue)

outliers (shown as green circles)

“maximum”: Q3 + 1.5*IQR

“minimum”: Q1 -1.5*IQR

What defines an outlier, “minimum”, or“maximum” may not be clear yet. The next
section will try to clear that up for you.

Boxplot on a Normal Distribution

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 3/12
12/9/2019 Understanding Boxplots - Towards Data Science

Comparison of a boxplot of a nearly normal distribution and a probability density function (pd ) for a
normal distribution

The image above is a comparison of a boxplot of a nearly normal distribution and the
probability density function (pdf) for a normal distribution. The reason why I am
showing you this image is that looking at a statistical distribution is more commonplace
than looking at a box plot. In other words, it might help you understand a boxplot.

This section will cover many things including:

How outliers are (for a normal distribution) .7% of the data.

What a “minimum” and a “maximum” are

Probability Density Function

This part of the post is very similar to the 68–95–99.7 rule article, but adapted for a
boxplot. To be able to understand where the percentages come from, it is important to
know about the probability density function (PDF). A PDF is used to specify the
probability of the random variable falling within a particular range of values, as opposed
to taking on any one value. This probability is given by the integral of this variable’s

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 4/12
12/9/2019 Understanding Boxplots - Towards Data Science

PDF over that range — that is, it is given by the area under the density function but
above the horizontal axis and between the lowest and greatest values of the range. This
definition might not make much sense so let’s clear it up by graphing the probability
density function for a normal distribution. The equation below is the probability
density function for a normal distribution

PDF for a Normal Distribution

Let’s simplify it by assuming we have a mean (μ) of 0 and a standard deviation (σ) of 1.

PDF for a Normal Distribution

This can be graphed using anything, but I choose to graph it using Python.

# Import all libraries for this portion of the blog post

from scipy.integrate import quad
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-4, 4, num = 100)

constant = 1.0 / np.sqrt(2*np.pi)
pdf_normal_distribution = constant * np.exp((-x**2) / 2.0)
fig, ax = plt.subplots(figsize=(10, 5));
ax.plot(x, pdf_normal_distribution);
ax.set_ylim(0);
ax.set_title('Normal Distribution', size = 20);
ax.set_ylabel('Probability Density', size = 20);

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 5/12
12/9/2019 Understanding Boxplots - Towards Data Science

The graph above does not show you the probability of events but their probability
density. To get the probability of an event within a given range we will need to
integrate. Suppose we are interested in finding the probability of a random data point
landing within the interquartile range .6745 standard deviation of the mean, we need
to integrate from -.6745 to .6745. This can be done with SciPy.

# Make PDF for the normal distribution a function

def normalProbabilityDensity(x):
constant = 1.0 / np.sqrt(2*np.pi)
return(constant * np.exp((-x**2) / 2.0) )

# Integrate PDF from -.6745 to .6745

result_50p, _ = quad(normalProbabilityDensity, -.6745, .6745, limit
= 1000)
print(result_50p)

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 6/12
12/9/2019 Understanding Boxplots - Towards Data Science

The same can be done for “minimum” and “maximum”.

# Make a PDF for the normal distribution a function

def normalProbabilityDensity(x):
constant = 1.0 / np.sqrt(2*np.pi)
return(constant * np.exp((-x**2) / 2.0) )

# Integrate PDF from -2.698 to 2.698

result_99_3p, _ = quad(normalProbabilityDensity,
-2.698,
2.698,
limit = 1000)
print(result_99_3p)

As mentioned earlier, outliers are the remaining .7% percent of the data.

It is important to note that for any PDF, the area under the curve must be 1 (the
probability of drawing any number from the function’s range is always 1).

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 7/12
12/9/2019 Understanding Boxplots - Towards Data Science

Graphing and Interpreting a Boxplot

Boxplots using Matplotlib, Pandas, and Seaborn …

Free preview video from the Using Python for Data Visualization course

This section is largely based on a free preview video from my Python for Data
Visualization course. In the last section, we went over a boxplot on a normal
distribution, but as you obviously won’t always have an underlying normal distribution,
let’s go over how to utilize a boxplot on a real dataset. To do this, we will utilize the
Breast Cancer Wisconsin (Diagnostic) Dataset. If you don’t have a Kaggle account, you
can download the dataset from my github.

Read in the data

The code below reads the data into a pandas dataframe.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 8/12
12/9/2019 Understanding Boxplots - Towards Data Science

# Put dataset on my github repo

df =
pd.read_csv('https://raw.githubusercontent.com/mGalarnyk/Python_Tuto
rials/master/Kaggle/BreastCancerWisconsin/data/data.csv')

Graph Boxplot
A boxplot is used below to analyze the relationship between a categorical feature
(malignant or benign tumor) and a continuous feature (area_mean).

There are a couple ways to graph a boxplot through Python. You can graph a boxplot
through seaborn, pandas, or seaborn.

seaborn

The code below passes the pandas dataframe df into seaborn’s boxplot .

sns.boxplot(x='diagnosis', y='area_mean', data=df)

matplotlib

The boxplots you have seen in this post were made through matplotlib. This approach
can be far more tedious, but can give you a greater level of control.

malignant = df[df['diagnosis']=='M']['area_mean']
benign = df[df['diagnosis']=='B']['area_mean']

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 9/12
12/9/2019 Understanding Boxplots - Towards Data Science

fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot([malignant,benign], labels=['M', 'B'])

You can make this a lot prettier with a little bit of work

pandas

You can plot a boxplot by invoking .boxplot() on your DataFrame. The code below
makes a boxplot of the area_mean column with respect to different diagnosis.

df.boxplot(column = 'area_mean', by = 'diagnosis');

plt.title('')

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 10/12
12/9/2019 Understanding Boxplots - Towards Data Science

Notched Boxplot
The notched boxplot allows you to evaluate confidence intervals (by default 95%
confidence interval) for the medians of each boxplot.

malignant = df[df['diagnosis']=='M']['area_mean']
benign = df[df['diagnosis']=='B']['area_mean']

fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot([malignant,benign], notch = True, labels=['M', 'B']);

Not the prettiest yet.

Interpreting a Boxplot
Data science is about communicating results so keep in mind you can always make your
boxplots a bit prettier with a little bit of work (code here).

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 11/12
12/9/2019 Understanding Boxplots - Towards Data Science

Using the graph, we can compare the range and distribution of the area_mean for
malignant and benign diagnosis. We observe that there is a greater variability for
malignant tumor area_mean as well as larger outliers.

Also, since the notches in the boxplots do not overlap, you can conclude that with 95%
confidence, that the true medians do differ.

Here are a few other things to keep in mind about boxplots:

1. Keep in mind that you can always pull out the data from the boxplot in case you
want to know what the numerical values are for the different parts of a boxplot.

2. Matplotlib does not estimate a normal distribution first and calculates the quartiles
from the estimated distribution parameters. The median and the quartiles are
calculated directly from the data. In other words, your boxplot may look different
depending on the distribution of your data and the size of the sample, e.g.,
asymmetric and with more or less outliers.

Conclusion
Hopefully this wasn’t too much information on boxplots. Future tutorials will take
some this knowledge and go over how to apply it to understanding confidence
intervals. My next tutorial goes over How to Use and Create a Z Table (standard normal
table). If you any questions or thoughts on the tutorial, feel free to reach out in the
comments below, through the YouTube video page, or through Twitter.

Data Science Statistics Descriptive Statistics Python

About Help Legal

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 12/12

Designing Flood Resilient Multi Purpose Community Hubs - Enhancing Resilience and Sustainability in Flood Prone Areas of The Philippines
No ratings yet
Designing Flood Resilient Multi Purpose Community Hubs - Enhancing Resilience and Sustainability in Flood Prone Areas of The Philippines
28 pages
Data Mining and Warehousing Assignment-1: Introduction To Boxplots
No ratings yet
Data Mining and Warehousing Assignment-1: Introduction To Boxplots
4 pages
Visualization - Hist and Box
No ratings yet
Visualization - Hist and Box
23 pages
boxblot in fods
No ratings yet
boxblot in fods
5 pages
Box Plot Data-Aggregation To Normalization DJB Notes 25-04-2024
No ratings yet
Box Plot Data-Aggregation To Normalization DJB Notes 25-04-2024
21 pages
6 Ways To Test For A Normal Distribution - Which One To Use - by Joos Korstanje - Towards Data Science
No ratings yet
6 Ways To Test For A Normal Distribution - Which One To Use - by Joos Korstanje - Towards Data Science
9 pages
Boxplot (4) (1)
No ratings yet
Boxplot (4) (1)
22 pages
Lecture-6: Introduction To Data Science
No ratings yet
Lecture-6: Introduction To Data Science
25 pages
Datascienece
No ratings yet
Datascienece
18 pages
An Adjusted Boxplot For Skewed Distributions: M. Hubert, E. Vandervieren
No ratings yet
An Adjusted Boxplot For Skewed Distributions: M. Hubert, E. Vandervieren
16 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
12 pages
Statistical Analysis: 1 Data Analysis: Mean, Variance, Boxplots
No ratings yet
Statistical Analysis: 1 Data Analysis: Mean, Variance, Boxplots
4 pages
Week2 Modified
No ratings yet
Week2 Modified
43 pages
UNIT 3
No ratings yet
UNIT 3
45 pages
20210129--Lecture01
No ratings yet
20210129--Lecture01
76 pages
Chapter 3 Exploratory Data Analysis
No ratings yet
Chapter 3 Exploratory Data Analysis
22 pages
1.Program
No ratings yet
1.Program
20 pages
Advanced_Plot_Types_with_Matplotlib
No ratings yet
Advanced_Plot_Types_with_Matplotlib
8 pages
Measures of Position PDF
No ratings yet
Measures of Position PDF
5 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Exp9
No ratings yet
Exp9
2 pages
pgm1
No ratings yet
pgm1
5 pages
Chap 4
No ratings yet
Chap 4
29 pages
Shubham Dadhich Box Plot-1
No ratings yet
Shubham Dadhich Box Plot-1
9 pages
Data Preprocessing Python Tome II
No ratings yet
Data Preprocessing Python Tome II
14 pages
limpieza-de-datos
No ratings yet
limpieza-de-datos
8 pages
Boxplot
No ratings yet
Boxplot
15 pages
Ml Lab Manual Bcsl602
No ratings yet
Ml Lab Manual Bcsl602
108 pages
Assignmeant-1 Sharan S
No ratings yet
Assignmeant-1 Sharan S
20 pages
CHP 2
No ratings yet
CHP 2
52 pages
5_Data Summaries and Visualization
No ratings yet
5_Data Summaries and Visualization
97 pages
ap_stat_exam_rev_ch1-13
No ratings yet
ap_stat_exam_rev_ch1-13
120 pages
fundamentals stats
No ratings yet
fundamentals stats
44 pages
TSA Theory Part1
No ratings yet
TSA Theory Part1
98 pages
Unit 3
No ratings yet
Unit 3
20 pages
Box Plot
No ratings yet
Box Plot
4 pages
5-Number Summary: Median
No ratings yet
5-Number Summary: Median
1 page
Measures of Relative Position
No ratings yet
Measures of Relative Position
28 pages
24-01-22 Marked Slides
No ratings yet
24-01-22 Marked Slides
50 pages
Box Plot
No ratings yet
Box Plot
12 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Assignment
No ratings yet
Assignment
7 pages
IB A&I 3.3
No ratings yet
IB A&I 3.3
17 pages
3.3 Percentiles and Box-and-Whisker Plots
No ratings yet
3.3 Percentiles and Box-and-Whisker Plots
16 pages
Notes 03
No ratings yet
Notes 03
21 pages
Module -3
No ratings yet
Module -3
43 pages
STAT 7000 Chapter 1.2 - Summarizing Data Probability: Ash Abebe
No ratings yet
STAT 7000 Chapter 1.2 - Summarizing Data Probability: Ash Abebe
18 pages
9
No ratings yet
9
4 pages
02data Part2
No ratings yet
02data Part2
34 pages
-Skewness 2025
No ratings yet
-Skewness 2025
62 pages
Visual Presentation of Data: by Means of Box Plots
No ratings yet
Visual Presentation of Data: by Means of Box Plots
4 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Box Plots and Distribution
No ratings yet
Box Plots and Distribution
14 pages
Lecture 02- Exploratory Data and Descriptive Statistics
No ratings yet
Lecture 02- Exploratory Data and Descriptive Statistics
27 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Lecture 8
No ratings yet
Lecture 8
76 pages
Chapter 2 Handout Jan 30
No ratings yet
Chapter 2 Handout Jan 30
12 pages
BUSINESS MOMENTS 1
No ratings yet
BUSINESS MOMENTS 1
9 pages
Chapter 2 Final of Final
No ratings yet
Chapter 2 Final of Final
158 pages
Basic Statistics - 1
No ratings yet
Basic Statistics - 1
21 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Section 7.8 Design of Ultra Filtration and Micro Filtration
No ratings yet
Section 7.8 Design of Ultra Filtration and Micro Filtration
11 pages
UG RESEARCH EVENTS BSAMCH Brochure
No ratings yet
UG RESEARCH EVENTS BSAMCH Brochure
4 pages
Budidaya Cabai Di Lahan Pasir Pinggir Pantai
No ratings yet
Budidaya Cabai Di Lahan Pasir Pinggir Pantai
6 pages
The Teacher As A Knower of Curriculum
No ratings yet
The Teacher As A Knower of Curriculum
12 pages
Appendix A: Conversion Factors and Constants: Metric Symbols and Names
No ratings yet
Appendix A: Conversion Factors and Constants: Metric Symbols and Names
3 pages
Unit 1 Intro To Chem
No ratings yet
Unit 1 Intro To Chem
292 pages
Annexure 12 IRS S 63 CABLE SPEC
No ratings yet
Annexure 12 IRS S 63 CABLE SPEC
45 pages
Nord Manuals 2649
No ratings yet
Nord Manuals 2649
89 pages
Introduction To Environmental Studies and Natural Resources
90% (20)
Introduction To Environmental Studies and Natural Resources
24 pages
Grade-9 Syllabus
No ratings yet
Grade-9 Syllabus
9 pages
Labs Project
No ratings yet
Labs Project
19 pages
Teacher Answer Key Thistle Tube and Osmosis Demo: Semipermeable Membrane
No ratings yet
Teacher Answer Key Thistle Tube and Osmosis Demo: Semipermeable Membrane
3 pages
Austin handbookII-libre
No ratings yet
Austin handbookII-libre
30 pages
Examen Parcial - Semana 4 - SEGUNDO BLOQUE-VIRTUAL - INGLÉS GENERAL 1 - (GRUPO B14)
No ratings yet
Examen Parcial - Semana 4 - SEGUNDO BLOQUE-VIRTUAL - INGLÉS GENERAL 1 - (GRUPO B14)
12 pages
Femara FCT 2.5MG 3X10 Ro Lot S0116
No ratings yet
Femara FCT 2.5MG 3X10 Ro Lot S0116
7 pages
Position Description Form DBM CSC Form No. 1
100% (1)
Position Description Form DBM CSC Form No. 1
2 pages
Division of AIDS (DAIDS) Site Clinical Operations and Research Essentials (SCORE) Manual: Electronic Systems
No ratings yet
Division of AIDS (DAIDS) Site Clinical Operations and Research Essentials (SCORE) Manual: Electronic Systems
3 pages
ANTICORIT
No ratings yet
ANTICORIT
8 pages
THERMODYNAMICS - Course Outline
No ratings yet
THERMODYNAMICS - Course Outline
2 pages
Totalitarian Government
No ratings yet
Totalitarian Government
12 pages
CarbonCure Whitepaper Impact of CO2 Utilization in Fresh Concrete On Corrosion of Steel Reinforcement
No ratings yet
CarbonCure Whitepaper Impact of CO2 Utilization in Fresh Concrete On Corrosion of Steel Reinforcement
6 pages
Islmc Medicine
100% (1)
Islmc Medicine
3 pages
SASL: North Africa: Suggestions For
No ratings yet
SASL: North Africa: Suggestions For
4 pages
Soil Acidity Analysis and Estimation of Lime
No ratings yet
Soil Acidity Analysis and Estimation of Lime
6 pages
DCC20063 - Assignment
No ratings yet
DCC20063 - Assignment
6 pages
PNC HCI Lecture Notes
No ratings yet
PNC HCI Lecture Notes
26 pages
Consequence To Life Category
No ratings yet
Consequence To Life Category
6 pages
STEP7 Safety Programming
No ratings yet
STEP7 Safety Programming
50 pages
Ir 811
No ratings yet
Ir 811
319 pages