0% found this document useful (1 vote)
162 views73 pages

Fundamentals of Data Science2

The document introduces two main data types: Quantitative, which includes Continuous and Discrete data, and Categorical, which is further divided into Ordinal and Nominal data. It explains how to identify and analyze these data types, including measures of center (mean, median, mode) and how to handle outliers. Additionally, it discusses the importance of notation in mathematics and the structure of data in spreadsheets.

Uploaded by

adenewd86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
162 views73 pages

Fundamentals of Data Science2

The document introduces two main data types: Quantitative, which includes Continuous and Discrete data, and Categorical, which is further divided into Ordinal and Nominal data. It explains how to identify and analyze these data types, including measures of center (mean, median, mode) and how to handle outliers. Additionally, it discusses the importance of notation in mathematics and the structure of data in spreadsheets.

Uploaded by

adenewd86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Data Types

In this video, two data types are introduced: Quantitative and Categorical.

Quantitative data takes on numeric values that allow us to perform mathematical operations
(like the number of dogs).

Categorical is used to label a group or set of items (like dog breeds - Collies, Labs, Poodles,
etc.)

Quiz: Data Types (Quantitative vs. Categorical)

Question 1 of 2

For each variable below, identify each as either quantitative or categorical.

These are the correct matches.

Variable Data Type

Zip Code Categorical

Age Quantitative

Income Quantitative

Marital Status (Single, Married, Divorced, etc.) Categorical

Height Quantitative

Question 2 of 2

For each variable below, identify each as either quantitative or categorical.

These are the correct matches.

Variable Data Types

Letter Grades (A+, A, A-, B+, B, B-, ) Categorical

Travel Distance to Work Quantitative

Ratings on a Survey (Poor, Ok, Great) Categorical

Temperature Quantitative

Average Speed Quantitative

2. Data Types (Ordinal vs. Nominal)


Categorical Ordinal vs. Categorical Nominal

We can divide categorical data further into two types: Ordinal and Nominal.

Categorical Ordinal data take on a ranked ordering (like a ranked interaction on a scale
from Very Poor to Very Good with the dogs).

Categorical Nominal data do not have an order or ranking (like the breeds of the dog).

3. Data Types (Continuous vs. Discrete)

We can think of quantitative data as being either continuous or discrete.

Continuous data can be split into smaller and smaller units, and still a smaller unit exists. An
example of this is the age of the dog - we can measure the units of the age in years, months, days,
hours, seconds, but there are still smaller units that could be associated with the age.

Discrete data only takes on countable values. The number of dogs we interact with is an example
of a discrete data type.

Summary of Video

The table below summarizes our data types. To expand on the information in the table, you can
look through the text that follows.

Data Types

Quantitative
Continuous Discrete
:

Pages in a Book, Trees in Yard, Dogs at a Coffee


Height, Age, Income
Shop

Categorical: Ordinal Nominal

Letter Grade, Survey


Gender, Marital Status, Breakfast Items
Rating

Below is a little more detail of the information shared in the above table.

Another Look

To break down our data types, there are two main blocks:

Quantitative and Categorical

Quantitative can be further divided into Continuous or Discrete.


Categorical data can be divided into Ordinal or Nominal.

You should have now mastered what types of data in the world around us falls into each of these
four buckets: Discrete, Continuous, Nominal, and Ordinal. In the next sections, we will work
through the numeric summaries that relate specifically to quantitative variables.

Quantitative vs. Categorical

Some of these can be a bit tricky - notice even though zip codes are a number, they aren’t really
a quantitative variable. If we add two zip codes together, we do not obtain any useful information
from this new value. Therefore, this is a categorical variable.

Height, Age, the Number of Pages in a Book, and Annual Income all take on values that we
can add, subtract and perform other operations with to gain useful insight. Hence, these
are quantitative.

Gender, Letter Grade, Breakfast Type, Marital Status, and Zip Code can be thought of as
labels for a group of items or individuals. Hence, these are categorical.

Continuous vs. Discrete

To consider if we have continuous or discrete data, we should see if we can split our data into
smaller and smaller units. Consider time - we could measure an event in years, months, days,
hours, minutes, or seconds, and even at seconds we know there are smaller units we could
measure time in. Therefore, we know this data type is continuous. Height, age, and income are
all examples of continuous data. Alternatively, the number of pages in a book, dogs I count
outside a coffee shop, or trees in a yard are discrete data. We would not want to split our dogs
in hal
Ordinal vs. Nominal

In looking at categorical variables, we found Gender, Marital Status, Zip Code, and
your Breakfast items are nominal variables where there is no order ranking associated with this
type of data. Whether you ate cereal, toast, eggs, or only coffee for breakfast; there is no rank-
ordering associated with your breakfast.

Alternatively, the Letter Grade or Survey Ratings have a rank ordering associated with it,
as ordinal data. If you receive an A, this is higher than an A-. An A- is ranked higher than a B+,
and so on... Ordinal variables frequently occur on rating scales from very poor to very good. In
many cases, we turn these ordinal variables into numbers, as we can more easily analyze them,
but more on this later!

Final Words
In this section, we looked at the different data types we might work with in the world around us.
When we work with data in the real world, it might not be very clean - sometimes there are typos
or missing values. When this is the case, simply having some expertise regarding the data and
knowing the data type can assist in our ability to ‘clean’ this data. Understanding data types can
also assist in our ability to build visuals to best explain the data. But more on this very soon!

2. Introduction to Summary Statistics

In the next lessons, you will learn how to use statistics to describe quantitative data. You'll gain
insight into the process of how data is collected and how to answer questions using your data.
Throughout this lesson, you will learn to be critical of the analysis that happens "under the hood"
and understand what the numbers actually mean.

As an example of the analysis we do here at Udacity, we look at how long students take to
complete one of our courses or programs. We try to provide an estimate of the number of hours
or months that students will spend. One way to start is to report the average amount of time it
takes to complete a course. But that doesn't tell the whole story because there will be differences
in time spent depending on what students knew before beginning the course.

The shortest time might be just a few weeks and the longest might be a couple of years. What
proportion of students finishes within two months and what proportion takes longer than eight
months?

Using a variety of measures, like measures of center, give you an idea of the average
student. Measures of spread, give you an idea of how students differ. Visuals provide a more
complete picture of how long it takes any student to complete a course or program.

3. Analyzing Quantitative Data

Four Aspects for Quantitative Data

There are four main aspects to analyzing Quantitative data.

1. Measures of Center

2. Measures of Spread

3. The Shape of the data.

4. Outliers
Analyzing Categorical Data

Analyzing categorical data has fewer parts to consider. Categorical data is analyzed usually by
looking at the counts or proportion of individuals that fall into each group. For example, if we
were looking at the breeds of the dogs, we would care about how many dogs are of each breed,
or what proportion of dogs are of each breed type.

Measures of Center

There are three measures of center:

1. Mean

2. Median

3. Mode

A. The Mean

In this video, we focused on the calculation of the mean. The mean is often called the average or
the expected value in mathematics. We calculate the mean by adding all of our values together
and dividing by the number of values in our dataset.

mon tues wed thu fri sat sun


5 3 8 3 15 45 9
Mean = 12.57 dogs

The remaining measures of the median and mode will be discussed in detail in the upcoming
quizzes and videos.

B. The Median

The median splits our data so that 50% of our values are lower and 50% are higher. We found in
this video that how we calculate the median depends on if we have an even number of
observations or an odd number of observations.

Median for Odd Values

If we have an odd number of observations, the median is simply the number in the direct
middle. For example, if we have 7 observations, the median is the fourth value when our
numbers are ordered from smallest to largest. If we have 9 observations, the median is the fifth
value.

Median for Even Values


If we have an even number of observations, the median is the average of the two values in the
middle. For example, if we have 8 observations, we average the fourth and fifth values together
when our numbers are ordered from smallest to largest.

In order to compute the median, we MUST sort our values first.

Whether we use the mean or median to describe a dataset is largely dependent on the shape of
our dataset and if there are any outliers. We will talk about this in just a bit!

Question 1 of 2

If we have the data:

5, 8, 15, 7, 10, 22, 3, 1, 15 = 1, 3, 5, 7, 8, 10, 15, 15, 22

What is the median?

A) 7 B) 9.5 C) 15 D) 8 E) 7.5

Question 2 of 2

If we have the data:

5, 8, 15, 7, 10, 22, 3, 1, 15, 2

What is the median?

A) 7 B) 9.56 C) 15 D) 8 E) 7.5

The Mode

The mode is the most frequently observed value in our dataset.

There might be multiple modes for a particular dataset or no mode at all.

No Mode

If all observations in our dataset are observed with the same frequency, there is no mode. If we
have the dataset:

1, 1, 2, 2, 3, 3, 4, 4

There is no mode because all observations occur the same number of times.

Many Modes
If two (or more) numbers share the maximum value, then there is more than one mode. If we
have the dataset:

1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9

There are two modes 3 and 6, because these values share the maximum frequencies at 3 times,
while all other values only appear once.
Quiz: Measures of Center (Mode)

Question 1 of 5

We want to summarize the number of dogs our friends have into a single number. We will
use the measures of center for this problem. Ashley has 1 dog, Steve has 1 dog, Jeff has 2
dogs, Kylie has 3 dogs, and Lisa has 8 dogs.

There is no measure of center that is always best, so we need to try all three to see what
makes sense in this situation. 1, 1, 2, 3, 8

What is the mean, median, and mode for the number of dogs our friends have?

Mean: 3 Median: 2 Mode: 1

Mean: 2 Median: 2 Mode: 8

Mean: 3 Median: 3 Mode: 1

Mean: 2 Median: 1 Mode: 8

Question 2 of 5

Check all of the below that are true with regards to our measures of center.

The mode is the middle number in the dataset when the numbers are rank ordered.
The median is the middle number in the dataset when the numbers are rank ordered.

The mean is always the best measure of center for any dataset.

The mean is always less than the median.

The median is always the best measure of center for any dataset.

The mode is always the best measure of center for any dataset.

Answer B

Question 3 of 5

If we have the data: 5, 8, 15, 7, 10, 22, 3, 1, 15 What is the mode?

A) 7 B) 9.56 C) 9 D) 15 E) 5 Answer D

Question 4 of 5

For the dataset below match the correct measure to the value:

8, 12, 32, 10, 3, 4, 4, 4, 4, 5, 12, 20 = 3, 4, 4, 4, 4, 5, 8, 10, 12, 12, 20, 32

These are the correct matches.

Mean = 9.83 Median = 6.5 and Mode = 4

Question 5 of 5

If we have the data:

5, 8, 15, 7, 10, 22, 3, 1, 15, 10

Mark all statements that are true.

A) The mode is 15 B) The mean is 15 C) The mode is 10 D) None of the


above are true
Answer: there are two modes in this dataset. If all the values appear the same number of times,
we usually say there is no mode. However, if more than one value appears the most number of
times, we count all of these values as modes.

4. What is Notation?

Notation is a common language used to communicate mathematical ideas. Think of notation as a


universal language used by academic and industry professionals to convey mathematical
ideas. In the next videos, you might see things that seem confusing. Use the quizzes to assist with
your understanding of the concepts.

You likely already know some notation. Plus, minus, multiply, division, and equal signs all have
mathematical symbols that you are likely familiar with. Each of these symbols replaces an idea
for how numbers interact with one another. In the coming concepts, you will be introduced to
some additional ideas related to notation. Though you will not need to use notation to complete
the project, it does have the following properties:

1. Understanding how to correctly use notation makes you seem really smart. Knowing how
to read and write in notation is like learning a new language. A language that is used to
convey ideas associated with mathematics.

2. It allows you to read documentation, and implement an idea to your own


problem. Notation is used to convey how problems are solved all the time. One really
popular mathematical algorithm that is used to solve some of the world's most difficult
problems is known as Gradient Boosting. The way that it solves problems is explained
here(opens in a new tab). If you really want to understand how this algorithm works, you
need to be able to read and understand notation.

3. It makes ideas that are hard to say in words easier to convey. Sometimes we just don't
have the right words to say. For those situations, I prefer to use notation to convey the
message. Similar to the way an emoji or meme might convey a feeling better than words,
the notation can convey an idea better than words. Usually, those ideas are related to
mathematics, but I am not here to stifle your creativity.

Supporting Materials

 Wikipedia on Gradient boosting.

Random Variables

Example to Introduce Notation

There is a lot going on in this video - here is a recap of the big ideas.
Rows and Columns

If you aren't familiar with spreadsheets, this will be covered in detail in future lessons.
Spreadsheets are a common way to hold data. They are composed of rows and columns. Rows
run horizontally, while columns run vertically. Each column in a spreadsheet commonly holds a
specific variable, while each row is commonly called an instance or individual.

The example used in the video is shown below.

Time Spent On
Date Day of Week Buy (Y)
Site (X)
June 15 Thursday 5 No
June 15 Thursday 10 Yes
June 16 Friday 20 Yes
This is a row:

Time Spent On
Date Day of Week Buy (Y)
Site (X)
June
Thursday 5 No
15
This is a column:

Time Spent On
Site (X)
5
10
20

Before Collecting Data

Before collecting data, we usually start with a question, or multiple questions, that we
would like to answer. The purpose of data is to help us in answering these questions.

Random Variables

A random variable is a placeholder for the possible values of some process (mostly... the term
'some process' is a bit ambiguous). As was stated before, notation is useful in that it helps us take
complex ideas and simplify (often to a single letter or single symbol). We see random variables
represented by capital letters (X, Y, or Z are common ways to represent a random variable).

We might have the random variable X, which is a holder for the possible values of the amount of
time someone spends on our site. Or the random variable Y, which is a holder for the possible
values of whether or not an individual purchases a product.
X is 'a holder' of the values that could possibly occur for the amount of time spent on our
website. Any number from 0 to infinity really.

Quiz: Variable Types

Example Dataset

An example of the data we might have collected in the previous video is shown here:

Time Spent On
Date Day of Week Buy (Y)
Site (X)

June 15 Thursday 5 No

June 15 Thursday 10 Yes

June 16 Friday 20 Yes

Question 1 of 2

What type of variable is the random variable X in the video in the previous concept?

Categorical - Ordinal

Categorical - Nominal

Quantitative - Continuous

Quantitative - Discrete

Answer: C

Question 2 of 2

What type of variable is the random variable Y in the video in the previous concept?
Categorical - Ordinal

Categorical - Nominal

Quantitative - Continuous

Quantitative - Discrete

Answer: B = Whether or not an individual 'buys' is a category without order involved.

Capital vs. Lower Case Letters

Random variables are represented by capital letters. Once we observe an outcome of these
random variables, we notate it as a lower case of the same letter.

Example 1

For example, the amount of time someone spends on our site is a random variable (we are not
sure what the outcome will be for any particular visitor), and we would notate this with X. Then
when the first person visits the website, if they spend 5 minutes, we have now observed this
outcome of our random variable. We would notate any outcome as a lowercase letter with a
subscript associated with the order that we observed the outcome.

If 5 individuals visit our website, the first spend 10 minutes, the second spends 20 minutes, the
third spend 45 mins, the fourth spends 12 minutes, and the fifth spends 8 minutes; we can notate
this problem in the following way:

X is the amount of time an individual spends on the website.

�1x1 = 10, �2x2 = 20 �3x3 = 45 �4x4 = 12 �5x5 = 8.

The capital X is associated with this idea of a random variable, while the observations of the
random variable take on lowercase x values.

Example 2

Taking this one step further, we could ask:

What is the probability someone spends more than 20 minutes in our website?

In notation, we would write:


P(X > 20)?

Here P stands for probability, while the parentheses encompass the statement for which we
would like to find the probability. Since X represents the amount of time spent on the website,
this notation represents the probability the amount of time on the website is greater than 20.

We could find this in the above example by noticing that only one of the 5 observations exceeds
20. So, we would say there is a 1 (the 45) in 5 or 20% chance that an individual spends more
than 20 minutes on our website (based on this dataset).

Example 3

If we asked: What is the probability of an individual spending 20 or more minutes on our


website? We could notate this as:

P(X ≥≥ 20)?

We could then find this by noticing there are two out of the five individuals that spent 20 or more
minutes on the website. So this probability is 2 out of 5 or 40%.

Quiz: Introduction to Notation

Consider we have the following table:

Years Experience Department Part/Full-Time

5 IT Part-Time

10 Finance Full-Time

8 HR Full-Time

1 Finance Part-Time

Consider we have the following labels:

X= years of experience

Y= Department

Z= Part/Full-Time

Match the following notation to their corresponding:

A. x1 B. y2

C. z3 D. n
Quiz Question
Use the information above to match the correct notation label to its corresponding value.
These are the correct matches.

Notation

Value

A. (this refers to the letter with the corresponding notation above)


5
B. (this refers to the letter with the corresponding notation above)
Finance
C. (this refers to the letter with the corresponding notation above)
Full Time
D. (this refers to the letter with the corresponding notation above)
4

A Better Way?

Notation for Calculating the Mean

We know that the mean is calculated as the sum of all our values divided by the number of
values in our dataset.

In our current notation, adding all of our values together can be extremely tedious. If we
want to add 3 values of some random variable together, we would use the notation:

�1+�2+�3x1+x2+x3

If we want to add 6 values together, we would use the notation:

�1+�2+�3+�4+�5+�6x1+x2+x3+x4+x5+x6

To extend this to add one hundred, one thousand, or one million values would be
ridiculous! How can we make this easier to communicate?!

Summation

Aggregations

An aggregation is a way to turn multiple numbers into fewer numbers (commonly one number).

Summation is a common aggregation. The notation used to sum our values is a greek symbol
called sigma ΣΣ.
Example 1

Imagine we are looking at the amount of time individuals spend on our website. We collect data
from nine individuals:

�1x1 = 10, �2x2 = 20 �3x3 = 45 �4x4 = 12 �5x5 = 8 �6x6 = 12, �7x7 = 3 �8x8 =
68 �9x9 = 5

If we want to sum the first three values together in our previous notation, we write:

�1+�2+�3x1+x2+x3

In our new notation, we can write:

∑�=13��i=1∑3xi.

Notice, our notation starts at the first observation (�=1i=1) and ends at 3 (the number at the top
of our summation).

So all of the following are equal to one another:

∑�=13��i=1∑3xi = �1+�2+�3x1+x2+x3 = 10 + 20 + 45 = 75

Example 2

Now, imagine we want to sum the last three values together.

�7+�8+�9x7+x8+x9

In our new notation, we can write:

∑�=79��i=7∑9xi.

Notice, our notation starts at the seventh observation (�=7i=7) and ends at 9 (the number at the
top of our summation).

Other Aggregations

The ΣΣ sign is used for aggregating using summation, but we might choose to aggregate in other
ways. Summing is one of the most common ways to need to aggregate. However, we might need
to aggregate in alternative ways. If we wanted to multiply all of our values together we would
use a product sign ΠΠ** **, capital Greek letter pi. The way we aggregate continuous values is
with something known as integration (a common technique in calculus), which uses the
following symbol ∫∫ which is just a long s. We will not be using integrals or products for quizzes
in this class, but you may see them in the future!
Notation for the Mean

Final Steps for Calculating the Mean

To finalize our calculation of the mean, we introduce n as the total number of values in our
dataset. We can use this notation both at the top of our summation, as well as for the value that
we divide by when calculating the mean.

1�∑�=1���n1i=1∑nxi

Instead of writing out all of the above, we commonly write �ˉxˉ to represent the mean of a
dataset. Although similar to the first video, we could use any variable. Therefore, we might also
write �ˉyˉ, or any other letter.

We also could index using any other letter, not just �i. We could just as easily use �j, �k,
or �m to index each of our data values. The quizzes on the next concept will help reinforce this
idea.

Notice

At second 0:12, this should say ∑�=15��=�1+�2+�3+�4+�5i=1∑5xi=x1+x2+x3+x4


+x5. The ��xi is missing here in front of the summation.

Quiz Question

Calculate the value of each expression where:

�1x1 = 5

�2x2 = 15

�3x3 = 3

�4x4 = 3

�5x5 = 8

�6x6 = 10

�7x7 = 12

These are the correct matches.

Expression Value

�n 7

∑�=1���i=1∑nxi 56
∑�=27��+6j=2∑7xj+6 57

�5x5 8

∑�=36���−1n−1i=3∑6xi 4

Match The Notation

For this quiz, you will be matching the notation attached to the letters below to the corresponding
numeric value to make sure you understand exactly what is being done with each part of the
notation.

Notation for Quizzes

For the below quiz, let the following letters denote the corresponding notation:

A. �X

B. �Y

C. �1x1

D. �n

E. ∑�=1���i=1∑nxi
Question 1 of 2

Use the letter next to the notation above to match the notation to the description of what the
notation represents.

These are the correct matches.

A = The notation for a random variable.

B = The notation for a random variable.

C = The notation for the first observed value of a random variable.

D = The notation for the number of rows in our dataset.

E = The notation for the sum of all the values in our dataset.

Notation for Quizzes

For the below quiz, let the following letters denote the corresponding notation:

A. ∑�=1���i=1∑nxi

B. ∑�=1����ni=1∑nxi
C. �ˉxˉ

D. �ˉyˉ

E. ∑�=1����nj=1∑nyj

Question 2 of 2

If we wanted to provide notation for the mean of a particular dataset, which of the following
letters would correspond to the notation attached to calculating the mean? (Mark all that apply.)

Answer: B,C,D,E

Notation Recap

Notation is an essential tool for communicating mathematical ideas. We have introduced the
fundamentals of notation in this lesson that will allow you to read, write, and communicate with
others using your new skills!

Notation and Random Variables

As a quick recap, capital letters signify random variables. When we look at individual
instances of a particular random variable, we identify these as lowercase letters with subscripts
attach themselves to each specific observation.

For example, we might have X be the amount of time an individual spends on our website. Our
first visitor arrives and spends 10 minutes on our website, and we would say �1x1 is 10
minutes.

We might imagine the random variables as columns in our dataset, while a particular value
would be notated with the lower case letters.

Notation English Example

X A random variable Time spent on


Notation English Example

website

�1x1 First observed value of the random variable X 15 mins

Sum values beginning at the first observation and ending


∑�=1���i=1∑nxi 5 + 2 + ... + 3
at the last

Sum values beginning at the first observation and ending


1�∑�=1���n1
at the last and divide by the number of observations (the (5 + 2 + 3)/3
i=1∑nxi
mean)

�ˉxˉ Exactly the same as the above - the mean of our data. (5 + 2 + 3)/3

Notation for the Mean

We took our notation even further by introducing the notation for summation ∑∑. Using this we
were able to calculate the mean as:

1�∑�=1���n1i=1∑nxi

In the next section, you will see this notation used to assist in your understanding of calculating
various measures of spread. Notation can take time to fully grasp. Understanding notation not
only helps in conveying mathematical ideas but also in writing computer programs - if you
decide you want to learn that too! Soon you will analyze data using spreadsheets. When that
happens, many of these operations will be hidden by the functions you will be using. But until
we get to spreadsheets, it is important to understand how mathematical ideas are commonly
communicated. This isn't easy, but you can do it!

Lesson Recap

This lesson covered some of the foundational statistical topics needed to use statistics in practice.
You can now:

 Evaluate data types and variable types

 Analyze measures of center

 Implement notation

2. Descriptive statistics two


Measures of Spread

Lesson Overview

In this lesson, we will continue to cover more topics related to analyzing quantitative variables
and you will learn to use measures of spread. Measures of spread are used to provide us an idea
of how spread-out our data are from one another.

In this lesson you will:

 Evaluate measures of spread

 Range

 Inter-quartile Range (IQR)

 Standard Deviation

 Variance

 Analyze outliers

 Evaluate descriptive and inferential statistics

Throughout this lesson, you will learn how to calculate these, as well as why we would use one
measure of spread over another.

A. Histograms

Histograms are super useful for understanding the different aspects of data and they are the most
common visual used for quantitative data. In the upcoming concepts, you will see histograms
used all the time to help you understand the four aspects we outlined earlier regarding a
quantitative variable:

 center

 spread

 shape

 outliers

How are Histograms constructed?

First, we need to bin our data. Each bin represents a range of values in a dataset. The number of
values that fall in the range of each bin determines the height of each histogram bar. As shown in
the video above, changing the range of our bins can result in slightly different visuals. However,
there is no right or wrong answer in choosing how to bin, and in most cases, the software you use
will choose the appropriate bins for you.

Weekdays vs. Weekends

The two histograms below illustrate the number of dogs Josh saw on weekdays versus weekends.
The measures of center for both histograms (mean, median, mode) are basically the same and
centered about the highest bin for both histograms, 13.

Visually, the difference between the histograms is the range or spread of dogs Josh sees during
each time period. In the upcoming lessons, we will discuss the most common ways to measure
the spread of our data.

Introduction to Five Number Summary

Calculating the 5 Number Summary

The five-number summary consist of 5 values:

1. Minimum: The smallest number in the dataset.

2. �1Q1: The value such that 25% of the data fall below.

3. �2Q2: The value such that 50% of the data fall below.
4. �3Q3: The value such that 75% of the data fall below.

5. Maximum: The largest value in the dataset.

In the above video, we saw that calculating each of these values was essentially just finding the
median of a bunch of different datasets. Because we are essentially calculating a bunch of
medians, the calculation depends on whether we have an odd or even number of values.

Range

The range is then calculated as the difference between the maximum and the minimum.

IQR

The inter-quartile range is calculated as the difference between �3Q3 and �1Q1.

In the upcoming sections, you will practice this with Katie and on your own.

Quiz: 5 Number Summary Practices

Question 1 of 2

Identify the following for this dataset:

1, 5, 10, 3, 8, 12, 4, 1, 2, 8 = 1, 1, 2, 3, 4, 4.5 5, 8, 8, 10, 12

These are the correct matches.

Item Number

Range 11

First Quartile 2

Third Quartile 8

Median 4.5

Question 2 of 2

Identify the following for this dataset:

5, 10, 3, 8, 12, 4, 1, 2, 8 = 1, 2, 2.5, 3, 4, 5, 8, 8, 9, 10, 12

These are the correct matches.

Item Number

Range 11
First Quartile 2.5

Third Quartile 9

Median 5

What if We Only Want One Number?

Looking back at the histograms Josh created for the number of dogs he recorded seeing on
weekdays and weekends, we can use the histograms to mark the values of the 5 number
summary and create a box plot.

 Box plots are useful for quickly comparing the spread of two data sets across some key
metrics, like quartiles, maximum, and minimum.

How do we create the box plot?

1. The beginning of the line to the left of the box and the end of the line to the right of the
box represent the minimum and maximum values in a dataset.

2. The visual distance between these markings is an indication of the range of the values.

3. The box itself represents the IQR. The box begins at the Q1 value, ends at the Q3 value,
and Q2, or the median, is represented by a line within the box.
From both the histograms and box plots, we can see that the number of dogs seen on weekends
varies much more than on weekdays.

However, instead of depending on a visual of the 5 number summary to compare our data, in the
next lesson, we will learn about using a single value to compare the two distribution spreads
- standard deviation.

Standard Deviation and Variance

The standard deviation is one of the most common measures for talking about the spread of data.
It is defined as the average distance of each observation from the mean.

In the above video, we saw this as how far individuals were from the average distance from work
(the example distances shown are examples from the full data set, the mean of just those 4
numbers is 38.5. The mean of 18 shown later in the video is the mean of the full data set which is
not shown in the video). In the next video, you will see exactly how this is calculated.

Example: Calculating the Standard Deviation

The dataset for the example is 10,14,10,6, 10,14,10,6

1. First, calculate the mean:

�‾=(∑�=14��)�=404=10x=n(i=1∑4xi)=440=10

2. Next, calculate the distance of each observation from the mean and square the value:

(��−�‾)2=(xi−x)2=

(10−10)2=02=0(10−10)2=02=0

(14−10)2=42=16(14−10)2=42=16

(10−10)2=02=0(10−10)2=02=0

(6−10)2=−42=16(6−10)2=−42=16

2. Then calculate the variance, the average squared difference of each observation from the
mean:

1�∑�=1�(��−�‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi−x)2=41(0+16+0+16)=432=8

4. Finally, calculate the standard deviation, the square root of the variance:

1�∑�=1�(��−�‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83

The standard deviation is, on average, how far each point in our dataset is from the mean.
Quiz: Measures of Spread (Calculation and Units)

Question 1 of 2

If we measure the variance associated with our sales in dollars for each month for 3 years, what
are the units associated with the variance?

A. Dollars B. Years C. Dollars per year

D. Dollars squared E. Dollars per month

Answer: D

Question 2 of 2

For the following set of data match the correct values.

Remember to find the variance we first find the mean average of the values, then subtract the
mean from each value, then square each of these values, then add them up, then divide by the
number of values. (Round your answer to two decimal places at the end of your calculation -
don't round along the way.)

1, 5, 10, 3, 8, 12, 4 1, 3, 4, 5, 8, 10, 12

These are the correct matches.

Measure Value

Variance 13.55

Standard Deviation 3.68

Other Measures of Spread

5 Number Summary

In the previous sections, we have seen how to calculate the values associated with the five-
number summary (min, �1Q1, �2Q2, �3Q3, max), as well as the measures of spread
associated with these values (range and IQR).

For datasets that are not symmetric, the five-number summary and a corresponding box plot are
a great way to get started with understanding the spread of your data. Although I still prefer a
histogram in most cases, box plots can be easier to compare two or more groups. You will
see this in the quizzes towards the end of this lesson.

Variance and Standard Deviation


Two additional measures of spread that are used all the time are the variance and standard
deviation. At first glance, the variance and standard deviation can seem overwhelming. If you do
not understand the expressions below, don't panic! In this section, I just want to give you an
overview of what the next sections will cover. We will walk through each of these parts
thoroughly in the next few sections, but the big picture goal is to generally understand the
following:

1. How the mean, variance, and standard deviation are calculated.

2. Why the measures of variance and standard deviation make sense to capture the spread of
our data.

3. Fields, where you might see these values used.

4. Why we might use the standard deviation or variance as opposed to the values associated
with the 5 number summary for a particular dataset.

Calculation

We calculate the variance in the following way:

1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2

The variance is the average squared difference of each observation from the mean.

To calculate the variance of a set of 10 values in a spreadsheet application, with our 10 data
points in column A, we would create a new column B by typing in something like =A1-
AVERAGE(A$1:A$10) and copying this down for all 10 rows. This would find us the
difference between each data point and the mean average of all the data. Then we create a new
column C having the square of these differences, using the formula =B1^2 in cell C1, and
copying that down for all rows. Then in the cell below this new column, cell C11, type
in =SUM(C1:C10). This adds up all these values in column C. Finally in cell C12, we divide
this sum by the number of data points we have, in this case, ten: =C11/10. This cell C12 now
contains the variance for our 10 data points.

More detailed guidance on using spreadsheets like this may be included in a future lesson in your
program.

The standard deviation is the square root of the variance. Therefore, the formula for the standard
deviation is the following:

1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our same set of 10 data
values, we would use another cell like C13 to take the square root of our variance measure, by
typing in =sqrt(C12).

The standard deviation is a measurement that has the same units as our original data, while the
units of the variance are the square of the units in our original data. For example, if the units in
our original data were dollars, then units of the standard deviation would also be dollars, while
the units of the variance would be dollars squared.

Again, this section is designed as background knowledge for the following sections. If it doesn't
make sense on this first pass, do not worry. You will be guided in future sections in performing
these calculations, and building your intuition, as you work through an example using the salary
data. Then we will provide context about why these calculations are important, and where you
might see them!

Why the Standard Deviation?

Standard deviation is a common metric used to compare the spread of two datasets. The
benefits of using a single metric instead of the 5 number summary are:

 It simplifies the amount of information needed to give a measure of spread

 It is useful for inferential statistics

Important Final Points

1. The variance is used to compare the spread of two different groups. A set of data with
higher variance is more spread out than a dataset with lower variance. Be careful though,
there might just be an outlier (or outliers) that is increasing the variance when most of the
data are actually very close.

2. When comparing the spread between two datasets, the units of each must be the same.

3. When data are related to money or the economy, higher variance (or standard deviation)
is associated with higher risk.

4. The standard deviation is used more often in practice than the variance because it shares
the units of the original dataset.

Use in the World

The standard deviation is associated with risk in finance, assists in determining the significance
of drugs in medical studies, and measures the error of our results for predicting anything from
the amount of rainfall we can expect tomorrow to your predicted commute time tomorrow.
These applications are beyond the scope of this lesson as they pertain to specific fields, but know
that understanding the spread of a particular set of data is extremely important to many areas. In
this lesson, you mastered the calculation of the most common measures of spread.

Quiz: Standard Deviation and Variance

Question 1 of 3

Assume d1 and d2 are datasets both measured in the same units. We know that the standard
deviation of d1 is 5 and the variance of d2 is 36, which of the following are certainly true. Mark
all that apply.

Remember the Standard Deviation is the square root of the variance. So if the Variance is 4 the
Standard Deviation would be 2.

A. The mean is larger for d1 than for d2.


B. The variance for d2 is larger than for d1.
C. The standard deviation for d2 is larger than for d1.
D. The median for d2 is larger than for d1.
E. The range for d2 is larger than for d1.

Answer: B and C

That's right! We can only talk about specific measures of spread, and not measures of center.
Additionally, the range isn't directly associated with the standard deviation, so we can't make a
claim that is always true like the final option.

Question 2 of 3

If a dataset has a standard deviation of zero, which of the following MUST be true?

A. All the data points must be zero.


B. All the data points must be the same
C. We made a calculation error because it is not possible for the standard deviation to be
zero.

That's right! Since the standard deviation is a measure of spread, a zero value suggests that all of
our data points are the same value.

Question 3 of 3

For each of the below: If the statement is true, mark the box next to the statement.

A. If two datasets have the same variance, they will also have the same standard deviation.
B. If I have two investment options with the same mean return, it really doesn't matter which
I invest in.
C. If I have two investment options with the same standard deviation associated with the
return, they will also have the same max possible return.

That is correct! Besides the mean return of an investment, we should also consider the spread
associated with the return. But just because the standard deviation associated with each
investment is the same, this does not mean the max you could make for each investment is the
same.

Quiz: Applied Standard Deviation and Variance

Investment Data

Consider we have two investment opportunities:

Returns

Year Year
Year 1 Year 3 Year 4 Year 6
2 5

Investment 1 5% 5% 5% 5% 5% 5%

Investment 2 12% -2% 10% 0% 7% 3%

The returns for 6 consecutive years for each investment are shown above. Use this information to
answer the questions below.

Question 1 of 3

Use the information above to match the mean/expected return for each investment.

These are the correct matches.

Investment Return

Investment 1 5%

Investment 2 5%

In the previous two questions, you should have found that these investments have the same
mean! That is, regardless of which investment opportunity you choose, you are expected to earn
the same amount. So how are they different? Let's look at some additional questions to see if we
can find some differences.

The same data as above is provided again (to minimize scrolling).

Question 2 of 3
Using the information above, mark all of the below that are true statements.

A. The risk associated with investment 1 is lower than the risk associated with
Investment 2.
B. The standard deviation associated with Investment 1 is smaller than the standard
deviation associated with Investment 2.
C. Knowing the mean return amount across all the years for each investment
provides us with all of the information necessary to understand which investment
we should choose.

Answer: A and B

That's right! Because the return is the same year over year for Investment 1, it has 'no spread' or a
standard deviation of 0. This smaller standard deviation is associated with smaller risk.
Understanding the spread of values we could earn is just as important as understanding the
expected return (mean return).

Question 3 of 3

Based on the observed data, which of the above two investments has the best opportunity of
earning more than 7%?

A. Investment 1
B. Investment 2
C. Neither.
D. We cannot tell.

Answer: B

That's right! Only Investment 2 has earned more than 7%, so it is more likely (with 1/3 chance).
Where Investment 1 has a 0/6 chance of earning more than 7% based on our observed data.

Useful Insight

The above example is a simplified version of the real world but does point out something useful
that you may have heard before. Notice if you were not fully invested in either Investment 1 or
fully invested in Investment 2, but instead, you were diversified across both investment options,
you could earn more than either investment individually. This is the benefit of diversifying your
portfolio for long-term gains. For short-term gains, you might not need or want to diversify. You
could get lucky and hit short-term gains associated with the upswings (12%, 10%, or 7%) of
Investment 2. However, you might also get unlucky, and hit a down term and earn nothing or
even lose money on your investment using this same strategy.

Final Quiz on Measures Spread


Question 1 of 2

For the following dataset, match each value to the appropriate label:

15, 4, 3, 8, 15, 22, 7, 9, 2, 3, 3, 12, 6

These are the correct matches.

Term Value

n 13

Median 7

First quartile 3

Third quartile 13.5

Mean 8.4

Mode 3

Question 2 of 2

For the following dataset, match each value to the appropriate label:

15, 4, 3, 8, 15, 22, 7, 9, 2, 3, 3, 12, 6

These are the correct matches.

Term Value

Inter-quartile range 10.5

range 20

variance 33.9

standard deviation 5.8

minimum 2

maximum 22

Measures of Center and Spread Summary

Recap
Variable Types

We have covered a lot up to this point! We started with identifying data types as
either categorical or quantitative. We then learned we could identify quantitative variables as
either continuous or discrete. We also found we could identify categorical variables as
either ordinal or nominal.

Categorical Variables

When analyzing categorical variables, we commonly just look at the count or percent of a group
that falls into each level of a category. For example, if we had two levels of a dog
category: lab and not lab. We might say, 32% of the dogs were lab (percent), or we might say 32
of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.

Quantitative Variables

Then we learned there are four main aspects used to describe quantitative variables:

1. Measures of Center

2. Measures of Spread

3. Shape of the Distribution

4. Outliers

We looked at calculating measures of Center

1. Means

2. Medians

3. Modes

We also looked at calculating measures of Spread

1. Range

2. Interquartile Range
3. Standard Deviation

4. Variance

Calculating Variance

We saw that we could calculate the variance as:

1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2

You will also see:

1�−1∑�=1�(��−�ˉ)2n−11i=1∑n(xi−xˉ)2

The reason for this is beyond the scope of what we have covered thus far, but you can find an
explanation here(opens in a new tab).

You can commonly find answers to your questions with a quick Google search(opens in a new
tab). Now is a great time to get started with this practice! This answer should make more sense
at the completion of this lesson.

Standard Deviation vs. Variance

The standard deviation is the square root of the variance. In practice, you usually use the
standard deviation rather than the variance. The reason for this is because the standard deviation
shares the same units with our original data, while the variance has squared units.

What Next?

In the next sections, we will be looking at the last two aspects of quantitative
variables: shape and outliers. What we know about measures of center and measures of spread
will assist in your understanding of these final two aspects.

Supporting Materials

 Calculating Variance

Shape

Histograms

We learned how to build a histogram in this video, as this is the most popular visual for
quantitative data.
Shape

From a histogram, we can quickly identify the shape of our data, which helps influence all of the
measures we learned in the previous concepts. We learned that the distribution of our data is
frequently associated with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Summary

Mean vs.
Shape Real-World Applications
Median

Symmetric Mean equals


Height, Weight, Errors, Precipitation
(Normal) Median

Mean greater Amount of drug remaining in a bloodstream, Time between


Right-skewed
than Median phone calls at a call center, Time until light bulb dies

Mean less than Grades as a percentage in many universities, Age of death,


Left-skewed
Median Asset price changes

The mode of a distribution is essentially the tallest bar in a histogram. There may be multiple
modes depending on the number of peaks in our histogram.

The Shape For Data In The World

If you're working with data, you can always build a Quick Plot to see the shape.

Just to apply some context, some examples of approximately Bell-Shaped data include heights
and weights, standardized test scores, precipitation amounts, the mean of a distribution, or errors
in manufacturing processes. Common data that follow Left Skewed Distributions include GPAs,
the age of death, and asset price changes. Common data that follow approximately right Skewed
Distributions include the amount of drug left in your bloodstream over time, the distribution of
wealth, and human athletic abilities.

There are links below in the instructor notes in case you want to learn more about each of these
cases.
Though these three, Right Skewed, left Skewed and symmetric, are the most common
distributions, data in the real world can be messy and it might not follow any of these
distributions.

We will talk about this more in the next section.

When working with data, building a quick plot lets you quickly see the shape of your data.

Distribution
Types of Data
Shape

Bell Shaped Heights, Weight, Scores

Left Skewed GPA, Age of Death, Price

Right Skewed Distribution of Wealth, Athletic Abilities

References

These are the references used to pull the applications of each shape.

 Quora(opens in a new tab)

 University of Texas(opens in a new tab)

 Stack Exchange(opens in a new tab)

Supporting Materials

 Quora(opens in a new tab)

 Stack Exchange

Quiz: Shape and Outliers (What's the Impact?)


Question 1 of 2

Match the distribution shape with the correct relationship in comparing the mean to the median.

These are the correct matches.

Shape Comparison

Right-skewed Mean is greater than the Median.

Left-skewed Mean is less than the Median.


Symmetric Mean is equal to the Median.

Quiz: Shape and Outliers (Comparing Distributions)

Image Summary

In the below image, we have three box-plots. Each box-plot is for a different Iris
flower: setosa, versicolor, or virginica. On the y-axis, we are given the sepal length. Notice
that virginica has an outlier towards the bottom of the plot. Therefore, the minimum is not given
by the bottom line here; rather, it is provided by this point.

Box Plots of Sepal length for 3 Iris Flower Species

Quick Refresher: The measures of center and spread we can determine from a Box Plot are as
follows. Let's use Setosa for these examples.

Median is the centerline inside the box and is 5

IQR is space between the first and third quartile which are the edges of the box. They are about
4.8 for the first quartile and 5.2 for the third

Questions 1 - 2: Petal Length

The below plot will be used to answer the first two questions in this section.
Question 1 of 5

What is the name of the above plot?

A. Bar Chart
B. Box Plot
C. Histogram
D. Pie Chart

Answer: C

Question 2 of 5

What is the shape of the above distribution?

A. Right skewed
B. Left skewed
C. Symmetric
D. Bi-modal

Answer: D

Questions 3 - 6: Shape and Outliers

Use the below image to assist with answering the next three questions.
Plot image for the quiz below

Question 3 of 5

What is the name of the above plot?

A. Bar Chart
B. Box Plot
C. Histogram
D. Pie Chart

Answer: B

Question 4 of 5

What is the shape of the distribution?

A. Right skewed
B. Left skewed
C. Symmetric
D. Bi-modal
E. Answer: B

Question 5 of 5
Select the true statement for the box-plot above.

A. The mean is less than the median.


B. The mean is greater than the median.
C. The mean is approximately equal to the median.
D. It is impossible to tell the relationship between the mean and median.

Answer: A

Histograms
Let the histogram on the left be Histogram 1 and the histogram on the right be Histogram 2.

Histograms for the Quiz below.

Quick Notes

Pay attention to the scale of these two graphs. The first is dealing with a lot higher numbers.

The median is the middle number and is not affected by outliers.

The average factors in all the numbers so outliers will bring the average towards them.

Left Skewed is when the graphs start with a low frequency and then slopes up. Right Skewed is
when the graph starts with a high frequency and slopes down.

Quiz Question

Correctly match the histograms to the statements that are true about each.

These are the correct matches.

Statement Histogram

Mean is greater than the median. Histogram 1

Data has higher variance. Histogram 1

Binwidth is equal to 0.5. Histogram 2

The range is approximately 5.5. Histogram 2

Distribution is right-skewed. Histogram 1

The mean is approximately equal to the median. Histogram 2

Descriptive Statistics Summary


Recap

Variable Types

We have covered a lot up to this point! We started with identifying data types as
either categorical or quantitative. We then learned we could identify quantitative variables as
either continuous or discrete. We also found we could identify categorical variables as
either ordinal or nominal
Categorical Variables

When analyzing categorical variables, we commonly just look at the count or percent of a group
that falls into each level of a category. For example, if we had two levels of a dog
category: lab and not lab. We might say, 32% of the dogs were lab (percent), or we might say 32
of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.

Quantitative Variables

Then we learned there are four main aspects used to describe quantitative variables:

1. Measures of Center

2. Measures of Spread

3. Shape of the Distribution

4. Outliers

Measures of Center

We looked at calculating measures of Center

1. Means

2. Medians

3. Modes

Measures of Spread

We also looked at calculating measures of Spread

1. Range

2. Interquartile Range
3. Standard Deviation

4. Variance

Shape

We learned that the distribution of our data is frequently associated with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Depending on the shape associated with our dataset, certain measures of center or spread may be
better for summarizing our dataset.

When we have data that follows a normal distribution, we can completely understand our
dataset using the mean and standard deviation.

However, if our dataset is skewed, the 5 number summary (and measures of center associated
with it) might be better to summarize our dataset.

Outliers

We learned that outliers have a larger influence on measures like the mean than on measures like
the median. We learned that we should work with outliers on a situation by situation basis.
Common techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understand why they exist, and the impact on questions we are trying to answer about our
data.

4. Reporting the 5 number summary values is often a better indication than measures like the
mean and standard deviation when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Histograms and Box Plots

We also looked at histograms and box plots to visualize our quantitative data. Identifying outliers
and the shape associated with the distribution of our data are easier when using a visual as
opposed to using summary statistics.

What Next?
Up to this point, we have only looked at Descriptive Statistics, because we are describing our
collected data. In the final sections of this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.

Descriptive vs. Inferential Statistics

Descriptive Statistics

Descriptive statistics is about describing our collected data.

Inferential Statistics

Inferential Statistics is about using our collected data to draw conclusions about a larger
population.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.

2. Parameter - numeric summary about a population

3. Sample - a subset of the population

4. Statistic numeric summary about a sample

Quiz: Descriptive vs. Inferential


Question 1 of 3

Identify the population, parameter, sample, and statistic for the below scenario:

Consider we are interested in the average number of hours slept by all Udacity students (100,000
students). I send an email to all Udacity students, but I only receive 5,000 response emails. The
average amount of sleep of those that responded was 6.8 hours of sleep.

These are the correct matches.

Term Description

Population All Udacity students

Parameter We cannot know for sure.

Sample 5,000 Udacity students

Statistic 6.8 hours of sleep

Question 2 of 3
Identify the population(s), parameter(s), sample(s), and statistic(s) for the below scenario:

Consider we own a bagel shop. We know that the average diameter of all of our bagels is 5.5
inches. A competitor moves right next door to us! We are interested in if they make larger bagels
than us. We obtain 100 of their bagels, and we find they have an average diameter of 6 inches.

These are the correct matches.

Description Term

5.5 inches Parameter

6 inches Statistic

All the bagels at our bagel shop. Population

All the bagels at our competitor's bagel shop. Population

The 100 bagels from the competitor's bagel shop. Sample

Answer: Amazing! This one is really tricky.

Essentially we have two populations - one is all the bagels from our competitor, and the second
is all the bagels from our shop. We know the diameter of all the bagels at our shop, so this is a
parameter. The 100 bagels from the competitor are now a sample, and we have a statistic, which
is our numeric summary from that sample of 6 inches.

Question 3 of 3

For the below, match the term to the correct description.

These are the correct matches.

Description Term

A numeric summary of a sample. Statistic

A numeric summary of a population. Parameter

Drawing conclusions regarding a population using information from a sample.

Inference

Drawing conclusions regarding a sample using information from a population. None

A subset of a population. Sample

Our entire group of interest. Population


Frequently we do not know this value, so we must try and estimate. Parameter

Descriptive vs. Inferential Statistics

In this section, we learned about how Inferential Statistics differs from Descriptive Statistics.

Descriptive Statistics

Descriptive statistics is about describing our collected data using the measures discussed
throughout this lesson: measures of center, measures of spread, the shape of our distribution, and
outliers. We can also use plots of our data to gain a better understanding.

Inferential Statistics

Inferential Statistics is about using our collected data to draw conclusions to a larger
population. Performing inferential statistics well requires that we take a sample that accurately
represents our population of interest.

A common way to collect data is via a survey. However, surveys may be extremely biased
depending on the types of questions that are asked, and the way the questions are asked. This is a
topic you should think about when tackling the first project.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.

2. Parameter - numeric summary about a population

3. Sample - a subset of the population

4. Statistic - numeric summary about a sample

Looking Ahead

Though we will not be diving deep into inferential statistics within this course, you are now
aware of the difference between these two branches of statistics. If you have ever conducted a
hypothesis test or built a confidence interval, you have performed inferential statistics. The way
we perform inferential statistics is changing as technology evolves. Many career paths
involving Machine Learning and Artificial Intelligence are aimed at using collected data to
draw conclusions about entire populations at an individual level. It is an exciting time to be a part
of this space, and you are now well on your way to joining the other practitioners!

Lesson Review

Congratulations on completing this lesson on descriptive statistics. You learned some


foundational metrics for understanding data, including how to:-
 Evaluate measures of spread

 Range

 Interquartile Range (IQR)

 Standard Deviation

 Variance

 Analyze outliers

 Evaluate descriptive and inferential statistics

Part Three Spreadsheets

In this introductory lesson, we'll focus on:

 A little background on spreadsheets

 How to set up a spreadsheet

 Major features of spreadsheets

Spreadsheet Options

It's important to note that there are multiple options for spreadsheets to use in this course. I'll be
using Microsoft Excel, but much of the same functionality (with some minor differences in
menus used) is also available in Google Sheets, Apple Numbers, and Apache Open Office Calc.

Commas vs. Periods in Different Countries

A quick note before we get started with spreadsheet applications.

Depending on your location, decimals and commas may be treated differently in spreadsheet
applications (mostly in European countries) than what you will see in this course.

For this course, we will be using a convention where commas separate large values - like
1,000.00 or 1,000,000.00. On the other hand, the period will be used as the decimal separator, as
you can see from these previous numbers.

You may want to assure that you are following along with these same conventions, or at least be
aware of the differences you may see should you be using a different convention for commas and
decimals.

Potential Solutions
Periods and Commas

Check out this link on switching between decimals and commas, or vice versa(opens in a new
tab), if needed.

 Set Decimal Separator to .

 Set Thousands Separator to ,

Commas and Semicolons

You may also differ in using commas versus semicolons in equations(opens in a new tab).

 Note you should Select English (United States) to have your formatting match what is
shown in the course

After making any changes to the settings of Excel, you may need to restart your computer for the
changes to actually go into effect.

Why Spreadsheets?

Spreadsheets started off as a manual way to track things like sales, keeping running totals of
columns, summing across columns, etc. VisiCalc(opens in a new tab) was the first
computerized version of a spreadsheet, for the Apple II, released in 1979. Ported over for IBM
PC's in 1981, it became one of the first truly popular apps across business functions.

Spreadsheet programs took over a lot of pain-staking manual work from before (especially
calculations), and helped reduce human errors.

Spreadsheets do have limitations, such as having issues with lots of users and very large datasets,
as well as variance on sheet sizes and how many there can be between programs. On the plus
side, they are easy to obtain and use, and are quick at loading, modifying, and visualizing data.

Quiz Question

What benefits do spreadsheets provide? Check all that apply.

A. They're perfect for massively large data sets


B. You can quickly load and visualize data
C. You can add formulas to analyze data
D. They make it easy to manipulate data

Spreadsheets make it possible to easily manipulate, analyze, and visualize data. There are some
limitations when using very large data sets

Getting Started with Spreadsheets


This course demonstrates functions, tools, and techniques using the Microsoft Excel Desktop
version on Windows 10. The latest version at the time of writing is Excel within Microsoft 365,
formerly known as Office 365. If you don't have the latest version of Excel, you can use another
spreadsheet application for the quizzes and exercises. Although functions and formulas behave
similarly across various applications, there can be significant differences in menus and tools,
especially with more advanced tools such as box plots(opens in a new tab). Sometimes it's easy
to figure these differences out, but when in doubt, get in the habit of searching for your own
answers in help files and on the Internet—there is a ton of information available on spreadsheets!

Excel

Microsoft Office 365 purchasing information(opens in a new tab)

There are several pricing options to choose from. Whatever your circumstance, be sure to
download the desktop version. It can do a few things that the browser version can't. Note that the
Windows and Mac versions of Excel are different, and the Mac version may not have certain
built-in functionalities, such as box plots. Also, the free Office Online version of Excel is limited
in functionality and is not recommended for this class. Once you have your Excel installed on a
PC or Mac, you will need to load the following Add-In:

 Analysis ToolPak(opens in a new tab).

Google Sheets

Link to Open a Sheet(opens in a new tab)

Google Sheets is a free alternative with full functionality when you enhance it with a add-on.
You'll need a free Google account to get started. If you already have one, the link above will take
you where you need to go. If you don't have a Google account, click "More Options" and create
one. Once you have your account established, open a blank sheet and use the "Add-ons" menu to
add the following to your account:

 Remove Duplicates(opens in a new tab)

Apple Numbers

Information about Apple Numbers(opens in a new tab)

If you have a Mac, this application should already be there. Most features we present are
available in Numbers, and it is compatible with Excel files. There does not seem to be a way to
easily create box plots, however, which we will introduce in the Visualize Data lesson.

Open Source Alternatives

Apache Open Office Download(opens in a new tab)


LibreOffice Download(opens in a new tab)

Both Apache Open Office Calc and LibreOffice Calc are open source and freely available. They
will read Excel format and provide most of the same functionality. There are some important
differences(opens in a new tab) to be aware of, however, including differences in formula
syntax. Some features, such as box plots, may require internet searches for solutions and a
number of steps to implement.

Navigation: Worksheet

Pro Tip: In Excel, you can create a new spreadsheet by clicking File -> New -> Blank
Spreadsheet. Other spreadsheet applications typically follow a similar process.

Spreadsheet columns use letters as their labels, A-Z, and then continuing on by adding an initial
letter to again run through the alphabet (so AA-AZ, then BA-BZ, etc.). Rows are numbered
numerically. Each cell is addressed based on its column and row, so the cell in Column D and
row 6 is D6.

Pro Tip: In Excel, you can add sheets with the "Add Sheets" ("+") button (near the bottom) and
rename sheets with a more specific title. Other spreadsheet applications typically follow a similar
process.

When working with multiple sheets, the tab names also become part of the cell name, if
addressed from a different sheet.

The formula bar appears just above the spreadsheet cells, or can be accessed by clicking into a
given cell. We'll talk more on formulas later as well.

Pro Tip: Clicking on the formula bar highlights all the cells being used in the formula.

Quiz Question

What’s the label for the 28th column of a spreadsheet?

A. BB B. AB C. 28 D. A28

Answer: B

Columns are labeled with letters of the English alphabet. After the 26 letters are exhausted,
column labels follow the pattern AA, AB, AC, ...

Navigation: Menu Bar

The main menu bar contains many functions and features that will be more specific to the
spreadsheet application you are using. Many also allow you to customize which menu commands
are shown.
Pro Tip: In Excel, you can customize menu commands with File -> Options -> Customize
Ribbon.

Pro Tip: In Excel, you can press F1 for the help menu.

File commands are those that operate between the spreadsheet application and your computer
operating system, such as creating a new spreadsheet, or saving or loading an existing one.

The very top of the menu bar contains quick access options such as undo or redo.

Pro Tip: In Excel, you can customize the quick access toolbar with File -> Options -> Quick
Access Toolbar.

The Home menu contains options such as cut, copy and paste, various formatting options like
font changes and data types, and cell operations like insertions or deletions. We'll use these in
later lessons. It also has functionality to find or replace, which may work differently based on the
application.

The Insert Menu has items to create hyperlinks or charts, which we'll use when we discuss
visualizing data. We'll also cover pivot tables later.

The Data Menu has items such as sorting and filtering data.

Quiz: Menu Bar

Quiz Question

What are some examples of Data menu operations? Check all that apply.

A. Sort
B. Filter
C. Chart
D. Text to columns

Answer: All except C

Navigation: Shortcuts

Fill: Copy or continue a pattern of cells by dragging the mouse using a fill handle on the lower
right of a cell. The fill handle typically shows up as a little plus sign when your mouse is in the
right place.

Pro Tip: Open a Context Menu by right-clicking on a cell.

Pro Tip: Many keyboard shortcuts also work within spreadsheets. Undo with Ctrl+Z and Redo
with Ctrl+Y; note that Mac users should use the Cmd button instead of Ctrl.
Quiz Question

In which directions is it possible to “Fill” data? Check all that apply.

A. Up
B. Down
C. Left
D. Right

Answer: all

The “Fill” gesture works in four directions!

Copy Data

Note: Google no longer provides historical data on stocks. However this information can be
found on Yahoo at:

https://finance.yahoo.com/quote/AAPL/history?p=AAPL

Exercise: Copy Data

Note: Google no longer provides historical data on stocks. However, this information can
be found on Yahoo at: https://finance.yahoo.com/quote/AAPL/history?p=AAPL(opens in a
new tab)

If for any reason, you can't access the sites, you can use these Word documents as the
source:

Apple Historical Prices Doc(opens in a new tab)

Wikipedia Spreadsheet webpage Doc

Table Paste

How many columns were created when you pasted the AAPL Historical prices table from the
website? Answer: 7

Correct! The data columns are Date, Open, High, Low, Close, and Volume.

Range Addressing

Range: A group of cells selected or addressed together. It is defined by the cells in the upper left
and lower right corners of the range.
You can select a range of cells by clicking on one cell on a corner of the range and dragging to
the other cells in the range. In the video, we also saw how this can be done in the formula with a
colon, such as B2:C7, which selects all cells in the B and C columns from rows 2 to 7.

Quiz Question

Which of the following is a valid range address? Check all that apply.

A. A1:A2
B. C8:A9
C. AA3:BB5
D. 9F:9G
E. E:G
F. 16:18

Answer: all except Band D

Correct! A1:A2 and AA3:BB5 are valid because they both start with an upper left cell address
and end with a lower right cell address. E:G and 16:18 are also valid… this was a bit of a trick
question. It is possible to define whole columns, such as columns E thru G, and whole rows, such
as rows 16 thru 18, as ranges.

Relative vs Absolute Addressing

Relative Address: The cell locations are relative to the answer cell that references them.
Copying or filling such locations will update the copied/fill location relative to the original (i.e.
dragging down from a cell using A1 would fill in A2 next). This is the default behavior.

Absolute Address: Fixed cell or range address that doesn't change when copied. Adding dollar
signs ($) to an address makes it absolute based on where the dollar signs are placed. A dollar
sign in front of the letter fixes a column, while the dollar sign in front of the number fixes the
row; if both are prefixed with a dollar sign, it is fixed to the exact cell.

Pro Tip: Use the F4 key to quickly toggle from relative to absolute addresses in Excel.

Quiz Question

Which of the following are valid cell addresses?

A. A1
B. $A$1
C. $A1
D. A$1

Answer: All
Correct! All of them are valid. In addition to fixing both the column and row as absolute, it is
possible to fix just the column or just the row.

Insert and Delete

The Home Menu and the Context Menu (available through a right-click) provide options to insert
cells, rows or columns.

Save Data

Always save your work! While Google Sheets will save each change to a cell for
you, most desktop applications, like Excel, require you to manually save.

Pro Tip: Save with Ctrl+S on PC, or Cmd+S on Mac.

The first time you save in Excel, you will have to "Save As", so that you can tell
Excel where to actually save the file. Afterward, it will normally "Save" to the
same location as the previous version.

Quiz Question

Which of the following file types can be usefully read by Excel and most other
spreadsheet applications?

A. Excel Workbook (*.xlsx)


B. Comma Delimited (*.csv)
C. Excel 2003 (*.xls)
D. JSON (*.json)

Answer: A, B, C

Correct! Current and past excel formats *.xlsx and *.xls, plus comma delimited
files, *.csv, are common types of files that can be read into spreadsheets.
Although *.json files can be read as text files, they will not automatically be
parsed correctly.

Recap

Great work in this lesson! You've learned about:

 What spreadsheet applications do and their benefits


 What some of the universal features of spreadsheet applications are

 Columns, rows and addressing

 How to build and save a simple spreadsheet

Spreadsheets 2: Manipulating data


In this lesson, you'll learn to manipulate data in your spreadsheet, such as:

 Working with cell formulas

 Text data

 Math operations

 Statistical functions

 Performing data operations at the table level

 Duplicating rows

 Splitting columns

 Sorting data

Cell Formulas
Cell Formula: An expression, beginning with an equal sign (=), that defines the operations which
calculate the value for the cell. These may be a constant, contain mathematical operators or
functions, or may reference a single cell, a range of cells, or no cells.

Parameter: Value or expressions required by a function to determine a result, defined by the


function. These can be a constant, cell reference, range reference, other expression, or another
function.

In Excel, you can find different functions in the Formulas Library found on the Formulas tab of
the menu bar.

Quiz Question

Match the following functions with definitions. You may need to use use "help"
in your spreadsheet application or look in the Formulas menu to find out more
about these functions.
These are the correct matches.

Definition Function

Converts a text string to upper case letters. UPPER(text)

Returns the logical value TRUE. TRUE()

Returns the number of characters in a text string. LEN(text)

Estimates standard deviation based on a sample (ignores logical values and text).

STDEV(n1, n2, ...)

Adds all the numbers in a range of cells. SUM(n1, n2, ...)

Removes all spaces from a text string except single spaces between words.

TRIM(text)

SUBSTITUTE
Text String: String of letters, numbers, and punctuation that is not treated
numerically.

While the SUBSTITUTE function sounds similar to find/replace, it is used for


different purposes. Find/replace gets rid of the old data, while SUBSTITUTE will
not change the original cell, instead showing the transformed data in a new cell.

SUBSTITUTE uses the syntax SUBSTITUTE({text}, {old_text}, {new_text}),


where {text} is the cell to change, {old_text} is the string sequence to be replaced,
and {new_text} is the new string in place of the old one.

Pro Tip: Enter a formula by typing directly into a selected cell, beginning with "=".
This way is more direct and faster than the formula bar!

Quiz Question

Assume that you must write a formula for the B2 cell to get the result shown.
Choose all valid formulas for the B2 cell from the list below.
A. =SUBSTITUTE(A2,B5,B4)
B. =SUBSTITUTE(A2,B4,B5)
C. =SUBSTITUTE($A2,$B4,$B5)
D. =SUBSTITUTE($A2,B5,B4)
E. =SUBSTITUTE(A2,$B$4,$B$5)

Answer: B, C,E

Extract Text
FIND and LEFT can be used to extract text. FIND can be given a substring and a cell
to return the position in a string where the substring was found. LEFT can then be
used to extract a certain number of characters from a cell, starting from the left
side.

RIGHT therefore extracts from the right side, while MID can extract from some
starting point in the middle of a cell.

Pro Tip: To display the formula of one cell in another, use


the FORMULATEXT function.
Exercise: Extract Text

In the exercise below, you'll combine what you've learned about the FIND and MID functions to
extract the first word found after the first occurrences of the word "data" in a series of sentences.
For example, in the following sentence, your target word is "is":
Some programming languages use 0 based indexing. However, Excel uses 1 based index. This
means that counting begins at the value of 1. Therefore, to find the index for the word data, we
would count as shown in the image below.

Then for the extraction for this quiz, since this extraction is in the middle of a
text string, you will use the function MID with the following syntax:

MID(text, start_num, num_chars)


text
text string containing the characters you want
to extract
start_num
position of the first character you want to
extract in text
num_chars
the number of characters you want MID to return
from text
Think about how this can be solved. Before you can use the MID function, you
first need to FIND the word after data and the number of characters that are in
it. To figure out the number of characters in the target word, you will need
to FIND the space after it as well. The size of the target word will be equal to
the position of the space after it minus the position of the start of the target
word.

The exercise will step you through intermediate formulas to get to the final answer. As you
figure out the correct formula in the top cell in each column, you will be able to "fill" (or copy)
the cells down the column for all the cases you are given.

As a reminder, here is the syntax for using the FIND function:

FIND(find_text, within_text, [start_num])

find_text

the text you want to find

within_text

the text containing the text you want to find

start_num

OPTIONAL - character position at which to start the search

Reformat Text

CONCATENATE will join together two or more strings. It's important to note that this will not
automatically add spaces between them, so make sure to add spaces as formula parameters if you
need them.
TRIM will help to remove excess whitespace from a string.

PROPER sets the first letter of each word to upper case, with the rest lowercase.

UPPER sets all letters to upper case, while LOWER sets all letters to lowercase.

Quiz: PROPER, UPPER, LOWER

Quiz Question

Match the following functions to their result when applied to this phrase:

"It is a capital mistake to theorize before one has data"

- Sherlock Holmes

These are the correct matches.

Phrase Function applied

"It Is A Capital Mistake To Theorize Before One Has Data" - Sherlock Holmes

PROPER

"IT IS A CAPITAL MISTAKE TO THEORIZE BEFORE ONE HAS DATA" - SHERLOCK


HOLMES

UPPER

"it is a capital mistake to theorize before one has data" - sherlock holmes LOWER

Math Functions
Math operations are one of the most common spreadsheet usages. These are used similarly to
what one might expect (with a leading equals sign):

 + for addition

 - for subtraction

 * for multiplication

 / for division

There are also the functions SUM and AVERAGE, which behave as their names suggest -
summing or averaging two or more cells, numbers or a range of cells.

Exercise: Math Functions


In the exercise below, you'll use SUM and AVERAGE to find the total number of fruit pieces
per order and the average number of fruit pieces per order. The data set is very similar to the
classroom example, so refer back to the demonstration as needed if you run into problems.

The following list has a series of steps for this exercise. As you complete each step, check it off
the list. The quizzes in the task list can be found below.

What is the formula you used in G7 ? (Include the equal sign in your answer) =sum(b7:f7)

AVERAGE

What is the average number of pieces per order? 13.8

Duplicate Rows
Clean Data: Data that is free of corrupt or inaccurate data items.

In Excel, under the Data tab, you can use the "Remove Duplicates" feature to remove duplicated
values.

Removing Duplicate Rows with Google Sheets

If you are using Google Sheets you can use Data Cleanup

Select Data > Data cleanup > Remove duplicates.

Next, select Data has header row and choose the rows you want analyzed.

Using the "Remove Duplicates" add-on


Exercise: Duplicate Rows

In this exercise, you'll open a data file and remove the duplicate rows.

Your data file is in comma-separated values(opens in a new tab) format, or CSV format. The
file extension is .csv and the file can be read as plain text, unlike .xlsx and .xls formats. In a CSV
file, the first row may be column headers separated by commas, while all later rows are the data
rows. Each value that corresponds to a column is separated by a comma.

Spreadsheet applications are designed to easily open this type of file, and it is often used for
storing tabular data. For example, the following text can be seen in the exercise CSV file
named worldcities.csv by looking at it with a plain text editor such as Microsoft Notepad(opens
in a new tab) on Windows or TextEdit(opens in a new tab) on Apple Mac :

If you open the same file with a spreadsheet application such as Excel, the application will
automatically separate the columns for you:

Separated Data

How many duplicates?


How many duplicates did you remove? 72

Split Columns
Splitting data is useful when you have data such as first and last names, or City and State,
separated by a delimiter (such as a comma). In Excel, this can be done by adding a column to the
right of the one you want to split, and then in the Data tab, selecting "Text to Columns".

Make sure you add the extra column first, as existing data may otherwise be overwritten in the
next column otherwise.

Splitting Columns in Google Sheets

If you are working in Google Sheets, you click Data > Split text to columns.

If you want to override the default separator, click in the Separator box. You can choose from
comma, semicolon, period, or space -- or you can select Custom to add your own separator.

Sort Data

Sorting is a very useful feature of spreadsheet applications. Select all your data, then on the Data
tab, click the "Sort" button. You can also choose which column to sort on, or even multiple
columns.

Sorting with Google Sheets

If you are using Google Sheets, the process is similar.

1. Select the range you want to sort.

2. Click Data > Sort range > Advanced range sorting options.

3. Select Data has header row.

4. Select sort columns from the drop down.

You can select Add another sort column to add another sort level.

Filter Data

Filter: A method to group data by selecting characteristics of one or more columns of a data set.

The filter method is used by clicking on the filter button, which has a little filter as its icon. You
can then select which items you want to filter down to. Make sure if you want to use a filter on
multiple columns that you clear old column filters no longer needed.
SPEADSHEET 4: VSUALIZE DATA
Pie Charts

Illustrating proportionality

A pie chart is used to illustrate proportionality. Think of it as slicing the pie into pieces, where
each piece matches a percentage of the whole list.

In spreadsheets, this is really easy, because all we need is a list of the categories and matching
values such as sums or counts.

When the chart is selected, a design and format menu is available on the Excel ribbon at the top
of the page. The design menu gives numerous chart options and choices, such as specific
coloring or displaying percentages. You can also change the chart title.

Pivot table

Using the pivot table we created earlier with some careful selection, we want to highlight the
position categories in the top row and the totals are in the bottom row. To do this:

1. Select the categories at the top of the table.

2. While holding down the control key on Windows or the Command key on Apple
keyboards, select the bottom row with your mouse.

3. We are going to copy this highlighted data to another location.

4. Paste (paste-transpose) using the transpose feature so it creates columns instead of


rows.

5. Select and choose "insert pie chart" as before.

Bar Charts

We could use the same information as before and choose a bar or column chart instead of a pie
chart. Instead of percentages, it would just show the values with longer bars or columns
representing larger values.
Bar charts do not show percentages for each category
In the chart above, we're comparing the category values against each other and we see their
relative sizes. However, we do not have much sense of the whole league or the percentage of
each category as we did with the pie charts.

Choosing which kind of chart to use really depends on what patterns you want to highlight and
what questions you want to answer.

Bar or pie chart?

 Use bar or column charts to compare category values with each other.

 Use a pie chart to show the proportionality of categories.

Scatter and Line Plots


Line Charts vs. Pie Charts

We use pie and bar charts to visualize categorical data. If we have a list of numerical data, such
as the list of stock prices over time, a line chart gives us a better picture of the data set.
Simple line charts

Using a table of data downloaded from a financial website, listing prices for AAPL stock, we can
explore line charts:

1. Notice that it has columns for date, open, high, low, close, and volume.

2. Select the date column and the close column .

3. Go to the insert menu and select a line chart.

4. Move the chart to its own sheet to see the detail better.

5. Choose a quick style fix in the design menu.

6. Change the title to AAPL Stock Price.

7. Verify that the horizontal axis shows the dates, and the vertical axis shows dollar values.

8. Observe that over the past year the stock has gone up with a little hiccup about a month
ago.

Multiple columns of data

To handle more than one column of data for the same dates:

1. Observe lines for each on the same chart can be shown.

2. Select the date plus the high and low values for AAPL stock.

3. Since the high and low aren't all that far apart, change the range for the dollar amount on
the left to start at 100.

4. Select the vertical axis, right-click, and format axis.

5. Observe both the high and low-value lines now and see the spreads between them.

Scatterplot

1. To plot two different variables, closing price, and volume, for AAPL stock, choose the
scatterplot.

2. Observe a graph with the closing price on the horizontal axis and the volume of trade that
day on the vertical axis.

3. Observe that the prices seem to cluster in a couple of areas, and that they have about the
same volume generally, though there are some high volume days at the lower price.
Quiz Question

Match the question posed to the type of chart to use (choose the best answer). We haven't talked
much about some of these yet. Just do your best.

These are the correct matches.

Question posed Chart

What are the relative percentages of different fruits sold this month? Pie

How does the number of apples, oranges, and pears sold this month compare to each other?

Bar

How has the price for AAPL stock changed over time? Line

Is there a relationship I can see between weight and age in a population? Scatter

What is the frequency of salaries by millions across all major league baseball players?

Histogram

What is the distribution of my numerical dataset from minimum to maximum, including the 1st,
2nd, and 3rd quartiles?

Box Plot

Exercise: Scatter Plots


Scatter plots are useful for displaying bivariate(opens in a new tab) numerical data. This means
a data set with two variables, such as height and weight measurements for a list of human beings.

If the data of both variables move up together, they have a positive correlation(opens in a new
tab), and this can be seen in the scatter plots, such as in the following plot of human height and
weight data(opens in a new tab). We can see that generally, as height increases, so does weight.
The line shown is the trend line which can be added in Excel by selecting the scatter chart,
then Design > Add Chart Element > Trendline > Linear.
If one variable increases as the other decreases, the two variables have a negative correlation, as
in the following plot of depth vs velocity in the Columbia River(opens in a new tab):
Adding a trendline in Google Sheets

In Google Sheets, you can create a treandline in Google Sheets by clicking the three dot menu in
the chart to open the Chart Editor. Then click Customize > Series. Check the Trendline box an
select the type in the dropdown menu

Chart Layout Tools


Note: If you are using Google Sheets, after you have inserted a chart, you can click on the
"Customize" tab on the top right of the Chart Editor menu (which will appear when you insert
the chart on the right side of the spreadsheet window). You can then use the different dropdowns,
like "Chart Style", to change the look of your chart.

Chart Layout Options in Excel

So far, we've used the Chart Design tab that appears when you select a chart to choose some pre-
made formatted styles.

In the top left corner of the design menu is the Quick Layout and the Add Chart Element menu
items:
 Quick Layout menu is a set of pre-made layouts that affect which chart elements are
included and where, like the axis labels and the legend. You can select various ones and
see the effect.

 Add Chart Element menu. With this menu, it's possible to add and remove elements of all
types from the chart. There is a lot here, and rather than try to learn all of it, we'll just use
some parts as we need them. It's good to know where to find it though, for those times
when you know what you want to change, and all those pre-made formats and layouts
just don't have what you need.

Quiz: Chart Layout


Spreadsheet charts have standard elements that can be added or removed by selecting a chart
using the Design menu, then Quick Layout and Add Chart Element sub-menus. The chart
below has several of these elements. Can you identify them by name?

Quiz Question

Match the chart element description with its location on the sample chart above.

These are the correct matches.


Chart element Location

Chart Title C

Horizontal Axis F

Vertical Axis None

Trend line B

Legend D

Vertical Axis Title A

Grid Lines E

Histograms
A Histogram is a column chart that measures the frequency of data in a data set and specifically
groups numerical values into bins we define.

Column Charts vs Histogram

 Recall that we previously created a column chart to compare counts of categories within
a data set. This kind of chart answers a question like: how many players are there in each
playing position in the league?

 But what if we want to ask the question: how many players made under $1 million in
salary, and between $1 and $2 million, and between $2 and $3 million in salary? This
kind of chart is called a histogram, and the groupings we choose such as, 1) all salaries
between $1 and $2 million, and 2) salaries between $2 and $3 million, are the bins.

There are two ways to do this in Excel.

Analysis tool pack add-in:

1. We'll start with a method that works on both Windows and Mac using the histogram tool
in the analysis tool pack add-in. Instructions for loading the analysis tool pack add-in are
given in the Getting Started instructions.

2. To create the histogram:

a. Choose data analysis from the data menu on Windows or from the tools menu on
Mac. Choose histogram, which opens a dialog.

b. For the input range, select the data from the salaries column.
c. For the bin range, select the bin intervals you've created.

d. If you have a label at the top of your columns, click labels.

e. For the output options, select new worksheet and chart output.

f. Press OK.

Insert chart

 Available in Microsoft Excel. The tool pack histogram requires two columns of data. One
for data you want to analyze, and one for bin levels that represent the intervals for the
bins. In the video example, I started at $1 million, then $2 million, et cetera, up to $15
million. When I created the histogram, the number of values in the salaries lists that are
below $1 million will be in the first bin. The numbers of salaries between $1 and $2
million will be in the second bin, and so on.

 To create the histogram:

a. Select your data and click insert, recommended charts, and choose the histogram
chart.

b. To configure details about the bins, right-click the horizontal axis of the chart,
click format axis and then click axis options.

c. The dialog provides options for choosing categorical data like the player positions
or automatic for numerical data. You can specify the number of bins that you
might choose to experiment with a bit. Note: If you choose bins that are too
narrow, the result can be noisy. On the other hand, too few bins will hide details.

d. As with other charts, the design and layout can be further customized from the
design menu when the chart is selected.

Box Plots
A box plot, which in our case is really a box and whisker plot, is the visualization of statistical
spread in a data set of values.

The five numbers summary

A traditional box plot is built using the five numbers summary. The five numbers summary
consists of five values.

 maximum

 minimum
 1st quartile

 2nd quartile, aka 'median'

 3rd quartile

Box plot description

Where we make a box and whisker plot:

 Maximum becomes the tip of the upper whisker.

 Minimum becomes the tip of the lower whisker.

 The box represents the middle half of the data with a line where the median is.

Note: Excel will give us a bonus of six numbers in the summary by placing an X at the mean or
average value of the set.

1. Creating a box plot in Microsoft Excel is as easy as any other chart.

2. Select the appropriate columns of data.

3. Click insert in recommended charts.

4. Click the box in the whisker chart. Remember that a box plot represents statistics for a
single list of numbers. So, each list you select will be represented by its own box plot.
5. Observe that the box plot visually gives a sense of the spread of the value list.

6. Adjust the range so that you can see the plots a little better, if needed.

7. Give the chart a title.

Professional Presentations
Excel and other spreadsheet applications can do a nice job of creating attractive tables and charts.
It's up to us to make them look their absolute best, though. They should:

 Be readable

 Be interesting

 Show the required information

 Not show extra information that doesn't matter

When you present data:

 Use fonts and layouts appropriately to include and emphasize the elements that matter
and exclude elements that don't.

 Ask some questions:

 Who is your audience?

 Is this a quick overview presentation on a slide with the data backup elsewhere, or
is it a written technical review where more in-depth data should be presented?

Improving charts

For categorical column data, like the counts of the baseball positions, the relative sizes are easier
to see at a glance if the data is sorted by size. This would not be something we would want to do
for sequential data like daily stock values or histogram bins, but it is more readable in this case.
There should be a prominent and descriptive title.

Pie chart

Since there is more than one data series, the labels are already showing up in the data this pie
chart, and an extra legend is just noise, so we remove it.

For the AAPL graph below:


 Do you need or want grid lines? In a detailed technical presentation, it is easier to see the
values if there are grid lines, but for quick overviews that are just emphasizing the trend,
those extra lines look busy.

 What about the axis labels? I hate it when I look at a graph and can't tell what the
numbers on the axis represent, so I would generally say put them there, but there are
always exceptions. If the units are redundant, somehow, they can be removed.

 The font size is a judgment call, but bigger is easier to read and draws attention.

Color
Keep in mind that your chart may be printed in grayscale or viewed by someone who's
colorblind. So if it's possible to distinguish groupings in additional ways such as different shapes
and scatter plots, or using dashes in lines, it's a best practice to do so. The format data series
dialogue has a number of options for changing the look of your graph. This was just a sampling
of some of the modifications you can make to your charts to give them the impact you want in a
professional presentation.

The best way to learn is to just try things, to explore.

You might also like