Fundamentals of Data Science2
Fundamentals of Data Science2
In this video, two data types are introduced: Quantitative and Categorical.
Quantitative data takes on numeric values that allow us to perform mathematical operations
(like the number of dogs).
Categorical is used to label a group or set of items (like dog breeds - Collies, Labs, Poodles,
etc.)
Question 1 of 2
Age Quantitative
Income Quantitative
Height Quantitative
Question 2 of 2
Temperature Quantitative
We can divide categorical data further into two types: Ordinal and Nominal.
Categorical Ordinal data take on a ranked ordering (like a ranked interaction on a scale
from Very Poor to Very Good with the dogs).
Categorical Nominal data do not have an order or ranking (like the breeds of the dog).
Continuous data can be split into smaller and smaller units, and still a smaller unit exists. An
example of this is the age of the dog - we can measure the units of the age in years, months, days,
hours, seconds, but there are still smaller units that could be associated with the age.
Discrete data only takes on countable values. The number of dogs we interact with is an example
of a discrete data type.
Summary of Video
The table below summarizes our data types. To expand on the information in the table, you can
look through the text that follows.
Data Types
Quantitative
Continuous Discrete
:
Below is a little more detail of the information shared in the above table.
Another Look
To break down our data types, there are two main blocks:
You should have now mastered what types of data in the world around us falls into each of these
four buckets: Discrete, Continuous, Nominal, and Ordinal. In the next sections, we will work
through the numeric summaries that relate specifically to quantitative variables.
Some of these can be a bit tricky - notice even though zip codes are a number, they aren’t really
a quantitative variable. If we add two zip codes together, we do not obtain any useful information
from this new value. Therefore, this is a categorical variable.
Height, Age, the Number of Pages in a Book, and Annual Income all take on values that we
can add, subtract and perform other operations with to gain useful insight. Hence, these
are quantitative.
Gender, Letter Grade, Breakfast Type, Marital Status, and Zip Code can be thought of as
labels for a group of items or individuals. Hence, these are categorical.
To consider if we have continuous or discrete data, we should see if we can split our data into
smaller and smaller units. Consider time - we could measure an event in years, months, days,
hours, minutes, or seconds, and even at seconds we know there are smaller units we could
measure time in. Therefore, we know this data type is continuous. Height, age, and income are
all examples of continuous data. Alternatively, the number of pages in a book, dogs I count
outside a coffee shop, or trees in a yard are discrete data. We would not want to split our dogs
in hal
Ordinal vs. Nominal
In looking at categorical variables, we found Gender, Marital Status, Zip Code, and
your Breakfast items are nominal variables where there is no order ranking associated with this
type of data. Whether you ate cereal, toast, eggs, or only coffee for breakfast; there is no rank-
ordering associated with your breakfast.
Alternatively, the Letter Grade or Survey Ratings have a rank ordering associated with it,
as ordinal data. If you receive an A, this is higher than an A-. An A- is ranked higher than a B+,
and so on... Ordinal variables frequently occur on rating scales from very poor to very good. In
many cases, we turn these ordinal variables into numbers, as we can more easily analyze them,
but more on this later!
Final Words
In this section, we looked at the different data types we might work with in the world around us.
When we work with data in the real world, it might not be very clean - sometimes there are typos
or missing values. When this is the case, simply having some expertise regarding the data and
knowing the data type can assist in our ability to ‘clean’ this data. Understanding data types can
also assist in our ability to build visuals to best explain the data. But more on this very soon!
In the next lessons, you will learn how to use statistics to describe quantitative data. You'll gain
insight into the process of how data is collected and how to answer questions using your data.
Throughout this lesson, you will learn to be critical of the analysis that happens "under the hood"
and understand what the numbers actually mean.
As an example of the analysis we do here at Udacity, we look at how long students take to
complete one of our courses or programs. We try to provide an estimate of the number of hours
or months that students will spend. One way to start is to report the average amount of time it
takes to complete a course. But that doesn't tell the whole story because there will be differences
in time spent depending on what students knew before beginning the course.
The shortest time might be just a few weeks and the longest might be a couple of years. What
proportion of students finishes within two months and what proportion takes longer than eight
months?
Using a variety of measures, like measures of center, give you an idea of the average
student. Measures of spread, give you an idea of how students differ. Visuals provide a more
complete picture of how long it takes any student to complete a course or program.
1. Measures of Center
2. Measures of Spread
4. Outliers
Analyzing Categorical Data
Analyzing categorical data has fewer parts to consider. Categorical data is analyzed usually by
looking at the counts or proportion of individuals that fall into each group. For example, if we
were looking at the breeds of the dogs, we would care about how many dogs are of each breed,
or what proportion of dogs are of each breed type.
Measures of Center
1. Mean
2. Median
3. Mode
A. The Mean
In this video, we focused on the calculation of the mean. The mean is often called the average or
the expected value in mathematics. We calculate the mean by adding all of our values together
and dividing by the number of values in our dataset.
The remaining measures of the median and mode will be discussed in detail in the upcoming
quizzes and videos.
B. The Median
The median splits our data so that 50% of our values are lower and 50% are higher. We found in
this video that how we calculate the median depends on if we have an even number of
observations or an odd number of observations.
If we have an odd number of observations, the median is simply the number in the direct
middle. For example, if we have 7 observations, the median is the fourth value when our
numbers are ordered from smallest to largest. If we have 9 observations, the median is the fifth
value.
Whether we use the mean or median to describe a dataset is largely dependent on the shape of
our dataset and if there are any outliers. We will talk about this in just a bit!
Question 1 of 2
A) 7 B) 9.5 C) 15 D) 8 E) 7.5
Question 2 of 2
A) 7 B) 9.56 C) 15 D) 8 E) 7.5
The Mode
No Mode
If all observations in our dataset are observed with the same frequency, there is no mode. If we
have the dataset:
1, 1, 2, 2, 3, 3, 4, 4
There is no mode because all observations occur the same number of times.
Many Modes
If two (or more) numbers share the maximum value, then there is more than one mode. If we
have the dataset:
1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9
There are two modes 3 and 6, because these values share the maximum frequencies at 3 times,
while all other values only appear once.
Quiz: Measures of Center (Mode)
Question 1 of 5
We want to summarize the number of dogs our friends have into a single number. We will
use the measures of center for this problem. Ashley has 1 dog, Steve has 1 dog, Jeff has 2
dogs, Kylie has 3 dogs, and Lisa has 8 dogs.
There is no measure of center that is always best, so we need to try all three to see what
makes sense in this situation. 1, 1, 2, 3, 8
What is the mean, median, and mode for the number of dogs our friends have?
Question 2 of 5
Check all of the below that are true with regards to our measures of center.
The mode is the middle number in the dataset when the numbers are rank ordered.
The median is the middle number in the dataset when the numbers are rank ordered.
The mean is always the best measure of center for any dataset.
The median is always the best measure of center for any dataset.
The mode is always the best measure of center for any dataset.
Answer B
Question 3 of 5
A) 7 B) 9.56 C) 9 D) 15 E) 5 Answer D
Question 4 of 5
For the dataset below match the correct measure to the value:
Question 5 of 5
4. What is Notation?
You likely already know some notation. Plus, minus, multiply, division, and equal signs all have
mathematical symbols that you are likely familiar with. Each of these symbols replaces an idea
for how numbers interact with one another. In the coming concepts, you will be introduced to
some additional ideas related to notation. Though you will not need to use notation to complete
the project, it does have the following properties:
1. Understanding how to correctly use notation makes you seem really smart. Knowing how
to read and write in notation is like learning a new language. A language that is used to
convey ideas associated with mathematics.
3. It makes ideas that are hard to say in words easier to convey. Sometimes we just don't
have the right words to say. For those situations, I prefer to use notation to convey the
message. Similar to the way an emoji or meme might convey a feeling better than words,
the notation can convey an idea better than words. Usually, those ideas are related to
mathematics, but I am not here to stifle your creativity.
Supporting Materials
Random Variables
There is a lot going on in this video - here is a recap of the big ideas.
Rows and Columns
If you aren't familiar with spreadsheets, this will be covered in detail in future lessons.
Spreadsheets are a common way to hold data. They are composed of rows and columns. Rows
run horizontally, while columns run vertically. Each column in a spreadsheet commonly holds a
specific variable, while each row is commonly called an instance or individual.
Time Spent On
Date Day of Week Buy (Y)
Site (X)
June 15 Thursday 5 No
June 15 Thursday 10 Yes
June 16 Friday 20 Yes
This is a row:
Time Spent On
Date Day of Week Buy (Y)
Site (X)
June
Thursday 5 No
15
This is a column:
Time Spent On
Site (X)
5
10
20
Before collecting data, we usually start with a question, or multiple questions, that we
would like to answer. The purpose of data is to help us in answering these questions.
Random Variables
A random variable is a placeholder for the possible values of some process (mostly... the term
'some process' is a bit ambiguous). As was stated before, notation is useful in that it helps us take
complex ideas and simplify (often to a single letter or single symbol). We see random variables
represented by capital letters (X, Y, or Z are common ways to represent a random variable).
We might have the random variable X, which is a holder for the possible values of the amount of
time someone spends on our site. Or the random variable Y, which is a holder for the possible
values of whether or not an individual purchases a product.
X is 'a holder' of the values that could possibly occur for the amount of time spent on our
website. Any number from 0 to infinity really.
Example Dataset
An example of the data we might have collected in the previous video is shown here:
Time Spent On
Date Day of Week Buy (Y)
Site (X)
June 15 Thursday 5 No
Question 1 of 2
What type of variable is the random variable X in the video in the previous concept?
Categorical - Ordinal
Categorical - Nominal
Quantitative - Continuous
Quantitative - Discrete
Answer: C
Question 2 of 2
What type of variable is the random variable Y in the video in the previous concept?
Categorical - Ordinal
Categorical - Nominal
Quantitative - Continuous
Quantitative - Discrete
Random variables are represented by capital letters. Once we observe an outcome of these
random variables, we notate it as a lower case of the same letter.
Example 1
For example, the amount of time someone spends on our site is a random variable (we are not
sure what the outcome will be for any particular visitor), and we would notate this with X. Then
when the first person visits the website, if they spend 5 minutes, we have now observed this
outcome of our random variable. We would notate any outcome as a lowercase letter with a
subscript associated with the order that we observed the outcome.
If 5 individuals visit our website, the first spend 10 minutes, the second spends 20 minutes, the
third spend 45 mins, the fourth spends 12 minutes, and the fifth spends 8 minutes; we can notate
this problem in the following way:
The capital X is associated with this idea of a random variable, while the observations of the
random variable take on lowercase x values.
Example 2
What is the probability someone spends more than 20 minutes in our website?
Here P stands for probability, while the parentheses encompass the statement for which we
would like to find the probability. Since X represents the amount of time spent on the website,
this notation represents the probability the amount of time on the website is greater than 20.
We could find this in the above example by noticing that only one of the 5 observations exceeds
20. So, we would say there is a 1 (the 45) in 5 or 20% chance that an individual spends more
than 20 minutes on our website (based on this dataset).
Example 3
P(X ≥≥ 20)?
We could then find this by noticing there are two out of the five individuals that spent 20 or more
minutes on the website. So this probability is 2 out of 5 or 40%.
5 IT Part-Time
10 Finance Full-Time
8 HR Full-Time
1 Finance Part-Time
X= years of experience
Y= Department
Z= Part/Full-Time
A. x1 B. y2
C. z3 D. n
Quiz Question
Use the information above to match the correct notation label to its corresponding value.
These are the correct matches.
Notation
Value
A Better Way?
We know that the mean is calculated as the sum of all our values divided by the number of
values in our dataset.
In our current notation, adding all of our values together can be extremely tedious. If we
want to add 3 values of some random variable together, we would use the notation:
�1+�2+�3x1+x2+x3
�1+�2+�3+�4+�5+�6x1+x2+x3+x4+x5+x6
To extend this to add one hundred, one thousand, or one million values would be
ridiculous! How can we make this easier to communicate?!
Summation
Aggregations
An aggregation is a way to turn multiple numbers into fewer numbers (commonly one number).
Summation is a common aggregation. The notation used to sum our values is a greek symbol
called sigma ΣΣ.
Example 1
Imagine we are looking at the amount of time individuals spend on our website. We collect data
from nine individuals:
�1x1 = 10, �2x2 = 20 �3x3 = 45 �4x4 = 12 �5x5 = 8 �6x6 = 12, �7x7 = 3 �8x8 =
68 �9x9 = 5
If we want to sum the first three values together in our previous notation, we write:
�1+�2+�3x1+x2+x3
∑�=13��i=1∑3xi.
Notice, our notation starts at the first observation (�=1i=1) and ends at 3 (the number at the top
of our summation).
∑�=13��i=1∑3xi = �1+�2+�3x1+x2+x3 = 10 + 20 + 45 = 75
Example 2
�7+�8+�9x7+x8+x9
∑�=79��i=7∑9xi.
Notice, our notation starts at the seventh observation (�=7i=7) and ends at 9 (the number at the
top of our summation).
Other Aggregations
The ΣΣ sign is used for aggregating using summation, but we might choose to aggregate in other
ways. Summing is one of the most common ways to need to aggregate. However, we might need
to aggregate in alternative ways. If we wanted to multiply all of our values together we would
use a product sign ΠΠ** **, capital Greek letter pi. The way we aggregate continuous values is
with something known as integration (a common technique in calculus), which uses the
following symbol ∫∫ which is just a long s. We will not be using integrals or products for quizzes
in this class, but you may see them in the future!
Notation for the Mean
To finalize our calculation of the mean, we introduce n as the total number of values in our
dataset. We can use this notation both at the top of our summation, as well as for the value that
we divide by when calculating the mean.
1�∑�=1���n1i=1∑nxi
Instead of writing out all of the above, we commonly write �ˉxˉ to represent the mean of a
dataset. Although similar to the first video, we could use any variable. Therefore, we might also
write �ˉyˉ, or any other letter.
We also could index using any other letter, not just �i. We could just as easily use �j, �k,
or �m to index each of our data values. The quizzes on the next concept will help reinforce this
idea.
Notice
Quiz Question
�1x1 = 5
�2x2 = 15
�3x3 = 3
�4x4 = 3
�5x5 = 8
�6x6 = 10
�7x7 = 12
Expression Value
�n 7
∑�=1���i=1∑nxi 56
∑�=27��+6j=2∑7xj+6 57
�5x5 8
∑�=36���−1n−1i=3∑6xi 4
For this quiz, you will be matching the notation attached to the letters below to the corresponding
numeric value to make sure you understand exactly what is being done with each part of the
notation.
For the below quiz, let the following letters denote the corresponding notation:
A. �X
B. �Y
C. �1x1
D. �n
E. ∑�=1���i=1∑nxi
Question 1 of 2
Use the letter next to the notation above to match the notation to the description of what the
notation represents.
E = The notation for the sum of all the values in our dataset.
For the below quiz, let the following letters denote the corresponding notation:
A. ∑�=1���i=1∑nxi
B. ∑�=1����ni=1∑nxi
C. �ˉxˉ
D. �ˉyˉ
E. ∑�=1����nj=1∑nyj
Question 2 of 2
If we wanted to provide notation for the mean of a particular dataset, which of the following
letters would correspond to the notation attached to calculating the mean? (Mark all that apply.)
Answer: B,C,D,E
Notation Recap
Notation is an essential tool for communicating mathematical ideas. We have introduced the
fundamentals of notation in this lesson that will allow you to read, write, and communicate with
others using your new skills!
As a quick recap, capital letters signify random variables. When we look at individual
instances of a particular random variable, we identify these as lowercase letters with subscripts
attach themselves to each specific observation.
For example, we might have X be the amount of time an individual spends on our website. Our
first visitor arrives and spends 10 minutes on our website, and we would say �1x1 is 10
minutes.
We might imagine the random variables as columns in our dataset, while a particular value
would be notated with the lower case letters.
website
�ˉxˉ Exactly the same as the above - the mean of our data. (5 + 2 + 3)/3
We took our notation even further by introducing the notation for summation ∑∑. Using this we
were able to calculate the mean as:
1�∑�=1���n1i=1∑nxi
In the next section, you will see this notation used to assist in your understanding of calculating
various measures of spread. Notation can take time to fully grasp. Understanding notation not
only helps in conveying mathematical ideas but also in writing computer programs - if you
decide you want to learn that too! Soon you will analyze data using spreadsheets. When that
happens, many of these operations will be hidden by the functions you will be using. But until
we get to spreadsheets, it is important to understand how mathematical ideas are commonly
communicated. This isn't easy, but you can do it!
Lesson Recap
This lesson covered some of the foundational statistical topics needed to use statistics in practice.
You can now:
Implement notation
Lesson Overview
In this lesson, we will continue to cover more topics related to analyzing quantitative variables
and you will learn to use measures of spread. Measures of spread are used to provide us an idea
of how spread-out our data are from one another.
Range
Standard Deviation
Variance
Analyze outliers
Throughout this lesson, you will learn how to calculate these, as well as why we would use one
measure of spread over another.
A. Histograms
Histograms are super useful for understanding the different aspects of data and they are the most
common visual used for quantitative data. In the upcoming concepts, you will see histograms
used all the time to help you understand the four aspects we outlined earlier regarding a
quantitative variable:
center
spread
shape
outliers
First, we need to bin our data. Each bin represents a range of values in a dataset. The number of
values that fall in the range of each bin determines the height of each histogram bar. As shown in
the video above, changing the range of our bins can result in slightly different visuals. However,
there is no right or wrong answer in choosing how to bin, and in most cases, the software you use
will choose the appropriate bins for you.
The two histograms below illustrate the number of dogs Josh saw on weekdays versus weekends.
The measures of center for both histograms (mean, median, mode) are basically the same and
centered about the highest bin for both histograms, 13.
Visually, the difference between the histograms is the range or spread of dogs Josh sees during
each time period. In the upcoming lessons, we will discuss the most common ways to measure
the spread of our data.
2. �1Q1: The value such that 25% of the data fall below.
3. �2Q2: The value such that 50% of the data fall below.
4. �3Q3: The value such that 75% of the data fall below.
In the above video, we saw that calculating each of these values was essentially just finding the
median of a bunch of different datasets. Because we are essentially calculating a bunch of
medians, the calculation depends on whether we have an odd or even number of values.
Range
The range is then calculated as the difference between the maximum and the minimum.
IQR
The inter-quartile range is calculated as the difference between �3Q3 and �1Q1.
In the upcoming sections, you will practice this with Katie and on your own.
Question 1 of 2
Item Number
Range 11
First Quartile 2
Third Quartile 8
Median 4.5
Question 2 of 2
Item Number
Range 11
First Quartile 2.5
Third Quartile 9
Median 5
Looking back at the histograms Josh created for the number of dogs he recorded seeing on
weekdays and weekends, we can use the histograms to mark the values of the 5 number
summary and create a box plot.
Box plots are useful for quickly comparing the spread of two data sets across some key
metrics, like quartiles, maximum, and minimum.
1. The beginning of the line to the left of the box and the end of the line to the right of the
box represent the minimum and maximum values in a dataset.
2. The visual distance between these markings is an indication of the range of the values.
3. The box itself represents the IQR. The box begins at the Q1 value, ends at the Q3 value,
and Q2, or the median, is represented by a line within the box.
From both the histograms and box plots, we can see that the number of dogs seen on weekends
varies much more than on weekdays.
However, instead of depending on a visual of the 5 number summary to compare our data, in the
next lesson, we will learn about using a single value to compare the two distribution spreads
- standard deviation.
The standard deviation is one of the most common measures for talking about the spread of data.
It is defined as the average distance of each observation from the mean.
In the above video, we saw this as how far individuals were from the average distance from work
(the example distances shown are examples from the full data set, the mean of just those 4
numbers is 38.5. The mean of 18 shown later in the video is the mean of the full data set which is
not shown in the video). In the next video, you will see exactly how this is calculated.
�‾=(∑�=14��)�=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the mean and square the value:
(��−�‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference of each observation from the
mean:
1�∑�=1�(��−�‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of the variance:
1�∑�=1�(��−�‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our dataset is from the mean.
Quiz: Measures of Spread (Calculation and Units)
Question 1 of 2
If we measure the variance associated with our sales in dollars for each month for 3 years, what
are the units associated with the variance?
Answer: D
Question 2 of 2
Remember to find the variance we first find the mean average of the values, then subtract the
mean from each value, then square each of these values, then add them up, then divide by the
number of values. (Round your answer to two decimal places at the end of your calculation -
don't round along the way.)
Measure Value
Variance 13.55
5 Number Summary
In the previous sections, we have seen how to calculate the values associated with the five-
number summary (min, �1Q1, �2Q2, �3Q3, max), as well as the measures of spread
associated with these values (range and IQR).
For datasets that are not symmetric, the five-number summary and a corresponding box plot are
a great way to get started with understanding the spread of your data. Although I still prefer a
histogram in most cases, box plots can be easier to compare two or more groups. You will
see this in the quizzes towards the end of this lesson.
2. Why the measures of variance and standard deviation make sense to capture the spread of
our data.
4. Why we might use the standard deviation or variance as opposed to the values associated
with the 5 number summary for a particular dataset.
Calculation
1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each observation from the mean.
To calculate the variance of a set of 10 values in a spreadsheet application, with our 10 data
points in column A, we would create a new column B by typing in something like =A1-
AVERAGE(A$1:A$10) and copying this down for all 10 rows. This would find us the
difference between each data point and the mean average of all the data. Then we create a new
column C having the square of these differences, using the formula =B1^2 in cell C1, and
copying that down for all rows. Then in the cell below this new column, cell C11, type
in =SUM(C1:C10). This adds up all these values in column C. Finally in cell C12, we divide
this sum by the number of data points we have, in this case, ten: =C11/10. This cell C12 now
contains the variance for our 10 data points.
More detailed guidance on using spreadsheets like this may be included in a future lesson in your
program.
The standard deviation is the square root of the variance. Therefore, the formula for the standard
deviation is the following:
1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our same set of 10 data
values, we would use another cell like C13 to take the square root of our variance measure, by
typing in =sqrt(C12).
The standard deviation is a measurement that has the same units as our original data, while the
units of the variance are the square of the units in our original data. For example, if the units in
our original data were dollars, then units of the standard deviation would also be dollars, while
the units of the variance would be dollars squared.
Again, this section is designed as background knowledge for the following sections. If it doesn't
make sense on this first pass, do not worry. You will be guided in future sections in performing
these calculations, and building your intuition, as you work through an example using the salary
data. Then we will provide context about why these calculations are important, and where you
might see them!
Standard deviation is a common metric used to compare the spread of two datasets. The
benefits of using a single metric instead of the 5 number summary are:
1. The variance is used to compare the spread of two different groups. A set of data with
higher variance is more spread out than a dataset with lower variance. Be careful though,
there might just be an outlier (or outliers) that is increasing the variance when most of the
data are actually very close.
2. When comparing the spread between two datasets, the units of each must be the same.
3. When data are related to money or the economy, higher variance (or standard deviation)
is associated with higher risk.
4. The standard deviation is used more often in practice than the variance because it shares
the units of the original dataset.
The standard deviation is associated with risk in finance, assists in determining the significance
of drugs in medical studies, and measures the error of our results for predicting anything from
the amount of rainfall we can expect tomorrow to your predicted commute time tomorrow.
These applications are beyond the scope of this lesson as they pertain to specific fields, but know
that understanding the spread of a particular set of data is extremely important to many areas. In
this lesson, you mastered the calculation of the most common measures of spread.
Question 1 of 3
Assume d1 and d2 are datasets both measured in the same units. We know that the standard
deviation of d1 is 5 and the variance of d2 is 36, which of the following are certainly true. Mark
all that apply.
Remember the Standard Deviation is the square root of the variance. So if the Variance is 4 the
Standard Deviation would be 2.
Answer: B and C
That's right! We can only talk about specific measures of spread, and not measures of center.
Additionally, the range isn't directly associated with the standard deviation, so we can't make a
claim that is always true like the final option.
Question 2 of 3
If a dataset has a standard deviation of zero, which of the following MUST be true?
That's right! Since the standard deviation is a measure of spread, a zero value suggests that all of
our data points are the same value.
Question 3 of 3
For each of the below: If the statement is true, mark the box next to the statement.
A. If two datasets have the same variance, they will also have the same standard deviation.
B. If I have two investment options with the same mean return, it really doesn't matter which
I invest in.
C. If I have two investment options with the same standard deviation associated with the
return, they will also have the same max possible return.
That is correct! Besides the mean return of an investment, we should also consider the spread
associated with the return. But just because the standard deviation associated with each
investment is the same, this does not mean the max you could make for each investment is the
same.
Investment Data
Returns
Year Year
Year 1 Year 3 Year 4 Year 6
2 5
Investment 1 5% 5% 5% 5% 5% 5%
The returns for 6 consecutive years for each investment are shown above. Use this information to
answer the questions below.
Question 1 of 3
Use the information above to match the mean/expected return for each investment.
Investment Return
Investment 1 5%
Investment 2 5%
In the previous two questions, you should have found that these investments have the same
mean! That is, regardless of which investment opportunity you choose, you are expected to earn
the same amount. So how are they different? Let's look at some additional questions to see if we
can find some differences.
Question 2 of 3
Using the information above, mark all of the below that are true statements.
A. The risk associated with investment 1 is lower than the risk associated with
Investment 2.
B. The standard deviation associated with Investment 1 is smaller than the standard
deviation associated with Investment 2.
C. Knowing the mean return amount across all the years for each investment
provides us with all of the information necessary to understand which investment
we should choose.
Answer: A and B
That's right! Because the return is the same year over year for Investment 1, it has 'no spread' or a
standard deviation of 0. This smaller standard deviation is associated with smaller risk.
Understanding the spread of values we could earn is just as important as understanding the
expected return (mean return).
Question 3 of 3
Based on the observed data, which of the above two investments has the best opportunity of
earning more than 7%?
A. Investment 1
B. Investment 2
C. Neither.
D. We cannot tell.
Answer: B
That's right! Only Investment 2 has earned more than 7%, so it is more likely (with 1/3 chance).
Where Investment 1 has a 0/6 chance of earning more than 7% based on our observed data.
Useful Insight
The above example is a simplified version of the real world but does point out something useful
that you may have heard before. Notice if you were not fully invested in either Investment 1 or
fully invested in Investment 2, but instead, you were diversified across both investment options,
you could earn more than either investment individually. This is the benefit of diversifying your
portfolio for long-term gains. For short-term gains, you might not need or want to diversify. You
could get lucky and hit short-term gains associated with the upswings (12%, 10%, or 7%) of
Investment 2. However, you might also get unlucky, and hit a down term and earn nothing or
even lose money on your investment using this same strategy.
For the following dataset, match each value to the appropriate label:
Term Value
n 13
Median 7
First quartile 3
Mean 8.4
Mode 3
Question 2 of 2
For the following dataset, match each value to the appropriate label:
Term Value
range 20
variance 33.9
minimum 2
maximum 22
Recap
Variable Types
We have covered a lot up to this point! We started with identifying data types as
either categorical or quantitative. We then learned we could identify quantitative variables as
either continuous or discrete. We also found we could identify categorical variables as
either ordinal or nominal.
Categorical Variables
When analyzing categorical variables, we commonly just look at the count or percent of a group
that falls into each level of a category. For example, if we had two levels of a dog
category: lab and not lab. We might say, 32% of the dogs were lab (percent), or we might say 32
of the 100 dogs I saw were labs (count).
However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.
Quantitative Variables
Then we learned there are four main aspects used to describe quantitative variables:
1. Measures of Center
2. Measures of Spread
4. Outliers
1. Means
2. Medians
3. Modes
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Calculating Variance
1�∑�=1�(��−�ˉ)2n1i=1∑n(xi−xˉ)2
1�−1∑�=1�(��−�ˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus far, but you can find an
explanation here(opens in a new tab).
You can commonly find answers to your questions with a quick Google search(opens in a new
tab). Now is a great time to get started with this practice! This answer should make more sense
at the completion of this lesson.
The standard deviation is the square root of the variance. In practice, you usually use the
standard deviation rather than the variance. The reason for this is because the standard deviation
shares the same units with our original data, while the variance has squared units.
What Next?
In the next sections, we will be looking at the last two aspects of quantitative
variables: shape and outliers. What we know about measures of center and measures of spread
will assist in your understanding of these final two aspects.
Supporting Materials
Calculating Variance
Shape
Histograms
We learned how to build a histogram in this video, as this is the most popular visual for
quantitative data.
Shape
From a histogram, we can quickly identify the shape of our data, which helps influence all of the
measures we learned in the previous concepts. We learned that the distribution of our data is
frequently associated with one of the three shapes:
1. Right-skewed
2. Left-skewed
Summary
Mean vs.
Shape Real-World Applications
Median
The mode of a distribution is essentially the tallest bar in a histogram. There may be multiple
modes depending on the number of peaks in our histogram.
If you're working with data, you can always build a Quick Plot to see the shape.
Just to apply some context, some examples of approximately Bell-Shaped data include heights
and weights, standardized test scores, precipitation amounts, the mean of a distribution, or errors
in manufacturing processes. Common data that follow Left Skewed Distributions include GPAs,
the age of death, and asset price changes. Common data that follow approximately right Skewed
Distributions include the amount of drug left in your bloodstream over time, the distribution of
wealth, and human athletic abilities.
There are links below in the instructor notes in case you want to learn more about each of these
cases.
Though these three, Right Skewed, left Skewed and symmetric, are the most common
distributions, data in the real world can be messy and it might not follow any of these
distributions.
When working with data, building a quick plot lets you quickly see the shape of your data.
Distribution
Types of Data
Shape
References
These are the references used to pull the applications of each shape.
Supporting Materials
Stack Exchange
Match the distribution shape with the correct relationship in comparing the mean to the median.
Shape Comparison
Image Summary
In the below image, we have three box-plots. Each box-plot is for a different Iris
flower: setosa, versicolor, or virginica. On the y-axis, we are given the sepal length. Notice
that virginica has an outlier towards the bottom of the plot. Therefore, the minimum is not given
by the bottom line here; rather, it is provided by this point.
Quick Refresher: The measures of center and spread we can determine from a Box Plot are as
follows. Let's use Setosa for these examples.
IQR is space between the first and third quartile which are the edges of the box. They are about
4.8 for the first quartile and 5.2 for the third
The below plot will be used to answer the first two questions in this section.
Question 1 of 5
A. Bar Chart
B. Box Plot
C. Histogram
D. Pie Chart
Answer: C
Question 2 of 5
A. Right skewed
B. Left skewed
C. Symmetric
D. Bi-modal
Answer: D
Use the below image to assist with answering the next three questions.
Plot image for the quiz below
Question 3 of 5
A. Bar Chart
B. Box Plot
C. Histogram
D. Pie Chart
Answer: B
Question 4 of 5
A. Right skewed
B. Left skewed
C. Symmetric
D. Bi-modal
E. Answer: B
Question 5 of 5
Select the true statement for the box-plot above.
Answer: A
Histograms
Let the histogram on the left be Histogram 1 and the histogram on the right be Histogram 2.
Quick Notes
Pay attention to the scale of these two graphs. The first is dealing with a lot higher numbers.
The average factors in all the numbers so outliers will bring the average towards them.
Left Skewed is when the graphs start with a low frequency and then slopes up. Right Skewed is
when the graph starts with a high frequency and slopes down.
Quiz Question
Correctly match the histograms to the statements that are true about each.
Statement Histogram
Variable Types
We have covered a lot up to this point! We started with identifying data types as
either categorical or quantitative. We then learned we could identify quantitative variables as
either continuous or discrete. We also found we could identify categorical variables as
either ordinal or nominal
Categorical Variables
When analyzing categorical variables, we commonly just look at the count or percent of a group
that falls into each level of a category. For example, if we had two levels of a dog
category: lab and not lab. We might say, 32% of the dogs were lab (percent), or we might say 32
of the 100 dogs I saw were labs (count).
However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.
Quantitative Variables
Then we learned there are four main aspects used to describe quantitative variables:
1. Measures of Center
2. Measures of Spread
4. Outliers
Measures of Center
1. Means
2. Medians
3. Modes
Measures of Spread
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Shape
We learned that the distribution of our data is frequently associated with one of the three shapes:
1. Right-skewed
2. Left-skewed
Depending on the shape associated with our dataset, certain measures of center or spread may be
better for summarizing our dataset.
When we have data that follows a normal distribution, we can completely understand our
dataset using the mean and standard deviation.
However, if our dataset is skewed, the 5 number summary (and measures of center associated
with it) might be better to summarize our dataset.
Outliers
We learned that outliers have a larger influence on measures like the mean than on measures like
the median. We learned that we should work with outliers on a situation by situation basis.
Common techniques include:
3. Understand why they exist, and the impact on questions we are trying to answer about our
data.
4. Reporting the 5 number summary values is often a better indication than measures like the
mean and standard deviation when we have outliers.
We also looked at histograms and box plots to visualize our quantitative data. Identifying outliers
and the shape associated with the distribution of our data are easier when using a visual as
opposed to using summary statistics.
What Next?
Up to this point, we have only looked at Descriptive Statistics, because we are describing our
collected data. In the final sections of this lesson, we will be looking at the difference
between Descriptive Statistics and Inferential Statistics.
Descriptive Statistics
Inferential Statistics
Inferential Statistics is about using our collected data to draw conclusions about a larger
population.
Identify the population, parameter, sample, and statistic for the below scenario:
Consider we are interested in the average number of hours slept by all Udacity students (100,000
students). I send an email to all Udacity students, but I only receive 5,000 response emails. The
average amount of sleep of those that responded was 6.8 hours of sleep.
Term Description
Question 2 of 3
Identify the population(s), parameter(s), sample(s), and statistic(s) for the below scenario:
Consider we own a bagel shop. We know that the average diameter of all of our bagels is 5.5
inches. A competitor moves right next door to us! We are interested in if they make larger bagels
than us. We obtain 100 of their bagels, and we find they have an average diameter of 6 inches.
Description Term
6 inches Statistic
Essentially we have two populations - one is all the bagels from our competitor, and the second
is all the bagels from our shop. We know the diameter of all the bagels at our shop, so this is a
parameter. The 100 bagels from the competitor are now a sample, and we have a statistic, which
is our numeric summary from that sample of 6 inches.
Question 3 of 3
Description Term
Inference
In this section, we learned about how Inferential Statistics differs from Descriptive Statistics.
Descriptive Statistics
Descriptive statistics is about describing our collected data using the measures discussed
throughout this lesson: measures of center, measures of spread, the shape of our distribution, and
outliers. We can also use plots of our data to gain a better understanding.
Inferential Statistics
Inferential Statistics is about using our collected data to draw conclusions to a larger
population. Performing inferential statistics well requires that we take a sample that accurately
represents our population of interest.
A common way to collect data is via a survey. However, surveys may be extremely biased
depending on the types of questions that are asked, and the way the questions are asked. This is a
topic you should think about when tackling the first project.
Looking Ahead
Though we will not be diving deep into inferential statistics within this course, you are now
aware of the difference between these two branches of statistics. If you have ever conducted a
hypothesis test or built a confidence interval, you have performed inferential statistics. The way
we perform inferential statistics is changing as technology evolves. Many career paths
involving Machine Learning and Artificial Intelligence are aimed at using collected data to
draw conclusions about entire populations at an individual level. It is an exciting time to be a part
of this space, and you are now well on your way to joining the other practitioners!
Lesson Review
Range
Standard Deviation
Variance
Analyze outliers
Spreadsheet Options
It's important to note that there are multiple options for spreadsheets to use in this course. I'll be
using Microsoft Excel, but much of the same functionality (with some minor differences in
menus used) is also available in Google Sheets, Apple Numbers, and Apache Open Office Calc.
Depending on your location, decimals and commas may be treated differently in spreadsheet
applications (mostly in European countries) than what you will see in this course.
For this course, we will be using a convention where commas separate large values - like
1,000.00 or 1,000,000.00. On the other hand, the period will be used as the decimal separator, as
you can see from these previous numbers.
You may want to assure that you are following along with these same conventions, or at least be
aware of the differences you may see should you be using a different convention for commas and
decimals.
Potential Solutions
Periods and Commas
Check out this link on switching between decimals and commas, or vice versa(opens in a new
tab), if needed.
You may also differ in using commas versus semicolons in equations(opens in a new tab).
Note you should Select English (United States) to have your formatting match what is
shown in the course
After making any changes to the settings of Excel, you may need to restart your computer for the
changes to actually go into effect.
Why Spreadsheets?
Spreadsheets started off as a manual way to track things like sales, keeping running totals of
columns, summing across columns, etc. VisiCalc(opens in a new tab) was the first
computerized version of a spreadsheet, for the Apple II, released in 1979. Ported over for IBM
PC's in 1981, it became one of the first truly popular apps across business functions.
Spreadsheet programs took over a lot of pain-staking manual work from before (especially
calculations), and helped reduce human errors.
Spreadsheets do have limitations, such as having issues with lots of users and very large datasets,
as well as variance on sheet sizes and how many there can be between programs. On the plus
side, they are easy to obtain and use, and are quick at loading, modifying, and visualizing data.
Quiz Question
Spreadsheets make it possible to easily manipulate, analyze, and visualize data. There are some
limitations when using very large data sets
Excel
There are several pricing options to choose from. Whatever your circumstance, be sure to
download the desktop version. It can do a few things that the browser version can't. Note that the
Windows and Mac versions of Excel are different, and the Mac version may not have certain
built-in functionalities, such as box plots. Also, the free Office Online version of Excel is limited
in functionality and is not recommended for this class. Once you have your Excel installed on a
PC or Mac, you will need to load the following Add-In:
Google Sheets
Google Sheets is a free alternative with full functionality when you enhance it with a add-on.
You'll need a free Google account to get started. If you already have one, the link above will take
you where you need to go. If you don't have a Google account, click "More Options" and create
one. Once you have your account established, open a blank sheet and use the "Add-ons" menu to
add the following to your account:
Apple Numbers
If you have a Mac, this application should already be there. Most features we present are
available in Numbers, and it is compatible with Excel files. There does not seem to be a way to
easily create box plots, however, which we will introduce in the Visualize Data lesson.
Both Apache Open Office Calc and LibreOffice Calc are open source and freely available. They
will read Excel format and provide most of the same functionality. There are some important
differences(opens in a new tab) to be aware of, however, including differences in formula
syntax. Some features, such as box plots, may require internet searches for solutions and a
number of steps to implement.
Navigation: Worksheet
Pro Tip: In Excel, you can create a new spreadsheet by clicking File -> New -> Blank
Spreadsheet. Other spreadsheet applications typically follow a similar process.
Spreadsheet columns use letters as their labels, A-Z, and then continuing on by adding an initial
letter to again run through the alphabet (so AA-AZ, then BA-BZ, etc.). Rows are numbered
numerically. Each cell is addressed based on its column and row, so the cell in Column D and
row 6 is D6.
Pro Tip: In Excel, you can add sheets with the "Add Sheets" ("+") button (near the bottom) and
rename sheets with a more specific title. Other spreadsheet applications typically follow a similar
process.
When working with multiple sheets, the tab names also become part of the cell name, if
addressed from a different sheet.
The formula bar appears just above the spreadsheet cells, or can be accessed by clicking into a
given cell. We'll talk more on formulas later as well.
Pro Tip: Clicking on the formula bar highlights all the cells being used in the formula.
Quiz Question
A. BB B. AB C. 28 D. A28
Answer: B
Columns are labeled with letters of the English alphabet. After the 26 letters are exhausted,
column labels follow the pattern AA, AB, AC, ...
The main menu bar contains many functions and features that will be more specific to the
spreadsheet application you are using. Many also allow you to customize which menu commands
are shown.
Pro Tip: In Excel, you can customize menu commands with File -> Options -> Customize
Ribbon.
Pro Tip: In Excel, you can press F1 for the help menu.
File commands are those that operate between the spreadsheet application and your computer
operating system, such as creating a new spreadsheet, or saving or loading an existing one.
The very top of the menu bar contains quick access options such as undo or redo.
Pro Tip: In Excel, you can customize the quick access toolbar with File -> Options -> Quick
Access Toolbar.
The Home menu contains options such as cut, copy and paste, various formatting options like
font changes and data types, and cell operations like insertions or deletions. We'll use these in
later lessons. It also has functionality to find or replace, which may work differently based on the
application.
The Insert Menu has items to create hyperlinks or charts, which we'll use when we discuss
visualizing data. We'll also cover pivot tables later.
The Data Menu has items such as sorting and filtering data.
Quiz Question
What are some examples of Data menu operations? Check all that apply.
A. Sort
B. Filter
C. Chart
D. Text to columns
Navigation: Shortcuts
Fill: Copy or continue a pattern of cells by dragging the mouse using a fill handle on the lower
right of a cell. The fill handle typically shows up as a little plus sign when your mouse is in the
right place.
Pro Tip: Many keyboard shortcuts also work within spreadsheets. Undo with Ctrl+Z and Redo
with Ctrl+Y; note that Mac users should use the Cmd button instead of Ctrl.
Quiz Question
A. Up
B. Down
C. Left
D. Right
Answer: all
Copy Data
Note: Google no longer provides historical data on stocks. However this information can be
found on Yahoo at:
https://finance.yahoo.com/quote/AAPL/history?p=AAPL
Note: Google no longer provides historical data on stocks. However, this information can
be found on Yahoo at: https://finance.yahoo.com/quote/AAPL/history?p=AAPL(opens in a
new tab)
If for any reason, you can't access the sites, you can use these Word documents as the
source:
Table Paste
How many columns were created when you pasted the AAPL Historical prices table from the
website? Answer: 7
Correct! The data columns are Date, Open, High, Low, Close, and Volume.
Range Addressing
Range: A group of cells selected or addressed together. It is defined by the cells in the upper left
and lower right corners of the range.
You can select a range of cells by clicking on one cell on a corner of the range and dragging to
the other cells in the range. In the video, we also saw how this can be done in the formula with a
colon, such as B2:C7, which selects all cells in the B and C columns from rows 2 to 7.
Quiz Question
Which of the following is a valid range address? Check all that apply.
A. A1:A2
B. C8:A9
C. AA3:BB5
D. 9F:9G
E. E:G
F. 16:18
Correct! A1:A2 and AA3:BB5 are valid because they both start with an upper left cell address
and end with a lower right cell address. E:G and 16:18 are also valid… this was a bit of a trick
question. It is possible to define whole columns, such as columns E thru G, and whole rows, such
as rows 16 thru 18, as ranges.
Relative Address: The cell locations are relative to the answer cell that references them.
Copying or filling such locations will update the copied/fill location relative to the original (i.e.
dragging down from a cell using A1 would fill in A2 next). This is the default behavior.
Absolute Address: Fixed cell or range address that doesn't change when copied. Adding dollar
signs ($) to an address makes it absolute based on where the dollar signs are placed. A dollar
sign in front of the letter fixes a column, while the dollar sign in front of the number fixes the
row; if both are prefixed with a dollar sign, it is fixed to the exact cell.
Pro Tip: Use the F4 key to quickly toggle from relative to absolute addresses in Excel.
Quiz Question
A. A1
B. $A$1
C. $A1
D. A$1
Answer: All
Correct! All of them are valid. In addition to fixing both the column and row as absolute, it is
possible to fix just the column or just the row.
The Home Menu and the Context Menu (available through a right-click) provide options to insert
cells, rows or columns.
Save Data
Always save your work! While Google Sheets will save each change to a cell for
you, most desktop applications, like Excel, require you to manually save.
The first time you save in Excel, you will have to "Save As", so that you can tell
Excel where to actually save the file. Afterward, it will normally "Save" to the
same location as the previous version.
Quiz Question
Which of the following file types can be usefully read by Excel and most other
spreadsheet applications?
Answer: A, B, C
Correct! Current and past excel formats *.xlsx and *.xls, plus comma delimited
files, *.csv, are common types of files that can be read into spreadsheets.
Although *.json files can be read as text files, they will not automatically be
parsed correctly.
Recap
Text data
Math operations
Statistical functions
Duplicating rows
Splitting columns
Sorting data
Cell Formulas
Cell Formula: An expression, beginning with an equal sign (=), that defines the operations which
calculate the value for the cell. These may be a constant, contain mathematical operators or
functions, or may reference a single cell, a range of cells, or no cells.
In Excel, you can find different functions in the Formulas Library found on the Formulas tab of
the menu bar.
Quiz Question
Match the following functions with definitions. You may need to use use "help"
in your spreadsheet application or look in the Formulas menu to find out more
about these functions.
These are the correct matches.
Definition Function
Estimates standard deviation based on a sample (ignores logical values and text).
Removes all spaces from a text string except single spaces between words.
TRIM(text)
SUBSTITUTE
Text String: String of letters, numbers, and punctuation that is not treated
numerically.
Pro Tip: Enter a formula by typing directly into a selected cell, beginning with "=".
This way is more direct and faster than the formula bar!
Quiz Question
Assume that you must write a formula for the B2 cell to get the result shown.
Choose all valid formulas for the B2 cell from the list below.
A. =SUBSTITUTE(A2,B5,B4)
B. =SUBSTITUTE(A2,B4,B5)
C. =SUBSTITUTE($A2,$B4,$B5)
D. =SUBSTITUTE($A2,B5,B4)
E. =SUBSTITUTE(A2,$B$4,$B$5)
Answer: B, C,E
Extract Text
FIND and LEFT can be used to extract text. FIND can be given a substring and a cell
to return the position in a string where the substring was found. LEFT can then be
used to extract a certain number of characters from a cell, starting from the left
side.
RIGHT therefore extracts from the right side, while MID can extract from some
starting point in the middle of a cell.
In the exercise below, you'll combine what you've learned about the FIND and MID functions to
extract the first word found after the first occurrences of the word "data" in a series of sentences.
For example, in the following sentence, your target word is "is":
Some programming languages use 0 based indexing. However, Excel uses 1 based index. This
means that counting begins at the value of 1. Therefore, to find the index for the word data, we
would count as shown in the image below.
Then for the extraction for this quiz, since this extraction is in the middle of a
text string, you will use the function MID with the following syntax:
The exercise will step you through intermediate formulas to get to the final answer. As you
figure out the correct formula in the top cell in each column, you will be able to "fill" (or copy)
the cells down the column for all the cases you are given.
find_text
within_text
start_num
Reformat Text
CONCATENATE will join together two or more strings. It's important to note that this will not
automatically add spaces between them, so make sure to add spaces as formula parameters if you
need them.
TRIM will help to remove excess whitespace from a string.
PROPER sets the first letter of each word to upper case, with the rest lowercase.
UPPER sets all letters to upper case, while LOWER sets all letters to lowercase.
Quiz Question
Match the following functions to their result when applied to this phrase:
- Sherlock Holmes
"It Is A Capital Mistake To Theorize Before One Has Data" - Sherlock Holmes
PROPER
UPPER
"it is a capital mistake to theorize before one has data" - sherlock holmes LOWER
Math Functions
Math operations are one of the most common spreadsheet usages. These are used similarly to
what one might expect (with a leading equals sign):
+ for addition
- for subtraction
* for multiplication
/ for division
There are also the functions SUM and AVERAGE, which behave as their names suggest -
summing or averaging two or more cells, numbers or a range of cells.
The following list has a series of steps for this exercise. As you complete each step, check it off
the list. The quizzes in the task list can be found below.
What is the formula you used in G7 ? (Include the equal sign in your answer) =sum(b7:f7)
AVERAGE
Duplicate Rows
Clean Data: Data that is free of corrupt or inaccurate data items.
In Excel, under the Data tab, you can use the "Remove Duplicates" feature to remove duplicated
values.
If you are using Google Sheets you can use Data Cleanup
Next, select Data has header row and choose the rows you want analyzed.
In this exercise, you'll open a data file and remove the duplicate rows.
Your data file is in comma-separated values(opens in a new tab) format, or CSV format. The
file extension is .csv and the file can be read as plain text, unlike .xlsx and .xls formats. In a CSV
file, the first row may be column headers separated by commas, while all later rows are the data
rows. Each value that corresponds to a column is separated by a comma.
Spreadsheet applications are designed to easily open this type of file, and it is often used for
storing tabular data. For example, the following text can be seen in the exercise CSV file
named worldcities.csv by looking at it with a plain text editor such as Microsoft Notepad(opens
in a new tab) on Windows or TextEdit(opens in a new tab) on Apple Mac :
If you open the same file with a spreadsheet application such as Excel, the application will
automatically separate the columns for you:
Separated Data
Split Columns
Splitting data is useful when you have data such as first and last names, or City and State,
separated by a delimiter (such as a comma). In Excel, this can be done by adding a column to the
right of the one you want to split, and then in the Data tab, selecting "Text to Columns".
Make sure you add the extra column first, as existing data may otherwise be overwritten in the
next column otherwise.
If you are working in Google Sheets, you click Data > Split text to columns.
If you want to override the default separator, click in the Separator box. You can choose from
comma, semicolon, period, or space -- or you can select Custom to add your own separator.
Sort Data
Sorting is a very useful feature of spreadsheet applications. Select all your data, then on the Data
tab, click the "Sort" button. You can also choose which column to sort on, or even multiple
columns.
2. Click Data > Sort range > Advanced range sorting options.
You can select Add another sort column to add another sort level.
Filter Data
Filter: A method to group data by selecting characteristics of one or more columns of a data set.
The filter method is used by clicking on the filter button, which has a little filter as its icon. You
can then select which items you want to filter down to. Make sure if you want to use a filter on
multiple columns that you clear old column filters no longer needed.
SPEADSHEET 4: VSUALIZE DATA
Pie Charts
Illustrating proportionality
A pie chart is used to illustrate proportionality. Think of it as slicing the pie into pieces, where
each piece matches a percentage of the whole list.
In spreadsheets, this is really easy, because all we need is a list of the categories and matching
values such as sums or counts.
When the chart is selected, a design and format menu is available on the Excel ribbon at the top
of the page. The design menu gives numerous chart options and choices, such as specific
coloring or displaying percentages. You can also change the chart title.
Pivot table
Using the pivot table we created earlier with some careful selection, we want to highlight the
position categories in the top row and the totals are in the bottom row. To do this:
2. While holding down the control key on Windows or the Command key on Apple
keyboards, select the bottom row with your mouse.
Bar Charts
We could use the same information as before and choose a bar or column chart instead of a pie
chart. Instead of percentages, it would just show the values with longer bars or columns
representing larger values.
Bar charts do not show percentages for each category
In the chart above, we're comparing the category values against each other and we see their
relative sizes. However, we do not have much sense of the whole league or the percentage of
each category as we did with the pie charts.
Choosing which kind of chart to use really depends on what patterns you want to highlight and
what questions you want to answer.
Use bar or column charts to compare category values with each other.
We use pie and bar charts to visualize categorical data. If we have a list of numerical data, such
as the list of stock prices over time, a line chart gives us a better picture of the data set.
Simple line charts
Using a table of data downloaded from a financial website, listing prices for AAPL stock, we can
explore line charts:
1. Notice that it has columns for date, open, high, low, close, and volume.
4. Move the chart to its own sheet to see the detail better.
7. Verify that the horizontal axis shows the dates, and the vertical axis shows dollar values.
8. Observe that over the past year the stock has gone up with a little hiccup about a month
ago.
To handle more than one column of data for the same dates:
2. Select the date plus the high and low values for AAPL stock.
3. Since the high and low aren't all that far apart, change the range for the dollar amount on
the left to start at 100.
5. Observe both the high and low-value lines now and see the spreads between them.
Scatterplot
1. To plot two different variables, closing price, and volume, for AAPL stock, choose the
scatterplot.
2. Observe a graph with the closing price on the horizontal axis and the volume of trade that
day on the vertical axis.
3. Observe that the prices seem to cluster in a couple of areas, and that they have about the
same volume generally, though there are some high volume days at the lower price.
Quiz Question
Match the question posed to the type of chart to use (choose the best answer). We haven't talked
much about some of these yet. Just do your best.
What are the relative percentages of different fruits sold this month? Pie
How does the number of apples, oranges, and pears sold this month compare to each other?
Bar
How has the price for AAPL stock changed over time? Line
Is there a relationship I can see between weight and age in a population? Scatter
What is the frequency of salaries by millions across all major league baseball players?
Histogram
What is the distribution of my numerical dataset from minimum to maximum, including the 1st,
2nd, and 3rd quartiles?
Box Plot
If the data of both variables move up together, they have a positive correlation(opens in a new
tab), and this can be seen in the scatter plots, such as in the following plot of human height and
weight data(opens in a new tab). We can see that generally, as height increases, so does weight.
The line shown is the trend line which can be added in Excel by selecting the scatter chart,
then Design > Add Chart Element > Trendline > Linear.
If one variable increases as the other decreases, the two variables have a negative correlation, as
in the following plot of depth vs velocity in the Columbia River(opens in a new tab):
Adding a trendline in Google Sheets
In Google Sheets, you can create a treandline in Google Sheets by clicking the three dot menu in
the chart to open the Chart Editor. Then click Customize > Series. Check the Trendline box an
select the type in the dropdown menu
So far, we've used the Chart Design tab that appears when you select a chart to choose some pre-
made formatted styles.
In the top left corner of the design menu is the Quick Layout and the Add Chart Element menu
items:
Quick Layout menu is a set of pre-made layouts that affect which chart elements are
included and where, like the axis labels and the legend. You can select various ones and
see the effect.
Add Chart Element menu. With this menu, it's possible to add and remove elements of all
types from the chart. There is a lot here, and rather than try to learn all of it, we'll just use
some parts as we need them. It's good to know where to find it though, for those times
when you know what you want to change, and all those pre-made formats and layouts
just don't have what you need.
Quiz Question
Match the chart element description with its location on the sample chart above.
Chart Title C
Horizontal Axis F
Trend line B
Legend D
Grid Lines E
Histograms
A Histogram is a column chart that measures the frequency of data in a data set and specifically
groups numerical values into bins we define.
Recall that we previously created a column chart to compare counts of categories within
a data set. This kind of chart answers a question like: how many players are there in each
playing position in the league?
But what if we want to ask the question: how many players made under $1 million in
salary, and between $1 and $2 million, and between $2 and $3 million in salary? This
kind of chart is called a histogram, and the groupings we choose such as, 1) all salaries
between $1 and $2 million, and 2) salaries between $2 and $3 million, are the bins.
1. We'll start with a method that works on both Windows and Mac using the histogram tool
in the analysis tool pack add-in. Instructions for loading the analysis tool pack add-in are
given in the Getting Started instructions.
a. Choose data analysis from the data menu on Windows or from the tools menu on
Mac. Choose histogram, which opens a dialog.
b. For the input range, select the data from the salaries column.
c. For the bin range, select the bin intervals you've created.
e. For the output options, select new worksheet and chart output.
f. Press OK.
Insert chart
Available in Microsoft Excel. The tool pack histogram requires two columns of data. One
for data you want to analyze, and one for bin levels that represent the intervals for the
bins. In the video example, I started at $1 million, then $2 million, et cetera, up to $15
million. When I created the histogram, the number of values in the salaries lists that are
below $1 million will be in the first bin. The numbers of salaries between $1 and $2
million will be in the second bin, and so on.
a. Select your data and click insert, recommended charts, and choose the histogram
chart.
b. To configure details about the bins, right-click the horizontal axis of the chart,
click format axis and then click axis options.
c. The dialog provides options for choosing categorical data like the player positions
or automatic for numerical data. You can specify the number of bins that you
might choose to experiment with a bit. Note: If you choose bins that are too
narrow, the result can be noisy. On the other hand, too few bins will hide details.
d. As with other charts, the design and layout can be further customized from the
design menu when the chart is selected.
Box Plots
A box plot, which in our case is really a box and whisker plot, is the visualization of statistical
spread in a data set of values.
A traditional box plot is built using the five numbers summary. The five numbers summary
consists of five values.
maximum
minimum
1st quartile
3rd quartile
The box represents the middle half of the data with a line where the median is.
Note: Excel will give us a bonus of six numbers in the summary by placing an X at the mean or
average value of the set.
4. Click the box in the whisker chart. Remember that a box plot represents statistics for a
single list of numbers. So, each list you select will be represented by its own box plot.
5. Observe that the box plot visually gives a sense of the spread of the value list.
6. Adjust the range so that you can see the plots a little better, if needed.
Professional Presentations
Excel and other spreadsheet applications can do a nice job of creating attractive tables and charts.
It's up to us to make them look their absolute best, though. They should:
Be readable
Be interesting
Use fonts and layouts appropriately to include and emphasize the elements that matter
and exclude elements that don't.
Is this a quick overview presentation on a slide with the data backup elsewhere, or
is it a written technical review where more in-depth data should be presented?
Improving charts
For categorical column data, like the counts of the baseball positions, the relative sizes are easier
to see at a glance if the data is sorted by size. This would not be something we would want to do
for sequential data like daily stock values or histogram bins, but it is more readable in this case.
There should be a prominent and descriptive title.
Pie chart
Since there is more than one data series, the labels are already showing up in the data this pie
chart, and an extra legend is just noise, so we remove it.
What about the axis labels? I hate it when I look at a graph and can't tell what the
numbers on the axis represent, so I would generally say put them there, but there are
always exceptions. If the units are redundant, somehow, they can be removed.
The font size is a judgment call, but bigger is easier to read and draws attention.
Color
Keep in mind that your chart may be printed in grayscale or viewed by someone who's
colorblind. So if it's possible to distinguish groupings in additional ways such as different shapes
and scatter plots, or using dashes in lines, it's a best practice to do so. The format data series
dialogue has a number of options for changing the look of your graph. This was just a sampling
of some of the modifications you can make to your charts to give them the impact you want in a
professional presentation.