0% found this document useful (0 votes)
8 views114 pages

Lecture 2 - data wrangling_update (2)

The lecture focuses on data wrangling in data science, emphasizing the use of tabular data and the pandas library for data manipulation. Key concepts include DataFrames, Series, and methods for extracting, adding, modifying, and arranging data. The lecture also introduces the five essential verbs for data wrangling: filter, select, mutate, arrange, and summarise.

Uploaded by

jevictoria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views114 pages

Lecture 2 - data wrangling_update (2)

The lecture focuses on data wrangling in data science, emphasizing the use of tabular data and the pandas library for data manipulation. Key concepts include DataFrames, Series, and methods for extracting, adding, modifying, and arranging data. The lecture also introduces the five essential verbs for data wrangling: filter, select, mutate, arrange, and summarise.

Uploaded by

jevictoria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Lecture 2 : Data wrangling

2 0 2 5 S P R ING I N T RO TO DATA S CI E NCE


Congratulations!!!
You have collected or have been given
a box of data.

What does this "data" actually look like?


How will you work with it?
Data Scientists Love Tabular Data

• Tabular data = data in a table.

• Typically:
• A row represents one observation (here, a single person running for president in a particular year).
• A column represents some characteristic, or feature, of that observation (here, the political party of
that person).
Standard Python Data Science Tool: pandas

Using pandas, we can:

• Arrange data in a tabular format.


• Extract useful information filtered by specific conditions.
• Operate on data to gain new insights.
• Apply NumPy functions to our data.

Stands for "panel data"


DataFrames

• In the "language" of pandas, we call a table a DataFrame


• We think of DataFrames as collections of named columns, called Series.
Series - Custom Index

• We can provide index labels for items in a Series by passing an index list.

• A Series index can also be changed.


Selection in Series

• We can select a single value or a set of values in a Series using:


• A single label
• A list of labels
• A filtering condition
Selection in Series

• Say we want to select values in the Series that satisfy a particular condition:
1. Apply a boolean condition to the Series. This creates a new Series of boolean values.
2. Index into our Series using this boolean condition. pandas will select only the entries in the
Series that satisfy the condition.
DataFrames of Series!

• Typically, we will work with Series using the perspective that they are columns in a DataFrame.

• We can think of a DataFrame as a collection of Series that all share the same Index.
Creating a DataFrame

• The syntax of creating DataFrame is:


pandas.DataFrame(data, index, columns)

• Many approaches exist for creating a DataFrame. Here, we will go over the
most popular ones.
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame

• From a CSV file.


• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame

• From a CSV file.


• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame

• From a CSV file.


• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame

• From a CSV file.


• Using a list and column name(s).
• From a dictionary.
• From a Series.
Indices Are Not Necessarily Row Numbers

An Index (a.k.a. row labels) can also:

• Be non-numeric.

• Have a name, e.g. "Candidate".


Indices Are Not Necessarily Unique

The row labels that constitute an index do not have to be unique.


• Left: The index values are all unique and numeric, acting as a row number.
• Right: The index values are named and non-unique.
Modifying Indices

• We can select a new column and set it as the index of the DataFrame.

• Example: Setting the index to the "Party" column.


Resetting the Index

• We can change our mind and reset the Index back to the default list of integers.
Column Names Are Usually Unique!

• Column names in pandas are almost always unique.


• Example: Really shouldn’t have two columns named "Candidate".
Retrieving the Index, Columns, and shape

• Sometimes you'll want to extract the list of row and column labels.

• For row labels, use DataFrame.index:

• For column labels, use DataFrame.columns:

•For shape of the DataFrame we use DataFrame.shape:


The Relationship Between DataFrames, Series, and Indices

• We can think of a DataFrame as a collection of Series that all share the


same Index.
• Candidate, Party, %, Year, and Result Series all share an Index from 0 to
5.
Data Science Lifecycle
Recall the Data Science Lifecycle

Data wrangling / visualization


Into to Data Wrangling

• Raw data is often messy, incomplete, and inconsistent

• Data wrangling is the process of cleaning, transforming, and organizing data into a
structured and usable format for analysis.

• The purpose is to ensure that data is accurate, consistent, and ready for modeling and
visualization.
Into to data wrangling

• Hadley Wickham says that the five verbs help solve 90% of challenge
• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;

• We will cover essential tools for implementing this verbs using pandas.

* Grammar of Data Manipulation wirh R (2020)


1. Extracting Data (filter, select)

• One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of
interest.
1. Extracting Data (filter, select)

• Common ways we may want to extract data:


• Grab the first or last k rows in the DataFrame.
• Grab data with a certain label.
• Grab data at a certain position.

• We'll find that all three of these methods are useful to us in data
manipulation tasks.
.head and .tail

• The simplest scenarios: We want to extract the first or last n rows from the
DataFrame.
• df.head(k) will return the first k rows of the DataFrame df.
• df.tail(k) will return the last k rows.
Label-based Extraction: .loc

• A more complex task: We want to extract data with specific column or index labels.

• The .loc accessor allows us to specify the labels of rows and columns we wish to extract.

• We describe "labels" as the bolded text at the top and left of a DataFrame.
Label-based Extraction: .loc

• Arguments to .loc can be:


• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc

• Arguments to .loc can be:


• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc

• Arguments to .loc can be:


• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc

• Arguments to .loc can be:


• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.
Label-based Extraction: .loc

• Arguments to .loc can be:


• A list.
• A slice (syntax is inclusive of the right hand side of the slice).
• A single value.

A series

A single string
Integer-based Extraction: .iloc

• Arguments to .iloc can be:


• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.
Integer-based Extraction: .iloc

• Arguments to .iloc can be:


• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.

elections.iloc[[1, 2, 3], [0, 1, 2]]


Integer-based Extraction: .iloc

• Arguments to .iloc can be:


• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.

elections.iloc[[1, 2, 3], 0:3]


Integer-based Extraction: .iloc

• Arguments to .loc can be:


• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.

elections.iloc[:, 0:3]
Integer-based Extraction: .iloc

• Arguments to .loc can be:


• A list.
• A slice (syntax is exclusive of the right hand side of the slice).
• A single value.

elections.iloc[[1, 2, 3], 1]

elections.iloc[0, 1]
.loc vs .iloc

• Remember:
• .loc performs label-based extraction
• .iloc performs integer-based extraction

• When choosing between .loc and .iloc, you'll usually choose .loc.
• Safer: If the order of data gets shuffled in a public database, your code still works.
• Readable: Easier to understand what elections.loc[:, ["Year", "Candidate", "Result"]]
means than elections.iloc[:, [0, 1, 4]]

• .iloc can still be useful.


• Example: If you have a DataFrame of movie earnings sorted by earnings, can use .iloc to get the
median earnings for a given year (index into the middle).
Context-dependent Extraction: [ ]

• Selection operators:
• .loc selects items by label. First argument is rows, second argument is columns.
• .iloc selects items by integer. First argument is rows, second argument is columns.
• [] only takes one argument, which may be:
• A slice of row numbers.
• A list of column labels.
• A single column label.

• That is, [] is context sensitive.


Context-dependent Extraction: [ ]

•[] only takes one argument, which may be:


• A slice of row integers.
• A list of column labels.
• A single column label.

elections[3:7]
Context-dependent Extraction: [ ]

•[] only takes one argument, which may be:


• A slice of row integers.
• A list of column labels.
• A single column label.

elections[["Year", "Candidate", "Result"]]


Context-dependent Extraction: [ ]

•[] only takes one argument, which may be:


• A slice of row integers.
• A list of column labels.
• A single column label.

elections["Candidate"]
Why Use []?

• In short: [ ] can be much more concise than .loc or .iloc

• Consider the case where we wish to extract the "Candidate" column. It is far simpler to write
elections["Candidate"] than it is to write elections.loc[:, "Candidate"]

• In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!
Boolean Array Input for .loc and [ ]

• We learned to extract data according to its integer position (.iloc) or its label (.loc)
• What if we want to extract rows that satisfy a given condition?
• .loc and [ ] also accept boolean arrays as input.
• Rows corresponding to True are extracted; rows corresponding to False are not.

babynames_first_10_rows = babynames.loc[:9, :]
Boolean Array Input for .loc and [ ]

• We learned to extract data according to its integer position (.iloc) or its label (.loc)
• What if we want to extract rows that satisfy a given condition?
• .loc and [ ] also accept boolean arrays as input.
• Rows corresponding to True are extracted; rows corresponding to False are not.

babynames_first_10_rows = babynames.loc[:9, :]
Boolean Array Input for .loc and [ ]
Boolean Array Input for .loc and [ ]

We can perform the same operation using .loc.

babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False,


True, False], :]
Boolean Array Input

Useful because boolean arrays can be generated by using logical operators on Series.
Boolean Array Input

Can also use .loc.


Boolean Array Input

• Boolean Series can be combined using various operators, allowing filtering of results by
multiple criteria.
• The & operator allows us to apply logical_operator_1 and logical_operator_2
• The | operator allows us to apply logical_operator_1 or logical_operator_2

babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]


Bitwise Operators

• & and | are examples of bitwise operators. They allow us to apply multiple logical conditions.

• If p and q are boolean arrays or Series:


• Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions.
babynames[(babynames["Name"] == "Bella") | (babynames["Name"] == "Alex") |
(babynames["Name"] == "Narges") | (babynames["Name"] == "Lisa")]

• pandas provides many alternatives, for example:


• .isin
names = ["Bella", "Alex", "Narges", "Lisa"]
babynames[babynames["Name"].isin(names)]
• .str.startswith

babynames[babynames["Name"].str.startswith("N")]
2. Adding Data

• Remember the five verbs


• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;
Syntax for Adding a Column

•Adding a column is easy:


1. Use [ ] to reference the desired new column.
2. Assign this column to a Series or array of the appropriate length.
Syntax for Modifying a Column

• Modifying a column is very similar to adding a column.


1.Use [ ] to reference the existing column.
2.Assign this column to a new Series or array of the appropriate length.

# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"]-1
Syntax for Renaming a Column

• Rename a column using the (creatively named) .rename() method.

•.rename() takes in a dictionary that maps old column names to their new ones.
Syntax for Dropping a Column (or Row)

•Remove columns using the (also creatively named) .drop method.

•The .drop() method assumes you're dropping a row by default. Use axis="columns" to drop a column
instead.

babynames = babynames.drop("Length", axis="columns")


An Important Note: DataFrame Copies
• Notice that we re-assigned babynames to an updated value on the previous slide.
babynames = babynames.drop("Length", axis="columns")
• By default, pandas methods create a copy of the DataFrame, without changing the original
DataFrame at all.

•To apply our changes, we must update our DataFrame to this new, modified copy.
3. Arrange rows

• Remember the five verbs


• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;
.sort_values()
• The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).

• Series.sort_values( ) will automatically sort all values in the Series.


• DataFrame.sort_values(column_name) must specify the name of the column to be used for
sorting.

babynames["Name"].sort_values() babynames.sort_values(by="Count", ascending=False)


Sorting By Length

• Let’s try to solve the sorting problem with different approaches


Approach 1: Create a Temporary Column and Sort Based on the New Column

• Sorting the DataFrame as usual


Approach 2: Sorting Using the key Argument
Approach 3: Create a Temporary Column and Sort Based on the New Column

• Suppose we want to sort by the number of occurrences of "dr" and "ea"s.

• Use the Series.map method.


4. Summaize data

• Remember the five verbs


• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame;
Why Group?

Our goal:

• Group together rows that fall under the same category.


• For example, group together all rows from the same year.
• Perform an operation that aggregates across all rows in the category.
• For example, sum up the total number of babies born in that year.
Why Group?
.groupby()

• A .groupby() operation involves some combination of splitting the object, applying a


function, and combining the results.
• Calling .groupby() generates DataFrameGroupBy objects → "mini" sub-DataFrames
• Each subframe contains all rows that correspond to the same group (here, a particular year)
.groupby().agg()

• We cannot work directly with DataFrameGroupBy objects! The diagram below is to help understand what goes on
conceptually – in reality, we can't "see" the result of calling .groupby.
• Instead, we transform a DataFrameGroupBy object back into a DataFrame using .agg
• .agg is how we apply an aggregation operation to the data.
Putting it all together

dataframe.groupby(column_name).agg(aggregation_function)
•babynames[["Year", "Count"]].groupby("Year").agg(sum) computes the total number of
babies born in each year.
Alternatives …

• Now, we create groups for each year.

babynames.groupby("Year")[["Count"]].agg(sum)
or
babynames.groupby("Year")[["Count"]].sum()
or
babynames.groupby("Year").sum(numeric_only=True)
Concluding groupby.agg

• A groupby operation involves some combination of splitting the object, applying a function, and combining the
results.

• So far, we've seen that df.groupby("Year").agg(sum):


• Split df into sub-DataFrames based on Year.
• Apply the sum function to each column of each sub-DataFrame.
• Combine the results of sum into a single DataFrame, indexed by Year.
Groupby Review Question
Answer
Case Study: Name "Popularity"

• Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.

f_babynames = babynames[babynames["Sex"]=="F"]
f_babynames = f_babynames.sort_values(["Year"])
jenn_counts_series =f_babynames[f_babynames["Name"]=="Jennifer"]["Count"]

Number of Jennifers Born in California Per Year.


What Is "Popularity"?

Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.

How do we define "fallen in popularity?"

• Let’s create a metric: "Ratio to Peak" (RTP).


• The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies born
with that name in any year.

Example for "Jennifer":

• In 1972, we hit peak Jennifer. 6,065 Jennifers were born.


• In 2022, there were only 114 Jennifers.
• RTP is 114 / 6065 = 0.018796372629843364.
Calculating RTP

max_jenn = max(f_babynames[f_babynames["Name"]=="Jennifer"]["Count"])
6065

curr_jenn = f_babynames[f_babynames["Name"]=="Jennifer"]["Count"].iloc[-1]
114 Remember: f_babynames is sorted by year.
.iloc[-1] means “grab the latest year”
rtp = curr_jenn / max_jenn
0.018796372629843364

def ratio_to_peak(series):
return series.iloc[-1] / max(series)
jenn_counts_ser = f_babynames[f_babynames["Name"]=="Jennifer"]["Count"]
ratio_to_peak(jenn_counts_ser)
0.018796372629843364
Calculating RTP Using .groupby()

• .groupby() makes it easy to compute the RTP for all names at once!

rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Renaming Columns After Grouping

• By default, .groupby will not rename any aggregated columns (the column is still named "Count", even
though it now represents the RTP.

• For better readability, we may wish to rename "Count" to "Count RTP”

rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns={"Count":"Count RTP"})
Some Data Science Payoff

• By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")
Some Data Science Payoff

• We can get the list of the top 10 names and then plot popularity with:

top10 = rtp_table.sort_values("Count RTP").head(10).index


Raw GroupBy Objects and Other Methods

•The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.

•It is not a DataFrame!


grouped_by_year = elections.groupby("Year")
type(grouped_by_year)

•Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or Series). agg is
only one choice:
groupby.size() and groupby.count()
groupby.size() and groupby.count()
Filtering by Group

•Another common use for groups is to filter data.

• groupby.filter takes an argument func.


• func is a function that:
• Takes a DataFrame as input.
• Returns either True or False.
• filter applies func to each group/sub-DataFrame:
• If func returns True for a group, then all rows belonging to the group are preserved.
• If func returns False for a group, then all rows belonging to that group are filtered out.
• Notes:
• Filtering is done per group, not per row. Different from boolean filtering.
• Unlike agg(), the column we grouped on does NOT become the index!
groupby.filter()
Filtering Elections Dataset

• Going back to the elections dataset.

• Let's keep only election year results where the max '%' is less than 45%.
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)
groupby Quiz

• We want to know the best election by each party.


groupby Quiz

• We want to know the best election by each party.


• Best election: The election with the highest % of votes.
• For example, Democrat’s best election was in 1964, with candidate Lyndon Johnson
winning 61.3% of votes.
Attempt #1

• Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

elections.groupby("Party").max().head(10)
Problem with Attempt #1

• Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
• Every column is calculated independently! Among Democrats:
• Last year they ran: 2020.
• Alphabetically the latest candidate name: Woodrow Wilson.
• Highest % of vote: 61.34%.
Attempt #2: Motivation

• We want to preserve entire rows, so we need an aggregate function that does that.
Attempt #2: Solution
Attempt #2: Solution

• First sort the DataFrame so that rows are in descending order of %.

• Then group by Party and take the first item of each sub-DataFrame.
Grouping by Multiple Columns

• Suppose we want to build a table showing the total number of babies born of each sex in each year.

• One way is to groupby using both columns of interest:

babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)
Pivot function

babynames_pivot = babynames.pivot(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count"], # field(s) to process in
each group
aggfunc = np.sum, # group operation
)
babynames_pivot.head(6)
groupby(["Year", "Sex"]) vs. pivot

• The pivot() output more naturally represents our data.


Pivot output

babynames.groupby(["Year",
"Sex"])[["Count"]].agg(sum).head(6)
Pivot Tables with Multiple Values

• Pivot_table is same as Pivot() but can takes multiple columns as values

babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into
index)
columns = "Sex", # column values
values = ["Count", "Name"],
aggfunc = np.max, # group operation)
babynames_pivot.head(6)
Pivot Table Mechanics
Where are we?

• 5 verbs for data manipulation

• Joining tables

• Tidy data
Joining Tables

• Suppose want to know the popularity of presidential candidate's names in 2022.


• Example: Dwight Eisenhower's name Dwight is not popular today, with only 5 babies born
with this name in California in 2022.

• To solve this problem, we’ll have to join tables.


Creating Table 1: Babynames in 2022

• Let's set aside names in California from 2022 first:

babynames_2022 = babynames[babynames["Year"] == 2022]


babynames_2022
Creating Table 2: Presidents with First Names

•To join our table, we’ll also need to set aside the first names of each candidate.

elections["First Name"] =
elections["Candidate"].str.split().str[0]
Joining Our Tables: Two Options

merged = pd.merge(left = elections, right = babynames_2022, left_on = "First Name", right_on = "Name")
merged = elections.merge(right = babynames_2022, left_on = "First Name", right_on = "Name")
Tidy data

• You can represent the same underlying data in multiple ways.


• Let’s see the example table which records : country, year, population, and number of
documented cases of TB (tuberculosis)
• Which one is the best representation?
Another example

<Table 1>

<Table 2>
Tidy data

• Tidy data is data where:


• 1. Each variable is in a column.
• 2. Each observation is a row.
• 3. Each value is a cell.
Tidy data
Tidy data –cont.

• Why tidy data?


• 1. You can easily add observations
• 2. You can easily add columns
• 3. Many python packages are designed with tidy data in mind
Two verbs for tidy data

<pivot> <melt>

• It makes “wide” data longer.


• It makes “long” data wider.
• Useful when multiple observations stored in one
• Useful when observations are spread over
row
multiple rows
melt()
melt()

You might also like