Lecture 2 - data wrangling_update (2)
Lecture 2 - data wrangling_update (2)
• Typically:
• A row represents one observation (here, a single person running for president in a particular year).
• A column represents some characteristic, or feature, of that observation (here, the political party of
that person).
Standard Python Data Science Tool: pandas
• We can provide index labels for items in a Series by passing an index list.
• Say we want to select values in the Series that satisfy a particular condition:
1. Apply a boolean condition to the Series. This creates a new Series of boolean values.
2. Index into our Series using this boolean condition. pandas will select only the entries in the
Series that satisfy the condition.
DataFrames of Series!
• Typically, we will work with Series using the perspective that they are columns in a DataFrame.
• We can think of a DataFrame as a collection of Series that all share the same Index.
Creating a DataFrame
• Many approaches exist for creating a DataFrame. Here, we will go over the
most popular ones.
• From a CSV file.
• Using a list and column name(s).
• From a dictionary.
• From a Series.
Creating a DataFrame
• Be non-numeric.
• We can select a new column and set it as the index of the DataFrame.
• We can change our mind and reset the Index back to the default list of integers.
Column Names Are Usually Unique!
• Sometimes you'll want to extract the list of row and column labels.
• Data wrangling is the process of cleaning, transforming, and organizing data into a
structured and usable format for analysis.
• The purpose is to ensure that data is accurate, consistent, and ready for modeling and
visualization.
Into to data wrangling
• Hadley Wickham says that the five verbs help solve 90% of challenge
• filter: select rows (observations) in a data frame;
• select: select columns (variables) in a data frame;
• mutate: add new columns to a data frame;
• arrange: reorder rows in a data frame;
• summarise: collapses a data frame to a single row;
• We will cover essential tools for implementing this verbs using pandas.
• One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of
interest.
1. Extracting Data (filter, select)
• We'll find that all three of these methods are useful to us in data
manipulation tasks.
.head and .tail
• The simplest scenarios: We want to extract the first or last n rows from the
DataFrame.
• df.head(k) will return the first k rows of the DataFrame df.
• df.tail(k) will return the last k rows.
Label-based Extraction: .loc
• A more complex task: We want to extract data with specific column or index labels.
• The .loc accessor allows us to specify the labels of rows and columns we wish to extract.
• We describe "labels" as the bolded text at the top and left of a DataFrame.
Label-based Extraction: .loc
A series
A single string
Integer-based Extraction: .iloc
elections.iloc[:, 0:3]
Integer-based Extraction: .iloc
elections.iloc[[1, 2, 3], 1]
elections.iloc[0, 1]
.loc vs .iloc
• Remember:
• .loc performs label-based extraction
• .iloc performs integer-based extraction
• When choosing between .loc and .iloc, you'll usually choose .loc.
• Safer: If the order of data gets shuffled in a public database, your code still works.
• Readable: Easier to understand what elections.loc[:, ["Year", "Candidate", "Result"]]
means than elections.iloc[:, [0, 1, 4]]
• Selection operators:
• .loc selects items by label. First argument is rows, second argument is columns.
• .iloc selects items by integer. First argument is rows, second argument is columns.
• [] only takes one argument, which may be:
• A slice of row numbers.
• A list of column labels.
• A single column label.
elections[3:7]
Context-dependent Extraction: [ ]
elections["Candidate"]
Why Use []?
• Consider the case where we wish to extract the "Candidate" column. It is far simpler to write
elections["Candidate"] than it is to write elections.loc[:, "Candidate"]
• In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!
Boolean Array Input for .loc and [ ]
• We learned to extract data according to its integer position (.iloc) or its label (.loc)
• What if we want to extract rows that satisfy a given condition?
• .loc and [ ] also accept boolean arrays as input.
• Rows corresponding to True are extracted; rows corresponding to False are not.
babynames_first_10_rows = babynames.loc[:9, :]
Boolean Array Input for .loc and [ ]
• We learned to extract data according to its integer position (.iloc) or its label (.loc)
• What if we want to extract rows that satisfy a given condition?
• .loc and [ ] also accept boolean arrays as input.
• Rows corresponding to True are extracted; rows corresponding to False are not.
babynames_first_10_rows = babynames.loc[:9, :]
Boolean Array Input for .loc and [ ]
Boolean Array Input for .loc and [ ]
Useful because boolean arrays can be generated by using logical operators on Series.
Boolean Array Input
• Boolean Series can be combined using various operators, allowing filtering of results by
multiple criteria.
• The & operator allows us to apply logical_operator_1 and logical_operator_2
• The | operator allows us to apply logical_operator_1 or logical_operator_2
• & and | are examples of bitwise operators. They allow us to apply multiple logical conditions.
babynames[babynames["Name"].str.startswith("N")]
2. Adding Data
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"]-1
Syntax for Renaming a Column
•.rename() takes in a dictionary that maps old column names to their new ones.
Syntax for Dropping a Column (or Row)
•The .drop() method assumes you're dropping a row by default. Use axis="columns" to drop a column
instead.
•To apply our changes, we must update our DataFrame to this new, modified copy.
3. Arrange rows
Our goal:
• We cannot work directly with DataFrameGroupBy objects! The diagram below is to help understand what goes on
conceptually – in reality, we can't "see" the result of calling .groupby.
• Instead, we transform a DataFrameGroupBy object back into a DataFrame using .agg
• .agg is how we apply an aggregation operation to the data.
Putting it all together
dataframe.groupby(column_name).agg(aggregation_function)
•babynames[["Year", "Count"]].groupby("Year").agg(sum) computes the total number of
babies born in each year.
Alternatives …
babynames.groupby("Year")[["Count"]].agg(sum)
or
babynames.groupby("Year")[["Count"]].sum()
or
babynames.groupby("Year").sum(numeric_only=True)
Concluding groupby.agg
• A groupby operation involves some combination of splitting the object, applying a function, and combining the
results.
• Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.
f_babynames = babynames[babynames["Sex"]=="F"]
f_babynames = f_babynames.sort_values(["Year"])
jenn_counts_series =f_babynames[f_babynames["Name"]=="Jennifer"]["Count"]
Goal: Find the baby name with sex "F" that has fallen in popularity the most in California.
max_jenn = max(f_babynames[f_babynames["Name"]=="Jennifer"]["Count"])
6065
curr_jenn = f_babynames[f_babynames["Name"]=="Jennifer"]["Count"].iloc[-1]
114 Remember: f_babynames is sorted by year.
.iloc[-1] means “grab the latest year”
rtp = curr_jenn / max_jenn
0.018796372629843364
def ratio_to_peak(series):
return series.iloc[-1] / max(series)
jenn_counts_ser = f_babynames[f_babynames["Name"]=="Jennifer"]["Count"]
ratio_to_peak(jenn_counts_ser)
0.018796372629843364
Calculating RTP Using .groupby()
• .groupby() makes it easy to compute the RTP for all names at once!
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Renaming Columns After Grouping
• By default, .groupby will not rename any aggregated columns (the column is still named "Count", even
though it now represents the RTP.
rtp_table = f_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns={"Count":"Count RTP"})
Some Data Science Payoff
• By sorting rtp_table we can see the names whose popularity has decreased the most.
rtp_table.sort_values("Count RTP")
Some Data Science Payoff
• We can get the list of the top 10 names and then plot popularity with:
•Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or Series). agg is
only one choice:
groupby.size() and groupby.count()
groupby.size() and groupby.count()
Filtering by Group
• Let's keep only election year results where the max '%' is less than 45%.
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45)
groupby Quiz
• Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
elections.groupby("Party").max().head(10)
Problem with Attempt #1
• Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
• Every column is calculated independently! Among Democrats:
• Last year they ran: 2020.
• Alphabetically the latest candidate name: Woodrow Wilson.
• Highest % of vote: 61.34%.
Attempt #2: Motivation
• We want to preserve entire rows, so we need an aggregate function that does that.
Attempt #2: Solution
Attempt #2: Solution
• Then group by Party and take the first item of each sub-DataFrame.
Grouping by Multiple Columns
• Suppose we want to build a table showing the total number of babies born of each sex in each year.
babynames.groupby(["Year", "Sex"])[["Count"]].agg(sum).head(6)
Pivot function
babynames_pivot = babynames.pivot(
index = "Year", # rows (turned into index)
columns = "Sex", # column values
values = ["Count"], # field(s) to process in
each group
aggfunc = np.sum, # group operation
)
babynames_pivot.head(6)
groupby(["Year", "Sex"]) vs. pivot
babynames.groupby(["Year",
"Sex"])[["Count"]].agg(sum).head(6)
Pivot Tables with Multiple Values
babynames_pivot = babynames.pivot_table(
index = "Year", # rows (turned into
index)
columns = "Sex", # column values
values = ["Count", "Name"],
aggfunc = np.max, # group operation)
babynames_pivot.head(6)
Pivot Table Mechanics
Where are we?
• Joining tables
• Tidy data
Joining Tables
•To join our table, we’ll also need to set aside the first names of each candidate.
elections["First Name"] =
elections["Candidate"].str.split().str[0]
Joining Our Tables: Two Options
merged = pd.merge(left = elections, right = babynames_2022, left_on = "First Name", right_on = "Name")
merged = elections.merge(right = babynames_2022, left_on = "First Name", right_on = "Name")
Tidy data
<Table 1>
<Table 2>
Tidy data
<pivot> <melt>