Cleaning Data in R
Cleaning Data in R
constraints
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Course outline
CLEANING DATA IN R
Why do we need clean data?
CLEANING DATA IN R
Data type constraints
Data R data type
type Example
character
First name, last name, address,
Text ... integer
Subscriber count, # products
Integer sold, ... numeric
CLEANING DATA IN R
Glimpsing at data types
sales <- read.csv("sales.csv") library(dplyr)
head(sales) glimpse(sales)
CLEANING DATA IN R
Checking data types
is.numeric(sales$revenue)
FALSE
library(assertive)
assert_is_numeric(sales$revenue)
assert_is_numeric(sales$quantity)
CLEANING DATA IN R
Checking data types
Logical checking - returns TRUE / FALSE assertive checking - errors when FALSE
is.character() assert_is_character()
is.numeric() assert_is_numeric()
is.logical() assert_is_logical()
is.factor() assert_is_factor()
is.Date() assert_is_date()
... ...
CLEANING DATA IN R
Why does data type matter?
class(sales$revenue)
"character"
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA
CLEANING DATA IN R
Comma problems
sales$revenue
CLEANING DATA IN R
Character to number
library(stringr)
revenue_trimmed = str_remove(sales$revenue, ",")
revenue_trimmed
as.numeric(revenue_trimmed)
CLEANING DATA IN R
Putting it together
sales %>%
mutate(revenue_usd = as.numeric(str_remove(revenue, ",")))
# A tibble: 100 x 4
order_id revenue quantity revenue_usd
<dbl> <chr> <dbl> <dbl>
1 7432 5,454 494 5454
2 7808 5,668 334 5668
3 4893 4,062 259 4062
4 6107 3,936 15 3936
5 7661 1,067 307 1067
# ... with 95 more rows
CLEANING DATA IN R
Same function, different outcomes
mean(sales$revenue)
NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA
mean(sales$revenue_usd)
5361.4
CLEANING DATA IN R
Converting data types
as.character()
as.numeric()
as.logical()
as.factor()
as.Date()
...
CLEANING DATA IN R
Watch out: factor to numeric
product_type as.numeric(product_type)
as.numeric(as.character(product_type))
class(product_type)
CLEANING DATA IN R
Range constraints
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What's an out of range value?
SAT score: 400-1600
Package weight: at least 0 lb/kg
CLEANING DATA IN R
Finding out of range values
movies
title avg_rating
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.8
6 Gone in Sixty Seconds 3.3
...
CLEANING DATA IN R
Finding out of range values
breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))
ggplot(movies, aes(avg_rating)) +
geom_histogram(breaks = breaks)
CLEANING DATA IN R
Finding out of range values
library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)
CLEANING DATA IN R
Handling out of range values
Remove rows
Treat as missing ( NA )
Replace with other value based on domain knowledge and/or knowledge of dataset
CLEANING DATA IN R
Removing rows
movies %>%
filter(avg_rating >= 0, avg_rating <= 5) %>%
ggplot(aes(avg_rating)) +
geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))
CLEANING DATA IN R
Treat as missing
movies movies %>%
mutate(rating_miss =
<chr> <dbl>
1 A Beautiful Mind 4.1 title rating_miss
2 La Vita e Bella 4.3 <chr> <dbl>
3 Amelie 4.2 1 A Beautiful Mind 4.1
4 Meet the Parents 3.5 2 La Vita e Bella 4.3
5 Unbreakable 5.8 3 Amelie 4.2
6 Gone in Sixty Seconds 3.3 4 Meet the Parents 3.5
... 5 Unbreakable NA
6 Gone in Sixty Seconds 3.3
replace(col, condition, replacement) ...
CLEANING DATA IN R
Replacing out of range values
movies %>%
mutate(rating_const =
replace(avg_rating, avg_rating > 5, 5))
title rating_const
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.0
6 Gone in Sixty Seconds 3.3
...
CLEANING DATA IN R
Date range constraints
assert_all_are_in_past(movies$date_recorded)
library(lubridate)
movies %>%
filter(date_recorded > today())
CLEANING DATA IN R
Removing out-of-range dates
library(lubridate)
movies <- movies %>%
filter(date_recorded <= today())
library(assertive)
assert_all_are_in_past(movies$date_recorded)
CLEANING DATA IN R
Uniqueness
constraints
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What's a duplicate?
First name Last name Address Credit score
1 Miriam Day 6042 Sollicitudin Avenue 313
2 Miriam Day 6042 Sollicitudin Avenue 313
CLEANING DATA IN R
Why do duplicates occur?
CLEANING DATA IN R
Full duplicates
First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Miriam Day 6042 Sollicitudin Avenue 313
3 Eagan Schmidt 507-6740 Cursus Avenue 728
4 Miriam Day 6042 Sollicitudin Avenue 313
5 Katell Roy Ap #434-4081 Mi Av. 455
6 Katell Roy Ap #434-4081 Mi Av. 455
... ... ... ... ...
CLEANING DATA IN R
Finding full duplicates
duplicated(credit_scores)
sum(duplicated(credit_scores))
CLEANING DATA IN R
Finding full duplicates
filter(credit_scores, duplicated(credit_scores))
CLEANING DATA IN R
Dropping full duplicates
credit_scores_unique <- distinct(credit_scores)
sum(duplicated(credit_scores_unique))
CLEANING DATA IN R
Partial duplicates
First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Eagan Schmidt 507-6740 Cursus Avenue 728
3 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
4 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
5 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
6 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
... ... ... ... ...
CLEANING DATA IN R
Finding partial duplicates
credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)
first_name last_name n
<fct> <fct> <int>
1 Katell Roy 2
2 Miriam Day 2
3 Tamekah Forbes 2
4 Xandra Barrett 2
CLEANING DATA IN R
Finding partial duplicates
dup_ids <- credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)
credit_scores %>%
filter(first_name %in% dup_ids$first_name, last_name %in% dup_ids$last_name)
CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one
CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one
CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one
CLEANING DATA IN R
Dropping partial duplicates
credit_scores %>%
distinct(first_name, last_name, .keep_all = TRUE)
CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)
CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)
CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)
CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)
CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)
CLEANING DATA IN R
Summarizing partial duplicates
credit_scores %>%
group_by(first_name, last_name) %>%
mutate(mean_credit_score = mean(credit_score))
CLEANING DATA IN R
Summarizing partial duplicates
credit_scores %>%
group_by(first_name, last_name) %>%
mutate(mean_credit_score = mean(credit_score)) %>%
distinct(first_name, last_name, .keep_all = TRUE) %>%
select(-credit_score)
CLEANING DATA IN R
Checking
membership
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Categorical data
Categorical variables have a fixed and known set of possible values
T-shirt size S , M , L , XL
CLEANING DATA IN R
Factors
In a factor , each category is stored as a number number and has a corresponding label
T-shirt size S , M , L , XL 1 , 2 , 3 , 4
CLEANING DATA IN R
Factor levels
tshirt_size
L XL XL L M M M L XL L S M M S S M XL S L S ...
Levels: S M L XL
levels(tshirt_size)
CLEANING DATA IN R
Values that don't belong
factor s cannot have values that fall outside of the predefined ones
CLEANING DATA IN R
How do we end up with these values?
CLEANING DATA IN R
Filtering joins: a quick review
Keeps or removes observations from the first table without adding columns
CLEANING DATA IN R
Blood type example
study_data blood_types
CLEANING DATA IN R
Blood type example
study_data blood_types
CLEANING DATA IN R
Finding non-members
CLEANING DATA IN R
Anti-join
study_data %>%
anti_join(blood_types, by = "blood_type")
CLEANING DATA IN R
Removing non-members
CLEANING DATA IN R
Semi-join
study_data %>%
semi_join(blood_types, by = "blood_type")
CLEANING DATA IN R
Categorical data
problems
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Categorical data problems
Inconsistency within a category Too many categories
CLEANING DATA IN R
Example: animal classification
animals
# A tibble: 68 x 9
animal_name hair eggs fins legs tail type
<chr> <fct> <fct> <fct> <int> <fct> <fct>
1 mole 1 0 0 4 1 mammal
2 chicken 0 1 0 2 1 bird
3 capybara 1 0 0 2 1 Mammal
4 tuna 0 1 1 0 1 fish
5 ostrich 0 1 0 2 1 bird
# ... with 63 more rows
CLEANING DATA IN R
Checking categories
animals %>% type n
count(type) 1 " mammal " 1
2 "amphibian" 2
3 "bird" 20
4 "bug" 1
"mammal"
5 "fish" 2
" mammal " 6 "invertebrate" 1
"MAMMAL" 7 "mammal" 38
8 "MAMMAL" 1
"Mammal "
9 "Mammal " 1
10 "reptile" 1
CLEANING DATA IN R
Case inconsistency
library(stringr)
animals %>%
mutate(type_lower = str_to_lower(type))
CLEANING DATA IN R
Case inconsistency
animals %>%
mutate(type_lower = str_to_lower(type)) %>%
count(type_lower)
type_lower n type_lower n
<chr> <int> <chr> <int>
1 " mammal " 1 6 "invertebrate" 1
2 "amphibian" 2 7 "mammal" 39
3 "bird" 20 8 "mammal " 1
4 "bug" 1 9 "reptile" 1
5 "fish" 2
"MAMMAL" → "mammal"
CLEANING DATA IN R
Case inconsistency
animals %>%
mutate(type_upper = str_to_upper(type)) %>%
count(type_upper)
type_upper n type_upper n
<chr> <int> <chr> <int>
1 " MAMMAL " 1 6 "INVERTEBRATE" 1
2 "AMPHIBIAN" 2 7 "MAMMAL" 39
3 "BIRD" 20 8 "MAMMAL " 1
4 "BUG" 1 9 "REPTILE" 1
5 "FISH" 2
CLEANING DATA IN R
Whitespace inconsistency
animals %>%
mutate(type_trimmed = str_trim(type_lower))
CLEANING DATA IN R
Whitespace inconsistency
animals %>%
mutate(type_trimmed = str_trim(type_lower)) %>%
count(type_trimmed)
type_trimmed n type_trimmed n
<chr> <int> <chr> <int>
1 amphibian 2 6 mammal 41
2 bird 20 7 reptile 1
3 bug 1
4 fish 2
5 invertebrate 1
CLEANING DATA IN R
Too many categories
animals %>%
count(type_trimmed, sort = TRUE)
type_trimmed n
1 mammal 41
2 bird 20
3 amphibian 2
4 fish 2
5 bug 1
6 invertebrate 1
7 reptile 1
CLEANING DATA IN R
Collapsing categories
other_categories = c("amphibian", "fish", "bug", "invertebrate", "reptile")
library(forcats)
animals %>%
mutate(type_collapsed = fct_collapse(type_trimmed, other = other_categories))
CLEANING DATA IN R
Collapsing categories
animals %>% animals %>%
count(type_collapsed) group_by(type_collapsed) %>%
summarize(avg_legs = mean(legs))
type_collapsed n
<fct> <int> type_collapsed avg_legs
1 other 7 <fct> <dbl>
2 bird 20 1 other 3.71
3 mammal 41 2 bird 2
3 mammal 3.37
CLEANING DATA IN R
Cleaning text data
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What is text data?
Type of data Example values
Names "Veronica Hopkins" , "Josiah" , ...
CLEANING DATA IN R
Unstructured data problems
Formatting inconsistency
"6171679912" vs. "(868) 949-4489"
Information inconsistency
+1 617-167-9912 vs. 617-167-9912
Invalid data
Phone number "0492" is too short
CLEANING DATA IN R
Customer data
customers
# A tibble: 99 x 3
name company credit_card
<chr> <chr> <chr>
1 Galena In Magna Associates 5171 5854 8986 1916
2 MacKenzie Iaculis Ltd 5128-5078-8008-5824
3 Megan Acosta Semper LLC 5502 4529 0732 1744
4 Phoebe Delacruz Sit Amet Nulla Limited 5419-7308-7424-0944
5 Jessica Pellentesque Sed Ltd 5419 2949 5508 9530
# ... with 95 more rows
CLEANING DATA IN R
Detecting hyphenated credit card numbers
str_detect(customers$credit_card, "-")
FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE ...
customers %>%
filter(str_detect(credit_card, "-"))
CLEANING DATA IN R
Replacing hyphens
customers %>%
mutate(credit_card_spaces = str_replace_all(credit_card, "-", " "))
CLEANING DATA IN R
Removing hyphens and spaces
credit_card_clean <- customers$credit_card %>%
str_remove_all("-") %>%
str_remove_all(" ")
customers %>%
mutate(credit_card = credit_card_clean)
CLEANING DATA IN R
Finding invalid credit cards
str_length(customers$credit_card)
16 16 16 16 16 16 16 16 16 16 16 16 12 16 16 16 16 16 16 16 16 16 16 16 16 ...
customers %>%
filter(str_length(credit_card) != 16)
CLEANING DATA IN R
Removing invalid credit cards
customers %>%
filter(str_length(credit_card) == 16)
CLEANING DATA IN R
More complex text problems
A regular expression is a sequence of characters that allows for robust searching within a
string.
Learn more in String Manipulation with stringr in R & Intermediate Regular Expressions in R
CLEANING DATA IN R
Uniformity
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Uniformity
Different units or formats
Temperature: °C vs. °F
CLEANING DATA IN R
Where do uniformity issues come from?
CLEANING DATA IN R
Finding uniformity issues
head(nyc_temps)
date temp
1 2019-04-01 4.2
2 2019-04-02 7.5
3 2019-04-03 12.2
4 2019-04-04 11.1
5 2019-04-05 41.5
6 2019-04-06 11.9
CLEANING DATA IN R
Finding uniformity issues
library(ggplot2)
ggplot(nyc_temps, aes(x = date, y = temp)) +
geom_point()
CLEANING DATA IN R
What to do?
There's no one best option. It depends on your dataset!
Do your research to understand where your data comes from
Data from Apr 7, 16, and 23 is from an external source that measured temps in °F
CLEANING DATA IN R
Unit conversion
5
C = (F − 32) ×
9
ifelse(condition, value_if_true, value_if_false)
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp))
CLEANING DATA IN R
Unit conversion
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp)) %>%
ggplot(aes(x = date, y = temp_c)) +
geom_point()
CLEANING DATA IN R
Date uniformity
nyc_temps
CLEANING DATA IN R
Parsing multiple formats
library(lubridate)
parse_date_time(nyc_temps$date,
orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))
NA
CLEANING DATA IN R
Ambiguous dates
Is 02/04/2019 in February or April?
Options include:
Treat as missing
CLEANING DATA IN R
Cross field validation
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What is cross field validation?
Cross field validation = a sanity check
Does this value make sense based on other values?
1 https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us
CLEANING DATA IN R
Credit card data
head(credit_cards)
CLEANING DATA IN R
Validating numbers
credit_cards %>%
select(dining_cb:total_cb)
CLEANING DATA IN R
Validating numbers
credit_cards %>%
mutate(theoretical_total = dining_cb + groceries_cb + gas_cb) %>%
filter(theoretical_total != total_cb) %>%
select(dining_cb:theoretical_total)
CLEANING DATA IN R
Validating date and age
credit_cards %>%
select(date_opened, acct_age)
date_opened acct_age
1 2018-07-05 1
2 2016-01-23 4
3 2016-03-25 4
4 2018-06-20 1
5 2017-02-08 3
6 2014-11-18 5
CLEANING DATA IN R
Calculating age
library(lubridate)
date_difference <- as.Date("2015-09-04") %--% today()
date_difference
as.numeric(date_difference, "years")
4.511978
floor(as.numeric(date_difference, "years"))
CLEANING DATA IN R
Validating age
credit_cards %>%
mutate(theor_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
filter(theor_age != acct_age)
CLEANING DATA IN R
What next?
CLEANING DATA IN R
Completeness
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
What is missing data?
CLEANING DATA IN R
Air quality
head(airquality)
CLEANING DATA IN R
Air quality
head(airquality)
CLEANING DATA IN R
Finding missing values
is.na(airquality)
CLEANING DATA IN R
Counting missing values
# Count missing vals in entire dataset
sum(is.na(airquality))
44
CLEANING DATA IN R
Visualizing missing values
library(visdat)
vis_miss(airquality)
CLEANING DATA IN R
Investigating missingness
airquality %>%
mutate(miss_ozone = is.na(Ozone)) %>%
group_by(miss_ozone) %>%
summarize(across(everything(), median, na.rm = TRUE))
CLEANING DATA IN R
Investigating missingness
airquality %>%
arrange(Temp) %>%
vis_miss()
CLEANING DATA IN R
Types of missingness
CLEANING DATA IN R
Dealing with missing data
Simple approaches:
2. Impute (fill in) with statistical measures (mean, median, mode..) or domain knowledge
CLEANING DATA IN R
Dropping missing values
airquality %>%
filter(!is.na(Ozone), !is.na(Solar.R))
CLEANING DATA IN R
Replacing missing values
airquality %>%
mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
CLEANING DATA IN R
Comparing strings
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Measuring distance between values
CLEANING DATA IN R
Minimum edit distance
How many typos are needed to get from one string to another?
CLEANING DATA IN R
Edit distance = 1
CLEANING DATA IN R
A more complex example
baboon → typhoon
Insert h
Substitute b → t
Substitute a → y
Substitute b → p
Total: 4
CLEANING DATA IN R
Types of edit distance
Damerau-Levenshtein
What you just learned
Levenshtein
Considers only substitution, insertion, and deletion
Others
Jaro-Winkler
Jaccard
Which is best?
CLEANING DATA IN R
String distance in R
library(stringdist)
stringdist("baboon",
"typhoon",
method = "dl")
CLEANING DATA IN R
Other methods
# LCS
stringdist("baboon", "typhoon",
method = "lcs")
# Jaccard
stringdist("baboon", "typhoon",
method = "jaccard")
0.75
CLEANING DATA IN R
Comparing strings to clean data
In Chapter 2:
"EU" , "eur" , "Europ" → "Europe"
CLEANING DATA IN R
Comparing strings to clean data
survey cities
CLEANING DATA IN R
Remapping using string distance
library(fuzzyjoin)
stringdist_left_join(survey, cities, by = "city", method = "dl")
CLEANING DATA IN R
Remapping using string distance
stringdist_left_join(survey, cities, by = "city", method = "dl", max_dist = 1)
CLEANING DATA IN R
Generating and
comparing pairs
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
When joins won't work
CLEANING DATA IN R
What is record linkage?
CLEANING DATA IN R
Pairs of records
CLEANING DATA IN R
Generating pairs
CLEANING DATA IN R
Generating pairs in R
library(reclin)
pair_blocking(df_A, df_B)
Simple blocking
No blocking used.
First data set: 5 records
Second data set: 5 records
Total number of pairs: 25 pairs
ldat with 25 rows and 2 columns
x y
1 1 1
2 2 1
3 3 1
...
CLEANING DATA IN R
Too many pairs
CLEANING DATA IN R
Blocking
Only consider pairs when they agree on the blocking variable (State)
CLEANING DATA IN R
Pair blocking in R
pair_blocking(df_A, df_B, blocking_var = "state")
CLEANING DATA IN R
Comparing pairs
CLEANING DATA IN R
Comparing pairs
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = "name", default_comparator = lcs())
CLEANING DATA IN R
Comparing multiple columns
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs())
CLEANING DATA IN R
Different comparators
default_comparator = lcs()
default_comparator = jaccard()
default_comparator = jaro_winkler()
CLEANING DATA IN R
Scoring and linking
C L E A N I N G D ATA I N R
Maggie Matsui
Content Developer @ DataCamp
Last lesson
df_A df_B
CLEANING DATA IN R
Where we left off
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs())
x y name zip
1 1 1 0.3529412 0.4
2 1 4 0.3030303 0.2
3 2 3 0.9285714 1.0
4 2 5 0.2307692 0.2
5 3 2 0.2142857 0.2
6 4 2 0.2857143 0.4
7 5 1 0.1935484 0.4
8 5 4 0.3333333 0.2
CLEANING DATA IN R
Scoring pairs
CLEANING DATA IN R
Scoring with sums
x y name zip
1 1 1 0.3529412 + 0.4 =
2 1 4 0.3030303 + 0.2 =
3 2 3 0.9285714 + 1.0 =
4 2 5 0.2307692 + 0.2 =
5 3 2 0.2142857 + 0.2 =
6 4 2 0.2857143 + 0.4 =
7 5 1 0.1935484 + 0.4 =
8 5 4 0.3333333 + 0.2 =
CLEANING DATA IN R
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_simsum()
CLEANING DATA IN R
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_simsum()
CLEANING DATA IN R
Disadvantages of summing
2 records with a similar name (Keaton Z Snyder & Keaton Snyder) are more likely to be a
match
2 records with the same sex (Male & Male) are not as likely to be a match
CLEANING DATA IN R
Scoring probabilistically
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_problink()
CLEANING DATA IN R
Linking pairs
CLEANING DATA IN R
Selecting matches
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_problink() %>%
select_n_to_m()
CLEANING DATA IN R
Linking the data
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_problink() %>%
select_n_to_m() %>%
link()
CLEANING DATA IN R
Linked data
name.x zip.x state.x name.y zip.y state.y
1 Keaton Z Snyder 15020 PA Keaton Snyder 15020 PA
2 Christine M. Conner 10456 NY <NA> <NA> <NA>
3 Arthur Potts 07799 NJ <NA> <NA> <NA>
4 Maia Collier 07960 NJ <NA> <NA> <NA>
5 Atkins, Alice W. 10603 NY <NA> <NA> <NA>
6 <NA> <NA> <NA> Jerome A. Yates 11743 NY
7 <NA> <NA> <NA> Garrison, Brenda 08611 NJ
8 <NA> <NA> <NA> Stuart, Bert F 12211 NY
9 <NA> <NA> <NA> Hayley Peck 19134 PA
CLEANING DATA IN R
What you learned
CLEANING DATA IN R
Chapter 1: Common Data Problems
CLEANING DATA IN R
Chapter 2: Text and Categorical Data
CLEANING DATA IN R
Chapter 3: Advanced Data Problems
CLEANING DATA IN R
Chapter 4: Record Linkage
CLEANING DATA IN R
Expand and build upon your new skills
Categorical Data
Categorical Data in the Tidyverse
Text Data
String Manipulation with stringr in R
CLEANING DATA IN R