0% found this document useful (0 votes)
1 views158 pages

Cleaning Data in R

The document outlines the process of cleaning data in R, focusing on data type constraints, checking and converting data types, handling out-of-range values, and managing duplicates. It emphasizes the importance of clean data for accurate analysis and provides examples of common data issues and solutions. Techniques such as using libraries like dplyr and assertive are highlighted for effective data manipulation.

Uploaded by

Kasper B.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views158 pages

Cleaning Data in R

The document outlines the process of cleaning data in R, focusing on data type constraints, checking and converting data types, handling out-of-range values, and managing duplicates. It emphasizes the importance of clean data for accurate analysis and provides examples of common data issues and solutions. Techniques such as using libraries like dplyr and assertive are highlighted for effective data manipulation.

Uploaded by

Kasper B.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Data type

constraints
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Course outline

Chapter 1 - Common data problems

CLEANING DATA IN R
Why do we need clean data?

CLEANING DATA IN R
Data type constraints
Data R data type
type Example
character
First name, last name, address,
Text ... integer
Subscriber count, # products
Integer sold, ... numeric

Decimal Temperature, exchange rate, ... logical

Is married, new customer, factor


Binary
yes/no, ...
Date
Category Marriage status, color, ...
Date Order dates, date of birth, ...

CLEANING DATA IN R
Glimpsing at data types
sales <- read.csv("sales.csv") library(dplyr)
head(sales) glimpse(sales)

order_id revenue quantity Observations: 100


1 7432 5,454 494 Variables: 3
2 7808 5,668 334 $ order_id <dbl> 7432, 7808, ...
3 4893 4,062 259 $ revenue <chr> "$5454", "$5668", ...
4 6107 3,936 15 $ quantity <dbl> 494, 334, ...
5 7661 1,067 307
6 5908 6,635 235

CLEANING DATA IN R
Checking data types
is.numeric(sales$revenue)

FALSE

library(assertive)
assert_is_numeric(sales$revenue)

Error: is_numeric : sales$revenue is not of class 'numeric'; it has class 'character'.

assert_is_numeric(sales$quantity)

CLEANING DATA IN R
Checking data types
Logical checking - returns TRUE / FALSE assertive checking - errors when FALSE

is.character() assert_is_character()

is.numeric() assert_is_numeric()

is.logical() assert_is_logical()

is.factor() assert_is_factor()

is.Date() assert_is_date()

... ...

CLEANING DATA IN R
Why does data type matter?
class(sales$revenue)

"character"

mean(sales$revenue)

NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA

CLEANING DATA IN R
Comma problems
sales$revenue

"5,454" "5,668" "4,062" "3,936" "1,067" ...

CLEANING DATA IN R
Character to number
library(stringr)
revenue_trimmed = str_remove(sales$revenue, ",")
revenue_trimmed

"5454" "5668" "4062" "3936" "1067" ...

as.numeric(revenue_trimmed)

5454 5668 4062 3936 1067 ...

CLEANING DATA IN R
Putting it together
sales %>%
mutate(revenue_usd = as.numeric(str_remove(revenue, ",")))

# A tibble: 100 x 4
order_id revenue quantity revenue_usd
<dbl> <chr> <dbl> <dbl>
1 7432 5,454 494 5454
2 7808 5,668 334 5668
3 4893 4,062 259 4062
4 6107 3,936 15 3936
5 7661 1,067 307 1067
# ... with 95 more rows

CLEANING DATA IN R
Same function, different outcomes
mean(sales$revenue)

NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA

mean(sales$revenue_usd)

5361.4

CLEANING DATA IN R
Converting data types
as.character()

as.numeric()

as.logical()

as.factor()

as.Date()

...

CLEANING DATA IN R
Watch out: factor to numeric
product_type as.numeric(product_type)

1000 1000 3000 2000 3000 1 1 3 2 3


Levels: 1000 2000 3000

as.numeric(as.character(product_type))
class(product_type)

1000 1000 3000 2000 3000


"factor"

CLEANING DATA IN R
Range constraints
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What's an out of range value?
SAT score: 400-1600
Package weight: at least 0 lb/kg

Adult heart rate: 60-100 beats per minute

CLEANING DATA IN R
Finding out of range values
movies

title avg_rating
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.8
6 Gone in Sixty Seconds 3.3
...

CLEANING DATA IN R
Finding out of range values
breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))
ggplot(movies, aes(avg_rating)) +
geom_histogram(breaks = breaks)

CLEANING DATA IN R
Finding out of range values
library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)

Error: is_in_closed_range : movies$avg_rating are not all in the range [0,5].


There were 3 failures:
Position Value Cause
1 5 5.8 too high
2 8 6.2 too high
3 9 -4.4 too low

CLEANING DATA IN R
Handling out of range values
Remove rows
Treat as missing ( NA )

Replace with range limit

Replace with other value based on domain knowledge and/or knowledge of dataset

CLEANING DATA IN R
Removing rows
movies %>%
filter(avg_rating >= 0, avg_rating <= 5) %>%

ggplot(aes(avg_rating)) +
geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))

CLEANING DATA IN R
Treat as missing
movies movies %>%
mutate(rating_miss =

title avg_rating replace(avg_rating, avg_rating > 5, NA))

<chr> <dbl>
1 A Beautiful Mind 4.1 title rating_miss
2 La Vita e Bella 4.3 <chr> <dbl>
3 Amelie 4.2 1 A Beautiful Mind 4.1
4 Meet the Parents 3.5 2 La Vita e Bella 4.3
5 Unbreakable 5.8 3 Amelie 4.2
6 Gone in Sixty Seconds 3.3 4 Meet the Parents 3.5
... 5 Unbreakable NA
6 Gone in Sixty Seconds 3.3
replace(col, condition, replacement) ...

CLEANING DATA IN R
Replacing out of range values
movies %>%
mutate(rating_const =
replace(avg_rating, avg_rating > 5, 5))

title rating_const
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.0
6 Gone in Sixty Seconds 3.3
...

CLEANING DATA IN R
Date range constraints
assert_all_are_in_past(movies$date_recorded)

Error: is_in_past : movies$date_recorded are not all in the past.


There was 1 failure:
Position Value Cause
1 3 2064-09-22 20:00:00 in future

library(lubridate)
movies %>%
filter(date_recorded > today())

title avg_rating date_recorded


1 Amelie 4.2 2064-09-23

CLEANING DATA IN R
Removing out-of-range dates
library(lubridate)
movies <- movies %>%
filter(date_recorded <= today())

library(assertive)
assert_all_are_in_past(movies$date_recorded)

Remember, no output = passed!

CLEANING DATA IN R
Uniqueness
constraints
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What's a duplicate?
First name Last name Address Credit score
1 Miriam Day 6042 Sollicitudin Avenue 313
2 Miriam Day 6042 Sollicitudin Avenue 313

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit St 356
2 Tamekah Forbes P.O. Box 147, 511 Velit St 342

CLEANING DATA IN R
Why do duplicates occur?

CLEANING DATA IN R
Full duplicates
First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Miriam Day 6042 Sollicitudin Avenue 313
3 Eagan Schmidt 507-6740 Cursus Avenue 728
4 Miriam Day 6042 Sollicitudin Avenue 313
5 Katell Roy Ap #434-4081 Mi Av. 455
6 Katell Roy Ap #434-4081 Mi Av. 455
... ... ... ... ...

CLEANING DATA IN R
Finding full duplicates
duplicated(credit_scores)

FALSE FALSE FALSE TRUE FALSE ...

sum(duplicated(credit_scores))

CLEANING DATA IN R
Finding full duplicates
filter(credit_scores, duplicated(credit_scores))

first_name last_name address credit_score


1 Miriam Day 6042 Sollicitudin Avenue 313
2 Katell Roy Ap #434-4081 Mi Av. 455

CLEANING DATA IN R
Dropping full duplicates
credit_scores_unique <- distinct(credit_scores)
sum(duplicated(credit_scores_unique))

CLEANING DATA IN R
Partial duplicates
First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Eagan Schmidt 507-6740 Cursus Avenue 728
3 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
4 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
5 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
6 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
... ... ... ... ...

CLEANING DATA IN R
Finding partial duplicates
credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)

first_name last_name n
<fct> <fct> <int>
1 Katell Roy 2
2 Miriam Day 2
3 Tamekah Forbes 2
4 Xandra Barrett 2

CLEANING DATA IN R
Finding partial duplicates
dup_ids <- credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)
credit_scores %>%
filter(first_name %in% dup_ids$first_name, last_name %in% dup_ids$last_name)

first_name last_name address credit_score


1 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 620
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
3 Miriam Day 6042 Sollicitudin Avenue 313
4 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 636
5 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
...

CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636

CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4

CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620

CLEANING DATA IN R
Dropping partial duplicates
credit_scores %>%
distinct(first_name, last_name, .keep_all = TRUE)

first_name last_name address credit_score


1 Harlan Hebert P.O. Box 356, 3869 Non Av. 305
2 Drake Soto 643-1409 Ac Avenue 642
3 Felix Morales 741-1497 Velit Ave 780
4 Brynne Charles 313-3757 Ultrices St. 513
5 Aquila Dillon P.O. Box 945, 5550 Aliquam Street 748
...

CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636

CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)

First Last Address Credit Mean credit score


name name score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356 349
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
P.O. Box 309, 2462 Pharetra
3 Xandra Barrett 620 628
Rd.

4 Xandra Barrett P.O. Box 309, 2462 Pharetra 636


Rd.

CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)

First Last Address Credit Mean credit score


name name score
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356 349
2
P.O. Box 309, 2462 Pharetra
3 Xandra Barrett 620 628
Rd.
4

CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 349
2
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 628
4

CLEANING DATA IN R
Handling partial duplicates: summarizing
Summarize differing values using statistical summary functions ( mean() , max() , etc.)

First name Last name Address Credit score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 349
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 628

CLEANING DATA IN R
Summarizing partial duplicates
credit_scores %>%
group_by(first_name, last_name) %>%
mutate(mean_credit_score = mean(credit_score))

first_name last_name address credit_score mean_score


1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356 349
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342 349
3 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 636 628
4 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 620 628
5 Katell Roy Ap #434-4081 Mi Av. 455 455
...

CLEANING DATA IN R
Summarizing partial duplicates
credit_scores %>%
group_by(first_name, last_name) %>%
mutate(mean_credit_score = mean(credit_score)) %>%
distinct(first_name, last_name, .keep_all = TRUE) %>%
select(-credit_score)

first_name last_name address mean_score


<fct> <fct> <fct> <dbl>
1 Tamekah Forbes P.O. Box 147, 511 Velit Street 349
2 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 628
3 Katell Roy Ap #434-4081 Mi Av. 455
4 Miriam Day 6042 Sollicitudin Avenue 313
...

CLEANING DATA IN R
Checking
membership
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Categorical data
Categorical variables have a fixed and known set of possible values

Data Example values


Marriage status unmarried , married

Household income category 0-20K , 20-40K , ...

T-shirt size S , M , L , XL

CLEANING DATA IN R
Factors
In a factor , each category is stored as a number number and has a corresponding label

Data Labels Numeric representation


Marriage status unmarried , married 1 , 2

Household income category 0-20K , 20-40K , ... 1 , 2 , ...

T-shirt size S , M , L , XL 1 , 2 , 3 , 4

CLEANING DATA IN R
Factor levels
tshirt_size

L XL XL L M M M L XL L S M M S S M XL S L S ...
Levels: S M L XL

levels(tshirt_size)

"S" "M" "L" "XL"

CLEANING DATA IN R
Values that don't belong
factor s cannot have values that fall outside of the predefined ones

Data Levels Not allowed


Marriage status unmarried , married divorced

Household income category 0-20K , 20-40K , ... 10-30K

T-shirt size S , M , L , XL S/M

CLEANING DATA IN R
How do we end up with these values?

CLEANING DATA IN R
Filtering joins: a quick review
Keeps or removes observations from the first table without adding columns

CLEANING DATA IN R
Blood type example
study_data blood_types

name birthday blood_type blood_type


1 Beth 2019-10-20 B- 1 O-
2 Ignatius 2020-07-08 A- 2 O+
3 Paul 2019-08-12 O+ 3 A-
4 Helen 2019-03-17 O- 4 A+
5 Jennifer 2019-12-17 Z+ 5 B+
6 Kennedy 2020-04-27 A+ 6 B-
7 Keith 2019-04-19 AB+ 7 AB+
8 AB-

CLEANING DATA IN R
Blood type example
study_data blood_types

name birthday blood_type blood_type


1 Beth 2019-10-20 B- 1 O-
2 Ignatius 2020-07-08 A- 2 O+
3 Paul 2019-08-12 O+ 3 A-
4 Helen 2019-03-17 O- 4 A+
5 Jennifer 2019-12-17 Z+ <-- 5 B+
6 Kennedy 2020-04-27 A+ 6 B-
7 Keith 2019-04-19 AB+ 7 AB+
8 AB-

CLEANING DATA IN R
Finding non-members

CLEANING DATA IN R
Anti-join
study_data %>%
anti_join(blood_types, by = "blood_type")

name birthday blood_type


1 Jennifer 2019-12-17 Z+

CLEANING DATA IN R
Removing non-members

CLEANING DATA IN R
Semi-join
study_data %>%
semi_join(blood_types, by = "blood_type")

name birthday blood_type


1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
4 Helen 2019-03-17 O-
5 Kennedy 2020-04-27 A+
6 Keith 2019-04-19 AB+

CLEANING DATA IN R
Categorical data
problems
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Categorical data problems
Inconsistency within a category Too many categories

CLEANING DATA IN R
Example: animal classification
animals

# A tibble: 68 x 9
animal_name hair eggs fins legs tail type
<chr> <fct> <fct> <fct> <int> <fct> <fct>
1 mole 1 0 0 4 1 mammal
2 chicken 0 1 0 2 1 bird
3 capybara 1 0 0 2 1 Mammal
4 tuna 0 1 1 0 1 fish
5 ostrich 0 1 0 2 1 bird
# ... with 63 more rows

CLEANING DATA IN R
Checking categories
animals %>% type n
count(type) 1 " mammal " 1
2 "amphibian" 2
3 "bird" 20
4 "bug" 1
"mammal"
5 "fish" 2
" mammal " 6 "invertebrate" 1
"MAMMAL" 7 "mammal" 38
8 "MAMMAL" 1
"Mammal "
9 "Mammal " 1
10 "reptile" 1

CLEANING DATA IN R
Case inconsistency
library(stringr)
animals %>%
mutate(type_lower = str_to_lower(type))

animal_name hair eggs fins legs tail type type_lower


<fct> <int> <int> <int> <int> <int> <fct> <chr>
1 mole 1 0 0 4 1 "mammal" "mammal"
2 chicken 0 1 0 2 1 "bird" "bird"
3 capybara 1 0 0 2 1 " Mammal" " mammal"
4 tuna 0 1 1 0 1 "fish" "fish"
5 ostrich 0 1 0 2 1 "bird" "bird"

CLEANING DATA IN R
Case inconsistency
animals %>%
mutate(type_lower = str_to_lower(type)) %>%
count(type_lower)

type_lower n type_lower n
<chr> <int> <chr> <int>
1 " mammal " 1 6 "invertebrate" 1
2 "amphibian" 2 7 "mammal" 39
3 "bird" 20 8 "mammal " 1
4 "bug" 1 9 "reptile" 1
5 "fish" 2

"MAMMAL" → "mammal"

CLEANING DATA IN R
Case inconsistency
animals %>%
mutate(type_upper = str_to_upper(type)) %>%
count(type_upper)

type_upper n type_upper n
<chr> <int> <chr> <int>
1 " MAMMAL " 1 6 "INVERTEBRATE" 1
2 "AMPHIBIAN" 2 7 "MAMMAL" 39
3 "BIRD" 20 8 "MAMMAL " 1
4 "BUG" 1 9 "REPTILE" 1
5 "FISH" 2

CLEANING DATA IN R
Whitespace inconsistency
animals %>%
mutate(type_trimmed = str_trim(type_lower))

animal_name hair eggs fins legs tail type_lower type_trimmed


<fct> <int> <int> <int> <int> <int> <chr> <chr>
1 mole 1 0 0 4 1 "mammal" mammal
2 chicken 0 1 0 2 1 "bird" bird
3 capybara 1 0 0 2 1 " mammal" mammal
4 tuna 0 1 1 0 1 "fish" fish
5 ostrich 0 1 0 2 1 "bird" bird

CLEANING DATA IN R
Whitespace inconsistency
animals %>%
mutate(type_trimmed = str_trim(type_lower)) %>%
count(type_trimmed)

type_trimmed n type_trimmed n
<chr> <int> <chr> <int>
1 amphibian 2 6 mammal 41
2 bird 20 7 reptile 1
3 bug 1
4 fish 2
5 invertebrate 1

CLEANING DATA IN R
Too many categories
animals %>%
count(type_trimmed, sort = TRUE)

type_trimmed n
1 mammal 41
2 bird 20
3 amphibian 2
4 fish 2
5 bug 1
6 invertebrate 1
7 reptile 1

CLEANING DATA IN R
Collapsing categories
other_categories = c("amphibian", "fish", "bug", "invertebrate", "reptile")

library(forcats)
animals %>%
mutate(type_collapsed = fct_collapse(type_trimmed, other = other_categories))

animal_name hair eggs fins legs tail type_trimmed type_collapsed


<fct> <int> <int> <int> <int> <int> <chr> <chr>
1 mole 1 0 0 4 1 mammal mammal
2 chicken 0 1 0 2 1 bird bird
3 capybara 1 0 0 2 1 mammal mammal
4 tuna 0 1 1 0 1 fish other
5 ostrich 0 1 0 2 1 bird bird

CLEANING DATA IN R
Collapsing categories
animals %>% animals %>%
count(type_collapsed) group_by(type_collapsed) %>%
summarize(avg_legs = mean(legs))
type_collapsed n
<fct> <int> type_collapsed avg_legs
1 other 7 <fct> <dbl>
2 bird 20 1 other 3.71
3 mammal 41 2 bird 2
3 mammal 3.37

CLEANING DATA IN R
Cleaning text data
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What is text data?
Type of data Example values
Names "Veronica Hopkins" , "Josiah" , ...

Phone numbers "6171679912" , "(868) 949-4489" , ...

Emails "[email protected]" , "[email protected]" , ...

Passwords "JZY46TVG8SM" , "iamjosiah21" , ...

Comments/Reviews "great service!" , "This product broke after 2 days" , ...

CLEANING DATA IN R
Unstructured data problems
Formatting inconsistency
"6171679912" vs. "(868) 949-4489"

"9239 5849 3712 0039" vs. "4490459957881031"

Information inconsistency
+1 617-167-9912 vs. 617-167-9912

"Veronica Hopkins" vs. "Josiah"

Invalid data
Phone number "0492" is too short

Zip code "19888" doesn't exist

CLEANING DATA IN R
Customer data
customers

# A tibble: 99 x 3
name company credit_card
<chr> <chr> <chr>
1 Galena In Magna Associates 5171 5854 8986 1916
2 MacKenzie Iaculis Ltd 5128-5078-8008-5824
3 Megan Acosta Semper LLC 5502 4529 0732 1744
4 Phoebe Delacruz Sit Amet Nulla Limited 5419-7308-7424-0944
5 Jessica Pellentesque Sed Ltd 5419 2949 5508 9530
# ... with 95 more rows

CLEANING DATA IN R
Detecting hyphenated credit card numbers
str_detect(customers$credit_card, "-")

FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE ...

customers %>%
filter(str_detect(credit_card, "-"))

name company credit_card


1 MacKenzie Iaculis Ltd 5128-5078-8008-5824
2 Phoebe Delacruz Sit Amet Nulla Limited 5419-7308-7424-0944
3 Abel Lorem PC 5211-6023-0805-0217
...

CLEANING DATA IN R
Replacing hyphens
customers %>%
mutate(credit_card_spaces = str_replace_all(credit_card, "-", " "))

name company credit_card_spaces


1 Galena In Magna Associates 5171 5854 8986 1916
2 MacKenzie Iaculis Ltd 5128 5078 8008 5824
3 Megan Acosta Semper LLC 5502 4529 0732 1744
4 Phoebe Delacruz Sit Amet Nulla Limited 5419 7308 7424 0944
5 Jessica Pellentesque Sed Ltd 5419 2949 5508 9530
...

CLEANING DATA IN R
Removing hyphens and spaces
credit_card_clean <- customers$credit_card %>%
str_remove_all("-") %>%
str_remove_all(" ")
customers %>%
mutate(credit_card = credit_card_clean)

name company credit_card


1 Galena In Magna Associates 5171585489861916
2 MacKenzie Iaculis Ltd 5128507880085824
3 Megan Acosta Semper LLC 5502452907321744
...

CLEANING DATA IN R
Finding invalid credit cards
str_length(customers$credit_card)

16 16 16 16 16 16 16 16 16 16 16 16 12 16 16 16 16 16 16 16 16 16 16 16 16 ...

customers %>%
filter(str_length(credit_card) != 16)

name company credit_card


1 Jerry Russell Sed Eu Company 516294099537
2 Ivor Christian Ut Tincidunt Incorporated 544571330015
3 Francesca Drake Etiam Consulting 517394144089

CLEANING DATA IN R
Removing invalid credit cards
customers %>%
filter(str_length(credit_card) == 16)

name company credit_card


1 Galena In Magna Associates 5171585489861916
2 MacKenzie Iaculis Ltd 5128507880085824
3 Megan Acosta Semper LLC 5502452907321744
4 Phoebe Delacruz Sit Amet Nulla Limited 5419730874240944
5 Jessica Pellentesque Sed Ltd 5419294955089530
...

CLEANING DATA IN R
More complex text problems
A regular expression is a sequence of characters that allows for robust searching within a
string.

Certain characters are treated differently in a regular expression:


( , ) , [ , ] , $ , . , + , * , and others

stringr functions use regular expressions

Searching for these characters requires using fixed() :


str_detect(column, fixed("$"))

Learn more in String Manipulation with stringr in R & Intermediate Regular Expressions in R

CLEANING DATA IN R
Uniformity
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Uniformity
Different units or formats
Temperature: °C vs. °F

Weight: kg vs. g vs. lb

Money: USD $ vs. GBP £ vs. JPY ¥

Date: DD-MM-YYYY vs. MM-DD-YYYY vs. YYYY-MM-DD

CLEANING DATA IN R
Where do uniformity issues come from?

CLEANING DATA IN R
Finding uniformity issues
head(nyc_temps)

date temp
1 2019-04-01 4.2
2 2019-04-02 7.5
3 2019-04-03 12.2
4 2019-04-04 11.1
5 2019-04-05 41.5
6 2019-04-06 11.9

CLEANING DATA IN R
Finding uniformity issues
library(ggplot2)
ggplot(nyc_temps, aes(x = date, y = temp)) +
geom_point()

CLEANING DATA IN R
What to do?
There's no one best option. It depends on your dataset!
Do your research to understand where your data comes from

Data from Apr 7, 16, and 23 is from an external source that measured temps in °F

CLEANING DATA IN R
Unit conversion
5
C = (F − 32) ×
9
ifelse(condition, value_if_true, value_if_false)

nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp))

date temp temp_c


1 2019-04-01 4.2 4.20000
...
7 2019-04-07 58.5 14.72222
...

CLEANING DATA IN R
Unit conversion
nyc_temps %>%
mutate(temp_c = ifelse(temp > 50, (temp - 32) * 5 / 9, temp)) %>%
ggplot(aes(x = date, y = temp_c)) +
geom_point()

CLEANING DATA IN R
Date uniformity
nyc_temps

Date string Date format


date temp_c
1 2019-11-23 5.12 "2019-11-23" "%Y-%m-%d"

2 01/15/19 -0.67 "01/15/19" "%m/%d/%y"


3 April 24, 2019 17.46
"April 24, 2019" "%B %d, %Y"
4 08/30/19 26.46
5 October 3, 2019 14.63
6 2019-03-17 3.47
?strptime in R console

CLEANING DATA IN R
Parsing multiple formats
library(lubridate)
parse_date_time(nyc_temps$date,
orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))

"2019-11-23 UTC" "2019-01-15 UTC" "2019-04-24 UTC" "2019-08-30 UTC"


"2019-10-03 UTC" "2019-03-17 UTC"

parse_date_time("Monday, January 3",


orders = c("%Y-%m-%d", "%m/%d/%y", "%B %d, %Y"))

NA

CLEANING DATA IN R
Ambiguous dates
Is 02/04/2019 in February or April?

Depends on your data!

Options include:

Treat as missing

If your data comes from multiple sources, infer based on source

Infer based on other data in the dataset

CLEANING DATA IN R
Cross field validation
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What is cross field validation?
Cross field validation = a sanity check
Does this value make sense based on other values?

1 https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us

CLEANING DATA IN R
Credit card data
head(credit_cards)

date_opened dining_cb groceries_cb gas_cb total_cb acct_age


1 2018-07-05 26.08 83.43 78.90 188.41 1
2 2016-01-23 1309.33 4.46 1072.25 2386.04 4
3 2016-03-25 205.84 119.20 800.62 1125.66 4
4 2018-06-20 14.00 16.37 18.41 48.78 1
5 2017-02-08 98.50 283.68 281.70 788.33 3
6 2014-11-18 889.28 2626.34 2973.62 6489.24 5

CLEANING DATA IN R
Validating numbers
credit_cards %>%
select(dining_cb:total_cb)

dining_cb groceries_cb gas_cb total_cb


1 26.08 83.43 78.90 188.41
2 1309.33 4.46 1072.25 2386.04
3 205.84 119.20 800.62 1125.66
4 14.00 16.37 18.41 48.78
5 98.50 283.68 281.70 788.33
6 889.28 2626.34 2973.62 6489.24

CLEANING DATA IN R
Validating numbers
credit_cards %>%
mutate(theoretical_total = dining_cb + groceries_cb + gas_cb) %>%
filter(theoretical_total != total_cb) %>%
select(dining_cb:theoretical_total)

dining_cb groceries_cb gas_cb total_cb theoretical_total


1 98.50 283.68 281.70 788.33 663.88
2 3387.53 363.85 2706.42 4502.94 6457.80

CLEANING DATA IN R
Validating date and age
credit_cards %>%
select(date_opened, acct_age)

date_opened acct_age
1 2018-07-05 1
2 2016-01-23 4
3 2016-03-25 4
4 2018-06-20 1
5 2017-02-08 3
6 2014-11-18 5

CLEANING DATA IN R
Calculating age
library(lubridate)
date_difference <- as.Date("2015-09-04") %--% today()
date_difference

2015-09-04 UTC--2020-03-09 UTC

as.numeric(date_difference, "years")

4.511978

floor(as.numeric(date_difference, "years"))

CLEANING DATA IN R
Validating age
credit_cards %>%
mutate(theor_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
filter(theor_age != acct_age)

date_opened acct_age dining_cb groceries_cb gas_cb total_cb theor_age


1 2016-03-25 4 814.34 471.58 3167.41 4453.33 3
2 2018-03-06 3 238.48 186.05 213.84 638.37 2

CLEANING DATA IN R
What next?

CLEANING DATA IN R
Completeness
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What is missing data?

Can be represented as NA , nan , 0 , 99 , . ...

CLEANING DATA IN R
Air quality
head(airquality)

Ozone Solar.R Wind Temp Month Day


1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

CLEANING DATA IN R
Air quality
head(airquality)

Ozone Solar.R Wind Temp Month Day


1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

CLEANING DATA IN R
Finding missing values
is.na(airquality)

Ozone Solar.R Wind Temp Month Day


[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] TRUE TRUE FALSE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE FALSE

CLEANING DATA IN R
Counting missing values
# Count missing vals in entire dataset
sum(is.na(airquality))

44

CLEANING DATA IN R
Visualizing missing values
library(visdat)
vis_miss(airquality)

CLEANING DATA IN R
Investigating missingness
airquality %>%
mutate(miss_ozone = is.na(Ozone)) %>%
group_by(miss_ozone) %>%
summarize(across(everything(), median, na.rm = TRUE))

miss_ozone Ozone Solar.R Wind Temp Month Day


<lgl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 FALSE 31.5 207 9.7 65 7 16
2 TRUE NA 194 9.7 99 6 15

CLEANING DATA IN R
Investigating missingness
airquality %>%
arrange(Temp) %>%
vis_miss()

CLEANING DATA IN R
Types of missingness

CLEANING DATA IN R
Dealing with missing data
Simple approaches:

1. Drop missing data

2. Impute (fill in) with statistical measures (mean, median, mode..) or domain knowledge

More complex approaches:

1. Impute using an algorithmic approach

2. Impute with machine learning models

Learn more in Dealing with Missing Data in R

CLEANING DATA IN R
Dropping missing values
airquality %>%
filter(!is.na(Ozone), !is.na(Solar.R))

Ozone Solar.R Wind Temp Month Day


<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 23 299 8.6 65 5 7
6 19 99 13.8 59 5 8

CLEANING DATA IN R
Replacing missing values
airquality %>%
mutate(ozone_filled = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))

Ozone Solar.R Wind Temp Month Day ozone_filled


<int> <int> <dbl> <int> <int> <int> <dbl>
1 41 190 7.4 67 5 1 41
2 36 118 8 72 5 2 36
3 12 149 12.6 74 5 3 12
4 18 313 11.5 62 5 4 18
5 NA NA 14.3 56 5 5 42.1

CLEANING DATA IN R
Comparing strings
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Measuring distance between values

What's the distance between typhoon and baboon?

CLEANING DATA IN R
Minimum edit distance
How many typos are needed to get from one string to another?

CLEANING DATA IN R
Edit distance = 1

CLEANING DATA IN R
A more complex example
baboon → typhoon
Insert h

Substitute b → t

Substitute a → y

Substitute b → p

Total: 4

CLEANING DATA IN R
Types of edit distance
Damerau-Levenshtein
What you just learned

Levenshtein
Considers only substitution, insertion, and deletion

LCS (Longest Common Subsequence)


Considers only insertion and deletion

Others
Jaro-Winkler

Jaccard

Which is best?

CLEANING DATA IN R
String distance in R
library(stringdist)
stringdist("baboon",
"typhoon",
method = "dl")

CLEANING DATA IN R
Other methods
# LCS
stringdist("baboon", "typhoon",
method = "lcs")

# Jaccard
stringdist("baboon", "typhoon",
method = "jaccard")

0.75

CLEANING DATA IN R
Comparing strings to clean data
In Chapter 2:
"EU" , "eur" , "Europ" → "Europe"

What if there are too many variations?


"EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" , ... → "Europe" ?

Use string distance!

CLEANING DATA IN R
Comparing strings to clean data
survey cities

city move_score city


1 chicgo 4 1 new york
2 los angles 4 2 chicago
3 chicogo 5 3 los angeles
4 new yrk 5 4 seattle
5 new yoork 2
6 seatttle 3
7 losangeles 4
8 seeatle 2
...

CLEANING DATA IN R
Remapping using string distance
library(fuzzyjoin)
stringdist_left_join(survey, cities, by = "city", method = "dl")

city.x move_score city.y


1 chicgo 4 chicago
2 los angles 4 los angeles
3 chicogo 5 chicago
4 new yrk 5 new york
5 new yoork 2 new york
6 seatttle 3 seattle
7 losangeles 4 los angeles
8 seeatle 2 seattle
9 siattle 1 seattle
...

CLEANING DATA IN R
Remapping using string distance
stringdist_left_join(survey, cities, by = "city", method = "dl", max_dist = 1)

city.x move_score city.y


1 chicgo 4 chicago
2 los angles 4 los angeles
3 chicogo 5 chicago
4 new yrk 5 new york
5 new yoork 2 new york
6 seatttle 3 seattle
7 losangeles 4 los angeles
8 seeatle 2 <NA>
9 siattle 1 seattle
...

CLEANING DATA IN R
Generating and
comparing pairs
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
When joins won't work

CLEANING DATA IN R
What is record linkage?

CLEANING DATA IN R
Pairs of records

CLEANING DATA IN R
Generating pairs

CLEANING DATA IN R
Generating pairs in R
library(reclin)
pair_blocking(df_A, df_B)

Simple blocking
No blocking used.
First data set: 5 records
Second data set: 5 records
Total number of pairs: 25 pairs
ldat with 25 rows and 2 columns
x y
1 1 1
2 2 1
3 3 1
...

CLEANING DATA IN R
Too many pairs

CLEANING DATA IN R
Blocking

Only consider pairs when they agree on the blocking variable (State)

CLEANING DATA IN R
Pair blocking in R
pair_blocking(df_A, df_B, blocking_var = "state")

Simple blocking ldat with 8 rows and 2 columns


Blocking variable(s): state x y
First data set: 5 records 1 1 1
Second data set: 5 records 2 1 4
Total number of pairs: 8 pairs 3 2 3
4 2 5
5 3 2
6 4 2
7 5 1
8 5 4

CLEANING DATA IN R
Comparing pairs

CLEANING DATA IN R
Comparing pairs
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = "name", default_comparator = lcs())

Compare ldat with 8 rows and 3 columns


By: name x y name
1 1 1 0.3529412
Simple blocking 2 1 4 0.3030303
Blocking variable(s): state 3 2 3 0.9285714
First data set: 5 records 4 2 5 0.2962963
Second data set: 5 records ...
Total number of pairs: 8 pairs 8 5 4 0.3333333

CLEANING DATA IN R
Comparing multiple columns
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs())

Compare ldat with 8 rows and 4 columns


By: name, zip x y name zip
1 1 1 0.3529412 0.4
Simple blocking 2 1 4 0.3030303 0.2
Blocking variable(s): state 3 2 3 0.9285714 1.0
First data set: 5 records 4 2 5 0.2962963 0.2
Second data set: 5 records ...
Total number of pairs: 8 pairs 8 5 4 0.3333333 0.2

CLEANING DATA IN R
Different comparators
default_comparator = lcs()

default_comparator = jaccard()

default_comparator = jaro_winkler()

CLEANING DATA IN R
Scoring and linking
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Last lesson
df_A df_B

name zip state name zip state


1 Christine M. Conner 10456 NY 1 Jerome A. Yates 11743 NY
2 Keaton Z Snyder 15020 PA 2 Garrison, Brenda 08611 NJ
3 Arthur Potts 07799 NJ 3 Keaton Snyder 15020 PA
4 Maia Collier 07960 NJ 4 Stuart, Bert F 12211 NY
5 Atkins, Alice W. 10603 NY 5 Hayley Peck 19134 PA

CLEANING DATA IN R
Where we left off
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs())

x y name zip
1 1 1 0.3529412 0.4
2 1 4 0.3030303 0.2
3 2 3 0.9285714 1.0
4 2 5 0.2307692 0.2
5 3 2 0.2142857 0.2
6 4 2 0.2857143 0.4
7 5 1 0.1935484 0.4
8 5 4 0.3333333 0.2

CLEANING DATA IN R
Scoring pairs

CLEANING DATA IN R
Scoring with sums
x y name zip
1 1 1 0.3529412 + 0.4 =
2 1 4 0.3030303 + 0.2 =
3 2 3 0.9285714 + 1.0 =
4 2 5 0.2307692 + 0.2 =
5 3 2 0.2142857 + 0.2 =
6 4 2 0.2857143 + 0.4 =
7 5 1 0.1935484 + 0.4 =
8 5 4 0.3333333 + 0.2 =

CLEANING DATA IN R
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_simsum()

x y name zip simsum


1 1 1 0.3529412 0.4 0.7529412
2 1 4 0.3030303 0.2 0.5030303
3 2 3 0.9285714 1.0 1.9285714
4 2 5 0.2307692 0.2 0.4307692
5 3 2 0.2142857 0.2 0.4142857
6 4 2 0.2857143 0.4 0.6857143
7 5 1 0.1935484 0.4 0.5935484
8 5 4 0.3333333 0.2 0.5333333

CLEANING DATA IN R
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_simsum()

x y name zip simsum


1 1 1 0.3529412 0.4 0.7529412
2 1 4 0.3030303 0.2 0.5030303
3 2 3 0.9285714 1.0 1.9285714 <--
4 2 5 0.2307692 0.2 0.4307692
5 3 2 0.2142857 0.2 0.4142857
6 4 2 0.2857143 0.4 0.6857143
7 5 1 0.1935484 0.4 0.5935484
8 5 4 0.3333333 0.2 0.5333333

CLEANING DATA IN R
Disadvantages of summing
2 records with a similar name (Keaton Z Snyder & Keaton Snyder) are more likely to be a
match

2 records with the same sex (Male & Male) are not as likely to be a match

Use probabilistic scoring!

CLEANING DATA IN R
Scoring probabilistically
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_problink()

x y name zip weight


1 1 1 0.3529412 0.4 -1.011599
2 1 4 0.3030303 0.2 -2.219198
3 2 3 0.9285714 1.0 16.019278
4 2 5 0.2307692 0.2 -2.590260
5 3 2 0.2142857 0.2 -2.685570
6 4 2 0.2857143 0.4 -1.321753
7 5 1 0.1935484 0.4 -1.832576
8 5 4 0.3333333 0.2 -2.079436

CLEANING DATA IN R
Linking pairs

CLEANING DATA IN R
Selecting matches
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_problink() %>%
select_n_to_m()

x y name zip weight select


1 1 1 0.3529412 0.4 -1.011599 FALSE
2 1 4 0.3030303 0.2 -2.219198 FALSE
3 2 3 0.9285714 1.0 16.019278 TRUE
4 2 5 0.2307692 0.2 -2.590260 FALSE
5 3 2 0.2142857 0.2 -2.685570 FALSE
6 4 2 0.2857143 0.4 -1.321753 FALSE
...

CLEANING DATA IN R
Linking the data
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs()) %>%
score_problink() %>%
select_n_to_m() %>%
link()

CLEANING DATA IN R
Linked data
name.x zip.x state.x name.y zip.y state.y
1 Keaton Z Snyder 15020 PA Keaton Snyder 15020 PA
2 Christine M. Conner 10456 NY <NA> <NA> <NA>
3 Arthur Potts 07799 NJ <NA> <NA> <NA>
4 Maia Collier 07960 NJ <NA> <NA> <NA>
5 Atkins, Alice W. 10603 NY <NA> <NA> <NA>
6 <NA> <NA> <NA> Jerome A. Yates 11743 NY
7 <NA> <NA> <NA> Garrison, Brenda 08611 NJ
8 <NA> <NA> <NA> Stuart, Bert F 12211 NY
9 <NA> <NA> <NA> Hayley Peck 19134 PA

CLEANING DATA IN R
What you learned

CLEANING DATA IN R
Chapter 1: Common Data Problems

CLEANING DATA IN R
Chapter 2: Text and Categorical Data

CLEANING DATA IN R
Chapter 3: Advanced Data Problems

CLEANING DATA IN R
Chapter 4: Record Linkage

CLEANING DATA IN R
Expand and build upon your new skills
Categorical Data
Categorical Data in the Tidyverse

Text Data
String Manipulation with stringr in R

Intermediate Regular Expressions in R

Writing Clean Code


Defensive R Programming

CLEANING DATA IN R

You might also like