0% found this document useful (0 votes)

1 views158 pages

Cleaning Data in R

The document outlines the process of cleaning data in R, focusing on data type constraints, checking and converting data types, handling out-of-range values, and managing duplicates. It emphasizes the importance of clean data for accurate analysis and provides examples of common data issues and solutions. Techniques such as using libraries like dplyr and assertive are highlighted for effective data manipulation.

Uploaded by

Kasper B.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views158 pages

Cleaning Data in R

Uploaded by

Kasper B.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 158

Data type

constraints
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
Course outline

Chapter 1 - Common data problems

CLEANING DATA IN R
Why do we need clean data?

CLEANING DATA IN R
Data type constraints
Data R data type
type Example
character
First name, last name, address,
Text ... integer
Subscriber count, # products
Integer sold, ... numeric

Decimal Temperature, exchange rate, ... logical

Is married, new customer, factor

Binary
yes/no, ...
Date
Category Marriage status, color, ...
Date Order dates, date of birth, ...

CLEANING DATA IN R
Glimpsing at data types
sales <- read.csv("sales.csv") library(dplyr)
head(sales) glimpse(sales)

order_id revenue quantity Observations: 100

1 7432 5,454 494 Variables: 3
2 7808 5,668 334 $ order_id <dbl> 7432, 7808, ...
3 4893 4,062 259 $ revenue <chr> "$5454", "$5668", ...
4 6107 3,936 15 $ quantity <dbl> 494, 334, ...
5 7661 1,067 307
6 5908 6,635 235

CLEANING DATA IN R
Checking data types
is.numeric(sales$revenue)

FALSE

library(assertive)
assert_is_numeric(sales$revenue)

Error: is_numeric : sales$revenue is not of class 'numeric'; it has class 'character'.

assert_is_numeric(sales$quantity)

CLEANING DATA IN R
Checking data types
Logical checking - returns TRUE / FALSE assertive checking - errors when FALSE

is.character() assert_is_character()

is.numeric() assert_is_numeric()

is.logical() assert_is_logical()

is.factor() assert_is_factor()

is.Date() assert_is_date()

... ...

CLEANING DATA IN R
Why does data type matter?
class(sales$revenue)

"character"

mean(sales$revenue)

NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA

CLEANING DATA IN R
Comma problems
sales$revenue

"5,454" "5,668" "4,062" "3,936" "1,067" ...

CLEANING DATA IN R
Character to number
library(stringr)
revenue_trimmed = str_remove(sales$revenue, ",")
revenue_trimmed

"5454" "5668" "4062" "3936" "1067" ...

as.numeric(revenue_trimmed)

5454 5668 4062 3936 1067 ...

CLEANING DATA IN R
Putting it together
sales %>%
mutate(revenue_usd = as.numeric(str_remove(revenue, ",")))

# A tibble: 100 x 4
order_id revenue quantity revenue_usd
<dbl> <chr> <dbl> <dbl>
1 7432 5,454 494 5454
2 7808 5,668 334 5668
3 4893 4,062 259 4062
4 6107 3,936 15 3936
5 7661 1,067 307 1067
# ... with 95 more rows

CLEANING DATA IN R
Same function, different outcomes
mean(sales$revenue)

NA
Warning message:
In mean.default(sales$revenue) :
argument is not numeric or logical: returning NA

mean(sales$revenue_usd)

5361.4

CLEANING DATA IN R
Converting data types
as.character()

as.numeric()

as.logical()

as.factor()

as.Date()

...

CLEANING DATA IN R
Watch out: factor to numeric
product_type as.numeric(product_type)

1000 1000 3000 2000 3000 1 1 3 2 3

Levels: 1000 2000 3000

as.numeric(as.character(product_type))
class(product_type)

1000 1000 3000 2000 3000

"factor"

CLEANING DATA IN R
Range constraints
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What's an out of range value?
SAT score: 400-1600
Package weight: at least 0 lb/kg

Adult heart rate: 60-100 beats per minute

CLEANING DATA IN R
Finding out of range values
movies

title avg_rating
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.8
6 Gone in Sixty Seconds 3.3
...

CLEANING DATA IN R
Finding out of range values
breaks <- c(min(movies$avg_rating), 0, 5, max(movies$avg_rating))
ggplot(movies, aes(avg_rating)) +
geom_histogram(breaks = breaks)

CLEANING DATA IN R
Finding out of range values
library(assertive)
assert_all_are_in_closed_range(movies$avg_rating, lower = 0, upper = 5)

Error: is_in_closed_range : movies$avg_rating are not all in the range [0,5].

There were 3 failures:
Position Value Cause
1 5 5.8 too high
2 8 6.2 too high
3 9 -4.4 too low

CLEANING DATA IN R
Handling out of range values
Remove rows
Treat as missing ( NA )

Replace with range limit

Replace with other value based on domain knowledge and/or knowledge of dataset

CLEANING DATA IN R
Removing rows
movies %>%
filter(avg_rating >= 0, avg_rating <= 5) %>%

ggplot(aes(avg_rating)) +
geom_histogram(breaks = c(min(movies$avg_rating), 0, 5, max(movies$avg_rating)))

CLEANING DATA IN R
Treat as missing
movies movies %>%
mutate(rating_miss =

title avg_rating replace(avg_rating, avg_rating > 5, NA))

<chr> <dbl>
1 A Beautiful Mind 4.1 title rating_miss
2 La Vita e Bella 4.3 <chr> <dbl>
3 Amelie 4.2 1 A Beautiful Mind 4.1
4 Meet the Parents 3.5 2 La Vita e Bella 4.3
5 Unbreakable 5.8 3 Amelie 4.2
6 Gone in Sixty Seconds 3.3 4 Meet the Parents 3.5
... 5 Unbreakable NA
6 Gone in Sixty Seconds 3.3
replace(col, condition, replacement) ...

CLEANING DATA IN R
Replacing out of range values
movies %>%
mutate(rating_const =
replace(avg_rating, avg_rating > 5, 5))

title rating_const
<chr> <dbl>
1 A Beautiful Mind 4.1
2 La Vita e Bella 4.3
3 Amelie 4.2
4 Meet the Parents 3.5
5 Unbreakable 5.0
6 Gone in Sixty Seconds 3.3
...

CLEANING DATA IN R
Date range constraints
assert_all_are_in_past(movies$date_recorded)

Error: is_in_past : movies$date_recorded are not all in the past.

There was 1 failure:
Position Value Cause
1 3 2064-09-22 20:00:00 in future

library(lubridate)
movies %>%
filter(date_recorded > today())

title avg_rating date_recorded

1 Amelie 4.2 2064-09-23

CLEANING DATA IN R
Removing out-of-range dates
library(lubridate)
movies <- movies %>%
filter(date_recorded <= today())

library(assertive)
assert_all_are_in_past(movies$date_recorded)

Remember, no output = passed!

CLEANING DATA IN R
Uniqueness
constraints
C L E A N I N G D ATA I N R

Maggie Matsui
Content Developer @ DataCamp
What's a duplicate?
First name Last name Address Credit score
1 Miriam Day 6042 Sollicitudin Avenue 313
2 Miriam Day 6042 Sollicitudin Avenue 313

First name Last name Address Credit score

1 Tamekah Forbes P.O. Box 147, 511 Velit St 356
2 Tamekah Forbes P.O. Box 147, 511 Velit St 342

CLEANING DATA IN R
Why do duplicates occur?

CLEANING DATA IN R
Full duplicates
First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Miriam Day 6042 Sollicitudin Avenue 313
3 Eagan Schmidt 507-6740 Cursus Avenue 728
4 Miriam Day 6042 Sollicitudin Avenue 313
5 Katell Roy Ap #434-4081 Mi Av. 455
6 Katell Roy Ap #434-4081 Mi Av. 455
... ... ... ... ...

CLEANING DATA IN R
Finding full duplicates
duplicated(credit_scores)

FALSE FALSE FALSE TRUE FALSE ...

sum(duplicated(credit_scores))

CLEANING DATA IN R
Finding full duplicates
filter(credit_scores, duplicated(credit_scores))

first_name last_name address credit_score

1 Miriam Day 6042 Sollicitudin Avenue 313
2 Katell Roy Ap #434-4081 Mi Av. 455

CLEANING DATA IN R
Dropping full duplicates
credit_scores_unique <- distinct(credit_scores)
sum(duplicated(credit_scores_unique))

CLEANING DATA IN R
Partial duplicates
First name Last name Address Credit score
1 Harper Taylor P.O. Box 212, 6557 Nunc Road 655
2 Eagan Schmidt 507-6740 Cursus Avenue 728
3 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
4 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
5 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
6 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636
... ... ... ... ...

CLEANING DATA IN R
Finding partial duplicates
credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)

first_name last_name n
<fct> <fct> <int>
1 Katell Roy 2
2 Miriam Day 2
3 Tamekah Forbes 2
4 Xandra Barrett 2

CLEANING DATA IN R
Finding partial duplicates
dup_ids <- credit_scores %>%
count(first_name, last_name) %>%
filter(n > 1)
credit_scores %>%
filter(first_name %in% dup_ids$first_name, last_name %in% dup_ids$last_name)

first_name last_name address credit_score

1 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 620
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
3 Miriam Day 6042 Sollicitudin Avenue 313
4 Xandra Barrett P.O. Box 309, 2462 Pharetra, Rd. 636
5 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
...

CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one

First name Last name Address Credit score

1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2 Tamekah Forbes P.O. Box 147, 511 Velit Street 342
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 636

CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one

First name Last name Address Credit score

1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
2
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620
4

CLEANING DATA IN R
Handling partial duplicates: dropping
Drop all duplicates except one

First name Last name Address Credit score

1 Tamekah Forbes P.O. Box 147, 511 Velit Street 356
3 Xandra Barrett P.O. Box 309, 2462 Pharetra Rd. 620

CLEANING DATA IN R
Dropping partial duplicates
credit_scores %>%
distinct(first_name, last_name, .keep_all = TRUE)

first_name last_name address credit_score

1 Harlan Hebert P.O. Box 356, 3869 Non Av. 305
2 Drake Soto 643-1409 Ac Avenue 642
3 Felix Morales 741-1497 Velit Ave 780
4 Brynne Charles 313-3757 Ultrices St. 513
5 Aquila Dillon P.O. Box 945, 5550 Aliquam Street 748
...