0% found this document useful (0 votes)
3 views54 pages

Week4 Slides

The document provides an overview of importing data into R, covering various file types such as CSV, Excel, JSON, and data from the web. It discusses troubleshooting common errors, the importance of cleaning data, and using RMarkdown for assignments. Additionally, it introduces JSON syntax and how to read JSON data from files and APIs.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views54 pages

Week4 Slides

The document provides an overview of importing data into R, covering various file types such as CSV, Excel, JSON, and data from the web. It discusses troubleshooting common errors, the importance of cleaning data, and using RMarkdown for assignments. Additionally, it introduces JSON syntax and how to read JSON data from files and APIs.

Uploaded by

Tùng Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

DSA2101

Essential Data Analytics Tools: Data Visualization

Yuting Huang

AY24/25

Week 4: Importing Data II

1 / 54
Importing data into R

1. CSV files Week 3


2. Flat files
3. Excel Files
4. R data files Week 4
5. JSON files
6. Data from the web
7. Using APIs

2 / 54
Recap: Troubleshooting errors

When we first start using R, it is common to see error messages.


▶ File path issue.

Warning in file(file, "rt") :


cannot open file 'data/heights.csv': No such file or directory
Error in file(file, "rt") : cannot open the connection

▶ Using a function that’s not installed or loaded.

Error in read_excel("../data/Workplace_injuries.xlsx") :
could not find function "read_excel"

▶ Other common issues: Misspellings, missing comma, unmatched


parenthesis, objects that don’t exist.

3 / 54
Warning messages

Warnings are different than errors. They alert us about something,


but usually won’t prevent us from running the code.

x <- c(1, 2, 3, "adam")


as.numeric(x)

## Warning: NAs introduced by coercion

## [1] 1 2 3 NA

4 / 54
Additional RMarkdown questions

1. Can I use setwd() in RMarkdown to set my working directory?

▶ No. Unlike in R scripts, setwd() is not persistent in


RMarkdown. It only applies to that specific code chunk.
▶ This also creates issues when the file is run on a different
machine.

2. How about using attach() for data sets?

▶ No. attach() in RMarkdown is also not persistent between


code chunks.
▶ This also makes debugging difficult! To ensure clarity and avoid
conflicts, always refer to variables in a data frame as
df$variable.

5 / 54
Additional help with RMarkdown

All tutorials and exams for DSA2101 require the use of R Markdown
(.Rmd).
▶ To provide you with additional help, we will ask you to submit
your Rmd file at the end of your tutorial day from Week 4
onwards.
▶ Part of your preparation for the upcoming midterm exam.

Ensure your file is executable – it should render successfully into


HTML without errors.
▶ This include loading all necessary data and libraries.
▶ Use our standard folder structure and relative path when
storing and reading data.
▶ Make sure your code is complete without errors.

6 / 54
Additional help with RMarkdown

Due to time constraints, we will not check the correctness of your


answers.
Instead, we focus on testing a minimal set of things including:

1. Whether your Rmd can knit to HTML successfully on our


machines, and
2. Whether the required object(s) and plot(s) are generated in the
knitted HTML.

We will do a few quick checks and provide you some feedback.

7 / 54
Recap: Cleaning data

Best to catch problems in data at an early stage.


A key step is to open the spreadsheet, quickly scroll through the
content, and look for warning signs of “bad data”.
▶ Missing values: Blank entries, suspicious values.
▶ Inconsistent date formats: 3/2/2025 or 2/3/2025?
▶ Duplicated or redundant rows and values.
▶ Inconsistent spellings in categorical variables.
▶ Numbers stored as texts.

8 / 54
R data files

R has two native data formats:

Format Description To read To write


Rdata (or Rda) Multiple R objects load() save()
Rds A single R object readRDS() saveRDS()

▶ To read in Rdata file with multiple R objects:

load("../data/wk4_data.RData")

▶ There are two objects in the environment now: heights and


injuries.

9 / 54
▶ Let’s remove the redundant rows and visualize the total injuries.

injurires_clean <- injuries[1:4, ]


injuries_sum <- colSums(injurires_clean[, -1])
years <- names(injuries_sum)
plot(years, injuries_sum, type = "b", xaxt = "n", las = 2,
main = "Workplace injuries over time",
xlab = "Years", ylab = "Number of injuries")
axis(1, at = years, labels = years)

Workplace injuries over time

12000
Number of injuries

11500

11000

10500

10000

2019 2020 2021 2022

Years

10 / 54
R data files
▶ To read in Rds file with a single R object:

hawkers <- readRDS("../data/hawker_ctr.rds")


str(hawkers[[1]])

## List of 12
## $ ADDRESSBUILDINGNAME : chr ""
## $ ADDRESSFLOORNUMBER : chr ""
## $ ADDRESSPOSTALCODE : chr "141001"
## $ ADDRESSSTREETNAME : chr "Commonwealth Drive"
## $ ADDRESSUNITNUMBER : chr ""
## $ DESCRIPTION : chr "HUP Standard Upgrading"
## $ HYPERLINK : chr ""
## $ NAME : chr "Blks 1A/ 2A/ 3A Commonwealth Drive"
## $ PHOTOURL : chr ""
## $ ADDRESSBLOCKHOUSENUMBER: chr "1A/2A/3A"
## $ XY : chr "24055.5,31341.24"
## $ ICON_NAME : chr "HC icons_Opt 8.jpg"

11 / 54
Retrieving street names
hawkers contains an object: A list of length 116.
▶ Each sub-list contains information on a hawker center.

12 / 54
Retrieving street names

▶ We can retrieve the street names of the first sub-list with the
following code:

hawkers[[1]]$ADDRESSSTREETNAME

## [1] "Commonwealth Drive"

13 / 54
street_name <- sapply(hawkers, function(x) x$ADDRESSSTREETNAME)
head(street_name, 2)

## [1] "Commonwealth Drive" "Marsiling Lane"

▶ function(x) x$ADDRESSSTREETNAME is a anonymous function


defined in sapply() for this specific task.
▶ . . . applied to each element in hawkers as sapply() iterates
through it.
▶ x represents one element in hawkers at a time during the
iteration.
▶ x$ADDRESSSTREETNAME extracts the street names from the
current input x.

14 / 54
Converting to a data frame
Using the same trick on different components, we can extract
information stored in vectors.
▶ Then combine them as a data frame.

postal_code <- sapply(hawkers, function(x) x$ADDRESSPOSTALCODE)


street_name <- sapply(hawkers, function(x) x$ADDRESSSTREETNAME)
name <- sapply(hawkers, function(x) x$NAME)
hawkers_df <- data.frame(postal_code, street_name, name)
head(hawkers_df, 4)

## postal_code street_name name


## 1 141001 Commonwealth Drive Blks 1A/ 2A/ 3A Commonwealth Drive
## 2 730020 Marsiling Lane Blks 20/21 Marsiling Lane
## 3 641221 Boon Lay Place Blks 221A/B Boon Lay Place
## 4 161022 Havelock Road Blks 22A/B Havelock Road

15 / 54
JavaScript Object Notation (JSON)

JSON (JavaScript Object Notation) is a standard text-based


format for storing structured data.
▶ A commonly-used text-based way to store and share structured
data.
▶ Lightweight, easy to share, and supported in nearly all
programming languages.
▶ The full description of the format can be found at
http://www.json.org/

We shall work with the jsonlite package.

# install.packages("jsonlite")
library(jsonlite)

16 / 54
JSON

▶ JSON data are nested and


hierarchical.
▶ An object is an unordered
collection of name/value
pairs.
▶ An array is an ordered list
of values.
▶ By repeatedly stacking these
structures on top of one
another, we can to store quite
complex data structures.

17 / 54
JSON syntax

JSON supports the following data types or values:


▶ String (in double quotes), number, booleans (true and false), null.

18 / 54
JSON syntax
JSON also supports arrays, which are sets of data defined with
brackets and contain a comma-separated list of values.
▶ Surrounded with square brackets: [ and ]
▶ Values are separated by a comma ,

Example:
▶ [12, 3, 7] is an JSON array with three elements, all are
numbers.
▶ ["Hello", 3, 7] is also valid.

19 / 54
Reading JSON array

The fromJSON() function from jsonlite reads data in JSON format


and convert them into an R object.
▶ If the JSON array contains values, fromJSON reads it into a
vector.

json1 <- "[12, 3, 7]"


fromJSON(json1)

## [1] 12 3 7

20 / 54
JSON syntax
Another type of data is called JSON object, which represents values
in key-value pairs.
▶ Surrounded with curly braces { and }
▶ Each key is followed by a colon :
▶ Key-value pairs are separated by a comma ,

Example:
▶ {"name": "John"} is a valid JSON object.
▶ {"name": "John", "age": 27} contains two key-value pairs.

21 / 54
Reading JSON objects

▶ fromJSON() reads JSON objects as a list.

json2 <- '{"name": "John", "age": 27}'


fromJSON(json2)

## $name
## [1] "John"
##
## $age
## [1] 27

▶ In the R object, the key becomes the name of the list, and the
value becomes the content of the list’s element.
▶ We can transform the list into a data frame with data.frame().

22 / 54
JSON from file

Let’s now try to read from files that contains JSON.

fromJSON("../data/wk4_colors_json.txt")

## $aliceblue
## [1] "#f0f8ff"
##
## $antiquewhite
## [1] "#faebd7"
##
## $aqua
## [1] "#00ffff"
##
## $aquamarine
## [1] "#7fffd4"

23 / 54
JSON from the web

We can also import JSON data from the web.


▶ Here is an example from the website Advice Slip:
https://api.adviceslip.com/advice

▶ As you can tell, the result contains a JSON object with nested
structure.

24 / 54
JSON from the web

We can read it in by passing the URL to fromJSON().

fromJSON("https://api.adviceslip.com/advice")

## $slip
## $slip$id
## [1] 142
##
## $slip$advice
## [1] "If you don’t like the opinion you’ve been given, get another one

25 / 54
Data from the web

We can read data files directly from a URL to R.


TidyTuesday is a weekly data project born out of the R for Data
Science textbook and its online learning community.
▶ It posts raw data set(s) and a related article every week.
▶ Emphasizes on summarizing and manipulating data to make
meaningful visualizations in the tidyverse ecosystem.

Full list of data sets can be found on


https:// github.com/ rfordatascience/ tidytuesday

26 / 54
TidyTuesday data
Let’s explore the data set posted on April 20, 2021.
▶ A data set on TV shows and movies available on Netflix.
▶ We can find an overview of the data at:
https:// github.com/ rfordatascience/ tidytuesday/ blob/
master/ data/ 2021/ 2021-04-20/ readme.md

27 / 54
We can read in data manually via a specific URL (available under the
Get the data here section).
https:// raw.githubusercontent.com/ rfordatascience/
tidytuesday/ master/ data/ 2021/ 2021-04-20/ netflix_titles.
csv

url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/ma


netflix <- readr::read_csv(url)
head(netflix, 2)

## # A tibble: 2 x 12
## show_id type title director cast country date_added release_
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <
## 1 s1 TV Show 3% <NA> João~ Brazil August 14~
## 2 s2 Movie 7:19 Jorge Mich~ Demi~ Mexico December ~
## # i 3 more variables: duration <chr>, listed_in <chr>, description <c

28 / 54
TidyTuesday data
The Data dictionary section provides necessary documentations
about the variables inside the data set.

29 / 54
A bar plot on the types of Netflix titles.

netflix$type <- as.factor(netflix$type)


barplot(table(netflix$type), main = "Distribution of types")

Distribution of types
5000
3000
0 1000

Movie TV Show

30 / 54
A histogram of movie duration.

movie <- netflix[netflix$type == "Movie", ]


movie$minute <- stringr::str_replace(movie$duration, " min", "")
movie$minute <- as.numeric(movie$minute)
hist(movie$minute,
xlab = "Duration (minutes)",
main = "Distribution of movie duration")

Distribution of movie duration


1000 1500 2000
Frequency

500
0

0 50 100 150 200 250 300

Duration (minutes)

31 / 54
Data from the web

▶ In 2011, data.gov.sg was launched as Singapore’s national open


data portal.
▶ 4000+ data sets from 70 government agencies, in the fields of
economy, education, environment, housing, health, etc.
▶ Data can be downloaded in various formats.
▶ It is possible to query the data with a script. The data would
then return as a JSON object.

32 / 54
Covid-19 vaccination
Data can be downloaded directly as an CSV.
▶ Convenient for static data set – finalized and won’t be updated
anymore.
https:// data.gov.sg/ datasets?topics=health&resultId=d_
713e8c4fd88c64a7b7e55e9c2643e936

33 / 54
Monthly rainfall
The National Environment Agency (NEA) publishes monthly data on
total rainfall at certain climate stations.
▶ Available from Jan 1982 and updates on a monthly basis.

34 / 54
APIs

▶ API stands for Application Programming Interface.


▶ Allows us to automate repetitive, time-consuming work of
querying data.
▶ Can provide additional layer of security against unauthorized
breaches by requiring authentication.

Source: Adapted from Postman.

35 / 54
Dataset APIs

The data.gov.sg website provide sample code written in Python.


▶ The key requirement is to identify the dataset id.

36 / 54
We can write our own query in R by specifying the query parameters
(i.e., dataset id).
▶ Detail instructions can be found on the User guide page.

Visit the page:


https:// data.gov.sg/ datasets/ d_
b16d06b83473fdfcc92ed9d37b66ba58/ view

▶ Click Help -> User guide on the upper right corner.

37 / 54
▶ This will direct you to the developers documentation page.
▶ Click on the Dataset APIs section.

▶ It explains that to query a data set, we need to first find its


dataset id.
▶ . . . for the rainfall data, the id is
d_b16d06b83473fdfcc92ed9d37b66ba58.

38 / 54
The API uses the domain
https://data.gov.sg/api/action/datastore_search.
▶ We can pass the dataset id into the API query:
https:// data.gov.sg/ api/ action/ datastore_search?resource_
id=d_b16d06b83473fdfcc92ed9d37b66ba58

39 / 54
Download data from data.gov.sg

dataset_id <- "d_b16d06b83473fdfcc92ed9d37b66ba58"


base_url<-"https://data.gov.sg/api/action/datastore_search?resource_id="

url <- paste0(base_url, dataset_id)


results_json <- fromJSON(url)

▶ The fromJSON() command the input into a list.


▶ In the Environment tab, we can see that the results_json
object is a list of length 3.

40 / 54
Working with the results
str(results_json)

## List of 3
## $ help : chr "https://data.gov.sg/api/3/action/help_show?name=data
## $ success: logi TRUE
## $ result :List of 5
## ..$ resource_id: chr "d_b16d06b83473fdfcc92ed9d37b66ba58"
## ..$ fields :’data.frame’: 3 obs. of 2 variables:
## .. ..$ type: chr [1:3] "text" "text" "int4"
## .. ..$ id : chr [1:3] "month" "total_rainfall" "_id"
## ..$ records :’data.frame’: 100 obs. of 3 variables:
## .. ..$ _id : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ month : chr [1:100] "1982-01" "1982-02" "1982-03" "1
## .. ..$ total_rainfall: chr [1:100] "107.1" "27.8" "160.8" "157" ...
## ..$ _links :List of 2
## .. ..$ start: chr "/api/action/datastore_search?resource_id=d_b16d0
## .. ..$ next : chr "/api/action/datastore_search?resource_id=d_b16d0
## ..$ total : int 516

41 / 54
Data are stored in the result element.
As with most APIs, there is a limit on the number of records that can
be retrieved per query.
▶ result -> records: A data frame with 100 rows and 3 columns.
▶ result -> total: The latest data set contains 516 rows.

dim(results_json[["result"]][["records"]])

## [1] 100 3

results_json[["result"]][["total"]]

## [1] 516

42 / 54
▶ result -> _links contains two links: The starting query and
the next query.

results_json[["result"]][["_links"]]

43 / 54
The next query

results_json[["result"]][["_links"]][["next"]]

Take a look at the link for the next query.


▶ We can use of this pattern as we extract the remaining rows.

44 / 54
Querying data
We want to continue submitting queries until the required number of
rows are obtained.
▶ Use a while loop to repeatedly fetch data until the number of
rows matches the total available rows.

# Current number of rows


results <- results_json[["result"]][["records"]]

# Total expected number of rows


total_records <- results_json[["result"]][["total"]]

while(nrow(results) < total_records) {


next_url <- paste0("https://data.gov.sg",
results_json[["result"]][["_links"]][["next"]])
results_json <- fromJSON(next_url)
results <- rbind(results, results_json[["result"]][["records"]])
}

45 / 54
Confirm that we’ve had the correct data:

str(results)

## ’data.frame’: 516 obs. of 3 variables:


## $ _id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ month : chr "1982-01" "1982-02" "1982-03" "1982-04" ...
## $ total_rainfall: chr "107.1" "27.8" "160.8" "157" ...

After that, we can write the data frame to a file.


▶ . . . with the base R function write.csv().

write.csv(results, "../data/rainfall.csv", row.names = FALSE)

46 / 54
Visualization

Let us visualize monthly rainfall from 2020 onwards.


▶ The following syntax with become comprehensible after the next
lecture.
▶ For now, we only need to understand its purpose.

library(tidyverse)
library(lubridate)
df <- results %>%
mutate(month = ym(month),
year = year(month),
total_rainfall = as.numeric(total_rainfall)) %>%
filter(year >= 2020)

47 / 54
▶ Tweak the arguments to see their effects on the plot.

plot(df$month, df$total_rainfall,
type = "l", lwd = 1, col = "purple",
xlab = "Month", ylab = "Monthly rainfall",
main = "Monthly rainfall across years")

Monthly rainfall across years


600
Monthly rainfall

400
200
0

2020 2022 2024

Month

48 / 54
Your turn: Graduate Employment Survey

The Ministry of Education (MOE) conducts the Graduate


Employment Survey across educational institutions in Singapore.
Data are available at
https:// data.gov.sg/ datasets?topics=education&page=1&
resultId=d_3c55210de27fcccda2ed0c63fdd2b352

1. Query the data through API.


2. Store the data in your data folder as ges_sg.csv.

49 / 54
Median income across universities

Let’s comparing the median gross income across universities.


▶ levels() renames the factor levels in university.
▶ reorder() sort the levels in university base on the median of
gross_monthly_median.

df <- read.csv("../data/ges_sg.csv",
stringsAsFactors = TRUE, na.strings = "na")
df2022 <- df[df$year == 2022, ]

levels(df2022$university) <- c("NTU", "NUS", "SIT",


"SMU", "SUSS", "SUTD")

df2022$university <- reorder(df2022$university,


df2022$gross_monthly_median,
FUN = median, na.rm = TRUE)

50 / 54
Median income across universities
boxplot(df2022$gross_monthly_median ~ df2022$university,
horizontal = TRUE, las = 1,
main = "Median monthly gross income, 2022",
xlab = "Gross income", ylab = "")
grid()

Median monthly gross income, 2022

SMU

SUTD
NUS
NTU

SIT
SUSS

3000 3500 4000 4500 5000 5500 6000 6500

Gross income

51 / 54
The ggplot2 way
library(tidyverse) # ggplot2 is included in tidyverse
ggplot(df2022) +
geom_boxplot(aes(x = gross_monthly_median, y = university)) +
labs(title = "Median monthly gross income, 2022",
x = "Gross income", y = "")

Median monthly gross income, 2022

SMU

SUTD

NUS

NTU

SIT

SUSS

3000 4000 5000 6000


Gross income

▶ Later in the semester, we shall learn about visualization with


ggplot().
52 / 54
Summary

We learn about importing data from different formats and sources:

4. R data file.
5. JSON file with jsonlite
6. Data from the web with readr.
7. Using APIs.

A few more ways to clean and visualize data.

53 / 54
54 / 54

You might also like