Week4 Slides
Week4 Slides
Yuting Huang
AY24/25
1 / 54
Importing data into R
2 / 54
Recap: Troubleshooting errors
Error in read_excel("../data/Workplace_injuries.xlsx") :
could not find function "read_excel"
3 / 54
Warning messages
## [1] 1 2 3 NA
4 / 54
Additional RMarkdown questions
5 / 54
Additional help with RMarkdown
All tutorials and exams for DSA2101 require the use of R Markdown
(.Rmd).
▶ To provide you with additional help, we will ask you to submit
your Rmd file at the end of your tutorial day from Week 4
onwards.
▶ Part of your preparation for the upcoming midterm exam.
6 / 54
Additional help with RMarkdown
7 / 54
Recap: Cleaning data
8 / 54
R data files
load("../data/wk4_data.RData")
9 / 54
▶ Let’s remove the redundant rows and visualize the total injuries.
12000
Number of injuries
11500
11000
10500
10000
Years
10 / 54
R data files
▶ To read in Rds file with a single R object:
## List of 12
## $ ADDRESSBUILDINGNAME : chr ""
## $ ADDRESSFLOORNUMBER : chr ""
## $ ADDRESSPOSTALCODE : chr "141001"
## $ ADDRESSSTREETNAME : chr "Commonwealth Drive"
## $ ADDRESSUNITNUMBER : chr ""
## $ DESCRIPTION : chr "HUP Standard Upgrading"
## $ HYPERLINK : chr ""
## $ NAME : chr "Blks 1A/ 2A/ 3A Commonwealth Drive"
## $ PHOTOURL : chr ""
## $ ADDRESSBLOCKHOUSENUMBER: chr "1A/2A/3A"
## $ XY : chr "24055.5,31341.24"
## $ ICON_NAME : chr "HC icons_Opt 8.jpg"
11 / 54
Retrieving street names
hawkers contains an object: A list of length 116.
▶ Each sub-list contains information on a hawker center.
12 / 54
Retrieving street names
▶ We can retrieve the street names of the first sub-list with the
following code:
hawkers[[1]]$ADDRESSSTREETNAME
13 / 54
street_name <- sapply(hawkers, function(x) x$ADDRESSSTREETNAME)
head(street_name, 2)
14 / 54
Converting to a data frame
Using the same trick on different components, we can extract
information stored in vectors.
▶ Then combine them as a data frame.
15 / 54
JavaScript Object Notation (JSON)
# install.packages("jsonlite")
library(jsonlite)
16 / 54
JSON
17 / 54
JSON syntax
18 / 54
JSON syntax
JSON also supports arrays, which are sets of data defined with
brackets and contain a comma-separated list of values.
▶ Surrounded with square brackets: [ and ]
▶ Values are separated by a comma ,
Example:
▶ [12, 3, 7] is an JSON array with three elements, all are
numbers.
▶ ["Hello", 3, 7] is also valid.
19 / 54
Reading JSON array
## [1] 12 3 7
20 / 54
JSON syntax
Another type of data is called JSON object, which represents values
in key-value pairs.
▶ Surrounded with curly braces { and }
▶ Each key is followed by a colon :
▶ Key-value pairs are separated by a comma ,
Example:
▶ {"name": "John"} is a valid JSON object.
▶ {"name": "John", "age": 27} contains two key-value pairs.
21 / 54
Reading JSON objects
## $name
## [1] "John"
##
## $age
## [1] 27
▶ In the R object, the key becomes the name of the list, and the
value becomes the content of the list’s element.
▶ We can transform the list into a data frame with data.frame().
22 / 54
JSON from file
fromJSON("../data/wk4_colors_json.txt")
## $aliceblue
## [1] "#f0f8ff"
##
## $antiquewhite
## [1] "#faebd7"
##
## $aqua
## [1] "#00ffff"
##
## $aquamarine
## [1] "#7fffd4"
23 / 54
JSON from the web
▶ As you can tell, the result contains a JSON object with nested
structure.
24 / 54
JSON from the web
fromJSON("https://api.adviceslip.com/advice")
## $slip
## $slip$id
## [1] 142
##
## $slip$advice
## [1] "If you don’t like the opinion you’ve been given, get another one
25 / 54
Data from the web
26 / 54
TidyTuesday data
Let’s explore the data set posted on April 20, 2021.
▶ A data set on TV shows and movies available on Netflix.
▶ We can find an overview of the data at:
https:// github.com/ rfordatascience/ tidytuesday/ blob/
master/ data/ 2021/ 2021-04-20/ readme.md
27 / 54
We can read in data manually via a specific URL (available under the
Get the data here section).
https:// raw.githubusercontent.com/ rfordatascience/
tidytuesday/ master/ data/ 2021/ 2021-04-20/ netflix_titles.
csv
## # A tibble: 2 x 12
## show_id type title director cast country date_added release_
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <
## 1 s1 TV Show 3% <NA> João~ Brazil August 14~
## 2 s2 Movie 7:19 Jorge Mich~ Demi~ Mexico December ~
## # i 3 more variables: duration <chr>, listed_in <chr>, description <c
28 / 54
TidyTuesday data
The Data dictionary section provides necessary documentations
about the variables inside the data set.
29 / 54
A bar plot on the types of Netflix titles.
Distribution of types
5000
3000
0 1000
Movie TV Show
30 / 54
A histogram of movie duration.
500
0
Duration (minutes)
31 / 54
Data from the web
32 / 54
Covid-19 vaccination
Data can be downloaded directly as an CSV.
▶ Convenient for static data set – finalized and won’t be updated
anymore.
https:// data.gov.sg/ datasets?topics=health&resultId=d_
713e8c4fd88c64a7b7e55e9c2643e936
33 / 54
Monthly rainfall
The National Environment Agency (NEA) publishes monthly data on
total rainfall at certain climate stations.
▶ Available from Jan 1982 and updates on a monthly basis.
34 / 54
APIs
35 / 54
Dataset APIs
36 / 54
We can write our own query in R by specifying the query parameters
(i.e., dataset id).
▶ Detail instructions can be found on the User guide page.
37 / 54
▶ This will direct you to the developers documentation page.
▶ Click on the Dataset APIs section.
38 / 54
The API uses the domain
https://data.gov.sg/api/action/datastore_search.
▶ We can pass the dataset id into the API query:
https:// data.gov.sg/ api/ action/ datastore_search?resource_
id=d_b16d06b83473fdfcc92ed9d37b66ba58
39 / 54
Download data from data.gov.sg
40 / 54
Working with the results
str(results_json)
## List of 3
## $ help : chr "https://data.gov.sg/api/3/action/help_show?name=data
## $ success: logi TRUE
## $ result :List of 5
## ..$ resource_id: chr "d_b16d06b83473fdfcc92ed9d37b66ba58"
## ..$ fields :’data.frame’: 3 obs. of 2 variables:
## .. ..$ type: chr [1:3] "text" "text" "int4"
## .. ..$ id : chr [1:3] "month" "total_rainfall" "_id"
## ..$ records :’data.frame’: 100 obs. of 3 variables:
## .. ..$ _id : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ month : chr [1:100] "1982-01" "1982-02" "1982-03" "1
## .. ..$ total_rainfall: chr [1:100] "107.1" "27.8" "160.8" "157" ...
## ..$ _links :List of 2
## .. ..$ start: chr "/api/action/datastore_search?resource_id=d_b16d0
## .. ..$ next : chr "/api/action/datastore_search?resource_id=d_b16d0
## ..$ total : int 516
41 / 54
Data are stored in the result element.
As with most APIs, there is a limit on the number of records that can
be retrieved per query.
▶ result -> records: A data frame with 100 rows and 3 columns.
▶ result -> total: The latest data set contains 516 rows.
dim(results_json[["result"]][["records"]])
## [1] 100 3
results_json[["result"]][["total"]]
## [1] 516
42 / 54
▶ result -> _links contains two links: The starting query and
the next query.
results_json[["result"]][["_links"]]
43 / 54
The next query
results_json[["result"]][["_links"]][["next"]]
44 / 54
Querying data
We want to continue submitting queries until the required number of
rows are obtained.
▶ Use a while loop to repeatedly fetch data until the number of
rows matches the total available rows.
45 / 54
Confirm that we’ve had the correct data:
str(results)
46 / 54
Visualization
library(tidyverse)
library(lubridate)
df <- results %>%
mutate(month = ym(month),
year = year(month),
total_rainfall = as.numeric(total_rainfall)) %>%
filter(year >= 2020)
47 / 54
▶ Tweak the arguments to see their effects on the plot.
plot(df$month, df$total_rainfall,
type = "l", lwd = 1, col = "purple",
xlab = "Month", ylab = "Monthly rainfall",
main = "Monthly rainfall across years")
400
200
0
Month
48 / 54
Your turn: Graduate Employment Survey
49 / 54
Median income across universities
df <- read.csv("../data/ges_sg.csv",
stringsAsFactors = TRUE, na.strings = "na")
df2022 <- df[df$year == 2022, ]
50 / 54
Median income across universities
boxplot(df2022$gross_monthly_median ~ df2022$university,
horizontal = TRUE, las = 1,
main = "Median monthly gross income, 2022",
xlab = "Gross income", ylab = "")
grid()
SMU
SUTD
NUS
NTU
SIT
SUSS
Gross income
51 / 54
The ggplot2 way
library(tidyverse) # ggplot2 is included in tidyverse
ggplot(df2022) +
geom_boxplot(aes(x = gross_monthly_median, y = university)) +
labs(title = "Median monthly gross income, 2022",
x = "Gross income", y = "")
SMU
SUTD
NUS
NTU
SIT
SUSS
4. R data file.
5. JSON file with jsonlite
6. Data from the web with readr.
7. Using APIs.
53 / 54
54 / 54