0% found this document useful (0 votes)
90 views

EM622 Data Analysis and Visualization Techniques For Decision-Making

This document provides an introduction to data analysis and visualization techniques in R. It covers importing and manipulating data, basic operations in R like installing packages and exporting data, and different data structures like vectors, arrays, and data frames. The document contains code examples for importing data from files, the web, and other sources. It also demonstrates accessing and subsetting elements within different data structures.

Uploaded by

Ridhi B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

EM622 Data Analysis and Visualization Techniques For Decision-Making

This document provides an introduction to data analysis and visualization techniques in R. It covers importing and manipulating data, basic operations in R like installing packages and exporting data, and different data structures like vectors, arrays, and data frames. The document contains code examples for importing data from files, the web, and other sources. It also demonstrates accessing and subsetting elements within different data structures.

Uploaded by

Ridhi B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

EM622 Data Analysis and Visualization

Techniques for Decision-Making

Introduction to R and Data Manipulation

1 / 47
Getting Started
RStudio console

Options (Import dataset)


File Viewer (Data & Code)

Console (for typing commands) Plots


2 / 47
Your first graph
Copy and paste:
data(iris)
plot(Sepal.Width ~ Sepal.Length, data=iris,
col=c("red","orange","blue")[iris$Species],pch=16,
xlab="Sepal Length", ylab="Sepal Width")
legend("topright", legend=levels(iris$Species),
col=c("red","orange","blue"), bty="n",pch=16)

3 / 47
Agenda

1. Basic operations
2. Data structures
3. Data Manipulation
4. Your First Graph

4 / 47
Basic Operation - Import data
1. Import data from drop down menu in R Studio:

2. Import data from SAS/SPSS, etc: http://www.statmethods.net/input/importingdata.html

5 / 47
Intermediate - Import data

## install.packages(c("tseries","lubridate"))
library(tseries)
library(lubridate)
amazon <- as.data.frame(get.hist.quote("amzn",
start="2013-1-1", end="2018-9-15", quote=c("Cl")))

## time series starts 2013-01-02


## time series ends 2018-09-14

amazon$Date<-ymd(row.names(amazon))
tail(amazon)

## Close Date
## 2018-09-07 1952.07 2018-09-07
## 2018-09-10 1939.01 2018-09-10
## 2018-09-11 1987.15 2018-09-11
## 2018-09-12 1990.00 2018-09-12
## 2018-09-13 1989.87 2018-09-13
## 2018-09-14 1970.19 2018-09-14

6 / 47
Advanced - Import data
# list of addresses for raw data.
addressList <- list(
drives_address = "http://stats.nba.com/js/data/sportvu/drivesData.js",
defense_address = "http://stats.nba.com/js/data/sportvu/defenseData.js",
catchshoot_address = "http://stats.nba.com/js/data/sportvu/catchShootData.js")

# function that grabs the data from the website and converts to R data frame
readIt <- function(address) {
web_page <- readLines(address)

## regex to strip javascript bits and convert raw to csv format


x1 <- gsub("[\\{\\}\\]]", "", web_page, perl = TRUE)
x2 <- gsub("[\\[]", "\n", x1, perl = TRUE)
x3 <- gsub("\"rowSet\":\n", "", x2, perl = TRUE)
x4 <- gsub(";", ",", x3, perl = TRUE)

# read the resulting csv with read.table()


nba <- read.table(textConnection(x4), header = T,
sep = ",", skip = 2, stringsAsFactors = FALSE)
return(nba)
}
# download the data
df_list <- lapply(addressList, readIt)

7 / 47
Advanced (Cont.) - Import data

# check the data


catchshoot<-df_list$catchshoot_address
#str(catchshoot) # Get information about structure
head(catchshoot)

## PLAYER_ID PLAYER FIRST_NAME LAST_NAME TEAM_ABBREVIATION GP MIN


## 1 202691 Klay Thompson Klay Thompson GSW 78 34.0
## 2 1717 Dirk Nowitzki Dirk Nowitzki DAL 53 26.3
## 3 2594 Kyle Korver Kyle Korver CLE 35 24.6
## 4 201586 Serge Ibaka Serge Ibaka TOR 23 30.9
## 5 201567 Kevin Love Kevin Love CLE 60 31.4
## 6 202331 Paul George Paul George IND 74 35.8
## PTS FGM FGA FG_PCT FG3M FG3A FG3_PCT EFG_PCT PTS_TOT X
## 1 11.5 4.2 9.3 0.454 3.1 7.1 0.438 0.621 899 NA
## 2 8.1 3.4 7.5 0.446 1.3 3.5 0.388 0.535 427 NA
## 3 7.6 2.7 5.7 0.470 2.2 4.7 0.470 0.662 265 NA
## 4 7.5 2.9 6.9 0.424 1.7 4.3 0.394 0.547 173 NA
## 5 7.5 2.6 6.6 0.388 2.3 5.8 0.395 0.561 448 NA
## 6 7.4 2.7 6.1 0.437 2.0 4.8 0.420 0.603 546 NA

8 / 47
Advanced: scraping the web using R

#install.packages("rvest")
library(rvest)
# Store web url
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
#Scrape the website for the movie rating
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
#rating
# Scrape the website for the cast
cast <- lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
#cast

https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/

9 / 47
Advanced (Cont.): scraping the web using R

#Scrape the website for the movie rating


rating

## [1] 7.8

# Scrape the website for the cast


cast

## character(0)

https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/

10 / 47
Basic Operation - Export data

I Export dataframe into a spreedsheet,the easiest way to do this is to


use write.csv().
I By default, write.csv() includes row names, but these are usually
unnecessary and may cause confusion.
I The export file will be stored under working directory.
# export 'mydf' as a .csv file:
write.csv(mydf,"test.csv")

I How to find out your working directory?


# returns an absolute filepath representing the current working directory o
getwd()
## [1] "/Users/annieyu/Dropbox/622 visualization/lectures/Lecture 3_intro_t

I Write data into other format files:


http://www.cookbook- r.com/Data_input_and_output/Writing_data_to_a_file/

11 / 47
Basic Operation - Install pacakges
Two ways to install a package:
1. From drop down menu in R Studio:

2. Using command:
# Download and install packages from CRAN-like repositories or from local f
install.packages(c("ggplot2","tidyr","dplyr"))
# Always load package before call it:
library(ggplot2)
12 / 47
Basic Operation - Update pacakges
1. To update all your installed packages to the latest versions available:

update.packages()

2. To store your R code, always create a R script:

3. Export your images to pdf/png format:

13 / 47
Getting Started
R programming style

I R is case sensitive: a and A are two different objects.


I The assignment symbol is <-. Alternatively, the classical = symbol
can be used.
I The symbol # comments to the end of the line:

# This is a comment
# The two following statements are equivalent:
a <- 1
# Assigning value 1 to object a:
a = 1

14 / 47
Data Structure
1. Vector
2. Matrix
3. Array
4. Data Frame
5. List

http://venus.ifca.unican.es/Rintro/dataStruct.html

15 / 47
Data Structure - Variable
Like most other languages, R lets you assign values to variables and refer
to them by name:
x <- 1
# x gets 1
y <- 2
# c(...): a generic function which combines values into a vector
z <- c(x,y)
# evaluate z to see what's stored as z
z

## [1] 1 2

Notice that the substitution is done at the time that the value is assigned
to z, not the time that z is evaluated:
y <- 5
z

## [1] 1 2

16 / 47
Data Structure - Vector
Fetch element(s) by location in a vector:

a <- c(1,2,3,4,5,6,7,8)
a

## [1] 1 2 3 4 5 6 7 8

# fetch the 5th item in vector a:


a[5]

## [1] 5

# fetch item 1 through 6:


a[1:6]

## [1] 1 2 3 4 5 6

# fetch item 1, 3, 7:
a[c(1,3,7)]

## [1] 1 3 7

17 / 47
Data Structure - Array
I In R, you can construct more complicated data structures than just
vectors.
I An array object is just a vector that’s associated with a dimension
attribute.

# Define an array
a <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim=c(2, 4))
a

## [,1] [,2] [,3] [,4]


## [1,] 1 3 5 7
## [2,] 2 4 6 8

# fetch one cell in array a:


a[2,3]

## [1] 6

# fetch 1st row only


a[1,]

## [1] 1 3 5 7

18 / 47
Data Structure - Data frame
I A data frame is a list that contains multiple named vectors that are
the same length.
I Like a spreadsheet or a database table, particularly good for
representing experimental data.
# data.frame() is a function to creates data frames
team <-c("A","B","C","D","E")
first <- c(92, 89, 94, 72, 59)
second <- c(70, 73, 77, 90, 102)
mydf <- data.frame(team, first, second)
mydf

## team first second


## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102

# refer to the components of a data frame by name:


mydf$team

## [1] A B C D E
## Levels: A B C D E
19 / 47
Data Structure - List
I R has a built-in data type for mixing objects of different types, called
lists.

# list() function to construct R lists.


#Example: a list containing two strings, and a data frame
e <- list(thing=c("hat","shoes"), size=c("8.25","5"), myData=mydf)
e

## $thing
## [1] "hat" "shoes"
##
## $size
## [1] "8.25" "5"
##
## $myData
## team first second
## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102

20 / 47
Data Structure - List Cont

# fetch the 1st item in the list:


e$thing

## [1] "hat" "shoes"

e[1]

## $thing
## [1] "hat" "shoes"

# fetch the 1st row in the data frame


# which is the third component in the list:
e$myData[1,]

## team first second


## 1 A 92 70

21 / 47
Data Structure - Get Info about structure
# Here are some sample variables for example:
n <- 1:4
let <- LETTERS[1:4]
let

## [1] "A" "B" "C" "D"

df <- data.frame(n, let)


df

## n let
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D

# Get information about structure


str(df)

## 'data.frame': 4 obs. of 2 variables:


## $ n : int 1 2 3 4
## $ let: Factor w/ 4 levels "A","B","C","D": 1 2 3 4

22 / 47
Data Structure - Get Info about structure

# Get the length of a vector


length(n)

## [1] 4

# Number of rows
nrow(df)

## [1] 4

# Number of columns
ncol(df)

## [1] 2

# Get num of rows and columns


dim(df)

## [1] 4 2

23 / 47
1
Data Exploration
“Happy families are all alike; every unhappy family is unhappy in its own
way. ” Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own
way. ” Hadley Wickham

1 Hadley Wickham. http://r4ds.had.co.nz/tidy-data.html


24 / 47
Working with NA and NaN
There are some special characters in R
I NA : Not Available (ie missing values)

I NaN : Not a Number

I Inf: Infinity

I -Inf : Minus Infinity

# For instance:
0/0

## [1] NaN

1/0

## [1] Inf

# Here's how to test whether a variable has one of these values:


y <- NA
# Is y NA?
is.na(y)

## [1] TRUE

25 / 47
Working with NA and NaN
Ignoring "bad" values in vector summary functions:
I If you run functions like mean() or sum() on a vector or data frame
containing NA or NaN, they will return NA and NaN(bad value).
I Many of these functions take the flag na.rm, which tells them to
ignore these values:
df1 <- c(1, 2, 3, NA, 5)
mean(df1)

## [1] NA

mean(df1, na.rm=TRUE)

## [1] 2.75

df2 <- c(1, 2, 3, NaN, 5)


sum(df2)

## [1] NaN

sum(df2, na.rm=TRUE)

## [1] 11
26 / 47
Example: Import Data
library(readr)
HW <- read_csv("dataSets/Student_List_HW.csv")
HW<-as.data.frame(HW)
summary(HW)

## Last_Name First_Name Status


## Length:20 Length:20 Length:20
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Home Homework_1 Homework_2 Homework_3
## Length:20 Min. :58.00 Min. :77.00 Min. : 80.00
## Class :character 1st Qu.:70.50 1st Qu.:80.00 1st Qu.: 85.50
## Mode :character Median :74.50 Median :88.00 Median : 90.50
## Mean :77.39 Mean :87.35 Mean : 90.90
## 3rd Qu.:84.25 3rd Qu.:93.00 3rd Qu.: 98.25
## Max. :99.00 Max. :99.00 Max. :100.00
## NA's :2

27 / 47
Example: Replace Missing Variables
HW$Homework_1[is.na(HW$Homework_1)]<-0
HW$Home[which(HW$Last_Name=="Garcia")]<-"NJ"
HW$Home[is.na(HW$Home)]<-"Unknown"
HW<-HW[complete.cases(HW),]
summary(HW)

## Last_Name First_Name Status


## Length:18 Length:18 Length:18
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Home Homework_1 Homework_2 Homework_3
## Length:18 Min. : 0.00 Min. :77.00 Min. : 80.00
## Class :character 1st Qu.:66.75 1st Qu.:80.00 1st Qu.: 86.25
## Mode :character Median :74.50 Median :86.00 Median : 90.50
## Mean :70.28 Mean :86.39 Mean : 91.33
## 3rd Qu.:84.25 3rd Qu.:91.75 3rd Qu.: 98.75
## Max. :99.00 Max. :98.00 Max. :100.00

28 / 47
Subset Observations (Rows)2

2 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-

cheatsheet.pdf
29 / 47
Subset Observations (Rows) Cont.

#load dplyr
library(dplyr)
Subset_HW_1 <- filter(HW,Status == "Master")
head(Subset_HW_1)

## Last_Name First_Name Status Home Homework_1 Homework_2 Homework_3


## 1 Brown Susan Master NJ 74 88 98
## 2 Wilson Karen Master NJ 0 93 84
## 3 Moore Nancy Master PA 74 91 89
## 4 Taylor Betty Master GA 93 92 88
## 5 Anderson Anthony Master CA 96 98 100
## 6 Thomas Donald Master NJ 82 77 96

30 / 47
Subset Variables (Columns)

There are many options to choose columns

31 / 47
Subset Variables (Columns) Cont.

Subset_HW_2 <- select(HW,contains("Name"),contains("Homework"))


head(Subset_HW_2)

## Last_Name First_Name Homework_1 Homework_2 Homework_3


## 1 Smith Patricia 82 97 82
## 2 Johnson Jennifer 0 77 99
## 3 Williams Robert 99 80 80
## 4 Jones Michael 75 82 86
## 5 Brown Susan 74 88 98
## 7 Miller Richard 85 78 82

32 / 47
Subset Observations (Rows) and Variables (Columns)

Subset_HW_3 <- subset(HW,Status == "Master" ,


select=c("Last_Name","First_Name",
"Homework_1","Homework_2","Homework_3"))
head(Subset_HW_3)

## Last_Name First_Name Homework_1 Homework_2 Homework_3


## 5 Brown Susan 74 88 98
## 8 Wilson Karen 0 93 84
## 9 Moore Nancy 74 91 89
## 10 Taylor Betty 93 92 88
## 11 Anderson Anthony 96 98 100
## 12 Thomas Donald 82 77 96

33 / 47
Pipe Operator

Piping makes coding more readable and allow us to make several actions
in one sentence such as sort, filter, or create a variable.

34 / 47
Pipe Operator Cont.

HW %>%
filter(Status == "Master") %>%
select(contains("Name"),contains("Homework"))%>%
arrange(desc(Homework_1))%>%
head()

## Last_Name First_Name Homework_1 Homework_2 Homework_3


## 1 Anderson Anthony 96 98 100
## 2 Taylor Betty 93 92 88
## 3 Garcia Linda 93 91 100
## 4 Thomas Donald 82 77 96
## 5 Brown Susan 74 88 98
## 6 Moore Nancy 74 91 89

35 / 47
Create New Columns and Re-order
The mutate() function will add new columns to the data frame.
Arrange or re-order rows using arrange().

HW_update<-HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
arrange(desc(Homework_Average))
head(HW_update)

## Last_Name First_Name Status Home Homework_1 Homework_2


## 1 Anderson Anthony Master CA 96 98
## 2 Garcia Linda Master NJ 93 91
## 3 Wang Thomas PhD CHINA 72 98
## 4 Martin Morgan Undergraduate NJ 72 88
## 5 Brown Susan Master NJ 74 88
## 6 Taylor Betty Master GA 93 92
## Homework_3 Homework_Average
## 1 100 98.6
## 2 100 95.9
## 3 95 91.3
## 4 99 90.3
## 5 98 90.2
## 6 88 90.2

36 / 47
Split-Apply-Combine
Idea: split up a big problem into manageable pieces, apply a function to
each piece and then combine all the pieces together.

Split Apply Combine


(by X) X Y (average)
A 2
A 4
X Y
X Y A 3 X Y
A 2 A 3
A 4 X Y X Y B 2.5
B 0 B 0 B 2.5 C 7.5
B 5 B 5
C 5
C 10
X Y X Y
C 5 B 7.5
C 10

37 / 47
Group Data
Implement group operations in the “split-apply-combine” concept:

38 / 47
Group Data

Group_Summarise_HW<- HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
group_by(Status) %>%
summarise(Homework_Average=mean(Homework_Average),
Number_of_Student=length(Status))%>%
arrange(desc(Homework_Average))
head(Group_Summarise_HW)

## # A tibble: 3 x 3
## Status Homework_Average Number_of_Student
## <chr> <dbl> <int>
## 1 Master 87.4 8
## 2 PhD 86.4 2
## 3 Undergraduate 83.7 8

39 / 47
Reshape Data3
Lets change the layout of a data set, our tools from Tidyr library are:

3 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-

cheatsheet.pdf
40 / 47
Reshape Data Cont.

I gather() makes "wide" data longer


I unite() combines two variables into one variable

#load tidyr
library(tidyr)
tidyr_HW<- HW %>% unite(Name, First_Name, Last_Name, sep = " ")%>%
select(-c(Status,Home)) %>%
gather(Homework, Score, Homework_1:Homework_3)
head(tidyr_HW)

## Name Homework Score


## 1 Patricia Smith Homework_1 82
## 2 Jennifer Johnson Homework_1 0
## 3 Robert Williams Homework_1 99
## 4 Michael Jones Homework_1 75
## 5 Susan Brown Homework_1 74
## 6 Richard Miller Homework_1 85

41 / 47
Merge Data
Exam<- read_csv("dataSets/Student_List_Exam.csv")
Exam<-as.data.frame(Exam)
head(Exam,3)

## Last_Name First_Name Exam Project


## 1 Smith Patricia 77 65
## 2 Johnson Jennifer 100 96
## 3 Williams Robert 92 53

HW_update<-mutate(HW,Homework_Average =
0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)
Merged_df<-inner_join(HW_update, Exam,by=c("Last_Name","First_Name"))
head(Merged_df,3)

## Last_Name First_Name Status Home Homework_1 Homework_2 Homework_3


## 1 Smith Patricia Undergraduate MD 82 97 82
## 2 Johnson Jennifer Undergraduate NY 0 77 99
## 3 Williams Robert Undergraduate NY 99 80 80
## Homework_Average Exam Project
## 1 86.5 77 65
## 2 72.6 100 96
## 3 83.8 92 53

42 / 47
ggplot2

I ggplot2 is an R package designed for creating high quality plots.


I ggplot is based on the layered grammar of graphics, which means
that plots can be constructed layer by layer.

#you need to install the package just once


install.packages('ggplot2')

43 / 47
Composition of plots in ggplot2
Plots have two main components: 1) data to use and 2) type of plot.

Basic We want
function points Aesthetics
for plotting

ggplot(data=economics) + geom_point(aes(x=date, y=unemploy))

Specify Specify
Dataset what goes what goes
on the on the
X axis Y axis

Type of plot
Data to use

44 / 47
Our first offcial graph
library(ggplot2)
ggplot(data=iris)+
geom_point(aes(x=Sepal.Width,y=Sepal.Length,colour=Species))

Species
Sepal.Length

setosa
6 versicolor
virginica

2.0 2.5 3.0 3.5 4.0 4.5


Sepal.Width

45 / 47
Resources

1. Rob Kabacoff, “R in Action”: https://www.amazon.com/Action- Data- Analysis- Graphics/dp/

1617291382/ref=pd_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=EEBN1DRHWQ6J09Z6TTBY

2. Michael J Crawley, “The R Book”:


http://users.humboldt.edu/ygkim/CrawleyMJ_TheRBook.pdf

3. Joseph Adler, “R in a Nutshell”:


http://www.amazon.com/R- Nutshell- Joseph- Adler/dp/144931208X

4. Quick-R tutorial: http://www.statmethods.net/input/datatypes.html

5. Cookbook for R, Data input and output:


http://www.cookbook- r.com/Data_input_and_output/Writing_data_to_a_file/

46 / 47
What have we learned?

1. Define Data structures such as vector, array, list and dataframe.


2. Basic operations such as install package, import/export datasets
3. Common data manipulation operations such as filtering for rows,
selecting specific columns, re-ordering rows, adding new columns,
summarizing data, and performing the "split-apply-combine" task
4. Draw the graph

47 / 47

You might also like