Al.I.
Cuza University of Iai
Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics
Data Analysis & Data
Science with R
Data structures in R.
Build-in Datasets
By Marin Fotache
Data structures in R
Tutorials (and code) on Data
Structures
Data structures (Advanced R by Hadley Wickham)
http://adv-r.had.co.nz/Data-structures.html
1.2 Variables (Variables and Data Structures)
https://www.youtube.com/watch?v=DG7YNf8kb3w
2 - Introduction to R : Atomic Classes
https://www.youtube.com/watch?v=271FKAYavYE
http://repidemiology.wordpress.com/introduction-to-r-code/
1.3 Vectors (Variables and Data Structures)
https://www.youtube.com/watch?v=QygSZw77Hs8
3- Introduction to R : Vectors
https://www.youtube.com/watch?v=MGphwmXCCgM#t=12
http://repidemiology.wordpress.com/introduction-to-r-code/
1.4 Matrices (Variables and Data Structures)
https://www.youtube.com/watch?v=UakyyZSyuZU
Tutorials on Data Structures (cont.)
1.5
Lists and Data Frames (Variables and Data Structures)
https://www.youtube.com/watch?v=U6vbR4el3kQ
1.6 Logical Vectors and Operators (Variables and Data
Structures)
https://www.youtube.com/watch?v=GQb735O2qjc
4- Introduction to R : Matrix, List and Data Frame
https://www.youtube.com/watch?v=cEX4iXUPqoo
http://repidemiology.wordpress.com/introduction-to-r-code/
Common Data Structures in R
https://www.youtube.com/watch?v=q5YJUGTYUvI
Introduction to R Statistical Computing: Data Structures
https://www.youtube.com/watch?v=OZD4oLobjWM
Lecture 2b: Subsetting
https://www.youtube.com/watch?v=hWbgqzsQJF0&index=7&
list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ
R script associated with this
presentation
02b_data_structures__datasets.R
http://1drv.ms/1sYllLB
Vectors with c() function
Vectors
are one-dimensional arrays that can hold
numeric, character logical, or date/time/timestamp data
Most frequently function c() is used to declare/form the
vector
> x = c(1, 3, 5, 7, 25, -13, 47)
> x
[1]
1
3
5
7 25 -13 47
> y = c("one", "two", "three", "eight")
> y
[1] "one"
"two"
"three" "eight"
> z = c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)
> z
[1] TRUE FALSE TRUE TRUE FALSE TRUE
The data in a vector must only be one type (numeric,
character, or logical)
Vectors of numbers with
sequences
Vectors
can also be created with a sequence
> ten_integers.1 <- 5:14
> ten_integers.1
[1] 5 6 7 8 9 10 11 12 13 14
or
> ten_integers.2 <- seq(from=5, to=14, by=1)
> ten_integers.2
[1] 5 6 7 8 9 10 11 12 13 14
Declare
a vector of descending numbers
> seq(from=5, to=-5, by=-1)
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
Combine
sequences and c function
> a_vector <- c( 2:4, 8:14)
> a_vector
[1] 2 3 4 8 9 10 11 12 13 14
Vectors containing a range of
dates
Generating
a vector with dates between
September 29th and October 2nd 2014 as
"pure" dates
First solution:
> seq(as.Date("2014/09/29"), by = "day", length.out = 4)
Second solution:
> seq(as.Date("2014/09/29"), as.Date("2014/10/02"),
"days")
In both cases the result is:
[1] "2014-09-29" "2014-09-30" "2014-10-01" "201410-02"
Vectors containing a range of
timestamps
Generating
a vector with dates between
September 29th and October 2nd 2014 as
timestamps
First solution
> seq(c(ISOdate(2014,9,29)), by = "DSTday",
length.out = 4)
Second solution
> x <- as.POSIXct("2014-09-25 23:59:59",
tz="Turkey")
> format(seq(x, by="day", length.out=8),
"%Y-%m-%d %Z")
Third solution
> d1<-ISOdate(year=2014,month=9,day=25,tz="GMT")
> seq(from=d1,by="day",length.out=8)
Vectors generated from the
normal distribution
Vector
object named x contains five random
values drawn from the standard normal
distribution; values are not ordered
> x <- rnorm(5)
> x
[1] -0.2766566 0.7262000
-0.3409396 -0.5192846
0.5508588
Numbers
are extracted randomly, so that the
same function will draw other five numbers:
> x <- rnorm(5)
> x
[1] 1.9030714 -1.7139177 -0.2287666
0.8369275 0.4203014
Vectors created with function rep
(repeat)
Vector
x.rep contains a sequence of
numbers (5, 7, 11) repeated three times
> x.rep <- rep(c(5, 7, 11), 3)
> x.rep
[1] 5 7 11 5 7 11 5 7 11
See
the difference with version which uses
each clause:
> x.rep.2 <- rep(c(5, 7, 11), each=2,
times=3)
> x.rep.2
[1] 5 5 7 7 11 11 5 5 7 7 11 11
5 5 7 7 11 11
Example of built-in (system
defined) vectors
> Letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
"o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
"O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
> month.name
[1] "January"
"June"
[10] "October"
"February"
"July"
"November"
"March"
"August"
"December"
"April"
"May"
"September"
> state.name
[1] "Alabama"
"Arkansas"
...
"Alaska"
> state.area
[1]
...
51609 589757 113909
53104
"Arizona"
Vectors of factors
Factors
are nominal variables whose values have a number of
levels
Very important in data analysis and visualization
Ex: two vectors:
student names
student genres
Both
vectors initially contain characters
> names <- c( "Popescu I. Valeria", "Ionescu V. Viorel",
+
"Genete I. Aurelia", "Lazar T. Ionut",
+
"Sadovschi V. Iuliana", "Dominte I. Nicoleta")
> genre <- c("Female", "Male", "Female", "Male",
+
"Female", "Female" )
> class(names)
[1] "character"
> class(genre)
[1] "character"
Vectors of factors (cont.)
> unclass(genre)
[1] "Female" "Male"
"Female" "Male"
"Female" "Female"
Genre can have only two values, so it is converted into a factor
> genre <- as.factor(genre)
> class(genre)
[1] "factor"
> unclass(genre)
[1] 1 2 1 2 1 1
attr(,"levels")
[1] "Female" "Male"
If
a non existing value is added in vector "genre", it is
automatically converted back into character
> genre <- c(genre, "Boy")
> class(genre)
[1] "character"
> unclass(genre)
Functions for getting vector
type and length
Class
returns elements data type; unclass returns the
values
> class(ten_integers.1)
[1] "integer"
> unclass(ten_integers.1)
[1] 5 6 7 8 9 10 11 12 13 14
Internally, factor levels are stored
as integers
> class(genre)
[1] "factor"
> unclass(genre)
[1] 1 2 1 2 1 1
attr(,"levels")
[1] "Female" "Male"
> typeof(genre)
[1] "integer"
Function length
returns the number of elements in a vector
> length(ten_integers.1)
[1] 10
Referencing vector elements
First
element in vector ten_integers.1
> ten_integers.1 [1]
[1] 5
Last element in vector ten_integers.1
> ten_integers.1 [length(ten_integers.1)]
[1] 14
First three elements in vector ten_integers.1
> ten_integers.1 [1:3]
[1] 5 6 7
Last three elements in vector
> ten_integers.1 [(length(ten_integers.1)-2) :
length(ten_integers.1)]
[1] 12 13 14
First, third, fifth and sixth elements
> ten_integers.1 [c(1, 3, 5, 6)]
[1] 5 7 9 10
Referencing vector elements
(cont.)
Indices
of elements can be qualified with other
vectors
Display first, third, fifth and sixth elements in
vector ten_integers.1
Vector ind contains indices for elements of
interest from vector ten_integers.1
> ind <- c(1, 3, 5, 6)
> ind
[1] 1 3 5 6
> ten_integers.1
[1]
Now
9 10 11 12 13 14
the result:
> ten_integers.1 [ind]
[1] 5 7 9 10
Excluding elements from a
vector
Basic
idea: R will exclude from a vector the
elements whose indices are negative
(prefixed by minus)
Excluding
first element:
> ten_integers.1 [-1]
[1]
Excluding
9 10 11 12 13 14
first three elements:
> ten_integers.1 [-(1:3)]
[1]
9 10 11 12 13 14
Excluding
first, third, and fourth elements:
> ten_integers.1 [-(c(1,3,4))]
[1]
9 10 11 12 13 14
Excluding elements from a vector
(cont.)
Excluding
first three elements and the 6 th
element and the 8th element
> ten_integers.1 [-(c(1:3,6,8))]
[1] 8 9 11 13 14
Excluding
the first two elements and
the last two elements of the vector:
> ten_integers.1 [-c((1:2),
(length(ten_integers.1)-1) :
length(ten_integers.1))]
[1] 7 8 9 10 11 12
Vector filtering
Filter
vector elements - select only elements
greater than 10
> ten_integers.1 [ten_integers.1 > 10]
[1] 11 12 13 14
How
many elementes are greater than 10 ?
> length(ten_integers.1 [ten_integers.1 > 10])
[1] 4
Display
INDICES of elements greater than 10
> which (ten_integers.1 > 10)
[1]
9 10
Filter
vector elements - select only elements
greater than 10 ver. 2
> ind <- which (ten_integers.1 > 10)
> ten_integers.1 [ind]
[1] 11 12 13 14
Sorting/ordering a vector
Initial
vector
> names <- c( "Popescu I. Valeria", "Ionescu V. Viorel",
+
"Genete I. Aurelia", "Lazar T. Ionut",
+
"Sadovschi V. Iuliana", "Dominte I. Nicoleta")
Sort
the vector elements in ascending (default) order
> names <- sort(names)
> names
[1] "Dominte I. Nicoleta" "Genete I. Aurelia"
"Ionescu V. Viorel"
"Lazar T. Ionut"
[5] "Popescu I. Valeria"
"Sadovschi V. Iuliana"
Sorting
the vector in descending order
> names.desc <- rev(sort(names))
> names.desc
[1] "Sadovschi V. Iuliana" "Popescu I. Valeria"
T. Ionut"
"Ionescu V. Viorel"
[5] "Genete I. Aurelia"
"Dominte I. Nicoleta"
"Lazar
R as a vectorized language
Lecture
2c: Vectorized Operations
https://www.youtube.com/watch?v=Fm8SORJQjPY&list=PLjTlx
b-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=8
Operations
are automatically applied on each element of the
vector without looping among vector elements
> num.vec.1 <- c(1, 3, 5, 7, 25, -13, 47)
> num.vec.2 <- num.vec.1 + 100
> num.vec.2
[1] 101 103 105 107 125 87 147
> date.vec.1 <- c ("2013-10-01", "2013-10-03", "2013-10-10")
For
the moment, elements are strings
> class(date.vec.1)
[1] "character"
as.Date()
converts all of the vector elements into dates
> date.vec.1 <- as.Date(date.vec.1)
> class(date.vec.1)
[1] "Date"
R as a vectorized language
(cont.)
Operations
can be applied on two or more vectors
> num.vec.3 <- num.vec.1 + num.vec.2
> num.vec.3
[1] 102 106 110 114 150 74 194
Compare
a vector with a value
> x
[1] -0.56757455 -0.90079348
> x >= 0
[1] FALSE FALSE TRUE FALSE
> x.1 <- x >= 0
> x.1
[1] FALSE FALSE TRUE FALSE
Testing
0.24397156 -0.51325283
0.03209287
TRUE
TRUE
if at least one of the vector elements fulfils the predicate
> x
[1] -0.56757455 -0.90079348
> any(x > 0)
[1] TRUE
0.24397156 -0.51325283
0.03209287
R as a vectorized language
(cont.)
Testing
if all the vector elements fulfill the
predicate (function all)
> all(x > 0)
[1] FALSE
> all(x > -25)
[1] TRUE
For
a character vector, display the number of
characters for each element
> y
[] "one"
"two"
> nchar(y)
[1] 3 3 5 5
>
"three" "eight"
Naming vector elements
Provide
a name for each vector element
> num_ro = c (one = "unu", two="doi", three="trei",
four="patru")
> num_ro
one
two
three
four
"unu"
"doi" "trei" "patru"
The
same result can be accomplished with:
> num_ro = c ("unu", "doi", "trei", "patru")
> num_ro
[1] "unu"
"doi"
"trei" "patru"
> names(num_ro) = c ("one", "two", "three", "four")
> num_ro
one
two
three
four
"unu"
"doi" "trei" "patru"
Descriptive statistics on vectors
A
vector (age) containing the age of 10 persons
(Kabacoff, 2011)
> age = c(1,3,5,2,11,9,3,9,12,3)
Another
vector containing the weight of above people
> weight = c(4.4,5.3,7.2,5.2,8.5,7.3,6.0,10.4,10.2,6.1)
Suppose
above weights were in US metric system, we had
convert them from lbs into kg
> weight.kg <- weight * 0.454
Compute
the mean of people's weight
> mean(weight)
[1] 7.06
Compute
the standard deviation of people's weight
> sd(weight)
[1] 2.077498
Compute
correlation between age and weight
> cor(age,weight)
Matrices
Two-dimensional
arrays where each element has
the same type (numeric,character, or logical)
Created with the m atrix function. Format:
> Myymatrix <- matrix(vector,
nrow=number_of_rows,
ncol=number_of_columns, byrow=logical_value,
dimnames=list( char_vector_rownames,
char_vector_colnames))
vector contains the elements for the matrix
nrow and ncol specify the row and column dimensions
dimnames contains optional row and column labels stored in
character vectors.
byrow indicates whether the matrix should be filled in by row
(byrow=TRUE) or by column (byrow=FALSE); the default is by
column.
Matrices (cont.)
m.1
is a 5 x 4 matrix
> m.1 <- matrix(1:20, nrow=5, ncol=4)
> m.1
[,1] [,2] [,3] [,4]
[1,]
1
6
11
16
[2,]
2
7
12
17
[3,]
3
8
13
18
[4,]
4
9
14
19
[5,]
5
10
15
20
m.2
>
>
>
>
+
is a 2 x 2 matrix, filled by rows
cells <- c(1,26,24,68)
rownames <- c("Row1", "Row2")
colnames <- c("Col1", "Col2")
m.2 <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rownames, colnames))
Matrices (cont.)
Display
m.2
> m.2
Col1 Col2
Row 1 1 26
Row 2 24 68
m.3 is a 2 x 2 matrix, filled by columns
list is a data structure presented after data frame
> m.3 <- matrix(cells, nrow=2, ncol=2,
byrow=FALSE,
+ dimnames=list(rownames, colnames))
> m.3
Col1 Col2
Row 1 1 24
Row 2 26 68
Matrices (cont.)
m.4
is a 4 x 3 matrix, filled by rows
> m.4 <- matrix(1:12, nrow=4, ncol=3, byrow=TRUE)
> m.4
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
Naming
rows: row.1, row.2, ... and columns: col.1, col.2, ...
> dimnames(m.4)=list(paste("row.", 1:nrow(m.4), sep=""),
paste("col.", 1:ncol(m.4), sep=""))
> m.4
col.1 col.2 col.3
row .1
1
2
3
row .2
4
5
6
row .3
7
8
9
row .4 10 11 12
Accesing matrix elements
> m.1
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
Display the 3rd row
> m.1[3,]
[1] 3 8 13 18
Display the
3rd column
> m.1[,3]
[1] 11 12 13 14 15
Display the element
at the intersection of the 2nd
row and the 3rd column
> m.1 [2,3]
[1] 12
Accesing matrix elements
(cont.)
Display
two elements from the same row: m.1 [2,3]
and m.1[2,4]
> m.1 [2, c(3,4)]
[1] 12 17
Display three elements from the same column:
m.1[1,2], m1[2,2] and m.1[3,2]
> m.1 [c(1,2, 3), 2]
[1] 6 7 8
Display a "submatrix", from m1 [2,2] to m2[4.4]
> m.1 [ c(2,3,4), c(2,3,4)]
[,1] [,2] [,3]
[1,] 7 12 17
[2,] 8 13 18
Basic statistics on matrix
> m.4
col.1 col.2 col.3
row .1
1
2
3
row .2
4
5
6
row .3
7
8
9
row .4 10 11 12
Compute mean of all the cells in matrix m.4
> mean(m.4)
[1] 6.5
Compute mean of all the cells on the third column
> mean(m.4[,3])
[1] 7.5
Compute mean of all the cells on the third row
> mean(m.4[3,])
[1] 8
Basic statistics on matrix (cont.)
Compute
sum of
> sum(m.4)
[1] 78
Compute sum of
> sum(m.4[,3])
[1] 30
Compute sum of
> sum(m.4[3,])
[1] 24
Compute sum of
> sum(m.4)
[1] 78
all the cells in matrix m.4
all the cells on the third column
all the cells on the third row
all the cells in matrix m.4
rowSums/colSums
rowSums
calculates the sum of the cells for each row of a
matrix
> rowSums(m.4)
row .1 row .2 row .3 row .4
6 15 24 33
colSums
calculated the sums of the cells for each column of
a matrix
> colSums(m.4)
col.1 col.2 col.3
22 26 30
rowMeans/colMeans
> rowMeans(m.4)
row .1 row .2 row .3 row .4
2
5
8 11
> colMeans(m.4)
col.1 col.2 col.3
5.5 6.5 7.5
calculate mean of the every row/column
Adding total rows and columns to
a matrix
> m.4
col.1 col.2 col.3
row .1
1
2
3
row .2
4
5
6
row .3
7
8
9
row .4 10 11 12
Add
total column
> m.4 <- cbind(m.4, rowSums(m.4))
Setting the name for the total column
> column.names <- colnames(m.4)
> column.names
[1] "col.1" "col.2" "col.3" ""
> column.names[length(column.names)] <"col.total"
> colnames(m.4) <- column.names
Adding total rows and columns to
a matrix (cont.)
Check
the operation
> m.4
col.1 col.2 col.3 col.total
row .1
1
2
3
6
row .2
4
5
6
15
row .3
7
8
9
24
row .4 10 11 12
33
Add
total row
> m.4 <- rbind(m.4, colSums(m.4))
Setting
the name for the total column
> row.names <- rownames(m.4)
> row.names
[1] "row .1" "row .2" "row .3" "row .4" ""
> row.names[length(row.names)] <- "row.total"
> rownames(m.4) <- row.names
Adding total rows and columns to
a matrix (cont.)
Check
the operation; notice the
names of rows and columns and the
content of last row and column
> m.4
col.1 col.2 col.3 col.total
row .1
row .2
15
row .3
24
row .4
10
row .total 22
11
26
12
30
33
78
Arrays
Similar
to matrices but can have more than
two dimensions
Elements must be of the same type
Created with array function:
> myarray <- array(vector,
+
dimensions, dimnames)
vector contains the data for the array
dimensions is a numeric vector giving the maximal
index for each dimension
dimnames - optional list of dimension labels.
Elements
in arrays are accessed similar to
those in matrices
Create and access arrays
> dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2",
+
"C3", "C4")
> a1 <- array(1:24, c(2, 3, 4), +
dimnames=list(dim1, dim2, + dim3))
>
> a1
,,C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
,,C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
Cont. of previous column
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
display element [2,2,3]
> a1 [2,2,3]
[1] 16
Create and access arrays (cont.)
display a matrix from
elements of A and B for first
row/column of C
> a1 [,,1]
display a subarray containg all
elements from first two
rows/columns of A, B and C
> a1 [c(1,2),c(1,2),c(1,2)]
B1 B2 B3
A1 1 3 5
, , C1
A2 2 4 6
B1 B2
display elements of A for the
3rd "row" of B and 2nd
row/columns of C
> a1 [,3,2]
A1 A2
11 12
A1 1 3
A2 2 4
, , C2
B1 B2
A1 7 9
A2 8 10
Data Frames
Most
important data structure in R (at least
for us)
A data frame is a structure in R that holds
data and is similar to the datasets found in
standard statistical packages (for example,
SAS, SPSS, and Stata) and databases
The columns are variables and the rows
are observations
Variables can have different types (for
example, numeric, character) in the same
data frame
Create an empty data frame
> student_gi <- data.frame(studentID = numeric(),
name = character(), age = numeric(),
scholarship = character(),
lab_assessment = character(),
final_grade = numeric())
> class(student_gi)
[1] "data.fram e"
> str(student_gi)
'data.fram e': 0 obs. of 6 variables:
$ studentID
: num
$ nam e
: Factor w / 0 levels:
$ age
: num
$ scholarship : Factor w / 0 levels:
$ lab_assessm ent: Factor w / 0 levels:
$ fi
nal_grade : num
Create a data frame from vectors
Create
the vectors
> studentID <- c(1, 2, 3, 4, 5)
> name <- c("Popescu I. Vasile", "Ianos W.
Adriana",
+
"Kovacz V. Iosef", "Babadag I. Maria",
+
"Pop P. Ion")
> age <- c(23, 19, 21, 22, 31)
> scholarship <- c("Social", "Studiu1", "Studiu2",
+
"Merit", "Studiu1")
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
> final_grade <- c(9, 9.45, 9.75, 9, 6)
Create
the data frame using the above vectors
> student_gi <- data.frame(studentID, name, age,
+
scholarship, lab_assessment, final_grade)
Display data frame content
Display
data frame (content)
> student_gi
studentID
nam e age scholarship lab_assessm ent fi
nal_grade
1
1 Popescu I.Vasile 23
Social
Bine
9.00
2
2 Ianos W .Adriana 19
Studiu1 Foarte bine
9.45
3
3 Kovacz V.Iosef 21
Studiu2
Excelent
9.75
4
4 Babadag I.M aria 22
M erit
Bine
9.00
5
5
Pop P.Ion 31
Studiu1
Slab
6.00
Display one column of the data frame as a vector
> student_gi$name
[1] Popescu I.Vasile Ianos W .Adriana Kovacz V.Iosef Babadag I.M aria Pop P.Ion
Levels: Babadag I.M aria Ianos W .Adriana Kovacz V.Iosef Pop P.Ion Popescu I.Vasile
Display one column of the data frame as a... column
> student_gi["name"]
name
1 Popescu I.Vasile
2 Ianos W .Adriana
3 Kovacz V.Iosef
4 Babadag I.M aria
5
Pop P.Ion
Display data frame structure
Confirm
student_giis indeed a data frame
> class(student_gi)
[1] "data.fram e"
Display
structure of the data frame
> str(student_gi)
'data.fram e': 5 obs. of 6 variables:
$ studentID
: num 1 2 3 4 5
$ nam e
: Factor w / 5 levels "Babadag I.M aria",..: 5 2 3 1 4
$ age
: num 23 19 21 22 31
$ scholarship : Factor w / 4 levels "M erit","Social",..: 2 3 4 1 3
$ lab_assessm ent: Factor w / 4 levels "Bine","Excelent",..: 1 3 2 1 4
$ fi
nal_grade : num 9 9.45 9.75 9 6
Display
type of invididual variables within the data fra
> class(student_gi$studentID)
[1] "num eric"
> class(student_gi$name)
[1] "factor"
Useful functions for displaying
some data frame properties
Number
of observations (rows)
> nrow(student_gi)
[1] 5
Number
of variables (columns)
> ncol(student_gi)
[1] 6
Both
the number of observations (rows) and variables
(columns)
> dim(student_gi)
[1] 5 6
Display
the names of all the variables (columns)
> names(student_gi)
[1] "studentID "
"nam e"
"age"
"lab_assessm ent" "fi
n al_grade"
Display
"scholarship"
the names of the second, third and fourth
variable
> names(student_gi[2:4])
Selecting columns
Select/display
first two columns (studentID and
name )
> student_gi [1:2]
studentID
nam e
1
1 Popescu I. Vasile
2
2 Ianos W . Adriana
3
3 Kovacz V. Iosef
4
4 Babadag I. M aria
5
5
Pop P. Ion
or
> student_gi [, 1:2]
or
> student_gi [c("studentID", "name")]
or
(see on next slide)
Selecting columns (cont.)
Select/display
first two columns (studentID and
name ) other solutions
> student_gi [, c("studentID", "name")]
Using
a vector for storing indices of the first two
columns
> cols <- c("studentID", "name")
> student_gi[cols]
or
> student_gi[, names(student_gi) %in% cols]
Return
"final_grade" variable (column) as a vector
> student_gi$final_grade
[1] 9.00 9.45 9.75 9.00 6.00
or ... See on the next slide
Selecting columns (cont.)
Return
"final_grade" variable (column) as a vector
(cont.)
> student_gi[ , 6]
or
> student_gi[ , "final_grade"]
Return
"final_grade" variable (column) as a one-column
data frame
> student_gi[ , "final_grade", drop=FALSE]
fi
nal_grade
1
9.00
2
9.45
3
9.75
4
9.00
5
6.00
Selecting rows
Display
first two observations (rows)
> student_gi [1:2,]
studentID
nam e age scholarship
1
1 Popescu I. Vasile 23
Social
2
2 Ianos W . Adriana 19
Studiu1
lab_assessm ent fi
n al_grade
1
Bine
9.00
2 Foarte bine
9.45
Display
display observations 1, 2 and 5
> student_gi [c(1:2, 5),]
studentID
nam e age scholarship lab_assessm ent
fi
nal_grade
1
1 Popescu I. Vasile 23
Social
Bine
9.00
2
2 Ianos W . Adriana 19
Studiu1 Foarte bine
9.45
5
5
Pop P. Ion 31
Studiu1
Slab
6.00
attach function
attach
adds the data frame to the R search path
> search()
[1] ".G lobalEnv"
"tools:rstudio"
[3] "package:stats" "package:graphics"
[5] "package:grD evices" "package:utils"
[7] "package:datasets" "package:m ethods"
[9] "Autoloads"
"package:base"
When a variable name is encountered, data
frames in the search path are checked in order to
locate the variable.
Commands
without attach
> student_gi$final_grade
> table (student_gi$lab_assessment,
student_gi$final_grade)
> summary(student_gi$final_grade)
attach vs. with
The
>
>
>
>
>
same commands using attach
attach(student_gi)
final_grade
table (lab_assessment, final_grade)
summary(final_grade)
plot(age, final_grade)
detach
removes an objects from the search path
> detach(student_gi)
It
is advisable to use
> with (student_gi,
> with (student_gi,
final_grade))
> with (student_gi,
final_grade) )
with instead of attach:
final_grade)
table (lab_assessment,
plot(lab_assessment,
Case (row) identifiers
Act
like primary/unique keys in relational tables
Can be specified by rowname option within the
data.frame function
We allocate new values for studentID (to avoid
confusion with row numbers); the remaining
vectors are identical
> studentID <- c(1001, 1002, 1003, 1004,
1005)
> name <- c("Popescu I. Vasile",
+
"Ianos W. Adriana", "Kovacz V. Iosef",
+
"Babadag I. Maria", "Pop P. Ion")
> age <- c(23, 19, 21, 22, 31)
> scholarship <- c("Social", "Studiu1",
+
"Studiu2", "Merit", "Studiu1")
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
Case (row) identifiers (cont.)
A
(slightly) new version of the data frame:
> student_gi <- data.frame(studentID, name,
age,
+
scholarship, lab_assessment,
+ final_grade, row.names = studentID)
studentID is the variable to use in labeling cases
on various printouts and graphics produced with
R.
display
the name of the rows (observations)
> rownames(student_gi)
[1] "1001" "1002" "1003" "1004" "1005"
> student_gi
studentID
nam e age scholarship lab_assessm ent
1001
1001 Popescu I. Vasile 23
Social
Bine
1002
1002 Ianos W . Adriana 19
Studiu1 Foarte bine
1003
1003 Kovacz V. Iosef 21
Studiu2
Excelent
Case (row) identifiers (cont.)
display
the name of the rows (observations)
> rownames(student_gi)
[1] "1001" "1002" "1003" "1004" "1005"
Notice
the leftmost column of the data frame
display
> student_gi
studentID
1001
1001
1002
1002
1003
1003
1004
1004
1005
1005
nam e age scholarship lab_assessm ent
Popescu I. Vasile 23
Social
Bine
Ianos W . Adriana 19
Studiu1 Foarte bine
Kovacz V. Iosef 21
Studiu2
Excelent
Babadag I. M aria 22
M erit
Bine
Pop P. Ion 31
Studiu1
Slab
fi
nal_grade
1001
9.00
1002
9.45
1003
9.75
1004
9.00
Case (row) identifiers (cont.)
Display
the observation (row) corresponding to
student Ianos W. Adriana using her case
identifier ("1002")
> student_gi["1002",]
studentID
nam e age scholarship lab_assessm ent
1002
1002 Ianos W . Adriana 19
Studiu1 Foarte bine
fi
nal_grade
1002
9.45
Display
the observations corresponding to
students Ianos W. Adriana and Pop P. Ion using
their case identifier ("1002" and "1005")
> student_gi[c("1002", "1005"),]
studentID
nam e age scholarship lab_assessm ent
1002
1002 Ianos W . Adriana 19
Studiu1 Foarte bine
1005
1005
Pop P. Ion 31
Studiu1
Slab
fi
nal_grade
1002
9.45
1005
6.0
Factors (reprise)
In
presentation 02a, variables were described as
nominal, ordinal, interval, and ratio
Nominal variables are categorical, without an
implied order. Examples: MaritalStatus, Sex, Job,
MasterProgramme
Ordinal variables imply order but not amount.
Examples: Status (poor, improved, excellent ),
LabAssessment (slab, bine, foarteBine, excelent)
Interval and Ratio variables can take on any
value within some range, and both order and
amount are implied. Examples: LitersPer100Km,
Height, Weight, FinalGrade (with decimals)
Categorical (nominal) and ordered categorical
(ordinal) variables are called factors.
Function factor
Factors
determine how data will be analyzed and
presented visually
The function factor() stores the categorical
values as a vector of integers in the range [1... k ]
(where k is the number of unique values in the
nominal variable), and an internal vector of
character strings (the original values) mapped to
these integers
Initially vector scholarship is a nominal variable
> scholarship <- c("Social", "Studiu1",
"Studiu2",
+
"Merit", "Studiu1")
Now
it will be converted into a factor:
> scholarship_f <- factor(scholarship)
> scholarship_f
[1] Social Studiu1 Studiu2 M erit Studiu1
Levels: M erit SocialStudiu1 Studiu2
Ordered factors
Another
ordinal variable
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
Notice the way of dispaying
> lab_assessment
[1] "Bine"
"Foarte bine" "Excelent" "Bine"
[5] "Slab"
Now declare the vector as an ordered factor
> lab_assessment <- factor(lab_assessment,
+
order=TRUE, levels=c("Slab", "Bine",
+
"Foarte bine", "Excelent"))
Notice the new way of displaying the vector
> lab_assessment
[1] Bine
Foarte bine Excelent Bine
Slab
Levels: Slab < Bine < Foarte bine < Excelent
Factors in data frames
Re-create
the data frame using factors
> studentID <- c(1001, 1002, 1003, 1004, 1005)
> name <- c("Popescu I. Vasile", "Ianos W.
Adriana",
+
"Kovacz V. Iosef", "Babadag I. Maria",
+
"Pop P. Ion")
> age <- c(23, 19, 21, 22, 31)
> scholarship <- c("Social", "Studiu1",
"Studiu2",
+
"Merit", "Studiu1")
> scholarship <- factor(scholarship)
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
> lab_assessment <- factor(lab_assessment,
+
order=TRUE, levels=c("Slab", "Bine",
+
"Foarte bine", "Excelent"))
> final_grade <- c(9, 9.45, 9.75, 9, 6)
Factors in data frames (cont.)
Another
version of the data frame
> student_gi <- data.frame(name, age,
scholarship,
+
lab_assessment, final_grade,
+
row.names = studentID)
Display
the structure of the data frame
> str(student_gi)
'data.fram e':5 obs.of 5 variables:
$ nam e
: Factor w / 5 levels "Babadag I.M aria",..: 5
2314
$ age
: num 23 19 21 22 31
$ scholarship : Factor w / 4 levels "M erit","Social",..: 2 3
413
$ lab_assessm ent: O rd.factor w / 4 levels
"Slab"< "Bine"< ..: 2 3 4 2 1
$ fi
n al_grade : num 9 9.45 9.75 9 6
Factors in data frames (cont.)
Basic
statistics about variables in data frame
> summary(student_gi)
nam e
age
scholarship
Babadag I.M aria :1 M in. :19.0 M erit :1
Ianos W .Adriana :1 1st Q u.:21.0 Social:1
Kovacz V.Iosef :1 M edian :22.0 Studiu1:2
Pop P.Ion
:1 M ean :23.2 Studiu2:1
Popescu I. Vasile:1 3rd Q u.:23.0
M ax. :31.0
lab_assessm ent fi
nal_grade
Slab
:1
M in. :6.00
Bine
:2
1st Q u.:9.00
Foarte bine:1
M edian :9.00
Excelent :1
M ean :8.64
3rd Q u.:9.45
M ax. :9.75
Factors and value labels
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1",
"Type1")
> status <- c("Poor", "Improved", "Excellent",
+
"Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> gender <- c(1, 2, 2, 1)
> patientdata <- data.frame(patientID, age,
+
diabetes, status, gender)
For
variable gender (coded 1 for males and 2 for
females) the value labels are declared with options
levels (indicating the values) and labels
(indicating the labels):
> patientdata$gender <-
Factors and value labels (cont.)
For
gender, labels (instead of of values) are displayed
> patientdata
patientID age diabetes status gender
1
1 25 Type1
Poor m ale
2
2 34 Type2 Im proved fem ale
3
3 28 Type1 Excellent fem ale
4
4 52 Type1
Poor m ale
Data
frame structure (see information about gender):
> str(patientdata)
'data.fram e':4 obs.of 5 variables:
$ patientID : num 1 2 3 4
$ age
: num 25 34 28 52
$ diabetes : Factor w / 2 levels "Type1","Type2": 1 2 1 1
$ status : O rd.factor w / 3 levels "Excellent"< "Im proved"< ..: 3
213
$ gender : Factor w / 2 levels "m ale","fem ale": 1 2 2 1
Lists
Lists
are the most complex of the R data types
A list is an ordered collection of objects
(components).
A list allows gathering a large variety of (possibly
unrelated) objects under one name.
A list can contain a combination of vectors,
matrices, data frames, and even other list
Created using list() function :
mylist <- list(object1, object2, )
where the objects are any of the structures seen so far
Optionally, the objects in a list can be named:
mylist <- list(name1=object1,
+
name2=object2, )
First example of list: POSIXlt variables
Variable
t gets the current system timestamp:
> t = Sys.time()
POSIXlt
objects are actually lists
> l.1 <- as.POSIXlt(t)
> l.1
[1] "2014-09-25 08:37:24 EEST"
> typeof(l.1)
[1] "list"
> names(l.1)
NULL
> unclass(l.1)
$sec
[1] 24.19267
$min
[1] 37
$hour
[1] 8
$mday
[1] 25
...
First example of list: POSIXlt variables (cont.)
Extract
list components values (seconds, minutes,
hours, ...) eqivalent to l.1$sec, l.1$min ...:
> l.1[[1]]
[1] 24.19267
> l.1[[2]]
[1] 37
> l.1[[3]]
[1] 8
> l.1[[4]]
[1] 25
...
Display
(horizontally) components of the timestamp
object
> unlist(l.1)
sec
min
24.19267 37.00000
wday
yday
hour
8.00000
isdst
mday
25.00000
mon
year
8.00000 114.00000
Matrices and lists
Matrix
dimension names (dimnames) object is a list
> m.3 <- matrix(cells, nrow=2, ncol=2,
+
byrow=FALSE,
+
dimnames=list(rownames, colnames))
> m.3
Col1 Col2
Row1
1
24
Row2
26
68
> dimnames(m.3)
[[1]]
[1] "Row1" "Row2"
[[2]]
[1] "Col1" "Col2"
> unlist(dimnames(m.3))
[1] "Row1" "Row2" "Col1" "Col2"
Creating and displaying simple lists
Create
two simple lists
> list.1 = list ("unu", "doi", "trei")
> list.2 = list( c("doi", "trei", "patru"))
Vizualizing
> list.1
[[1]]
[1] "unu"
[[2]]
[1] "doi"
[[3]]
[1] "trei"
> list.2
[[1]]
lists
Create a more complex list
list.3
contains two previous lists, a vector (sequence) and a data
frame:
> list.3 = list (list.1, list.2, 3:7, patientdata)
> list.3
[[1]]
[[1]][[1]]
[1] "unu"
[[1]][[2]]
[1] "doi"
[[1]][[3]]
[1] "trei"
[[2]]
[[2]][[1]]
[1] "doi"
"trei" "patru"
[[3]]
[1] 3 4 5 6 7
[[4]]
patientID age diabetes
status gender
1
1 25
Type1
Poor
male
2
2 34
Type2 Improved female
3
3 28
Type1 Excellent female
4
4 52
Type1
Poor
male
Create a more complex list (cont.)
Display
the structure of list.3:
> str(list.3)
List of 4
$ :List of 3
..$ : chr "unu"
..$ : chr "doi"
..$ : chr "trei"
$ :List of 1
..$ : chr [1:3] "doi" "trei" "patru"
$ : int [1:5] 3 4 5 6 7
$ :'data.frame': 4 obs. of 5 variables:
..$ patientID: num [1:4] 1 2 3 4
..$ age
: num [1:4] 25 34 28 52
..$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
..$ status
: Ord.factor w/ 3 levels
"Excellent"<"Improved"<..: 3 2 1 3
..$ gender
: Factor w/ 2 levels "male","female": 1 2 2 1
Accessing list components
Display
the number of objects in a list
> length(list.3)
[1] 4
Access
the first object of the list
> list.3[[1]]
[[1]]
[1] "unu"
[[2]]
[1] "doi"
[[3]]
[1] "trei"
> class(list.3[[1]])
[1] "list"
Accessing list components (cont)
Access
the second component of the list
> list.3[[2]]
[[1]]
[1] "doi"
"trei" "patru"
> class(list.3[[2]])
[1] "list"
...
and the fourth component
> list.3[[4]]
patientID age diabetes
status gender
1
1 25
Type1
Poor
male
2
2 34
Type2 Improved female
3
3 28
Type1 Excellent female
4
4 52
Type1
Poor
male
> class(list.3[[4]])
[1] "data.frame"
List component attributes/names
Function
names display the names of
designated components of a list
The
first object of list.3 is a list whose
components have no name:
> names(list.3[[1]])
NULL
The
fourth object of list.3 is a data frame
called patientdata; this data frame have four
variables (columns) whose names can be
displayed with function names:
> names(list.3[[4]])
[1] "patientID" "age"
"gender"
"diabetes"
"status"
Accessing components within components
Display
the third object within the first component in list.3
> list.3[[1]][[3]]
[1] "trei"
Display, in the data
frame patientdata (the data frame is
the 4th component of the list) the values of column age (this
column is the 2nd of the data frame)
list.3[[4]][,
2]
[1] 25 34 28 52
Display
or
> list.3[[4]][, "age"]
age as a column (not a vector)
> list.3[[4]][, "age", drop=FALSE]
age
1 25
2 34
3 28
4 52
Display
age of the third patient
> list.3[[4]][, 2][3]
> list.3[[4]][, "age", drop=FALSE]$age[3]
[1] 28
Tables in R
Not
full-fledged data structure, but a sort of
labeled (named) arrays
Some functions (e.g. graphic functions,
categorical data analysis functions) accept
only tables as arguments
More about tables in script 06c
Two
main types of tables:
tables of frequencies counts number of occurences
for each value of a (usually) categorical variable
tables of proportions which divides number of
occurences of each value to total number of
occurences of a (usually) categorical variable
Uni-dimensional tables
Create
a table with frequencies of scholarship in data frame
student_gi
> table.1 <- with(student_gi, table(scholarship))
> table.1
scholarship
Merit Social Studiu1 Studiu2
1
1
2
1
Display structure of table.1
> str(table.1)
'table' int [1:4(1d)] 1 1 2 1
- attr(*, "dimnames")=List of 1
..$ scholarship: chr [1:4] "Merit" "Social" "Studiu1"
"Studiu2"
> class(table.1)
[1] "table"
Unidimensional
tables are vectors with labeled elements (each
element's label is a value of the attribute used in function table)
> names(table.1)
[1] "Merit"
"Social" "Studiu1" "Studiu2"
Access/display uni-dimensional tables
tables.1
is not a data frame, so we cannot qualify the variable using
$...
> table.1$Merit
Error in table.1$Merit : $ operator is invalid for atomic vectors
...
but we can access with vector indices
> table.1[1]
Merit
1
...
or list indices
> table.1[[1]]
[1] 1
Display
both label and the of the 3rd element in table table.1:
> table.1[3]
Studiu1
2
...
or
> unlist(table.1)[3]
Studiu1
2
Access/display uni-dimensional tables (cont.)
Display
only the label of the 3rd element of the table table.1:
> names(table.1) [3]
[1] "Studiu1"
Display
only the value of the 3rd element in table.1:
> unlist(table.1)[[3]]
[1] 2
Display
3rd elements' both name and value by the name:
> table.1["Studiu1"]
Studiu1
2
Display
both names and values of two elements by their
names:
> table.1[c("Merit", "Studiu1")]
scholarship
Merit Studiu1
1
2
Bi-dimensional tables
Similar
to pivot tables in Excel
Create
a contingency (pivot) table with frequencies of
scholarship by lab_assessment
> table.2 <- with(student_gi, table(scholarship, lab_assessment))
> table.2
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 1
0
0
Social 0 1
0
0
Studiu1 1 0
1
0
Studiu2 0 0
0
1
Structure
of table.2
> str(table.2)
'table'int [1:4,1:4] 0 0 1 0 1 1 0 0 0 0 ...
- attr(*,"dim nam es")= List of 2
..$ scholarship : chr [1:4] "M erit" "Social" "Studiu1" "Studiu2"
..$ lab_assessm ent: chr [1:4] "Slab" "Bine" "Foarte bine" "Excelent"
> class(table.2)
[1] "table"
Accessing bi-dimensional tables
Any
cell can be accessed using indices of row and column...
> table.2[1,2]
[1] 1
...
or the names/labels
> table.2["Merit", "Bine"]
[1] 1
Display
the second column (associated with value Bine of
lab_assessment) as a vector using the index (2)...
> table.2[, 2]
M erit SocialStudiu1 Studiu2
1
...
or the name of the column (Bine)
> table.2[, "Bine"]
M erit SocialStudiu1 Studiu2
1
Accessing bi-dimensional tables (cont.)
Similarly,
Access
one can access individual (or group of) rows
particular rows and columns in a table
> table.2[c("Merit", "Studiu1"), c("Slab", "Excelent")]
lab_assessm ent
scholarship Slab Excelent
M erit
Studiu1
0
1
Tri-dimensional tables
Create
a three-dimensional table with frequencies of scholarship by
lab_assessment by final_grade
> table.3 <- with(student_gi, table(scholarship, lab_assessment,
final_grade))
Display
table.3
> table.3
,,fi
nal_grade = 6
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 1 0
0
0
Studiu2 0 0
0
0
,,fi
nal_grade = 9
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 1
0
0
Social 0 1
0
0
Studiu1 0 0
0
0
Studiu2 0 0
0
0
Tri-dimensional tables (cont.)
Display
table.3 (cont.)
, , fi
n al_grade = 9.45
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 0 0
1
0
Studiu2 0 0
0
0
, , fi
n al_grade = 9.75
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 0 0
0
0
Studiu2 0 0
0
1
ftable
ftable
improves the display of three-dimensional tables
> ftable(table.3)
fi
n al_grade 6 9 9.45 9.75
scholarship lab_assessm ent
M erit
Slab
00 0 0
Bine
01 0 0
Foarte bine
00 0 0
Excelent
00 0 0
Social
Slab
00 0 0
Bine
01 0 0
Foarte bine
00 0 0
Excelent
00 0 0
Studiu1
Slab
10 0 0
Bine
00 0 0
Foarte bine
00 1 0
Excelent
00 0 0
Studiu2
Slab
00 0 0
Bine
00 0 0
Foarte bine
00 0 0
Excelent
00 0 1
Accessing three-dimensional tables
Any
cell can be accessed using indices of the three axes...
> table.3[3, 3, 3]
[1] 1
...
or the names/labels
> table.3["Studiu2", "Excelent", "9.75"]
[1] 1
Display,
as an one-dimensional table, the values of the
lab_assessment which corespond to value Studiu2 (4th) of
scholarship and the value 9.75 (4th) of final_grade
one can use the indexes ...
> table.3[4, , 4]
Slab
0
Bine Foarte bine Excelent
0
0
1
... or the label/names
> table.3[ "Studiu2", , "9.75" ]
Slab
0
Bine Foarte bine Excelent
0
0
1
Accessing three-dimensional tables (cont.)
Display,
as a bi-dimensional table, the values of the first
(scholarship) and the third (final_grade) axes associated with
the 4th value (Excelent) of the second axis (lab_assessment)
one can use the index...
> table.3[, 4, ]
fi
nal_grade
scholarship 6 9 9.45 9.75
M erit 0 0 0 0
Social 0 0 0 0
Studiu1 0 0 0 0
Studiu2 0 0 0 1
... or the label/name
> table.3[, "Excelent", ]
fi
nal_grade
scholarship 6 9 9.45 9.75
M erit 0 0 0 0
Social 0 0 0 0
Studiu1 0 0 0 0
Studiu2 0 0 0 1
Accessing three-dimensional tables (cont.)
One
can access particular ranges on each axis
> table.3[c("Merit", "Studiu1"), c("Slab",
"Excelent"), c("9.45", "9.75") ]
, , fi
n al_grade = 9.45
lab_assessm ent
scholarship Slab Excelent
M erit
0
0
Studiu1 0
0
, , fi
n al_grade = 9.75
lab_assessm ent
scholarship Slab Excelent
M erit
0
0
Studiu1 0
0
Built-in datasets
Some
datasets are available in base (core) R (e.g. faithful)
> head(faithful, 3)
eruptions w aiting
1
3.600
79
2
1.800
54
3
3.333
74
Most
data sets are available in packages (e.g. ggplot2, vcd,
...)
In
most cases, data sets are stored as data frames, e.g.
the dataset movies from package ggplot2
Every
package must be installed (once per computer)
> install.packages("ggplot2")
After
installation, a package must be loaded (once for
every RStudio session)
> library(ggplot2)
Built-in datasets (cont.)
Display
the structure of dataset movies
> str(movies)
'data.fram e':58788 obs. of 24 variables:
$ title
: chr "$" "$1000 a Touchdow n" "$21 a D ay O nce a
M onth" "$40,000" ...
$ year
: int 1971 1939 1941 1996 1975 2000 2002
2002 1987 1917 ...
$ length
: int 121 71 7 70 71 91 93 25 97 61 ...
$ budget : int N A N A N A N A N A N A N A N A N A N A ...
$ rating
: num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
$ votes
: int 348 20 5 6 17 45 200 24 18 51 ...
$ r1
: num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
$ r2
: num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
$ r3
: num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
...
Built-in dataset stored as table
Data
set HairEyeColor in package vcd is stored as
three-dimensional table (http://cran.us.rproject.org/w eb/packages/vcdExtra/vignettes/vcdtutorial.pdf)
> install.packages("vcd")
> library(vcd)
> head(HairEyeColor)
[1] 32 53 10 3 11 50
> str(HairEyeColor)
table [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
- attr(*, "dim nam es")= List of 3
..$ H air: chr [1:4] "Black" "Brow n" "Red" "Blond"
..$ Eye : chr [1:4] "Brow n" "Blue" "H azel" "G reen"
..$ Sex : chr [1:2] "M ale" "Fem ale"
> class(HairEyeColor)
[1] "table"
Package datasets
has a special package called datasets
> library(datasets)
function
data displays all the datasets in this package
> data()
Visualize
all the data sets available in all packages:
> data(package = .packages(all.available =
TRUE))
Display the datasets available in package ggplot2
> try(data(package = "ggplot2") )
...or
> data(package = "ggplot2")$results
list (made in 2012) of all datasets in R is available at
http://www.public.iastate.edu/~hofmann/data_in_r_sor
Data structures conversion
Not
all conversions from an object (of a data type) into
another object (of another data type) are possible
Generally,
function as.data.frame converts any other
data type object into a a data frame
Ex:
convert a vector into a data frame
> a_vector
[1] 2 3 4 8 9 10 11 12 13 14
> v_to_df.1 <- as.data.frame(a_vector)
> v_to_df.1
a_vector
1
2
2
3
3
4
...
Data structures conversion (cont.)
Convert
matrix m.4 into a data frame
> m_to_df.1 <- as.data.frame(m.4)
> m_to_df.1
col.1 col.2 col.3 col.total
row.1
1
2
3
6
row.2
4
5
6
15
row.3
7
8
9
24
row.4
10
11
12
33
row.total
22
26
30
78
> str(m_to_df.1)
'data.frame': 5 obs. of 4 variables:
$ col.1
: num 1 4 7 10 22
$ col.2
: num 2 5 8 11 26
$ col.3
: num 3 6 9 12 30
$ col.total: num 6 15 24 33 78
Data structures conversion (cont.)
Convert
a table into a data frame
> table_to_dataframe =
data.frame(unlist(HairEyeColor))
> head(table_to_dataframe, 3)
Hair
Eye Sex Freq
1 Black Brown Male
32
2 Brown Brown Male
53
3 Red Brown Male
10
Convert
>
+
>
1
2
3
a list into a data frame
df <- data.frame(matrix(unlist(list.1), nrow=132,
byrow=T))
head(df,3)
matrix.unlist.list.1...nrow...132..byrow...T.
unu
doi
trei