0% found this document useful (0 votes)
15 views27 pages

Data Preprocessing

Uploaded by

bauuaverma2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views27 pages

Data Preprocessing

Uploaded by

bauuaverma2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Preprocessing

Managing Data with R


• One of the challenges faced while working
with massive datasets involves gathering,
preparing, and otherwise managing data from
a variety of sources.
Saving, loading, and removing R
data
structures
• To save a data structure to a file that can be
reloaded later or transferred to another system,
use the save() function.
• The save() function writes one or more R data
structures to the location specified by the file
parameter.
• Suppose you have three objects named x, y, and
z that you would like to save in a
• permanent file.
> save(x, y, z, file = "mydata.RData")
• The load() command can recreate any data
structures that have been saved to an .RData
file. To load the mydata.RData file we saved in
the preceding code, simply type:
> load("mydata.RData")
• After working on an R session for some time,
you may have accumulated a number of data
structures.
• The ls() listing function returns a vector of all
the data structures currently in the memory.
> ls()
[1] "blood" "flu_status" "gender" "m"
[5] "pt_data" "subject_name" "subject1"
"symptoms"
[9] "temperature"
• R will automatically remove these from its memory upon
quitting the session, but for large data structures, you may
want to free up the memory sooner.
• The rm() remove function can be used for this purpose. For
example, to eliminate the m and subject1 objects, simply
type:
> rm(m, subject1)
• The rm() function can also be supplied with a character vector
of the object names to be removed. This works with the ls()
function to clear the entire R session:
> rm(list=ls())
Importing and saving data from
CSV files
• The most common tabular text file format is the
CSV (Comma-Separated Values) file, which as the
name suggests, uses the comma as a delimiter.
• The CSV files can be imported to and exported
from many common applications. A CSV file
representing the medical dataset constructed
previously could be stored as:
subject_name,temperature,flu_status,gender,blood
_type
• John Doe,98.1,FALSE,MALE,O
• Jane Doe,98.6,FALSE,FEMALE,AB
• Steve Graves,101.4,TRUE,MALE,A
• Given a patient data file named pt_data.csv
located in the R working directory, the read.csv()
function can be used as follows to load the file
into R:
> pt_data <- read.csv("pt_data.csv",
stringsAsFactors = FALSE)
• By default, R assumes that the CSV file includes a
header line listing the names of the features in
the dataset.
• If a CSV file does not have a header, specify the
optionheader = FALSE, as shown in the following
command, and R will assign default
• feature names in the V1 and V2 forms and so on:
> mydata <- read.csv("mydata.csv",
stringsAsFactors = FALSE, header = FALSE)
• To save a data frame to a CSV file, use the
write.csv() function. If your data frame is
named pt_data, simply enter:
> write.csv(pt_data, file = "pt_data.csv",
row.names = FALSE)
Exploring and understanding data
• After collecting data and loading it into R's
data structures, the next step in the machine
learning process involves examining the data
in detail.
• We will explore the usedcars.csv dataset,
which contains actual data about used cars.
• Since the dataset is stored in the CSV form, we
can use the read.csv() function to load the
data into an R data frame:
> usedcars <- read.csv("usedcars.csv",
stringsAsFactors = FALSE)
Exploring the structure of data
• One of the first questions to ask is how the
dataset is organized.
• The str() function provides a method to display
the structure of R data structures such as data
frames, vectors, or lists. It can be used to create
the basic outline for our data dictionary:
> str(usedcars)
• Using such a simple command, we learn a wealth
of information about the dataset.
Exploring numeric variables
• To investigate the numeric variables in the
used car data, we will employ a common set
of measurements to describe values known as
summary statistics.
• The summary() function displays several
common summary statistics. Let's take a look
at a single feature, year:
> summary(usedcars$year)
• We can also use the summary() function to
obtain summary statistics for several numeric
variables at the same time:
> summary(usedcars[c("price", "mileage")])
Measuring the central tendency –
mean and median
• Measures of central tendency are a class of
statistics used to identify a value that falls in
the middle of a set of data.
• You most likely are already familiar with one
common measure of center: the average. In
common use, when something is deemed
average, it falls somewhere between the
extreme ends of the scale.
• R also provides a mean() function, which
calculates the mean for a vector of numbers:
> mean(c(36000, 44000, 56000))
[1] 45333.33
• summary() output listed mean values for the
price and mileage variables. The means suggest
that the typical used car in this dataset was
listed at a price of $12,962 and had a mileage of
44,261.
• Another commonly used measure of central
tendency is the median, which is the value that
occurs halfway through an ordered list of
values.
• As with the mean, R provides a median()
function, which we can apply to our salary data,
as shown in the following example:
> median(c(36000, 44000, 56000))
[1] 44000
Measuring spread – quartiles and
the five-number
summary
• To measure the diversity, we need to employ another type
of summary statistics that is concerned with the spread of
data, or how tightly or loosely the values are spaced.
• The five-number summary is a set of five statistics that
roughly depict the spread of a feature's values.
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
• Minimum and maximum are the most
extreme feature values, indicating the smallest
and largest values, respectively.
• R provides the min() and max() functions to
calculate these values on a vector of data.
• In R, range() function returns both the
minimum and maximum value.
range(usedcars$price)
• Combining range() with the diff() difference
function allows you to examine the range of data
> diff(range(usedcars$price))
• The quartiles divide a dataset into four portions.
• The seq() function is used to generate vectors of
evenly-spaced values. This makes it easy to obtain
other slices of data, such as the quintiles (five
groups), as shown in
• the following command:
• > quantile(usedcars$price, seq(from = 0, to =
1, by = 0.20))
• 0% 20% 40% 60% 80% 100%
• 3800.0 10759.4 12993.8 13992.0 14999.0
21992.0
Exploring categorical variables
• The used car dataset had three categorical
variables: model, color, and transmission.
• Additionally, we might consider treating the
year variable as categorical; although it has
been loaded as a numeric (int) type vector,
each year is a category that could apply to
multiple cars.
• A table that presents a single categorical
variable is known as a one-way table.
• The table() function can be used to generate
one-way tables for our used car data.
> table(usedcars$year)
> table(usedcars$model)
> table(usedcars$color)
• The table() output lists the categories of the
nominal variable and a count of the number of
values falling into this category.
• R can also perform the calculation of table
proportions directly, by using the prop.table()
command on a table produced by the table()
function:
model_table <- table(usedcars$model)
prop.table(model_table)
• The results of prop.table() can be combined
with other R functions to transform the output.
> color_pct <- table(usedcars$color)
> color_pct <- prop.table(color_pct) * 100
> round(color_pct, digits = 1)
Exploring relationships between
variables
• So far, we have examined variables one at a
time, calculating only univariate statistics.
• bivariate relationships, which consider the
relationship between two variables.
• Relationships of more than two variables are
called multivariate relationships.

You might also like