Data Preprocessing

Uploaded by

bauuaverma2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views27 pages

Data Preprocessing

Uploaded by

bauuaverma2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Preprocessing

Managing Data with R

• One of the challenges faced while working
with massive datasets involves gathering,
preparing, and otherwise managing data from
a variety of sources.
Saving, loading, and removing R
data
structures
• To save a data structure to a file that can be
reloaded later or transferred to another system,
use the save() function.
• The save() function writes one or more R data
structures to the location specified by the file
parameter.
• Suppose you have three objects named x, y, and
z that you would like to save in a
• permanent file.
> save(x, y, z, file = "mydata.RData")
• The load() command can recreate any data
structures that have been saved to an .RData
file. To load the mydata.RData file we saved in
the preceding code, simply type:
> load("mydata.RData")
• After working on an R session for some time,
you may have accumulated a number of data
structures.
• The ls() listing function returns a vector of all
the data structures currently in the memory.
> ls()
[1] "blood" "flu_status" "gender" "m"
[5] "pt_data" "subject_name" "subject1"
"symptoms"
[9] "temperature"
• R will automatically remove these from its memory upon
quitting the session, but for large data structures, you may
want to free up the memory sooner.
• The rm() remove function can be used for this purpose. For
example, to eliminate the m and subject1 objects, simply
type:
> rm(m, subject1)
• The rm() function can also be supplied with a character vector
of the object names to be removed. This works with the ls()
function to clear the entire R session:
> rm(list=ls())
Importing and saving data from
CSV files
• The most common tabular text file format is the
CSV (Comma-Separated Values) file, which as the
name suggests, uses the comma as a delimiter.
• The CSV files can be imported to and exported
from many common applications. A CSV file
representing the medical dataset constructed
previously could be stored as:
subject_name,temperature,flu_status,gender,blood
_type
• John Doe,98.1,FALSE,MALE,O
• Jane Doe,98.6,FALSE,FEMALE,AB
• Steve Graves,101.4,TRUE,MALE,A
• Given a patient data file named pt_data.csv
located in the R working directory, the read.csv()
function can be used as follows to load the file
into R:
> pt_data <- read.csv("pt_data.csv",
stringsAsFactors = FALSE)
• By default, R assumes that the CSV file includes a
header line listing the names of the features in
the dataset.
• If a CSV file does not have a header, specify the
optionheader = FALSE, as shown in the following
command, and R will assign default
• feature names in the V1 and V2 forms and so on:
> mydata <- read.csv("mydata.csv",
stringsAsFactors = FALSE, header = FALSE)
• To save a data frame to a CSV file, use the
write.csv() function. If your data frame is
named pt_data, simply enter:
> write.csv(pt_data, file = "pt_data.csv",
row.names = FALSE)
Exploring and understanding data
• After collecting data and loading it into R's
data structures, the next step in the machine
learning process involves examining the data
in detail.
• We will explore the usedcars.csv dataset,
which contains actual data about used cars.
• Since the dataset is stored in the CSV form, we
can use the read.csv() function to load the
data into an R data frame:
> usedcars <- read.csv("usedcars.csv",
stringsAsFactors = FALSE)
Exploring the structure of data
• One of the first questions to ask is how the
dataset is organized.
• The str() function provides a method to display
the structure of R data structures such as data
frames, vectors, or lists. It can be used to create
the basic outline for our data dictionary:
> str(usedcars)
• Using such a simple command, we learn a wealth
of information about the dataset.
Exploring numeric variables
• To investigate the numeric variables in the
used car data, we will employ a common set
of measurements to describe values known as
summary statistics.
• The summary() function displays several
common summary statistics. Let's take a look
at a single feature, year:
> summary(usedcars$year)
• We can also use the summary() function to
obtain summary statistics for several numeric
variables at the same time:
> summary(usedcars[c("price", "mileage")])
Measuring the central tendency –
mean and median
• Measures of central tendency are a class of
statistics used to identify a value that falls in
the middle of a set of data.
• You most likely are already familiar with one
common measure of center: the average. In
common use, when something is deemed
average, it falls somewhere between the
extreme ends of the scale.
• R also provides a mean() function, which
calculates the mean for a vector of numbers:
> mean(c(36000, 44000, 56000))
[1] 45333.33
• summary() output listed mean values for the
price and mileage variables. The means suggest
that the typical used car in this dataset was
listed at a price of $12,962 and had a mileage of
44,261.
• Another commonly used measure of central
tendency is the median, which is the value that
occurs halfway through an ordered list of
values.
• As with the mean, R provides a median()
function, which we can apply to our salary data,
as shown in the following example:
> median(c(36000, 44000, 56000))
[1] 44000
Measuring spread – quartiles and
the five-number
summary
• To measure the diversity, we need to employ another type
of summary statistics that is concerned with the spread of
data, or how tightly or loosely the values are spaced.
• The five-number summary is a set of five statistics that
roughly depict the spread of a feature's values.
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
• Minimum and maximum are the most
extreme feature values, indicating the smallest
and largest values, respectively.
• R provides the min() and max() functions to
calculate these values on a vector of data.
• In R, range() function returns both the
minimum and maximum value.
range(usedcars$price)
• Combining range() with the diff() difference
function allows you to examine the range of data
> diff(range(usedcars$price))
• The quartiles divide a dataset into four portions.
• The seq() function is used to generate vectors of
evenly-spaced values. This makes it easy to obtain
other slices of data, such as the quintiles (five
groups), as shown in
• the following command:
• > quantile(usedcars$price, seq(from = 0, to =
1, by = 0.20))
• 0% 20% 40% 60% 80% 100%
• 3800.0 10759.4 12993.8 13992.0 14999.0
21992.0
Exploring categorical variables
• The used car dataset had three categorical
variables: model, color, and transmission.
• Additionally, we might consider treating the
year variable as categorical; although it has
been loaded as a numeric (int) type vector,
each year is a category that could apply to
multiple cars.
• A table that presents a single categorical
variable is known as a one-way table.
• The table() function can be used to generate
one-way tables for our used car data.
> table(usedcars$year)
> table(usedcars$model)
> table(usedcars$color)
• The table() output lists the categories of the
nominal variable and a count of the number of
values falling into this category.
• R can also perform the calculation of table
proportions directly, by using the prop.table()
command on a table produced by the table()
function:
model_table <- table(usedcars$model)
prop.table(model_table)
• The results of prop.table() can be combined
with other R functions to transform the output.
> color_pct <- table(usedcars$color)
> color_pct <- prop.table(color_pct) * 100
> round(color_pct, digits = 1)
Exploring relationships between
variables
• So far, we have examined variables one at a
time, calculating only univariate statistics.
• bivariate relationships, which consider the
relationship between two variables.
• Relationships of more than two variables are
called multivariate relationships.

DR - Lal Path Labs: Invoice Cum Cash Receipt (Please Bring This Receipt For Report Collection)
33% (6)
DR - Lal Path Labs: Invoice Cum Cash Receipt (Please Bring This Receipt For Report Collection)
1 page
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
Unit 2
No ratings yet
Unit 2
32 pages
Service Manual: Harman/kardon
100% (3)
Service Manual: Harman/kardon
130 pages
CS ELEC 4 Midterm Module
No ratings yet
CS ELEC 4 Midterm Module
59 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
Introduction to R for Business Analytics(1)
No ratings yet
Introduction to R for Business Analytics(1)
7 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Statistics and Data Science with R Part -4
No ratings yet
Statistics and Data Science with R Part -4
23 pages
Rmarkdown
No ratings yet
Rmarkdown
10 pages
Starting With R
No ratings yet
Starting With R
34 pages
DS Lab
No ratings yet
DS Lab
31 pages
Introduction To R
No ratings yet
Introduction To R
52 pages
BA_Unit 4 (P2)
No ratings yet
BA_Unit 4 (P2)
17 pages
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
No ratings yet
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
28 pages
DA_Lab_Week-1
No ratings yet
DA_Lab_Week-1
7 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
40 pages
R Prog
No ratings yet
R Prog
27 pages
Unit II - R Programming
No ratings yet
Unit II - R Programming
29 pages
DAR 4
No ratings yet
DAR 4
28 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
Unit-4 Big Data Analytics Methods using R
No ratings yet
Unit-4 Big Data Analytics Methods using R
57 pages
Graph Plotting in R Programming
No ratings yet
Graph Plotting in R Programming
12 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
lec49
No ratings yet
lec49
17 pages
Exploratory Data Analysis - NOTES
No ratings yet
Exploratory Data Analysis - NOTES
31 pages
EM622 Data Analysis and Visualization Techniques For Decision-Making
No ratings yet
EM622 Data Analysis and Visualization Techniques For Decision-Making
47 pages
STA1007S Lab 3: Plots (II) and Sub-Setting: "Sample"
No ratings yet
STA1007S Lab 3: Plots (II) and Sub-Setting: "Sample"
10 pages
Teaching Notes of R
No ratings yet
Teaching Notes of R
78 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
01 IntroSlides
No ratings yet
01 IntroSlides
43 pages
R Exercise 1 - Introduction To R For Non-Programmers
No ratings yet
R Exercise 1 - Introduction To R For Non-Programmers
9 pages
R - Lecture 4
No ratings yet
R - Lecture 4
37 pages
Practical 3 Intro To R
No ratings yet
Practical 3 Intro To R
10 pages
Handout 2
No ratings yet
Handout 2
15 pages
Basic R Commands For Data Analysis
No ratings yet
Basic R Commands For Data Analysis
7 pages
Descriptive and Inferential Statistics With R
No ratings yet
Descriptive and Inferential Statistics With R
6 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
Working With Data
No ratings yet
Working With Data
38 pages
R Tutorial
No ratings yet
R Tutorial
100 pages
UNIT-II R Programming
No ratings yet
UNIT-II R Programming
41 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Introduction To R
No ratings yet
Introduction To R
34 pages
Lab1: Introduction To R: Islr2
No ratings yet
Lab1: Introduction To R: Islr2
10 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
100% (7)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
35 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (3)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
40 pages
A Brief Introduction To R
No ratings yet
A Brief Introduction To R
17 pages
r Module 5
No ratings yet
r Module 5
21 pages
Fast R
No ratings yet
Fast R
43 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
unit 4 ba shivdas
No ratings yet
unit 4 ba shivdas
17 pages
Tutorial 1
No ratings yet
Tutorial 1
29 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
100% (23)
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
43 pages
Experiment # 4
No ratings yet
Experiment # 4
10 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Basic Computer Troubleshooting
No ratings yet
Basic Computer Troubleshooting
20 pages
Bca Iii Sem Syllabus
No ratings yet
Bca Iii Sem Syllabus
10 pages
Jawaharlal Nehru Technological University Kakinada
No ratings yet
Jawaharlal Nehru Technological University Kakinada
4 pages
General Purpose Simulation System (GPSS)
No ratings yet
General Purpose Simulation System (GPSS)
14 pages
(External) Video Policy Explanatory Slides-1
No ratings yet
(External) Video Policy Explanatory Slides-1
19 pages
slides(lec-6)
No ratings yet
slides(lec-6)
9 pages
Leica CS10/CS15: User Manual
No ratings yet
Leica CS10/CS15: User Manual
140 pages
KMSnano v10.0 Instructions Info
No ratings yet
KMSnano v10.0 Instructions Info
2 pages
Disney Plus March 21
No ratings yet
Disney Plus March 21
6 pages
Multi-User Automated Pageant Tabulation System: Shoven M. Afable, Janice Dyan G. Quiloña
No ratings yet
Multi-User Automated Pageant Tabulation System: Shoven M. Afable, Janice Dyan G. Quiloña
4 pages
MASIMULA DEON MOAHLODIcafe
No ratings yet
MASIMULA DEON MOAHLODIcafe
15 pages
Vibration Control
No ratings yet
Vibration Control
380 pages
Artisa Catalogue
No ratings yet
Artisa Catalogue
20 pages
Custom Components 2017
No ratings yet
Custom Components 2017
148 pages
FrameworksFundamentals Trainingv5
No ratings yet
FrameworksFundamentals Trainingv5
210 pages
Operation Manual OMD
No ratings yet
Operation Manual OMD
3 pages
Digital Systems Design Using VHDL 3rd Edition Roth Solutions Manual 1
100% (61)
Digital Systems Design Using VHDL 3rd Edition Roth Solutions Manual 1
36 pages
Activities Guide and Evaluation Rubric - Unit 1 - Task 1 - Recognizing The Importance of Information Secu
No ratings yet
Activities Guide and Evaluation Rubric - Unit 1 - Task 1 - Recognizing The Importance of Information Secu
5 pages
C#.Net AAT Report
No ratings yet
C#.Net AAT Report
65 pages
High Performance Computing: Modern Systems and Practices Thomas Sterling Download PDF
No ratings yet
High Performance Computing: Modern Systems and Practices Thomas Sterling Download PDF
54 pages
New Download Links
No ratings yet
New Download Links
14 pages
Chapter 4. Polymorphism
No ratings yet
Chapter 4. Polymorphism
64 pages
PLC To Hmi Communication Protocol
No ratings yet
PLC To Hmi Communication Protocol
7 pages
(Ebook) Perceiving in Depth, Volume 3: Other Mechanisms of Depth Perception by Ian P. Howard ISBN 9780199764167, 0199764166 download
No ratings yet
(Ebook) Perceiving in Depth, Volume 3: Other Mechanisms of Depth Perception by Ian P. Howard ISBN 9780199764167, 0199764166 download
59 pages
OutSystems Agile Platform 5.0 - Form Validations
No ratings yet
OutSystems Agile Platform 5.0 - Form Validations
9 pages
Little Booklet of Phone Scams
No ratings yet
Little Booklet of Phone Scams
12 pages
Art 2.9
No ratings yet
Art 2.9
1 page
Professional Ethics & Codes of Conduct
No ratings yet
Professional Ethics & Codes of Conduct
27 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing

Managing Data with R

You might also like