0% found this document useful (0 votes)

42 views25 pages

Introduction To Ds - 2024

The document discusses data preprocessing which involves collecting and cleaning raw data. It describes various tasks in data preprocessing including data collection, cleaning, integration, transformation, reduction and discretization. Exploratory data analysis techniques for understanding data are also covered.

Uploaded by

abebeyonas88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views25 pages

Introduction To Ds - 2024

Uploaded by

abebeyonas88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter 2

Data Manipulation and

Analysis

1
Data Manipulation and Analysis
• Data collection
• Data Preprocessing
 Why Data Preprocessing?
 Preprocessing
 Reading data
 Selecting filtering data
 Data cleaning
 Filtering missing values
 Dropping/replacing missing values
 Data integration
 Data transformation
 Manipulating data
 Data reduction
 Data Discretization and concept hierarchy generation
• Exploratory Data Analysis (EDA)
• Introduction to Pandas and Numpy

2 2
Data Manipulation and Analysis

 Data collection:
 is a systematic process of gathering observations
or measurements.
 Whether you are conducting research for
business, governmental, or academic purposes,
data collection allows you to gain first-hand
knowledge and original insights into your research
problem.

3
Data Manipulation and Analysis

 Steps in Data collection:

1. Define the Aim of Your Research
2. Choose Your Data Collection Method
3. Plan Your Data Collection Procedures
4. Collect the Data

4
Data Manipulation and Analysis

 Steps in Data collection:

 1. Define the Aim of Your Research
 Clarify your research objectives.

 Write problem statement and Formulate

research questions.
 Decide data type: Quantitative (Numeric) or
qualitative (expressed in words) or mixed
approach.

5
Data Manipulation and Analysis
Steps in Data Collection
2. Choose Your Data Collection Method:
o Based on the data you want to collect, decide on the most appropriate method:
 Surveys and Questionnaires: Gather information through structured questions.
 Observations: Observe and record behaviors, events, or phenomena.
 Interviews: Conduct one-on-one or group interviews to gather in-depth insights.
 Existing Data: Use data that already exists (e.g., historical records, databases).
 Experiments: Manipulate variables to observe their effects.
 Case Studies: Investigate a specific individual, group, or situation.
 Sampling: Collect data from a subset of the population.
 Sensor Data: Use sensors or devices to collect real-time data.
 Social Media Data: Analyze content from social platforms.
 Field Notes: Record observations during fieldwork.
 Diaries or Journals: Collect self-reported data over time.

6
Data Manipulation and Analysis
 Steps in Data collection:
 3. Plan Your Data Collection Procedures:
 Develop a detailed plan for data collection:

 Sampling Strategy: Decide how to select participants or

cases.
 Data Collection Tools: Prepare surveys, interview guides, or
observation protocols.
 Data Recording: Specify how you’ll record data (e.g., paper
forms, digital tools).
 Ethical Considerations: Ensure informed consent and protect
participants’ privacy.
 Pilot Testing: Test your data collection procedures before full
implementation.

7
Data Manipulation and Analysis
 Steps in Data collection:
 4. Collect the Data:
 Execute your plan, following the established
procedures.
 Be consistent, accurate, and thorough in recording
observations or measurements.
 Address any unexpected challenges during data
collection.

8
Data Manipulation and Analysis
Why Data Preprocessing?
 Quality decisions must be based on quality data

 Data in the real world is full of dirty

 incomplete:

 lacking attribute values that is vital for decision making so they have
to be added,
 lacking certain attributes of interest in certain dimension and should
be again added with the required value,
 containing only aggregate data so that the primary source of the
aggregation should be included
 noisy: containing errors or outliers that deviate from the expected

 inconsistent: containing discrepancies in codes or names of the

organization or domain
 etc

9
Data Manipulation and Analysis
Why Data Preprocessing?
 Incomplete, noisy and inconsistent data are commonplace
properties of large real world databases and data sources
 Data cleaning routine work to clean such problems so that
results can be accepted
 Before starting data preprocessing, it will be advisable to have
overall picture of the data we have so that it tell us high level
summary such as
 General property of the data
 Which data values should be considered as noise or outliers
 This can be done with the help of exploratory data analysis

10
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Descriptive summary about data can be
generated with the help of
 measure of central tendency of the data,
 measure of dispersion of the data and
 their graphic display

11
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Descriptive summary about data can be
generated with the help of
 measure of central tendency of the data,
 measure of dispersion of the data and
 their graphic display
 Measure of central tendency includes
 Mean, Median, and Mode

12
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Measure of dispersion includes
 Range, Quartiles, Interquartile and range (IQR)
 The five number summary (based on Quartiles)
 minimum, Q1, median (Q2), Q3, IQR, and maximum

 Variance and Standard deviation

13
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 Graphical Methods
 The BoxPlots
 Can be plotted based on the five number summary
 It is useful tool for identifying outliers
 It is also one of the popular way of visualizing a distribution

14
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 The BoxPlots
 The end of the box are the quartiles Q1 and Q3 so that the length of
the box is the IQR
 The median is marked by a line within the box

 Two lines (called whiskers) outside the box extends to the smallest
(Minimum) and largest (Maximum) observation
 The whiskers should extended to the extreme low and high value
only if these values are less than 1.5IQR beyond the quartiles.
Otherwise the whiskers terminates at the most extreme observation
occurring within 1.5IQR of the quartiles
 The remaining observations are plotted individually to show outliers

15
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 The BoxPlots

Boxplot for the unit price data for items sold at four branches

16
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 Other graphical Methods
 Pie charts
 Bar charts
 Histograms
 Quantile plots
 q-q plots
 Scatter plots
 etc.

17
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
 Data pre-processing in data analytics activity refers to the
processing of the various data elements to prepare for the
analytics operation.
 Any activity performed prior to mining the data to get
knowledge out of it is called data pre-processing
 This involves:
 Data cleaning
 Data integration
 Data transformation
 Data reduction
 Data Discretization and concept hierarchy generation

18
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Cleaning: Refers to the process of
 filling in missing values,
 smooth noisy data,
 identify or remove outliers, and resolve inconsistencies

19
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data integration:
 Combines data from multiple sources (databases, data
cubes, or files) into a coherent store

 There are a number of issues to consider during data

integration. Some of these are:
 Schema integration issue

 Entity identification issue

 Data value conflict issue

 Avoiding redundancy issue

20
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Transformation
• Data transformation is the process of transforming or
consolidating data into a form appropriate for mining which is
more appropriate for measurement of similarity and distance
 This involves:
 Smoothing
 Aggregation
 Generalization
 Normalization
 Attribute/feature construction

21
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data reduction
 Data sources may store terabytes of data

 Complex data analysis/mining may take a very long time to

run on the complete dataset

 Data reduction tries to obtain a reduced representation of the

data set that is much smaller in volume but yet produces the
same (or almost the same or better) analytical results

22
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Discretization and concept hierarchy generation
 Data discritization refers to transforming the data set which

is usually continous into discrete interval values

 Concept hierarchy refers to generating the concept levels so
that data mining function can be applied at specific concept
level
 Can be used to reduce the number of values for a given
continuous attribute by dividing the range of attribute into
intervals
 Interval labels can be used to replace actual data values

 This leads to concise, easy to use, knowledge level

representation of mining result
23
Introduction to Numpy and Pandas
 NumPy
 Is a fundamental package for scientific computing with Python.
 It provides support for large, multi-dimensional arrays and
matrices, along with a collection of high-level mathematical
functions to operate on these arrays.

 Some key features of NumPy include:

 Multi-dimensional array objects (ndarray)
 Mathematical functions for fast operations on arrays
 Tools for reading and writing array data to disk
 Linear algebra and random number generation capabilities

24
Introduction to Numpy and Pandas
 Pandas
 Is a powerful and easy-to-use open-source data analysis and
manipulation tool built on top of the Python programming
language.
 It offers data structures and data analysis tools that are ideal for
working with structured data.

 Key features of pandas include:

 DataFrame object for data manipulation with integrated indexing
 Tools for reading and writing data between in-memory data
structures and different file formats
 Data alignment and handling of missing data
 Reshaping and pivoting of data sets
 Time series functionality

Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Fracture Presentation
100% (10)
Fracture Presentation
53 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
DSP UNIT - II
No ratings yet
DSP UNIT - II
14 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
No ratings yet
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
12 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
DEV_CORE
No ratings yet
DEV_CORE
7 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Updated notes of APR_084732
No ratings yet
Updated notes of APR_084732
6 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Document (1)
No ratings yet
Document (1)
10 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Week 3
No ratings yet
Week 3
23 pages
Data Analytics Fundamentals-2
No ratings yet
Data Analytics Fundamentals-2
34 pages
Comprehensive Guide to Modern Data Analysis Techniques
No ratings yet
Comprehensive Guide to Modern Data Analysis Techniques
4 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Processing Data
No ratings yet
Processing Data
4 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
General Data Analyst Interview Questions
No ratings yet
General Data Analyst Interview Questions
7 pages
Approaches in data analysis [Slides]
No ratings yet
Approaches in data analysis [Slides]
13 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Approaches in data analysis [Slides] [Re-brand]
No ratings yet
Approaches in data analysis [Slides] [Re-brand]
13 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Unit 2, 3
No ratings yet
Unit 2, 3
9 pages
Unit 1
No ratings yet
Unit 1
19 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
Data Analysis
No ratings yet
Data Analysis
22 pages
Math211101020
No ratings yet
Math211101020
12 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
devish all unit
No ratings yet
devish all unit
42 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
UNIT 1 Exploratory Data Analysis
100% (2)
UNIT 1 Exploratory Data Analysis
21 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
unit-1
No ratings yet
unit-1
50 pages
DE&V TWO MARKS QUESTIONS WITH ANSWERS
No ratings yet
DE&V TWO MARKS QUESTIONS WITH ANSWERS
19 pages
Data Analytics
No ratings yet
Data Analytics
36 pages
Data Mining
No ratings yet
Data Mining
34 pages
Ch-1 Introduction To Data Analysis
No ratings yet
Ch-1 Introduction To Data Analysis
23 pages
ml-5
No ratings yet
ml-5
26 pages
Group lab2 Assignmen1
No ratings yet
Group lab2 Assignmen1
2 pages
Basic, Function and Turtle Graphics
No ratings yet
Basic, Function and Turtle Graphics
35 pages
50191-Artificial intelligence ppt presentation
No ratings yet
50191-Artificial intelligence ppt presentation
1 page
Y0 Today
No ratings yet
Y0 Today
72 pages
Engneering New-1 (2) - 1
No ratings yet
Engneering New-1 (2) - 1
4 pages
Terminology (Hvac)
No ratings yet
Terminology (Hvac)
5 pages
PULLOVER Cool Wool Cashmere
No ratings yet
PULLOVER Cool Wool Cashmere
1 page
American Barrick
No ratings yet
American Barrick
8 pages
Lucru de Mana
100% (1)
Lucru de Mana
4 pages
DE Lpack
No ratings yet
DE Lpack
7 pages
Mother Dairy Visit
No ratings yet
Mother Dairy Visit
21 pages
Module 3 Project and Problem Based-1-1
No ratings yet
Module 3 Project and Problem Based-1-1
24 pages
Cat Is Kinda Sussy Baka
No ratings yet
Cat Is Kinda Sussy Baka
9 pages
TCM Division: Bull'S Eye Post Weld Heat Treatment (PWHT)
No ratings yet
TCM Division: Bull'S Eye Post Weld Heat Treatment (PWHT)
11 pages
Course Outline-Taxation - Mr. S.B.Gabhawalla-CM - I Yr Trim III-2010-11
No ratings yet
Course Outline-Taxation - Mr. S.B.Gabhawalla-CM - I Yr Trim III-2010-11
2 pages
2017-07-13 Calvert County Times
No ratings yet
2017-07-13 Calvert County Times
24 pages
How To Grow Your Own SHTF Pharmacy
100% (1)
How To Grow Your Own SHTF Pharmacy
23 pages
Rocket Boomer 281
No ratings yet
Rocket Boomer 281
29 pages
Vitrinite Reflectance
No ratings yet
Vitrinite Reflectance
33 pages
Behringer MX2642A Mixer Manual
No ratings yet
Behringer MX2642A Mixer Manual
22 pages
Course Title: Laser Physics: Lecture # 3
No ratings yet
Course Title: Laser Physics: Lecture # 3
17 pages
Software Testing and Analysis - Process, Principles and Techniques by Mauro Pezze ..
No ratings yet
Software Testing and Analysis - Process, Principles and Techniques by Mauro Pezze ..
564 pages
5530-Sds-Tpetromin Turbomaster LD 10W-40 V#2
No ratings yet
5530-Sds-Tpetromin Turbomaster LD 10W-40 V#2
6 pages
Descriptive Text
No ratings yet
Descriptive Text
12 pages
User Buffer: Auth/auth - Number - in - Userbuffer
No ratings yet
User Buffer: Auth/auth - Number - in - Userbuffer
10 pages
Managing Knowledge
No ratings yet
Managing Knowledge
52 pages
CIMA P1 Performance Operations Study Text 2013
100% (8)
CIMA P1 Performance Operations Study Text 2013
697 pages
Cycling Risk Assessment
No ratings yet
Cycling Risk Assessment
2 pages
c73b489
No ratings yet
c73b489
6 pages
Bernards Paperwork
No ratings yet
Bernards Paperwork
16 pages
Assignment - I: Fundamentals of Interior Designing
100% (1)
Assignment - I: Fundamentals of Interior Designing
9 pages
Lexico Test 33 34
100% (1)
Lexico Test 33 34
5 pages
Cogent Customer User Guide
No ratings yet
Cogent Customer User Guide
31 pages
LG Lithium 18650 3 7v 2200mah Rechargeable Ion Battery
No ratings yet
LG Lithium 18650 3 7v 2200mah Rechargeable Ion Battery
9 pages

Introduction To Ds - 2024

Uploaded by

Introduction To Ds - 2024

Uploaded by

Chapter 2

Data Manipulation and

 Steps in Data collection:

 Steps in Data collection:

 Write problem statement and Formulate

 Sampling Strategy: Decide how to select participants or

 Data in the real world is full of dirty

 inconsistent: containing discrepancies in codes or names of the

 Variance and Standard deviation

 There are a number of issues to consider during data

 Entity identification issue

 Data value conflict issue

 Avoiding redundancy issue

 Complex data analysis/mining may take a very long time to

 Data reduction tries to obtain a reduced representation of the

is usually continous into discrete interval values

 This leads to concise, easy to use, knowledge level

 Some key features of NumPy include:

 Key features of pandas include:

You might also like