0% found this document useful (0 votes)
42 views25 pages

Introduction To Ds - 2024

The document discusses data preprocessing which involves collecting and cleaning raw data. It describes various tasks in data preprocessing including data collection, cleaning, integration, transformation, reduction and discretization. Exploratory data analysis techniques for understanding data are also covered.

Uploaded by

abebeyonas88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views25 pages

Introduction To Ds - 2024

The document discusses data preprocessing which involves collecting and cleaning raw data. It describes various tasks in data preprocessing including data collection, cleaning, integration, transformation, reduction and discretization. Exploratory data analysis techniques for understanding data are also covered.

Uploaded by

abebeyonas88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 2

Data Manipulation and


Analysis

1
Data Manipulation and Analysis
• Data collection
• Data Preprocessing
 Why Data Preprocessing?
 Preprocessing
 Reading data
 Selecting filtering data
 Data cleaning
 Filtering missing values
 Dropping/replacing missing values
 Data integration
 Data transformation
 Manipulating data
 Data reduction
 Data Discretization and concept hierarchy generation
• Exploratory Data Analysis (EDA)
• Introduction to Pandas and Numpy

2 2
Data Manipulation and Analysis

 Data collection:
 is a systematic process of gathering observations
or measurements.
 Whether you are conducting research for
business, governmental, or academic purposes,
data collection allows you to gain first-hand
knowledge and original insights into your research
problem.

3
Data Manipulation and Analysis

 Steps in Data collection:


1. Define the Aim of Your Research
2. Choose Your Data Collection Method
3. Plan Your Data Collection Procedures
4. Collect the Data

4
Data Manipulation and Analysis

 Steps in Data collection:


 1. Define the Aim of Your Research
 Clarify your research objectives.

 Write problem statement and Formulate


research questions.
 Decide data type: Quantitative (Numeric) or
qualitative (expressed in words) or mixed
approach.

5
Data Manipulation and Analysis
Steps in Data Collection
2. Choose Your Data Collection Method:
o Based on the data you want to collect, decide on the most appropriate method:
 Surveys and Questionnaires: Gather information through structured questions.
 Observations: Observe and record behaviors, events, or phenomena.
 Interviews: Conduct one-on-one or group interviews to gather in-depth insights.
 Existing Data: Use data that already exists (e.g., historical records, databases).
 Experiments: Manipulate variables to observe their effects.
 Case Studies: Investigate a specific individual, group, or situation.
 Sampling: Collect data from a subset of the population.
 Sensor Data: Use sensors or devices to collect real-time data.
 Social Media Data: Analyze content from social platforms.
 Field Notes: Record observations during fieldwork.
 Diaries or Journals: Collect self-reported data over time.

6
Data Manipulation and Analysis
 Steps in Data collection:
 3. Plan Your Data Collection Procedures:
 Develop a detailed plan for data collection:

 Sampling Strategy: Decide how to select participants or


cases.
 Data Collection Tools: Prepare surveys, interview guides, or
observation protocols.
 Data Recording: Specify how you’ll record data (e.g., paper
forms, digital tools).
 Ethical Considerations: Ensure informed consent and protect
participants’ privacy.
 Pilot Testing: Test your data collection procedures before full
implementation.

7
Data Manipulation and Analysis
 Steps in Data collection:
 4. Collect the Data:
 Execute your plan, following the established
procedures.
 Be consistent, accurate, and thorough in recording
observations or measurements.
 Address any unexpected challenges during data
collection.

8
Data Manipulation and Analysis
Why Data Preprocessing?
 Quality decisions must be based on quality data

 Data in the real world is full of dirty

 incomplete:

 lacking attribute values that is vital for decision making so they have
to be added,
 lacking certain attributes of interest in certain dimension and should
be again added with the required value,
 containing only aggregate data so that the primary source of the
aggregation should be included
 noisy: containing errors or outliers that deviate from the expected

 inconsistent: containing discrepancies in codes or names of the

organization or domain
 etc

9
Data Manipulation and Analysis
Why Data Preprocessing?
 Incomplete, noisy and inconsistent data are commonplace
properties of large real world databases and data sources
 Data cleaning routine work to clean such problems so that
results can be accepted
 Before starting data preprocessing, it will be advisable to have
overall picture of the data we have so that it tell us high level
summary such as
 General property of the data
 Which data values should be considered as noise or outliers
 This can be done with the help of exploratory data analysis

10
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Descriptive summary about data can be
generated with the help of
 measure of central tendency of the data,
 measure of dispersion of the data and
 their graphic display

11
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Descriptive summary about data can be
generated with the help of
 measure of central tendency of the data,
 measure of dispersion of the data and
 their graphic display
 Measure of central tendency includes
 Mean, Median, and Mode

12
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Measure of dispersion includes
 Range, Quartiles, Interquartile and range (IQR)
 The five number summary (based on Quartiles)
 minimum, Q1, median (Q2), Q3, IQR, and maximum

 Variance and Standard deviation

13
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 Graphical Methods
 The BoxPlots
 Can be plotted based on the five number summary
 It is useful tool for identifying outliers
 It is also one of the popular way of visualizing a distribution

14
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 The BoxPlots
 The end of the box are the quartiles Q1 and Q3 so that the length of
the box is the IQR
 The median is marked by a line within the box

 Two lines (called whiskers) outside the box extends to the smallest
(Minimum) and largest (Maximum) observation
 The whiskers should extended to the extreme low and high value
only if these values are less than 1.5IQR beyond the quartiles.
Otherwise the whiskers terminates at the most extreme observation
occurring within 1.5IQR of the quartiles
 The remaining observations are plotted individually to show outliers

15
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 The BoxPlots

Boxplot for the unit price data for items sold at four branches

16
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 Other graphical Methods
 Pie charts
 Bar charts
 Histograms
 Quantile plots
 q-q plots
 Scatter plots
 etc.

17
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
 Data pre-processing in data analytics activity refers to the
processing of the various data elements to prepare for the
analytics operation.
 Any activity performed prior to mining the data to get
knowledge out of it is called data pre-processing
 This involves:
 Data cleaning
 Data integration
 Data transformation
 Data reduction
 Data Discretization and concept hierarchy generation

18
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Cleaning: Refers to the process of
 filling in missing values,
 smooth noisy data,
 identify or remove outliers, and resolve inconsistencies

19
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data integration:
 Combines data from multiple sources (databases, data
cubes, or files) into a coherent store

 There are a number of issues to consider during data


integration. Some of these are:
 Schema integration issue

 Entity identification issue

 Data value conflict issue

 Avoiding redundancy issue

20
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Transformation
• Data transformation is the process of transforming or
consolidating data into a form appropriate for mining which is
more appropriate for measurement of similarity and distance
 This involves:
 Smoothing
 Aggregation
 Generalization
 Normalization
 Attribute/feature construction

21
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data reduction
 Data sources may store terabytes of data

 Complex data analysis/mining may take a very long time to


run on the complete dataset

 Data reduction tries to obtain a reduced representation of the


data set that is much smaller in volume but yet produces the
same (or almost the same or better) analytical results

22
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Discretization and concept hierarchy generation
 Data discritization refers to transforming the data set which

is usually continous into discrete interval values


 Concept hierarchy refers to generating the concept levels so
that data mining function can be applied at specific concept
level
 Can be used to reduce the number of values for a given
continuous attribute by dividing the range of attribute into
intervals
 Interval labels can be used to replace actual data values

 This leads to concise, easy to use, knowledge level


representation of mining result
23
Introduction to Numpy and Pandas
 NumPy
 Is a fundamental package for scientific computing with Python.
 It provides support for large, multi-dimensional arrays and
matrices, along with a collection of high-level mathematical
functions to operate on these arrays.

 Some key features of NumPy include:


 Multi-dimensional array objects (ndarray)
 Mathematical functions for fast operations on arrays
 Tools for reading and writing array data to disk
 Linear algebra and random number generation capabilities

24
Introduction to Numpy and Pandas
 Pandas
 Is a powerful and easy-to-use open-source data analysis and
manipulation tool built on top of the Python programming
language.
 It offers data structures and data analysis tools that are ideal for
working with structured data.

 Key features of pandas include:


 DataFrame object for data manipulation with integrated indexing
 Tools for reading and writing data between in-memory data
structures and different file formats
 Data alignment and handling of missing data
 Reshaping and pivoting of data sets
 Time series functionality

25

You might also like