Introduction To Ds - 2024
Introduction To Ds - 2024
1
Data Manipulation and Analysis
• Data collection
• Data Preprocessing
Why Data Preprocessing?
Preprocessing
Reading data
Selecting filtering data
Data cleaning
Filtering missing values
Dropping/replacing missing values
Data integration
Data transformation
Manipulating data
Data reduction
Data Discretization and concept hierarchy generation
• Exploratory Data Analysis (EDA)
• Introduction to Pandas and Numpy
2 2
Data Manipulation and Analysis
Data collection:
is a systematic process of gathering observations
or measurements.
Whether you are conducting research for
business, governmental, or academic purposes,
data collection allows you to gain first-hand
knowledge and original insights into your research
problem.
3
Data Manipulation and Analysis
4
Data Manipulation and Analysis
5
Data Manipulation and Analysis
Steps in Data Collection
2. Choose Your Data Collection Method:
o Based on the data you want to collect, decide on the most appropriate method:
Surveys and Questionnaires: Gather information through structured questions.
Observations: Observe and record behaviors, events, or phenomena.
Interviews: Conduct one-on-one or group interviews to gather in-depth insights.
Existing Data: Use data that already exists (e.g., historical records, databases).
Experiments: Manipulate variables to observe their effects.
Case Studies: Investigate a specific individual, group, or situation.
Sampling: Collect data from a subset of the population.
Sensor Data: Use sensors or devices to collect real-time data.
Social Media Data: Analyze content from social platforms.
Field Notes: Record observations during fieldwork.
Diaries or Journals: Collect self-reported data over time.
6
Data Manipulation and Analysis
Steps in Data collection:
3. Plan Your Data Collection Procedures:
Develop a detailed plan for data collection:
7
Data Manipulation and Analysis
Steps in Data collection:
4. Collect the Data:
Execute your plan, following the established
procedures.
Be consistent, accurate, and thorough in recording
observations or measurements.
Address any unexpected challenges during data
collection.
8
Data Manipulation and Analysis
Why Data Preprocessing?
Quality decisions must be based on quality data
incomplete:
lacking attribute values that is vital for decision making so they have
to be added,
lacking certain attributes of interest in certain dimension and should
be again added with the required value,
containing only aggregate data so that the primary source of the
aggregation should be included
noisy: containing errors or outliers that deviate from the expected
organization or domain
etc
9
Data Manipulation and Analysis
Why Data Preprocessing?
Incomplete, noisy and inconsistent data are commonplace
properties of large real world databases and data sources
Data cleaning routine work to clean such problems so that
results can be accepted
Before starting data preprocessing, it will be advisable to have
overall picture of the data we have so that it tell us high level
summary such as
General property of the data
Which data values should be considered as noise or outliers
This can be done with the help of exploratory data analysis
10
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
Descriptive summary about data can be
generated with the help of
measure of central tendency of the data,
measure of dispersion of the data and
their graphic display
11
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
Descriptive summary about data can be
generated with the help of
measure of central tendency of the data,
measure of dispersion of the data and
their graphic display
Measure of central tendency includes
Mean, Median, and Mode
12
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
Measure of dispersion includes
Range, Quartiles, Interquartile and range (IQR)
The five number summary (based on Quartiles)
minimum, Q1, median (Q2), Q3, IQR, and maximum
13
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
Graphical Methods
The BoxPlots
Can be plotted based on the five number summary
It is useful tool for identifying outliers
It is also one of the popular way of visualizing a distribution
14
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
The BoxPlots
The end of the box are the quartiles Q1 and Q3 so that the length of
the box is the IQR
The median is marked by a line within the box
Two lines (called whiskers) outside the box extends to the smallest
(Minimum) and largest (Maximum) observation
The whiskers should extended to the extreme low and high value
only if these values are less than 1.5IQR beyond the quartiles.
Otherwise the whiskers terminates at the most extreme observation
occurring within 1.5IQR of the quartiles
The remaining observations are plotted individually to show outliers
15
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
The BoxPlots
Boxplot for the unit price data for items sold at four branches
16
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
Other graphical Methods
Pie charts
Bar charts
Histograms
Quantile plots
q-q plots
Scatter plots
etc.
17
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data pre-processing in data analytics activity refers to the
processing of the various data elements to prepare for the
analytics operation.
Any activity performed prior to mining the data to get
knowledge out of it is called data pre-processing
This involves:
Data cleaning
Data integration
Data transformation
Data reduction
Data Discretization and concept hierarchy generation
18
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data Cleaning: Refers to the process of
filling in missing values,
smooth noisy data,
identify or remove outliers, and resolve inconsistencies
19
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data integration:
Combines data from multiple sources (databases, data
cubes, or files) into a coherent store
20
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data Transformation
• Data transformation is the process of transforming or
consolidating data into a form appropriate for mining which is
more appropriate for measurement of similarity and distance
This involves:
Smoothing
Aggregation
Generalization
Normalization
Attribute/feature construction
21
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data reduction
Data sources may store terabytes of data
22
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data Discretization and concept hierarchy generation
Data discritization refers to transforming the data set which
24
Introduction to Numpy and Pandas
Pandas
Is a powerful and easy-to-use open-source data analysis and
manipulation tool built on top of the Python programming
language.
It offers data structures and data analysis tools that are ideal for
working with structured data.
25