0% found this document useful (0 votes)

6 views8 pages

UNIT - 2 ML

The document outlines the comprehensive process of data preparation for machine learning, including steps such as data collection, exploratory data analysis, data cleaning, feature engineering, and data transformation. It emphasizes the importance of understanding business objectives, iterative exploration, and documentation for reproducibility. Additionally, it covers model selection and training, highlighting the significance of evaluating model performance and hyperparameter tuning.

Uploaded by

gfhghyrtf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views8 pages

UNIT - 2 ML

Uploaded by

gfhghyrtf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Page 1 of 8

UNIT-2: Data Preparation

Working with Real Data:

Working with real data in data preparation for machine learning involves several steps
to ensure the data is properly formatted, cleaned, and preprocessed for use in training a
machine learning model. Here's a general guide to the data preparation process:

1. Data Collection: Obtain the dataset from reliable sources. This could be from
databases, APIs, CSV files, Excel spreadsheets, or any other structured data
format.
2. Exploratory Data Analysis (EDA): Perform EDA to understand the structure,
distribution, and characteristics of the data. This involves:
 Checking for missing values.
 Summarizing statistics (mean, median, min, max, etc.).
 Visualizations (histograms, box plots, scatter plots, etc.) to understand
relationships and distributions.
 Identify outliers and anomalies.
3. Data Cleaning:
 Handle missing values: Impute missing values (using mean, median, mode,
or more sophisticated methods), or remove rows/columns with missing
data depending on the amount of missingness and the nature of the
problem.
 Deal with outliers: Decide whether to remove outliers or transform them to
mitigate their impact on the model.
 Address inconsistencies and errors in data: This might involve correcting
typos, standardizing formats, or resolving inconsistencies in categorical
variables.
4. Feature Engineering:
 Create new features: Combine existing features or derive new ones that
might be more informative for the model.
 Encode categorical variables: Convert categorical variables into numerical
representations using techniques like one-hot encoding, label encoding,
or embeddings.
 Feature scaling: Scale numerical features to a similar range (e.g., using
min-max scaling or standardization) to prevent features with large values
from dominating the model.
5. Data Transformation:
 Normalize the data: Scale the features to have a mean of 0 and a standard
deviation of 1 to improve convergence during training.
Page 2 of 8

 Dimensionality reduction: If dealing with high-dimensional data, use

techniques like Principal Component Analysis (PCA) or feature selection to
reduce the number of features while preserving most of the variance.
6. Data Splitting:
 Split the data into training, validation, and test sets to assess model
performance and prevent overfitting.
7. Data Preprocessing Pipeline:
 Create a preprocessing pipeline that encapsulates all the data preparation
steps. This ensures consistency and allows easy application to new data.
8. Iterative Process: Data preparation is often an iterative process. You may need
to revisit previous steps based on insights gained during model training and
evaluation.
9. Documentation: Document all the steps taken during data preparation,
including any assumptions made or decisions taken. This documentation is
crucial for reproducibility and collaboration.

Look at the Big Picture:

Looking at the big picture in data preparation for machine learning involves
understanding the overarching goals, challenges, and best practices that guide the
entire process. Here's an overview:

1. Understanding Business Objectives: Data preparation starts with a clear

understanding of the business problem or objectives that the machine learning
model aims to address. This understanding helps in defining the scope of data
collection, the choice of features, and the evaluation metrics for the model.
2. Data Collection: Acquiring relevant and high-quality data is crucial for the
success of any machine learning project. This involves identifying data sources,
collecting data, and ensuring its integrity, accuracy, and completeness. Data may
come from various internal or external sources, such as databases, APIs, sensor
data, or web scraping.
3. Data Cleaning and Preprocessing: Raw data often contains errors, missing
values, outliers, and inconsistencies that need to be addressed before feeding it
into a machine learning model. Data cleaning involves tasks like imputation of
missing values, handling outliers, removing duplicates, and standardizing formats.
Preprocessing includes feature scaling, encoding categorical variables, and
handling skewness in distributions.
4. Feature Engineering: Feature engineering is the process of creating new features
or transforming existing ones to enhance the performance of machine learning
models. This step requires domain knowledge and creativity to extract
Page 3 of 8

meaningful information from the data. Feature engineering aims to capture

relevant patterns, reduce dimensionality, and improve the model's ability to
generalize.
5. Exploratory Data Analysis (EDA): EDA is an essential step in understanding the
underlying structure and relationships within the data. It involves visualizations,
statistical summaries, and hypothesis testing to gain insights into the data
distribution, correlations between variables, and potential patterns or trends.
6. Data Transformation and Scaling: Data transformation techniques like
normalization or standardization are applied to ensure that all features have a
similar scale. This prevents features with larger magnitudes from dominating the
model and helps in achieving faster convergence during training.
7. Data Splitting: Before training a machine learning model, the dataset is split into
training, validation, and test sets. This ensures that the model is trained on one
set, validated on another set for hyperparameter tuning, and tested on a separate
set to evaluate its generalization performance.
8. Documentation and Reproducibility: Documenting the entire data preparation
process is crucial for reproducibility and transparency. This includes recording
data sources, preprocessing steps, feature engineering techniques, and any
assumptions or decisions made during the process.
9. Iterative Process: Data preparation is often an iterative process that involves
refining data cleaning procedures, experimenting with different feature
engineering techniques, and optimizing preprocessing steps based on model
performance and feedback.
10. Continuous Monitoring and Maintenance: Once a machine learning model is
deployed, it's essential to monitor its performance over time and update the data
preparation pipeline accordingly. This ensures that the model remains effective in
real-world scenarios and adapts to changing data patterns or business
requirements.

Get the Data:

Discover and Visualize the Data to Gain Insights :

Discovering and visualizing the data is a crucial step in data preparation for machine
learning. Here's a guide on how to perform exploratory data analysis (EDA) to gain
insights:

1. Load the Data: Start by loading your dataset into your preferred data analysis
environment such as Python with libraries like Pandas, NumPy, and
Matplotlib/Seaborn for visualization.
2. Basic Data Exploration:
Page 4 of 8

 Check the first few rows of the dataset using the .head() function to
understand its structure.
 Check the dimensions of the dataset (number of rows and columns) using
the .shape attribute.
 Use the .info() function to get a concise summary of the dataset, including
data types and missing values.
3. Summary Statistics:
 Compute summary statistics such as mean, median, standard deviation,
minimum, and maximum values for numerical features using the .describe ()
function.
 For categorical features, you can use the .value_ counts () function to get the
frequency distribution of unique values.
4. Data Visualization:
 Histograms: Plot histograms to visualize the distribution of numerical
features. This helps in understanding the range and spread of values and
identifying potential outliers.
 Box plots: Use box plots to visualize the distribution of numerical features,
identify outliers, and compare distributions across different categories.
 Scatter plots: Plot scatter plots to visualize the relationship between pairs
of numerical features. This helps in identifying patterns, correlations, and
potential trends in the data.
 Bar plots: Use bar plots to visualize the frequency distribution of
categorical features. This helps in understanding the distribution of
categories and identifying dominant categories.
 Heatmaps: Plot heatmaps to visualize the correlation matrix between
numerical features. This helps in identifying multicollinearity and
understanding the strength and direction of correlations.
5. Feature Relationships:
 Explore relationships between features using scatter plots, pair plots (for
multiple numerical features), and categorical plots (for categorical
features).
 Look for patterns, trends, and correlations between features, which can
provide valuable insights for feature selection and engineering.
6. Missing Values and Outliers:
 Visualize missing values using heatmaps or bar plots to identify patterns of
missingness across features.
 Plot box plots or scatter plots to identify outliers in numerical features.
Decide whether to remove or impute outliers based on domain knowledge
and the impact on the model.
7. Interactive Visualizations:
Page 5 of 8

 Consider using interactive visualization libraries like Plotly or Bokeh for

more interactive and dynamic exploration of the data.
8. Iterative Exploration:
 Perform iterative data exploration and visualization based on initial
insights and hypotheses generated. This may involve drilling down into
specific subsets of the data or focusing on particular features of interest.

Prepare the Data for machine Learning Algorithms:

Preparing data for machine learning algorithms involves several steps to ensure that the
dataset is formatted correctly, features are appropriately scaled, and the data is ready to
be used for training a model. Here's a comprehensive guide to preparing data for
machine learning:

1. Handling Missing Values:

 Identify and handle missing values in the dataset. Options include:
 Removing rows or columns with missing values if they are
insignificant.
 Imputing missing values using methods like mean, median, mode,
or more sophisticated techniques such as K-nearest neighbors
(KNN) imputation or predictive modeling.
 For categorical variables, consider adding a new category to
represent missing values if they carry meaningful information.
2. Encoding Categorical Variables:
 Convert categorical variables into a numerical format suitable for machine
learning algorithms. Common techniques include:
 One-hot encoding: Create binary columns for each category, where
1 indicates the presence of the category and 0 indicates absence.
 Label encoding: Map each category to a unique integer. This is
suitable for ordinal categorical variables with a natural order.
 Target encoding: Encode categorical variables based on the target
variable's mean or other aggregated metrics. This can be useful for
high-cardinality categorical variables.
3. Feature Scaling:
 Scale numerical features to a similar range to prevent features with larger
magnitudes from dominating the model. Common scaling techniques
include:
 Min-max scaling (Normalization): Scale features to a range between
0 and 1.
Page 6 of 8

 Standardization: Transform features to have a mean of 0 and a

standard deviation of 1.
 Robust scaling: Scale features using median and interquartile range
to handle outliers.
4. Feature Engineering:
 Create new features or transform existing ones to capture meaningful
information and improve the model's performance. Feature engineering
techniques include:
 Polynomial features: Generate polynomial combinations of features
to capture nonlinear relationships.
 Interaction terms: Create new features by taking the product or
ratio of existing features.
 Domain-specific transformations: Apply domain knowledge to
create features relevant to the problem.
5. Dimensionality Reduction:
 Reduce the number of features to alleviate the curse of dimensionality and
improve computational efficiency. Techniques include:
 Principal Component Analysis (PCA): Project data onto a lower-
dimensional subspace while preserving the maximum variance.
 Feature selection: Select a subset of relevant features based on
statistical tests, feature importance scores, or domain knowledge.
6. Data Splitting:
 Split the dataset into training, validation, and test sets to evaluate the
model's performance. Common splits include:
 Training set: Used to train the model.
 Validation set: Used to tune hyperparameters and assess model
performance during training.
 Test set: Held out for final evaluation to estimate the model's
generalization performance on unseen data.
7. Data Pipeline:
 Create a data preprocessing pipeline that encapsulates all the data
preparation steps. This ensures consistency and facilitates reproducibility
when applying the preprocessing steps to new data.
8. Documentation and Versioning:
 Document all data preparation steps, including assumptions,
transformations, and preprocessing techniques applied. Version control
the data preprocessing pipeline to track changes and ensure
reproducibility.
Select and Train a Model:
Page 7 of 8

Selecting and training a model in the data preparation phase of machine learning
involves choosing an appropriate algorithm, training it on the prepared dataset, and
evaluating its performance. Here's a step-by-step guide:

1. Choose a Model:
 Select a machine learning algorithm suitable for your problem based on
factors such as the nature of the data (e.g., classification, regression), size
of the dataset, interpretability requirements, and computational resources.
 Common algorithms include linear regression, logistic regression, decision
trees, random forests, support vector machines (SVM), k-nearest neighbors
(KNN), and neural networks.
2. Prepare the Data:
 Ensure that the dataset is properly cleaned, preprocessed, and split into
training and testing sets as described earlier in the data preparation
process.
3. Train the Model:
 Fit the selected model to the training data using the fit() method or
equivalent in your chosen machine learning library (e.g., scikit-learn in
Python).
 Provide the training features (X_train ) and the corresponding target labels
(y_train ) as input to the fit() method.
4. Model Evaluation:
 Evaluate the trained model's performance using appropriate evaluation
metrics based on the problem type (e.g., accuracy, precision, recall, F1-
score for classification; mean squared error, R-squared for regression).
 Calculate the performance metrics on the test set using the predict( )
method to generate predictions and compare them with the actual target
labels (y_test ).
 Visualize the model's performance using relevant plots such as confusion
matrices, ROC curves (for binary classification), or calibration plots.
5. Hyperparameter Tuning:
 Fine-tune the model's hyperparameters to improve its performance. This
involves searching over a predefined hyperparameter space using
techniques like grid search or random search.
 Use cross-validation to estimate the model's performance on different
subsets of the training data and avoid overfitting.
6. Model Selection:
 Compare the performance of different models using cross-validation or a
separate validation set.
Page 8 of 8

 Select the model with the best performance based on the evaluation
metrics and your specific requirements (e.g., accuracy, interpretability,
computational efficiency).
7. Training Pipeline:
 Create a training pipeline that encapsulates the data preparation, model
training, and evaluation steps. This ensures reproducibility and facilitates
experimentation with different algorithms and hyperparameters.
8. Documentation and Reporting:
 Document the model selection process, including the chosen algorithm,
hyperparameters, and evaluation results.
 Provide insights into the model's strengths, weaknesses, and potential
areas for improvement.

Introducing The Art of Statistics How To Learn Fro
No ratings yet
Introducing The Art of Statistics How To Learn Fro
6 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
3.3 Measures of Skew and Outliers
No ratings yet
3.3 Measures of Skew and Outliers
42 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
Machine learning Life cycle
No ratings yet
Machine learning Life cycle
11 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Session 4 Machine Learning Process (1)
No ratings yet
Session 4 Machine Learning Process (1)
28 pages
Unit 1
No ratings yet
Unit 1
41 pages
Lecture 1
No ratings yet
Lecture 1
21 pages
ML_1
No ratings yet
ML_1
13 pages
S-9
No ratings yet
S-9
18 pages
1725892639Module 3 the Machine Learning Process
No ratings yet
1725892639Module 3 the Machine Learning Process
17 pages
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
No ratings yet
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
5 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
ML Question Answer
No ratings yet
ML Question Answer
4 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
HCA2 (1)
No ratings yet
HCA2 (1)
63 pages
Part 2 Introduction To ML
No ratings yet
Part 2 Introduction To ML
13 pages
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
AML MIDSEM
No ratings yet
AML MIDSEM
59 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
26 pages
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
No ratings yet
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
6 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
How To Prepare Data For Machine Learning
No ratings yet
How To Prepare Data For Machine Learning
34 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
Machine Learning
No ratings yet
Machine Learning
116 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Data Preparation For Automated Machine Learning: White Paper
No ratings yet
Data Preparation For Automated Machine Learning: White Paper
21 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
6 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Deep Learning
No ratings yet
Deep Learning
25 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Pa 2
No ratings yet
Pa 2
13 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Lecture No 2 Data Preparation
No ratings yet
Lecture No 2 Data Preparation
23 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
Six Steps To Master Machine Learning With Data Preparation
No ratings yet
Six Steps To Master Machine Learning With Data Preparation
44 pages
ADS-IMP-QNA-2025-15-04-06-06-35_copy
No ratings yet
ADS-IMP-QNA-2025-15-04-06-06-35_copy
33 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Manual Data
No ratings yet
Manual Data
13 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Module_-1
No ratings yet
Module_-1
9 pages
Unit2_2) How python is deployed and Data Science Process.pptx
No ratings yet
Unit2_2) How python is deployed and Data Science Process.pptx
7 pages
Steps to create data sets and developing a machine learning model
No ratings yet
Steps to create data sets and developing a machine learning model
3 pages
Data Mining & Machine Learning Courseoutline
No ratings yet
Data Mining & Machine Learning Courseoutline
7 pages
Module 5.pptx_20250608_201231_0000
No ratings yet
Module 5.pptx_20250608_201231_0000
43 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Ml Notes All
No ratings yet
Ml Notes All
32 pages
Steps of Implementation of A GLM
No ratings yet
Steps of Implementation of A GLM
8 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
22 pages
How To Build AI
No ratings yet
How To Build AI
10 pages
Exploring, Transforming, And Summarizing Input Datasets for Building Classification Models
No ratings yet
Exploring, Transforming, And Summarizing Input Datasets for Building Classification Models
21 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
OREAS 906 Certificate
No ratings yet
OREAS 906 Certificate
18 pages
On MAPE-R As A Measure of Cross-Sectional Estimation and Forecast Accuracy
No ratings yet
On MAPE-R As A Measure of Cross-Sectional Estimation and Forecast Accuracy
15 pages
Robust Regression and Outlier Detection With The ROBUSTREG Procedure PDF
No ratings yet
Robust Regression and Outlier Detection With The ROBUSTREG Procedure PDF
14 pages
Tugas Ridho 2.100-2.102
No ratings yet
Tugas Ridho 2.100-2.102
9 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
The Normal Distribution Activity
No ratings yet
The Normal Distribution Activity
6 pages
East West Airlines Output
No ratings yet
East West Airlines Output
33 pages
Machine Learning and Business Analytics Surprize Quiz
No ratings yet
Machine Learning and Business Analytics Surprize Quiz
5 pages
Kiln Shell Laser Manual 1
No ratings yet
Kiln Shell Laser Manual 1
24 pages
Exam 1 Review, Solutions, and Formula Sheet, Chapters 1-4
100% (1)
Exam 1 Review, Solutions, and Formula Sheet, Chapters 1-4
9 pages
EST Prep (Probability)
No ratings yet
EST Prep (Probability)
8 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
OREAS 990b Certificate
No ratings yet
OREAS 990b Certificate
21 pages
Final Stats Intrerview Q&A
No ratings yet
Final Stats Intrerview Q&A
12 pages
Forensic Chemistry 2nd Edition by Suzanne Bell ISBN Solution Manual
100% (46)
Forensic Chemistry 2nd Edition by Suzanne Bell ISBN Solution Manual
23 pages
The Variation of Pre-Movement Time in Building Evacuation: And, Sweden, Sweden, Sweden, New Zealand
No ratings yet
The Variation of Pre-Movement Time in Building Evacuation: And, Sweden, Sweden, Sweden, New Zealand
23 pages
IEEE Conference Templa
No ratings yet
IEEE Conference Templa
4 pages
OSTA WS2024 Tutorial Session 01
No ratings yet
OSTA WS2024 Tutorial Session 01
19 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
Reflection Paper On Median
No ratings yet
Reflection Paper On Median
2 pages
Student Solutions Manual for Devore s Probability and Statistics for Engineering and the Sciences 7th 7th Edition Jay L. Devore - Download the ebook now to never miss important information
No ratings yet
Student Solutions Manual for Devore s Probability and Statistics for Engineering and the Sciences 7th 7th Edition Jay L. Devore - Download the ebook now to never miss important information
52 pages
Maths SMILE - Manas Boolani
100% (1)
Maths SMILE - Manas Boolani
4 pages
Investigating The Effect of Salinity On Brine Shrimp Natality
No ratings yet
Investigating The Effect of Salinity On Brine Shrimp Natality
6 pages
SSD: A U F S - S O D: Nified Ramework FOR ELF Upervised Utlier Etection
No ratings yet
SSD: A U F S - S O D: Nified Ramework FOR ELF Upervised Utlier Etection
17 pages
Pseudo Affine Projection Sign Algorithm For
No ratings yet
Pseudo Affine Projection Sign Algorithm For
2 pages
Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree
No ratings yet
Predicting Sales of Rossman Stores: Machine Learning Engineer Nanodegree
23 pages
SMDM Project
No ratings yet
SMDM Project
29 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages

UNIT - 2 ML

Uploaded by

UNIT - 2 ML

Uploaded by

Page 1 of 8

UNIT-2: Data Preparation

 Dimensionality reduction: If dealing with high-dimensional data, use

Look at the Big Picture:

1. Understanding Business Objectives: Data preparation starts with a clear

meaningful information from the data. Feature engineering aims to capture

Get the Data:

 Consider using interactive visualization libraries like Plotly or Bokeh for

Prepare the Data for machine Learning Algorithms:

1. Handling Missing Values:

 Standardization: Transform features to have a mean of 0 and a

You might also like