0% found this document useful (0 votes)

7 views6 pages

Data Preprocessing 1

The document outlines the process of data pre-processing in machine learning, focusing on handling missing values, encoding categorical data, and splitting datasets into training, validation, and testing sets. It emphasizes the importance of data preparation, which can consume a significant portion of a data scientist's time, and provides Python code examples for implementing these techniques. Key steps include using imputation for missing values, one-hot encoding for categorical variables, and splitting the dataset to ensure effective model training and evaluation.

Uploaded by

Harshitha Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

Data Preprocessing 1

Uploaded by

Harshitha Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.

ipynb - Colab

EXP-5: Apply Data Pre-processing with the missing data handling, transformation of data and converting categorical data into label and one
hot encoding. (Python/PowerBI)

Handling missing values

Encoding string values to integer values
Split data into a train-test-validation dataset
Feature scaling
Deletion of outliers
Let’s discuss each term one by one along with their implementation using Python packages."""

'https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3328575%2Fd414b9295330421715f5176601c72732%2Fdatapr
ep.jpeg?generation=1589614931397208&alt=media\nData Preparation is one of the indispensable steps in any Machine Learning developme
nt life cycle.\nIn today’s world, the data is present in a structured as well as unstructured form.\nTo deal with such data, data s
cientists spent almost 70–80% of their time in preparing data for further analysis which includes:\n\nHandling missing values\nEnco
ding string values to integer values\nSplit data into a train-test-validation dataset\nFeature scaling\nDeletion of outliers\nLet’s
discuss each term one by one along with their implementation using Python packages.'

Handling missing values

For any real-world applications, it is very common to have missing values in the application dataset. No machine learning algorithm prefers to
have missing data in their training and testing dataset. To deal with such situations, python came up with a very useful library that will be
discussed next.

Let’s take a demo dataset containing some missing values.

Employee Age Salary Purchased Anjali 45 71000 No Parul 28 48000 Yes Kanisha 31 53000 No Parul 35 61000 No Kanisha 42 . Yes Anjali 35
59000 Yes Parul . 53000 No Anjali 47 80000 Yes Kanisha 51 81000 No Anjali 36 68000 Yes

Save the above dataset into a raw .csv file in your local system and read the file using python library.

Double-click (or enter) to edit

#Importing libraries
import pandas as pd
import numpy as np

#Importing dataset
df = pd.read_csv("/content/drive/MyDrive/data.csv")
print(df)

Country Age Salary Purchased

0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
10 France 37.0 67000.0 Yes

from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

df['Country,Age,Salary,Purchased']

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 1/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
3790 try:
-> 3791 return self._engine.get_loc(casted_key)
3792 except KeyError as err:

index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Country,Age,Salary,Purchased'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)

2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
3796 ):
3797 raise InvalidIndexError(key)
-> 3798 raise KeyError(key) from err
3799 except TypeError:
3800 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'Country,Age,Salary,Purchased'

df.iloc[0].values#row-1

df.iloc[0:10,[0]]#10 rows of col-1

As you can see, we have 2 cells with missing values in the above image at row #4 and #6. We can handle these missing values by replacing the
values with the following aggregate values:

Mean Mode (most frequent) Median Constant value Delete the entire row/column with missing values There is no rule of thumb to select a
specific option, it depends on the data and the problem statement which is intended to solve. To select the best option, the knowledge of both
data and the application are needed.

Let’s separate the independent and the dependent variable before handing missing values.

X = df.iloc[0:10,[0,1,2]].values# all rows and col-1,2 & 3

y = df.iloc[:,3].values# all rows col-1
print(X)
print(y)

[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']

#X=df.iloc[:,['Country,Age,Salary']]

We will use the SimpleImputer class from the sklearn.impute library to replace the missing values with the mean value of the corresponding
columns.

Handling missing values

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan,strategy="mean")
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)

[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 2/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

We replaced the missing values to 38.88 and 63777.77 respectively. Instead of mean, we can also replace values with their corresponding
mode, median, or any constant values by changing the ‘strategy’ parameter.

Encoding dataset features

It is obvious to have a string-based column in the dataset in the form of names, addresses, and so on. But no machine learning algorithm can
train the model with string-based variables in it. Hence, we have to encode those variables into numeric-based variables before delving into any
machine learning algorithm.

There are several ways by which we can handle such kind of data. Here, we will use 2 special Python libraries to convert the string-based into
numeric-based variables.

Encoding the independent variables

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
Xt = ct.fit_transform(X)
Xta = np.array(X)
print(X)

[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

There are a lot of things happening in the above program. First, as you can see, we have 3 unique values in the Employee column. The above

code replaced the values with numeric values as shown below.

In Figure 6, you must be wondering how some extra columns appeared which were not present before. If you did, then you are going on a
correct path.

The Machine Learning algorithm has nothing to do with the column names, instead, it tries to find the patterns within the data. As per Figure 7,
we can easily infer that the value 2 is bigger than the value 1 and 0; it may also infer that there is a numerical order within the data. Hence, some
misinterpretation could happen between independent variables and the dependent variables which could lead to the wrong correlations and
future results.

To mitigate such issues, we will create 1 column for each value with 0 and 1; 0 being the value is absent and 1 being present.

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 3/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab

If you noticed, the dependent variable is also string-based. But, the dependent column has only 2 unique values; in that case, we can skip the
above process, and directly we can encode the values from [‘No’, ‘Yes’] to [0,1] correspondingly. After encoding, you will notice that we have only
0 and 1 in the dependent column, and that is what we want. Let’s quickly modify the dependent variable ‘y’ to a numerical-based variable.

Encoding the dependent variables

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
yt = le.fit_transform(y)
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']

Start coding or generate with AI.

That’s how we handle the string-based variables in the model building process. Let’s move to the next steps.

print(X)
print(y)

[['France' 44.0 72000.0]

Split data into a train-test-validation dataset In every model development process, we need to train the model with a subset of the original data,
further, the model predicts the values for the new unseen data to evaluate its performance in terms of accuracy, ROC-AUC curve, and so on. We
will not be going to discuss any evaluation parameters here as it is out of scope for this tutorial, but I cover those in my further tutorial for sure,
until then stay tuned. :)

To achieve the above scenario, we need to split the original dataset into 2 or sometimes 3 splits namely training, validation, and testing dataset.
Let’s discuss all the 3 datasets and their significance in the machine learning life cycle.

Training dataset: It consists of more than half of the original dataset, the sole purpose of this dataset it to train the model and update the
weights of the model.

Validation dataset: A small subset of original data is used to provide an unbiased evaluation of a model fit on the training dataset while tuning
model hyperparameters. It is optional to have a validation dataset into the model.

Testing dataset: It ranges from 10–25% of the original data to evaluate the model performance based on various evaluation parameters
discussed above.

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 4/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
Generally, it is recommended to have a split of 70–30 or 80–20 ratios of the train-test split, and 60–20–20 or 70–15–15 in case of the train-
validation-split dataset.

Consider the below example with 100 rows in the original dataset. In the middle split, an 80–20 split ratio happened between training and
testing dataset. The training dataset further split into a 75–25 split ratios between training and validation dataset in the last split.

# Ensure X and y have the same number of samples before splitting

assert len(X) == len(y), "X and y must have the same number of samples"

---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-32-339470f9995c> in <cell line: 2>()
1 # Ensure X and y have the same number of samples before splitting
----> 2 assert len(X) == len(y), "X and y must have the same number of samples"

AssertionError: X and y must have the same number of samples

#Let’s implement the same in Python.

#Split data into train and test dataset

from sklearn.model_selection import train_test_split
# Ensure X and y have the same number of samples before splitting
#assert len(X) == len(y), "X and y must have the same number of samples"
len(X)
len(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
print('X_train.shape: ', X_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-9df5ee0c2fba> in <cell line: 9>()
7 len(X)
8 len(y)
----> 9 X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
10 print('X_train.shape: ', X_train.shape)
11 print('X_test.shape: ', X_test.shape)

3 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
405 uniques = np.unique(lengths)
406 if len(uniques) > 1:
--> 407 raise ValueError(
408 "Found input variables with inconsistent numbers of samples: %r"
409 % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [10, 11]

As you see, it is very easy to split the dataset using train_test_split class from sklearn.model_selection library. Choose the ‘test_size’ parameter
between 0 to 1, in our case we took 0.2 to get 20% testing data. It is recommended to set the seed parameter ‘random_state’ to achieve the
reproducibility of the results. If we do not set the seed, every time the random split occurs between the datasets and hence results differ every
time the model runs.

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 5/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab

Feature scaling
It is a technique to standardize the independent variables into a fixed range. It is a very crucial step in the process of data preparation because
if we skip this step, the distance-based models cause variables with larger values to tend to dominate the variables with smaller values.

We can achieve this in various ways, but we will discuss here the 2 most popular feature scaling techniques i.e. Min-Max scaling and
Standardization. Let’s discuss it one by one.

Min-Max scaling: In this technique, the features/variables are re-scaled between 0 and 1.

Standardization: In this, the features got re-scaled to the values so that the distribution with mean=0 and standard deviation=1.

In our case, we will apply the standardization technique to scale the independent features in the training and testing dataset using
sklearn.preprocessing library.

Feature scaling of training dataset

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print(X_train)
print(X_test)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-36-c747fd55f3dc> in <cell line: 3>()
1 from sklearn.preprocessing import StandardScaler
2 sc_X = StandardScaler()
----> 3 X_train = sc_X.fit_transform(X_train)
4 X_test = sc_X.transform(X_test)
5 print(X_train)

NameError: name 'X_train' is not defined

Here if you noticed, I used the fit_transform method for the training dataset and transform method to the testing dataset, the reason being that
we learn the scaling parameter from the training dataset, and used the same parameters to scale the testing dataset.

Note: In our case, we do not apply feature scaling on the dependent variable as the dependent variable is already within the required range
values.

Now the big question arises where to apply feature scaling or where to not A rule of thumb says apply feature scaling on distance based

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 6/6

Microsoft Power Platform Ebook
No ratings yet
Microsoft Power Platform Ebook
16 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
Lab File
No ratings yet
Lab File
96 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
Pre-Processing Example - 1
No ratings yet
Pre-Processing Example - 1
6 pages
DP
No ratings yet
DP
9 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
2777959-Day 8 - Data Wrangling
No ratings yet
2777959-Day 8 - Data Wrangling
2 pages
DA PROGRAM UPTO 6 (1)
No ratings yet
DA PROGRAM UPTO 6 (1)
20 pages
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
No ratings yet
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
10 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Avinash DA 6
No ratings yet
Avinash DA 6
3 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
DA lab
No ratings yet
DA lab
27 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Practicals
No ratings yet
Practicals
42 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
lec 4
No ratings yet
lec 4
9 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Document (4)
No ratings yet
Document (4)
15 pages
a5
No ratings yet
a5
28 pages
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Slides on DataII
No ratings yet
Slides on DataII
26 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
Document (4)-1
No ratings yet
Document (4)-1
15 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
Python Amit
No ratings yet
Python Amit
11 pages
DP prog
No ratings yet
DP prog
10 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Missing Values
No ratings yet
Missing Values
3 pages
Machine Learning Lab Assignment 2
No ratings yet
Machine Learning Lab Assignment 2
23 pages
aide memoire preparation des données
No ratings yet
aide memoire preparation des données
2 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
96133914368
No ratings yet
96133914368
2 pages
ML Question Bank
No ratings yet
ML Question Bank
7 pages
Brochure
No ratings yet
Brochure
4 pages
Portofolio Digital Marketing - Elhaz
No ratings yet
Portofolio Digital Marketing - Elhaz
12 pages
Year 10 IGCSE Course Summary
No ratings yet
Year 10 IGCSE Course Summary
4 pages
Design and Realization of A Descretized PV System With An Improved MPPT Control For A Better Exploitation of The PV Energy
No ratings yet
Design and Realization of A Descretized PV System With An Improved MPPT Control For A Better Exploitation of The PV Energy
14 pages
PowerVault MD3600f Spec Sheet
No ratings yet
PowerVault MD3600f Spec Sheet
2 pages
PAN-OS® and Panorama™API Usage Guide - V10.1
No ratings yet
PAN-OS® and Panorama™API Usage Guide - V10.1
160 pages
Permutation Group Algorithms
No ratings yet
Permutation Group Algorithms
274 pages
INST241 Sec3
No ratings yet
INST241 Sec3
121 pages
Deep Fake Technology
No ratings yet
Deep Fake Technology
10 pages
Mubarek Hassen: Ries Engineering S.Co
No ratings yet
Mubarek Hassen: Ries Engineering S.Co
3 pages
UMTS ... : 3G Technology and Concepts
No ratings yet
UMTS ... : 3G Technology and Concepts
83 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
T-45C Aircraft: Virtual Natops Flight Manual Navy Model
No ratings yet
T-45C Aircraft: Virtual Natops Flight Manual Navy Model
232 pages
CHECK-PLAGIARISM final report
No ratings yet
CHECK-PLAGIARISM final report
12 pages
MEP Professionals: Mechanical Programs List & Prices
No ratings yet
MEP Professionals: Mechanical Programs List & Prices
78 pages
Say Hello To New Sage 50c: This Is My Office
No ratings yet
Say Hello To New Sage 50c: This Is My Office
3 pages
Trace File Output
No ratings yet
Trace File Output
4,533 pages
Switch 9 Position A09 - Wiring Diagram
No ratings yet
Switch 9 Position A09 - Wiring Diagram
1 page
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
No ratings yet
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
15 pages
Aitmatov, Cinghiz - Cantecul Stepei, Cantecul Muntilor
No ratings yet
Aitmatov, Cinghiz - Cantecul Stepei, Cantecul Muntilor
1 page
SQL Views and SQL Subqueries Lab
No ratings yet
SQL Views and SQL Subqueries Lab
5 pages
Case-Study-Template-ITN
No ratings yet
Case-Study-Template-ITN
3 pages
1Z0 1057 Demo
No ratings yet
1Z0 1057 Demo
8 pages
Python Interview Questions
No ratings yet
Python Interview Questions
14 pages
Ofse Learning
No ratings yet
Ofse Learning
225 pages
Technology and Technical Accreditation Standard Second Edition - TVET Sector
No ratings yet
Technology and Technical Accreditation Standard Second Edition - TVET Sector
92 pages
Easy to Use GUI Guide
No ratings yet
Easy to Use GUI Guide
44 pages

Data Preprocessing 1

Uploaded by

Data Preprocessing 1

Uploaded by

8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.

Handling missing values

Handling missing values

Let’s take a demo dataset containing some missing values.

Double-click (or enter) to edit

Country Age Salary Purchased

from google.colab import drive

KeyError Traceback (most recent call last)

df.iloc[0:10,[0]]#10 rows of col-1

X = df.iloc[0:10,[0,1,2]].values# all rows and col-1,2 & 3

[['France' 44.0 72000.0]

Handling missing values

from sklearn.impute import SimpleImputer

[['France' 44.0 72000.0]

Encoding dataset features

Encoding the independent variables

from sklearn.compose import ColumnTransformer

[['France' 44.0 72000.0]

code replaced the values with numeric values as shown below.

Encoding the dependent variables

from sklearn.preprocessing import LabelEncoder

Start coding or generate with AI.

Start coding or generate with AI.

[['France' 44.0 72000.0]

# Ensure X and y have the same number of samples before splitting

AssertionError: X and y must have the same number of samples

#Let’s implement the same in Python.

#Split data into train and test dataset

Feature scaling of training dataset

from sklearn.preprocessing import StandardScaler

NameError: name 'X_train' is not defined

You might also like