0% found this document useful (0 votes)
7 views6 pages

Data Preprocessing 1

The document outlines the process of data pre-processing in machine learning, focusing on handling missing values, encoding categorical data, and splitting datasets into training, validation, and testing sets. It emphasizes the importance of data preparation, which can consume a significant portion of a data scientist's time, and provides Python code examples for implementing these techniques. Key steps include using imputation for missing values, one-hot encoding for categorical variables, and splitting the dataset to ensure effective model training and evaluation.

Uploaded by

Harshitha Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

Data Preprocessing 1

The document outlines the process of data pre-processing in machine learning, focusing on handling missing values, encoding categorical data, and splitting datasets into training, validation, and testing sets. It emphasizes the importance of data preparation, which can consume a significant portion of a data scientist's time, and provides Python code examples for implementing these techniques. Key steps include using imputation for missing values, one-hot encoding for categorical variables, and splitting the dataset to ensure effective model training and evaluation.

Uploaded by

Harshitha Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.

ipynb - Colab

EXP-5: Apply Data Pre-processing with the missing data handling, transformation of data and converting categorical data into label and one
hot encoding. (Python/PowerBI)

Handling missing values


Encoding string values to integer values
Split data into a train-test-validation dataset
Feature scaling
Deletion of outliers
Let’s discuss each term one by one along with their implementation using Python packages."""

'https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3328575%2Fd414b9295330421715f5176601c72732%2Fdatapr
ep.jpeg?generation=1589614931397208&alt=media\nData Preparation is one of the indispensable steps in any Machine Learning developme
nt life cycle.\nIn today’s world, the data is present in a structured as well as unstructured form.\nTo deal with such data, data s
cientists spent almost 70–80% of their time in preparing data for further analysis which includes:\n\nHandling missing values\nEnco
ding string values to integer values\nSplit data into a train-test-validation dataset\nFeature scaling\nDeletion of outliers\nLet’s
discuss each term one by one along with their implementation using Python packages.'

Handling missing values

For any real-world applications, it is very common to have missing values in the application dataset. No machine learning algorithm prefers to
have missing data in their training and testing dataset. To deal with such situations, python came up with a very useful library that will be
discussed next.

Let’s take a demo dataset containing some missing values.

Employee Age Salary Purchased Anjali 45 71000 No Parul 28 48000 Yes Kanisha 31 53000 No Parul 35 61000 No Kanisha 42 . Yes Anjali 35
59000 Yes Parul . 53000 No Anjali 47 80000 Yes Kanisha 51 81000 No Anjali 36 68000 Yes

Save the above dataset into a raw .csv file in your local system and read the file using python library.

Double-click (or enter) to edit

#Importing libraries
import pandas as pd
import numpy as np

#Importing dataset
df = pd.read_csv("/content/drive/MyDrive/data.csv")
print(df)

Country Age Salary Purchased


0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
10 France 37.0 67000.0 Yes

from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

df['Country,Age,Salary,Purchased']

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 1/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
3790 try:
-> 3791 return self._engine.get_loc(casted_key)
3792 except KeyError as err:

index.pyx in pandas._libs.index.IndexEngine.get_loc()

index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Country,Age,Salary,Purchased'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)


2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
3796 ):
3797 raise InvalidIndexError(key)
-> 3798 raise KeyError(key) from err
3799 except TypeError:
3800 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'Country,Age,Salary,Purchased'

df.iloc[0].values#row-1

df.iloc[0:10,[0]]#10 rows of col-1

As you can see, we have 2 cells with missing values in the above image at row #4 and #6. We can handle these missing values by replacing the
values with the following aggregate values:

Mean Mode (most frequent) Median Constant value Delete the entire row/column with missing values There is no rule of thumb to select a
specific option, it depends on the data and the problem statement which is intended to solve. To select the best option, the knowledge of both
data and the application are needed.

Let’s separate the independent and the dependent variable before handing missing values.

X = df.iloc[0:10,[0,1,2]].values# all rows and col-1,2 & 3


y = df.iloc[:,3].values# all rows col-1
print(X)
print(y)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']

#X=df.iloc[:,['Country,Age,Salary']]

We will use the SimpleImputer class from the sklearn.impute library to replace the missing values with the mean value of the corresponding
columns.

Handling missing values

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values=np.nan,strategy="mean")
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 2/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

We replaced the missing values to 38.88 and 63777.77 respectively. Instead of mean, we can also replace values with their corresponding
mode, median, or any constant values by changing the ‘strategy’ parameter.

Encoding dataset features

It is obvious to have a string-based column in the dataset in the form of names, addresses, and so on. But no machine learning algorithm can
train the model with string-based variables in it. Hence, we have to encode those variables into numeric-based variables before delving into any
machine learning algorithm.

There are several ways by which we can handle such kind of data. Here, we will use 2 special Python libraries to convert the string-based into
numeric-based variables.

Encoding the independent variables

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
Xt = ct.fit_transform(X)
Xta = np.array(X)
print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

There are a lot of things happening in the above program. First, as you can see, we have 3 unique values in the Employee column. The above

code replaced the values with numeric values as shown below.

In Figure 6, you must be wondering how some extra columns appeared which were not present before. If you did, then you are going on a
correct path.

The Machine Learning algorithm has nothing to do with the column names, instead, it tries to find the patterns within the data. As per Figure 7,
we can easily infer that the value 2 is bigger than the value 1 and 0; it may also infer that there is a numerical order within the data. Hence, some
misinterpretation could happen between independent variables and the dependent variables which could lead to the wrong correlations and
future results.

To mitigate such issues, we will create 1 column for each value with 0 and 1; 0 being the value is absent and 1 being present.

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 3/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab

If you noticed, the dependent variable is also string-based. But, the dependent column has only 2 unique values; in that case, we can skip the
above process, and directly we can encode the values from [‘No’, ‘Yes’] to [0,1] correspondingly. After encoding, you will notice that we have only
0 and 1 in the dependent column, and that is what we want. Let’s quickly modify the dependent variable ‘y’ to a numerical-based variable.

Encoding the dependent variables

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
yt = le.fit_transform(y)
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']

Start coding or generate with AI.

Start coding or generate with AI.

That’s how we handle the string-based variables in the model building process. Let’s move to the next steps.

print(X)
print(y)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']

Split data into a train-test-validation dataset In every model development process, we need to train the model with a subset of the original data,
further, the model predicts the values for the new unseen data to evaluate its performance in terms of accuracy, ROC-AUC curve, and so on. We
will not be going to discuss any evaluation parameters here as it is out of scope for this tutorial, but I cover those in my further tutorial for sure,
until then stay tuned. :)

To achieve the above scenario, we need to split the original dataset into 2 or sometimes 3 splits namely training, validation, and testing dataset.
Let’s discuss all the 3 datasets and their significance in the machine learning life cycle.

Training dataset: It consists of more than half of the original dataset, the sole purpose of this dataset it to train the model and update the
weights of the model.

Validation dataset: A small subset of original data is used to provide an unbiased evaluation of a model fit on the training dataset while tuning
model hyperparameters. It is optional to have a validation dataset into the model.

Testing dataset: It ranges from 10–25% of the original data to evaluate the model performance based on various evaluation parameters
discussed above.

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 4/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
Generally, it is recommended to have a split of 70–30 or 80–20 ratios of the train-test split, and 60–20–20 or 70–15–15 in case of the train-
validation-split dataset.

Consider the below example with 100 rows in the original dataset. In the middle split, an 80–20 split ratio happened between training and
testing dataset. The training dataset further split into a 75–25 split ratios between training and validation dataset in the last split.

# Ensure X and y have the same number of samples before splitting


assert len(X) == len(y), "X and y must have the same number of samples"

---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-32-339470f9995c> in <cell line: 2>()
1 # Ensure X and y have the same number of samples before splitting
----> 2 assert len(X) == len(y), "X and y must have the same number of samples"

AssertionError: X and y must have the same number of samples

#Let’s implement the same in Python.

#Split data into train and test dataset


from sklearn.model_selection import train_test_split
# Ensure X and y have the same number of samples before splitting
#assert len(X) == len(y), "X and y must have the same number of samples"
len(X)
len(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
print('X_train.shape: ', X_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-9df5ee0c2fba> in <cell line: 9>()
7 len(X)
8 len(y)
----> 9 X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
10 print('X_train.shape: ', X_train.shape)
11 print('X_test.shape: ', X_test.shape)

3 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
405 uniques = np.unique(lengths)
406 if len(uniques) > 1:
--> 407 raise ValueError(
408 "Found input variables with inconsistent numbers of samples: %r"
409 % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [10, 11]

As you see, it is very easy to split the dataset using train_test_split class from sklearn.model_selection library. Choose the ‘test_size’ parameter
between 0 to 1, in our case we took 0.2 to get 20% testing data. It is recommended to set the seed parameter ‘random_state’ to achieve the
reproducibility of the results. If we do not set the seed, every time the random split occurs between the datasets and hence results differ every
time the model runs.

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 5/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab

Feature scaling
It is a technique to standardize the independent variables into a fixed range. It is a very crucial step in the process of data preparation because
if we skip this step, the distance-based models cause variables with larger values to tend to dominate the variables with smaller values.

We can achieve this in various ways, but we will discuss here the 2 most popular feature scaling techniques i.e. Min-Max scaling and
Standardization. Let’s discuss it one by one.

Min-Max scaling: In this technique, the features/variables are re-scaled between 0 and 1.

Standardization: In this, the features got re-scaled to the values so that the distribution with mean=0 and standard deviation=1.

In our case, we will apply the standardization technique to scale the independent features in the training and testing dataset using
sklearn.preprocessing library.

Feature scaling of training dataset

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print(X_train)
print(X_test)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-36-c747fd55f3dc> in <cell line: 3>()
1 from sklearn.preprocessing import StandardScaler
2 sc_X = StandardScaler()
----> 3 X_train = sc_X.fit_transform(X_train)
4 X_test = sc_X.transform(X_test)
5 print(X_train)

NameError: name 'X_train' is not defined

Here if you noticed, I used the fit_transform method for the training dataset and transform method to the testing dataset, the reason being that
we learn the scaling parameter from the training dataset, and used the same parameters to scale the testing dataset.

Note: In our case, we do not apply feature scaling on the dependent variable as the dependent variable is already within the required range
values.

Now the big question arises where to apply feature scaling or where to not A rule of thumb says apply feature scaling on distance based

https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 6/6

You might also like