Data Preprocessing 1
Data Preprocessing 1
ipynb - Colab
EXP-5: Apply Data Pre-processing with the missing data handling, transformation of data and converting categorical data into label and one
hot encoding. (Python/PowerBI)
'https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3328575%2Fd414b9295330421715f5176601c72732%2Fdatapr
ep.jpeg?generation=1589614931397208&alt=media\nData Preparation is one of the indispensable steps in any Machine Learning developme
nt life cycle.\nIn today’s world, the data is present in a structured as well as unstructured form.\nTo deal with such data, data s
cientists spent almost 70–80% of their time in preparing data for further analysis which includes:\n\nHandling missing values\nEnco
ding string values to integer values\nSplit data into a train-test-validation dataset\nFeature scaling\nDeletion of outliers\nLet’s
discuss each term one by one along with their implementation using Python packages.'
For any real-world applications, it is very common to have missing values in the application dataset. No machine learning algorithm prefers to
have missing data in their training and testing dataset. To deal with such situations, python came up with a very useful library that will be
discussed next.
Employee Age Salary Purchased Anjali 45 71000 No Parul 28 48000 Yes Kanisha 31 53000 No Parul 35 61000 No Kanisha 42 . Yes Anjali 35
59000 Yes Parul . 53000 No Anjali 47 80000 Yes Kanisha 51 81000 No Anjali 36 68000 Yes
Save the above dataset into a raw .csv file in your local system and read the file using python library.
#Importing libraries
import pandas as pd
import numpy as np
#Importing dataset
df = pd.read_csv("/content/drive/MyDrive/data.csv")
print(df)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
df['Country,Age,Salary,Purchased']
https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 1/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key)
3790 try:
-> 3791 return self._engine.get_loc(casted_key)
3792 except KeyError as err:
index.pyx in pandas._libs.index.IndexEngine.get_loc()
index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'Country,Age,Salary,Purchased'
The above exception was the direct cause of the following exception:
KeyError: 'Country,Age,Salary,Purchased'
df.iloc[0].values#row-1
As you can see, we have 2 cells with missing values in the above image at row #4 and #6. We can handle these missing values by replacing the
values with the following aggregate values:
Mean Mode (most frequent) Median Constant value Delete the entire row/column with missing values There is no rule of thumb to select a
specific option, it depends on the data and the problem statement which is intended to solve. To select the best option, the knowledge of both
data and the application are needed.
Let’s separate the independent and the dependent variable before handing missing values.
#X=df.iloc[:,['Country,Age,Salary']]
We will use the SimpleImputer class from the sklearn.impute library to replace the missing values with the mean value of the corresponding
columns.
https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 2/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
We replaced the missing values to 38.88 and 63777.77 respectively. Instead of mean, we can also replace values with their corresponding
mode, median, or any constant values by changing the ‘strategy’ parameter.
It is obvious to have a string-based column in the dataset in the form of names, addresses, and so on. But no machine learning algorithm can
train the model with string-based variables in it. Hence, we have to encode those variables into numeric-based variables before delving into any
machine learning algorithm.
There are several ways by which we can handle such kind of data. Here, we will use 2 special Python libraries to convert the string-based into
numeric-based variables.
There are a lot of things happening in the above program. First, as you can see, we have 3 unique values in the Employee column. The above
In Figure 6, you must be wondering how some extra columns appeared which were not present before. If you did, then you are going on a
correct path.
The Machine Learning algorithm has nothing to do with the column names, instead, it tries to find the patterns within the data. As per Figure 7,
we can easily infer that the value 2 is bigger than the value 1 and 0; it may also infer that there is a numerical order within the data. Hence, some
misinterpretation could happen between independent variables and the dependent variables which could lead to the wrong correlations and
future results.
To mitigate such issues, we will create 1 column for each value with 0 and 1; 0 being the value is absent and 1 being present.
https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 3/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
If you noticed, the dependent variable is also string-based. But, the dependent column has only 2 unique values; in that case, we can skip the
above process, and directly we can encode the values from [‘No’, ‘Yes’] to [0,1] correspondingly. After encoding, you will notice that we have only
0 and 1 in the dependent column, and that is what we want. Let’s quickly modify the dependent variable ‘y’ to a numerical-based variable.
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes']
That’s how we handle the string-based variables in the model building process. Let’s move to the next steps.
print(X)
print(y)
Split data into a train-test-validation dataset In every model development process, we need to train the model with a subset of the original data,
further, the model predicts the values for the new unseen data to evaluate its performance in terms of accuracy, ROC-AUC curve, and so on. We
will not be going to discuss any evaluation parameters here as it is out of scope for this tutorial, but I cover those in my further tutorial for sure,
until then stay tuned. :)
To achieve the above scenario, we need to split the original dataset into 2 or sometimes 3 splits namely training, validation, and testing dataset.
Let’s discuss all the 3 datasets and their significance in the machine learning life cycle.
Training dataset: It consists of more than half of the original dataset, the sole purpose of this dataset it to train the model and update the
weights of the model.
Validation dataset: A small subset of original data is used to provide an unbiased evaluation of a model fit on the training dataset while tuning
model hyperparameters. It is optional to have a validation dataset into the model.
Testing dataset: It ranges from 10–25% of the original data to evaluate the model performance based on various evaluation parameters
discussed above.
https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 4/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
Generally, it is recommended to have a split of 70–30 or 80–20 ratios of the train-test split, and 60–20–20 or 70–15–15 in case of the train-
validation-split dataset.
Consider the below example with 100 rows in the original dataset. In the middle split, an 80–20 split ratio happened between training and
testing dataset. The training dataset further split into a 75–25 split ratios between training and validation dataset in the last split.
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-32-339470f9995c> in <cell line: 2>()
1 # Ensure X and y have the same number of samples before splitting
----> 2 assert len(X) == len(y), "X and y must have the same number of samples"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-9df5ee0c2fba> in <cell line: 9>()
7 len(X)
8 len(y)
----> 9 X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
10 print('X_train.shape: ', X_train.shape)
11 print('X_test.shape: ', X_test.shape)
3 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
405 uniques = np.unique(lengths)
406 if len(uniques) > 1:
--> 407 raise ValueError(
408 "Found input variables with inconsistent numbers of samples: %r"
409 % [int(l) for l in lengths]
ValueError: Found input variables with inconsistent numbers of samples: [10, 11]
As you see, it is very easy to split the dataset using train_test_split class from sklearn.model_selection library. Choose the ‘test_size’ parameter
between 0 to 1, in our case we took 0.2 to get 20% testing data. It is recommended to set the seed parameter ‘random_state’ to achieve the
reproducibility of the results. If we do not set the seed, every time the random split occurs between the datasets and hence results differ every
time the model runs.
https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 5/6
8/23/24, 5:23 PM ML_LAB_EXP_3_PRE_PROCESSING_A_DATASET.ipynb - Colab
Feature scaling
It is a technique to standardize the independent variables into a fixed range. It is a very crucial step in the process of data preparation because
if we skip this step, the distance-based models cause variables with larger values to tend to dominate the variables with smaller values.
We can achieve this in various ways, but we will discuss here the 2 most popular feature scaling techniques i.e. Min-Max scaling and
Standardization. Let’s discuss it one by one.
Min-Max scaling: In this technique, the features/variables are re-scaled between 0 and 1.
Standardization: In this, the features got re-scaled to the values so that the distribution with mean=0 and standard deviation=1.
In our case, we will apply the standardization technique to scale the independent features in the training and testing dataset using
sklearn.preprocessing library.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-36-c747fd55f3dc> in <cell line: 3>()
1 from sklearn.preprocessing import StandardScaler
2 sc_X = StandardScaler()
----> 3 X_train = sc_X.fit_transform(X_train)
4 X_test = sc_X.transform(X_test)
5 print(X_train)
Here if you noticed, I used the fit_transform method for the training dataset and transform method to the testing dataset, the reason being that
we learn the scaling parameter from the training dataset, and used the same parameters to scale the testing dataset.
Note: In our case, we do not apply feature scaling on the dependent variable as the dependent variable is already within the required range
values.
Now the big question arises where to apply feature scaling or where to not A rule of thumb says apply feature scaling on distance based
https://colab.research.google.com/drive/1B639O9T8LqGPqgjuLB2j6qwyyXICRn9f#scrollTo=OSIIdu5OPgL6&printMode=true 6/6