0% found this document useful (0 votes)

39 views8 pages

1 - DataPreparation - Ipynb - Colaboratory

The document discusses various steps for data preparation including loading data, cleaning data by removing unnecessary or erroneous data, transforming data, and rearranging data. It then provides examples of exploring a toy dataset by checking for missing data, distributions, and handling missing data through various imputation techniques.

Uploaded by

Nicolas Araman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views8 pages

1 - DataPreparation - Ipynb - Colaboratory

Uploaded by

Nicolas Araman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data preparation is the first step after collecting data, or having the dataset ready.

We have to
preprocess the raw data to be able to use it for analysis.
Major steps:

Loading data
Cleaning data (removing unnecessary data or erroneous data)
Transforming data
Possibly rearranging data

Load the data set and store it in a dataframe

#import pandas
import pandas as pd
# Set ipython's max row display to 1000
pd.set_option('display.max_row', 1000)
# Set iPython's max column width to 50
pd.set_option('display.max_columns', 50)

#Loading .xlsx type data set to data-frame with meaningful name.
#df = pd.read_excel('Data.xlsx',sheet_name="Sheet1")
#or read a csv file
#Loading .csv type data set ,considering the data is in same folder, if not then you can refe
#df_csvData = pd.read_csv('horror-train.csv')
df = pd.read_csv('toy_dataset.zip', compression='zip')
df.head() # 5

Number City Gender Age Income Illness

0 1 Dallas Male 41.0 40367.0 No

1 2 Dallas Male 54.0 45084.0 No

2 3 Dallas Male 42.0 52483.0 No

3 4 Dallas Male 40.0 40941.0 No

4 5 Dallas Male 46.0 50289.0 No

Exploration phase

what does the data look like?

missing data?
distribution of the data?

# get info about the attributes and their types

df.info()

RangeIndex: 150000 entries, 0 to 149999

Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Number 150000 non-null int64

1 City 149999 non-null object

2 Gender 149993 non-null object

3 Age 149994 non-null float64

4 Income 149992 non-null float64

5 Illness 149998 non-null object

dtypes: float64(2), int64(1), object(3)

memory usage: 6.9+ MB

#we can find alo the total number of non-empty values using count

df.count()

Number 150000

City 149999

Gender 149993

Age 149994

Income 149992

Illness 149998

dtype: int64

#get statistical summary info about numeric attributes

df.describe()

Number Age Income

count 150000.000000 149994.000000 149992.000000

mean 75000.500000 44.950058 91253.820697

std 43301.414527 11.572402 24988.333860

min 1.000000 25.000000 -654.000000

25% 37500.750000 35.000000 80869.000000

50% 75000.500000 45.000000 93655.000000

75% 112500.250000 55.000000 104519.000000

max 150000.000000 65.000000 177157.000000

#gender distribution: #of records for male and female and nan if any

df.groupby("Gender",dropna=False).size()

Gender

Female 66198

Male 83795

dtype: int64

#plot this distribution

#create a new variable that holds the Gender's value counts

bygender=df["Gender"].value_counts(dropna=False)

#!pip install matplotlib --upgrade #required for bar_label to see the numbers above the bars(
#bar plot using the default plot function in pandas (based on matplotlib)

ax = bygender.plot(kind = 'bar') #returns Axes object

#or

#bygender.plot.bar()

#give it a title

ax.set_title("Bar Graph of Gender")

#assign a label to x axis

ax.set_xlabel('Gender')

#assign a label to y axis

ax.set_ylabel('Number of People')

#show the data labels

ax.bar_label(ax.containers[0]) #version 3.4 or matplotlib

[Text(0, 0, '83795'), Text(0, 0, '66198'), Text(0, 0, '7')]

#or using matplotlib pyplot

#the bygender has np.nan in the column names, we have to rename it otherwise we get an error

import matplotlib.pyplot as plt

import numpy as np

bygender = bygender.rename({np.nan:"NaN"})

#index returns the labels or the column names, values returns the value for each column

plt.bar(bygender.index,height=bygender.values)

plt.show()

How to handle missing data?

An interesting summary from datacamp:

Deletion

#deletion

#drop age and use inplace = True only when you want to modifiy original dataframe, otherwise
newdf=df.drop(['Age'],axis=1)#, inplace=True)

#Remove Gender column

df.drop(columns=['Gender'], inplace=True)

df.head(20)

#Remove rows which contains null values in City, Gender and Age

df_1 = df.dropna(subset =['City','Gender','Age'] )

RangeIndex: 150000 entries, 0 to 149999

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Number 150000 non-null int64

1 City 149999 non-null object

2 Gender 149993 non-null object

3 Income 149992 non-null float64

4 Illness 149998 non-null object

dtypes: float64(1), int64(1), object(3)

memory usage: 5.7+ MB

Number City Gender Age Income Illness

0 1 Dallas Male 41.0 40367.0 No

1 2 Dallas Male 54.0 45084.0 No

2 3 Dallas Male 42.0 52483.0 No

3 4 Dallas Male 40.0 40941.0 No

4 5 Dallas Male 46.0 50289.0 No

... ... ... ... ... ... ...

149995 149996 Austin Male 48.0 93669.0 No

149996 149997 Austin Male 25.0 96748.0 No

149997 149998 Austin Male 26.0 111885.0 No

149998 149999 Austin Male 25.0 111878.0 No

149999 150000 Austin Female 37.0 87251.0 No

149986 rows × 6 columns

Data Transformation: Imputation

Mean
Median
Mode (most common value)

#Fill missing Income with mean value - add a new column

df['IncomeFillNa'] = df['Income'].fillna(df['Income'].mean())

#Fill missing Age with median value - add a new column

df['AgeFillNa'] = df['Age'].fillna(df['Age'].median())

#Fill missing Gender with mode value - add a new column

#mode returns a dataframe so we access it at index 0

df['GenderFillNa'].fillna(df['Gender'].mode()[0],inplace=True) #mode: 0 Male

Male 83802

Female 66198

Name: Gender, dtype: int64

Male

Check for duplicate data

#check if there are duplicates

df.duplicated().sum()

False 150000

dtype: int64

#remove duplicates if any

#original not affected unless you save

df[df.duplicated(keep=False)] #'first': all duplicates except first occurence, 'last': all du

Data Transformation: cntd

From categorical to numerical
Normalization

#convert categorical Gender to number (get_dummies)

one_hot_gender = pd.get_dummies(df["Gender"])

#add the generated hot-encoded columns to new df

df_1=pd.concat([df, one_hot_gender],axis=1)

df_1.head(10)

Number City Gender Age Income Illness Female Male Female Male

0 1.0 Dallas Male 41.0 40367.0 No NaN NaN 0 1

1 2.0 Dallas Male 54.0 45084.0 No NaN NaN 0 1

2 3.0 Dallas Male 42.0 52483.0 No NaN NaN 0 1

3 4.0 Dallas Male 40.0 40941.0 No NaN NaN 0 1

#normalization: income and age have very different scales

4 5.0 Dallas Male 46.0 50289.0 No NaN NaN 0 1

#simple normalization for both - add new columns

5 6.0 Dallas Female NaN 50786.0 No NaN NaN 1 0

df_1['SimpleIncome'] = df_1['Income'] / df_1['Income'].max()

6 7.0 Dallas Female 32.0 33155.0 No

df_1['SimpleAge'] = df_1['Age'] / df_1['Age'].max()
NaN NaN 1 0

7 8.0 Dallas Male 39.0 30914.0 No NaN NaN 0 1

#min-max normalization for both - add new columns

8 9.0 Dallas Male 51.0 68667.0

min = df_1['Income'].min()
No NaN NaN 0 1
max= df_1['Income'].max()

9 10.0 Dallas Female 30.0 50082.0 No NaN NaN 1 0

df_1['MinMaxIncome'] = (df_1['Income'] - min) / (max - min)

min = df_1['Age'].min()

max= df_1['Age'].max()

df_1['MinMaxAge'] = (df_1['Age'] - min) / (max - min)

#z-score normalization for both - add new columns

df_1['ZscoreIncome'] = (df_1['Income'] - df_1['Income'].mean()) / df_1['Income'].std()

df_1['ZscoreAge'] = (df_1['Age'] - df_1['Age'].mean()) / df_1['Age'].std()

df_1.head()

Number City Gender Age Income Illness Female Male Female Male Simpl

0 1.0 Dallas Male 41.0 40367.0 No NaN NaN 0 1 0

1 2.0 Dallas Male 54.0 45084.0 No NaN NaN 0 1 0

2 3.0 Dallas Male 42.0 52483.0 No NaN NaN 0 1 0

3 4.0 Dallas Male 40.0 40941.0 No NaN NaN 0 1 0

4 5.0 Dallas Male 46.0 50289.0 No NaN NaN 0 1 0

#normalization using apply and functions

#define a function to perform the simple feature normalization

def simpleNorm(old,max):

return old/max

#apply the function to age

df_1["Age"].apply(lambda x: simpleNorm(x, df_1["Age"].max())) #slow

#apply the function to income - make it anonymous

max = df_1["Income"].max()

df_1["Income"].apply(lambda x: x/max)

0 0.227860

1 0.254486

2 0.296251

3 0.231100

4 0.283867

...

149995 0.528734

149996 0.546114

149997 0.631558

149998 0.631519

149999 0.492507

Name: Income, Length: 150000, dtype: float64

#export the updated dataframe to csv

df_1.to_csv("updated_data.csv")

Colab paid products

-
Cancel contracts here

Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
hduud
No ratings yet
hduud
55 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Day 4 Data Manipulation With Pandas
No ratings yet
Day 4 Data Manipulation With Pandas
4 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
1728086737277
No ratings yet
1728086737277
26 pages
AttiqAhmadAfsarMidExam
No ratings yet
AttiqAhmadAfsarMidExam
8 pages
2777959-Day 8 - Data Wrangling
No ratings yet
2777959-Day 8 - Data Wrangling
2 pages
Data Analysis W Pandas
No ratings yet
Data Analysis W Pandas
4 pages
Data Cheat Sheet
No ratings yet
Data Cheat Sheet
2 pages
Assignment 2[1]
No ratings yet
Assignment 2[1]
6 pages
Ap Python
No ratings yet
Ap Python
12 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
EDA - Session-1 - Basic Dataframe Opertaions-1
No ratings yet
EDA - Session-1 - Basic Dataframe Opertaions-1
7 pages
DRA Lab Exp8
No ratings yet
DRA Lab Exp8
6 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
Absenteeism_module
No ratings yet
Absenteeism_module
2 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
DP
No ratings yet
DP
9 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
cleaning process data analysis and visualisation using python
No ratings yet
cleaning process data analysis and visualisation using python
15 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Pandas: Reference Sheet
No ratings yet
Pandas: Reference Sheet
9 pages
ML Complete Notes Hridoy.docx
No ratings yet
ML Complete Notes Hridoy.docx
5 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
DAP_3_module
No ratings yet
DAP_3_module
62 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Introduction to Pandas Programming 2
No ratings yet
Introduction to Pandas Programming 2
3 pages
asfasdas
No ratings yet
asfasdas
36 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
Pandas
No ratings yet
Pandas
4 pages
Netflix McGrawHill
No ratings yet
Netflix McGrawHill
24 pages
CVS Drop Off Handling
No ratings yet
CVS Drop Off Handling
2 pages
(Optional) Ch5-Pages From Zook-Profit From The Core-BK-2010
No ratings yet
(Optional) Ch5-Pages From Zook-Profit From The Core-BK-2010
13 pages
Definition of Sedimentary Rocks
No ratings yet
Definition of Sedimentary Rocks
5 pages