0% found this document useful (0 votes)
39 views8 pages

1 - DataPreparation - Ipynb - Colaboratory

The document discusses various steps for data preparation including loading data, cleaning data by removing unnecessary or erroneous data, transforming data, and rearranging data. It then provides examples of exploring a toy dataset by checking for missing data, distributions, and handling missing data through various imputation techniques.

Uploaded by

Nicolas Araman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views8 pages

1 - DataPreparation - Ipynb - Colaboratory

The document discusses various steps for data preparation including loading data, cleaning data by removing unnecessary or erroneous data, transforming data, and rearranging data. It then provides examples of exploring a toy dataset by checking for missing data, distributions, and handling missing data through various imputation techniques.

Uploaded by

Nicolas Araman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Data preparation is the first step after collecting data, or having the dataset ready.

We have to
preprocess the raw data to be able to use it for analysis.
Major steps:

Loading data
Cleaning data (removing unnecessary data or erroneous data)
Transforming data
Possibly rearranging data

Load the data set and store it in a dataframe

#import pandas
import pandas as pd
# Set ipython's max row display to 1000
pd.set_option('display.max_row', 1000)
# Set iPython's max column width to 50
pd.set_option('display.max_columns', 50)

#Loading .xlsx type data set to data-frame with meaningful name.
#df = pd.read_excel('Data.xlsx',sheet_name="Sheet1")
#or read a csv file
#Loading .csv type data set ,considering the data is in same folder, if not then you can refe
#df_csvData = pd.read_csv('horror-train.csv')
df = pd.read_csv('toy_dataset.zip', compression='zip')
df.head() # 5 

Number City Gender Age Income Illness

0 1 Dallas Male 41.0 40367.0 No

1 2 Dallas Male 54.0 45084.0 No

2 3 Dallas Male 42.0 52483.0 No

3 4 Dallas Male 40.0 40941.0 No

4 5 Dallas Male 46.0 50289.0 No

Exploration phase

what does the data look like?


missing data?
distribution of the data?

# get info about the attributes and their types

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150000 entries, 0 to 149999

Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Number 150000 non-null int64

1 City 149999 non-null object

2 Gender 149993 non-null object

3 Age 149994 non-null float64

4 Income 149992 non-null float64

5 Illness 149998 non-null object

dtypes: float64(2), int64(1), object(3)

memory usage: 6.9+ MB

#we can find alo the total number of non-empty values using count

df.count()

Number 150000

City 149999

Gender 149993

Age 149994

Income 149992

Illness 149998

dtype: int64

#get statistical summary info about numeric attributes

df.describe()

Number Age Income

count 150000.000000 149994.000000 149992.000000

mean 75000.500000 44.950058 91253.820697

std 43301.414527 11.572402 24988.333860

min 1.000000 25.000000 -654.000000

25% 37500.750000 35.000000 80869.000000

50% 75000.500000 45.000000 93655.000000

75% 112500.250000 55.000000 104519.000000

max 150000.000000 65.000000 177157.000000

#gender distribution: #of records for male and female and nan if any

df.groupby("Gender",dropna=False).size()

Gender

Female 66198

Male 83795

dtype: int64

#plot this distribution

#create a new variable that holds the Gender's value counts

bygender=df["Gender"].value_counts(dropna=False)

#!pip install matplotlib --upgrade #required for bar_label to see the numbers above the bars(
#bar plot using the default plot function in pandas (based on matplotlib)

ax = bygender.plot(kind = 'bar') #returns Axes object

#or

#bygender.plot.bar()

#give it a title

ax.set_title("Bar Graph of Gender")

#assign a label to x axis

ax.set_xlabel('Gender')

#assign a label to y axis

ax.set_ylabel('Number of People')

#show the data labels

ax.bar_label(ax.containers[0]) #version 3.4 or matplotlib

[Text(0, 0, '83795'), Text(0, 0, '66198'), Text(0, 0, '7')]

#or using matplotlib pyplot

#the bygender has np.nan in the column names, we have to rename it otherwise we get an error

import matplotlib.pyplot as plt

import numpy as np

bygender = bygender.rename({np.nan:"NaN"})

#index returns the labels or the column names, values returns the value for each column

plt.bar(bygender.index,height=bygender.values)

plt.show()

How to handle missing data?

An interesting summary from datacamp:

Deletion

#deletion

#drop age and use inplace = True only when you want to modifiy original dataframe, otherwise 
newdf=df.drop(['Age'],axis=1)#, inplace=True)

#Remove Gender column 

df.drop(columns=['Gender'], inplace=True)

df.head(20)

#Remove rows which contains null values in City, Gender and Age

df_1 = df.dropna(subset =['City','Gender','Age'] )

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150000 entries, 0 to 149999

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Number 150000 non-null int64

1 City 149999 non-null object

2 Gender 149993 non-null object

3 Income 149992 non-null float64

4 Illness 149998 non-null object

dtypes: float64(1), int64(1), object(3)

memory usage: 5.7+ MB

Number City Gender Age Income Illness

0 1 Dallas Male 41.0 40367.0 No

1 2 Dallas Male 54.0 45084.0 No

2 3 Dallas Male 42.0 52483.0 No

3 4 Dallas Male 40.0 40941.0 No

4 5 Dallas Male 46.0 50289.0 No

... ... ... ... ... ... ...

149995 149996 Austin Male 48.0 93669.0 No

149996 149997 Austin Male 25.0 96748.0 No

149997 149998 Austin Male 26.0 111885.0 No

149998 149999 Austin Male 25.0 111878.0 No

149999 150000 Austin Female 37.0 87251.0 No

149986 rows × 6 columns

Data Transformation: Imputation


Mean
Median
Mode (most common value)

#Fill missing Income with mean value - add a new column

df['IncomeFillNa'] = df['Income'].fillna(df['Income'].mean())

#Fill missing Age with median value - add a new column

df['AgeFillNa'] = df['Age'].fillna(df['Age'].median())

#Fill missing Gender with mode value - add a new column

#mode returns a dataframe so we access it at index 0

df['GenderFillNa'].fillna(df['Gender'].mode()[0],inplace=True) #mode: 0 Male

Male 83802

Female 66198

Name: Gender, dtype: int64

Male

Check for duplicate data

#check if there are duplicates

df.duplicated().sum()

False 150000

dtype: int64

#remove duplicates if any 

#original not affected unless you save

df[df.duplicated(keep=False)] #'first': all duplicates except first occurence, 'last': all du

Data Transformation: cntd


From categorical to numerical
Normalization

#convert categorical Gender to number (get_dummies)

one_hot_gender = pd.get_dummies(df["Gender"])

#add the generated hot-encoded columns to new df

df_1=pd.concat([df, one_hot_gender],axis=1)

df_1.head(10)

Number City Gender Age Income Illness Female Male Female Male

0 1.0 Dallas Male 41.0 40367.0 No NaN NaN 0 1

1 2.0 Dallas Male 54.0 45084.0 No NaN NaN 0 1

2 3.0 Dallas Male 42.0 52483.0 No NaN NaN 0 1

3 4.0 Dallas Male 40.0 40941.0 No NaN NaN 0 1


#normalization: income and age have very different scales

4 5.0 Dallas Male 46.0 50289.0 No NaN NaN 0 1

#simple normalization for both - add new columns

5 6.0 Dallas Female NaN 50786.0 No NaN NaN 1 0


df_1['SimpleIncome'] = df_1['Income'] / df_1['Income'].max()

6 7.0 Dallas Female 32.0 33155.0 No


df_1['SimpleAge'] = df_1['Age'] / df_1['Age'].max()
NaN NaN 1 0

7 8.0 Dallas Male 39.0 30914.0 No NaN NaN 0 1


#min-max normalization for both - add new columns

8 9.0 Dallas Male 51.0 68667.0


min = df_1['Income'].min()
No NaN NaN 0 1
max= df_1['Income'].max()

9 10.0 Dallas Female 30.0 50082.0 No NaN NaN 1 0


df_1['MinMaxIncome'] = (df_1['Income'] - min) / (max - min)

min = df_1['Age'].min()

max= df_1['Age'].max()

df_1['MinMaxAge'] = (df_1['Age'] - min) / (max - min)

#z-score normalization for both - add new columns

df_1['ZscoreIncome'] = (df_1['Income'] - df_1['Income'].mean()) / df_1['Income'].std()

df_1['ZscoreAge'] = (df_1['Age'] - df_1['Age'].mean()) / df_1['Age'].std()

df_1.head()

Number City Gender Age Income Illness Female Male Female Male Simpl

0 1.0 Dallas Male 41.0 40367.0 No NaN NaN 0 1 0

1 2.0 Dallas Male 54.0 45084.0 No NaN NaN 0 1 0

2 3.0 Dallas Male 42.0 52483.0 No NaN NaN 0 1 0

3 4.0 Dallas Male 40.0 40941.0 No NaN NaN 0 1 0

4 5.0 Dallas Male 46.0 50289.0 No NaN NaN 0 1 0

#normalization using apply and functions

#define a function to perform the simple feature normalization

def simpleNorm(old,max):

  return old/max

#apply the function to age

df_1["Age"].apply(lambda x: simpleNorm(x, df_1["Age"].max())) #slow

#apply the function to income - make it anonymous

max = df_1["Income"].max()

df_1["Income"].apply(lambda x: x/max)

0 0.227860

1 0.254486

2 0.296251

3 0.231100

4 0.283867

...

149995 0.528734

149996 0.546114

149997 0.631558

149998 0.631519

149999 0.492507

Name: Income, Length: 150000, dtype: float64

#export the updated dataframe to csv

df_1.to_csv("updated_data.csv")

Colab paid products


-
Cancel contracts here

You might also like