Chapter 02 Overview (Python)
Chapter 02 Overview (Python)
Introduction to Business
Analytics
Rasha Alahmad
Chapter (2) : Overview of
the Data Mining Process
Core Ideas in Data Mining
o Classification
o Prediction
o Association Rules & Recommenders
o Data and Dimension Reduction
o Data Exploration and Visualization
Core Ideas in Data Mining
Classification
Recommendation Systems
Analyze individual user preferences and behaviors to make
personalized suggestions.
o Amazon and Netflix use these systems to recommend products or
shows based on past interactions.
Data Visualization
Exploration by creating charts and dashboards.
Steps in Data Mining
1. Develop an understanding of the purpose of your project
2. Obtain the dataset to be used in the analysis
3. Explore, clean, and preprocess the data
4. Reduce the data dimension, if necessary
5. Determine the data mining task (e.g., classification, prediction)
6. Choose the data mining techniques to be used (regression, neural
nets etc.)
7. Use algorithms to perform the task
8. Interpret the results of the algorithms
9. Deploy the model
Dataset: WestRoxbury.csv
14 Variable/Columns
5803 Records/Rows
Preliminary Exploration in Python
loading data, viewing it, summary statistics
Load data
housing_df = pd.read_csv('WestRoxbury.csv')
housing_df.shape #find number of rows & columns
housing_df.head() #show the 1st five rows
print(housing_df) #show all the data
Column Names
'TOTAL VALUE ‘ 'BEDROOMS ‘
'TAX’
'LOT SQFT ‘ 'FULL BATH',
'YR BUILT’ 'HALF BATH’
'GROSS AREA ' 'KITCHEN’
'LIVING AREA’ 'FIREPLACE’
'FLOORS ‘ 'REMODEL'
'ROOMS’
Data Exploration in Python, cont.
Rename columns: replace spaces with '_'
housing_df = housing_df.rename
(columns={'TOTAL VALUE ': 'TOTAL_VALUE'})
# explicit
housing_df.columns = [s.strip().replace(' ',
'_') for s in housing_df.columns] # all
columns
Show first four rows of the data
housing_df.loc[0:3] # inclusive
housing_df.iloc[0:4] # exclusive
Data Exploration in Python, cont.
Show the first 10 values in column
TOTAL_VALUE
housing_df['TOTAL_VALUE'].iloc[0:10]
# show length
print('Number of rows ‘)
print(len(housing_df['TOTAL_VALUE'])
of first column
# show mean
print('Mean of TOTAL_VALUE ‘)
print(housing_df['TOTAL_VALUE'].mean()
of column
Show summary statistics for each column
housing_df.describe()