0% found this document useful (0 votes)
8 views

Lecture 2.2 Example Data Preparation Feature Engineering

The document provides an overview of machine learning, explaining its goal of learning patterns from examples and generalizing them to new instances. It distinguishes between supervised and unsupervised learning, detailing the processes involved in model fitting, including data splitting, tuning, and evaluation. Additionally, it discusses various algorithms used in supervised learning and emphasizes the importance of avoiding overfitting while maximizing model performance.

Uploaded by

revaldochetie092
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 2.2 Example Data Preparation Feature Engineering

The document provides an overview of machine learning, explaining its goal of learning patterns from examples and generalizing them to new instances. It distinguishes between supervised and unsupervised learning, detailing the processes involved in model fitting, including data splitting, tuning, and evaluation. Additionally, it discusses various algorithms used in supervised learning and emphasizes the importance of avoiding overfitting while maximizing model performance.

Uploaded by

revaldochetie092
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Feature Engineering

What is Machine Learning?

Simple
How machines learn rules from examples.
definition:

Goal of any machine learning:

• Learn patterns from examples


• Be able to generalize them to new examples

Supervised and unsupervised machine learning:

 In both cases learning is achieved through examples!


What is Machine Learning?
A program is said to learn from experience E with regard to
Formal
task T and performance measure P, if its performance on task T
definition:
improves with experience.

# Task Experience Performance Measure


Recognize Set of digits with Percent of correct
1
handwritten digits labels recognitions
Predict length of
2 Patient histories Mean prediction error
hospital stay
Recommend Netflix
3 Viewing histories # users viewing show
shows
Some motivating examples

Early detection of Identifying vulnerable Predicting


disease outbreaks buildings for retrofit transport demand

Preventing violent Reducing CO2 Targeting fire


crime emissions risk inspections

Acknowledgment: D. Neill, Machine Learning for Cities, CUSP NYU


An approach to model fitting
Five main steps:

Use Case & Model Tune


Predict Evaluate
Data training (calibrate)

Determine Split data into Tune model Use the tuned Compare the
question of training and parameters model to form predictions
interest, get test sets. Fit predictions with the
informative model to about your actual values
data. training set. test set for the test set

Variables of interest are categorical, supported by classification


or numerical, supported by regression
Unsupervised Learning
• The only thing we have is input data.
• Labels are not provided by a supervisor.

What does an unsupervised algorithm do?


Extract patterns in the data.
Create clusters whose members are similar (based on some set of measurements).

 Example: Take raw data on visitors to my website


 Segment them into groups that share same characteristics; target ads.
Supervised Learning
• The machine learns from examples that have already been labelled.
• Each example has input values (attributes) and an output value.

Example: A spam classifier learns rules from this training set of emails1

Goal:
 Use known output values to learn the patterns of the input.
 Predict the output value of new examples.

Image credit: Géron, Hands-On Machine Learning


Supervised Learning algorithms

Linear regression
• Models output as linear combination of inputs
 Fast to train, effective on high-dimensional data.

Support Vector Machines


• Learns a decision boundary (linear or non-linear)
 Suits complex, medium-size datasets

Decision trees and Random Forest


• Builds flow-chart style rules that maximize information gain
 High predictive power, requires less data preparation.

Neural networks
• Algorithms inspired by structure and function of the brain.
• Scalable, highly accurate on tasks like image recognition.
Building a model
Use Case & Model
Tune Predict Evaluate
Data training
Use Case & Model
Tune Predict Evaluate
Data training
Build labeled dataset for question of interest
Use Case & Model
Tune Predict Evaluate
Data training
Split training and test data

When fitting ML algorithms, it is common to separate


data into training and test sets
Split the dataset
Dataset
(e.g. 70/30 ratio)

Build model on
the training set

Training set Test set Evaluate model on


(70% of records) (30%)
the test set

Image credit: D. Ziganto “Standard Deviations” blog


Use Case & Model
Tune Predict Evaluate
Data training
Complexity vs. accuracy
We can build models of lower or higher complexity by
changing their hyper-parameters.
Aim for the ‘sweet spot’ that maximizes performance but
avoids overfitting.

* Overfitting: a complex model that memorizes the test set (including noise in it)
but fails to generalize to new data.
Complexity vs. accuracy
We can build models of lower or higher complexity by
changing their hyper-parameters.
Aim for the ‘sweet spot’ that maximizes performance but
avoids overfitting.

* Overfitting: a complex model that memorizes the test set (including noise in it)
but fails to generalize to new data.
Use Case & Model
Tune Predict Evaluate
Data training
Make predictions

With the model tuned and fitted to training data, we can


predict outcomes for test set, ensure its performance is
satisfactory, and deploy.

Figure: Object detection in images

Image: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2016)
EXAMPLE:
Predicting mode of
transport and music
taste
Decision tree model for transport planning
Scenario: The World Bank has hired a cohort of 100 new staff, who start this
summer. GSD needs to decide how many bike racks or parking spaces to build for
them.
Decision tree model for transport planning
Scenario: The World Bank has hired a cohort of 100 new staff, who start this
summer. GSD needs to decide how many bike racks or parking spaces to build for
them.

Attributes (𝑿𝟏 … 𝑿𝑵 ) Target variable (y)

From this training set, construct


set of rules to predict mode of
transport for unseen examples.
Decision tree model for transport planning
Scenario: The World Bank has hired a talented cohort of 100 new staff, who start
after Thanksgiving. GSD needs to decide how many bike racks or parking spaces to
build for them.

Mix of home states and ages


Decision tree model for transport planning
Scenario: The World Bank has hired a talented cohort of 100 new staff, who start
after Thanksgiving. GSD needs to decide how many bike racks or parking spaces to
build for them.

High enjoyment of Netflix


Use Case & Model
Tune Predict Evaluate
Data training

You might also like