0% found this document useful (0 votes)

42 views20 pages

Feature Engineering

The document discusses feature engineering in machine learning, emphasizing the importance of human decision-making in the process. It outlines the cyclical nature of feature design, testing, and refinement, while addressing challenges like feature explosion and methods to mitigate it, such as feature selection and regularization. Additionally, it covers practical techniques for encoding and normalizing features using scikit-learn tools.

Uploaded by

siva borra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views20 pages

Feature Engineering

Uploaded by

siva borra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Feature Engineering

in Machine Learning

Zdeněk Žabokrtský
Institute of Formal and Applied Linguistics,
Charles University in Prague
Used resources

http://www.cs.princeton.edu/courses/archive/spring10/cos424/slides/18-feat.pdf
http://stackoverflow.com/questions/2674430/how-to-engineer-features-for-machine-learning
https://facwiki.cs.byu.edu/cs479/index.php/Feature engineering
documentation of scikit-learn
wikipedia
Human’s role when applying Machine Learning

Machine learning provides you with extremely powerful tools

for decision making ...
... but until a breakthrough in AI, the role of the developer’s
decision will still be crucial.
Your responsibility:
setting up the correct problem to be optimized (it’s far from
straightforward in the real world)
choosing a model
choosing a learning algorithm (or a family of algorithms)
finding relevant data
designing features, feature representation, feature selection . . .
Feature

a feature - a piece of information that is potentially useful for

prediction
Feature engineering

feature engineering - not a formally defined term, just a

vaguely agreed space of tasks related to designing feature sets
for ML applications
two components:
first, understanding the properties of the task you’re trying to
solve and how they might interact with the strengths and
limitations of the model you are going to use
second, experimental work were you test your expectations and
find out what actually works and what doesn’t.
Feature engineering in real life

Typically a cycle
1 design a set of features
2 run an experiment and analyze the results on a validation
dataset
3 change the feature set
4 go to step 1
Don’t expect any elegant answers today.
Causes of feature explosion

Feature templates: When designing a feature set, you

usually quickly turn from coding individual features (such as
’this word is predeced by a preposition and a determiner’) to
implementing feature templates (such as ’the two preceding
POSs are X a Y’)
Feature combination: linear models cannot handle some
dependencies between features (e.g. XOR with binary
operations, polynomial dependencies with real-valued
features) - feature combinations might work better.
Both lead to quick growth of the number of features.
Stop the explosion

There must be some limits, because

Given the limited size of training data, the number of features
that can be efficiently used is hardly unbounded (overfitting).
Sooner or later speed becomes a problem.
Possible solutions to avoid the explosion

feature selection
regularization
kernels
Feature selection

Central assumption: we can identify features that are

redundant or irrelevant.
Let’s just use the best-working subset: arg maxf acc(f ), where
acc(f ) evaluates prediction quality on held-out data
Rings a bell? Yes, there’s a set-of-all-subsets problem (NP
hard), exhaustive search clearly intractable.
(The former implicitly used e.g. in top-down-induced of
decision trees.)
A side effect of feature reduction: improved model
interpretability.
Feature selection

Basic approaches:
wrapper - search through the space of subsets, train a model
for current subset, evaluate it on held-out data, iterate...
simple greedy search heuristics:
forward selection - start with an empty set, gradually add the
“strongest” features
backward selection - start with the full set, gradually remove
the “weakest” features
computationally expensive
filter - use N most promissing features according to ranking
resulting from a proxy measure, e.g. from
mutual information
Pearson correlation coefficient
embedded methods - feature selection is a part of model
construction
Regularisation

regularisation = introducing penalty for complexity

the more features matter in the model, the bigger complexity
in other words, try to concentrate the weight mass, don’t
scatter it too much
application of Occam’s razor: the model should be simple
Bayesian view: regularization = imposing this prior knowledge
(“the world is simple”) on parameters
Regularisation

In practice, regularisation is enforced by adding a factor that has

high values for complex parameter settings to the cost function,
typically to negative log-likelihood:

cost(f ) = −l(f ) + regularizer (f )

L0 norm: ... + λcount(wj 6= 0) - minimize the number of

features with non-zero weight, the less the better
L1 norm: ... + λ|w | - minimize the sum of all weights
L2 norm: ... + λ||w || - minimize the lenght of the weight
vector
p
L1/2 norm: ... + λ ||w ||
L∞
Experimenting with features in scikit-learn
Encoding categorical features

Turning (a dictionary of) categorical features to a fixed-lenght

vector:
estimators can be fed only with numbers, not with strings
turn a categorical feature to one-of-K vector of features
preprocessing.OneHotEncoder for one feature after
another
or sklearn.feature extraction.DictVectorizer for the
whole dataset at once, two steps: fit transform and
transform
Feature binarization

thresholding numerical features to get boolean values

preprocessing.Binarizer
Feature Discretization

converting continuous features to discrete features

Typically data is discretized into partitions of K equal
lengths/width (equal intervals) or K% of the total data (equal
frequencies).
? in sklearn?
Dataset standardization

some estimators might work badly if distributions of values of

different features are radically different (e.g. in the order of
magnitude)
solution: transform the data by moving the center (toward
zero mean) and scaling (towards unit variance)
preprocessing.scale
Vector normalization

Normalization is the process of scaling individual samples to

have unit norm.
solution: transform the data by moving the center (toward
zero mean) and scaling (towards unit variance)
preprocessing.normalize
Feature selection

Scikit-learn exposes feature selection routines as objects that

implement the transform method:
SelectKBest removes all but the k highest scoring features
SelectPercentile removes all but a user-specified highest
scoring percentile of features
using common univariate statistical tests for each feature:
false positive rate SelectFpr, false discovery rate SelectFdr, or
family wise error SelectFwe.
These objects take as input a scoring function that returns
univariate p-values:
For regression: f regression
For classification: chi2 or f classif

Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Unit II
No ratings yet
Unit II
119 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Feature Selection Techniques Explained
No ratings yet
Feature Selection Techniques Explained
54 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
C1 W2 Lab04 FeatEng PolyReg Soln
No ratings yet
C1 W2 Lab04 FeatEng PolyReg Soln
5 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
Machine Learning Feature Engineering
No ratings yet
Machine Learning Feature Engineering
94 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
UNIT04
No ratings yet
UNIT04
35 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
Machine Learning with Scikit-learn
No ratings yet
Machine Learning with Scikit-learn
28 pages
Feature Engineering and Dimensionality Reduction
No ratings yet
Feature Engineering and Dimensionality Reduction
146 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Feature Engineering for Regression Models
No ratings yet
Feature Engineering for Regression Models
23 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Data Pre-Processing for Machine Learning
No ratings yet
Data Pre-Processing for Machine Learning
12 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Unit 3 - Test 2 Portions-1
No ratings yet
Unit 3 - Test 2 Portions-1
24 pages
Module 6
No ratings yet
Module 6
51 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
19 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Well Posed Learning Problem
100% (1)
Well Posed Learning Problem
4 pages
Unit-II Feature Engineering - Removed
No ratings yet
Unit-II Feature Engineering - Removed
158 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
Feature Engineering - Python Data Science Handbook
No ratings yet
Feature Engineering - Python Data Science Handbook
9 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
43 pages
Model Selection and Feature Engineering
No ratings yet
Model Selection and Feature Engineering
64 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
7 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
123 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Lec 2 Feature Engineering
No ratings yet
Lec 2 Feature Engineering
18 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
86 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Unit 2
No ratings yet
Unit 2
91 pages
IT Compensation Analysis in Kochi
No ratings yet
IT Compensation Analysis in Kochi
13 pages
Rastogi 19535
No ratings yet
Rastogi 19535
8 pages
Nabl 900
No ratings yet
Nabl 900
57 pages
Ap English Language and Composition Course Overview
No ratings yet
Ap English Language and Composition Course Overview
2 pages
The Routledge Companion To Philanthropy (Harrow, JennyJung, TobiasPhillips, Susan) (Z-Library)
No ratings yet
The Routledge Companion To Philanthropy (Harrow, JennyJung, TobiasPhillips, Susan) (Z-Library)
553 pages
Qualitative Research in Financial Reporting
No ratings yet
Qualitative Research in Financial Reporting
61 pages
Practical Design Control Implementation For Medical Devices 1st Edition Jose Justiniano Download
100% (2)
Practical Design Control Implementation For Medical Devices 1st Edition Jose Justiniano Download
52 pages
James Marcia
No ratings yet
James Marcia
8 pages
SM Chalisa May24
No ratings yet
SM Chalisa May24
90 pages
FYDP Guidelines Batch 2020F JUNE 2024
No ratings yet
FYDP Guidelines Batch 2020F JUNE 2024
13 pages
Research Proposal
No ratings yet
Research Proposal
1 page
India's Lunar Exploration Journey
No ratings yet
India's Lunar Exploration Journey
8 pages
Machiavellianism, Emotional Manipulation, and Friendship Functions in Women's Friendships
No ratings yet
Machiavellianism, Emotional Manipulation, and Friendship Functions in Women's Friendships
21 pages
Hands On Lab New Edited
No ratings yet
Hands On Lab New Edited
33 pages
MBA Curriculum Framework Overview
No ratings yet
MBA Curriculum Framework Overview
162 pages
MMPO-01-EM-22-23
No ratings yet
MMPO-01-EM-22-23
15 pages
Biostatistics and Epidemiology Lecture
No ratings yet
Biostatistics and Epidemiology Lecture
13 pages
Master Thesis Introduction Sample
100% (3)
Master Thesis Introduction Sample
8 pages
Descriptive and Correlation
No ratings yet
Descriptive and Correlation
12 pages
Family Support's Impact on Student Engagement
No ratings yet
Family Support's Impact on Student Engagement
9 pages
AI Auditability and Auditor Readiness For Auditing AI Systems
No ratings yet
AI Auditability and Auditor Readiness For Auditing AI Systems
28 pages
Proposal Nepal Telecome
No ratings yet
Proposal Nepal Telecome
12 pages
Semantic Changes and Corpus Analysis
No ratings yet
Semantic Changes and Corpus Analysis
3 pages
Strategic Management Final Examination - Advent 2020
100% (1)
Strategic Management Final Examination - Advent 2020
3 pages
Teaching Health Statistics Lesson and Se
No ratings yet
Teaching Health Statistics Lesson and Se
114 pages
Contemporary Clinical Immunology and Serology Rittenhouse Olson Full Version
No ratings yet
Contemporary Clinical Immunology and Serology Rittenhouse Olson Full Version
311 pages
CONSORT-Outcomes 2022 Checklist
No ratings yet
CONSORT-Outcomes 2022 Checklist
5 pages
Mahindra Bolero Customer Satisfaction Study
No ratings yet
Mahindra Bolero Customer Satisfaction Study
72 pages
Thesis Writing in Philosophy of Man
100% (1)
Thesis Writing in Philosophy of Man
6 pages
Principles and Strategies of Teaching and Designing Individualized Education Program
No ratings yet
Principles and Strategies of Teaching and Designing Individualized Education Program
2 pages

Feature Engineering

Uploaded by

Feature Engineering

Uploaded by

Feature Engineering

Machine learning provides you with extremely powerful tools

a feature - a piece of information that is potentially useful for

feature engineering - not a formally defined term, just a

Feature templates: When designing a feature set, you

There must be some limits, because

Central assumption: we can identify features that are

regularisation = introducing penalty for complexity

In practice, regularisation is enforced by adding a factor that has

cost(f ) = −l(f ) + regularizer (f )

L0 norm: ... + λcount(wj 6= 0) - minimize the number of

Turning (a dictionary of) categorical features to a fixed-lenght

thresholding numerical features to get boolean values

converting continuous features to discrete features

some estimators might work badly if distributions of values of

Normalization is the process of scaling individual samples to

Scikit-learn exposes feature selection routines as objects that

You might also like