0% found this document useful (0 votes)

22 views

Machine Learning Lecture1 - 26-27 Aug

This document discusses machine learning concepts like supervised and unsupervised learning, classification, regression, and clustering. It then outlines the steps to take in a machine learning project, including defining the problem, preparing data, evaluating algorithms, and improving results.

Uploaded by

kundan kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Machine Learning Lecture1 - 26-27 Aug

Uploaded by

kundan kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

PYTHON & MACHINE

LEARNING - 1
Sarbani Maiti | [email protected]
Technology Leader (AI, ML, Cloud, Devops)
Associate Director at a Leading IT Company
Owner of NexZenBot Robotics & STEAM Education provider
import scipy
import sklearn

PYTHON MODULES REQUIRED FOR ML

MACHINE LEARNING
LECTURE – 26 & 27 AUG
MACHIE
LEARNING
Supervised and Unsupervised Learning.
Supervised learning Applications in which the training data
comprises examples of the input vectors along with their
corresponding target vectors are known as supervised learning
problems. unsupervised learning In other pattern recognition
problems, the training data consists of a set of input vectors x
without any corresponding target values. The goal in such

Unsupervised learning problems may be to discover groups of

similar examples within the data: it is called clustering.
Classification – an object's category prediction
assign a category to each object (OCR, text
classification, speech recognition)

Regression – prediction of a specific point on a

numeric axis. (prices, stock values, economic
variables, ratings)

Clustering partition data into homogeneous groups

(analysis of very large data sets) ( Market
Segmentation)
HOW TO START A ML PROBLEM

Project Based Learning

Create your own Portfolio
The Systematic Process For Working Through Predictive Modeling
Problems that Delivers Above Average Results

• Define the Problem

Step 1: What is the problem? Describe the problem informally and formally
and list assumptions and similar problems.
Step 2: Why does the problem need to be solved? List your motivation for
solving the problem, the benefits a solution provides and how the solution
will be used.
Step 3: How would I solve the problem? Describe how the problem would
be solved manually to flush domain knowledge.
The Systematic Process For Working Through Predictive Modeling
Problems that Delivers Above Average Results

• Prepare Data

Step 1: Data Selection: Consider what data is available, what data is missing
and what data can be removed.
Step 2: Data Preprocessing: Organize your selected data by formatting,
cleaning and sampling from it.
Step 3: Data Transformation: Transform preprocessed data ready for
machine learning by engineering features using scaling, attribute
decomposition and attribute aggregation.
The Systematic Process For Working Through Predictive Modeling
Problems that Delivers Above Average Results

• Spot Check Algorithms

Loading up a bunch of standard machine learning algorithms into the test harness and performing a
formal experiment. ~run 10-20 standard algorithms from all the major algorithm families across all the
transformed and scaled versions of the dataset I have prepared.

The goal of spot checking is to flush out the types of algorithms and dataset combinations that are good
at picking out the structure of the problem so that they can be studied in more detail with focused
experiments.

More focused experiments with well-performing families of algorithms may be performed in this step, but
algorithm tuning is left for the next step.
The Systematic Process For Working Through Predictive Modeling
Problems that Delivers Above Average Results

• Improve Results / Optimization

Algorithm Tuning: where discovering the best models is treated like

a search problem through model parameter space.

Ensemble Methods: where the predictions made by multiple

models are combined.

Extreme Feature Engineering: where the attribute decomposition

and aggregation seen in data preparation is pushed to the limits.
The Systematic Process For Working Through Predictive Modeling
Problems that Delivers Above Average Results

• Present Results
Context (Why): Define the environment in which the problem exists and set up the motivation for the
research question.

Problem (Question): Concisely describe the problem as a question that you went out and answered.

Solution (Answer): Concisely describe the solution as an answer to the question you posed in the previous
section. Be specific.

Findings: Bulleted lists of discoveries you made along the way that interests the audience. They may be
discoveries in the data, methods that did or did not work or the model performance benefits you
achieved along your journey.

Limitations: Consider where the model does not work or questions that the model does not answer. Do
not shy away from these questions, defining where the model excels is more trusted if you can define
where it does not excel.

Conclusions (Why+Question+Answer): Revisit the “why”, research question and the answer you discovered
in a tight little package that is easy to remember and repeat for yourself and others
PROJECT-1
pima-indians-diabetes.data.csv
•Lesson 1: Download and Install Python and SciPy ecosystem.
•Lesson 2: Get Around In Python, NumPy, Matplotlib and Pandas.
•Lesson 3: Load Data From CSV.
•Lesson 4: Understand Data with Descriptive Statistics.
•Lesson 5: Understand Data with Visualization.
•Lesson 6: Prepare For Modeling by Pre-Processing Data.
•Lesson 7: Algorithm Evaluation With Resampling Methods.
•Lesson 8: Algorithm Evaluation Metrics.
•Lesson 9: Spot-Check Algorithms.
•Lesson 10: Model Comparison and Selection.
•Lesson 11: Improve Accuracy with Algorithm Tuning.
•Lesson 12: Improve Accuracy with Ensemble Predictions.
•Lesson 13: Finalize And Save Your Model.

Linear Regression, k Nearest Neighbor (KNN), LogisticRegression, Linear-discriminant-analysis (LDA)

Lesson 4: Understand Data with Descriptive Statistics

Understand your data using the head() function to look at

the first few rows.
Review the dimensions of your data with the shape property.
Look at the data types for each attribute with
the dtypes property.
Review the distribution of your data with
the describe() function.
Calculate pairwise correlation between your variables using
the corr() function.
Lesson 5: Understand Data with Visualization
•Use the hist() function to create a histogram of each attribute.
•Use the plot(kind=’box’) function to create box-and-whisker plots of each attribute.
•Use the pandas.scatter_matrix() function to create pairwise scatterplots of all attributes.

Lesson 6: Prepare For Modeling by Pre-Processing Data

• Sometimes you need to preprocess your data in order to best present the inherent structure of the problem
in your data to the modeling algorithms. In today’s lesson, you will use the pre-processing capabilities
provided by the scikit-learn.
• The scikit-learn library provides two standard idioms for transforming data. Each transform is useful in
different circumstances: Fit and Multiple Transform and Combined Fit-And-Transform.
• There are many techniques that you can use to prepare your data for modeling. For example, try out some
of the following
• Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options.
• Normalize numerical data (e.g. to a range of 0-1) using the range option.
• Explore more advanced feature engineering such as Binarizing.
Lesson 7: Algorithm Evaluation With Resampling Methods

The dataset used to train a machine learning algorithm is called a training dataset. The dataset used to train an
algorithm cannot be used to give you reliable estimates of the accuracy of the model on new data. This is a big
problem because the whole idea of creating the model is to make predictions on new data.
You can use statistical methods called resampling methods to split your training dataset up into subsets, some are
used to train the model and others are held back and used to estimate the accuracy of the model on unseen data.
Your goal with today’s lesson is to practice using the different resampling methods available in scikit-learn, for
example:
•Split a dataset into training and test sets.
•Estimate the accuracy of an algorithm using k-fold cross validation.
•Estimate the accuracy of an algorithm using leave one out cross validation.
The snippet below uses scikit-learn to estimate the accuracy of the Logistic Regression algorithm on the Pima Indians
onset of diabetes dataset using 10-fold cross validation.

The cross-validation score can be directly calculated using the cross_val_score helper. Given an estimator, the
cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a
testing set, trains the estimator using the training set and computes the scores based on the testing set for each
iteration of cross-validation.
K-Folds cross-validator -Provides train/test indices to split data in train/test sets. Split dataset into k
consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
Lesson 8: Algorithm Evaluation Metrics

There are many different metrics that you can use to evaluate the skill of a machine learning algorithm on a
dataset.
You can specify the metric used for your test harness in scikit-learn via
the cross_validation.cross_val_score() function and defaults can be used for regression and classification
problems. Your goal with today’s lesson is to practice using the different algorithm performance metrics
available in the scikit-learn package.

•Practice using the Accuracy and LogLoss metrics on a classification problem.

•Practice generating a confusion matrix and a classification report.
•Practice using RMSE and RSquared metrics on a regression problem.

The snippet below demonstrates calculating the LogLoss metric on the Pima Indians onset of diabetes dataset.
Lesson 9: Spot-Check Algorithms

We cannot possibly know which algorithm will perform best on our data beforehand.
We have to discover it using a process of trial and error.
Lets call this spot-checking algorithms.

The scikit-learn library provides an interface to many machine learning algorithms and tools to compare the
estimated accuracy of those algorithms.

In this lesson, we will practice spot checking different machine learning algorithms.

•Spot check linear algorithms on a dataset (e.g. linear regression, logistic regression and linear discriminate
analysis).
•Spot check some non-linear algorithms on a dataset (e.g. KNN, SVM and CART).
•Spot-check some sophisticated ensemble algorithms on a dataset (e.g. random forest and stochastic gradient
boosting).
Lesson 9: Spot-Check Algorithms – WHAT IS KNN Alogorithm - K-Nearest Neighbors
Notice in the image above that most of the time, similar data
points are close to each other. The KNN algorithm hinges on
this assumption being true enough for the algorithm to be
useful. KNN captures the idea of similarity (sometimes called
distance, proximity, or closeness) with some mathematics
The k-nearest neighbors (KNN) algorithm is a simple,
supervised machine learning algorithm that can be used to
solve both classification and regression problems. It’s easy to
implement and understand, but has a major drawback of
becoming significantly slows as the size of that data in use
grows.
KNN works by finding the distances between a query and all
the examples in the data, selecting the specified number
examples (K) closest to the query, then votes for the most
frequent label (in the case of classification) or averages the
In the case of classification and regression, we saw that labels (in the case of regression).
choosing the right K for our data is done by trying several Ks
and picking the one that works best.
Finally, we looked at an example of how the KNN algorithm
could be used in recommender systems, an application of KNN-
search.
Lesson 10: Model Comparison and Selection

Lets check how to compare the estimated performance of different algorithms and select the best model.
In today’s lesson, you will practice comparing the accuracy of machine learning algorithms in Python with
scikit-learn.

•Compare linear algorithms to each other on a dataset.

•Compare nonlinear algorithms to each other on a dataset.
•Compare different configurations of the same algorithm to each other.
•Create plots of the results comparing algorithms.

The example below compares Logistic Regression and Linear Discriminant Analysis to
each other on the Pima Indians onset of diabetes dataset.
Linear Discriminant Analysis for Machine Learning

classification algorithm traditionally limited to only two-class classification problems.

If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification
technique.

The representation of LDA is straight forward.

It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is
the mean and the variance of the variable for each class. For multiple variables, this is the same properties
calculated over the multivariate Gaussian, namely the means and the covariance matrix.
These statistical properties are estimated from your data and plug into the LDA equation to make predictions.
These are the model values that you would save to file for your model.
Lesson 11: Improve Accuracy with Algorithm Tuning

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of
correctly predicted instances in divided by the total number of instances in the dataset multiplied
by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we
run build and evaluate each model next.

Once you have found one or two algorithms that perform well on your dataset, you may want to improve the
performance of those models.

One way to increase the performance of an algorithm is to tune its parameters to your specific dataset.
The scikit-learn library provides two ways to search for combinations of parameters for a machine learning
algorithm. Let’s practice each.

•Tune the parameters of an algorithm using a grid search that you specify.

•Tune the parameters of an algorithm using a random search.

Lesson 12: Improve Accuracy with Ensemble Predictions

Another way that you can improve the performance of your models is to combine the predictions from multiple
models.
Some models provide this capability built-in such as random forest for bagging and stochastic gradient
boosting for boosting. Another type of ensembling called voting can be used to combine the predictions from
multiple different models together.
In today’s lesson, you will practice using ensemble methods.

•Practice bagging ensembles with the random forest and extra trees algorithms.

•Practice boosting ensembles with the gradient boosting machine and AdaBoost algorithms.

•Practice voting ensembles using by combining the predictions from multiple models together.

The snippet below demonstrates how you can use the Random Forest algorithm (a bagged ensemble of
decision trees) on the Pima Indians onset of diabetes dataset.
Lesson 13: Finalize And Save Your Model

Once you have found a well-performing model on your machine learning problem, you need to finalize it.
In today’s lesson, you will practice the tasks related to finalizing your model.
Practice making predictions with your model on new data (data unseen during training and testing).
Practice saving trained models to file and loading them up again.
For example, the pima diabates dataset shows how you can create a Logistic Regression model, save it to
file, then load it later and make predictions on unseen data.
HANDS ON LAB

Internet Access is required to install the machine learning modules

Install Python 3.7,

Pycharm – available on few machines
Anaconda – not available must be downloaded. Instruction to be provided
Use of Python Modules – could not use machines which were not connected
to internet. Rest of the machines installed them .
Use the command pip install

Pip install matplotlib

pip install pandas
pip install numpy
pip install sklearn

Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
ML in Python Part-2
No ratings yet
ML in Python Part-2
21 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
Python Learning
No ratings yet
Python Learning
21 pages
Developing A Machining Learning Models From Start To Finish.
No ratings yet
Developing A Machining Learning Models From Start To Finish.
59 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Final ML
No ratings yet
Final ML
2 pages
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
No ratings yet
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
22 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
100% (1)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
60 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
algorithmeknn-121213175830-phpapp02
No ratings yet
algorithmeknn-121213175830-phpapp02
52 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) instant download
100% (2)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) instant download
38 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine learning Life cycle
No ratings yet
Machine learning Life cycle
11 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
ML 01
No ratings yet
ML 01
24 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Data Science
No ratings yet
Data Science
38 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
ML_DA
No ratings yet
ML_DA
55 pages
Learn Machine Learning in One Lesson Book
No ratings yet
Learn Machine Learning in One Lesson Book
8 pages
ML and Deploying It Using Flask and Docker.
No ratings yet
ML and Deploying It Using Flask and Docker.
30 pages
Day 2 Presentation
No ratings yet
Day 2 Presentation
65 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Lecture5
No ratings yet
Lecture5
26 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
2482
No ratings yet
2482
41 pages
Learning Predictive Analytics With Python Gain Practical Insights Into Predictive Modelling By Implementing Predictive Analytics Algorithms On Public Datasets With Python Gulipalli instant download
No ratings yet
Learning Predictive Analytics With Python Gain Practical Insights Into Predictive Modelling By Implementing Predictive Analytics Algorithms On Public Datasets With Python Gulipalli instant download
77 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
Model_learning_steps
No ratings yet
Model_learning_steps
12 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
Unit 3
No ratings yet
Unit 3
55 pages
Prediction of Mental Health (Depression) Using Data Science Technique
No ratings yet
Prediction of Mental Health (Depression) Using Data Science Technique
6 pages
MLA Manual
No ratings yet
MLA Manual
25 pages
2-ML Principles
No ratings yet
2-ML Principles
34 pages
Assignment 9[1]
No ratings yet
Assignment 9[1]
8 pages
Exercises
No ratings yet
Exercises
69 pages
40 Interview Questions On Machine Learning - AnalyticsVidhya
100% (1)
40 Interview Questions On Machine Learning - AnalyticsVidhya
21 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages