0% found this document useful (0 votes)
42 views65 pages

Day 2 Presentation

Uploaded by

Hari Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views65 pages

Day 2 Presentation

Uploaded by

Hari Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

NUS ACE SUMMER PROGRAMME

AI & MACHINE
LEARNING

Manoranjan Dash
Professor and Dean
School of Computing and Data Science
FLAME University, Pune

Ex-Senior Research Fellow


Singapore Data Science Consortium
National University of Singapore
Day 2: Machine Learning Methods Using Orange
• Importing the data files
• Understanding our Data
• Missing Values and Imputation
• Training your First Model
• Supervised Learning Algorithms
 Logistic Regression
 Random Forest
 Support Vector Machine
 K-Nearest Neighbour
 Neural Network
 Comparison of Different Models
Day 2: Machine Learning Methods Using Orange
• Clustering Algorithms
• Hierarchical Clustering
• K-Means
• Visualization
• Data Table
• Scatter Plot
• Mosaic Display
• Sieve Diagram
• Rank
• Rad Viz
1. Why Orange?
• Orange is a platform built for mining and analysis on a GUI based workflow.
This signifies that you do not have to know how to code to be able to work
using Orange and mine data, crunch numbers and derive insights.

• You can perform tasks ranging from basic visuals to data manipulations,
transformations, and data mining. It consolidates all the functions of the
entire process into a single workflow.

• The best part and the differentiator about Orange is that it has some
wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all sorts
of visualizations available.
2. Setting up your System
• Orange comes built-in with the Anaconda tool if you’ve previously
installed it. If not, follow these steps to download Orange.

• Step 1: Go to Orange Data Mining and click on Download


Step 2: Install the platform and set the working directory for
Orange to store its files

This is what the start-up page of Orange looks like. You have options that allow you to create new projects,
open recent ones or view examples and get started.
• Before we delve into how Orange works, let’s define a few key terms
to help us in our understanding:
• A widget is the basic processing point of any data manipulation. It can do a
number of actions based on what you choose in your widget selector on the
left of the screen.
• A workflow is the sequence of steps or actions that you take in your platform
to accomplish a particular task.
• For now, click on “New” and let’s start building your first workflow.
3. Creating Your First Workflow
• This is the first step towards building a solution to any problem. We
need to first understand what steps we need to take in order to
achieve our final goal. After you clicked on “New” in the above step,
this is what you should have come up with.
This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging
any widget from the widget menu to your workflow.
4. Familiarising yourself with the basics
• Orange is a platform that can help us solve most problems in Data
Science today. Topics that range from the most basic visualizations to
training models. You can even evaluate and perform unsupervised
learning on datasets
• Problem
• The problem we’re looking to solve in this tutorial is the practice problem
Loan Prediction that can be accessed via this link
Loan Prediction (analyticsvidhya.com) on Datahack
Importing the data files
- We begin with the first and the necessary step to understand our data and make predictions: importing our data
- Step 1: Click on the “Data” tab on the widget selector menu and drag the widget “CSV File Import” to our blank workflow.
Directory for Orange Datasets
C:\Users\mdash\Desktop\CG\ACE_TEACHING\
AI_ML_Fundamentals_Dec2023\Orange3-3.36.1\Orange\Lib\site-
packages\Orange\datasets
Step 2: Double click the “File” widget and select the file you want to load into the workflow. Import Iris dataset.
Step 3: Click on ‘Data Table’ widget
Understanding our Data
Click on the semicircle in front of the “File” widget and drag it to an empty space in the workflow and
select the “Scatter Plot” widget.
Another way to visualize our distributions would be the “Distributions” widget. Click on the semi-circle again, and
drag to find the widget “Distributions”.
Missing Values and Imputation

Actual value is 1.4


Training your First Model
• Step 1: First, we need to set a target
variable to apply Logistic Regression
on it.

• Step 2: Go to the “File” widget and


double click it.

• Step 3: Now, double click on the Iris


column and select it as the target
variable. Click Apply.
Step 4: Once we have set our target variable, find the clean data from the
“File” widget as follows and place the “Logistic Regression” widget.

- Logistic Regression is a classification algorithm commonly used for


binary classification problems, where the target variable has two
possible outcomes (e.g., 0 or 1, True or False).

- Despite its name, logistic regression is used for classification, not regression
It's called "regression" because it's an extension of linear regression,
but it's adapted for classification purposes through the logistic function.
Logistic Regression Algorithm:
1. Linear Combination:
• The algorithm starts with a linear combination of the input features. The linear
combination is represented as z = b0 + b1.x1 + … + bn.xn where b0, …, bn are
coefficients, x1, …, xn are input features
2. Logistic Function (Sigmoid):
• The linear combination is then passed through a logistic function, also known as
the sigmoid function. The sigmoid function maps any real-valued number to the
range between 0 and 1. The formula for the sigmoid function is 𝜎 ( 𝑧 )= 1 − 𝑧
1+𝑒
3. Probability Prediction:
• The output of the logistic function represents the probability that the given input
point belongs to the positive class (class 1). It can be interpreted as the
probability of success in a binary outcome.
4. Decision Threshold:
• A decision threshold is chosen (typically 0.5), and if the predicted probability is
greater than or equal to the threshold, the instance is classified as belonging to
the positive class; otherwise, it is classified as belonging to the negative class.
• Step 5: Double click the widget and select the type of
regularization you want to perform.

• Ridge Regression:
• Performs L2 regularization, i.e. adds penalty equivalent to square
of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of square of
coefficients)

• Lasso Regression:
• Performs L1 regularization, i.e. adds penalty equivalent to
absolute value of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of absolute value of
coefficients)
Step 6: Next, click on the “File” or the “Logistic Regression” widget and find the
“Test and Score” widget. Make sure you connect both the data and the model
to the testing widget.
Step 7: Click on “Test and Score” widget to see how well your model is doing.
Step 8: To visualize the results better, drag and drop from the “Test and
Score” widget to fin d “Confusion Matrix”.
Step 9: Once you’ve placed it, click on it to visualize your findings!
Random Forest
• It is an ensemble learning method used for both classification and
regression tasks.
• It operates by constructing a multitude of decision trees during
training and outputs the mode (for classification) or mean prediction
(for regression) of the individual trees as the final prediction.
• Random Forest introduces randomness both in the data used to train
each tree and in the features considered when splitting each node of
the trees.
Random Forest Algorithm:
1. Bootstrapped Sampling (Bagging):
• Random Forest starts by creating multiple bootstrap samples from the original
dataset. Each sample is obtained by randomly sampling with replacement from
the original dataset.
2. Random Feature Selection:
• At each node of each tree, a random subset of features is selected to determine
the best split. This introduces diversity among the trees and helps prevent
overfitting
3. Decision Tree Construction:
• For each bootstrap sample and at each node of the tree, the algorithm constructs
a decision tree using the selected features. The tree is grown until a stopping
criterion is met (e.g., a maximum depth is reached).
4. Voting (Classification) or Averaging (Regression):
• For classification problems, the final prediction is determined by a majority vote
among the individual trees. For regression problems, it's the average of the
predictions.
Hyperparameters:
Number of Trees (n_estimators): The number of decision
trees in the forest.

Max Depth: The maximum depth of each tree.

Min Samples Split: The minimum number of samples


required to split a node.

Min Samples Leaf: The minimum number of samples


required to be in a leaf node.

Max Features: The number of features to consider when


looking for the best split.
Support Vector Machine (SVM)
• SVM is a powerful supervised learning algorithm used for
classification and regression tasks.
• Primary objective of SVM is to find a hyperplane that best separates
the data into different classes while maximizing the margin between
the classes.
SVM Algorithm:
Data Representation: Given a dataset with labeled instances, where each instance belongs to one of two classes, SVM
represents each instance as a point in space, with the value of each feature being the value of a particular coordinate.

Hyperplane Definition: SVM aims to find the hyperplane that best separates the data into two classes. A hyperplane is a
decision boundary that maximizes the margin between the two classes. The margin is defined as the distance between the
hyperplane and the nearest data point from either class.

Optimization Objective: SVM formulates an optimization problem to maximize the margin while minimizing classification
errors. The optimal hyperplane is the one that satisfies this objective.

Soft Margin (C parameter): In some cases, it may not be possible to find a hyperplane that perfectly separates the classes.
SVM introduces a "soft margin" that allows for some misclassification. The parameter C controls the trade-off between
having a smooth decision boundary and classifying training points correctly.

Kernel Trick: SVM can handle non-linear decision boundaries by using the kernel trick. This involves mapping the input
features into a higher-dimensional space where a hyperplane can effectively separate the data. Common kernels include
linear, polynomial, and radial basis function (RBF) kernels.
Hyperparameters:
Kernel Type: The choice of kernel (linear, polynomial,
RBF, etc.) influences the decision boundary shape.

C (Regularization Parameter): Controls the trade-off


between having a smooth decision boundary and
classifying training points correctly.

Gamma (for RBF kernel): Controls the width of the


Gaussian function and influences the decision
boundary.
kNN (k Nearest Neighbors)
• It is a simple, yet powerful, supervised machine learning algorithm
used for both classification and regression tasks.
• It belongs to the category of instance-based or lazy learning
algorithms.
• The central idea behind kNN is to make predictions for a new data
point based on the majority class (for classification) or the average
(for regression) of its k-nearest neighbors in the feature space.
kNN Algorithm:
Data Representation: Given a dataset with labeled instances, kNN represents each instance
as a point in a multidimensional space, with the features of each instance being the
coordinates of that point.

Distance Calculation: When a prediction is needed for a new data point, kNN calculates the
distance between that point and all other points in the training dataset. Common distance
metrics include Euclidean distance, Manhattan distance, or other distance measures.

Finding Neighbors: The algorithm identifies the k-nearest neighbors of the new data point
based on the calculated distances.

Majority Voting (Classification) or Averaging (Regression):For classification, the algorithm


assigns the majority class among the k-nearest neighbors to the new data point. For
regression, it calculates the average of the target values of the k-nearest neighbors.
Hyperparameter: k (Number of Neighbors):
The choice of k is a critical hyperparameter. A smaller
k value makes the model more sensitive to noise,
while a larger k value makes it more robust but may
smooth out local patterns.
Neural Network
• Neural Networks, also known as Artificial Neural Networks (ANNs),
are a class of machine learning models inspired by the structure and
functioning of the human brain.
• They are used for various tasks, including classification, regression,
and pattern recognition.
• A neural network consists of layers of interconnected nodes (neurons)
organized in input, hidden, and output layers.
Neural Network Architecture:
Input Layer:
Neurons in the input layer represent the features of the input data.

Hidden Layers:
Between the input and output layers, there can be one or more hidden layers. Each
neuron in a hidden layer takes input from the previous layer, applies a weighted sum
and an activation function, and produces an output for the next layer.

Output Layer:
The output layer produces the final prediction or classification. The number of neurons
in the output layer depends on the type of task (e.g., binary classification, multi-class
classification, regression).
Neural Network Training (Backpropagation):
Forward Propagation: During training, the input data is fed forward through the network,
and the predictions are computed.

Loss Function: A loss function is used to measure the difference between the predicted
output and the actual target. Common loss functions include mean squared error for
regression and cross-entropy for classification.

Backpropagation: The algorithm performs backpropagation to update the weights of the


connections between neurons. It computes the gradient of the loss with respect to the
weights and adjusts the weights to minimize the loss.

Optimization: Optimization algorithms (e.g., stochastic gradient descent) are used to find
the optimal weights that minimize the loss function.

Activation Functions:Neurons typically use activation functions (e.g., sigmoid, tanh, ReLU)
to introduce non-linearity into the model, enabling it to learn complex patterns.
Hyperparameters:
Number of Layers: The choice of the number of hidden
layers and neurons in each layer.

Learning Rate: A small value that determines the step size


in weight updates during optimization.

Activation Functions: Choices include sigmoid, tanh, ReLU,


etc.

Number of Epochs: The number of times the entire


dataset is passed forward and backward through the
neural network during training.

Batch Size: The number of data points used in each


iteration of training.
Comparison of Different Models
Clustering Using Orange
• Clustering is an unsupervised learning technique that groups data
points together based on their similarity.
• Orange Data Mining provides a variety of widgets for clustering data,
including hierarchical clustering and k-means clustering.
Hierarchical Clustering
• It creates a hierarchy of clusters by iteratively merging or splitting
clusters.
• The hierarchy is represented as a dendrogram, which is a tree-like diagram
that shows the relationship between the clusters
To perform hierarchical clustering in Orange, follow these steps:
1. Load your data into Orange
2. Select the Hierarchical Clustering widget from the Clustering category
3. Connect the data source to the input pin of the Hierarchical Clustering widget
4. In the "Linkage method" drop-down list, select the desired linkage method. The options
include:
a. Single linking: Defines the distance between two clusters as the smallest distance
between any two points in the two clusters
b. Complete linking: Defines the distance between two clusters as the largest distance
between any two points in the two clusters
c. Average linking: Defines the distance between two clusters as the average distance
between all pairs of points in the two clusters
d. Ward's linkage: Minimizes the overall within-cluster variance, creating clusters that
are internally homogeneous and externally heterogeneous
5. (Optional) In the "Distance metric" drop-down list, select the desired distance
metric. The options include:
a) Euclidean distance: The most common distance metric, calculated as the
straight-line distance between two points
b) Manhattan distance: Calculated as the sum of the absolute differences between
the corresponding coordinates of two points
c) Minkowski distance: A generalization of Euclidean and Manhattan distances,
where the power of the distance is specified by a parameter
6. Click the "Run" button.

The Hierarchical Clustering widget will output a new dataset with a cluster label
attached to each data point. You can then use this cluster label to visualize your data or
to perform other analyses.
In hierarchical clustering, the height ratio and top N are two methods for selecting the
number of clusters to extract from a dendrogram.

a. Height Ratio
• It is a measure of the relative distance between clusters.
• It is calculated by dividing the distance between two merged clusters by the total
height of the dendrogram.
• The height of the dendrogram is the maximum distance between any two clusters.
• A high height ratio indicates that the two clusters are very different, while a low
height ratio indicates that the two clusters are very similar.
• To use the height ratio to determine the optimal number of clusters, you can cut
the dendrogram at a height that corresponds to a desired level of similarity
between clusters. For example, if you want to extract five clusters, you would cut
the dendrogram at a height that corresponds to a height ratio of 0.2. This would
ensure that the five clusters are relatively different from each other.
a. Top N
• It selects the N largest clusters from the dendrogram.
• It is useful if you want to extract a specific number of clusters, regardless of the
similarity between the clusters.
• To use the top N method to determine the optimal number of clusters, you can
specify the desired number of clusters (N) in the Hierarchical Clustering widget.
The widget will then extract the N largest clusters from the dendrogram and assign
them to data points.
Sieve Diagram
• It is a graphical visualization tool used in data mining to examine the
relationship between two categorical variables
• In the context of data clustering, the Sieve diagram can be employed to
assess the effectiveness of the clustering process by comparing the
observed frequencies of attribute combinations to the expected
frequencies under the assumption of independence
• Interpret the Sieve diagram
• It displays a grid of rectangles, where each rectangle represents a combination of
attribute values.
• The area of each rectangle corresponds to the expected frequency under the
assumption of independence, while the number of squares inside each rectangle
indicates the observed frequency.
• Deviations from the expected frequencies suggest potential dependencies
between the attributes.
Rectangles:
Each rectangle represents a combination of values for the two categorical variables. The
size of each rectangle is proportional to the expected frequency of that combination,
assuming the variables are independent

Squares:
The number of squares inside each rectangle represents the observed frequency of that
combination. If the observed frequency is significantly higher than the expected
frequency, it suggests a positive correlation between the variables. Conversely, if the
observed frequency is significantly lower than the expected frequency, it suggests a
negative correlation.

Coloring:
The squares are colored according to the deviation from the expected frequency. Red
squares indicate positive deviations, while blue squares indicate negative deviations.
Sieve Diagram
Box Plot
Visualization
• Data Table
• Scatter Plot
• Mosaic Display
• Sieve Diagram
• Rank
• Rad Viz
Mosaic Display: It is a graphical method for visualizing the association between two
categorical variables. It uses a grid of rectangles, with the area of each rectangle
proportional to the joint frequency of the corresponding categories.

Sieve Diagram: It is a visualization tool often used in hierarchical clustering. It represents


the hierarchical structure of clusters in a tree-like format. Each node in the tree corresponds
to a cluster, and the connections between nodes indicate how clusters are merged. It helps
in understanding the relationships and hierarchy within clustered data.

Rank: In the context of data analysis and visualization, "rank" typically refers to the ordering
of items based on a particular criterion. For example, you might rank items by their
frequency, importance, or some other measure. Rank visualizations often involve displaying
items in order, highlighting their relative positions in a list or chart.

Rad Viz (Radial Visualization): It is a method of data representation where data points are
arranged in a circular or radial pattern. It's particularly useful for visualizing multivariate
data, as different variables can be represented along the radial axes.
Petal length and petal width gives better clusters
Mosaic Display + Scatter Plot

Selected
Rank
• The Rank widget in Orange Data Mining is used to score variables
according to their correlation with a discrete or numeric target
variable.
• It utilizes various internal scorers, such as information gain, chi-
square, and linear regression, to assess the relevance of each variable
to the target variable.
• Additionally, it can incorporate scores from external models like linear
regression, logistic regression, random forest, and SGD.
Rank
Radviz: Radial Visualization
• The Radviz widget in Orange Data Mining is a non-linear
multidimensional visualization technique that can display data
defined by three or more variables in a 2-dimensional projection.
• It utilizes a metaphor from physics, where data instances are
represented as points within a circle, and their positions are
determined by springs attached to attribute anchors located on the
circle's perimeter.
References
1. https://www.analyticsvidhya.com/blog/2017/09/building-machine-l
earning-model-fun-using-orange/

You might also like