0% found this document useful (0 votes)
5 views

Introduction to AI

The document provides an overview of data science, focusing on machine learning, including definitions, types of data, and various algorithms such as supervised, unsupervised, and reinforcement learning. It discusses model selection, evaluation, and techniques like regularization to prevent overfitting. Key algorithms covered include kNN, decision trees, SVM, and random forests, along with their applications and challenges in training AI models.

Uploaded by

Yasiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Introduction to AI

The document provides an overview of data science, focusing on machine learning, including definitions, types of data, and various algorithms such as supervised, unsupervised, and reinforcement learning. It discusses model selection, evaluation, and techniques like regularization to prevent overfitting. Key algorithms covered include kNN, decision trees, SVM, and random forests, along with their applications and challenges in training AI models.

Uploaded by

Yasiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

EEX4373 - Data Science

Dr. H.M.M.Caldera
Electrical and Computer Engineering
The Open University of Sri Lanka
Agenda
• A general overview and terminology
• Variables
• Feature representation
• Introduction to machine learning
• Model selection and evaluation
• Classification: kNN, decision trees, SVM
• Ensemble methods: random forests
• Regularization
A general overview and terminology
Variables

Independent variable An explanatory variable is a type of independent variable.


Dependent variable
More...
• Your explanatory variable is
academic motivation at the start of
the school year.
• Your response variable is GPA at
the end of the school year.
Under AI context,

Variables = Features ??
• A"variable" is a broader term that refers to any quantity that can be
measured or controlled. It can include not only features but also
other types of data, such as labels or target variables in supervised
learning.
• "Features" specifically refer to the input variables or attributes that
are used to describe the data in a machine learning model. These
features are the characteristics or properties of the data that the
model uses to make predictions or classifications.
Introduction to machine learning
Machine Learning
Machine Learning is a subfield of Artificial Intelligence
(AI) that empowers computers to learn from data and improve
their performance over time, all without explicit programming.

Image credit - https://i.stack.imgur.com/


Definition

Arthur Samuel, coined the term “Machine


Learning” in 1959 while at IBM.

His definition as below (Generally).

The field of study that gives computers the


ability to learn without being explicitly
programmed.
Overview of AI
• Once you have a large set of data, you can perform many tasks like,
• Identification/Detection
• Classification
• Prediction
• How can AI transform “data” into valuable assert?
• AI involves with mainly,
• Predictive analytics
General Applications
How machines learn?
• It is like a human learning process
• By Examples
• By Experience

• You need to tell me following


object is a mop or not
Data Types in ML
•In high-level we are able to categorize data into mainly two
parts
•Numerical data
•Categorical data
1. Numeric Data: Numeric data is the most commonly used data
type in machine learning. It includes continuous variables like
age, height, weight, etc., and discrete variables like the number
of siblings, number of cars owned, etc.
2. Categorical Data: Categorical data includes variables that have
discrete values like color, gender, and nationality. These
variables can be nominal (no order) or ordinal (ordered).
3. Text Data: Text data includes text documents like emails, social
More media posts, and customer reviews. Natural Language
Processing (NLP) is used to process this type of data.

detailed 4. Image Data: Image data includes images that are used for tasks
like image recognition, object detection, and facial recognition.
Convolutional Neural Networks (CNNs) are used to process this
type of data.
5. Audio Data: Audio data includes audio files that are used for
tasks like speech recognition and audio classification. Recurrent
Neural Networks (RNNs) are used to process this type of data.
6. Time-Series Data: Time-series data includes data that is
collected over time, like stock prices, weather data, and sensor
data. Time-series analysis is used to process this type of data.
ML process
• ML process has a step-by-step process.
A set of data
that use to train
the algorithm
Can be labelled
or not
Types of Machine Learning

Supervised Learning

ML

Reinforcement Unsupervised
Learning Learning
Supervised Learning
• Under supervised learning, the algorithm is trained on a labeled
dataset, which means that the input data is paired with
corresponding output labels.
• i.e. Learn to predict target values from labelled data
• The goal is for the model to learn the mapping between inputs
and outputs, making predictions or classifications on new,
unseen data.
• Common tasks include regression (predicting a continuous
value) and classification (assigning a label to input data).
• Examples of supervised learning algorithms include linear
regression, support
More - What is Supervised vector
Learning? | IBM machines, decision trees, and neural
networks.
Unsupervised Learning
• Unsupervised learning involves working with unlabeled data,
where the algorithm must discover patterns, relationships, or
structures within the data without explicit guidance.
• Clustering and dimensionality reduction are common tasks in
unsupervised learning.
• Examples of unsupervised learning algorithms include k-means
clustering, hierarchical clustering, principal component analysis
(PCA), and autoencoders.
Reinforcement Learning
• Reinforcement learning focuses on training models to make
sequential decisions by interacting with an environment. The
model receives feedback in the form of rewards or penalties
based on its actions, allowing it to learn optimal strategies over
time.
• Reinforcement learning is often used in scenarios where an
agent must learn to navigate an environment and take actions
to maximize cumulative rewards.
• Examples of reinforcement learning algorithms include
Q-learning, deep Q-networks (DQN), and policy gradient
methods.
ML Algorithms

Reinforcement
Supervised Unsupervised
learning

Regression Classification

Simple Linear
regression

Multiple Linear
regression

Polynomial
Model selection and evaluation
Supervised learning algorithms
• Linear regression
• Logistic regression
• Support vector machines
• K-NN
• Naïve bias
• Decision tree
• Random forest
Unsupervised learning algorithms

K means clustering
Hierarchical clustering
Principle Component Analysis
Independent Component Analysis
Singular Value decomposition
Supervised
learning

Regression Classification
Regression problem

When input and output is focused on a sequence of


(continuous values), then regression is the best fit
Linear regression
• Mainly defined as a statistical method
• Focused on a depend and independent
variable
• AI context defined as a supervised
learning algorithm
Regression

Multiple
Single explanatory
explanatory
variable.(simple
variables. (multiple
linear regression)
linear regression)
Classification

When the output is


required to classify
into classes, then
classification problem
is the best option.
Popular classification algorithms
• Logistic Regression: Despite its name, logistic regression is a classification
algorithm commonly used for binary classification tasks.
• Decision Trees: These are tree-like structures where each internal node
represents a decision based on a feature, and each leaf node represents a
class label.
• Support Vector Machines (SVM): SVM aims to find a hyperplane that best
separates data into different classes.
• K-Nearest Neighbors (KNN): This algorithm classifies data points based on
the majority class among their k-nearest neighbors in the feature space.
• Random Forest: A collection of decision trees that work together to
improve the overall accuracy and robustness of the model.
• Naive Bayes: Based on Bayes' theorem, this algorithm is particularly
effective for text classification tasks.
Logistic regression
• Useful when the dependent variable is categorical
• Output is a binary output (i.e., 0/1 or True/False)
• An inverted tree design
Decision Tree • Used for both classification and regression tasks.
Internal Root Node: The topmost node in the tree, representing
Nodes: the first decision based on a specific feature. This decision
Nodes in splits the dataset into subsets.
the middle
of the tree
that
represent
decisions
based on
different
features.
Branches: The
Each
edges connecting Leaf Nodes: The termina
internal
nodes, indicating the nodes at the bottom of the
node
outcome of a tree, representing the fina
contributes
decision. Each class label (in classificatio
to further
branch corresponds or the predicted value (in
partitioning
to a specific value or regression). Each leaf nod
the dataset.
range of values for a is associated with a speci
feature. decision or outcome.
More…
• Splitting Criteria:
• The criteria used to decide how to split the dataset at each internal
node. Common criteria include Gini impurity (for classification) and
mean squared error (for regression).
• Pruning:
• The process of removing branches or nodes from the tree to prevent
overfitting. Pruning helps create a more generalized model that
performs well on new, unseen data.
• Decision Rules:
• The paths from the root node to the leaf nodes represent decision rules.
These rules define how the algorithm makes predictions based on the
input features.
Random Forest
• A collection of decision trees
• Focused on many features to analysis the output
Support Vector Machines
• Support Vector Machines (SVM) are a powerful and versatile class of
supervised machine learning algorithms used for both classification
and regression tasks.
• SVMs are particularly effective in high-dimensional spaces and are
well-suited for situations where the data has complex relationships.
• They work by finding a hyperplane that best separates the data into
different classes or predicts a numerical value in the case of
regression.
Hyperplane: In a
binary
classification
problem, SVM
aims to find the The margin is the distance
hyperplane that between the hyperplane and
best separates the the nearest data point from
data points of one either class. SVM seeks to
class from another. maximize this margin,
A hyperplane is a providing a robust separation
subspace with one between classes.
fewer dimension
than the input
space. In two
dimensions, the
hyperplane is a
line, and in three
dimensions, it is a
plane. Support Vectors are the data
points that lie closest to the
decision boundary
(hyperplane). These points are
crucial in determining the
Image credit - https://www.mdpi.com/1996-1073/14/12/3393
position and orientation of the
hyperplane.
KNN
• K-Nearest Neighbors (KNN) is a simple and
intuitive machine learning algorithm used
for both classification and regression tasks.
• It falls under the category of
instance-based or lazy learning algorithms
because it doesn't explicitly learn a model
during the training phase. Instead, it makes
predictions based on the similarity of new
instances to existing labeled instances in
the training dataset.
Some popular issues in training an AI
algorithm
• Overfitting
• Underfitting
• Hyperparameter Tuning
• Class Imbalance
• Ethical Concerns and Bias

Image credit -Interactive Regression Lens for Exploring Scatter Plots (researchgate.net)
Outliers
Regularization
• Regularization is a technique used in machine learning and artificial
intelligence to prevent overfitting and improve the generalization of a
model. Overfitting occurs when a model learns the training data too
well, including its noise and outliers, to the extent that it performs
poorly on new, unseen data.
• Regularization methods add a penalty term to the model's objective
function, discouraging it from fitting the training data too closely and
promoting a simpler, more generalized solution. The regularization
term is typically based on the model's parameters, and it penalizes
large parameter values.
Types of regularization
• L1 Regularization (Lasso): In L1 regularization, the penalty term is the
absolute values of the model's coefficients multiplied by a
regularization parameter. This type of regularization can lead to some
coefficients being exactly zero, effectively performing feature
selection by eliminating less important features.
• L2 Regularization (Ridge): L2 regularization adds a penalty term that
is the squared sum of the model's coefficients multiplied by a
regularization parameter. It tends to shrink the coefficients toward
zero but rarely results in exactly zero coefficients. L2 regularization is
useful for preventing large weights that may cause numerical
instability.
Learn More…
• Course | 6.036 | MIT Open Learning Library
• https://machinelearningmastery.com/start-here/#python
Questions
Thank you

PO Box 21, Nawala, Nugegoda, Sri Lanka


Phone: +94 11 288 100
www.ou.ac.lk

You might also like