0% found this document useful (0 votes)
32 views43 pages

Multimedia Application L9

The document discusses logistic regression as a foundational supervised machine learning tool for classification, highlighting its importance in natural and social sciences. It explains the differences between generative and discriminative classifiers, the components of a probabilistic machine learning classifier, and the process of training and testing in logistic regression. Additionally, it covers the use of the sigmoid function for probability estimation and the cross-entropy loss function for optimizing model performance.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views43 pages

Multimedia Application L9

The document discusses logistic regression as a foundational supervised machine learning tool for classification, highlighting its importance in natural and social sciences. It explains the differences between generative and discriminative classifiers, the components of a probabilistic machine learning classifier, and the process of training and testing in logistic regression. Additionally, it covers the use of the sigmoid function for probability estimation and the cross-entropy loss function for optimizing model performance.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Multimedia

Application
By

Minhaz Uddin Ahmed, PhD


Department of Computer Engineering
Inha University Tashkent.
Email: [Link]@[Link]
Content
 The sigmoid function
 Classification with Logistic Regression .
 Multinomial logistic regression
 Learning in Logistic Regression
 The cross-entropy loss function
Logistic Regression

 Important analytic tool in natural and social sciences


 Baseline supervised machine learning tool for classification
 Is also the foundation of neural networks
Generative and Discriminative
Classifiers
 Naive Bayes is a generative classifier

 by contrast:

 Logistic regression is a discriminative classifier


Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:

• Build a model of what's in a cat image


• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:


Run both models and see which one fits better
Discriminative Classifier

 Just try to distinguish dogs from cats

Oh look, dogs have collars!


Let's ignore everything else
Finding the correct class c from a document d
in
Generative vs Discriminative Classifiers
 Naive Bayes

 Logistic Regression
posterior

P(c|d)
Components of a probabilistic
machine learning classifier
 Given m input/output pairs (x(i),y(i)):
1. A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn]. Feature j for
input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
The two phases of logistic
regression
 Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.

 Test: Given a test example x we compute p(y|x) using learned


weights w and b, and return whichever label (y = 1 or y = 0) is
higher probability
Classification in Logistic Regression

 Positive/negative
sentiment
 Spam/not spam
 Authorship
attribution Alexander Hamilton
(Hamilton or
Madison?)
Text Classification: definition

 Input:
 a document x
 a fixed set of classes C = {c1, c2,…, cJ}

 Output: a predicted class C


Binary Classification in Logistic
Regression
 Given a series of input/output pairs:

(x(i), y(i))
 For each observation x(i)

We represent x(i) by a feature vector [x1, x2,…, xn]


We compute an output: a predicted class (i)  {0,1}
Features in logistic regression

• For feature xi, weight wi tells is how important is xi


• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Features in logistic regression

 The weight wi represents how important that input feature is to the classification
decision, and can be positive (providing evidence that the in- stance being
classified belongs in the positive class) or negative (providing evidence that the
instance being classified belongs in the negative class). Thus we might expect in a
sentiment task the word awesome to have a high positive weight, and abysmal to
have a very negative weight.
Logistic Regression for one
observation x
 Input observation: vector x = [x1, x2,…, xn]
 Weights: one per feature: W = [w1, w2,…, wn]
 Sometimes we call the weights θ = [θ1, θ2,…, θn]

 Output: a predicted class  {0,1}

(multinomial logistic regression:  {0, 1, 2, 3, 4})


How to do classification

 For each feature xi, weight wi tells us importance of xi


 (Plus we'll have a bias b)
 We'll sum up all the weighted features and the bias

 If this sum is high, we say y=1; if low, then y=0


But we want a probabilistic classifier

 We need to formalize “sum is high”.


 We’d like a principled classifier that
gives us a probability, just like Naive
Bayes did
 We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability,
it's just a number!

 Solution: use a function of z that goes from 0


to 1
The very useful sigmoid or logistic
function
Idea of logistic regression

We’ll compute w∙x+b


And then we’ll pass it through the sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
Making probabilities with sigmoids

 If we apply the sigmoid to the sum of the weighted features, we get a number
between 0 and 1. To make it a probability, we just need to make sure that the
two cases, p(y = 1) and p(y = 0), sum to 1.
Making probabilities with sigmoids

Because

The sigmoid function has the property


Turning a probability into a classifier

0.5 here is called the decision boundary


Turning a probability into a classifier

P(y=1)

wx + b
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0

We've seen how logistic regression uses the sigmoid function to take weighted
features for an input example x and assign it to the class 1 or 0.
Logistic Regression: a text example on
sentiment classification

 Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .

So why was it so enjoyable ? For one thing , the cast is


great . Another nice touch is the music . I was overcome with the urge to
get off the couch and start dancing . It sucked me in , and it'll do the same
to you .
Classifying sentiment for input x

Suppose w =
b = 0.1
Classifying sentiment for input x

S=
sigma
We can build features for logistic regression for
any classification task: period disambiguation

End of sentence

This ends in a period.


The house at 465 Main St. is new.
Classification in (binary) logistic
regression: summary
 Given:
a set of classes: (+ sentiment,- sentiment)
a vector x of features [x1, x2, …, xn]
x1= count( "awesome")
x2 = log(number of words in review)
A vector w of weights [w1, w2, …, wn]
w for each feature fi
i
Learning: Cross-Entropy Loss

 where did the W’s come from

Supervised classification:
• We know the correct label y (either 0 or 1) for
each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance
between our estimate (i) and the true y(i).
• We need a distance estimator: a loss function
or a cost function
• We need an optimization algorithm to update w
and b to minimize the loss.
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent
Learning components

 This requires two components. The first is a metric for how close the current label
(yˆ) is to the true gold label y. Rather than measure similarity, we usually talk about
the opposite of this: the distance between the system output and the gold output,
and we call this distance
 the loss function or the cost function. We'll introduce the loss function that is
commonly used for logistic regression and also for neural networks, the cross-
entropy loss.
The second thing we need is an optimization algorithm for iteratively updating
 the weights so as to minimize this loss function. The standard algorithm for this is
gradient descent;
The distance between and y

We want to know how far is the classifier output:


= σ(w∙x+b)

from the true output:


y [= either 0 or 1]

We'll call this difference:


L(,y) = how much differs from the true y
Intuition of negative log likelihood
loss
= cross-entropy loss
 A case of conditional maximum likelihood
estimation
 We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a
single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can express
the probability p(y|x) from our classifier (the thing we want to
maximize) as

noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
Deriving cross-entropy loss for a
single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:

 Now take the log of both sides (mathematically handy)

Maximize:

 Whatever values maximize log p(y|x) will also maximize p(y|x)


Reference

Chapter 5
Question
Thank you

You might also like