Multimedia
Application
By
Minhaz Uddin Ahmed, PhD
Department of Computer Engineering
Inha University Tashkent.
Email: [Link]@[Link]
Content
The sigmoid function
Classification with Logistic Regression .
Multinomial logistic regression
Learning in Logistic Regression
The cross-entropy loss function
Logistic Regression
Important analytic tool in natural and social sciences
Baseline supervised machine learning tool for classification
Is also the foundation of neural networks
Generative and Discriminative
Classifiers
Naive Bayes is a generative classifier
by contrast:
Logistic regression is a discriminative classifier
Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Also build a model for dog images
Now given a new image:
Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats
Oh look, dogs have collars!
Let's ignore everything else
Finding the correct class c from a document d
in
Generative vs Discriminative Classifiers
Naive Bayes
Logistic Regression
posterior
P(c|d)
Components of a probabilistic
machine learning classifier
Given m input/output pairs (x(i),y(i)):
1. A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn]. Feature j for
input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
The two phases of logistic
regression
Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x) using learned
weights w and b, and return whichever label (y = 1 or y = 0) is
higher probability
Classification in Logistic Regression
Positive/negative
sentiment
Spam/not spam
Authorship
attribution Alexander Hamilton
(Hamilton or
Madison?)
Text Classification: definition
Input:
a document x
a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class C
Binary Classification in Logistic
Regression
Given a series of input/output pairs:
(x(i), y(i))
For each observation x(i)
We represent x(i) by a feature vector [x1, x2,…, xn]
We compute an output: a predicted class (i) {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Features in logistic regression
The weight wi represents how important that input feature is to the classification
decision, and can be positive (providing evidence that the in- stance being
classified belongs in the positive class) or negative (providing evidence that the
instance being classified belongs in the negative class). Thus we might expect in a
sentiment task the word awesome to have a high positive weight, and abysmal to
have a very negative weight.
Logistic Regression for one
observation x
Input observation: vector x = [x1, x2,…, xn]
Weights: one per feature: W = [w1, w2,…, wn]
Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class {0,1}
(multinomial logistic regression: {0, 1, 2, 3, 4})
How to do classification
For each feature xi, weight wi tells us importance of xi
(Plus we'll have a bias b)
We'll sum up all the weighted features and the bias
If this sum is high, we say y=1; if low, then y=0
But we want a probabilistic classifier
We need to formalize “sum is high”.
We’d like a principled classifier that
gives us a probability, just like Naive
Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability,
it's just a number!
Solution: use a function of z that goes from 0
to 1
The very useful sigmoid or logistic
function
Idea of logistic regression
We’ll compute w∙x+b
And then we’ll pass it through the sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
Making probabilities with sigmoids
If we apply the sigmoid to the sum of the weighted features, we get a number
between 0 and 1. To make it a probability, we just need to make sure that the
two cases, p(y = 1) and p(y = 0), sum to 1.
Making probabilities with sigmoids
Because
The sigmoid function has the property
Turning a probability into a classifier
0.5 here is called the decision boundary
Turning a probability into a classifier
P(y=1)
wx + b
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
We've seen how logistic regression uses the sigmoid function to take weighted
features for an input example x and assign it to the class 1 or 0.
Logistic Regression: a text example on
sentiment classification
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to
get off the couch and start dancing . It sucked me in , and it'll do the same
to you .
Classifying sentiment for input x
Suppose w =
b = 0.1
Classifying sentiment for input x
S=
sigma
We can build features for logistic regression for
any classification task: period disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Classification in (binary) logistic
regression: summary
Given:
a set of classes: (+ sentiment,- sentiment)
a vector x of features [x1, x2, …, xn]
x1= count( "awesome")
x2 = log(number of words in review)
A vector w of weights [w1, w2, …, wn]
w for each feature fi
i
Learning: Cross-Entropy Loss
where did the W’s come from
Supervised classification:
• We know the correct label y (either 0 or 1) for
each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance
between our estimate (i) and the true y(i).
• We need a distance estimator: a loss function
or a cost function
• We need an optimization algorithm to update w
and b to minimize the loss.
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
Learning components
This requires two components. The first is a metric for how close the current label
(yˆ) is to the true gold label y. Rather than measure similarity, we usually talk about
the opposite of this: the distance between the system output and the gold output,
and we call this distance
the loss function or the cost function. We'll introduce the loss function that is
commonly used for logistic regression and also for neural networks, the cross-
entropy loss.
The second thing we need is an optimization algorithm for iteratively updating
the weights so as to minimize this loss function. The standard algorithm for this is
gradient descent;
The distance between and y
We want to know how far is the classifier output:
= σ(w∙x+b)
from the true output:
y [= either 0 or 1]
We'll call this difference:
L(,y) = how much differs from the true y
Intuition of negative log likelihood
loss
= cross-entropy loss
A case of conditional maximum likelihood
estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a
single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can express
the probability p(y|x) from our classifier (the thing we want to
maximize) as
noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
Deriving cross-entropy loss for a
single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:
Whatever values maximize log p(y|x) will also maximize p(y|x)
Reference
Chapter 5
Question
Thank you