Te x t S u m m a r i z a t i o n
(Natural Language Processing)
Advanced Data Science
Dr. Hetal V. Gandhi
[email protected] Welcome to the Digital Regenesys course in
Natural Language Processing Session 6
We will be starting shortly …
REGENESYS’ INTEGRATED LEADERSHIP AND
MANAGEMENT MODEL
Holistic focus on the individual (SQ,
EQ, IQ, and PQ)
Interrelationships are dynamic
between individual, team, institution
and the external environment
(systemic)
Strategy affects individual, team,
organisational, and environmental
performance
Delivery requires alignment of
strategy,
structure, systems and culture
REGENESYS GRADUATE ATTRIBUTES
Ground Rules
• Be open-minded
• Listen carefully
• Avoid doing other unrelated tasks when attending the session so that
you can be FOCUSED.
• Raise your hand if you have any query so that we can ensure one
conversation at a time
• When speaking, use “I think”, “I feel”, etc. (you are a very important
aspect of this learning)
• Respect the opinions of others
• Give constructive feedback
• Build on the ideas of others rather than destroying them
• Take some risks and share new ideas
• Have fun and ENJOY the learning experience!
Know your
facilitator Teaching Experience of 13 years in
Computer Science and Engineering
Expertise:
o Data Science and Machine Learning
Dr. Hetal Gandhi o Data Structures and Algorithms
o Natural Language Processing
o Programming skills
She is awarded PhD for contributions
towards “Aspect Based Sentiment
Analysis of Hindi Reviews” in the area
of Natural Language Processing.
She is actively involved in research
and had more than 10 papers
published in renowned Conferences
and Journals
Contents
• Text data pre-processing
• Text vectorization
• Text embeddings
• Text classification
• Movies recommendation system
• Text Summarization
9
Contents for today’s session
• Text Summarization Presentation
Introduction
Types
Use Cases
Steps
• Hands-on on Text Summarization
• Hands-on for deriving vectors for each document
using embeddings (Doc2Vec)
10
Te x t S u m m a r i z a t i o n
Te x t s u m m a r i z a t i o n i s t h e p r o b l e m o f
c r e a t i n g a s h o r t , a c c u r a t e , a n d fl u e n t
summary of a longer text document.
Advantages:
Discover most relevant and important information in a small amount of time
Allows analyzing large amount of text
11
Ty p e s o f Te x t S u m m a r i z a t i o n
Extractive Methods
Selection of phrases and sentences from source
Involves RANKING of phrases to create a targeted summary
Abstractive Methods
Generates entirely new phrases and sentences to form a
summary
Human-like approach but more challenging
Types of Text Summarization, Closer Look
12
U s e C a s e s / A p p l i c a t i o n s / E v e r y - d ay e x a m p l e s
News Headlines- e.g. Inshorts.com Reviews of Movies
Government or Private Reviews of a book
Organization Reports Biography or resumes
Court Orders span to more than Bulletins of weather forecasts/ stock
100 pages market reports
Research Papers Summary Sound bites of politicians on a current
Minutes of a meeting issue
Histories or chronologies of salient events
13
S t e p s f o r E x t r a c t i v e Te x t S u m m a r i z a t i o n
1) Tokenization
2) Get word_frequency
3) Get normalized_word_frequency
4) Sentence Tokenization
5) Calculate score for each sentence
sentence_score = sum of normalized_word_frequency of words in the sentence
6) Rank the sentences- Use sentence_score to get important sentences (to form a summary)
NOTE: Other than normalized word frequencies, more advanced methods consider factors like
sentence position, keyword importance, and semantic relevance for more accurate
summarization.
14
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Sentences Score Rank
Summary from Artificial intelligence, often abbreviated as AI,
ranked is the simulation of human intelligence 23 2
sentences processes by machines, especially computer
systems.
These processes include learning, reasoning,
problem-solving, perception, and language 10 3
understanding.
AI has become a significant part of our daily 15
lives, from voice assistants like Siri and Alexa
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Step 1- Tokenization: First, we'll tokenize the sentences and get list of all UNIQUE
TOKENS appearing in the document
[artificial, intelligence, often, abbreviated, as, AI, is, the, simulation, of, human, processes,
by, machines, especially, computer, systems, these, include, learning, reasoning, problem-
solving, perception, and, language, understanding…………………]
16
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Step 2- Get word_frequency: Frequency of each unique word in overall document
{artificial-1, intelligence-2, often-1, abbreviated-1, as-1, AI-2, is-1, the-1, simulation-1, of-2,
human-1, processes-2, by-1, machines-1, especially-1, computer-1, systems-1, these-1,
include-1, learning-1, reasoning-1, problem-solving-1, perception-1, and-1, language-1,
understanding-1…………………}
17
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Step 3- Get normalized_word_frequency: Divide word_frequency by @Number_of_tokens
(Total number of word tokens in complete document).
{artificial-0.02, intelligence-0.04, often-0.02, abbreviated-0.02, as-0.02, AI-0.04, is-0.02, the-
0.02 simulation-0.02, of-0.04, human-0.02, processes-0.04, by-0.02, machines-0.02, especially-
0.02, computer-0.02, systems-0.02, these-0.02, include-0.02, learning-0.02, reasoning-0.02,
problem-solving-0.02, perception-0.02, and-0.02, language-0.02, understanding-
0.02…………………} 18
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Step 4- Sentence Tokenization: Use ‘.’ to divide document into sentences.
No. Sentences
Artificial intelligence, often abbreviated as AI, is the simulation
1 of human intelligence processes by machines, especially
computer systems.
2 These processes include learning, reasoning, problem-solving,
perception, and language understanding.
AI has become a significant part of our daily lives, from voice
19
3 assistants like Siri and Alexa to recommendation systems on
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Step 5- Calculate score for each sentence:
sentence_score = sum of normalized_word_frequency of words in the sentence
Sentence 1 score: (0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.04 + 0.02 + 0.02 + 0.04 +
0.02 + 0.04 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02) ≈ 1.06
Sentence 2 score: (0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02) ≈
0.98
Sentence 3 score: (0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 +
0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02)
≈ 0.80 20
Goal of Extractive Summarization
Artificial intelligence, often abbreviated as AI, is the simulation of
Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.
Step 6- Rank the sentences
Normalized
No.1 Sentences Score Rank
score
Artificial intelligence, often abbreviated as AI, is the
Summary from 1 simulation of human intelligence processes by 23 1.06 1
ranked machines, especially computer systems.
sentences These processes include learning, reasoning, problem-
2 10 0.98 2
solving, perception, and language understanding.
AI has become a significant part of our daily lives, from
3 voice assistants like Siri and Alexa to recommendation 29 0.80 3
systems on streaming platforms.
21
Creating an Application
22
P r a c ti c e A s s i g n m e n t o n Te x t S u m m a r i z a ti o n
Consider the sentences from the below document and apply text summarization to it using Extractive
Summarization method. Summarize it to not more than 20% of the sentences.
text = """Natural Language Processing (NLP) is an intricate field focused on the challenge of understanding human
language. One of its core aspects is handling ‘stop words’ – words which, due to their high frequency in text, often don’t
offer significant insights on their own.
Stop words like ‘the’, ‘and’, and ‘I’, although common, don’t usually provide meaningful information about a document’s
specific topic. By eliminating these words from a corpus, we can more easily identify unique and relevant terms.
It’s important to note that there isn’t a universally accepted list of stop words in NLP. However, the Natural Language
Toolkit, or NLTK, does offer a list for researchers and practitioners to utilize.
Throughout this guide, you’ll discover how to efficiently remove stop words using the nltk module, streamlining your text
data for better analysis.
We’ll be building upon code from a prior tutorial that dealt with tokenizing words.
Despite being crucial for sentence structure, most stop words don’t enhance our understanding of sentence semantics.
Thankfully, with NLTK, you don’t have to manually define every stop word. The library already includes a predefined list of
common words that typically don’t carry much semantic weight. NLTK’s default list contains 40 such words, for example:
“a”, “an”, “the”, and “of”.
This paragraph is taken from “https://pythonspot.com/nltk-stop-words/”
"""
23
References
A Gentle Introduction to Text Summarization by
Jason Brownlee:
https://machinelearningmastery.com/gentle-introduction-text-sum
marization/
Text Summarization REVIEW research paper:
https://www.sciencedirect.com/science/article/pii/S1319157820303712
24
End of module
25
End of
Session
26
NLP