0% found this document useful (0 votes)
18 views26 pages

Session NLP Hetal 6

The document outlines a session on Natural Language Processing (NLP) focusing on text summarization, including its types, use cases, and steps for extractive summarization. It emphasizes the importance of summarization in efficiently analyzing large amounts of text and provides a structured approach to implementing extractive methods. Additionally, it introduces practical assignments and references for further reading on the topic.

Uploaded by

shubhra.goyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

Session NLP Hetal 6

The document outlines a session on Natural Language Processing (NLP) focusing on text summarization, including its types, use cases, and steps for extractive summarization. It emphasizes the importance of summarization in efficiently analyzing large amounts of text and provides a structured approach to implementing extractive methods. Additionally, it introduces practical assignments and references for further reading on the topic.

Uploaded by

shubhra.goyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Te x t S u m m a r i z a t i o n

(Natural Language Processing)

Advanced Data Science

Dr. Hetal V. Gandhi


[email protected]
Welcome to the Digital Regenesys course in

Natural Language Processing Session 6

We will be starting shortly …


REGENESYS’ INTEGRATED LEADERSHIP AND
MANAGEMENT MODEL

Holistic focus on the individual (SQ,


EQ, IQ, and PQ)

Interrelationships are dynamic


between individual, team, institution
and the external environment
(systemic)

Strategy affects individual, team,


organisational, and environmental
performance

Delivery requires alignment of


strategy,
structure, systems and culture
REGENESYS GRADUATE ATTRIBUTES
Ground Rules

• Be open-minded
• Listen carefully
• Avoid doing other unrelated tasks when attending the session so that
you can be FOCUSED.
• Raise your hand if you have any query so that we can ensure one
conversation at a time
• When speaking, use “I think”, “I feel”, etc. (you are a very important
aspect of this learning)
• Respect the opinions of others
• Give constructive feedback
• Build on the ideas of others rather than destroying them
• Take some risks and share new ideas
• Have fun and ENJOY the learning experience!
Know your
facilitator  Teaching Experience of 13 years in
Computer Science and Engineering
 Expertise:
o Data Science and Machine Learning
Dr. Hetal Gandhi o Data Structures and Algorithms
o Natural Language Processing
o Programming skills
 She is awarded PhD for contributions
towards “Aspect Based Sentiment
Analysis of Hindi Reviews” in the area
of Natural Language Processing.
 She is actively involved in research
and had more than 10 papers
published in renowned Conferences
and Journals
Contents

• Text data pre-processing


• Text vectorization
• Text embeddings
• Text classification
• Movies recommendation system
• Text Summarization

9
Contents for today’s session

• Text Summarization Presentation


 Introduction
 Types
 Use Cases
 Steps
• Hands-on on Text Summarization
• Hands-on for deriving vectors for each document
using embeddings (Doc2Vec)

10
Te x t S u m m a r i z a t i o n

 Te x t s u m m a r i z a t i o n i s t h e p r o b l e m o f
c r e a t i n g a s h o r t , a c c u r a t e , a n d fl u e n t
summary of a longer text document.

 Advantages:
 Discover most relevant and important information in a small amount of time
 Allows analyzing large amount of text

11
Ty p e s o f Te x t S u m m a r i z a t i o n

 Extractive Methods

 Selection of phrases and sentences from source

 Involves RANKING of phrases to create a targeted summary

 Abstractive Methods

 Generates entirely new phrases and sentences to form a


summary

 Human-like approach but more challenging

Types of Text Summarization, Closer Look


12
U s e C a s e s / A p p l i c a t i o n s / E v e r y - d ay e x a m p l e s

 News Headlines- e.g. Inshorts.com  Reviews of Movies

 Government or Private  Reviews of a book


Organization Reports  Biography or resumes
 Court Orders span to more than  Bulletins of weather forecasts/ stock
100 pages market reports
 Research Papers Summary  Sound bites of politicians on a current
 Minutes of a meeting issue

 Histories or chronologies of salient events

13
S t e p s f o r E x t r a c t i v e Te x t S u m m a r i z a t i o n

1) Tokenization

2) Get word_frequency

3) Get normalized_word_frequency

4) Sentence Tokenization

5) Calculate score for each sentence

sentence_score = sum of normalized_word_frequency of words in the sentence

6) Rank the sentences- Use sentence_score to get important sentences (to form a summary)

NOTE: Other than normalized word frequencies, more advanced methods consider factors like
sentence position, keyword importance, and semantic relevance for more accurate
summarization.

14
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

Sentences Score Rank

Summary from Artificial intelligence, often abbreviated as AI,


ranked is the simulation of human intelligence 23 2
sentences processes by machines, especially computer
systems.
These processes include learning, reasoning,
problem-solving, perception, and language 10 3
understanding.
AI has become a significant part of our daily 15
lives, from voice assistants like Siri and Alexa
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

 Step 1- Tokenization: First, we'll tokenize the sentences and get list of all UNIQUE
TOKENS appearing in the document

[artificial, intelligence, often, abbreviated, as, AI, is, the, simulation, of, human, processes,
by, machines, especially, computer, systems, these, include, learning, reasoning, problem-
solving, perception, and, language, understanding…………………]

16
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

 Step 2- Get word_frequency: Frequency of each unique word in overall document

{artificial-1, intelligence-2, often-1, abbreviated-1, as-1, AI-2, is-1, the-1, simulation-1, of-2,
human-1, processes-2, by-1, machines-1, especially-1, computer-1, systems-1, these-1,
include-1, learning-1, reasoning-1, problem-solving-1, perception-1, and-1, language-1,
understanding-1…………………}

17
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

 Step 3- Get normalized_word_frequency: Divide word_frequency by @Number_of_tokens


(Total number of word tokens in complete document).

{artificial-0.02, intelligence-0.04, often-0.02, abbreviated-0.02, as-0.02, AI-0.04, is-0.02, the-


0.02 simulation-0.02, of-0.04, human-0.02, processes-0.04, by-0.02, machines-0.02, especially-
0.02, computer-0.02, systems-0.02, these-0.02, include-0.02, learning-0.02, reasoning-0.02,
problem-solving-0.02, perception-0.02, and-0.02, language-0.02, understanding-
0.02…………………} 18
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

 Step 4- Sentence Tokenization: Use ‘.’ to divide document into sentences.

No. Sentences

Artificial intelligence, often abbreviated as AI, is the simulation


1 of human intelligence processes by machines, especially
computer systems.

2 These processes include learning, reasoning, problem-solving,


perception, and language understanding.
AI has become a significant part of our daily lives, from voice
19
3 assistants like Siri and Alexa to recommendation systems on
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

 Step 5- Calculate score for each sentence:

sentence_score = sum of normalized_word_frequency of words in the sentence


 Sentence 1 score: (0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.04 + 0.02 + 0.02 + 0.04 +
0.02 + 0.04 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02) ≈ 1.06
 Sentence 2 score: (0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02) ≈
0.98
 Sentence 3 score: (0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 +
0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02 + 0.02)
≈ 0.80 20
Goal of Extractive Summarization

Artificial intelligence, often abbreviated as AI, is the simulation of


Large human intelligence processes by machines, especially computer
Document systems. These processes include learning, reasoning, problem-
solving, perception, and language understanding. AI has become a
significant part of our daily lives, from voice assistants like Siri and
Alexa to recommendation systems on streaming platforms.

 Step 6- Rank the sentences


Normalized
No.1 Sentences Score Rank
score
Artificial intelligence, often abbreviated as AI, is the
Summary from 1 simulation of human intelligence processes by 23 1.06 1
ranked machines, especially computer systems.
sentences These processes include learning, reasoning, problem-
2 10 0.98 2
solving, perception, and language understanding.
AI has become a significant part of our daily lives, from
3 voice assistants like Siri and Alexa to recommendation 29 0.80 3
systems on streaming platforms.
21
Creating an Application

22
P r a c ti c e A s s i g n m e n t o n Te x t S u m m a r i z a ti o n

Consider the sentences from the below document and apply text summarization to it using Extractive
Summarization method. Summarize it to not more than 20% of the sentences.
text = """Natural Language Processing (NLP) is an intricate field focused on the challenge of understanding human
language. One of its core aspects is handling ‘stop words’ – words which, due to their high frequency in text, often don’t
offer significant insights on their own.
Stop words like ‘the’, ‘and’, and ‘I’, although common, don’t usually provide meaningful information about a document’s
specific topic. By eliminating these words from a corpus, we can more easily identify unique and relevant terms.
It’s important to note that there isn’t a universally accepted list of stop words in NLP. However, the Natural Language
Toolkit, or NLTK, does offer a list for researchers and practitioners to utilize.
Throughout this guide, you’ll discover how to efficiently remove stop words using the nltk module, streamlining your text
data for better analysis.
We’ll be building upon code from a prior tutorial that dealt with tokenizing words.
Despite being crucial for sentence structure, most stop words don’t enhance our understanding of sentence semantics.
Thankfully, with NLTK, you don’t have to manually define every stop word. The library already includes a predefined list of
common words that typically don’t carry much semantic weight. NLTK’s default list contains 40 such words, for example:
“a”, “an”, “the”, and “of”.
This paragraph is taken from “https://pythonspot.com/nltk-stop-words/”
"""

23
References

 A Gentle Introduction to Text Summarization by


Jason Brownlee:
https://machinelearningmastery.com/gentle-introduction-text-sum
marization/

 Text Summarization REVIEW research paper:


https://www.sciencedirect.com/science/article/pii/S1319157820303712

24
End of module

25
End of
Session

26
NLP

You might also like