Natural Language Processing
Natural Language understanding
1
Text Mining Applications – Unsupervised
• Text clustering • Trend analysis
Trend for the Term “text mining” from Google Trends
Cluster Comment Key Words
No.
1 1, 3, 4 doctor, staff,
friendly, helpful
2 5, 6, 8 treatment, results,
time, schedule
3 2, 7 service, clinic, fast
2
Text Mining Applications – Supervised
– Many typical predictive modeling or
classification applications can be
enhanced by incorporating textual data in
addition to traditional input variables.
• churning propensity models that include
customer center notes, website forms, e-
mails, and Twitter messages
• hospital admission prediction models
incorporating medical records notes as a
new source of information
• insurance fraud modeling using adjustor
notes
• sentiment categorization
• stylometry or forensic applications that
identify the author of a particular writing
sample
Sentiment Analysis
• The field of sentiment analysis deals with categorization (or
classification) of opinions expressed in textual documents
Green color represents positive tone, red color represents negative tone, and
product features and model names are highlighted in blue and brown, respectively.
4
Structured + Text Data in Predictive
Models
• Use of both types of data in building predictive
models.
ROC Chart of Models With and Without Textual Comments
NLP Tasks
• NLP applications require several NLP analyses:
– Word tokenization
– Sentence boundary detection
– Part-of-speech (POS) tagging
• to identify the part-of-speech (e.g. noun, verb) of each word
– Named Entity (NE) recognition
• to identify proper nouns (e.g. names of person, location,
organization; domain terminologies)
– Parsing
• to identify the syntactic structure of a sentence
– Semantic analysis
• to derive the meaning of a sentence
6
1. Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical
class marker to each word in a sentence (and all
sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj
7
Syntactic Analysis - Grammar
• sentence -> noun_phrase, verb_phrase
• noun_phrase -> proper_noun
• noun_phrase -> determiner, noun
• verb_phrase -> verb, noun_phrase
• proper_noun -> [mary]
• noun -> [apple]
• verb -> [ate]
• determiner -> [the]
9
2. Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a
sentence
– e.g. “U.N. official Ekeus heads for Baghdad.”
10
Confusion matrix
• True Positive:
You predicted positive and it’s true.
• True Negative:
You predicted negative and it’s true.
• False Positive: (Type 1 Error)
You predicted positive and it’s false.
• False Negative: (Type 2 Error)
You predicted negative and it’s false.
11
12