0% found this document useful (0 votes)

4 views

Text Analysis

The document provides a comprehensive guide on text analysis using R, covering techniques such as tokenization, sentiment analysis, and topic modeling. It includes practical examples and code snippets for analyzing text data from various sources, including CSV and PDF files. The document emphasizes the importance of text analysis in extracting insights from large volumes of text data across different domains.

Uploaded by

Vandana Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Text Analysis

Uploaded by

Vandana Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Dr Arun Julka

Text
Analysis

R
A UN JULKA

1
Dr Arun Julka

Table of Contents
1 Text Analysis: Introduction ............................................................................................................. 3
2 Text Analysis in R............................................................................................................................. 5
2.1 Text Analysis of a text ta(a) ..................................................................................................... 5
2.2 Text Analysis of a text ta(b)..................................................................................................... 5
2.3 Text Analysis of a text ta(c) ..................................................................................................... 7
2.4 Sentiment Analysis of a text ta(d) ........................................................................................... 8
2.5 Sentiment Analysis of a CSV file ta(e) ................................................................................... 10
2.6 Sentiment Analysis of a PDF file ta(f) .................................................................................... 11
2.7 Sentiment Analysis of Chapter 7.1 (HKD) ............................................................................. 13
2.8 Sentiment Analysis of Chapter 7.2(HKD) .............................................................................. 13

2
Dr Arun Julka

1 Text Analysis: Introduction

Suppose you have a mountain of text data: customer reviews, news articles, books – the
possibilities are endless. Text analysis is like having a powerful magnifying glass and a set of
tools to sift through this mountain and uncover hidden patterns, understand the underlying
meaning, and extract valuable insights.

Think of it this way:

You have a box full of jigsaw puzzles. Each puzzle piece is a word, and the entire box is a
collection of texts.

Text analysis helps you:

• Find all the corner pieces: Identify the most frequent words (like "the," "a," "is")
– these are common but not always the most meaningful.
• Group similar pieces: Find words that often appear together (like "delicious" and
"food," "fast" and "delivery") to understand themes and topics.
• Determine the overall picture: Analyse the sentiment (positive, negative, neutral)
expressed in the text, identify the main topics discussed, and even predict future
trends.

Let’s take a simple example

Let's say you have a collection of customer reviews for a restaurant. You can use text analysis
to:

• Identify common words: "delicious," "tasty," "service," "slow," "friendly,"

"disappointed."
• Analyse sentiment: Determine if the overall sentiment of the reviews is positive,
negative, or neutral.
• Find common themes: Identify recurring themes, such as slow service, delicious
food, or friendly staff.

Key Techniques in Text Analysis:

• Tokenisation: Breaking down text into individual words or sentences.

• Sentiment Analysis: Determining the emotional tone of the text (positive, negative,
neutral).
• Topic Modeling: Identifying the main topics discussed in the text.
• Named Entity Recognition: Identifying and classifying named entities (people,
organizations, locations).

Tools for Text Analysis:

• R: A powerful programming language with many libraries for text analysis (like
tidytext, tm, sentimentr).
• Python: Another popular language with libraries like NLTK, spaCy, and scikit-learn.

3
Dr Arun Julka

Text analysis is a rapidly growing field with applications in various domains, including
business, marketing, social sciences, and even healthcare.

Text Analysis Flowchart

Import Data

dplyer, tidyverse, pdftools, VCorpus

Data Cleaning

tm, tidytext, textstem

Lemma

ggplot2
Plot

wordcloud

Word cloud

syuzhet

Sentiment
Analysis

4
Dr Arun Julka

2 Text Analysis in R
2.1 Text Analysis of a text ta(a)
library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

2.2 Text Analysis of a text ta(b)

library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process

5
Dr Arun Julka

text_data <- data.frame(cbind(

id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
#Examples of common English stop words:
##Articles: a, an, the
##Prepositions: in, on, at, to, from, with, for
##Conjunctions: and, but, or, if, because
##Pronouns: I, you, he, she, it, they, we, me, him, her, them, us
##Other: no, not, only, very, this, that, these, those
mystopwords<- c(stopwords("english"),"arun")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization (Lemmatization in Natural Language Processing (NLP) is the process of reducing a

word to its base or dictionary form, known as the lemma)
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

6
Dr Arun Julka

2.3 Text Analysis of a text ta(c)

library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

print(crp[[1]])

crp<- tm_map(crp, content_transformer(tolower))

print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"arun", "Julka")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

7
Dr Arun Julka

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words

library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency

ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud

wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

2.4 Sentiment Analysis of a text ta(d)

8
Dr Arun Julka

colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))

print(crp[[1]])

crp<- tm_map(crp, content_transformer(tolower))

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words

library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency

ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud

wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

# Get sentiment lexicon

sentiment_lexicon <- get_sentiments("bing")

9
Dr Arun Julka

colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)

# View sentiment analysis

print(sentiment_analysis)

2.5 Sentiment Analysis of a CSV file ta(e)

library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
#Analysing Excel Text
data=read.csv("C:/Users/ADMIN/OneDrive/Desktop/R/AM/amazon_vfl_reviews_session2.csv")
summary(data)
str(data)
data$sn=seq(1,nrow(data))

colnames(data)[c(6,5)]=c('doc_id','text')

crp<- VCorpus(DataframeSource(data))

print(crp[[1]])
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"book","people")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

10
Dr Arun Julka

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words

library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency

ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud

wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

# Get sentiment lexicon

sentiment_lexicon <- get_sentiments("bing")

colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)

# View sentiment analysis

print(sentiment_analysis)

2.6 Sentiment Analysis of a PDF file ta(f)

#Read PDF Files
##Reading PDF Files From location
#identifying multiple pdf files from folder
library(pdftools)
library(tm)
stop_words2=c(stopwords("en"),"makes")
setwd("C:/Users/ADMIN/OneDrive/Desktop/R/AM/PDF")

files<- list.files(pattern = "pdf$")

files #files contain the named vector of pdf files

11
Dr Arun Julka

read_function<- readPDF(control=list(text="-layout"))
read_corpus<- Corpus(URISource(files[1:5]),readerControl = list(reader=read_function))

read_corpus<-tm_map(read_corpus,removePunctuation)

dtm <- DocumentTermMatrix(read_corpus, control = list(removePunctuation = TRUE, stopwords =

TRUE, tolower = TRUE, removeNumbers = TRUE, stemDocument = TRUE, bounds = list(global = c(3,
Inf))))

dtm_matrix<-as.matrix(dtm) # converting dtm to a matrix so that data becomes viewable

#some inverted commas, hastags etc are not removed from "remove punctuation"
# so we can use textclean package for those cases.

View(dtm_matrix) # running this might take 5 to 10 seconds as it shows the word count of each
word in 15 pdfs

dtm_matrix<-t(dtm_matrix) # to show the data in strutured format

number_occurance<- rowSums(dtm_matrix) #use rowSums not rowsum as this is matrix

number_occurance[1:20] #using the squared brackets to fix the number of words

number_occurance_sorted <- sort(number_occurance,decreasing = TRUE)

number_occurance_sorted[1:20] #using the squared brackets to fix the number of words

library(wordcloud)
set.seed(123)
wordcloud(names(number_occurance_sorted), number_occurance_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))

cor_word <- findAssocs(dtm, "marketing" , corlimit = 0.2)

cor_word$marketing[1:20] #as we are corelating with marketing

library(treemap)

data_frame<- data.frame(word=names(number_occurance_sorted),
freq=number_occurance_sorted)
data_frame[1:20,]

#Enter the minimum Frequency in a Word Tree

treemap(subset(data_frame,number_occurance_sorted>10), index = c('word'), vSize = 'freq')

#Enter How many words you want to enter

treemap(data_frame[1:10,], index = c('word'), vSize = 'freq')

#cluster analysis
distance<-dist(data_frame[1:20,] )

12
Dr Arun Julka

distance
clust<- hclust(distance)
plot(clust) #hang=-1 for symetric cluster roots

2.7 Sentiment Analysis of Chapter 7.1 (HKD)

#install readxl
install.packages("readxl")
library(readxl)
#Replace "our_pdf_file.xlxs" with the actual path to your EXCEL file
reviews<-read_excel("C:/Users/ADMIN/OneDrive/Desktop/R/R Data/socialmediareviews.xlsx")
#install tm
install.packages("tm")
library(tm)
review_corp<-VCorpus(VectorSource(reviews$reviews))
review_corp[1][2]
review_corp<-
tm_map(review_corp,removeWords,c("now","Know","took","that's","air","away","war","Know","jo
b","one","like","actually","new","guy","don't","things","lot","try","bit","don't","don't","anything","t
hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]

2.8 Sentiment Analysis of Chapter 7.2(HKD)

13
Dr Arun Julka

hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]

library(wordcloud)
wordcloud(names(number_occurrence_sorted), number_occurrence_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))

#association of words
cor_word <- findAssocs(dtm, "time" , corlimit = 0.1)
cor_word$time

library(treemap)
data_frame<- data.frame(word=names(number_occurrence_sorted),
freq=number_occurrence_sorted)
data_frame[1:20,]

#Enter the minimum Frequency in a Word Tree

treemap(subset(data_frame,number_occurrence_sorted>10), index = c('word'), vSize = 'freq')

#Enter How many words you want to enter

treemap(data_frame[1:10,], index = c('word'), vSize = 'freq')

#cluster analysis
distance<-dist(data_frame[1:20,] )
distance
clust<- hclust(distance)
plot(clust)

#sentiment analysis
library(syuzhet)
sent_corpus<- iconv(reviews$reviews)
review_sent<- get_nrc_sentiment(sent_corpus)
head(review_sent)
sentiment_counts <- colSums(review_sent)
barplot(sentiment_counts, las=2, col=rainbow(10), ylab='Count', main= 'Sentiment reviews')

14
Dr Arun Julka

#Important Notes:
#Accuracy: Sentiment analysis accuracy depends heavily on the quality of the text data, the
chosen lexicon or model, and the complexity of the text.
#Lexicon Selection: The sentimentr package uses a built-in lexicon. You can explore other
lexicons (e.g., Bing Liu, AFINN) for potentially better results.
#Advanced Techniques: For more sophisticated sentiment analysis, consider using machine
learning models like Naive Bayes or Support Vector Machines.
#Error Handling: Implement robust error handling for potential issues like invalid PDF files
or unexpected text formats.

RP23 Rebuild: Disassembly
No ratings yet
RP23 Rebuild: Disassembly
22 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Mapping Texts _ Computational Text Analysis for the Social -- Dustin S_ Stoltz, Marshall A_ Taylor -- 2024 -- Computational Social Science -- 9780197756874 -- 4bb94f00b911a0819217ab0b4e8b9aab -- Anna’s Archive
No ratings yet
Mapping Texts _ Computational Text Analysis for the Social -- Dustin S_ Stoltz, Marshall A_ Taylor -- 2024 -- Computational Social Science -- 9780197756874 -- 4bb94f00b911a0819217ab0b4e8b9aab -- Anna’s Archive
326 pages
Mapping Texts 2024
No ratings yet
Mapping Texts 2024
326 pages
Business Analytics CA3
No ratings yet
Business Analytics CA3
11 pages
Basic Textual Analysis in R
No ratings yet
Basic Textual Analysis in R
2 pages
Text Mining Code
No ratings yet
Text Mining Code
2 pages
BCSE206L_FDS_MODULE-4_SMSATAPATHY
No ratings yet
BCSE206L_FDS_MODULE-4_SMSATAPATHY
50 pages
week_1-4_Text_an
No ratings yet
week_1-4_Text_an
74 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Raj_Dv_exp5
No ratings yet
Raj_Dv_exp5
6 pages
Step 1: Create A CSV File: # For Text Mining
No ratings yet
Step 1: Create A CSV File: # For Text Mining
9 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
35 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Text Mining in R: A Tutorial
No ratings yet
Text Mining in R: A Tutorial
7 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
Hands-On Data Science With R Text Mining: 10th January 2016
No ratings yet
Hands-On Data Science With R Text Mining: 10th January 2016
47 pages
chp_5
No ratings yet
chp_5
57 pages
5 Paso S Text Mining
No ratings yet
5 Paso S Text Mining
4 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Module 3
No ratings yet
Module 3
40 pages
Text Analysis in R
No ratings yet
Text Analysis in R
21 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
DA Project Report
No ratings yet
DA Project Report
17 pages
06 Text and Document
No ratings yet
06 Text and Document
43 pages
A Tutorial of Text Mining in R Using TM Package
No ratings yet
A Tutorial of Text Mining in R Using TM Package
6 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
Word Cloud
No ratings yet
Word Cloud
3 pages
ETB Text analytics using Machine Learning -20-12-24
No ratings yet
ETB Text analytics using Machine Learning -20-12-24
38 pages
R Code NB
No ratings yet
R Code NB
3 pages
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
No ratings yet
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
122 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
Text Analytics Notes
No ratings yet
Text Analytics Notes
12 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Packages Which Are Used For Above Analysis
No ratings yet
Packages Which Are Used For Above Analysis
4 pages
SMTA QBnew
No ratings yet
SMTA QBnew
3 pages
Content Modeling For Social Media Text: Christina Sauper
No ratings yet
Content Modeling For Social Media Text: Christina Sauper
136 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Bda Mod5
No ratings yet
Bda Mod5
20 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
EXP5
No ratings yet
EXP5
15 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Text Mining Methodologies
No ratings yet
Text Mining Methodologies
45 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Text Analysis Monkeylearncom
No ratings yet
Text Analysis Monkeylearncom
46 pages
Text Mining With R
No ratings yet
Text Mining With R
15 pages
Semantic Analysis-Week 7
No ratings yet
Semantic Analysis-Week 7
24 pages
Lab5 Instructions
No ratings yet
Lab5 Instructions
51 pages
WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization
No ratings yet
WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization
20 pages
TXSA Lecture-7-9-2023 PDF
No ratings yet
TXSA Lecture-7-9-2023 PDF
8 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
detention
No ratings yet
detention
1 page
Research Paper (Sanjana Das)
No ratings yet
Research Paper (Sanjana Das)
21 pages
article
No ratings yet
article
2 pages
Dashboard Data
No ratings yet
Dashboard Data
3 pages
BA Viva Questions
No ratings yet
BA Viva Questions
8 pages
data_set_A
No ratings yet
data_set_A
5 pages
Statistics Chapter 7
No ratings yet
Statistics Chapter 7
25 pages
Event Handling PDF
No ratings yet
Event Handling PDF
46 pages
McKinsey Growth in Marketing 2017
No ratings yet
McKinsey Growth in Marketing 2017
26 pages
Tech - Specs Maricar Panda
No ratings yet
Tech - Specs Maricar Panda
56 pages
In-CylinderFlameDevelopment DieselEngineThesisRusly13
No ratings yet
In-CylinderFlameDevelopment DieselEngineThesisRusly13
126 pages
Management (22509) Subject Microproject
No ratings yet
Management (22509) Subject Microproject
16 pages
Uwild Guide
No ratings yet
Uwild Guide
8 pages
2nd sem Entrepreneurship Development Apr.May 2022
No ratings yet
2nd sem Entrepreneurship Development Apr.May 2022
3 pages
Econ Notes
No ratings yet
Econ Notes
126 pages
Boot
No ratings yet
Boot
10 pages
Content Gap Analysis Template
No ratings yet
Content Gap Analysis Template
2 pages
Lesson 2 Team Development Wheel
No ratings yet
Lesson 2 Team Development Wheel
6 pages
Contemporary Theme
No ratings yet
Contemporary Theme
4 pages
Mazda_CCEFW211182_AllSystemDTC_20250409034329
No ratings yet
Mazda_CCEFW211182_AllSystemDTC_20250409034329
2 pages
Ats Compressed 370 Brochure
No ratings yet
Ats Compressed 370 Brochure
6 pages
Caed Lab Manual 1 Print Final
No ratings yet
Caed Lab Manual 1 Print Final
34 pages
Legal Ethics Compilation of Cases
No ratings yet
Legal Ethics Compilation of Cases
25 pages
TTL 1 Module 2
No ratings yet
TTL 1 Module 2
14 pages
2025-SCHOOL-FEES-docx-1-1
No ratings yet
2025-SCHOOL-FEES-docx-1-1
2 pages
Intervention Materials Quarter Iv
No ratings yet
Intervention Materials Quarter Iv
10 pages
Respect The Unstable PDF
No ratings yet
Respect The Unstable PDF
14 pages
CCASAECR10
No ratings yet
CCASAECR10
82 pages
《Java面试手册》
No ratings yet
《Java面试手册》
166 pages
ThermoCond 29 - EN PDF
No ratings yet
ThermoCond 29 - EN PDF
4 pages
Survey of Starbucks Consumer Satisfaction
No ratings yet
Survey of Starbucks Consumer Satisfaction
4 pages
Flash CS3 Practical 5
No ratings yet
Flash CS3 Practical 5
9 pages
Auroras Resume - 5
No ratings yet
Auroras Resume - 5
2 pages
Cleopatra Term Paper
100% (1)
Cleopatra Term Paper
7 pages
Luiz Bonfá Biography
No ratings yet
Luiz Bonfá Biography
5 pages
Technology of Dynamic Kill Drilling For Drilling in The Superficial Layer of Deepwater PDF
No ratings yet
Technology of Dynamic Kill Drilling For Drilling in The Superficial Layer of Deepwater PDF
6 pages

Text Analysis

Uploaded by

Text Analysis

Uploaded by

Dr Arun Julka

1 Text Analysis: Introduction

Think of it this way:

Text analysis helps you:

Let’s take a simple example

• Identify common words: "delicious," "tasty," "service," "slow," "friendly,"

Key Techniques in Text Analysis:

• Tokenisation: Breaking down text into individual words or sentences.

Tools for Text Analysis:

Text Analysis Flowchart

dplyer, tidyverse, pdftools, VCorpus

tm, tidytext, textstem

2.2 Text Analysis of a text ta(b)

text_data <- data.frame(cbind(

# Lemmatization (Lemmatization in Natural Language Processing (NLP) is the process of reducing a

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

2.3 Text Analysis of a text ta(c)

crp<- tm_map(crp, content_transformer(tolower))

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# Now plotting the top frequent words

# Plot word frequency

# Create word cloud

2.4 Sentiment Analysis of a text ta(d)

crp<- tm_map(crp, content_transformer(tolower))

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# Now plotting the top frequent words

# Plot word frequency

# Create word cloud

# Get sentiment lexicon

# View sentiment analysis

2.5 Sentiment Analysis of a CSV file ta(e)

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))

# Now plotting the top frequent words

# Plot word frequency

# Create word cloud

# Get sentiment lexicon

# View sentiment analysis

2.6 Sentiment Analysis of a PDF file ta(f)

files<- list.files(pattern = "pdf$")

dtm <- DocumentTermMatrix(read_corpus, control = list(removePunctuation = TRUE, stopwords =

dtm_matrix<-as.matrix(dtm) # converting dtm to a matrix so that data becomes viewable

dtm_matrix<-t(dtm_matrix) # to show the data in strutured format

number_occurance<- rowSums(dtm_matrix) #use rowSums not rowsum as this is matrix

number_occurance_sorted <- sort(number_occurance,decreasing = TRUE)

cor_word <- findAssocs(dtm, "marketing" , corlimit = 0.2)

#Enter the minimum Frequency in a Word Tree

#Enter How many words you want to enter

2.7 Sentiment Analysis of Chapter 7.1 (HKD)

2.8 Sentiment Analysis of Chapter 7.2(HKD)

#Enter the minimum Frequency in a Word Tree

#Enter How many words you want to enter

You might also like