0% found this document useful (0 votes)
4 views

Text Analysis

The document provides a comprehensive guide on text analysis using R, covering techniques such as tokenization, sentiment analysis, and topic modeling. It includes practical examples and code snippets for analyzing text data from various sources, including CSV and PDF files. The document emphasizes the importance of text analysis in extracting insights from large volumes of text data across different domains.

Uploaded by

Vandana Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Text Analysis

The document provides a comprehensive guide on text analysis using R, covering techniques such as tokenization, sentiment analysis, and topic modeling. It includes practical examples and code snippets for analyzing text data from various sources, including CSV and PDF files. The document emphasizes the importance of text analysis in extracting insights from large volumes of text data across different domains.

Uploaded by

Vandana Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Dr Arun Julka

Text
Analysis

R
A UN JULKA

1
Dr Arun Julka

Table of Contents
1 Text Analysis: Introduction ............................................................................................................. 3
2 Text Analysis in R............................................................................................................................. 5
2.1 Text Analysis of a text ta(a) ..................................................................................................... 5
2.2 Text Analysis of a text ta(b)..................................................................................................... 5
2.3 Text Analysis of a text ta(c) ..................................................................................................... 7
2.4 Sentiment Analysis of a text ta(d) ........................................................................................... 8
2.5 Sentiment Analysis of a CSV file ta(e) ................................................................................... 10
2.6 Sentiment Analysis of a PDF file ta(f) .................................................................................... 11
2.7 Sentiment Analysis of Chapter 7.1 (HKD) ............................................................................. 13
2.8 Sentiment Analysis of Chapter 7.2(HKD) .............................................................................. 13

2
Dr Arun Julka

1 Text Analysis: Introduction


Suppose you have a mountain of text data: customer reviews, news articles, books – the
possibilities are endless. Text analysis is like having a powerful magnifying glass and a set of
tools to sift through this mountain and uncover hidden patterns, understand the underlying
meaning, and extract valuable insights.

Think of it this way:

You have a box full of jigsaw puzzles. Each puzzle piece is a word, and the entire box is a
collection of texts.

Text analysis helps you:

• Find all the corner pieces: Identify the most frequent words (like "the," "a," "is")
– these are common but not always the most meaningful.
• Group similar pieces: Find words that often appear together (like "delicious" and
"food," "fast" and "delivery") to understand themes and topics.
• Determine the overall picture: Analyse the sentiment (positive, negative, neutral)
expressed in the text, identify the main topics discussed, and even predict future
trends.

Let’s take a simple example

Let's say you have a collection of customer reviews for a restaurant. You can use text analysis
to:

• Identify common words: "delicious," "tasty," "service," "slow," "friendly,"


"disappointed."
• Analyse sentiment: Determine if the overall sentiment of the reviews is positive,
negative, or neutral.
• Find common themes: Identify recurring themes, such as slow service, delicious
food, or friendly staff.

Key Techniques in Text Analysis:

• Tokenisation: Breaking down text into individual words or sentences.


• Sentiment Analysis: Determining the emotional tone of the text (positive, negative,
neutral).
• Topic Modeling: Identifying the main topics discussed in the text.
• Named Entity Recognition: Identifying and classifying named entities (people,
organizations, locations).

Tools for Text Analysis:

• R: A powerful programming language with many libraries for text analysis (like
tidytext, tm, sentimentr).
• Python: Another popular language with libraries like NLTK, spaCy, and scikit-learn.

3
Dr Arun Julka

Text analysis is a rapidly growing field with applications in various domains, including
business, marketing, social sciences, and even healthcare.

Text Analysis Flowchart

Import Data

dplyer, tidyverse, pdftools, VCorpus

Data Cleaning

tm, tidytext, textstem

Lemma

ggplot2
Plot

wordcloud

Word cloud

syuzhet

Sentiment
Analysis

4
Dr Arun Julka

2 Text Analysis in R
2.1 Text Analysis of a text ta(a)
library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)

2.2 Text Analysis of a text ta(b)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process

5
Dr Arun Julka

text_data <- data.frame(cbind(


id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
#Examples of common English stop words:
##Articles: a, an, the
##Prepositions: in, on, at, to, from, with, for
##Conjunctions: and, but, or, if, because
##Pronouns: I, you, he, she, it, they, we, me, him, her, them, us
##Other: no, not, only, very, this, that, these, those
mystopwords<- c(stopwords("english"),"arun")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization (Lemmatization in Natural Language Processing (NLP) is the process of reducing a


word to its base or dictionary form, known as the lemma)
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

6
Dr Arun Julka

2.3 Text Analysis of a text ta(c)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))

print(crp[[1]])

crp<- tm_map(crp, content_transformer(tolower))


print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"arun", "Julka")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

7
Dr Arun Julka

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words


library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency


ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud


wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

2.4 Sentiment Analysis of a text ta(d)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c("I am Arun Julka - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"

8
Dr Arun Julka

colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))

print(crp[[1]])

crp<- tm_map(crp, content_transformer(tolower))


print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"arun")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words


library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency


ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud


wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

# Get sentiment lexicon


sentiment_lexicon <- get_sentiments("bing")

9
Dr Arun Julka

colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)

# View sentiment analysis


print(sentiment_analysis)

2.5 Sentiment Analysis of a CSV file ta(e)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
#Analysing Excel Text
data=read.csv("C:/Users/ADMIN/OneDrive/Desktop/R/AM/amazon_vfl_reviews_session2.csv")
summary(data)
str(data)
data$sn=seq(1,nrow(data))

colnames(data)[c(6,5)]=c('doc_id','text')

crp<- VCorpus(DataframeSource(data))

print(crp[[1]])
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"book","people")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

10
Dr Arun Julka

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words


library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency


ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud


wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

# Get sentiment lexicon


sentiment_lexicon <- get_sentiments("bing")

colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)

# View sentiment analysis


print(sentiment_analysis)

2.6 Sentiment Analysis of a PDF file ta(f)


#Read PDF Files
##Reading PDF Files From location
#identifying multiple pdf files from folder
library(pdftools)
library(tm)
stop_words2=c(stopwords("en"),"makes")
setwd("C:/Users/ADMIN/OneDrive/Desktop/R/AM/PDF")

files<- list.files(pattern = "pdf$")


files #files contain the named vector of pdf files

11
Dr Arun Julka

read_function<- readPDF(control=list(text="-layout"))
read_corpus<- Corpus(URISource(files[1:5]),readerControl = list(reader=read_function))

read_corpus<-tm_map(read_corpus,removePunctuation)

dtm <- DocumentTermMatrix(read_corpus, control = list(removePunctuation = TRUE, stopwords =


TRUE, tolower = TRUE, removeNumbers = TRUE, stemDocument = TRUE, bounds = list(global = c(3,
Inf))))

dtm_matrix<-as.matrix(dtm) # converting dtm to a matrix so that data becomes viewable


#some inverted commas, hastags etc are not removed from "remove punctuation"
# so we can use textclean package for those cases.

View(dtm_matrix) # running this might take 5 to 10 seconds as it shows the word count of each
word in 15 pdfs

dtm_matrix<-t(dtm_matrix) # to show the data in strutured format

number_occurance<- rowSums(dtm_matrix) #use rowSums not rowsum as this is matrix


number_occurance[1:20] #using the squared brackets to fix the number of words

number_occurance_sorted <- sort(number_occurance,decreasing = TRUE)


number_occurance_sorted[1:20] #using the squared brackets to fix the number of words

library(wordcloud)
set.seed(123)
wordcloud(names(number_occurance_sorted), number_occurance_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))

cor_word <- findAssocs(dtm, "marketing" , corlimit = 0.2)


cor_word$marketing[1:20] #as we are corelating with marketing

library(treemap)

data_frame<- data.frame(word=names(number_occurance_sorted),
freq=number_occurance_sorted)
data_frame[1:20,]

#Enter the minimum Frequency in a Word Tree


treemap(subset(data_frame,number_occurance_sorted>10), index = c('word'), vSize = 'freq')

#Enter How many words you want to enter


treemap(data_frame[1:10,], index = c('word'), vSize = 'freq')

#cluster analysis
distance<-dist(data_frame[1:20,] )

12
Dr Arun Julka

distance
clust<- hclust(distance)
plot(clust) #hang=-1 for symetric cluster roots

2.7 Sentiment Analysis of Chapter 7.1 (HKD)


#install readxl
install.packages("readxl")
library(readxl)
#Replace "our_pdf_file.xlxs" with the actual path to your EXCEL file
reviews<-read_excel("C:/Users/ADMIN/OneDrive/Desktop/R/R Data/socialmediareviews.xlsx")
#install tm
install.packages("tm")
library(tm)
review_corp<-VCorpus(VectorSource(reviews$reviews))
review_corp[1][2]
review_corp<-
tm_map(review_corp,removeWords,c("now","Know","took","that's","air","away","war","Know","jo
b","one","like","actually","new","guy","don't","things","lot","try","bit","don't","don't","anything","t
hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]

2.8 Sentiment Analysis of Chapter 7.2(HKD)


#install readxl
install.packages("readxl")
library(readxl)
#Replace "our_pdf_file.xlxs" with the actual path to your EXCEL file
reviews<-read_excel("C:/Users/ADMIN/OneDrive/Desktop/R/R Data/socialmediareviews.xlsx")
#install tm
install.packages("tm")
library(tm)
review_corp<-VCorpus(VectorSource(reviews$reviews))
review_corp[1][2]
review_corp<-
tm_map(review_corp,removeWords,c("now","Know","took","that's","air","away","war","Know","jo
b","one","like","actually","new","guy","don't","things","lot","try","bit","don't","don't","anything","t

13
Dr Arun Julka

hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]

library(wordcloud)
wordcloud(names(number_occurrence_sorted), number_occurrence_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))

#association of words
cor_word <- findAssocs(dtm, "time" , corlimit = 0.1)
cor_word$time

library(treemap)
data_frame<- data.frame(word=names(number_occurrence_sorted),
freq=number_occurrence_sorted)
data_frame[1:20,]

#Enter the minimum Frequency in a Word Tree


treemap(subset(data_frame,number_occurrence_sorted>10), index = c('word'), vSize = 'freq')

#Enter How many words you want to enter


treemap(data_frame[1:10,], index = c('word'), vSize = 'freq')

#cluster analysis
distance<-dist(data_frame[1:20,] )
distance
clust<- hclust(distance)
plot(clust)

#sentiment analysis
library(syuzhet)
sent_corpus<- iconv(reviews$reviews)
review_sent<- get_nrc_sentiment(sent_corpus)
head(review_sent)
sentiment_counts <- colSums(review_sent)
barplot(sentiment_counts, las=2, col=rainbow(10), ylab='Count', main= 'Sentiment reviews')

14
Dr Arun Julka

#Important Notes:
#Accuracy: Sentiment analysis accuracy depends heavily on the quality of the text data, the
chosen lexicon or model, and the complexity of the text.
#Lexicon Selection: The sentimentr package uses a built-in lexicon. You can explore other
lexicons (e.g., Bing Liu, AFINN) for potentially better results.
#Advanced Techniques: For more sophisticated sentiment analysis, consider using machine
learning models like Naive Bayes or Support Vector Machines.
#Error Handling: Implement robust error handling for potential issues like invalid PDF files
or unexpected text formats.

15

You might also like