0% found this document useful (0 votes)
202 views2 pages

Text Analysis with Quanteda

The cheat sheet summarizes functions in the quanteda package for text analysis in R. It provides the general syntax for functions that manage text corpora and tokens, create document-feature matrices, fit text models, and create text visualizations. The cheat sheet also lists additional useful functions and packages that work well with quanteda for tasks like reading text data, applying dictionaries, sampling documents, and more. It aims to serve as a concise reference for the core capabilities of quanteda.

Uploaded by

ayrusurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views2 pages

Text Analysis with Quanteda

The cheat sheet summarizes functions in the quanteda package for text analysis in R. It provides the general syntax for functions that manage text corpora and tokens, create document-feature matrices, fit text models, and create text visualizations. The cheat sheet also lists additional useful functions and packages that work well with quanteda for tasks like reading text data, applying dictionaries, sampling documents, and more. It aims to serve as a concise reference for the core capabilities of quanteda.

Uploaded by

ayrusurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Cheat Sheet Extract features (dfm_*; fcm_*)

Create a document-feature matrix (dfm) from a corpus


x <- dfm(data_corpus_inaugural,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
General syntax Extensions remove = stopwords("english"))

head(x, n = 2, nf = 4)
• corpus_* manage text collections/metadata quanteda works well with these ## Document-feature matrix of: 2 documents, 4 features (41.7% sparse).
• tokens_* create/modify tokenized texts companion packages: ## features
## docs fellow-citizens senate house representatives
• dfm_* create/modify doc-feature matrices • readtext: An easy way to read text ## 1789-Washington 1 1 2 2
• fcm_* work with co-occurrence matrices data ## 1793-Washington 0 0 0 0
• textstat_* calculate text-based statistics • spacyr: NLP using the spaCy library
• quanteda.data: additional textual
Create a dictionary
• textmodel_* fit (un-)supervised models
dictionary(list(negative = c("bad", "awful", "sad"),
• textplot_* create text-based visualizations data
positive = c("good", "wonderful", "happy")))
• stopwords: multilingual stopword
Consistent grammar: Apply a dictionary
lists in R
• object() constructor for the object type dfm_lookup(x, dictionary = data_dictionary_LSD2015)
• object_verb() inputs & returns object type
Select features
Create a corpus from texts (corpus_*) dfm_select(x, dictionary = data_dictionary_LSD2015)
Compress a dfm by combining identical elements
dfm_compress(x, margin = c("both", "documents", "features"))
Read texts (txt, pdf, csv, doc, docx, json, xml)
my_texts <- readtext::readtext("~/link/to/path/*") Randomly sample documents or features
dfm_sample(x, what = c("documents", "features"))
Construct a corpus from a character vector
x <- corpus(data_char_ukimmig2010, text_field = "text") Weight or smooth the feature frequencies
dfm_weight(x, type = "prop") | dfm_smooth(x, smoothing = 0.5)
Explore a corpus
summary(data_corpus_inaugural, n = 2) Sort or group a dfm
# Corpus consisting of 58 documents, showing 2 documents: dfm_sort(x, margin = c("features", "documents", "both"))
# Text Types Tokens Sentences Year President FirstName
# 1789-Washington 625 1538 23 1789 Washington George dfm_group(x, groups = "President")
# 1793-Washington 96 147 4 1793 Washington George
# Combine identical dimension elements of a dfm
# Source: Gerhard Peters and John T. Woolley. The American Presidency Project. dfm_compress(x, margin = c("both", "documents", "features"))
# Created: Tue Jun 13 14:51:47 2017
# Notes: http://www.presidency.ucsb.edu/inaugurals.php Create a feature co-occurrence matrix (fcm)
Extract or add document-level variables x <- fcm(data_corpus_inaugural, context = "window", size = 5)
party <- docvars(data_corpus_inaugural, "Party") fcm_compress/remove/select/toupper/tolower are also available
docvars(x, "serial_number") <- 1:ndoc(x)
Bind or subset corpora
corpus(x[1:5]) + corpus(x[7:9])
Useful additional functions
corpus_subset(x, Year > 1990) Locate keywords-in-context
kwic(data_corpus_inaugural, "america*")
Change units of a corpus
corpus_reshape(x, to = c("sentences", "paragraphs")) Utility functions
texts(corpus) Show texts of a corpus
Segment texts on a pattern match ndoc(corpus/dfm/tokens) Count documents/features
corpus_segment(x, pattern, valuetype, extract_pattern = TRUE) nfeat(corpus/dfm/tokens) Count features
summary(corpus/dfm) Print summary
Take a random sample of corpus texts head(corpus/dfm) Return first part
corpus_sample(x, size = 10, replace = FALSE) tail(corpus/dfm) Return last part
https://creativecommons.org/licenses/by/4.0/
Tokenize a set of texts (tokens_*) Fit text models based on a dfm (textmodel_*)
Tokenize texts from a character vector or corpus Correspondence Analysis (CA)
x <- tokens("Powerful tool for text analysis.", textmodel_ca(x, threads = 2, sparse = TRUE, residual_floor = 0.1)
remove_punct = TRUE, stem = TRUE)
Naïve Bayes classifier for texts
Convert sequences into compound tokens textmodel_nb(x, y = training_labels, distribution = "multinomial")
myseqs <- phrase(c("powerful", "tool", "text analysis"))
tokens_compound(x, myseqs) Wordscores text model
refscores <- c(seq(-1.5, 1.5, .75), NA))
Select tokens textmodel_wordscores(data_dfm_lbgexample, refscores)
tokens_select(x, c("powerful", "text"), selection = "keep")
Wordfish Poisson scaling model
Create ngrams and skipgrams from tokens textmodel_wordfish(dfm(data_corpus_irishbudget2010), dir = c(6,5))
tokens_ngrams(x, n = 1:3)
Textmodel methods: predict(), coef(), summary(), print()
tokens_skipgrams(toks, n = 2, skip = 0:1)
Convert case of tokens
tokens_tolower(x) | tokens_topupper(x) Plot features or models (textplot_*)
Stem the terms in an object power

tokens_wordstem(x) Plot features as a wordcloud creed


peace american
long know years
common america
generation every spirit
data_corpus_inaugural %>% well
world
god
new can
freedom

americans
corpus_subset(President == "Obama") %>%
make
today
now oath
uspeople

less
work
dfm(remove = stopwords("english")) %>%
words
must life one act
let liberty

time
future just women
citizens nation still believe

textplot_wordcloud()
Calculate text statistics (textstat_*)
equal journey men
together country
government may
courage

Lexical dispersion plot


Plot the dispersion of key word(s) american

Tabulate feature frequencies from a dfm data_corpus_inaugural %>% 2001−Bush

textstat_frequency(x) | topfeatures(x) corpus_subset(Year > 1945) %>%


2005−Bush

Document
kwic("american") %>%
2009−Obama

Identify and score collocations from a tokenized text textplot_xray()


2013−Obama

toks <- tokens(c("quanteda is a pkg for quant text analysis", 0.00 0.25 0.50 0.75 1.00
2017−Trump

"quant text analysis is a growing field")) Relative token index


Plot word keyness
textstat_collocations(toks, size = 3, min_count = 2)
data_corpus_inaugural %>% dreams
protected
america

Calculate readability of a corpus corpus_subset(President %in%


american
everyone
first
back

textstat_readability(data_corpus_inaugural, measure = "Flesch") c("Obama", "Trump")) %>%


right
country
obama Trump
know

dfm(groups = "President",
Obama
still
common

Calculate lexical diversity of a dfm


freedom
journey

remove = stopwords("english")) %>%


generation

must

textstat_lexdiv(x, measure = "TTR") textstat_keyness(target = "Trump") %>%


can
us

−10 0 10
chi2

textplot_keyness()
Measure distance or similarity from a dfm Kenny FG ●

FG
ODonnell FG ●

Bruton FG ●

textstat_simil(x, "2017-Trump", method = "cosine") Quinn LAB ●


Plot Wordfish, Wordscores or CA models


Higgins LAB

LAB

Burton LAB ●

textstat_dist(x, "2017-Trump", margin = "features")


Gilmore LAB ●

textplot_scale1d(scaling_model,
Gormley Green

Green

Cuffe Green ●

Ryan Green ●

Calculate keyness statistics groups = party, OCaolain SF ●


SF
Morgan SF ●

textstat_keyness(x, target = "2017-Trump") margin = "documents")


Lenihan FF ●

FF
Cowen FF ●

−0.10 −0.05 0.00 0.05 0.10


Document position

by Stefan Müller and Kenneth Benoit • [email protected], [email protected]


Convert dfm to a non-quanteda format
https://creativecommons.org/licenses/by/4.0/ convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels",
Learn more at: http://quanteda.io • updated: 05/18 "lsa", "matrix", "data.frame))

You might also like