Cheat Sheet Extract features (dfm_*; fcm_*)
Create a document-feature matrix (dfm) from a corpus
x <- dfm(data_corpus_inaugural,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
General syntax Extensions remove = stopwords("english"))
head(x, n = 2, nf = 4)
• corpus_* manage text collections/metadata quanteda works well with these ## Document-feature matrix of: 2 documents, 4 features (41.7% sparse).
• tokens_* create/modify tokenized texts companion packages: ## features
## docs fellow-citizens senate house representatives
• dfm_* create/modify doc-feature matrices • readtext: An easy way to read text ## 1789-Washington 1 1 2 2
• fcm_* work with co-occurrence matrices data ## 1793-Washington 0 0 0 0
• textstat_* calculate text-based statistics • spacyr: NLP using the spaCy library
• quanteda.data: additional textual
Create a dictionary
• textmodel_* fit (un-)supervised models
dictionary(list(negative = c("bad", "awful", "sad"),
• textplot_* create text-based visualizations data
positive = c("good", "wonderful", "happy")))
• stopwords: multilingual stopword
Consistent grammar: Apply a dictionary
lists in R
• object() constructor for the object type dfm_lookup(x, dictionary = data_dictionary_LSD2015)
• object_verb() inputs & returns object type
Select features
Create a corpus from texts (corpus_*) dfm_select(x, dictionary = data_dictionary_LSD2015)
Compress a dfm by combining identical elements
dfm_compress(x, margin = c("both", "documents", "features"))
Read texts (txt, pdf, csv, doc, docx, json, xml)
my_texts <- readtext::readtext("~/link/to/path/*") Randomly sample documents or features
dfm_sample(x, what = c("documents", "features"))
Construct a corpus from a character vector
x <- corpus(data_char_ukimmig2010, text_field = "text") Weight or smooth the feature frequencies
dfm_weight(x, type = "prop") | dfm_smooth(x, smoothing = 0.5)
Explore a corpus
summary(data_corpus_inaugural, n = 2) Sort or group a dfm
# Corpus consisting of 58 documents, showing 2 documents: dfm_sort(x, margin = c("features", "documents", "both"))
# Text Types Tokens Sentences Year President FirstName
# 1789-Washington 625 1538 23 1789 Washington George dfm_group(x, groups = "President")
# 1793-Washington 96 147 4 1793 Washington George
# Combine identical dimension elements of a dfm
# Source: Gerhard Peters and John T. Woolley. The American Presidency Project. dfm_compress(x, margin = c("both", "documents", "features"))
# Created: Tue Jun 13 14:51:47 2017
# Notes: http://www.presidency.ucsb.edu/inaugurals.php Create a feature co-occurrence matrix (fcm)
Extract or add document-level variables x <- fcm(data_corpus_inaugural, context = "window", size = 5)
party <- docvars(data_corpus_inaugural, "Party") fcm_compress/remove/select/toupper/tolower are also available
docvars(x, "serial_number") <- 1:ndoc(x)
Bind or subset corpora
corpus(x[1:5]) + corpus(x[7:9])
Useful additional functions
corpus_subset(x, Year > 1990) Locate keywords-in-context
kwic(data_corpus_inaugural, "america*")
Change units of a corpus
corpus_reshape(x, to = c("sentences", "paragraphs")) Utility functions
texts(corpus) Show texts of a corpus
Segment texts on a pattern match ndoc(corpus/dfm/tokens) Count documents/features
corpus_segment(x, pattern, valuetype, extract_pattern = TRUE) nfeat(corpus/dfm/tokens) Count features
summary(corpus/dfm) Print summary
Take a random sample of corpus texts head(corpus/dfm) Return first part
corpus_sample(x, size = 10, replace = FALSE) tail(corpus/dfm) Return last part
https://creativecommons.org/licenses/by/4.0/
Tokenize a set of texts (tokens_*) Fit text models based on a dfm (textmodel_*)
Tokenize texts from a character vector or corpus Correspondence Analysis (CA)
x <- tokens("Powerful tool for text analysis.", textmodel_ca(x, threads = 2, sparse = TRUE, residual_floor = 0.1)
remove_punct = TRUE, stem = TRUE)
Naïve Bayes classifier for texts
Convert sequences into compound tokens textmodel_nb(x, y = training_labels, distribution = "multinomial")
myseqs <- phrase(c("powerful", "tool", "text analysis"))
tokens_compound(x, myseqs) Wordscores text model
refscores <- c(seq(-1.5, 1.5, .75), NA))
Select tokens textmodel_wordscores(data_dfm_lbgexample, refscores)
tokens_select(x, c("powerful", "text"), selection = "keep")
Wordfish Poisson scaling model
Create ngrams and skipgrams from tokens textmodel_wordfish(dfm(data_corpus_irishbudget2010), dir = c(6,5))
tokens_ngrams(x, n = 1:3)
Textmodel methods: predict(), coef(), summary(), print()
tokens_skipgrams(toks, n = 2, skip = 0:1)
Convert case of tokens
tokens_tolower(x) | tokens_topupper(x) Plot features or models (textplot_*)
Stem the terms in an object power
tokens_wordstem(x) Plot features as a wordcloud creed
peace american
long know years
common america
generation every spirit
data_corpus_inaugural %>% well
world
god
new can
freedom
americans
corpus_subset(President == "Obama") %>%
make
today
now oath
uspeople
less
work
dfm(remove = stopwords("english")) %>%
words
must life one act
let liberty
time
future just women
citizens nation still believe
textplot_wordcloud()
Calculate text statistics (textstat_*)
equal journey men
together country
government may
courage
Lexical dispersion plot
Plot the dispersion of key word(s) american
Tabulate feature frequencies from a dfm data_corpus_inaugural %>% 2001−Bush
textstat_frequency(x) | topfeatures(x) corpus_subset(Year > 1945) %>%
2005−Bush
Document
kwic("american") %>%
2009−Obama
Identify and score collocations from a tokenized text textplot_xray()
2013−Obama
toks <- tokens(c("quanteda is a pkg for quant text analysis", 0.00 0.25 0.50 0.75 1.00
2017−Trump
"quant text analysis is a growing field")) Relative token index
Plot word keyness
textstat_collocations(toks, size = 3, min_count = 2)
data_corpus_inaugural %>% dreams
protected
america
Calculate readability of a corpus corpus_subset(President %in%
american
everyone
first
back
textstat_readability(data_corpus_inaugural, measure = "Flesch") c("Obama", "Trump")) %>%
right
country
obama Trump
know
dfm(groups = "President",
Obama
still
common
Calculate lexical diversity of a dfm
freedom
journey
remove = stopwords("english")) %>%
generation
−
must
textstat_lexdiv(x, measure = "TTR") textstat_keyness(target = "Trump") %>%
can
us
−10 0 10
chi2
textplot_keyness()
Measure distance or similarity from a dfm Kenny FG ●
●
FG
ODonnell FG ●
●
Bruton FG ●
●
textstat_simil(x, "2017-Trump", method = "cosine") Quinn LAB ●
●
Plot Wordfish, Wordscores or CA models
Higgins LAB
LAB
●
●
Burton LAB ●
textstat_dist(x, "2017-Trump", margin = "features")
●
Gilmore LAB ●
●
textplot_scale1d(scaling_model,
Gormley Green
Green
●
●
Cuffe Green ●
●
Ryan Green ●
●
Calculate keyness statistics groups = party, OCaolain SF ●
●
SF
Morgan SF ●
●
textstat_keyness(x, target = "2017-Trump") margin = "documents")
Lenihan FF ●
●
FF
Cowen FF ●
●
−0.10 −0.05 0.00 0.05 0.10
Document position
Convert dfm to a non-quanteda format
https://creativecommons.org/licenses/by/4.0/ convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels",
Learn more at: http://quanteda.io • updated: 05/18 "lsa", "matrix", "data.frame))