0% found this document useful (0 votes)

202 views2 pages

Text Analysis with Quanteda

The cheat sheet summarizes functions in the quanteda package for text analysis in R. It provides the general syntax for functions that manage text corpora and tokens, create document-feature matrices, fit text models, and create text visualizations. The cheat sheet also lists additional useful functions and packages that work well with quanteda for tasks like reading text data, applying dictionaries, sampling documents, and more. It aims to serve as a concise reference for the core capabilities of quanteda.

Uploaded by

ayrusurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views2 pages

Text Analysis with Quanteda

Uploaded by

ayrusurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Cheat Sheet Extract features (dfm_; fcm_)

Create a document-feature matrix (dfm) from a corpus

x <- dfm(data_corpus_inaugural,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
General syntax Extensions remove = stopwords("english"))

head(x, n = 2, nf = 4)
• corpus_* manage text collections/metadata quanteda works well with these ## Document-feature matrix of: 2 documents, 4 features (41.7% sparse).
• tokens_* create/modify tokenized texts companion packages: ## features
## docs fellow-citizens senate house representatives
• dfm_* create/modify doc-feature matrices • readtext: An easy way to read text ## 1789-Washington 1 1 2 2
• fcm_* work with co-occurrence matrices data ## 1793-Washington 0 0 0 0
• textstat_* calculate text-based statistics • spacyr: NLP using the spaCy library
• quanteda.data: additional textual
Create a dictionary
• textmodel_* fit (un-)supervised models
dictionary(list(negative = c("bad", "awful", "sad"),
• textplot_* create text-based visualizations data
positive = c("good", "wonderful", "happy")))
• stopwords: multilingual stopword
Consistent grammar: Apply a dictionary
lists in R
• object() constructor for the object type dfm_lookup(x, dictionary = data_dictionary_LSD2015)
• object_verb() inputs & returns object type
Select features
Create a corpus from texts (corpus_*) dfm_select(x, dictionary = data_dictionary_LSD2015)
Compress a dfm by combining identical elements
dfm_compress(x, margin = c("both", "documents", "features"))
Read texts (txt, pdf, csv, doc, docx, json, xml)
my_texts <- readtext::readtext("~/link/to/path/*") Randomly sample documents or features
dfm_sample(x, what = c("documents", "features"))
Construct a corpus from a character vector
x <- corpus(data_char_ukimmig2010, text_field = "text") Weight or smooth the feature frequencies
dfm_weight(x, type = "prop") | dfm_smooth(x, smoothing = 0.5)
Explore a corpus
summary(data_corpus_inaugural, n = 2) Sort or group a dfm
# Corpus consisting of 58 documents, showing 2 documents: dfm_sort(x, margin = c("features", "documents", "both"))
# Text Types Tokens Sentences Year President FirstName
# 1789-Washington 625 1538 23 1789 Washington George dfm_group(x, groups = "President")
# 1793-Washington 96 147 4 1793 Washington George
# Combine identical dimension elements of a dfm
# Source: Gerhard Peters and John T. Woolley. The American Presidency Project. dfm_compress(x, margin = c("both", "documents", "features"))
# Created: Tue Jun 13 14:51:47 2017
# Notes: http://www.presidency.ucsb.edu/inaugurals.php Create a feature co-occurrence matrix (fcm)
Extract or add document-level variables x <- fcm(data_corpus_inaugural, context = "window", size = 5)
party <- docvars(data_corpus_inaugural, "Party") fcm_compress/remove/select/toupper/tolower are also available
docvars(x, "serial_number") <- 1:ndoc(x)
Bind or subset corpora
corpus(x[1:5]) + corpus(x[7:9])
Useful additional functions
corpus_subset(x, Year > 1990) Locate keywords-in-context
kwic(data_corpus_inaugural, "america*")
Change units of a corpus
corpus_reshape(x, to = c("sentences", "paragraphs")) Utility functions
texts(corpus) Show texts of a corpus
Segment texts on a pattern match ndoc(corpus/dfm/tokens) Count documents/features
corpus_segment(x, pattern, valuetype, extract_pattern = TRUE) nfeat(corpus/dfm/tokens) Count features
summary(corpus/dfm) Print summary
Take a random sample of corpus texts head(corpus/dfm) Return first part
corpus_sample(x, size = 10, replace = FALSE) tail(corpus/dfm) Return last part
https://creativecommons.org/licenses/by/4.0/
Tokenize a set of texts (tokens_*) Fit text models based on a dfm (textmodel_*)
Tokenize texts from a character vector or corpus Correspondence Analysis (CA)
x <- tokens("Powerful tool for text analysis.", textmodel_ca(x, threads = 2, sparse = TRUE, residual_floor = 0.1)
remove_punct = TRUE, stem = TRUE)
Naïve Bayes classifier for texts
Convert sequences into compound tokens textmodel_nb(x, y = training_labels, distribution = "multinomial")
myseqs <- phrase(c("powerful", "tool", "text analysis"))
tokens_compound(x, myseqs) Wordscores text model
refscores <- c(seq(-1.5, 1.5, .75), NA))
Select tokens textmodel_wordscores(data_dfm_lbgexample, refscores)
tokens_select(x, c("powerful", "text"), selection = "keep")
Wordfish Poisson scaling model
Create ngrams and skipgrams from tokens textmodel_wordfish(dfm(data_corpus_irishbudget2010), dir = c(6,5))
tokens_ngrams(x, n = 1:3)
Textmodel methods: predict(), coef(), summary(), print()
tokens_skipgrams(toks, n = 2, skip = 0:1)
Convert case of tokens
tokens_tolower(x) | tokens_topupper(x) Plot features or models (textplot_*)
Stem the terms in an object power

tokens_wordstem(x) Plot features as a wordcloud creed

peace american
long know years
common america
generation every spirit
data_corpus_inaugural %>% well
world
god
new can
freedom

americans
corpus_subset(President == "Obama") %>%
make
today
now oath
uspeople

less
work
dfm(remove = stopwords("english")) %>%
words
must life one act
let liberty

time
future just women
citizens nation still believe

textplot_wordcloud()
Calculate text statistics (textstat_*)
equal journey men
together country
government may
courage

Lexical dispersion plot

Plot the dispersion of key word(s) american

Tabulate feature frequencies from a dfm data_corpus_inaugural %>% 2001−Bush

textstat_frequency(x) | topfeatures(x) corpus_subset(Year > 1945) %>%

2005−Bush

Document
kwic("american") %>%
2009−Obama

Identify and score collocations from a tokenized text textplot_xray()

2013−Obama

toks <- tokens(c("quanteda is a pkg for quant text analysis", 0.00 0.25 0.50 0.75 1.00
2017−Trump

"quant text analysis is a growing field")) Relative token index

Plot word keyness
textstat_collocations(toks, size = 3, min_count = 2)
data_corpus_inaugural %>% dreams
protected
america

Calculate readability of a corpus corpus_subset(President %in%

american
everyone
first
back

textstat_readability(data_corpus_inaugural, measure = "Flesch") c("Obama", "Trump")) %>%

right
country
obama Trump
know

dfm(groups = "President",
Obama
still
common

Calculate lexical diversity of a dfm

freedom
journey

remove = stopwords("english")) %>%

generation
−
must

textstat_lexdiv(x, measure = "TTR") textstat_keyness(target = "Trump") %>%

can
us

−10 0 10
chi2

textplot_keyness()
Measure distance or similarity from a dfm Kenny FG ●
●

FG
ODonnell FG ●
●

Bruton FG ●
●

textstat_simil(x, "2017-Trump", method = "cosine") Quinn LAB ●

●

Plot Wordfish, Wordscores or CA models

Higgins LAB

LAB
●
●

Burton LAB ●

textstat_dist(x, "2017-Trump", margin = "features")

●

Gilmore LAB ●
●

textplot_scale1d(scaling_model,
Gormley Green

Green
●
●

Cuffe Green ●
●

Ryan Green ●
●

Calculate keyness statistics groups = party, OCaolain SF ●

●

SF
Morgan SF ●
●

textstat_keyness(x, target = "2017-Trump") margin = "documents")

Lenihan FF ●
●

FF
Cowen FF ●
●

−0.10 −0.05 0.00 0.05 0.10

Document position

by Stefan Müller and Kenneth Benoit • [email protected], [email protected]

Convert dfm to a non-quanteda format
https://creativecommons.org/licenses/by/4.0/ convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels",
Learn more at: http://quanteda.io • updated: 05/18 "lsa", "matrix", "data.frame))

Text Analysis with Quanteda Cheat Sheet
No ratings yet
Text Analysis with Quanteda Cheat Sheet
2 pages
Quanteda
No ratings yet
Quanteda
2 pages
2019 06 27 - Muenster
No ratings yet
2019 06 27 - Muenster
218 pages
Text Mining with R's tm Package
No ratings yet
Text Mining with R's tm Package
7 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Introduction To The TM Package Text Mining in R: Ingo Feinerer April 20, 2024
No ratings yet
Introduction To The TM Package Text Mining in R: Ingo Feinerer April 20, 2024
8 pages
Lab5 Instructions
No ratings yet
Lab5 Instructions
51 pages
Text Mining & Analysis Guide
No ratings yet
Text Mining & Analysis Guide
6 pages
Big Data
No ratings yet
Big Data
5 pages
Quanteda: Text Analysis in R
No ratings yet
Quanteda: Text Analysis in R
103 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
Quanteda
No ratings yet
Quanteda
137 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Mod3 Tables EPP
No ratings yet
Mod3 Tables EPP
9 pages
Tosca
No ratings yet
Tosca
60 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
BA Notes
No ratings yet
BA Notes
5 pages
Hands-On Data Science With R Text Mining: 10th January 2016
No ratings yet
Hands-On Data Science With R Text Mining: 10th January 2016
47 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Submitted By-Pawan Yadav, Roll No. (18PT1-17)
No ratings yet
Submitted By-Pawan Yadav, Roll No. (18PT1-17)
4 pages
Naive Bayes Text Classification Guide
No ratings yet
Naive Bayes Text Classification Guide
3 pages
Jahanvi Gupta 22BC233 - Siya Gupta 22BC563
No ratings yet
Jahanvi Gupta 22BC233 - Siya Gupta 22BC563
23 pages
Quantitative Linguistics With R
No ratings yet
Quantitative Linguistics With R
29 pages
Quantitative Methods in Linguistics - Lecture 3: Adrian Brasoveanu March 30, 2014
No ratings yet
Quantitative Methods in Linguistics - Lecture 3: Adrian Brasoveanu March 30, 2014
51 pages
R Reference Guide for Programmers
No ratings yet
R Reference Guide for Programmers
6 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
Text
No ratings yet
Text
102 pages
Readtext
No ratings yet
Readtext
11 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Data Frames in R
No ratings yet
Data Frames in R
7 pages
Dataframes
No ratings yet
Dataframes
13 pages
R Text Mining & Sentiment Guide
No ratings yet
R Text Mining & Sentiment Guide
9 pages
Word Cloud
No ratings yet
Word Cloud
3 pages
R - Lecture #2
No ratings yet
R - Lecture #2
21 pages
White Pink Fun Abstract Group Project Presentation
No ratings yet
White Pink Fun Abstract Group Project Presentation
14 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
R Reference Card
100% (4)
R Reference Card
4 pages
EuroStat TextAnalysistraining
No ratings yet
EuroStat TextAnalysistraining
12 pages
Ex 4 R Objects
No ratings yet
Ex 4 R Objects
6 pages
Text Mining Code
No ratings yet
Text Mining Code
2 pages
Peer Graded Assignment: Task Milestones
No ratings yet
Peer Graded Assignment: Task Milestones
6 pages
Fdiversity: User Manual
No ratings yet
Fdiversity: User Manual
57 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
Text Mining Twitter Data with R
No ratings yet
Text Mining Twitter Data with R
35 pages
R/Rpad Reference Card: Slicing and Extracting Data
No ratings yet
R/Rpad Reference Card: Slicing and Extracting Data
5 pages
R Commands & Resources Guide
No ratings yet
R Commands & Resources Guide
274 pages
R Programming Basics & Data Structures
No ratings yet
R Programming Basics & Data Structures
30 pages
Saving R Environment to RData
No ratings yet
Saving R Environment to RData
60 pages
Text Mining Political Speeches
No ratings yet
Text Mining Political Speeches
17 pages
R Reference Card
No ratings yet
R Reference Card
1 page
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
6th Central Pay Commission Salary Calculator
100% (436)
6th Central Pay Commission Salary Calculator
15 pages
Stringr R Package
No ratings yet
Stringr R Package
2 pages
R Cheatsheet Rmarkdown Reference
No ratings yet
R Cheatsheet Rmarkdown Reference
5 pages
Survminer Cheatsheet
No ratings yet
Survminer Cheatsheet
1 page
Recode and Dichotomise Variables
No ratings yet
Recode and Dichotomise Variables
1 page
Time Series ML Models Cheat Sheet
No ratings yet
Time Series ML Models Cheat Sheet
1 page
Devtools Cheatsheet
100% (1)
Devtools Cheatsheet
2 pages
Intro Stats With Mosaic: One Quantitative Variable Formula Interface One Categorical Variable
No ratings yet
Intro Stats With Mosaic: One Quantitative Variable Formula Interface One Categorical Variable
2 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
Machine Learning with R Guide
No ratings yet
Machine Learning with R Guide
2 pages
Dates and Times With Lubridate::: Cheat Sheet
No ratings yet
Dates and Times With Lubridate::: Cheat Sheet
2 pages
Design Flow
No ratings yet
Design Flow
18 pages
Case Study 1 Assignm
No ratings yet
Case Study 1 Assignm
3 pages
Caracteristicas Mydaq Ni
No ratings yet
Caracteristicas Mydaq Ni
54 pages
Solution For "This System Is Not Registered With An Entitlement Server" SahliTech, Inc
No ratings yet
Solution For "This System Is Not Registered With An Entitlement Server" SahliTech, Inc
5 pages
Change Documents for Production Orders
No ratings yet
Change Documents for Production Orders
4 pages
OrCAD PSpice 9.1tutorial
No ratings yet
OrCAD PSpice 9.1tutorial
11 pages
SMB Protocol Support For HP Printing Devices
No ratings yet
SMB Protocol Support For HP Printing Devices
9 pages
Username Dan Password Login
No ratings yet
Username Dan Password Login
4 pages
Opetrating System
No ratings yet
Opetrating System
25 pages
Stack and Queue Operations in C
No ratings yet
Stack and Queue Operations in C
47 pages
IntelliVue Information Center HL7 Programmer S Guide PDF
No ratings yet
IntelliVue Information Center HL7 Programmer S Guide PDF
210 pages
Cyber Security Foundations Extended Final
No ratings yet
Cyber Security Foundations Extended Final
2 pages
Cisco Collaboration Flex Plan Meeting
No ratings yet
Cisco Collaboration Flex Plan Meeting
38 pages
Numeric Solutions of ODEs in Maple
No ratings yet
Numeric Solutions of ODEs in Maple
6 pages
Solved - BTE 1040 - SAP Community
No ratings yet
Solved - BTE 1040 - SAP Community
9 pages
SolidCAM 2021 Mill-Turn Training Course
No ratings yet
SolidCAM 2021 Mill-Turn Training Course
424 pages
c10rl Pro Myusg Quick User Manual v1 1
No ratings yet
c10rl Pro Myusg Quick User Manual v1 1
16 pages
2.3 - An Activity-Based Model Template For Cube Voyager
100% (1)
2.3 - An Activity-Based Model Template For Cube Voyager
21 pages
Liebherr Repair and Overhaul Guide
60% (5)
Liebherr Repair and Overhaul Guide
466 pages
Chapter 3 Part 6
No ratings yet
Chapter 3 Part 6
2 pages
The Only Bash Scripting Cheat Sheet That You Will Ever Need
No ratings yet
The Only Bash Scripting Cheat Sheet That You Will Ever Need
9 pages
SAP Revit Structure
No ratings yet
SAP Revit Structure
20 pages
(Group - IV) Analogy (Class Notes)
No ratings yet
(Group - IV) Analogy (Class Notes)
59 pages
LiveCycle Designer - 11 - Help PDF
No ratings yet
LiveCycle Designer - 11 - Help PDF
713 pages
Reverse Engineering Question Bank
No ratings yet
Reverse Engineering Question Bank
5 pages
(Ebook) Introduction To Programming With C++ by Y Daniel Liang ISBN 9780273793243, 0273793241 Instant Download
100% (12)
(Ebook) Introduction To Programming With C++ by Y Daniel Liang ISBN 9780273793243, 0273793241 Instant Download
47 pages
375 Field Communicator Resource CD
No ratings yet
375 Field Communicator Resource CD
7 pages
A Netflix Web Performance Case Study - by Addy Osmani - Dev Channel - Medium
No ratings yet
A Netflix Web Performance Case Study - by Addy Osmani - Dev Channel - Medium
17 pages
Connectrix B Series CLI Reference Manual
No ratings yet
Connectrix B Series CLI Reference Manual
368 pages
Chapter 1 CC
No ratings yet
Chapter 1 CC
20 pages