Text and Document Visualization
Text data?
• Huge resources of information; from libraries, to e-mail archives
• Documents
• Articles, books and novels
• Computer programs
• E-mails, web pages, blogs
• Tags, comments
Text data
Collection of documents
• Messages (e-mail, blogs, tags, comments)
• Social networks (personal profiles)
• Academic collaborations (publications)
Text as Data
• Words have meanings and relations
– Correlations: Hong Kong, San Francisco, Bay Area
– Order: April, February, January, June, March, May
– Membership: Tennis, Running, Swimming, Hiking, Piano
– Hierarchy, antonyms & synonyms, entities
• Whether text is a nominal or ordinal ??
Text Processing Pipeline
Tokenization: segment text into terms
• Special cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”
• Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?
Stemming: one means of normalizing terms
• Reduce terms to their “root”; Porter’s algorithm for English
• e.g., automate(s), automatic, automation all map to automat
• For visualization, want to reverse stemming for labels
• Simple solution: map from stem to the most frequent word
Stemming Vs Lemmatization
Stop words
Bag of Words Model
• A document ≈ vector of term weights
– Each dimension corresponds to a term (10,000+)
– Each value represents the relevance
– For example, simple term counts
• Aggregate into a document x term matrix
• Document vector space model
Document x Term matrix
• Each document is a vector of term weights
• Simplest weighting is to just count occurrences
Computing Weights
• Tf (w) be the term frequency or number of times that word w occurred in the
document,
• Let Df (w) be the document frequency (number of documents that contain the
word).
• Let N be the number of documents.
• We define Tf Idf(w) as
Bag of Words Model
Example
Vector Space Representation
Visualizing Document Content
Single document visualization
Word Clouds
WordTree
TextArc
Arc Diagrams