WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization

The document discusses text and document visualization. It describes text data as collections of documents including articles, books, emails, web pages, etc. It explains that text can be analyzed as data by looking at word meanings, relations, orderings, and hierarchies. It then outlines a common text processing pipeline involving tokenization, stemming/lemmatization, and removing stop words. Finally, it discusses several techniques for visualizing document content and structure at both the single document and collection level, including word clouds, word trees, text arcs, and arc diagrams.

Uploaded by

M Ramani Devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views20 pages

WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization

Uploaded by

M Ramani Devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Text and Document Visualization

Text data?
• Huge resources of information; from libraries, to e-mail archives
• Documents
• Articles, books and novels
• Computer programs
• E-mails, web pages, blogs
• Tags, comments
Text data
Collection of documents
• Messages (e-mail, blogs, tags, comments)
• Social networks (personal profiles)
• Academic collaborations (publications)
Text as Data

• Words have meanings and relations

– Correlations: Hong Kong, San Francisco, Bay Area
– Order: April, February, January, June, March, May
– Membership: Tennis, Running, Swimming, Hiking, Piano
– Hierarchy, antonyms & synonyms, entities
• Whether text is a nominal or ordinal ??
Text Processing Pipeline
Tokenization: segment text into terms
• Special cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”
• Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?
Stemming: one means of normalizing terms
• Reduce terms to their “root”; Porter’s algorithm for English
• e.g., automate(s), automatic, automation all map to automat
• For visualization, want to reverse stemming for labels
• Simple solution: map from stem to the most frequent word
Stemming Vs Lemmatization
Stop words
Bag of Words Model
• A document ≈ vector of term weights
– Each dimension corresponds to a term (10,000+)
– Each value represents the relevance
– For example, simple term counts
• Aggregate into a document x term matrix
• Document vector space model
Document x Term matrix
• Each document is a vector of term weights
• Simplest weighting is to just count occurrences
Computing Weights
• Tf (w) be the term frequency or number of times that word w occurred in the
document,
• Let Df (w) be the document frequency (number of documents that contain the
word).
• Let N be the number of documents.
• We define Tf Idf(w) as
Bag of Words Model
Example
Vector Space Representation
Visualizing Document Content
Single document visualization
Word Clouds
WordTree
TextArc
Arc Diagrams

Uveit Foster
50% (6)
Uveit Foster
954 pages
Reductions PPT 29-08-2020
67% (3)
Reductions PPT 29-08-2020
12 pages
Chemistry (Annual Reports - Vol.59-1962)
100% (8)
Chemistry (Annual Reports - Vol.59-1962)
576 pages
Homework: Android Facebook Post - A Mobile Phone Exercise 1. Objectives
100% (3)
Homework: Android Facebook Post - A Mobile Phone Exercise 1. Objectives
10 pages
Pu
83% (6)
Pu
5 pages
Gen Chem II Exam 1 Ans Key VA f08
86% (7)
Gen Chem II Exam 1 Ans Key VA f08
5 pages
Supply Chain Management Overview
50% (2)
Supply Chain Management Overview
67 pages
1 Supply Chain Management Fundamentals
100% (18)
1 Supply Chain Management Fundamentals
174 pages
User Guide
100% (2)
User Guide
448 pages
IRQs
No ratings yet
IRQs
2 pages
Inorganic Vs Organic Polymers PDF
100% (3)
Inorganic Vs Organic Polymers PDF
37 pages
CIA泄密
100% (4)
CIA泄密
118 pages
Simple Distillation
71% (7)
Simple Distillation
2 pages
MBBS 1st Year Syllabus Overview
83% (6)
MBBS 1st Year Syllabus Overview
14 pages
Impact of Globalization On Legal Education
100% (1)
Impact of Globalization On Legal Education
14 pages
Intermolecular Forces Tutorial Guide
75% (4)
Intermolecular Forces Tutorial Guide
1 page
pEDIATRICS Cvs MCQ
79% (14)
pEDIATRICS Cvs MCQ
3 pages
Radiology MCQs
75% (4)
Radiology MCQs
7 pages
Social Media Marketing: A Literature Review
100% (2)
Social Media Marketing: A Literature Review
27 pages
Birth Asphyxia
75% (4)
Birth Asphyxia
12 pages
BTX Aromatics
No ratings yet
BTX Aromatics
6 pages
Cassava Thesis
No ratings yet
Cassava Thesis
46 pages
Stoichiometric Principles in Industrial Chemistry
No ratings yet
Stoichiometric Principles in Industrial Chemistry
6 pages
Tutorial-Chapter 2 (June - Oct 2013)
No ratings yet
Tutorial-Chapter 2 (June - Oct 2013)
5 pages
2 Medicine MCQs Nephrology
100% (3)
2 Medicine MCQs Nephrology
3 pages
Liquid Bromine Manufacturing Overview
100% (1)
Liquid Bromine Manufacturing Overview
7 pages
Acute Respiratory Distress Syndrome Nursing Management and Interventions - Nurseslabs
No ratings yet
Acute Respiratory Distress Syndrome Nursing Management and Interventions - Nurseslabs
2 pages
Blender Basics: Free 3D Program Tutorial
100% (1)
Blender Basics: Free 3D Program Tutorial
2 pages
Income Tax Ordinance, 2001
67% (3)
Income Tax Ordinance, 2001
23 pages
The Calling of The Health Care Provider
100% (3)
The Calling of The Health Care Provider
6 pages

WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization

Uploaded by

WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization

Uploaded by

Text and Document Visualization

• Words have meanings and relations

You might also like