0% found this document useful (0 votes)

69 views

Natural Language Processing

The document discusses text processing in natural language processing. It begins with an overview of text processing, including defining graphemes, phonemes, words, and sentences. It then covers challenges in text processing, such as different writing systems, and outlines the modules to be covered, including character encoding, word segmentation, and sentence segmentation. The document focuses on character encoding types like ASCII and Unicode and discusses challenges related to character set dependence, language dependence, corpus dependence, and application dependence in text processing.

Uploaded by

shuchis785

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

Natural Language Processing

Uploaded by

shuchis785

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Natural Language Processing

CSE4022

Lecture-04: Text Processing - Module 2

Dr. Durgesh Kumar

Assistant Professor, SCOPE, VIT Vellore
Table of contents

1 Recap from previous Module

2 Text Processing Module outline

3 Character Encoding types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 1 / 34

Recap from previous Module- Intro to NLP

Module 1 Summary

What is NLP?
Stages of NLP
Dis-ambiguity in NLP
Challenges of NLP.
Models and Algorithm categories in NLP.
Real world Applications of NLP
Heroes of NLP and related online courses.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 2 / 34

Text Processing Introduction

Text Processing
The task of converting a raw text file, essentially a sequence of digital
bits, into a well-defined sequence of linguistically meaningful units.

1 Grapheme:the smallest units of writing that correspond with sounds

(more accurately phonemes).
2 Phoneme:A unit of sound that can distinguish one word from
another in a particular language.
3 words: consists of one or more characters
4 sentence: consisting of one or more words.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 3 / 34

Text Processing Introduction (Contd.)

At the lowest level characters represents: individual grapheme, word,

sentence in a language’s written system.
Text pre-processing is an essential part of any NLP system.
The characters, words, and sentences identified at this stage are the
fundamental units passed to all further processing stages, from
document analysis, NER, POS, Sentiment Analysis, Information
Retrieval, etc.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 4 / 34

Text Processing Introduction (Contd.)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 5 / 34

Progress of NLP (Text Processing)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 6 / 34

Text Processing - Challenges & Need

NLP contains inherent ambiguities, and is further amplified and

generated by writing systems.
The explosion in corpus size and variety has necessities techniques
for automatically harvesting and preparing text corpora for various
NLP tasks.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 7 / 34

Outline of Text-processing

Figure: Text Processing types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 8 / 34

Text Processing

Text pre-processing: converting sequence of digital bits, into a

well-defined sequence of linguistically meaningful units.
Text pre-processing can be divided into two stages:
1 document triage: process of converting a set of digital files into
well-defined text documents.
2 text segmentation: process of converting a well-defined text corpus
into its component words, and sentences.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 9 / 34

Text Processing - Document Triage
In order for any natural language document to be machine readable, its
characters must be represented in a character encoding, in which one or
more bytes in a file maps to a known character.
1 Character encoding determines the character encoding (or
encodings) for any file and optionally converts between encodings.
2 language identification: determines the natural language for a
document; this step is closely linked to, but not uniquely determined
by, the character encoding.
3 text sectioning identifies the actual content within a file while
discarding undesirable elements, such as images, tables, headers,
links, and HTML formatting.
The o/p the document triage stage is a well-defined text corpus,
organized by language, suitable for text segmentation, and further
analysis.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 10 / 34
Text Processing - Text segmentation
1 Word segmentation breaks up the sequence of characters in a text
by locating the word boundaries. In computation linguistics, words
are oaften referred to as tokens, and word segmentation as
tokenization.
2 Text normalization is a related step that involves merging different
written forms of a token into a canonical normalized form; e.g.
convert tokens “Mr.”, “Mr”, “mister”, and “Mister” to a single
normalized form.
3 Sentence segmentation is the process of determining the longer
processing units consisting of one or more words. This task involves
identifying sentence boundaries between words in different
sentences. sentence boundary detection, sentence boundary
disambiguation or sentence boundary recognition.
=⇒ Most written languages have punctuation marks.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 11 / 34
Module 2 outline- from Syllabus

Character Encoding : In the linguistic analysis of a digital natural

language text, it is necessary to clearly define the characters, words,
and sentences in any document.
Word Segmentation
Sentence Segmentation
Intro to Corpora
Corpora Analysis

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 12 / 34

Challenges of Text Processing

Types of Writing systems

1 Logo-graphic: individual symbol represent words; large number

(often thousands) of symbols. e.g.: Chinese.
2 Syllabic: individual symbol represents syllables. e.g.: Japanese kana
3 alphabetic: where individual symbol represent sounds. e.g.: English

Syllabic and alphabetic system have fewer than 100 symbols.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 13 / 34

Challenges of Text Processing

Syllables

a unit of spoken language that is next bigger than a speech sound

is a sequence of speech sounds (formed from vowels and consonants)
organized into a single unit.
consists of one or more vowel sounds alone or of a syllabic consonant
alone or of either with one or more consonant sounds preceding or
following
act as the building blocks of a spoken word, determining the pace
and rhythm of how the word is pronounced.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 14 / 34

Challenges of Text Processing
Table: Syllab Example

suffix Words Written Syllables Spoken Syllables

alarming a·larm·ing uh-lahr-ming
asking ask·ing
baking bak·ing
“-ing” words
eating eat·ing
learning learn·ing
thinking think·ing

atomize at·om·ize
customize cus·tom·ize
internalize in·ter·nal·ize
“-ize” words
moisturize mois·tur·ize

See here for details

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 15 / 34
Challenges of Text Processing

Different types of writing systems. e.g. logo-graphics, syllabic,

alphabetic.
Majority of all written languages uses an alphabetic or syllabic
system [Corrie et al. 1996].
Modern writing system employs more than one types.
English predominantly uses roman base script (an alphabetic writing
system), but also utilizes logo-graphic symbols.
Arabic (0-9)
Currency symbol: dollar($), Euro (€) , Rupee(|) , Pounds (£), Yen
(¥)
And other symbols (%, &, #)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 16 / 34

Challenges of Text Processing

Challenges of Text processing can be categorized into 4 categories:

1 Character set dependence
(a) Character encoding types
(b) Character encoding identification & its implication on tokenization.

2 Language dependence
3 Corpus dependence
4 Application dependence

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 17 / 34

1(A). Character Encoding types

ASCII encoding (7-bit)

Extended ASCII encoding (8-bit)
ISCII encoding (8-bit)
16 bit - ( for encoding Chinese, Japanese)
Unicode 5.0
UTF-8: Variable length Encoding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 18 / 34

1(A). Character Encoding Types - ASCII (7 bit)

American Standard Code for Information Interchange.

Historically all digital text were encded in 7-bit ASCII code.
ASCII code include 128 characters (27 ).
It includes only Roman or Latin alphabets and essentials English
alphabets. e.g. [0-9], [A-Z], [a-z].

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 19 / 34

1(A). Character Encoding Types- ASCII (Contd.)

Figure: ASCII codes and it’s meaning.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 20 / 34

1(A). Character Encoding Types- ASCII (Contd.)

Figure: ASCII codes and it’s meaning in rectangular format.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 21 / 34

1(A). Character Encoding Types- ASCII (Contd.)

ASCII codes can be broadly classified into two categories:

1 ASCII control characters (character code 0-31): NULL Character
(0), Start of Heading (1), Backspace (8), Horizontal tab (9), Line
Feed (10), Vertical Tab (11), Device Control 1 (17).
2 ASCII printable characters (character code 32-127): represent
letters, digits, punctuation marks, and a few miscellaneous
symbols.
digits [0-9]: 48-57
Capital letters [A-Z]: 65-90
Small letters [a-z]: 97-122
Punctuation: Space (32), Exclamation mark (33), double
quote(34), Single quote (39), Comma (44), hypen (45), Period, dot
or full stop (46), colon (58), Semicolon (59), question mark (63)
Special Symbols: #: (35), $: (36), % (37), &: (38), <: (60), =
(61), >: (62), @: (64)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 22 / 34
1(A). Character Encoding Types - ASCII Limitations

Limitation of 7-bit ASCII code

Required “asciification” or “romanization” of characters not present

in ASCII table. e.g. German über would be written as u”ber or
ueber.
the French word déjà would be written as de‘ja’ or de1ja2.
complex ASCII-fication for non-roman based languages:
a phonetic mapping of the source characters to the
roman characters . e.g.: Russian, Arabic, Chinese.
The Pinyin transliteration of Chinese writing is one such example.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 23 / 34

1(A). Character Encoding Types - Extended ASCII

8 bits: It can represents 256 characters (28 ).

First 128 characters (0-128) reserved for original ASCII characters.
Eight-bit encoding exist for all common alphabetic and some syllabic
writing systems; e.g. ISO-8859.
Cons: Overlapping character sets in different languages.
ISO-8859

It contains encoding definitions for most European characters.

HTML syntax: < meta charset="ISO-8859-1">
ISO-8859-1 (Western Europe), ISO-8859-2 (Central Europe),
ISO-8859-3 (Southern Europe), ISO-8859-4 (Baltic) ISO-8859-5
(Cyrillic)
ISO-8859-6 (Arabic), ISO-8859-7 (Greek), ISO-8859-8 (Hebrew),
ISO-8859-9 (Turkish), ISO-8859-15 (Latin 9)
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 24 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)

Indian Script Code for Information Interchange.

ISCII is an encoding scheme that represents various languages that
are written and spoken in India.
ISCII was introduced in the year 1991 by the Bureau of Indian
Standards(BIS).
ISCII code include 256 characters (28 ).
The first 128 characters, that is, from 0-127 are same as that for
ASCII.
The next characters, that are from 128-255 represent the characters
from the Indian scripts.
Most of the Indian language characters are taken from the ancient
Brahmi script and resemble close to each other due to having similar
phonetic structure.
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 25 / 34
1(A). Character Encoding Types - ISCII encoding
(8-bit)

Advantage

Majority of languages that are spoken in India are represented in this.

Character set is simple and easy to understand.
Easy transliteration between languages is possible.
Supported Languages: Devanagari, Punjabi, Gujarati, Oriya,
Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil

Disadvantages

We need a special keyboard which contains ISCII character keys.

As the Unicode was invented later, and with Unicode having the
characters of ISCII, ISCII became obsolete.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 26 / 34

1(A). Character Encoding Types - A two byte
character set - for Japanese and Chinese

Chinese and Japanese, which have several thousand distinct

characters, require multiple bytes to encode a single character.
A two-byte character or 16 bits set can represent 65,536 (216 )
distinct characters.
single-byte letters, spaces, punctuation marks (e.g., periods,
quotation marks, and parentheses), and Arabic numerals (0–9) are
commonly interspersed (distributed) with 2-byte Chinese and
Japanese characters

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 27 / 34

1(A). Character Encoding Types - A two byte
character set- Challenges

Challenges

Code switching: Characters from many different writing systems

occur within the same text. e.g. i am going to School. (where all
words except School are written in Indian languages such as hindi,
Tamil, Telugu, Bengali etc.).
Multiple Encoding Multiple encodings also exist for chinese e.g.
Big-5 for the complex-form (traditional) character set and GB for
the simple-form (simplified) set.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 28 / 34

1(A). Character encoding - Unicode Encoding - UTF-8

The default character encoding for HTML5 is UTF-8.

To specify encoding in HTML : <meta charset="UTF-8">

Figure: Character encoding in HTML

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 29 / 34
1(A). Character Encoding Types - Unicode encodings

It seeks to eliminate this character set ambiguity by specifying a

Universal Character Set that includes over 100,000 distinct coded
characters derived from over 75 supported scripts representing all the
writing systems commonly used today.
most commonly implemented in the UTF-8 variable-length
character encoding, in which each character is represented by a 1 to
4 byte encoding.
UTF-8 allow for the encoding of all supported characters with no
overlap or confusion between conflicting byte ranges.
it is rapidly replacing older character encoding sets for multilingual
applications.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 30 / 34

1(A). Character encoding - Unicode Encoding - UTF-8

In the UTF-8 encoding

ASCII characters require 1 byte
other characters included in ISO-8859 and other alphabetic
characters require 2 bytes.
All other characters, including Chinese, Japanese, and Korean,
require 3 bytes (and very rarely 4 bytes).
1 import pandas as pd
2 df = pd . read_csv ( ’ fname . csv ’ , encoding = ’ utf8 ’)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 31 / 34

1 (B). Character Encoding Identification and Its
Impact on Tokenization

the header of a digital document may contain information regarding

its character encoding.
this information is not always present or even reliable.
in which case the encoding must be determined automatically.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 32 / 34

1 (B). Character Encoding Identification and Its
Impact on Tokenization

Challenges of Character Encoding Identification.

Despite popularity of Unicode encoding (UTF-8), still many sources

of documents uses different encoding schemes.
same range of numeric values can represent different characters in
different encodings. e.g. English or Spanish are both normally stored
in the common 8-bit encoding Latin-1 (or ISO-8859-1).

Python Librarry for character encoding detection

chardet
cchardet
Reference blog

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 33 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 34 / 34

NIELSEN, Jakob; Mack., Robert L. Usability Inspection Methods
No ratings yet
NIELSEN, Jakob; Mack., Robert L. Usability Inspection Methods
439 pages
SW certificationProcedure-v1G
100% (1)
SW certificationProcedure-v1G
6 pages
Text Processing
No ratings yet
Text Processing
47 pages
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
No ratings yet
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
35 pages
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
No ratings yet
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
34 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
6-Text Preprocessing-01-08-2024
No ratings yet
6-Text Preprocessing-01-08-2024
52 pages
Lecture 02
No ratings yet
Lecture 02
62 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Lecture02 Tokenization
No ratings yet
Lecture02 Tokenization
16 pages
NLP Intro
No ratings yet
NLP Intro
4 pages
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
No ratings yet
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
19 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Nlp Unit II Notes
No ratings yet
Nlp Unit II Notes
31 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
CAT King study material 5
No ratings yet
CAT King study material 5
21 pages
NLP
No ratings yet
NLP
16 pages
2 - 6N302 Natural Language Processing
No ratings yet
2 - 6N302 Natural Language Processing
6 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
nayie bayes classifier 21 page
No ratings yet
nayie bayes classifier 21 page
28 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
Natural Language Processing: Dr. Abdulfetah A.A
No ratings yet
Natural Language Processing: Dr. Abdulfetah A.A
25 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
lect1-intro-3jan08 (1)
No ratings yet
lect1-intro-3jan08 (1)
94 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Introduction To The Module: John Barnden School of Computer Science University of Birmingham
No ratings yet
Introduction To The Module: John Barnden School of Computer Science University of Birmingham
18 pages
lec2
No ratings yet
lec2
21 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
30 pages
Natural Language Processing Inside Pages 2
No ratings yet
Natural Language Processing Inside Pages 2
159 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
NLP Lab File
100% (2)
NLP Lab File
66 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
An Introduction to Language Processing with Perl and Prolog An Outline of Theories Implementation and Application with Special Consideration of English French and German 2006th Edition Pierre M Nugues - Read the ebook online or download it to own the full content
100% (1)
An Introduction to Language Processing with Perl and Prolog An Outline of Theories Implementation and Application with Special Consideration of English French and German 2006th Edition Pierre M Nugues - Read the ebook online or download it to own the full content
63 pages
NLP-Lect 4-01.02.2021
No ratings yet
NLP-Lect 4-01.02.2021
16 pages
Introduction To Natural Language Processing-03-01-2024
No ratings yet
Introduction To Natural Language Processing-03-01-2024
27 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP01 IntroNLP
No ratings yet
NLP01 IntroNLP
68 pages
NLP Notes
No ratings yet
NLP Notes
16 pages
Lecture_1_Introduction
No ratings yet
Lecture_1_Introduction
57 pages
Natural Language Processing Slides
No ratings yet
Natural Language Processing Slides
1,027 pages
NLP Textbook Star Edu
No ratings yet
NLP Textbook Star Edu
103 pages
Assignment of AI Finished
No ratings yet
Assignment of AI Finished
16 pages
1-Introduction to NLP_part1
No ratings yet
1-Introduction to NLP_part1
31 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
nlp2
No ratings yet
nlp2
45 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
Natural Language Processing - Session 1 - Introduction
No ratings yet
Natural Language Processing - Session 1 - Introduction
55 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
nlp-01
No ratings yet
nlp-01
16 pages
Unit 1
No ratings yet
Unit 1
24 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
250 Essential Chinese Characters Volume 2: Revised Edition (HSK Level 2)
From Everand
250 Essential Chinese Characters Volume 2: Revised Edition (HSK Level 2)
Philip Yungkin Lee
1/5 (1)
Module 3: Morphology Inflectional and Derivation Morphology
No ratings yet
Module 3: Morphology Inflectional and Derivation Morphology
17 pages
Module 3: Morphology Morphological Parsing With Finite State
No ratings yet
Module 3: Morphology Morphological Parsing With Finite State
29 pages
Futureinternet 14 00253 v2
No ratings yet
Futureinternet 14 00253 v2
17 pages
Natual Languagr Processing
No ratings yet
Natual Languagr Processing
12 pages
Paper 3308
No ratings yet
Paper 3308
19 pages
Digital Assignment LSM
No ratings yet
Digital Assignment LSM
7 pages
Câu hỏi tin
No ratings yet
Câu hỏi tin
10 pages
Pipe Stress
No ratings yet
Pipe Stress
16 pages
H5Menu(DarkStyle).html
No ratings yet
H5Menu(DarkStyle).html
6 pages
Carlyle-Solutions-Setup Instructions Web
No ratings yet
Carlyle-Solutions-Setup Instructions Web
18 pages
MQ Install Instructions Autocad
No ratings yet
MQ Install Instructions Autocad
2 pages
Wonderware Application Server User's Guide: Invensys Systems, Inc
No ratings yet
Wonderware Application Server User's Guide: Invensys Systems, Inc
262 pages
Cloud Computing and Its Implications For Construction IT: B. Kumar J.C.P. Cheng
No ratings yet
Cloud Computing and Its Implications For Construction IT: B. Kumar J.C.P. Cheng
6 pages
PSCAD Tutorial
100% (2)
PSCAD Tutorial
42 pages
designing books
No ratings yet
designing books
2 pages
Module 4 - Tests
No ratings yet
Module 4 - Tests
18 pages
ND1 Engine Control Unit (ECU) - 4
No ratings yet
ND1 Engine Control Unit (ECU) - 4
7 pages
Securing Secrets at Scale Ebook FINAL
No ratings yet
Securing Secrets at Scale Ebook FINAL
17 pages
Mathematics: Voxel-Based 3D Object Reconstruction From Single 2D Image Using Variational Autoencoders
No ratings yet
Mathematics: Voxel-Based 3D Object Reconstruction From Single 2D Image Using Variational Autoencoders
11 pages
Blind CV - Full Stack - Malaysian
No ratings yet
Blind CV - Full Stack - Malaysian
14 pages
PPT MINI PROJECT
No ratings yet
PPT MINI PROJECT
10 pages
Thesis Service Oriented Architecture
100% (3)
Thesis Service Oriented Architecture
6 pages
Oracle Apps Interview Questions
No ratings yet
Oracle Apps Interview Questions
32 pages
1.4 Workbook Answers
100% (2)
1.4 Workbook Answers
13 pages
Cultural Web Analysis - A Practical Guide To Delivering Results
No ratings yet
Cultural Web Analysis - A Practical Guide To Delivering Results
2 pages
Name & Address: Bengaluru-560500
No ratings yet
Name & Address: Bengaluru-560500
3 pages
Untitled Document
No ratings yet
Untitled Document
3 pages
Atse Oritsemamidida Eyegbanren: 903 Glen Willow DR APT 13, Capitol Heights, MD 20743
No ratings yet
Atse Oritsemamidida Eyegbanren: 903 Glen Willow DR APT 13, Capitol Heights, MD 20743
4 pages
ECE 6378 Power System Analysis Projects and HW
No ratings yet
ECE 6378 Power System Analysis Projects and HW
34 pages
Set-3 QP
No ratings yet
Set-3 QP
7 pages
09 - 2 Security Deployment Guideline R13 - en
No ratings yet
09 - 2 Security Deployment Guideline R13 - en
126 pages
040424 flipkart (2)
No ratings yet
040424 flipkart (2)
1 page
1.3.7 High - and Low-Level Languages and Their Translators
No ratings yet
1.3.7 High - and Low-Level Languages and Their Translators
4 pages
Palisetti Tejaswi: Portfolio: Https://tejapro - Github.io Mobile Email
No ratings yet
Palisetti Tejaswi: Portfolio: Https://tejapro - Github.io Mobile Email
1 page

Natural Language Processing

Uploaded by

Natural Language Processing

Uploaded by

Natural Language Processing

Lecture-04: Text Processing - Module 2

Dr. Durgesh Kumar

1 Recap from previous Module

2 Text Processing Module outline

3 Character Encoding types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 1 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 2 / 34

1 Grapheme:the smallest units of writing that correspond with sounds

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 3 / 34

At the lowest level characters represents: individual grapheme, word,

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 4 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 5 / 34

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 6 / 34

NLP contains inherent ambiguities, and is further amplified and

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 7 / 34

Figure: Text Processing types

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 8 / 34

Text pre-processing: converting sequence of digital bits, into a

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 9 / 34

Character Encoding : In the linguistic analysis of a digital natural

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 12 / 34

Types of Writing systems

1 Logo-graphic: individual symbol represent words; large number

Syllabic and alphabetic system have fewer than 100 symbols.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 13 / 34

a unit of spoken language that is next bigger than a speech sound

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 14 / 34

suffix Words Written Syllables Spoken Syllables

See here for details

Different types of writing systems. e.g. logo-graphics, syllabic,

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 16 / 34

Challenges of Text processing can be categorized into 4 categories:

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 17 / 34

ASCII encoding (7-bit)

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 18 / 34

American Standard Code for Information Interchange.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 19 / 34

Figure: ASCII codes and it’s meaning.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 20 / 34

Figure: ASCII codes and it’s meaning in rectangular format.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 21 / 34

ASCII codes can be broadly classified into two categories:

Limitation of 7-bit ASCII code

Required “asciification” or “romanization” of characters not present

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 23 / 34

8 bits: It can represents 256 characters (28 ).

It contains encoding definitions for most European characters.

Indian Script Code for Information Interchange.

Majority of languages that are spoken in India are represented in this.

We need a special keyboard which contains ISCII character keys.

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 26 / 34

Chinese and Japanese, which have several thousand distinct

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 27 / 34

Code switching: Characters from many different writing systems

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 28 / 34

The default character encoding for HTML5 is UTF-8.

Figure: Character encoding in HTML

It seeks to eliminate this character set ambiguity by specifying a

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 30 / 34

In the UTF-8 encoding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 31 / 34

the header of a digital document may contain information regarding

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 32 / 34

Challenges of Character Encoding Identification.

Despite popularity of Unicode encoding (UTF-8), still many sources

Python Librarry for character encoding detection

Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 August 4, 2022August 4, 2022 33 / 34

You might also like