0% found this document useful (0 votes)

24 views16 pages

Lecture 2 Tokenization

Uploaded by

ravleen3310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views16 pages

Lecture 2 Tokenization

Uploaded by

ravleen3310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Tokenization

Tokenization
Tokenization in natural language processing (NLP) is a technique that involves dividing a
sentence or phrase into smaller units known as tokens.
These tokens can encompass words, dates, punctuation marks, or even fragments of
words.
Tokenization is a critical step in many NLP tasks, including text processing, language
modelling, and machine translation.
Types of Tokenization

Tokenization can be classified into several types based on how the text is segmented.
Here are some types of tokenization:

I. Word Tokenization:
• Word tokenization divides the text into individual words. Many NLP tasks use this
approach, in which words are treated as the basic units of meaning.

• Example:
Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]
Types of Tokenization
II. Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is useful for tasks
requiring individual sentence analysis or processing.

Example:
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller
units."]
Types of Tokenization
III. Sub word Tokenization:
Sub word tokenization entails breaking down words into smaller units, which can be
especially useful when dealing with morphologically rich languages or rare words.

Example:
Input: "tokenization"
Output: ["token", "ization"]
Types of Tokenization
IV. Character Tokenization:

This process divides the text into individual characters. This can be useful for modelling
character-level language.

Example:

Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Need of Tokenization

Effective Text Processing: Tokenization reduces the size of raw text so that it can be
handled more easily for processing and analysis.
Feature extraction: Text data can be represented numerically for algorithmic
comprehension by using tokens as features in machine learning models.
Language Modelling: Tokenization in NLP facilitates the creation of organized
representations of language, which is useful for tasks like text generation and language
modelling.
Information Retrieval: Tokenization is essential for indexing and searching in systems
that store and retrieve information efficiently based on words or phrases.
Text Analysis: Tokenization is used in many NLP tasks, including sentiment analysis and
named entity recognition, to determine the function and context of individual words in a
sentence.
Implementation for Tokenization
The code uses sent_tokenize function from NLTK library. The sent_tokenize function is
used to segment a given text into a list of sentences.

from nltk.tokenize import sent_tokenize

text = "Hello everyone. Welcome to the Class. You are studying NLP ."
sent_tokenize(text)

Output:

['Hello everyone.',
'Welcome to the Class.',
'You are studying NLP.']
Implementation for Tokenization
 When we have huge chunks of data then it is efficient to use ‘PunktSentenceTokenizer' from the NLTK library.

import nltk.data
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

 Tokenize sentence of different language

import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text) Output:
Output:
['Hola amigo.',
'Estoy bien.']
Implementation for Tokenization
Word Tokenization using word_tokenize
• The word_tokenize function is helpful for breaking down a sentence or text into its
constituent words, facilitating further analysis or processing at the word level in natural
language processing tasks.

from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to Class."
word_tokenize(text)

Output:
['Hello', 'everyone', '.', 'Welcome', 'to', ‘Class', ‘.’]
Implementation for Tokenization
Word Tokenization Using TreebankWordTokenizer
• These tokenizers work by separating the words using punctuation and spaces. And as
mentioned in the code outputs above
• It doesn’t discard the punctuation, allowing a user to decide what to do with the
punctuations at the time of pre-processing.

from nltk.tokenize import TreebankWordTokenizer

text = "Hello everyone. Welcome to Class ."
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

Output:
['Hello', 'everyone.', 'Welcome', 'to', ‘class', '.']
Implementation for Tokenization
Word Tokenization using WordPunctTokenizer
• The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on
punctuation boundaries.
• Each punctuation mark is treated as a separate token.

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

Output:

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']
Implementation for Tokenization
Using regular expressions allows for more fine-grained control over tokenization, and
you can customize the pattern based on your specific requirements.

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)

Output:
['Let', 's', 'see', 'how', 'it', 's', 'working']
Implementation for Tokenization
• RegexpTokenizer: This class is used to tokenize text based on a regular expression
pattern.
• r'\w+': This is a regular expression pattern passed to RegexpTokenizer.
• \w: Matches any word character (alphanumeric character plus underscore, i.e., [a-
zA-Z0-9_]).
• +: Matches one or more occurrences of the preceding element (in this case, word
characters).
• So, r'\w+' matches sequences of word characters (essentially words) separated by
non-word characters.
• Example:
• If you apply RegexpTokenizer(r'\w+') to the text "Hello, world! It's a beautiful
day.", it would tokenize the text into:
• ['Hello', 'world', 'It', 's', 'a', 'beautiful', 'day']
• In this example, punctuation and spaces are removed, and the text is split into
word tokens.
More Techniques for Tokenization
We can also implement tokenization using following methods and libraries:
Spacy
BERT tokenizer
Byte-Pair Encoding
Sentence Piece
Limitations of Tokenization

Tokenization is unable to capture the meaning of the sentence hence, results in ambiguity.

In certain languages like Chinese, Japanese, Arabic, lack distinct spaces between words.
Hence, there is an absence of clear boundaries that complicates the process of
tokenization.

Text may also include more than one word, for example email address, URLs and special
symbols, hence it is difficult to decide how to tokenize such elements.

Unit 1 - Tokenisation Text
No ratings yet
Unit 1 - Tokenisation Text
5 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
Week 1
No ratings yet
Week 1
14 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
NLP and Computational Linguistics Overview
No ratings yet
NLP and Computational Linguistics Overview
60 pages
Tokenizations
No ratings yet
Tokenizations
3 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Slide 2 Introduction To Text Tokeni
No ratings yet
Slide 2 Introduction To Text Tokeni
5 pages
Tokenization in Santali Language NLP
No ratings yet
Tokenization in Santali Language NLP
16 pages
NLP Tokenization Basics
No ratings yet
NLP Tokenization Basics
3 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Python Sentence Tokenization Methods
No ratings yet
Python Sentence Tokenization Methods
3 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
21 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
2.2text Preprocessing Tokanization
No ratings yet
2.2text Preprocessing Tokanization
3 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Learn Text Analytics With Python
No ratings yet
Learn Text Analytics With Python
12 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP 02
No ratings yet
NLP 02
6 pages
Tokenization
No ratings yet
Tokenization
13 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
54 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
Understanding Tokenization in NLP
100% (1)
Understanding Tokenization in NLP
10 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLTK: Python Text Processing Guide
No ratings yet
NLTK: Python Text Processing Guide
4 pages
NLP with NLTK in Python Guide
No ratings yet
NLP with NLTK in Python Guide
5 pages
String Tokenization in NLP Explained
No ratings yet
String Tokenization in NLP Explained
10 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
Exp1 Ananya 66 C NLP
No ratings yet
Exp1 Ananya 66 C NLP
12 pages
NLP m2
No ratings yet
NLP m2
71 pages
Free Text Tokenization in Python
No ratings yet
Free Text Tokenization in Python
3 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP Techniques: Tokenization & Stemming
No ratings yet
NLP Techniques: Tokenization & Stemming
11 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Understanding Lemmatization in NLP
No ratings yet
Understanding Lemmatization in NLP
20 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Tasks: Syntax, Semantics, Pragmatics
No ratings yet
NLP Tasks: Syntax, Semantics, Pragmatics
12 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
Tokenization and Stemming in NLTK
No ratings yet
Tokenization and Stemming in NLTK
29 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Tokenization Essentials
No ratings yet
Tokenization Essentials
20 pages
Turing Machines: Design and Definition
No ratings yet
Turing Machines: Design and Definition
21 pages
One-Hot Encoding for ML & NLP
No ratings yet
One-Hot Encoding for ML & NLP
13 pages
Lecture 17 Transfer Learning
No ratings yet
Lecture 17 Transfer Learning
12 pages
Lecture 4 Regular Expression
No ratings yet
Lecture 4 Regular Expression
30 pages
Colors in Arabic
100% (1)
Colors in Arabic
2 pages
Eugene Nida Correspondrnce
No ratings yet
Eugene Nida Correspondrnce
2 pages
9th Class Active Passive
No ratings yet
9th Class Active Passive
6 pages
Critical Reading Practice 1 For Each of The Following Questions, Select The Best Answer From The Choices Provided and Circle The Correct Answer
No ratings yet
Critical Reading Practice 1 For Each of The Following Questions, Select The Best Answer From The Choices Provided and Circle The Correct Answer
6 pages
Past Perfect Simple Exercises
No ratings yet
Past Perfect Simple Exercises
4 pages
Neurolinguistics in Language Acquisition
No ratings yet
Neurolinguistics in Language Acquisition
28 pages
LetSum, An Automatic Text Summarization System in Law Field
No ratings yet
LetSum, An Automatic Text Summarization System in Law Field
6 pages
Andromeda by Robert Shapiro
100% (1)
Andromeda by Robert Shapiro
11 pages
Vocabulary: The Family La Familia
No ratings yet
Vocabulary: The Family La Familia
5 pages
APTIS & VSTEP Exam Prep Guide
No ratings yet
APTIS & VSTEP Exam Prep Guide
68 pages
Year 1 English Lesson Plan: Classroom Objects
No ratings yet
Year 1 English Lesson Plan: Classroom Objects
5 pages
Handout Indonesian English Translation PDF
No ratings yet
Handout Indonesian English Translation PDF
36 pages
ks2 English 2013 Level 6 Specimen Grammar Punctuation Spelling Marking Scheme
No ratings yet
ks2 English 2013 Level 6 Specimen Grammar Punctuation Spelling Marking Scheme
24 pages
Restaurant Role-Play Rubric
No ratings yet
Restaurant Role-Play Rubric
1 page
Francesco Straniero Sergio-1998-Notes On Cultural Mediation
No ratings yet
Francesco Straniero Sergio-1998-Notes On Cultural Mediation
18 pages
Interpersonal Communication Exercises
No ratings yet
Interpersonal Communication Exercises
2 pages
Laya Garcia-259220761-Lesson 4 Quiz
No ratings yet
Laya Garcia-259220761-Lesson 4 Quiz
3 pages
Tenses of English
No ratings yet
Tenses of English
13 pages
Spatial Concepts Activity Guide
100% (1)
Spatial Concepts Activity Guide
6 pages
Presentation For Communication
No ratings yet
Presentation For Communication
32 pages
English 6 Verb Tenses Activity Sheets
No ratings yet
English 6 Verb Tenses Activity Sheets
4 pages
Letter Sounds and Activities Guide
No ratings yet
Letter Sounds and Activities Guide
143 pages
Sequencing Words for Storytelling
No ratings yet
Sequencing Words for Storytelling
10 pages
Open Cloze - Online Exercises
No ratings yet
Open Cloze - Online Exercises
3 pages
Lect 6 Ling
No ratings yet
Lect 6 Ling
4 pages
English-Malay Vocabulary List
No ratings yet
English-Malay Vocabulary List
2 pages
ARAL Reading DM Aug4 82025 1
100% (2)
ARAL Reading DM Aug4 82025 1
33 pages
Grade 3 English Lesson Plan
No ratings yet
Grade 3 English Lesson Plan
1 page
Speech Chain: 1. PARAGRAPH 1 - Rafael Lim
No ratings yet
Speech Chain: 1. PARAGRAPH 1 - Rafael Lim
4 pages
Examen 1er Trimestre 3ero Bgu Ingles 2024-2025
No ratings yet
Examen 1er Trimestre 3ero Bgu Ingles 2024-2025
3 pages

Lecture 2 Tokenization

Uploaded by

Lecture 2 Tokenization

Uploaded by

Tokenization

from nltk.tokenize import sent_tokenize

 Tokenize sentence of different language

from nltk.tokenize import word_tokenize

from nltk.tokenize import TreebankWordTokenizer

from nltk.tokenize import WordPunctTokenizer

from nltk.tokenize import RegexpTokenizer

You might also like