0% found this document useful (0 votes)

34 views20 pages

NLP Preprocessing Steps

Natural Language Processing (NLP) involves various pre-processing steps to convert raw text into a suitable format for analysis or machine learning. Key steps include lowercasing, tokenization, removing punctuation, and stemming, among others, which enhance data quality and algorithm efficiency. The specific pre-processing techniques applied depend on the task and nature of the text, making them essential for effective NLP applications.

Uploaded by

Prabhash Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views20 pages

NLP Preprocessing Steps

Uploaded by

Prabhash Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Natural Language Pre-processing (NLP)

Natural Language Processing (NLP) involves a series of pre-

processing steps to transform raw text data into a format
suitable for analysis or machine learning models. These steps help
improve the quality of the data and make it easier for algorithms
to understand and process the text. Below are the key pre-
processing steps used in NLP, along with explanations and example
code.

33 Common Pre-processing step commonly used

before feeding data into an NLP model

1. Lowercasing
2. Tokenization
3. Removing Punctuation
4. Removing Stopwords
5. Stemming
6. Lemmatization
7. Removing Numbers
8. Removing Extra Spaces
9. Handling Contractions
10. Removing Special Characters
11. Part-of-Speech (POS) Tagging
12. Named Entity Recognition (NER)
13. Vectorization
14. Handling Missing Data
15. Normalization
16. Spelling Correction
17. Handling Emojis and Emoticons
18. Removing HTML Tags
19. Handling URLs
20. Handling Mentions and Hashtags
21. Sentence Segmentation
22. Handling Abbreviations
23. Language Detection
24. Text Encoding
25. Handling Whitespace Tokens
26. Handling Dates and Times
27. Text Augmentation
28. Handling Negations
29. Dependency Parsing
30. Handling Rare Words
31. Text Chunking
32. Handling Synonyms
33. Text Normalization for Social Media

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

Detailed explanation of each pre-processing step commonly
used before feeding data into an NLP model and during its
use:

1. Lowercasing
 Purpose: Converts all text to lowercase to ensure uniformity.
 Why: Reduces the vocabulary size and avoids treating the same
word in different cases as different tokens (e.g., "Apple" vs. "apple").

text = "Hello World! This is NLP."

text = text.lower()
print(text)

2. Tokenization
 Purpose: Splits text into individual words, phrases, or sentences
(tokens).
 Why: Breaks down text into manageable units for further processing.

import nltk

nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = "Hello World! This is NLP."

tokens = word_tokenize(text)
print(tokens)

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

3. Removing Punctuation
 Purpose: Removes punctuation marks like commas, periods,
exclamation marks, etc.
 Why: Punctuation often doesn’t contribute to the meaning in many
NLP tasks and can add noise.

import string

text = "Hello, World! This is NLP."

text = text.translate(str.maketrans('', '',
string.punctuation))
print(text)

4. Removing Stopwords
 Purpose: Removes common words like "the," "is," "and," which
don’t carry significant meaning.
 Why: Reduces noise and focuses on meaningful words.

import nltk

# Download the 'stopwords' dataset

nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ["this", "is", "a", "sample", "sentence"]
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
5. Stemming
 Purpose: Reduces words to their root form by chopping off suffixes
(e.g., "running" → "run").
 Why: Simplifies words to their base form, reducing vocabulary size.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

6. Lemmatization
 Purpose: Converts words to their base or dictionary form (e.g.,
"better" → "good").
 Why: More accurate than stemming as it uses vocabulary and
morphological analysis.

import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word
in words]
print(lemmatized_words)
7. Removing Numbers
 Purpose: Removes numeric values from the text.
 Why: Numbers may not be relevant in certain NLP tasks like
sentiment analysis.

import re
text = "There are 3 apples and 5 oranges."
text = re.sub(r'\d+', '', text)
print(text)

8. Removing Extra Spaces

 Purpose: Eliminates multiple spaces, tabs, or newlines.
 Why: Ensures clean and consistent text formatting.

text = " This is a sentence. "

text = ' '.join(text.split())
print(text)

9. Handling Contractions
 Purpose: Expands contractions (e.g., "can't" → "cannot").
 Why: Standardizes text for better processing.

!pip install contractions

from contractions import fix

text = "I can't do this."

text = fix(text)
print(text)

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

10. Removing Special Characters
 Purpose: Removes non-alphanumeric characters like @, #, $, etc.
 Why: Reduces noise and irrelevant symbols.

import re

text = "This is a #sample text with @special characters!"

text = re.sub(r'[^\w\s]', '', text)
print(text)

11. Part-of-Speech (POS) Tagging

 Purpose: Assigns grammatical tags to words (e.g., noun, verb,
adjective).
 Why: Helps in understanding the syntactic structure of sentences.

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download the required resource

nltk.download('averaged_perceptron_tagger_eng')

tokens = word_tokenize("This is a sample sentence.")

pos_tags = pos_tag(tokens)
print(pos_tags)

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

12. Named Entity Recognition (NER)
 Purpose: Identifies and classifies entities like names, dates, locations,
etc.
 Why: Useful for tasks like information extraction.

import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# Download the required resources

nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
# Download the 'maxent_ne_chunker_tab' resource
nltk.download('maxent_ne_chunker_tab') # This line is crucial to fix
the error.

tokens = word_tokenize("John works at Google in New York.")

pos_tags = pos_tag(tokens)
ner_tags = ne_chunk(pos_tags)
print(ner_tags)

13. Vectorization
 Purpose: Converts text into numerical vectors (e.g., Bag of Words,
TF-IDF, Word Embeddings).
 Why: Machine learning models require numerical input.
from sklearn.feature_extraction.text import
CountVectorizer

corpus = ["This is a sample sentence.", "Another example

sentence."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray()) # Output: [[1 1 1 1 0], [0 1 1 0 1]]
print(vectorizer.get_feature_names_out()

14. Handling Missing Data

 Purpose: Fills or removes missing or incomplete text data.
 Why: Ensures the dataset is complete and consistent.

import pandas as pd

data = {"text": ["Hello", None, "World"]}

df = pd.DataFrame(data)
df["text"].fillna("My Dear", inplace=True) # Fill missing values
print(df)

15. Normalization
 Purpose: Standardizes text (e.g., converting all dates to a single
format).
 Why: Ensures consistency in the dataset.
import unicodedata

text = "Café"
text = unicodedata.normalize('NFKD', text).encode('ascii',
'ignore').decode('utf-8')
print(text)

16. Spelling Correction

 Purpose: Corrects spelling errors in the text.
 Why: Improves the quality of the text for analysis.

from textblob import TextBlob

text = "I made a many mistakes in Artificial intellengence"

blob = TextBlob(text)
corrected_text = blob.correct()
print(corrected_text)

17. Handling Emojis and Emoticons

 Purpose: Converts emojis and emoticons into text or removes them.
 Why: Emojis can carry sentiment or meaning that needs to be
captured.

!pip install emoji

import emoji

text = "I love Python! �"

# Convert emojis to text
text = emoji.demojize(text)
print(text) # Output: "I love Python!
:smiling_face_with_smiling_eyes:"

# Remove emojis
text = emoji.replace_emoji(text, replace="")
print(text)

18. Removing HTML Tags

 Purpose: Removes HTML tags from web scraped text.
 Why: HTML tags are irrelevant for most NLP tasks.

from bs4 import BeautifulSoup

text = "<p>This is a <b>sample</b> text.</p>"

soup = BeautifulSoup(text, "html.parser")
clean_text = soup.get_text()
print(clean_text)

19. Handling URLs

 Purpose: Removes or replaces URLs in the text.
 Why: URLs are often irrelevant for text analysis.

import re

text = "Visit my website at https://example.com."

text = re.sub(r'http\S+|www\S+|https\S+', '', text,
flags=re.MULTILINE)
print(text)
20. Handling Mentions and Hashtags
 Purpose: Processes or removes social media mentions (@user) and
hashtags (#topic).
 Why: Useful for social media text analysis.

text = "Hey @user, check out #NLP!"

text = re.sub(r'@\w+|#\w+', '', text)
print(text)

21. Sentence Segmentation

 Purpose: Splits text into individual sentences.
 Why: Important for tasks like machine translation or summarization.

from nltk.tokenize import sent_tokenize

text = "This is the first sentence. This is the second

sentence."
sentences = sent_tokenize(text)
print(sentences)

22. Handling Abbreviations

 Purpose: Expands abbreviations (e.g., "ASAP" → "as soon as
possible").
 Why: Ensures clarity and consistency.

!pip install contractions

import contractions
text = "I'll be there ASAP."
expanded_text = contractions.fix(text)
print(expanded_text)

23. Language Detection

 Purpose: Identifies the language of the text.
 Why: Ensures the correct NLP model is applied.

!pip install langdetect

from langdetect import detect
text = "Ceci est un texte en français."
language = detect(text)
print(language)

24. Text Encoding

 Purpose: Converts text into a specific encoding format (e.g., UTF-8).
 Why: Ensures compatibility with NLP tools and models.

text = "Café"
text = text.encode('utf-8').decode('utf-8')
print(text)

25. Handling Whitespace Tokens

 Purpose: Removes or processes tokens that are just spaces or empty
strings.
 Why: Ensures clean and meaningful tokens.

tokens = ["This", " ", "is", " ", "a", " ", "sample", " "]
tokens = [token for token in tokens if token.strip()]
print(tokens)
26. Handling Dates and Times
 Purpose: Standardizes or extracts date and time formats.
 Why: Useful for time-sensitive analysis.

import dateutil.parser as dparser

text = "The event is on 2023-10-15."

date = dparser.parse(text, fuzzy=True)
print(date)

27. Text Augmentation

 Purpose: Generates additional training data by modifying existing text
(e.g., synonym replacement).
 Why: Improves model robustness and performance.

#!pip install nlpaug # Install the nlpaug library

from nlpaug.augmenter.word import SynonymAug

aug = SynonymAug(aug_src='wordnet')
text = "This is a sample text."
augmented_text = aug.augment(text)
print(augmented_text)

28. Handling Negations

 Purpose: Identifies and processes negations (e.g., "not good").
 Why: Important for sentiment analysis and understanding context.
from nltk import word_tokenize

text = "This is not good."

tokens = word_tokenize(text)
for i, token in enumerate(tokens):
if token == "not" and i + 1 < len(tokens):
tokens[i + 1] = "not_" + tokens[i + 1]
print(tokens)

29. Dependency Parsing

 Purpose: Analyzes the grammatical structure of a sentence.
 Why: Helps in understanding relationships between words.

import spacy

!python -m spacy download en_core_web_sm # Download

the model if not already downloaded
nlp = spacy.load("en_core_web_sm") # Load the model
directly using spacy.load

# The rest of your code remains the same

text = "This is a sample sentence."
doc = nlp(text)
for token in doc:
print(token.text, token.dep_, token.head.text)

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

30. Handling Rare Words
 Purpose: Replaces or removes rare words that occur infrequently.
 Why: Reduces noise and improves model efficiency.

from collections import Counter

tokens = ["this", "is", "a", "rare", "word", "word"]

word_counts = Counter(tokens)
rare_words = {word for word, count in word_counts.items()
if count < 2}
tokens = [token if token not in rare_words else "<UNK>" for
token in tokens]
print(tokens)

31. Text Chunking

 Purpose: Groups words into "chunks" based on POS tags (e.g., noun
phrases).
 Why: Useful for information extraction.

from nltk import pos_tag, word_tokenize

from nltk.chunk import RegexpParser

text = "This is a sample sentence."

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)
print(tree)

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

32. Handling Synonyms
 Purpose: Replaces words with their synonyms.
 Why: Helps in text augmentation and reducing redundancy.

from nltk.corpus import wordnet

word = "happy"
synonyms = wordnet.synsets(word)
print([syn.lemmas()[0].name() for syn in synonyms])

33. Text Normalization for Social Media

 Purpose: Processes informal text (e.g., "u" → "you", "gr8" →
"great").
 Why: Social media text often contains informal language and slang.

import re

text = "I loooove this!"

text = re.sub(r'(.)\1+', r'\1', text)
print(text)

These pre-processing steps are crucial for cleaning, standardizing,

and transforming raw text into a format suitable for NLP models.
The specific steps used depend on the task (e.g., sentiment
analysis, machine translation) and the nature of the text (e.g.,
formal documents, social media posts).
NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758
The importance of pre-processing steps in NLP depends on the specific
task, type of text data, and the NLP model being used. However, some steps
are generally considered more critical across most NLP tasks. Here's a
breakdown:

Most Important Pre-processing Steps for NLP

1. Tokenization
o Why: Tokenization is the foundation of NLP. It breaks text into
meaningful units (words, sentences, etc.), which are necessary for
any further processing.
o When: Always required, regardless of the task.
2. Lowercasing
oWhy: Ensures consistency by treating words like "Apple" and "apple"
as the same. Reduces vocabulary size and computational complexity.
o When: Important for tasks like text classification, sentiment
analysis, and information retrieval.
3. Removing Stopwords
oWhy: Stopwords (e.g., "the," "is," "and") add noise and don’t
contribute much to the meaning in many tasks.
o When: Useful for tasks like text classification, topic modeling, and
search engines.
4. Handling Missing Data
o Why: Incomplete or missing data can lead to poor model
performance.
o When: Critical for all tasks, especially when working with real-world
datasets.
5. Vectorization
oWhy: Converts text into numerical representations (e.g., Bag of
Words, TF-IDF, Word Embeddings) that machine learning models
can process.
o When: Essential for all tasks involving machine learning or deep
learning models.
6. Removing Punctuation and Special Characters
oWhy: Punctuation and special characters often don’t contribute to
the meaning and can add noise.
o When: Important for tasks like sentiment analysis, text
classification, and machine translation.
7. Lemmatization or Stemming
Why: Reduces words to their base forms, simplifying the vocabulary
o
and improving consistency.
o When: Useful for tasks like information retrieval, text
classification, and topic modeling.
8. Handling Contractions and Abbreviations
Why: Expands contractions (e.g., "can't" → "cannot") and
o
abbreviations (e.g., "ASAP" → "as soon as possible") for better
understanding.
o When: Important for tasks involving informal text (e.g., social media
analysis).
9. Handling URLs, Mentions, and Hashtags
o Why: Social media text often contains URLs, mentions (@user), and
hashtags (#topic), which need to be processed or removed.
o When: Critical for social media text analysis.
10. Text Normalization
o Why: Standardizes text (e.g., converting dates, times, and numbers
to a consistent format).
o When: Important for tasks involving structured data or time-
sensitive analysis.

Task-Specific Importance
 Sentiment Analysis: Handling negations, emojis, and emoticons is crucial.
 Machine Translation: Sentence segmentation and POS tagging are
important.
 Named Entity Recognition (NER): Handling dates, times, and special
characters is critical.
 Social Media Analysis: Handling emojis, hashtags, and informal language is
essential.
 Text Classification: Removing stopwords, lowercasing, and vectorization
are key.

Summary

While tokenization, lowercasing, stopword removal, and vectorization are

universally important, the relevance of other steps depends on the task and
dataset. Always analyze your data and task requirements to determine the most
critical preprocessing steps.

Prepared by: Syed Afroz Ali

Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
Test
No ratings yet
Test
111 pages
MIRO - User Exit During SAVE
100% (2)
MIRO - User Exit During SAVE
2 pages
NLP_Preprocessing_Steps__1740444240
No ratings yet
NLP_Preprocessing_Steps__1740444240
20 pages
NLP
No ratings yet
NLP
81 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
2. NLP Pipeline
No ratings yet
2. NLP Pipeline
50 pages
NLP_record300
No ratings yet
NLP_record300
24 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP Slides
No ratings yet
NLP Slides
19 pages
NLP concepts Resources
No ratings yet
NLP concepts Resources
48 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
No ratings yet
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
37 pages
Module-I_NLP (1)
No ratings yet
Module-I_NLP (1)
35 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
tinywow_pythass3_77951173
No ratings yet
tinywow_pythass3_77951173
17 pages
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
No ratings yet
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
171 pages
nlp2
No ratings yet
nlp2
45 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP tech-names
No ratings yet
NLP tech-names
3 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
CH4
No ratings yet
CH4
15 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
AIML_P4
No ratings yet
AIML_P4
12 pages
p4
No ratings yet
p4
10 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
nlp_1
No ratings yet
nlp_1
11 pages
CAT King study material 5
No ratings yet
CAT King study material 5
21 pages
NLP_course-EDC-1-29
No ratings yet
NLP_course-EDC-1-29
29 pages
AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
ChatGPT_MyLearning on Coding for NLP
No ratings yet
ChatGPT_MyLearning on Coding for NLP
10 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
5/5 (1)
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Python Reference: An Alphabetical Guide
From Everand
Python Reference: An Alphabetical Guide
Jo Foster
No ratings yet
2021 - Sem3 BT IT
No ratings yet
2021 - Sem3 BT IT
22 pages
Informed Search Strategies: Artificial Intelligence
No ratings yet
Informed Search Strategies: Artificial Intelligence
72 pages
Cara Order Emas PublicGold
No ratings yet
Cara Order Emas PublicGold
20 pages
Aligarh College of Engineering and Technology Aligarh
No ratings yet
Aligarh College of Engineering and Technology Aligarh
7 pages
20bcs0003 Oop Lab Fat Exam
No ratings yet
20bcs0003 Oop Lab Fat Exam
10 pages
Hyperlocal Retail - Final
No ratings yet
Hyperlocal Retail - Final
11 pages
Support Vector Machine - Wikipedia, The Free Encyclopedia
No ratings yet
Support Vector Machine - Wikipedia, The Free Encyclopedia
12 pages
Unit I Information Theory & Coding Techniques P I
No ratings yet
Unit I Information Theory & Coding Techniques P I
48 pages
Transport Layer: Internet Protocol Suite
No ratings yet
Transport Layer: Internet Protocol Suite
6 pages
Commodore Magazine Vol-10-N05 1989 May
100% (2)
Commodore Magazine Vol-10-N05 1989 May
100 pages
Email Header Analysis
No ratings yet
Email Header Analysis
7 pages
The Mathematica Book 5
100% (3)
The Mathematica Book 5
1,301 pages
Labview 2013
No ratings yet
Labview 2013
2 pages
PoT - Im.06.1.027.14 Presentation
No ratings yet
PoT - Im.06.1.027.14 Presentation
196 pages
Creating Group Policy Objects: This Lab Contains The Following Exercises and Activities
No ratings yet
Creating Group Policy Objects: This Lab Contains The Following Exercises and Activities
10 pages
IT 304 OOPM Unit V - 1693892221
No ratings yet
IT 304 OOPM Unit V - 1693892221
10 pages
MC Simulation
No ratings yet
MC Simulation
39 pages
14 BSC Maths CBCS Revised Syllabus Rev April 16
No ratings yet
14 BSC Maths CBCS Revised Syllabus Rev April 16
16 pages
Analisis Kinerja Simpang Tiga Tak Bersinyal: (Studi Kasus Simpang Lamlo Kabupaten Pidie)
No ratings yet
Analisis Kinerja Simpang Tiga Tak Bersinyal: (Studi Kasus Simpang Lamlo Kabupaten Pidie)
10 pages
Capture Board Connection (Eng)
No ratings yet
Capture Board Connection (Eng)
25 pages
Cbse Class 12th Result Vishwash
No ratings yet
Cbse Class 12th Result Vishwash
1 page
Control Lab Exp 5
No ratings yet
Control Lab Exp 5
16 pages
Computational Complexity: CSD-202 Data Structure and Algorithms
No ratings yet
Computational Complexity: CSD-202 Data Structure and Algorithms
29 pages
Userman MillPlus V500 GB
100% (1)
Userman MillPlus V500 GB
532 pages
MCQ question bank Malware
No ratings yet
MCQ question bank Malware
86 pages
Mathematics Form 2
No ratings yet
Mathematics Form 2
9 pages
Web Dev Curriculum
No ratings yet
Web Dev Curriculum
9 pages
Dompdf Github Guia php5
No ratings yet
Dompdf Github Guia php5
4 pages

NLP Preprocessing Steps

Uploaded by

NLP Preprocessing Steps

Uploaded by

Natural Language Pre-processing (NLP)

Natural Language Processing (NLP) involves a series of pre-

33 Common Pre-processing step commonly used

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

text = "Hello World! This is NLP."

text = "Hello World! This is NLP."

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

text = "Hello, World! This is NLP."

# Download the 'stopwords' dataset

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

8. Removing Extra Spaces

text = " This is a sentence. "

!pip install contractions

text = "I can't do this."

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

text = "This is a #sample text with @special characters!"

11. Part-of-Speech (POS) Tagging

# Download the required resource

tokens = word_tokenize("This is a sample sentence.")

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

# Download the required resources

tokens = word_tokenize("John works at Google in New York.")

corpus = ["This is a sample sentence.", "Another example

14. Handling Missing Data

data = {"text": ["Hello", None, "World"]}

16. Spelling Correction

from textblob import TextBlob

text = "I made a many mistakes in Artificial intellengence"

17. Handling Emojis and Emoticons

!pip install emoji

text = "I love Python! �"

18. Removing HTML Tags

from bs4 import BeautifulSoup

text = "<p>This is a <b>sample</b> text.</p>"

19. Handling URLs

text = "Visit my website at https://example.com."

text = "Hey @user, check out #NLP!"

21. Sentence Segmentation

from nltk.tokenize import sent_tokenize

text = "This is the first sentence. This is the second

22. Handling Abbreviations

!pip install contractions

23. Language Detection

!pip install langdetect

24. Text Encoding

25. Handling Whitespace Tokens

import dateutil.parser as dparser

text = "The event is on 2023-10-15."

27. Text Augmentation

#!pip install nlpaug # Install the nlpaug library

28. Handling Negations

text = "This is not good."

29. Dependency Parsing

!python -m spacy download en_core_web_sm # Download

# The rest of your code remains the same

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

from collections import Counter

tokens = ["this", "is", "a", "rare", "word", "word"]

31. Text Chunking

from nltk import pos_tag, word_tokenize

text = "This is a sample sentence."

NLP Sentimental Analysis Code + Data : https://t.me/AIMLDeepThaught/758

from nltk.corpus import wordnet

33. Text Normalization for Social Media

text = "I loooove this!"

These pre-processing steps are crucial for cleaning, standardizing,

Most Important Pre-processing Steps for NLP

While tokenization, lowercasing, stopword removal, and vectorization are

Prepared by: Syed Afroz Ali

You might also like