NLP Preprocessing Steps
NLP Preprocessing Steps
1. Lowercasing
2. Tokenization
3. Removing Punctuation
4. Removing Stopwords
5. Stemming
6. Lemmatization
7. Removing Numbers
8. Removing Extra Spaces
9. Handling Contractions
10. Removing Special Characters
11. Part-of-Speech (POS) Tagging
12. Named Entity Recognition (NER)
13. Vectorization
14. Handling Missing Data
15. Normalization
16. Spelling Correction
17. Handling Emojis and Emoticons
18. Removing HTML Tags
19. Handling URLs
20. Handling Mentions and Hashtags
21. Sentence Segmentation
22. Handling Abbreviations
23. Language Detection
24. Text Encoding
25. Handling Whitespace Tokens
26. Handling Dates and Times
27. Text Augmentation
28. Handling Negations
29. Dependency Parsing
30. Handling Rare Words
31. Text Chunking
32. Handling Synonyms
33. Text Normalization for Social Media
1. Lowercasing
Purpose: Converts all text to lowercase to ensure uniformity.
Why: Reduces the vocabulary size and avoids treating the same
word in different cases as different tokens (e.g., "Apple" vs. "apple").
2. Tokenization
Purpose: Splits text into individual words, phrases, or sentences
(tokens).
Why: Breaks down text into manageable units for further processing.
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
import string
4. Removing Stopwords
Purpose: Removes common words like "the," "is," "and," which
don’t carry significant meaning.
Why: Reduces noise and focuses on meaningful words.
import nltk
stop_words = set(stopwords.words('english'))
tokens = ["this", "is", "a", "sample", "sentence"]
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
5. Stemming
Purpose: Reduces words to their root form by chopping off suffixes
(e.g., "running" → "run").
Why: Simplifies words to their base form, reducing vocabulary size.
stemmer = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
6. Lemmatization
Purpose: Converts words to their base or dictionary form (e.g.,
"better" → "good").
Why: More accurate than stemming as it uses vocabulary and
morphological analysis.
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word
in words]
print(lemmatized_words)
7. Removing Numbers
Purpose: Removes numeric values from the text.
Why: Numbers may not be relevant in certain NLP tasks like
sentiment analysis.
import re
text = "There are 3 apples and 5 oranges."
text = re.sub(r'\d+', '', text)
print(text)
9. Handling Contractions
Purpose: Expands contractions (e.g., "can't" → "cannot").
Why: Standardizes text for better processing.
import re
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
13. Vectorization
Purpose: Converts text into numerical vectors (e.g., Bag of Words,
TF-IDF, Word Embeddings).
Why: Machine learning models require numerical input.
from sklearn.feature_extraction.text import
CountVectorizer
import pandas as pd
15. Normalization
Purpose: Standardizes text (e.g., converting all dates to a single
format).
Why: Ensures consistency in the dataset.
import unicodedata
text = "Café"
text = unicodedata.normalize('NFKD', text).encode('ascii',
'ignore').decode('utf-8')
print(text)
# Remove emojis
text = emoji.replace_emoji(text, replace="")
print(text)
import re
import contractions
text = "I'll be there ASAP."
expanded_text = contractions.fix(text)
print(expanded_text)
text = "Café"
text = text.encode('utf-8').decode('utf-8')
print(text)
tokens = ["This", " ", "is", " ", "a", " ", "sample", " "]
tokens = [token for token in tokens if token.strip()]
print(tokens)
26. Handling Dates and Times
Purpose: Standardizes or extracts date and time formats.
Why: Useful for time-sensitive analysis.
aug = SynonymAug(aug_src='wordnet')
text = "This is a sample text."
augmented_text = aug.augment(text)
print(augmented_text)
import spacy
word = "happy"
synonyms = wordnet.synsets(word)
print([syn.lemmas()[0].name() for syn in synonyms])
import re
Task-Specific Importance
Sentiment Analysis: Handling negations, emojis, and emoticons is crucial.
Machine Translation: Sentence segmentation and POS tagging are
important.
Named Entity Recognition (NER): Handling dates, times, and special
characters is critical.
Social Media Analysis: Handling emojis, hashtags, and informal language is
essential.
Text Classification: Removing stopwords, lowercasing, and vectorization
are key.
Summary