NLP lect 2
NLP lect 2
4. Feature Engineering: This step involves transforming the processed text into
numerical representations that can be understood by machine learning algorithms.
5. Modeling: This step involves training machine learning models using the data and
features extracted.
Text normalization techniques (like converting text to lowercase, expanding abbreviations, or handling
code-mixing) are used to deal with variations in how text is written across different platforms (e.g.,
social media).
The NLP pipeline is an iterative process, and depending on the nature of the data and task,
different steps may be prioritized or customized.
:
1 How can we get data required for training an NLP technique?
To gather data for training an NLP model, you can:
Use Public Datasets from platforms like Kaggle or Google Dataset Search.
Scrape Data from the Web using tools like BeautifulSoup or Scrapy.
Collect Internal Data from customer logs or internal systems.
Augment Data using techniques like back translation or synonym replacement.
2 Data can be collected from PDF files, HTML pages, and images, how this data can be
cleaned based on their sources?
To clean data from different sources:
PDF Files: Use libraries like PyPDF2 or PDFMiner to extract text, then remove unwanted elements
(like headers/footers) and fix formatting issues.
HTML Pages: Use BeautifulSoup or Scrapy to parse and extract text, then remove HTML tags,
scripts, and non-relevant data.
Images: Use OCR tools like Tesseract to extract text, then clean any recognition errors and unwanted
characters.
3 Using dot (.) to segment sentences can cause problems, explain how?
Using a dot (.) to segment sentences can cause problems because dots are also used in abbreviations
(e.g., "Dr.", "Mr.") and numbers (e.g., "3.14"). This can lead to incorrect segmentation.
To avoid this, more advanced methods like context-based rules or machine learning models
8 Explain the feature engineering for classical NLP versus DL-based NLP
Classical NLP: Manual Feature Extraction: Involves selecting and crafting specific features from the
raw text using domain knowledge.
Examples of Features:
Bag of Words (BoW): Counts occurrences of each word.
Part-of-Speech (POS) tags: Labels words based on their role (noun, verb, etc.).
N-grams: Sequences of N words or characters.
Purpose: These features are used as inputs for traditional machine learning models (e.g.,
SVM, Naive Bayes, or Logistic Regression).
DL-based NLP: Automatic Feature Learning: Deep learning models, especially neural networks,
automatically learn features from raw text data without manual intervention.
Example:
Word Embeddings: Word2Vec, GloVe, or FastText that map words to dense vectors capturing semantic
relationships.
12 Which modeling technique can be used in the following cases: small data, large data,
poor data quality, and good data quality?
Small Data: Use Traditional ML models like Logistic Regression, SVM, or Decision Trees, which
perform well with limited data.
Large Data: Use Deep Learning models like Neural Networks, Convolutional Neural Networks
(CNNs), or Recurrent Neural Networks (RNNs), as they scale well with large datasets.
Poor Data Quality: Use Robust models like Random Forests or Gradient Boosting Machines
(GBM) that are less sensitive to noisy or incomplete data.
Good Data Quality: Use Simple models like Linear Regression, SVM, or Naive Bayes, which
perform well when the data is clean and well-structured.