0% found this document useful (0 votes)
0 views5 pages

NLP lect 2

The document outlines the essential steps in the Natural Language Processing (NLP) pipeline, including data acquisition, text cleaning, pre-processing, feature engineering, modeling, evaluation, deployment, and monitoring. It discusses various methods for data collection, cleaning techniques based on source types, and key concepts like tokenization, stop words, and part-of-speech tagging. Additionally, it highlights the differences between classical and deep learning approaches, evaluation types, and the importance of continuous model updates in response to changing data.

Uploaded by

asdasdosama1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views5 pages

NLP lect 2

The document outlines the essential steps in the Natural Language Processing (NLP) pipeline, including data acquisition, text cleaning, pre-processing, feature engineering, modeling, evaluation, deployment, and monitoring. It discusses various methods for data collection, cleaning techniques based on source types, and key concepts like tokenization, stop words, and part-of-speech tagging. Additionally, it highlights the differences between classical and deep learning approaches, evaluation types, and the importance of continuous model updates in response to changing data.

Uploaded by

asdasdosama1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

Data Acquisition‫ جمع البيانات‬:


Data is the foundation of any NLP system. Whether the project involves machine
learning or rule-based systems, the first step is always to collect relevant data.
This step involves gathering the necessary data for the system.
Data Acquisition Methods:
1 Use a Public Dataset: like Google Dataset Search
2 Scraping Data: Scraping refers to extracting data from websites or online platforms
3 Scraping Internal Data (Product Intervention): scraping external data is not enough
because it lacks context (e.g., missing product names or user-specific behaviors). In such cases, product
intervention becomes necessary.
4 Data Augmentation: taking a small dataset and using some tricks to create more data

2. Text Cleaning ‫ تنظيف النصوص‬:


Raw data often contains a lot of noise and irrelevant information.
Text cleaning refers to the process of removing these unwanted parts and ensuring that the
data is in a usable format.
3. Pre-processing: Pre-processing is crucial because it prepares the data for further analysis.
Key steps include:
Sentence segmentation: Dividing the text into sentences.
Tokenization: Breaking sentences into individual words.
Stemming: Reducing words to their root forms (e.g., "running" to "run").
Lemmatization: Similar to stemming but with a focus on returning words to their base
form (e.g., "better" to "good").
Stop word removal: the process of eliminating common, less meaningful words from
a text to reduce noise and improve the performance

4. Feature Engineering: This step involves transforming the processed text into
numerical representations that can be understood by machine learning algorithms.

5. Modeling: This step involves training machine learning models using the data and
features extracted.

6. Evaluation: Evaluation is critical to assess how well the model performs.


Some common evaluation metrics include:
.) Accuracy, Precision, Recall, and F1-Score (used in classification tasks).
.)BLEU (for machine translation tasks), RMSE (for regression problems), and others.
Evaluation also includes intrinsic (measuring model performance directly on a task) and extrinsic
(measuring the model’s impact on real-world business goals) evaluations.
7 Deployment: The best-performing model is deployed in a production
environment to handle new, unseen data

8 Monitoring and Model Updating: After deployment, continuous monitoring is


essential to ensure the model remains effective as data changes over time.

Text normalization techniques (like converting text to lowercase, expanding abbreviations, or handling
code-mixing) are used to deal with variations in how text is written across different platforms (e.g.,
social media).

The NLP pipeline is an iterative process, and depending on the nature of the data and task,
different steps may be prioritized or customized.

:
1 How can we get data required for training an NLP technique?
To gather data for training an NLP model, you can:
Use Public Datasets from platforms like Kaggle or Google Dataset Search.
Scrape Data from the Web using tools like BeautifulSoup or Scrapy.
Collect Internal Data from customer logs or internal systems.
Augment Data using techniques like back translation or synonym replacement.

2 Data can be collected from PDF files, HTML pages, and images, how this data can be
cleaned based on their sources?
To clean data from different sources:
PDF Files: Use libraries like PyPDF2 or PDFMiner to extract text, then remove unwanted elements
(like headers/footers) and fix formatting issues.
HTML Pages: Use BeautifulSoup or Scrapy to parse and extract text, then remove HTML tags,
scripts, and non-relevant data.
Images: Use OCR tools like Tesseract to extract text, then clean any recognition errors and unwanted
characters.

3 Using dot (.) to segment sentences can cause problems, explain how?
Using a dot (.) to segment sentences can cause problems because dots are also used in abbreviations
(e.g., "Dr.", "Mr.") and numbers (e.g., "3.14"). This can lead to incorrect segmentation.
To avoid this, more advanced methods like context-based rules or machine learning models

4 What are the frequent steps in the data pre-processing phase?


Frequent steps in the data pre-processing phase include:
Sentence Segmentation: Dividing text into sentences.
Tokenization: Splitting sentences into words or subwords.
Stop Word Removal: Eliminating common words that don’t add much meaning.
Lowercasing: Converting all text to lowercase.
Punctuation Removal: Removing unnecessary punctuation marks.
Stemming/Lemmatization: Reducing words to their base form.
5 With examples, explain the differences between segmentation and lemmatization.
Segmentation: Divides text into meaningful units like sentences or words. Example: "I love
programming." → Segmented into: "I", "love", "programming".
Lemmatization: Reduces words to their base or root form. Example: "running" →
Lemmatized to: "run", "better" → Lemmatized to: "good".6 What is the difference
between code mixing and transliteration?
Code Mixing: Refers to the use of multiple languages within a single sentence or
conversation. Example: "I am going to the bazaar for shopping."
‫يعني استخدام لغتين أو أكثر داخل نفس الجملة أو الحديث‬.‫( خلط اللغات‬Code Mixing)
Transliteration: Involves converting words from one script to another while keeping the
original pronunciation. Example: Writing "‫ "خوش‬in Latin script as "khush."
‫ دون تغيير النطق‬،‫هو كتابة كلمة بلغة معينة باستخدام األبجدية أو الحروف الخاصة بلغة أخرى‬.‫( لتحويل الصوتي‬Transliteration)

7 Describe the concept coreference resolution.


Coreference resolution is the process of identifying and linking words or phrases in a text that refer to
the same entity. For example, in the sentences "John went to the store. He bought some milk," the word "He"
refers to "John." Coreference resolution helps a model understand that "he" and "John" are the same person.

8 Explain the feature engineering for classical NLP versus DL-based NLP
Classical NLP: Manual Feature Extraction: Involves selecting and crafting specific features from the
raw text using domain knowledge.
Examples of Features:
Bag of Words (BoW): Counts occurrences of each word.
Part-of-Speech (POS) tags: Labels words based on their role (noun, verb, etc.).
N-grams: Sequences of N words or characters.
Purpose: These features are used as inputs for traditional machine learning models (e.g.,
SVM, Naive Bayes, or Logistic Regression).
DL-based NLP: Automatic Feature Learning: Deep learning models, especially neural networks,
automatically learn features from raw text data without manual intervention.
Example:
Word Embeddings: Word2Vec, GloVe, or FastText that map words to dense vectors capturing semantic
relationships.

9 How to combine heuristics directly or indirectly with the ML model?


You can combine heuristics with a machine learning (ML) model in the following ways:
Direct Integration: Use heuristic rules as features or input to the ML model. For example, adding
rule-based features (like keyword matches or POS tags) into a model's input features to help the model
make better predictions.
Indirect Integration: Use heuristics as a post-processing step. After the ML model makes
predictions, apply heuristic rules to refine or adjust the model's output (e.g., correcting errors or applying
domain-specific knowledge).
10 What is the difference between models ensembling and stacking?
Ensembling: Combines multiple models (e.g., decision trees, logistic regression) to make
a final prediction by methods like bagging (e.g., Random Forest) or boosting (e.g.,
XGBoost), often by averaging or voting.
Stacking: Involves training multiple models and using a meta-model to learn how to
combine their predictions. The meta-model is trained on the outputs (predictions) of the
base models to make a final prediction.
Key difference: Ensembling combines model predictions directly, while stacking uses
another model (meta-model) to combine them.

11 What is the difference between intrinsic and extrinsic evaluation?


Intrinsic Evaluation: Measures the performance of a model based on its direct output
(e.g., accuracy, precision, recall) without considering its impact on real-world tasks.
It focuses on the quality of the model's predictions.
Extrinsic Evaluation: Assesses the model's performance by its effectiveness in a real-
world application or task (e.g., how well a language model improves user experience in a
chatbot). It focuses on the model's practical use and its influence on outcomes.

12 Which modeling technique can be used in the following cases: small data, large data,
poor data quality, and good data quality?
Small Data: Use Traditional ML models like Logistic Regression, SVM, or Decision Trees, which
perform well with limited data.
Large Data: Use Deep Learning models like Neural Networks, Convolutional Neural Networks
(CNNs), or Recurrent Neural Networks (RNNs), as they scale well with large datasets.
Poor Data Quality: Use Robust models like Random Forests or Gradient Boosting Machines
(GBM) that are less sensitive to noisy or incomplete data.
Good Data Quality: Use Simple models like Linear Regression, SVM, or Naive Bayes, which
perform well when the data is clean and well-structured.

13 explain how the NLP pipeline is different from a language to another?


The NLP pipeline differs from one language to another due to variations in grammar,
syntax, morphology, and other language-specific features. For example:
Tokenization: Different languages may have different word boundaries (e.g., Chinese
doesn't use spaces, so word segmentation is more complex).
Syntax: Sentence structure (e.g., word order in Japanese vs. English) affects how the
pipeline processes text.
Stop Words: Common words (like "the" in English or "le" in French) differ in each
language, requiring language-specific lists.

14 Describe deploying, monitoring, and updating phases of NLP the pipeline.


Deploying: After training the NLP model, it is deployed into a production environment where it can
start processing real-world data.
Monitoring: Once deployed, the model’s performance is continuously tracked. This includes monitoring
metrics.
Updating: The model is periodically updated based on new data, performance feedback, or to improve
accuracy. This involves retraining the model with fresh data
15 Describe the NLP pipeline for ranking tickets in a ticketing system by Uber.
1 Data Collection: Collect text data from customer tickets (e.g., complaints, requests, or inquiries).
2 Preprocessing: Clean the data by removing stop words, special characters, and irrelevant text.
Tokenize the text into words or phrases
3 Text Representation: Convert the text into numerical representations using word embeddings (e.g.,
Word2Vec, GloVe) or transformers (e.g., BERT) for context-aware representations.
4 Feature Extraction: Extract relevant features from the text (e.g., ticket category, urgency, keywords).
5 Classification or Ranking: Use classification models (e.g., Random Forest, SVM) or ranking
models (e.g., Learning to Rank) to assign priority scores or categories
6 Ranking: Rank tickets based on predicted scores, such as ticket urgency, customer sentiment, or
issue complexity.
7 Post-processing: Present the ranked tickets in the ticketing system for efficient handling by support
teams.

What is tokenization in NLP?


Tokenization is the process of breaking text into smaller units, such as words or
subwords, to make it easier for the model to process.

What are stop words?


Stop words are common words (e.g., "the", "is", "and") that are usually removed in NLP
tasks because they don't carry significant meaning.

What is part-of-speech tagging?


Part-of-speech (POS) tagging involves labeling each word in a sentence with its
corresponding part of speech, such as noun, verb, or adjective.

What is the purpose of word embeddings?


Word embeddings represent words as dense vectors that capture semantic relationships
between words, allowing models to understand word meanings better.

What is sentiment analysis in NLP?


Sentiment analysis is the task of determining the sentiment expressed in a text, such as
positive, negative, or neutral.

You might also like