0% found this document useful (0 votes)

0 views5 pages

NLP lect 2

The document outlines the essential steps in the Natural Language Processing (NLP) pipeline, including data acquisition, text cleaning, pre-processing, feature engineering, modeling, evaluation, deployment, and monitoring. It discusses various methods for data collection, cleaning techniques based on source types, and key concepts like tokenization, stop words, and part-of-speech tagging. Additionally, it highlights the differences between classical and deep learning approaches, evaluation types, and the importance of continuous model updates in response to changing data.

Uploaded by

asdasdosama1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views5 pages

NLP lect 2

Uploaded by

asdasdosama1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1.

Data Acquisition‫ جمع البيانات‬:

Data is the foundation of any NLP system. Whether the project involves machine
learning or rule-based systems, the first step is always to collect relevant data.
This step involves gathering the necessary data for the system.
Data Acquisition Methods:
1 Use a Public Dataset: like Google Dataset Search
2 Scraping Data: Scraping refers to extracting data from websites or online platforms
3 Scraping Internal Data (Product Intervention): scraping external data is not enough
because it lacks context (e.g., missing product names or user-specific behaviors). In such cases, product
intervention becomes necessary.
4 Data Augmentation: taking a small dataset and using some tricks to create more data

2. Text Cleaning ‫ تنظيف النصوص‬:

Raw data often contains a lot of noise and irrelevant information.
Text cleaning refers to the process of removing these unwanted parts and ensuring that the
data is in a usable format.
3. Pre-processing: Pre-processing is crucial because it prepares the data for further analysis.
Key steps include:
Sentence segmentation: Dividing the text into sentences.
Tokenization: Breaking sentences into individual words.
Stemming: Reducing words to their root forms (e.g., "running" to "run").
Lemmatization: Similar to stemming but with a focus on returning words to their base
form (e.g., "better" to "good").
Stop word removal: the process of eliminating common, less meaningful words from
a text to reduce noise and improve the performance

4. Feature Engineering: This step involves transforming the processed text into
numerical representations that can be understood by machine learning algorithms.

5. Modeling: This step involves training machine learning models using the data and
features extracted.

6. Evaluation: Evaluation is critical to assess how well the model performs.

Some common evaluation metrics include:
.) Accuracy, Precision, Recall, and F1-Score (used in classification tasks).
.)BLEU (for machine translation tasks), RMSE (for regression problems), and others.
Evaluation also includes intrinsic (measuring model performance directly on a task) and extrinsic
(measuring the model’s impact on real-world business goals) evaluations.
7 Deployment: The best-performing model is deployed in a production
environment to handle new, unseen data

8 Monitoring and Model Updating: After deployment, continuous monitoring is

essential to ensure the model remains effective as data changes over time.

Text normalization techniques (like converting text to lowercase, expanding abbreviations, or handling
code-mixing) are used to deal with variations in how text is written across different platforms (e.g.,
social media).

The NLP pipeline is an iterative process, and depending on the nature of the data and task,
different steps may be prioritized or customized.

:
1 How can we get data required for training an NLP technique?
To gather data for training an NLP model, you can:
Use Public Datasets from platforms like Kaggle or Google Dataset Search.
Scrape Data from the Web using tools like BeautifulSoup or Scrapy.
Collect Internal Data from customer logs or internal systems.
Augment Data using techniques like back translation or synonym replacement.

2 Data can be collected from PDF files, HTML pages, and images, how this data can be
cleaned based on their sources?
To clean data from different sources:
PDF Files: Use libraries like PyPDF2 or PDFMiner to extract text, then remove unwanted elements
(like headers/footers) and fix formatting issues.
HTML Pages: Use BeautifulSoup or Scrapy to parse and extract text, then remove HTML tags,
scripts, and non-relevant data.
Images: Use OCR tools like Tesseract to extract text, then clean any recognition errors and unwanted
characters.

3 Using dot (.) to segment sentences can cause problems, explain how?
Using a dot (.) to segment sentences can cause problems because dots are also used in abbreviations
(e.g., "Dr.", "Mr.") and numbers (e.g., "3.14"). This can lead to incorrect segmentation.
To avoid this, more advanced methods like context-based rules or machine learning models

4 What are the frequent steps in the data pre-processing phase?

Frequent steps in the data pre-processing phase include:
Sentence Segmentation: Dividing text into sentences.
Tokenization: Splitting sentences into words or subwords.
Stop Word Removal: Eliminating common words that don’t add much meaning.
Lowercasing: Converting all text to lowercase.
Punctuation Removal: Removing unnecessary punctuation marks.
Stemming/Lemmatization: Reducing words to their base form.
5 With examples, explain the differences between segmentation and lemmatization.
Segmentation: Divides text into meaningful units like sentences or words. Example: "I love
programming." → Segmented into: "I", "love", "programming".
Lemmatization: Reduces words to their base or root form. Example: "running" →
Lemmatized to: "run", "better" → Lemmatized to: "good".6 What is the difference
between code mixing and transliteration?
Code Mixing: Refers to the use of multiple languages within a single sentence or
conversation. Example: "I am going to the bazaar for shopping."
‫يعني استخدام لغتين أو أكثر داخل نفس الجملة أو الحديث‬.‫( خلط اللغات‬Code Mixing)
Transliteration: Involves converting words from one script to another while keeping the
original pronunciation. Example: Writing "‫ "خوش‬in Latin script as "khush."
‫ دون تغيير النطق‬،‫هو كتابة كلمة بلغة معينة باستخدام األبجدية أو الحروف الخاصة بلغة أخرى‬.‫( لتحويل الصوتي‬Transliteration)

7 Describe the concept coreference resolution.

Coreference resolution is the process of identifying and linking words or phrases in a text that refer to
the same entity. For example, in the sentences "John went to the store. He bought some milk," the word "He"
refers to "John." Coreference resolution helps a model understand that "he" and "John" are the same person.

8 Explain the feature engineering for classical NLP versus DL-based NLP
Classical NLP: Manual Feature Extraction: Involves selecting and crafting specific features from the
raw text using domain knowledge.
Examples of Features:
Bag of Words (BoW): Counts occurrences of each word.
Part-of-Speech (POS) tags: Labels words based on their role (noun, verb, etc.).
N-grams: Sequences of N words or characters.
Purpose: These features are used as inputs for traditional machine learning models (e.g.,
SVM, Naive Bayes, or Logistic Regression).
DL-based NLP: Automatic Feature Learning: Deep learning models, especially neural networks,
automatically learn features from raw text data without manual intervention.
Example:
Word Embeddings: Word2Vec, GloVe, or FastText that map words to dense vectors capturing semantic
relationships.

9 How to combine heuristics directly or indirectly with the ML model?

You can combine heuristics with a machine learning (ML) model in the following ways:
Direct Integration: Use heuristic rules as features or input to the ML model. For example, adding
rule-based features (like keyword matches or POS tags) into a model's input features to help the model
make better predictions.
Indirect Integration: Use heuristics as a post-processing step. After the ML model makes
predictions, apply heuristic rules to refine or adjust the model's output (e.g., correcting errors or applying
domain-specific knowledge).
10 What is the difference between models ensembling and stacking?
Ensembling: Combines multiple models (e.g., decision trees, logistic regression) to make
a final prediction by methods like bagging (e.g., Random Forest) or boosting (e.g.,
XGBoost), often by averaging or voting.
Stacking: Involves training multiple models and using a meta-model to learn how to
combine their predictions. The meta-model is trained on the outputs (predictions) of the
base models to make a final prediction.
Key difference: Ensembling combines model predictions directly, while stacking uses
another model (meta-model) to combine them.

11 What is the difference between intrinsic and extrinsic evaluation?

Intrinsic Evaluation: Measures the performance of a model based on its direct output
(e.g., accuracy, precision, recall) without considering its impact on real-world tasks.
It focuses on the quality of the model's predictions.
Extrinsic Evaluation: Assesses the model's performance by its effectiveness in a real-
world application or task (e.g., how well a language model improves user experience in a
chatbot). It focuses on the model's practical use and its influence on outcomes.

12 Which modeling technique can be used in the following cases: small data, large data,
poor data quality, and good data quality?
Small Data: Use Traditional ML models like Logistic Regression, SVM, or Decision Trees, which
perform well with limited data.
Large Data: Use Deep Learning models like Neural Networks, Convolutional Neural Networks
(CNNs), or Recurrent Neural Networks (RNNs), as they scale well with large datasets.
Poor Data Quality: Use Robust models like Random Forests or Gradient Boosting Machines
(GBM) that are less sensitive to noisy or incomplete data.
Good Data Quality: Use Simple models like Linear Regression, SVM, or Naive Bayes, which
perform well when the data is clean and well-structured.

13 explain how the NLP pipeline is different from a language to another?

The NLP pipeline differs from one language to another due to variations in grammar,
syntax, morphology, and other language-specific features. For example:
Tokenization: Different languages may have different word boundaries (e.g., Chinese
doesn't use spaces, so word segmentation is more complex).
Syntax: Sentence structure (e.g., word order in Japanese vs. English) affects how the
pipeline processes text.
Stop Words: Common words (like "the" in English or "le" in French) differ in each
language, requiring language-specific lists.

14 Describe deploying, monitoring, and updating phases of NLP the pipeline.

Deploying: After training the NLP model, it is deployed into a production environment where it can
start processing real-world data.
Monitoring: Once deployed, the model’s performance is continuously tracked. This includes monitoring
metrics.
Updating: The model is periodically updated based on new data, performance feedback, or to improve
accuracy. This involves retraining the model with fresh data
15 Describe the NLP pipeline for ranking tickets in a ticketing system by Uber.
1 Data Collection: Collect text data from customer tickets (e.g., complaints, requests, or inquiries).
2 Preprocessing: Clean the data by removing stop words, special characters, and irrelevant text.
Tokenize the text into words or phrases
3 Text Representation: Convert the text into numerical representations using word embeddings (e.g.,
Word2Vec, GloVe) or transformers (e.g., BERT) for context-aware representations.
4 Feature Extraction: Extract relevant features from the text (e.g., ticket category, urgency, keywords).
5 Classification or Ranking: Use classification models (e.g., Random Forest, SVM) or ranking
models (e.g., Learning to Rank) to assign priority scores or categories
6 Ranking: Rank tickets based on predicted scores, such as ticket urgency, customer sentiment, or
issue complexity.
7 Post-processing: Present the ranked tickets in the ticketing system for efficient handling by support
teams.

What is tokenization in NLP?

Tokenization is the process of breaking text into smaller units, such as words or
subwords, to make it easier for the model to process.

What are stop words?

Stop words are common words (e.g., "the", "is", "and") that are usually removed in NLP
tasks because they don't carry significant meaning.

What is part-of-speech tagging?

Part-of-speech (POS) tagging involves labeling each word in a sentence with its
corresponding part of speech, such as noun, verb, or adjective.

What is the purpose of word embeddings?

Word embeddings represent words as dense vectors that capture semantic relationships
between words, allowing models to understand word meanings better.

What is sentiment analysis in NLP?

Sentiment analysis is the task of determining the sentiment expressed in a text, such as
positive, negative, or neutral.

Introduction To Application Development With BMC Remedy Developer Studio
No ratings yet
Introduction To Application Development With BMC Remedy Developer Studio
96 pages
Chapter 9 - Pas 1 Statement of Comprehensive Income
No ratings yet
Chapter 9 - Pas 1 Statement of Comprehensive Income
26 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
6 pages
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
No ratings yet
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
171 pages
NLP Chapter -2 Sheet
No ratings yet
NLP Chapter -2 Sheet
7 pages
Lect02
No ratings yet
Lect02
23 pages
DeekshikaJadyada-AP24LDS11
No ratings yet
DeekshikaJadyada-AP24LDS11
6 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
2. NLP Pipeline
No ratings yet
2. NLP Pipeline
50 pages
NLP Sheets
No ratings yet
NLP Sheets
23 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
NLP-1
No ratings yet
NLP-1
13 pages
AI-2
No ratings yet
AI-2
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
fbc0639a-812b-41a2-9cc3-1df472ef6897
No ratings yet
fbc0639a-812b-41a2-9cc3-1df472ef6897
22 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
nlp2
No ratings yet
nlp2
45 pages
Ai Applications Unit-1
No ratings yet
Ai Applications Unit-1
11 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
1_NLP.docx
No ratings yet
1_NLP.docx
26 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
eco36
No ratings yet
eco36
6 pages
NLP_record300
No ratings yet
NLP_record300
24 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
NLP-Questions (1)
No ratings yet
NLP-Questions (1)
26 pages
Text Classification and Processing using NLP
No ratings yet
Text Classification and Processing using NLP
21 pages
Artificial Intelligence-UNIT-4
No ratings yet
Artificial Intelligence-UNIT-4
37 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
NLP concepts Resources
No ratings yet
NLP concepts Resources
48 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
NPL Assignment 1
No ratings yet
NPL Assignment 1
5 pages
Unit 1 NLP and TA
No ratings yet
Unit 1 NLP and TA
9 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
AI-CH-4
No ratings yet
AI-CH-4
53 pages
NLP 1
No ratings yet
NLP 1
29 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
MOD-1
No ratings yet
MOD-1
71 pages
Unit 5
No ratings yet
Unit 5
8 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
Basic Terms NLP and Major Challenges
No ratings yet
Basic Terms NLP and Major Challenges
12 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
ورقة الذكاء
No ratings yet
ورقة الذكاء
7 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Module-I_NLP (1)
No ratings yet
Module-I_NLP (1)
35 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
Computer Programming and Problem Solving Explorations
From Everand
Computer Programming and Problem Solving Explorations
Pasquale De Marco
No ratings yet
ml_chapter 2-7
No ratings yet
ml_chapter 2-7
12 pages
How To Prevent Road Accidents
85% (26)
How To Prevent Road Accidents
9 pages
Plumbing Design Criteria Water Supply Demand Calculation Reference Code UPC - 2009
No ratings yet
Plumbing Design Criteria Water Supply Demand Calculation Reference Code UPC - 2009
18 pages
Lesson 1 Semantics
100% (9)
Lesson 1 Semantics
42 pages
Turbine Speed & Load Control
100% (5)
Turbine Speed & Load Control
32 pages
Dossier 6170 (1) .FR - en
100% (1)
Dossier 6170 (1) .FR - en
48 pages
Straightness of Lines & Surfaces (GD&T)
100% (3)
Straightness of Lines & Surfaces (GD&T)
19 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Car Sales in Britain
No ratings yet
Car Sales in Britain
3 pages
Snapdeal and E Bay Project
No ratings yet
Snapdeal and E Bay Project
15 pages
Síntesis de Calix (4) Pirrol
100% (1)
Síntesis de Calix (4) Pirrol
5 pages
Rolling Bearing Failure V2
67% (6)
Rolling Bearing Failure V2
34 pages
01 Environmnet Gs Notes
No ratings yet
01 Environmnet Gs Notes
20 pages
Grid - Marcus Ohlenforst - Marine Corps Martial Arts Program
No ratings yet
Grid - Marcus Ohlenforst - Marine Corps Martial Arts Program
2 pages
Fragments and run-on sentences
No ratings yet
Fragments and run-on sentences
14 pages
Heavy Fueledpeakingpowergen Morocco PowerGenAfrica2012
No ratings yet
Heavy Fueledpeakingpowergen Morocco PowerGenAfrica2012
18 pages
Conditionals
No ratings yet
Conditionals
1 page
【Timon Beyes】The Topographical Imagination
No ratings yet
【Timon Beyes】The Topographical Imagination
26 pages
38810u Sparepartlist Kingboa
No ratings yet
38810u Sparepartlist Kingboa
4 pages
Tales From The Wasteland Screenplay
No ratings yet
Tales From The Wasteland Screenplay
30 pages
Trucking Business Logistics Visual Charts Presentation in White and Blue Light Simple Style
No ratings yet
Trucking Business Logistics Visual Charts Presentation in White and Blue Light Simple Style
27 pages
Ancient Filipino Seafarers - v2
No ratings yet
Ancient Filipino Seafarers - v2
5 pages
pattan ghousekhan CV resume
No ratings yet
pattan ghousekhan CV resume
3 pages
HHS Lunch Menu 75 90
No ratings yet
HHS Lunch Menu 75 90
2 pages
A Little Princess - Act 3 (Scene 1 Only)
No ratings yet
A Little Princess - Act 3 (Scene 1 Only)
4 pages
END OF MONTH SEPT24 USS F.MATHEMATICS P2
No ratings yet
END OF MONTH SEPT24 USS F.MATHEMATICS P2
2 pages
Volume 2 Nomor 2, Desember 2019 E-ISSN: 2655-7347: Keywords: Defect, Compensation, Gojek, Negligence, Shipment
No ratings yet
Volume 2 Nomor 2, Desember 2019 E-ISSN: 2655-7347: Keywords: Defect, Compensation, Gojek, Negligence, Shipment
17 pages
Reinsurance Presentation
No ratings yet
Reinsurance Presentation
48 pages

NLP lect 2

Uploaded by

NLP lect 2

Uploaded by

1.

Data Acquisition‫ جمع البيانات‬:

2. Text Cleaning ‫ تنظيف النصوص‬:

6. Evaluation: Evaluation is critical to assess how well the model performs.

8 Monitoring and Model Updating: After deployment, continuous monitoring is

4 What are the frequent steps in the data pre-processing phase?

7 Describe the concept coreference resolution.

9 How to combine heuristics directly or indirectly with the ML model?

11 What is the difference between intrinsic and extrinsic evaluation?

13 explain how the NLP pipeline is different from a language to another?

14 Describe deploying, monitoring, and updating phases of NLP the pipeline.

What is tokenization in NLP?

What are stop words?

What is part-of-speech tagging?

What is the purpose of word embeddings?

What is sentiment analysis in NLP?

You might also like