Wildchat M Fu: in Implement Transformer From To Its Next Dataset and Your To As Toxic Non-Toxic

Uploaded by

Ranveer Hudda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views9 pages

Wildchat M Fu: in Implement Transformer From To Its Next Dataset and Your To As Toxic Non-Toxic

Uploaded by

Ranveer Hudda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

Overview

In this assignment, you will implement an encoder-only Transformer (BERT) from scratch to
gain a deep understanding of its core components.

Next, you will work with the WildChat-1M-Full dataset, which contains conversations between
humans and ChatGPT. Your task is to classify as toxic or non-toxic.

You will work with the Fakeddit dataset, which contains the posts from reddit and the comments
on the respective posts. Your task is to classify whether the post is fake or not.
! Note: Please request access to the WildChat-1M-Full dataset immediately. Do not wait
until the last moment, as getting the permission to access this dataset may take time.
Mention in the report clearly which dataset you are using.

Tasks

Task 1: Implement BERT class from scratch [40% of TOTAL]

Objective: Implement the essential components of a BERT model from scratch using PyTorch,
and test it on a sample sentence. The main goal of this task is to understand the data flow and
the shapes of the embeddings. I've provided the references to guide your understanding.
Please use them for reference only, not as-is.

Requirements:

1. Implement the following components in PyTorch: Consider number of heads=

12, embedding dimension = 768, number of layers = 2
o Multi-Head Self-Attention
o Feed-Forward Layer
o Transformer Encoder Layer
o BertModel (stack of encoders with token embeddings, positional
embeddings, pooler, and classifier head)
2. Tokenizer: Use bert-base-uncased from Hugging Face to tokenize inputs.
3. Testing with a Sample Sentence:
o Consider any sentence of your choice with more than 10 words.
o Tokenize the sentence using the tokenizer.
o Pass the tokenized input through your implemented BERT model and for the
segmentation information (if required) pass O for all the tokens.
4. Report:
o Record and include the shapes of embeddings at each stage of the model:
■ Token embeddings
■ Positional embeddings
■ Output of Multi-Head Attention
■ Output of Feed-Forward layer
■ Output of Transformer Encoder Layer
■ Output probabilities from the classifier
■ Shape of each parameter.
Prepare the WildChat-1M-Full dataset for toxicity classification by filtering, cleaning, and
organizing the data.

Steps:

1. Filter by Language:
o Use the language column in the dataset and retain only conversations in English.
2. Use the column 'conversation' to access the user prompts and the response from
the ChatGPT. Anonymize the chat conversation by giving the input as the
conversation chains between User A and User B, where User A = human, User B =
chatGPT.
o Report the input format you are using.
3. [BONUS - 10% of TOTAL] Apply data cleaning strategies (example removing duplicates),
mention the methods used and the data size after each step.
4. Balance the Dataset:
o The toxic column contains whether the conversation contains toxic or not. Use
it as the ground truth
o Create a balanced dataset by randomly sampling equal numbers (consider
at least 2500 each) of toxic and non-toxic conversations.
5. Train-Test Split:
o Split the filtered dataset into 80% training and 20% testing.
o Make sure that both sets have balanced classes.
6. Report Statistics:
o Number of English conversations in the original dataset
o Number of toxic and non-toxic conversations after filtering.
o Number of toxic and non-toxic conversations in the balanced dataset you built.
o Number of conversations in train and test sets
Fakeddit
Prepare the Fakeddit dataset for fake news classification by filtering, cleaning, and organizing
the data.

Steps:

1. Navigate to the folder a// samples (also includes non multimodal). Load and combine
the files - all_train.tsv, al/_test_public.tsv and all_validate.tsv. Retain only the columns
'clean title' for the post content, '2_way_label' for the classification and 'id' for
identifying the post.
2. [BONUS -10% of TOTAL] Post-Comment-Reply Structure: Obtain the
'all_comments.tsv' file from the Fakeddit GitHub repository. Load the following
columns:
• id - Unique comment ID
• submission_id - ID of the post this comment belongs to
• body - content of the comment
• pa rent_id - Parent of the comment
o Prefix t3_--+Comment directly on the post (top-level comment)
o Prefix t 1 _ - Reply to another comment

Use this information to add comments to the inputs to enrich the text. Mention the format
you are using to feed to the model. Note that you will still classify whether the post is
fake or not.

3. Balance the Dataset:

a. Create a balanced dataset by randomly sampling equal numbers (consider at
least 2500 each) of fake and non-fake posts.
4. Train-Test Split:
a. Split the filtered dataset into 80% training and 20% testing.
b. Make sure that both sets have balanced classes.
5. Report Statistics:
a. Number of posts in the original dataset
b. Number of posts which are fake and non-fake in the balanced dataset you built.
c. Distribution of posts in train and test sets.
d. For those who did step 2 mention the following apart from above statistics:
i. Number of posts with at least one comment in the original dataset
ii. Mean, standard deviation of number of comments for fake vs non-fake
posts in the entire dataset.
iii. Mean, standard deviation of number of comments for fake vs non-fake
posts in the balanced dataset you built.

Task 3: Fine tune the model and Report [40% of TOTAL]

Metrics
Steps:

1. For this part, there is no need to use the class you implemented in Task 1.
2. Use BertForSequenceClassification from transformers and load the model
bert-base-uncased to finetune the model.
3. If you are using only a subset of the input tokens, specify the exact values in your report
and clearly describe the method used for truncating the input.
4. Set Hyperparameters and Train:
a. Define hyperparameters (epochs, batch size, learning rate, optimizer).
b. Finetune the model using cross-entropy loss on the training set.
5. Evaluate Performance:
a. Test the model on the test set and compute Accuracy, Precision, Recall,
F1-score, confusion matrix.
6. Report Results:
a. Present a metrics table for the test set.
b. Document all hyperparameters used.

References
• [Link]
• [Link]
-3e6562228891

Submission Guidelines
• You have to use !Python Notebooks for coding. (Use Google Golab, Kaggle for
running your assignments).
• Include small documentation on each cell of the notebook. Further, properly
add comments to your code as and when required.
• Important - Make sure your IPython Notebook has the outputs from each cell. The
outputs must be present in the submitted notebook as well. Absence of these results will
lead to straight deduction of 50% marks.
• You have to submit the !Python notebook and Assignment report as mentioned
above. Name each of them as NLP_Assignment_3_Roll_No.ipynb and
NLP_Assignment_3_Roll_No.pdf, respectively.

[For eg: NLP- Assignment--3 [Link], NLP- Assignment--3 [Link]]

[For eg: NLP_Assignment_3_21CS30035.ipynb, NLP_Assignment_3_21CS30035.pdf]

• Submit these two files in .zip format on Google Form. Note: Your zip file must be
named NLP_Assignment_3_Roll_No.zip and if you submit more than once, only
your latest submission will be considered.

[For eg: NLP- Assignment- 3- [Link]]

Transformer-BERT Take Home Assessment
No ratings yet
Transformer-BERT Take Home Assessment
2 pages
Fine-Tuning BERT for Text Classification
No ratings yet
Fine-Tuning BERT for Text Classification
6 pages
Pre-Training BERT With Hugging Face Transformers and Habana Gaudi
No ratings yet
Pre-Training BERT With Hugging Face Transformers and Habana Gaudi
12 pages
Generative AI Techniques with OpenAI Models
No ratings yet
Generative AI Techniques with OpenAI Models
24 pages
Student Group Assignment Guidelines
No ratings yet
Student Group Assignment Guidelines
12 pages
Harvard CS197 Lecture 4 Notes
No ratings yet
Harvard CS197 Lecture 4 Notes
15 pages
Project Report Template
No ratings yet
Project Report Template
29 pages
Aiml 3
No ratings yet
Aiml 3
4 pages
Text Classification with Hugging Face
No ratings yet
Text Classification with Hugging Face
1 page
Multi Class Text Classification With Deep Learning Using Bert - by Susan Li - Aug - 2020 - Towards Data Science
No ratings yet
Multi Class Text Classification With Deep Learning Using Bert - by Susan Li - Aug - 2020 - Towards Data Science
6 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
5 pages
Hugging Face
No ratings yet
Hugging Face
1 page
Transformer Models for Sentiment Analysis
No ratings yet
Transformer Models for Sentiment Analysis
45 pages
Sentiment Analysis Pipeline Guide
No ratings yet
Sentiment Analysis Pipeline Guide
8 pages
Amazon Review Classification with BERT
0% (2)
Amazon Review Classification with BERT
8 pages
Cyberbullying Detection in Bangla Social Media
No ratings yet
Cyberbullying Detection in Bangla Social Media
35 pages
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
98 pages
Toxic Message Detector
No ratings yet
Toxic Message Detector
9 pages
Miniproject NLP
No ratings yet
Miniproject NLP
22 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
22BCE9752 NLPDigital Assignment 02
No ratings yet
22BCE9752 NLPDigital Assignment 02
21 pages
22CS3017 Bert
No ratings yet
22CS3017 Bert
4 pages
22CS3002 Bert
No ratings yet
22CS3002 Bert
4 pages
Sub-Event Detection in Football Matches
No ratings yet
Sub-Event Detection in Football Matches
5 pages
Practice IV - Second Stage - Clickbait Detection
No ratings yet
Practice IV - Second Stage - Clickbait Detection
10 pages
Round 1
No ratings yet
Round 1
8 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
6 pages
CS330 Homework 3
No ratings yet
CS330 Homework 3
9 pages
Bert T
No ratings yet
Bert T
2 pages
cl12 Huggingface
No ratings yet
cl12 Huggingface
34 pages
COL774: Assignment 4 Naive Bayes & Collaborative Filtering: Released On: 2nd October, 2024
No ratings yet
COL774: Assignment 4 Naive Bayes & Collaborative Filtering: Released On: 2nd October, 2024
4 pages
AI Phase2
No ratings yet
AI Phase2
9 pages
COMP 551 MiniProject: Reddit ML Task
No ratings yet
COMP 551 MiniProject: Reddit ML Task
5 pages
Assignment - 2
No ratings yet
Assignment - 2
5 pages
NLP Emotion Classification Assignment
No ratings yet
NLP Emotion Classification Assignment
3 pages
Fake News Detection Report
No ratings yet
Fake News Detection Report
46 pages
Genai Week10
No ratings yet
Genai Week10
5 pages
Phase 2 Ibm
No ratings yet
Phase 2 Ibm
5 pages
BERT Sentiment Analysis Guide Formatted
No ratings yet
BERT Sentiment Analysis Guide Formatted
12 pages
Hugging Face
100% (1)
Hugging Face
11 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
IQBAL Fresher 19
No ratings yet
IQBAL Fresher 19
3 pages
1752 Mini Project 3 RHC Zcuj7vak
No ratings yet
1752 Mini Project 3 RHC Zcuj7vak
2 pages
Final Assignment
No ratings yet
Final Assignment
3 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Transformer Models for NLP Training
No ratings yet
Transformer Models for NLP Training
4 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Experiment 10 NLP
No ratings yet
Experiment 10 NLP
5 pages
Batch 6
No ratings yet
Batch 6
21 pages
RediMinds - AIEnabler - Technical - Exercise - DF 1
No ratings yet
RediMinds - AIEnabler - Technical - Exercise - DF 1
2 pages
Complex Engineering Activity
No ratings yet
Complex Engineering Activity
2 pages
Topic Classification with Feedforward NN
No ratings yet
Topic Classification with Feedforward NN
2 pages
Fine-Tuning GPT-2 Document
No ratings yet
Fine-Tuning GPT-2 Document
7 pages
BERT NER Tutorial for Token Classification
No ratings yet
BERT NER Tutorial for Token Classification
30 pages
AI Engineer Intern Assignment Tasks
No ratings yet
AI Engineer Intern Assignment Tasks
3 pages
Warm-Starting Encoder-Decoder Models
No ratings yet
Warm-Starting Encoder-Decoder Models
50 pages
Text Classification With Transformers - by Ashwin N - Medium
No ratings yet
Text Classification With Transformers - by Ashwin N - Medium
26 pages
End-To-End NLP Pipeline For Detecting Sarcasm in Conversa
No ratings yet
End-To-End NLP Pipeline For Detecting Sarcasm in Conversa
45 pages
Chat Bot
No ratings yet
Chat Bot
10 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
42 pages
Product Details Quick
No ratings yet
Product Details Quick
9 pages
Lab 3
No ratings yet
Lab 3
2 pages
ELPD Course
No ratings yet
ELPD Course
4 pages
06 Greedy
No ratings yet
06 Greedy
9 pages
Lab 1
No ratings yet
Lab 1
2 pages
Transforming Waste Into Worth
No ratings yet
Transforming Waste Into Worth
9 pages
Tutorial 2 Merged
No ratings yet
Tutorial 2 Merged
4 pages
Lab 2
No ratings yet
Lab 2
2 pages
02 Divide and Conquer
No ratings yet
02 Divide and Conquer
3 pages
Lab 5
No ratings yet
Lab 5
2 pages
As4 Ques1
No ratings yet
As4 Ques1
2 pages
04.02 Avl Trees
No ratings yet
04.02 Avl Trees
1 page
05 DP
No ratings yet
05 DP
2 pages
Spring 2023
No ratings yet
Spring 2023
2 pages
B-Plan Handout 5
No ratings yet
B-Plan Handout 5
47 pages
HW 05
No ratings yet
HW 05
2 pages
09 Heaps
No ratings yet
09 Heaps
3 pages
Algorithemic Game Theory
No ratings yet
Algorithemic Game Theory
3 pages
Solution and Practice Questions
No ratings yet
Solution and Practice Questions
2 pages
04.01 BST
No ratings yet
04.01 BST
4 pages
HW 01
No ratings yet
HW 01
2 pages
HW 02
No ratings yet
HW 02
2 pages
Session by Mohanish
No ratings yet
Session by Mohanish
2 pages
Covid 19 Report Tracking System Report
No ratings yet
Covid 19 Report Tracking System Report
3 pages
B-Plan Handout - 4
No ratings yet
B-Plan Handout - 4
24 pages
Hardware Security in IoT Devices
No ratings yet
Hardware Security in IoT Devices
78 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Disk IO v3
No ratings yet
Disk IO v3
23 pages
Organization Blocks
No ratings yet
Organization Blocks
7 pages
Benefits of Baking and Reducing Sugar
No ratings yet
Benefits of Baking and Reducing Sugar
2 pages
Math Quiz: Triangle and Geometry Questions
No ratings yet
Math Quiz: Triangle and Geometry Questions
8 pages
R and RStudio Tutorial Guide
No ratings yet
R and RStudio Tutorial Guide
21 pages
Strategic Management MCQs for MBA
100% (2)
Strategic Management MCQs for MBA
8 pages
Language and Culture The Filipino Identity
No ratings yet
Language and Culture The Filipino Identity
10 pages
Summary 4 Financial Accounting Theory Aninda Putri
No ratings yet
Summary 4 Financial Accounting Theory Aninda Putri
6 pages
Quiz Competition National Level: Public Health Engineering
No ratings yet
Quiz Competition National Level: Public Health Engineering
2 pages
Body Electrical System
No ratings yet
Body Electrical System
19 pages
Aquafina Water Project
0% (2)
Aquafina Water Project
38 pages
Inkwell Innovations Company Brochure
No ratings yet
Inkwell Innovations Company Brochure
2 pages
Technical Writing Guidelines for Research
No ratings yet
Technical Writing Guidelines for Research
3 pages
Enhancing Calculus with GeoGebra and CT
No ratings yet
Enhancing Calculus with GeoGebra and CT
33 pages
Law of Contract: Form and Consent in Tanzania
No ratings yet
Law of Contract: Form and Consent in Tanzania
117 pages
Deuteronomy 11-12: Obedience and Worship
No ratings yet
Deuteronomy 11-12: Obedience and Worship
27 pages
Operations Research 2nd Edition Col. D.S. Cheema
No ratings yet
Operations Research 2nd Edition Col. D.S. Cheema
380 pages
HSE Policy
No ratings yet
HSE Policy
2 pages
Linguistic Analysis of Pinter's The Caretaker
67% (3)
Linguistic Analysis of Pinter's The Caretaker
66 pages
Chapter One: Face Recognition Based Meal Card System
No ratings yet
Chapter One: Face Recognition Based Meal Card System
41 pages
Tactile Sensibility Article
No ratings yet
Tactile Sensibility Article
12 pages
Grammar 4.4.1-4.4.5
No ratings yet
Grammar 4.4.1-4.4.5
4 pages
Last Push Paper 1
No ratings yet
Last Push Paper 1
96 pages
1.preparation of The Mouth
No ratings yet
1.preparation of The Mouth
8 pages
A320 Hydraulic Accumulator Servicing
No ratings yet
A320 Hydraulic Accumulator Servicing
9 pages
Bitbase Accounting Case Study Guide
No ratings yet
Bitbase Accounting Case Study Guide
23 pages
SX-5 Starburst® Searchlight
100% (1)
SX-5 Starburst® Searchlight
62 pages
Art of Attention A Yoga Practice Workbook For Movement As Meditation FULL PDF DOCX DOWNLOAD
100% (14)
Art of Attention A Yoga Practice Workbook For Movement As Meditation FULL PDF DOCX DOWNLOAD
16 pages
Instructions - How To Fill The OMR Multiple Choice Answer Sheet
No ratings yet
Instructions - How To Fill The OMR Multiple Choice Answer Sheet
3 pages
LRFD Design of 10m Slab Bridge
100% (1)
LRFD Design of 10m Slab Bridge
3 pages
Accepted Manuscript: 10.1016/j.cjche.2018.05.010
No ratings yet
Accepted Manuscript: 10.1016/j.cjche.2018.05.010
30 pages

Wildchat M Fu: in Implement Transformer From To Its Next Dataset and Your To As Toxic Non-Toxic

Uploaded by

Wildchat M Fu: in Implement Transformer From To Its Next Dataset and Your To As Toxic Non-Toxic

Uploaded by

Overview

Task 1: Implement BERT class from scratch [40% of TOTAL]

1. Implement the following components in PyTorch: Consider number of heads=

3. Balance the Dataset:

Task 3: Fine tune the model and Report [40% of TOTAL]

[For eg: NLP- Assignment--3 [Link], NLP- Assignment--3 [Link]]

[For eg: NLP- Assignment- 3- [Link]]

You might also like