Project Paper Submission B21CS045
Project Paper Submission B21CS045
Management
Marri Bharadwaj Dr. Chandana N
Computer Science & Engineering (CSE) Center for Emerging Technologies for Sustainable Development (CETSD)
IIT Jodhpur IIT Jodhpur
[email protected] [email protected]
Abstract—Early identification of disaster-related events on how machine learning models can read human language in
social media channels like Twitter considerably improves emer- real-time and derive urgency and relevance from it.
gency response and risk reduction efforts. In this paper, I studied
and compared the performance of two deep learning models,
The larger rationale behind this work is driven by public
namely Long Short-Term Memory (LSTM) and Bidirectional good. An effective tweet classification model may serve as
Encoder Representations from Transformers (BERT), for car- a useful adjunct to aid emergency services, public health
rying out binary classification of tweets as disaster-related or agencies, and humanitarian organisations in monitoring un-
non-disaster-related. Our LSTM model utilises FastText embed- folding disasters, analysing their path, and coordinating urgent
dings with gated memory cells to make full use of sequential
information, and BERT relies on deep contextual embeddings
response to the disaster. It reflects the increased overlap of
enabled by attention. Despite the groundbreaking structure of technology and society where machine learning is increasingly
BERT, I noted that our LSTM model performs better than BERT vital for humanitarian assistance and risk reduction. [2]
for short tweets, with an F1 value of 0.79 compared to BERT’s
value of 0.74. Most of this performance advantage arises due
to the length of tweets being short, so that better training and II. DATA C OLLECTION
optimisation could be achieved by the LSTM model. The study
points out the possibility of using NLP-enabled deep learning In order to build and test sound disaster detection model
models for real-time monitoring of disasters, with scalability of
solutions amenable for deployment in emergency management utilising NLP approaches, I assemble a data set comprised
systems. of differing social media posts specifically related to natural
Index Terms—Disaster classification, LSTM, BERT, Natural and man-made disasters. The data included social media
Language Processing, Social Media Mining, Tweet Analysis, posts most often collected from platforms including Twitter,
Emergency Response, Deep Learning, Real-Time Detection, Fast- Facebook, and Reddit with the express goal of including as
Text.
much diversity in real-world context, informal language-styles,
and reporting behaviour as possible.
I. I NTRODUCTION The final data set comprises 7,613 posts, each of which
Disasters, both natural and human-made, threaten human is labeled as either disaster-related posts (target = 1) or non-
life, infrastructure, and environmental stability. Climate change disasters posts (target = 0). The posts covered a variety of
and urban vulnerabilities are increasing the frequency and disaster events, ranging from:
intensity of disasters, leading to a growing need to deliver and • Natural disasters: Earthquakes, hurricanes, floods, wild-
assess rapid detection, communication, and response mecha- fires, tsunamis, tornadoes, blizzards, volcanic eruption,
nisms. Conventional models for disaster management typically landslides, heat waves, droughts, and storms.
rely on slow messages and formal alerts that do not always • Man-made disasters: terrorist attacks, industrial acci-
communicate what is happening on the ground at the time. dents, an oil spill, radiation emergency, explosion, trans-
Social media, especially Twitter, has emerged as a dy- portation accidents.
namic space where a person shares live updates, personal
experiences, and requests for assistance. These types of user- The data set was collected in a CSV file and contained the
generated insights can act as signals of an impending crisis or following variables:
existing emergency situation. The challenge comes in filtering • Text: The post or tweet text.
through the vast quantity of unstructured text data to locate • Keyword: A relevant keyword extracted from the post
relevant, credible information. [1] (when applicable).
This project seeks to assess and demonstrate that Natural • Location: The location in which post was created (often
Language Processing (NLP) techniques can be used to classify partial or missing).
tweets as disaster related or not, thus exhibiting a way to • Target: A binary variable indicating if the post is about a
contribute to real-time disaster portal monitoring efforts. I used real disaster (1) or not (0) - only available in the training
deep learning models trained on annotated data to demonstrate data set.
Fig. 2. Frequency Histogram
A. Class Distribution
An initial analysis of the target labels revealed a fairly even
split throughout the dataset with around 43% of the posts
pointing to actual disasters and 57% pointing to non-disaster
events. This balance is beneficial for training machine learning
models as it minimises the likelihood of leaning towards any
particular category.
Figure-1 visualises the class distribution.
B. Keyword Analysis
Of the 7,613 posts, only 61 did not have a keyword. The Fig. 3. Proportion of Posts across the Classes
dataset comprises 222 unique keywords, many of which are
multi-word keywords and correspond to certain disaster types.
A histogram of keyword frequencies (Figure-2) shows a • Removing non-Latin characters.
long-tail distribution, with only a few keywords appearing • Lowercasing all text.
frequently, while many keywords appeared a handful of times. • Standardising country names, e.g., changed ”united
Also, Figure-3 exhibits the proportion of posts classified as a states” and ”us” to ”usa”.
real disaster per keyword, with the high variance of proportions • Removing excess white space and formatting issues.
suggesting that keywords are good predictive features that In the end, while the location variable had been cleaned,
correspond well with specific reports of real disasters. it was still very sparse and had high levels of ambiguity. In
Figure-4 you can see that all but a few locations are unique
C. Location Analysis or missing completely, and the distribution across classes is
Due to a high level of inconsistency and noise in the location so uniform that it was irrelevant to move forward training a
variable of the data set, including 2,533 missing cases and model. This aspect of the data will be omitted from any model
over 3,300 unique named locations, I conducted extensive data training based on its low pertinence.
cleaning and assessment of this variable. Some examples of the
III. M ETHODOLOGY AND P IPELINE
issues to address are informal expressions found throughout
the data, e.g., recorded locations of ”Earth” and ”Somewhere A. Text Pre-processings and Word Embeddings
in the sky,” inconsistent naming of the location feature, or To refine the model’s capacity to comprehend and derive
special characters. Data cleaning involved considerable efforts generalisations from textual data, I employed FastText em-
and included the following steps: beddings to vectorise the words contained in it. FastText
of the words in our dataset had valid vectors assigned to them,
which directly supports our model in terms of remembering
meaning.
Crawl Embeddings have 0.780 vocabulary cover example
and of 0.938 text cover example.
I also preprocessed the keyword column by replacing en-
coded spaces (%20) with regular spaces and then replacing
missing values with the placeholder ”empty”. This preprocess
allowed us to treat the keyword field in our model as an
additional feature.
B. Input Preparation
The clean text data, and preprocessed text data was then
shaped to feed it into our model for training. I were able to
split it into the datasets of train-test (96-4 split) using scikit-
learn’s ”train-test-split method, which allowed us to keep a
small sample for final analysis and evaluation, while providing
enough data to train effectively.
How to Prepare Inputs for LSTM:
Fig. 4. Frequency of Location Occurences • Tokenisation:
– The text column was tokenised utilising Tokenizer()
and converted into sequences of integers.
embeddings differ from older embeddings like Word2Vec or
– The keyword column was tokenised separately and
GloVe, in that FastText utilises subword (character n-gram)
represented in matrix format.
information, which likewise permits it to generalise to out-of-
• Padding: The token sequences were given padding to
vocabulary (OOV) words by building representations for new
words. [3] ensure that there will be uniform sizes of input across
At first, the raw text data had comparatively low embedding all data samples.
• Meta-data Features: Additional float-type metadata fea-
coverage. FastText embeddings provided coverage of 51.5% of
the vocabulary and 81.8% of the text. The limited embedding tures (such as word-count, etc.) were concatenated with
coverage was primarily due to tokens that were not able the keyword matrix to create a large set of float features.
• Embedding Matrix: I initialised the embedding matrix
to be tokenised given the presence of irregular characters,
symbols, and punctuation. To improve embedding coverage with FastText’s vectors, such that all words I found in the
and generalisation in the embeddings, I executed a number of vocabulary were given a corresponding 300-dimensional
text pre-processing workflows: vector. Any words not found in FastText’s vectors were
given the zero vector. I ended up with 3,386 unknown
• Removal of URLs: All hyperlinks were removed from the
words, which was a significant drop from the prepro-
text using regular expressions, as these were predom- cessed state.
inantly noise, but also inconsequential for purposes of
understanding content. C. LSTM-Based Model with Attention
• Expansion of Contractions: Common contractions, like To create a strong neural baseline for the tweet-level
”won’t”, ”can’t”, ”I’m”, and so forth, were enhanced classification task, I used a Bidirectional Long Short-Term
to their full form (e.g., ”will not”, ”can not”, ”I am”) Memory (BiLSTM) network with an attention mechanism.
to improve the ability to match with the embedding This architecture enabled the model to capture our sentiment
vocabulary. classification task’s forward and backward contextual depen-
• Lower-casing: I then converted to lower case. This im- dencies in the text, while also being able to pay attention to
proved the coverage and reduced vocab size. the most useful piece of the tweet. [4]
• Removal of characters and symbols: All characters that 1) Data Pre-processing: The preprocessing pipeline was as
are not Latin (including punctuation and special charac- follows:
ters) were removed using regular expressions. • Converting tweets to lowercase, removing URLs, hash-
• Removing stopwords: Common stopwords (e.g., ”the”, tags, mentions, emojis, special characters, and those extra
”is”, ”and”) were removed to keep only the words that whitespaces/unnecessary spaces, etc.
hold meaning. • Tokenising the cleaned tweets with a torchtext’s ”Field”
After preprocessing the above tasks, the vocabulary cover- object with a maximum and fixed number of tokens of
age reached 78% and the text coverage was improved to 93.8% 40.
based on embedding coverage checks. This was an important • Padding/truncating sequences to ensure uniform input
preprocessing task since I wanted to ensure that the majority lengths.
output probability for a tweet being positive.
The ability of the BiLSTM and attention model to learn a
dependency of relevance to an arbitrary sentiment not con-
strained to only the first words of the sequence is a powerful
sentiment analysis approach.
I set the following training parameters:
• Loss Function: Binary Cross-Entropy (BCE)
• Optimiser: Adam with learning rate as 10−3
• Batch Size: 32
• Epochs: 10
• Validation Split: 10% of training data
• Device: CUDA-enabled GPU for faster training
To alleviate overfitting, I monitored the validation accuracy,
with early stopping based on patience over the best validation
loss.