0% found this document useful (0 votes)
15 views6 pages

spamdetection

Uploaded by

22311a6250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

spamdetection

Uploaded by

22311a6250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Project Report: Spam Email Classification Using Logistic Regression

1. Introduction
The rise of email communication has led to an increase in unwanted emails,
commonly known as spam. Spam detection is crucial for ensuring that users do
not waste time on irrelevant or malicious content. This project aims to build a
spam classification model that uses machine learning to classify messages as
either spam or ham (non-spam). We use the Logistic Regression algorithm, a
simple yet powerful model for binary classification tasks. The input data consists
of email messages, and the model is trained to differentiate between spam and
ham based on the content of the messages.
2. Objective
The objective of this project is to create a spam email classifier using machine
learning techniques. Specifically, we aim to:
 Preprocess the data to prepare it for model training.
 Extract relevant features from email text using TF-IDF (Term Frequency-
Inverse Document Frequency).
 Build and train a Logistic Regression model to classify messages into
spam and ham.
 Evaluate the model’s performance using accuracy metrics.
 Test the classifier on new, unseen email samples.
3. Dataset
The dataset used for this project is the SMS Spam Collection Dataset, which
contains 5,572 messages, categorized into ham (non-spam) and spam
categories. The dataset is available in CSV format, and each message is labeled
with either ham or spam. The dataset has the following columns:
 Message: The content of the email or text message.
 Category: The label indicating whether the message is spam (0) or ham
(1).
4. Data Preprocessing
Data preprocessing is essential for preparing the data for machine learning. The
steps performed include:
 Handling Missing Values: The dataset is cleaned by replacing any
missing or null values with empty strings.
python
Copy code
data = df.where(pd.notnull(df), '')
 Label Encoding: The Category column, which contains text labels ('spam'
and 'ham'), is converted into numeric labels:
o 0 for spam

o 1 for ham

python
Copy code
data.loc[data['Category'] == 'spam', 'Category'] = 0
data.loc[data['Category'] == 'ham', 'Category'] = 1
 Splitting the Data: The dataset is split into training (80%) and testing
(20%) sets using the train_test_split function from scikit-learn.
python
Copy code
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=3)
5. Feature Extraction
Text data cannot be directly used in machine learning algorithms. To handle this,
we apply TF-IDF (Term Frequency-Inverse Document Frequency) to
transform the text messages into numerical feature vectors. TF-IDF helps identify
the most important words in a document relative to the entire dataset by
weighing the frequency of words in a document against their general occurrence
in all documents.
python
Copy code
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english',
lowercase=True)
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)
6. Model Selection and Training
The chosen machine learning model for this project is Logistic Regression.
Logistic regression is widely used for binary classification tasks because it
outputs probabilities that a message belongs to a particular class (spam or ham).
python
Copy code
model = LogisticRegression()
model.fit(x_train_features, y_train)
The model is trained using the training features (x_train_features) and labels
(y_train).
7. Model Evaluation
Once the model is trained, it is evaluated using the accuracy score, which
compares the predicted labels to the true labels of the test set.
python
Copy code
prediction_on_test_data = model.predict(x_test_features)
accuracy_on_test_data = accuracy_score(y_test, prediction_on_test_data)
The accuracy score is calculated by comparing the predicted labels on the test
set to the actual labels.
8. Testing on New Data
The model can also be used to classify new, unseen messages. For this, we input
a new message, transform it using the same TF-IDF vectorizer, and then predict
its label using the trained model.
python
Copy code
input_your_mail = ["Free entry in 2 a wkly comp to win FA Cup final tkts..."]
input_data_features = feature_extraction.transform(input_your_mail)
prediction = model.predict(input_data_features)
The result is printed as either "Spam mail" or "Ham mail".
9. Results
The accuracy of the model on the test data is computed, and the results can vary
based on the dataset used and the parameters of the model. A sample output
might look like this:
csharp
Copy code
Accuracy on Test Data: 0.98
Ham mail
This indicates that the model is highly accurate at classifying messages and has
successfully identified a new email as "ham."
10. Conclusion
In this project, a Logistic Regression model was trained to classify messages
as either spam or ham using a TF-IDF feature extraction technique. The model
performed well with a high accuracy rate on the test data. The project
demonstrates the effectiveness of machine learning algorithms in natural
language processing tasks, such as spam detection.
11. Future Work
 Model Improvement: The model can be improved by using other
algorithms like Random Forests, Support Vector Machines (SVM), or
Naive Bayes for comparison.
 Hyperparameter Tuning: Further tuning of hyperparameters (such as
the regularization strength in Logistic Regression) can help improve model
performance.
 Deep Learning: Advanced techniques like Recurrent Neural Networks
(RNN) or Transformer-based models could be explored for better
accuracy, especially for more complex datasets.
12. References
 SMS Spam Collection Dataset
 scikit-learn: Documentation on Logistic Regression: https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressio
n.html
 TF-IDF: Understanding TF-IDF: https://en.wikipedia.org/wiki/Tf
%E2%80%93idf

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset


df = pd.read_csv('spam.csv')

# Clean up missing values and label the 'Category' as 0 for 'spam' and 1 for
'ham'
data = df.where(pd.notnull(df), '')
data.loc[data['Category'] == 'spam', 'Category'] = 0
data.loc[data['Category'] == 'ham', 'Category'] = 1

# Separate the features and the target variable


x = data['Message']
y = data['Category']

# Split the dataset into training and testing sets (80% train, 20% test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=3)

# Initialize TfidfVectorizer for feature extraction


feature_extraction = TfidfVectorizer(min_df=1, stop_words='english',
lowercase=True)

# Fit and transform the training data and transform the test data
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test) # Corrected to use
x_test, not x_train

# Convert target labels to integers (0 for spam, 1 for ham)


y_train = y_train.astype('int')
y_test = y_test.astype('int')

# Initialize the logistic regression model


model = LogisticRegression()

# Train the model on the training data


model.fit(x_train_features, y_train)

# Predict on training data to check accuracy


prediction_on_training_data = model.predict(x_train_features)
accuracy_on_training_data = accuracy_score(y_train,
prediction_on_training_data)

# Print training accuracy


print(f"Accuracy on Training Data: {accuracy_on_training_data:.2f}")
# Test the model on a custom email input
input_your_mail = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply
08452810075 over 18's"]

# Transform the input data using the same feature extractor


input_data_features = feature_extraction.transform(input_your_mail)

# Make a prediction for the input email


prediction = model.predict(input_data_features)

# Print the result


if prediction[0] == 1:
print("Ham mail")
else:
print("Spam mail")

You might also like