0% found this document useful (0 votes)

15 views6 pages

spamdetection

Uploaded by

22311a6250

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

spamdetection

Uploaded by

22311a6250

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Project Report: Spam Email Classification Using Logistic Regression

1. Introduction
The rise of email communication has led to an increase in unwanted emails,
commonly known as spam. Spam detection is crucial for ensuring that users do
not waste time on irrelevant or malicious content. This project aims to build a
spam classification model that uses machine learning to classify messages as
either spam or ham (non-spam). We use the Logistic Regression algorithm, a
simple yet powerful model for binary classification tasks. The input data consists
of email messages, and the model is trained to differentiate between spam and
ham based on the content of the messages.
2. Objective
The objective of this project is to create a spam email classifier using machine
learning techniques. Specifically, we aim to:
 Preprocess the data to prepare it for model training.
 Extract relevant features from email text using TF-IDF (Term Frequency-
Inverse Document Frequency).
 Build and train a Logistic Regression model to classify messages into
spam and ham.
 Evaluate the model’s performance using accuracy metrics.
 Test the classifier on new, unseen email samples.
3. Dataset
The dataset used for this project is the SMS Spam Collection Dataset, which
contains 5,572 messages, categorized into ham (non-spam) and spam
categories. The dataset is available in CSV format, and each message is labeled
with either ham or spam. The dataset has the following columns:
 Message: The content of the email or text message.
 Category: The label indicating whether the message is spam (0) or ham
(1).
4. Data Preprocessing
Data preprocessing is essential for preparing the data for machine learning. The
steps performed include:
 Handling Missing Values: The dataset is cleaned by replacing any
missing or null values with empty strings.
python
Copy code
data = df.where(pd.notnull(df), '')
 Label Encoding: The Category column, which contains text labels ('spam'
and 'ham'), is converted into numeric labels:
o 0 for spam

o 1 for ham

python
Copy code
data.loc[data['Category'] == 'spam', 'Category'] = 0
data.loc[data['Category'] == 'ham', 'Category'] = 1
 Splitting the Data: The dataset is split into training (80%) and testing
(20%) sets using the train_test_split function from scikit-learn.
python
Copy code
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=3)
5. Feature Extraction
Text data cannot be directly used in machine learning algorithms. To handle this,
we apply TF-IDF (Term Frequency-Inverse Document Frequency) to
transform the text messages into numerical feature vectors. TF-IDF helps identify
the most important words in a document relative to the entire dataset by
weighing the frequency of words in a document against their general occurrence
in all documents.
python
Copy code
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english',
lowercase=True)
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)
6. Model Selection and Training
The chosen machine learning model for this project is Logistic Regression.
Logistic regression is widely used for binary classification tasks because it
outputs probabilities that a message belongs to a particular class (spam or ham).
python
Copy code
model = LogisticRegression()
model.fit(x_train_features, y_train)
The model is trained using the training features (x_train_features) and labels
(y_train).
7. Model Evaluation
Once the model is trained, it is evaluated using the accuracy score, which
compares the predicted labels to the true labels of the test set.
python
Copy code
prediction_on_test_data = model.predict(x_test_features)
accuracy_on_test_data = accuracy_score(y_test, prediction_on_test_data)
The accuracy score is calculated by comparing the predicted labels on the test
set to the actual labels.
8. Testing on New Data
The model can also be used to classify new, unseen messages. For this, we input
a new message, transform it using the same TF-IDF vectorizer, and then predict
its label using the trained model.
python
Copy code
input_your_mail = ["Free entry in 2 a wkly comp to win FA Cup final tkts..."]
input_data_features = feature_extraction.transform(input_your_mail)
prediction = model.predict(input_data_features)
The result is printed as either "Spam mail" or "Ham mail".
9. Results
The accuracy of the model on the test data is computed, and the results can vary
based on the dataset used and the parameters of the model. A sample output
might look like this:
csharp
Copy code
Accuracy on Test Data: 0.98
Ham mail
This indicates that the model is highly accurate at classifying messages and has
successfully identified a new email as "ham."
10. Conclusion
In this project, a Logistic Regression model was trained to classify messages
as either spam or ham using a TF-IDF feature extraction technique. The model
performed well with a high accuracy rate on the test data. The project
demonstrates the effectiveness of machine learning algorithms in natural
language processing tasks, such as spam detection.
11. Future Work
 Model Improvement: The model can be improved by using other
algorithms like Random Forests, Support Vector Machines (SVM), or
Naive Bayes for comparison.
 Hyperparameter Tuning: Further tuning of hyperparameters (such as
the regularization strength in Logistic Regression) can help improve model
performance.
 Deep Learning: Advanced techniques like Recurrent Neural Networks
(RNN) or Transformer-based models could be explored for better
accuracy, especially for more complex datasets.
12. References
 SMS Spam Collection Dataset
 scikit-learn: Documentation on Logistic Regression: https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressio
n.html
 TF-IDF: Understanding TF-IDF: https://en.wikipedia.org/wiki/Tf
%E2%80%93idf

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset

df = pd.read_csv('spam.csv')

# Clean up missing values and label the 'Category' as 0 for 'spam' and 1 for
'ham'
data = df.where(pd.notnull(df), '')
data.loc[data['Category'] == 'spam', 'Category'] = 0
data.loc[data['Category'] == 'ham', 'Category'] = 1

# Separate the features and the target variable

x = data['Message']
y = data['Category']

# Split the dataset into training and testing sets (80% train, 20% test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=3)

# Initialize TfidfVectorizer for feature extraction

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english',
lowercase=True)

# Fit and transform the training data and transform the test data
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test) # Corrected to use
x_test, not x_train

# Convert target labels to integers (0 for spam, 1 for ham)

y_train = y_train.astype('int')
y_test = y_test.astype('int')

# Initialize the logistic regression model

model = LogisticRegression()

# Train the model on the training data

model.fit(x_train_features, y_train)

# Predict on training data to check accuracy

prediction_on_training_data = model.predict(x_train_features)
accuracy_on_training_data = accuracy_score(y_train,
prediction_on_training_data)

# Print training accuracy

print(f"Accuracy on Training Data: {accuracy_on_training_data:.2f}")
# Test the model on a custom email input
input_your_mail = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May
2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply
08452810075 over 18's"]

# Transform the input data using the same feature extractor

input_data_features = feature_extraction.transform(input_your_mail)

# Make a prediction for the input email

prediction = model.predict(input_data_features)

# Print the result

if prediction[0] == 1:
print("Ham mail")
else:
print("Spam mail")

Logistic Regression Classification in Natural Language Processing (NLP) Final
No ratings yet
Logistic Regression Classification in Natural Language Processing (NLP) Final
14 pages
Zoom
No ratings yet
Zoom
20 pages
NBHM PHD Que. Papers & Answer
No ratings yet
NBHM PHD Que. Papers & Answer
196 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Email Spam Detection Ppt Github
No ratings yet
Email Spam Detection Ppt Github
11 pages
Ritesh Mangla ML PracticalFile
No ratings yet
Ritesh Mangla ML PracticalFile
55 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
AI Report
No ratings yet
AI Report
8 pages
AIML ASSIGNMENT-2
No ratings yet
AIML ASSIGNMENT-2
8 pages
aiproject-2
No ratings yet
aiproject-2
4 pages
Assignment 1 Predicting Loan Approval with Logistic Regression
No ratings yet
Assignment 1 Predicting Loan Approval with Logistic Regression
13 pages
yogi case study
No ratings yet
yogi case study
10 pages
Day20 Machine Learning
No ratings yet
Day20 Machine Learning
11 pages
Micro
No ratings yet
Micro
5 pages
Data Science Report
No ratings yet
Data Science Report
33 pages
SVM_Lab_report (1)
No ratings yet
SVM_Lab_report (1)
7 pages
ml4
No ratings yet
ml4
6 pages
Document
No ratings yet
Document
11 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Email spam detection
No ratings yet
Email spam detection
3 pages
Code
No ratings yet
Code
6 pages
A2
No ratings yet
A2
12 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Email Spam Classifier Phase1
No ratings yet
Email Spam Classifier Phase1
4 pages
FAM PR-10
No ratings yet
FAM PR-10
4 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
DWDM_pavan_final[1]
No ratings yet
DWDM_pavan_final[1]
10 pages
research article on the forensic
No ratings yet
research article on the forensic
14 pages
AI Phase3
No ratings yet
AI Phase3
3 pages
ml lab
No ratings yet
ml lab
13 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
spamfilter (2)
No ratings yet
spamfilter (2)
4 pages
Spam News Detection Report: Manikiran
No ratings yet
Spam News Detection Report: Manikiran
12 pages
Invitation for National Math League 2024
No ratings yet
Invitation for National Math League 2024
10 pages
Spam Email Detection Using Machine Learning[1] (1)
No ratings yet
Spam Email Detection Using Machine Learning[1] (1)
8 pages
Spam Filter Project Report logistic regression
No ratings yet
Spam Filter Project Report logistic regression
10 pages
Spam Email Classifier_Ramsanjay
No ratings yet
Spam Email Classifier_Ramsanjay
2 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Logistic Regression For Spam Filtering: Niclas Englesson
No ratings yet
Logistic Regression For Spam Filtering: Niclas Englesson
37 pages
Email Spam Filtering Using Machine Learning.1[1]
No ratings yet
Email Spam Filtering Using Machine Learning.1[1]
16 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
notebook - text classification
No ratings yet
notebook - text classification
7 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
2nd Project Darling
No ratings yet
2nd Project Darling
9 pages
Namma Kalvi 12th Maths Guide em 219528
No ratings yet
Namma Kalvi 12th Maths Guide em 219528
306 pages
0_SPAM MAIL PREDICTION
No ratings yet
0_SPAM MAIL PREDICTION
29 pages
aryan blackbook 1
No ratings yet
aryan blackbook 1
29 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
1822 b Deleted
No ratings yet
1822 b Deleted
38 pages
vishal FOML micro project vishal & milan
No ratings yet
vishal FOML micro project vishal & milan
26 pages
Ass 3
No ratings yet
Ass 3
2 pages
PRUTHVIRAJ MICOR FOML
No ratings yet
PRUTHVIRAJ MICOR FOML
26 pages
Game Theory Alive
No ratings yet
Game Theory Alive
178 pages
Final_report(Saie)
No ratings yet
Final_report(Saie)
38 pages
Operator Precedence in C
No ratings yet
Operator Precedence in C
7 pages
Get Teaching Statistics Using Baseball 2nd Edition James Albert free all chapters
No ratings yet
Get Teaching Statistics Using Baseball 2nd Edition James Albert free all chapters
67 pages
Grade 8 Data Handling Probability Statistics in
No ratings yet
Grade 8 Data Handling Probability Statistics in
13 pages
Project 2
No ratings yet
Project 2
10 pages
Salient Features of Behaviourism
75% (8)
Salient Features of Behaviourism
2 pages
06 - Exercise - Park Transformation and Synchronous Machine
No ratings yet
06 - Exercise - Park Transformation and Synchronous Machine
4 pages
Potato Leaf Diseases Detection Using Deep Learning Report
No ratings yet
Potato Leaf Diseases Detection Using Deep Learning Report
20 pages
G1-Q3-LE-WEEK 1-MATH
No ratings yet
G1-Q3-LE-WEEK 1-MATH
13 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
No ratings yet
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
21 pages
Danessa 1
No ratings yet
Danessa 1
6 pages
Spam email. Classifier ppt
No ratings yet
Spam email. Classifier ppt
16 pages
RTe Bookacronym1
No ratings yet
RTe Bookacronym1
31 pages
Unit I - QB
100% (1)
Unit I - QB
3 pages
Output - Statistik - Servasia Ensi Rista Meo - 2019 A
No ratings yet
Output - Statistik - Servasia Ensi Rista Meo - 2019 A
8 pages
12 Angi
No ratings yet
12 Angi
4 pages
A Few Standard Graphs: F (X) X F (X) X
No ratings yet
A Few Standard Graphs: F (X) X F (X) X
8 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
5 pages
Chapter 2
No ratings yet
Chapter 2
13 pages
Failure Probability Analysis of Overhead Crane Bridge Girders Within Uncertain Design Parameters
No ratings yet
Failure Probability Analysis of Overhead Crane Bridge Girders Within Uncertain Design Parameters
11 pages
Membrane Processes: M. Mulder, Basic Principles of Membrane Technology © Kluwer Academic Publishers 1996
No ratings yet
Membrane Processes: M. Mulder, Basic Principles of Membrane Technology © Kluwer Academic Publishers 1996
2 pages
The Euler-Lagrange Equation
No ratings yet
The Euler-Lagrange Equation
13 pages
Management Trainee Job Description
No ratings yet
Management Trainee Job Description
1 page
Frequency, Spectrum and Bandwidth
No ratings yet
Frequency, Spectrum and Bandwidth
4 pages
Secret Key Cryptography Using Graphics Cards
No ratings yet
Secret Key Cryptography Using Graphics Cards
14 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Analytical Project
No ratings yet
Analytical Project
22 pages
ICSE Board Class IX Mathematics Sample Paper 4 Time: 2 Hrs Total Marks: 80
No ratings yet
ICSE Board Class IX Mathematics Sample Paper 4 Time: 2 Hrs Total Marks: 80
5 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Grade 10 Maths 2024 PRE-JUNE EXAM
100% (1)
Grade 10 Maths 2024 PRE-JUNE EXAM
10 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Vivek Raj Project
No ratings yet
Vivek Raj Project
41 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Zawodzinski1991 PDF
No ratings yet
Zawodzinski1991 PDF
5 pages

spamdetection

Uploaded by

spamdetection

Uploaded by

Project Report: Spam Email Classification Using Logistic Regression

# Load the dataset

# Separate the features and the target variable

# Initialize TfidfVectorizer for feature extraction

# Convert target labels to integers (0 for spam, 1 for ham)

# Initialize the logistic regression model

# Train the model on the training data

# Predict on training data to check accuracy

# Print training accuracy

# Transform the input data using the same feature extractor

# Make a prediction for the input email

# Print the result

You might also like