Spam Email Classifier

Introduction

After learning regression projects, the next step in Machine Learning is to work on classification problems. In this lesson, you will build a Spam Email Classifier that can identify whether an email is spam or not.

This is one of the most common real-world applications of Machine Learning.

Problem Statement

The goal is to classify emails into two categories:

Spam
Not Spam

This is a binary classification problem.

Step 1: Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Step 2: Load Dataset

df = pd.read_csv(“spam.csv”)

Explore Data

df.head()
df.info()

Step 3: Data Preprocessing

Convert Labels to Numeric

df[‘label’] = df[‘label’].map({‘ham’: 0, ‘spam’: 1})

Select Features and Target

X = df[‘message’]
y = df[‘label’]

Step 4: Text Vectorization

Machine Learning models cannot understand text directly, so we convert text into numerical format.

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

Step 5: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 6: Train Model

model = MultinomialNB()
model.fit(X_train, y_train)

Step 7: Make Predictions

predictions = model.predict(X_test)

Step 8: Evaluate Model

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, predictions))

Step 9: Test with Custom Input

sample = [“Congratulations! You won a prize”]
sample_vector = vectorizer.transform(sample)
print(model.predict(sample_vector))

Key Concepts Used

Text preprocessing
Feature extraction (CountVectorizer)
Classification algorithm (Naive Bayes)
Model evaluation

Real-World Applications

Email filtering systems
SMS spam detection
Fraud message detection

Improvements You Can Make

Use TF-IDF instead of CountVectorizer
Try different algorithms like Logistic Regression
Clean text (remove stopwords, punctuation)
Use advanced NLP techniques

Conclusion

This project shows how Machine Learning can be used for text classification. Spam detection is widely used in real-world systems and is a great project for beginners.

In the next lesson, you will build a Customer Segmentation project using clustering techniques.

FAQs

What type of problem is spam detection?

It is a classification problem.

Which algorithm is used?

Naive Bayes classifier.

What is CountVectorizer?

It converts text into numerical format.

Can this model be improved?

Yes, using better preprocessing and algorithms.

Is this project useful in real life?

Yes, it is widely used in email systems.

Internal Link

To explore more courses and improve your skills, click here for more free courses

Our Coach

Quick Link

Apps Download

Archives

Categories

Course

Machine Learning Course in Jaipur – Complete AI & ML Training with Projects

Curriculum