Spam Email Classifier
Introduction
After learning regression projects, the next step in Machine Learning is to work on classification problems. In this lesson, you will build a Spam Email Classifier that can identify whether an email is spam or not.
This is one of the most common real-world applications of Machine Learning.
Problem Statement
The goal is to classify emails into two categories:
- Spam
- Not Spam
This is a binary classification problem.
Step 1: Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
Step 2: Load Dataset
df = pd.read_csv(“spam.csv”)
Explore Data
df.head()
df.info()
Step 3: Data Preprocessing
Convert Labels to Numeric
df[‘label’] = df[‘label’].map({‘ham’: 0, ‘spam’: 1})
Select Features and Target
X = df[‘message’]
y = df[‘label’]
Step 4: Text Vectorization
Machine Learning models cannot understand text directly, so we convert text into numerical format.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)
Step 5: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Step 6: Train Model
model = MultinomialNB()
model.fit(X_train, y_train)
Step 7: Make Predictions
predictions = model.predict(X_test)
Step 8: Evaluate Model
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))
Step 9: Test with Custom Input
sample = [“Congratulations! You won a prize”]
sample_vector = vectorizer.transform(sample)
print(model.predict(sample_vector))
Key Concepts Used
- Text preprocessing
- Feature extraction (CountVectorizer)
- Classification algorithm (Naive Bayes)
- Model evaluation
Real-World Applications
- Email filtering systems
- SMS spam detection
- Fraud message detection
Improvements You Can Make
- Use TF-IDF instead of CountVectorizer
- Try different algorithms like Logistic Regression
- Clean text (remove stopwords, punctuation)
- Use advanced NLP techniques
Conclusion
This project shows how Machine Learning can be used for text classification. Spam detection is widely used in real-world systems and is a great project for beginners.
In the next lesson, you will build a Customer Segmentation project using clustering techniques.
FAQs
What type of problem is spam detection?
It is a classification problem.
Which algorithm is used?
Naive Bayes classifier.
What is CountVectorizer?
It converts text into numerical format.
Can this model be improved?
Yes, using better preprocessing and algorithms.
Is this project useful in real life?
Yes, it is widely used in email systems.
Internal Link
To explore more courses and improve your skills, click here for more free courses



