Spam filtering system based on machine learning

Foreword:

Some time ago I wrote a topic: a spam filtering system based on machine learning , and then some children's shoes asked me about the specific implementation framework, so let's briefly talk about it now.

Table of contents

Foreword:

I. Overview

2. Data collection

2.1 Get data from SpamAssassin public mail corpus

2.2. Data preprocessing

2.2.1 Read email data

2.2.2 Text Preprocessing

3. Select the model

3.1. Training model

4. Evaluate model performance

4.1. Split Dataset

4.2. Evaluation indicators

4.3. Cross Validation

5. Deploy the model to practical application

5.1. Save the model

5.2. Mail processing

6. Integrate into email client

6.1. Create an Outlook macro

6.2. Create a Python script

6.3. Call Python script

 6.4. Create mail rules


I. Overview

In this blog, we will introduce how to build a spam filtering system based on machine learning. We will use the Python programming language and some common machine learning libraries for this project. The whole project will be divided into the following parts:

1. Overview
2. Data collection and preprocessing
3. Select and train machine learning models
4. Evaluate model performance
5. Deploy models to practical applications
6. Integrate into clients

2. Data collection

2.1 Get data from SpamAssassin public mail corpus

( https://spamassassin.apache.org/old/publiccorpus/) to download spam and ham data. There are multiple compressed files, containing a large amount of mail data, we need to decompress and merge these data.

Here is a simple Python code to download and decompress mail data:

import os
import tarfile
import urllib.request

# 下载数据集
def download_data(url, target_folder):
    if not os.path.exists(target_folder):
        os.makedirs(target_folder)
    file_name = url.split("/")[-1]
    target_path = os.path.join(target_folder, file_name)
    if not os.path.exists(target_path):
        urllib.request.urlretrieve(url, target_path)
        print(f"Downloaded {file_name}")
    return target_path

# 解压数据集
def extract_data(file_path, target_folder):
    with tarfile.open(file_path, "r:gz") as tar:
        tar.extractall(target_folder)
        print(f"Extracted {os.path.basename(file_path)} to {target_folder}")

# 下载并解压数据集
url_list = [
    "https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2",
    "https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2",
    "https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2"
]

target_folder = "data"
for url in url_list:
    file_path = download_data(url, target_folder)
    extract_data(file_path, target_folder)

2.2. Data preprocessing

2.2.1 Read email data

We need to read the mail data from the unzipped folder. Here is a simple Python function to read a mail file:

import os
import email
import email.policy

def read_email(file_path):
    with open(file_path, "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

ham_folder = "data/easy_ham"
spam_folder = "data/spam"
ham_files = [os.path.join(ham_folder, f) for f in os.listdir(ham_folder)]
spam_files = [os.path.join(spam_folder, f) for f in os.listdir(spam_folder)]
ham_emails = [read_email(f) for f in ham_files]
spam_emails = [read_email(f) for f in spam_files]

2.2.2 Text Preprocessing

We need text preprocessing of the email body, including cleaning, normalization, and vectorization. Here is a simple Python code to implement these operations:

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words("english"))

# 清洗文本
def clean_text(text):
    text = re.sub(r'\W+', ' ', text)  # 移除非字母数字字符
    text = text.lower()  # 转换为小写
    text = re.sub(r'\d+', '', text)  # 移除数字
text = ' '.join([stemmer.stem(word) for word in text.split() if word not in stop_words])  # 移除停用词并进行词干提取
return text
# 提取邮件正文
def get_email_text(email_obj):
parts = []
for part in email_obj.walk():
if part.get_content_type() == 'text/plain':
parts.append(part.get_payload())
return ''.join(parts)

# 对所有邮件进行预处理
ham_texts = [clean_text(get_email_text(email)) for email in ham_emails]
spam_texts = [clean_text(get_email_text(email)) for email in spam_emails]

# 合并数据和标签
texts = ham_texts + spam_texts
labels = [0] * len(ham_texts) + [1] * len(spam_texts)

# 向量化文本
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
y = labels


So far, we have completed the work of data collection and preprocessing. We get the vectorized email data X and the corresponding label y. Next, we can use this data to train a machine learning model for spam filtering.

3. Select the model

In this project, we chose to use a Naive Bayesian model. Naive Bayesian classifier is a simple probability classifier based on Bayesian theorem, which assumes that features are independent of each other. Although this independence assumption is often not true in practical applications, Naive Bayesian classifiers still show good performance in many scenarios, especially in text classification tasks.

3.1. Training model

We will use the MultinomialNB module in the scikit-learn library to implement the Naive Bayesian model and train it on the preprocessed dataset. The following is the Python code for training the model:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建朴素贝叶斯模型并训练
model = MultinomialNB()
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Through the above code, we have realized the selection and training of the Naive Bayesian model. At the same time, we also calculated the accuracy, precision, recall and F1 score of the model on the test set to evaluate the performance of the model. In a subsequent step, we will use this trained model to make predictions on actual emails to determine whether an email is spam or not. 

4. Evaluate model performance

4.1. Split Dataset

We split the preprocessed dataset into training and testing sets with a ratio of 80% and 20%. Here we use the train_test_split function in the scikit-learn library to achieve. This part of the code has been given in the part of training the model, as follows:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4.2. Evaluation indicators

We evaluate model performance using metrics such as accuracy, precision, recall, and F1-score. This part of the code has also been given in the part of training the model, as follows:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

4.3. Cross Validation

To evaluate model performance more accurately, we can use the cross-validation method. Here, we use the cross_val_score module in the scikit-learn library to implement cross-validation. The following is the Python code to evaluate the performance of the model using cross-validation:

from sklearn.model_selection import cross_val_score

# 使用交叉验证计算评估指标
cv_accuracy = cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
cv_precision = cross_val_score(model, X, y, cv=5, scoring='precision').mean()
cv_recall = cross_val_score(model, X, y, cv=5, scoring='recall').mean()
cv_f1 = cross_val_score(model, X, y, cv=5, scoring='f1').mean()

print(f"Cross-Validation Accuracy: {cv_accuracy:.4f}")
print(f"Cross-Validation Precision: {cv_precision:.4f}")
print(f"Cross-Validation Recall: {cv_recall:.4f}")
print(f"Cross-Validation F1 Score: {cv_f1:.4f}")

Through the above code, we have realized the evaluation of model performance, including the division of data sets, the calculation of evaluation indicators and cross-validation. These evaluation results can help us understand the performance of the model to better predict spam in real applications.

5. Deploy the model to practical application

5.1. Save the model

We can use Python's pickle library to save and load the model. The following is the Python code to save the model:

import pickle

# 保存模型
with open('spam_classifier_model.pkl', 'wb') as file:
    pickle.dump(model, file)

# 保存向量化器
with open('vectorizer.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)

5.2. Mail processing

In practice, we need to preprocess the incoming emails to fit our trained model. This requires us to implement an email processing function, which mainly includes the following steps:

  1. Extract Email Body: Extract the body content from the received email. You can use Python's email library to parse the email and get the body.

  2. Text preprocessing: Perform the same preprocessing operations on the extracted text content as before, including cleaning, normalization, and vectorization.

  3. Model Prediction: Use the trained model to predict the preprocessed email body. Determine whether an email is spam or not based on the prediction result.

The following is the Python code that implements the mail processing function:

import email
from email.message import EmailMessage

# 加载模型和向量化器
with open('spam_classifier_model.pkl', 'rb') as file:
    model = pickle.load(file)

with open('vectorizer.pkl', 'rb') as file:
    vectorizer = pickle.load(file)

# 邮件处理函数
def process_email(raw_email):
    # 解析邮件并提取正文
    email_obj = email.message_from_string(raw_email)
    email_text = get_email_text(email_obj)

    # 文本预处理
    cleaned_text = clean_text(email_text)
    vectorized_text = vectorizer.transform([cleaned_text])

    # 使用模型预测
    prediction = model.predict(vectorized_text)

    return "Spam" if prediction[0] == 1 else "Ham"

# 示例:读取邮件文本并使用处理函数判断是否为垃圾邮件
with open('example_email.txt', 'r') as file:
    raw_email = file.read()

result = process_email(raw_email)
print(f"Result: {result}")

Through the above code, we have realized the function of deploying the model to the actual application. We can integrate these codes into an email client or server for real-time spam filtering.

6. Integrate into email client

For ease of use, we can integrate spam filters into existing mail clients. This requires us to implement a plugin or extension that automatically invokes our spam filter for processing when new mail arrives. How this is done depends on the mail client being used.

Here, we take Microsoft Outlook as an example to introduce how to integrate the spam filter into the mail client. Outlook supports VBA (Visual Basic for Applications) macros, we can use VBA macros to call our Python spam filter. The specific implementation method is as follows:

6.1. Create an Outlook macro

  1. Open Outlook and click the "Developer" tab. If you don't see the Developer tab, you can enable it in File -> Options -> Customize the Ribbon.

  2. Click the "Visual Basic" button to open the VBA editor.

  3. On the left side of the VBA editor, double-click "This Outlook session" to open the code editing window.

  4. In the code editing window, enter the following code:

    Option Explicit
    
    Sub ProcessNewEmail(Item As Outlook.MailItem)
        ' 调用Python脚本处理邮件,并获得预测结果
        Dim result As String
        result = RunPythonScript("process_email.py", Item.Body)
        
        ' 根据预测结果处理邮件
        If result = "Spam" Then
            ' 将邮件移动到垃圾邮件文件夹
            Dim spamFolder As Outlook.MAPIFolder
            Set spamFolder = Application.Session.GetDefaultFolder(olFolderJunk)
            Item.Move spamFolder
        End If
    End Sub
    

  5. Click "File" -> "Save this Outlook Session".

6.2. Create a Python script

6.2.1 Create a new Python script file process_email.pyand put the previously implemented mail processing function process_emailinto it. At the same time, the function needs to be modified so that it accepts the email body as a parameter, and outputs the prediction results to standard output after processing.

import sys

def process_email(email_text):
    # ...
    # 文本预处理和模型预测的代码
    # ...

    return "Spam" if prediction[0] == 1 else "Ham"

if __name__ == "__main__":
    email_text = sys.argv[1]
    result = process_email(email_text)
    print(result)

6.2.2 will process_email.pybe placed in the same folder as the trained model file and the vectorizer file.

6.3. Call Python script

6.3.1 In order to call a Python script in VBA, we need to create a RunPythonScriptfunction called to execute the Python script and return the result. Enter the following code in the code editing window of the VBA editor:

Function RunPythonScript(scriptPath As String, emailBody As String) As String
    Dim shell As Object
    Dim command As String
    Dim tempFile As String
    Dim fso As Object
    Dim file As Object
    Dim result As String
    
    ' 创建一个临时文件用于存储Python脚本的输出
    Set fso = CreateObject("Scripting.FileSystemObject")
    tempFile = fso.GetSpecialFolder(2) & "\" & fso.GetTempName
    
    ' 构建命令行
command = "python " & scriptPath & " """ & emailBody & """" & " > " & tempFile

' 执行命令行
Set shell = CreateObject("WScript.Shell")
shell.Run command, 0, True

' 读取临时文件中的输出结果
Set file = fso.OpenTextFile(tempFile, 1)
result = file.ReadAll
file.Close

' 删除临时文件
fso.DeleteFile tempFile

' 返回结果
RunPythonScript = Trim(result)

 
6.4. Create mail rules

1. Return to the main interface of Outlook, click "Rules" -> "Manage Rules and Alerts".

2. Click "New Rule" and select "Mail received through a specific account".

3. Select your email account and click Next.

4. No need to set any conditions, just click "Next" and select "Yes".

5. Select "Run a script" and click the "Script" link.

6. In the pop-up window, select the newly created `ProcessNewEmail` macro, and click "OK".

7. Click Finish to save the rule.

Now, when you receive new mail, Outlook will automatically call our spam filter to process it. If the email is judged as spam, it will be moved to the spam folder.

It should be noted that this method relies on the locally installed Python environment. Integration methods may vary in different mail clients. The specific implementation method depends on the mail client used and the extension methods it supports.
 

Guess you like

Origin blog.csdn.net/a871923942/article/details/130556593