Hejing Community Data Analysis Weekly Challenge [Ninety-seventh Issue: Technical Blog Text Analysis]

Hejing Community Data Analysis Weekly Challenge [Ninety-seventh Issue: Technical Blog Text Analysis]

1. Background description

insert image description here

This dataset contains 34k+ article data on the GeeksforGeeks website.

GeeksforGeeks.com is a computer science portal that provides learning resources and reference materials in various programming languages ​​for both beginners and experienced programmers. The site covers all core computer science courses, including data structures, algorithms, operating systems, computer networks, databases, and more.

If you are learning data structures and algorithms, you can find many articles on this site on how to solve different types of problems. These articles often include detailed explanations and sample code to help you better understand the problem and the solution.
Additionally, you can use the online editor to practice your programming skills and check out other users' answers for inspiration and ideas after completing challenges.

2. Data description

field illustrate
title title of the article
author_id author of the article
last_updated The date the article was last updated
link Links to articles on GeeksforGeeks
category Article classification

3. Problem description

Website article data EDA

Title Content Text Classification

4. Data import

import pandas as pd

articles_df = pd.read_csv('articles.csv')

articles_df.head()

insert image description here

The dataset contains the following fields:

  • title: the title of the article
  • author_id: the author of the article
  • last_updated: The date the article was last updated
  • link: the link of the article on GeeksforGeeks
  • category: the classification of the article
# 查看数据分布
articles_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34574 entries, 0 to 34573
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         34574 non-null  object
 1   author_id     34555 non-null  object
 2   last_updated  34556 non-null  object
 3   link          34574 non-null  object
 4   category      34574 non-null  object
dtypes: object(5)
memory usage: 1.3+ MB

Here is the basic information of our dataset:

  • We have 34,574 records.
  • The data set contains 5 fields, all of which are object (string) type data.
  • The title, link, and category fields have no missing values.
  • The author_id and last_updated fields have a small number of missing values.

5. Data exploratory analysis

Next we will check the specific data of these fields. Starting with the category field, we'll check how many different categories are in this field and how many examples each category has. This will help us understand how articles are categorized and whether the sample size for each category is balanced. Then, we will use a visual way to display this information so that we can understand the data more intuitively.

category_counts = articles_df['category'].value_counts()

category_counts
medium    10440
easy       9663
basic      8144
hard       4232
expert     2095
Name: category, dtype: int64

From the analysis results of the category field, we can see that:

  • Articles are divided into five categories: medium, easy, basic, hard, and expert.

  • The medium category has the most articles with 10,440 articles.

  • The expert category has the lowest number of articles with 2,095 articles.

from pyecharts import options as opts
from pyecharts.charts import Bar

categories = category_counts.index.tolist()
article_counts = category_counts.values.tolist()

# 创建柱状图对象
bar_chart = Bar()
bar_chart.set_global_opts(
    title_opts=opts.TitleOpts(title="文章类别分布"),
    xaxis_opts=opts.AxisOpts(name="类别"),
    yaxis_opts=opts.AxisOpts(name="文章数量"),
)

# 添加数据
bar_chart.add_xaxis(categories)
bar_chart.add_yaxis("", article_counts)

# 设置x轴标签旋转角度
bar_chart.set_series_opts(
    label_opts=opts.LabelOpts(
        position="top", rotate=0
    )
)

# 展示图表
bar_chart.render_notebook()

insert image description here

This is the distribution of article categories. From this histogram, we can see:

  • The medium category has the highest number of articles, which probably means that most articles on this site are of moderate difficulty.

  • There are also more articles in the easy and basic categories, which may be to better meet the learning needs of novices in programming.

  • The relatively small number of articles in the hard and expert categories may reflect relatively little demand for, or the high cost of writing, hard articles.

Next, we'll analyze the author_id field. We'll check how many different authors there are and how many articles each author has published. This will help us understand which authors are more active, and the distribution of author and article counts.

author_counts = articles_df['author_id'].value_counts()

# 显示文章最多的前10位作者
author_counts.head(10)
GeeksforGeeks          11957
ManasChhabra2            317
Striver                  265
manjeet_04               246
Chinmoy Lenka            192
pawan_asipu              157
sarthak_ishu11           151
anuupadhyay              148
Shubrodeep Banerjee      143
ankita_saini             125
Name: author_id, dtype: int64

Here are the top 10 authors and the number of articles they published:

  • GeeksforGeeksHe is the author with the most published articles, with 11,957 articles published.
  • ManasChhabra2317 articles were published.
  • Striver265 articles were published.
  • manjeet_04246 articles were published.
  • Chinmoy LenkaPublished 192 articles.
  • pawan_asipuPublished 157 articles.
  • sarthak_ishu11151 articles were published.
  • anuupadhyay148 articles were published.
  • Shubrodeep Banerjee143 articles were published.
  • ankita_saini125 articles were published.

This information can help us understand which authors are more active on the platform.

authors = author_counts.head(10).index.tolist()
article_counts = author_counts.head(10).values.tolist()

# 创建柱状图对象
bar_chart = Bar()
bar_chart.set_global_opts(
    title_opts=opts.TitleOpts(title="发表文章最多的10位作者"),
    xaxis_opts=opts.AxisOpts(name="作者"),
    yaxis_opts=opts.AxisOpts(name="文章数量"),
)

# 添加数据
bar_chart.add_xaxis(authors)
bar_chart.add_yaxis("", article_counts)

# 设置x轴标签旋转角度
bar_chart.set_series_opts(
    label_opts=opts.LabelOpts(
        position="top", rotate=45
    )
)

# 展示图表
bar_chart.render_notebook()

insert image description here

From the figure, we can clearly see:

  • GeeksforGeeksIs the most active author, publishing far more articles than other authors.

  • The rest of the authors in the top 10 list have a similar number of articles, ranging from 100-300.

Next, we'll deal with last_updatedthe field. This field is a date field, we need to convert it to the correct date format, and then we can analyze the update of the article, such as which month of the year the most updated articles, how the number of article updates changes each year, etc.

# 将“last_updated”列转换为日期时间
articles_df['last_updated'] = pd.to_datetime(articles_df['last_updated'], errors='coerce')

articles_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34574 entries, 0 to 34573
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         34574 non-null  object        
 1   author_id     34555 non-null  object        
 2   last_updated  34460 non-null  datetime64[ns]
 3   link          34574 non-null  object        
 4   category      34574 non-null  object        
dtypes: datetime64[ns](1), object(4)
memory usage: 1.3+ MB

last_updatedField has been successfully converted to date type. However, it's worth noting that the number of non-null values ​​for this field dropped from 34,556 to 34,460. This means that some values ​​that cannot be converted to valid dates are converted to null values ​​during conversion. This is because we pd.to_datetimeset errors='coerce'the parameter when calling the function, and when encountering a value that cannot be converted to a date, Pandas will convert it to NaT(i.e. an empty date).

Next, we will extract the year and month information and analyze how articles are updated in different years and months. We can show this information through counting and visualization. This will help us understand the update trend of articles, such as which months have more articles updated, and whether the number of articles updated each year has a trend of increasing or decreasing, etc.

# 从“last_updated”列中提取年份和月份
articles_df['year'] = articles_df['last_updated'].dt.year
articles_df['month'] = articles_df['last_updated'].dt.month

articles_df.head()

insert image description here

The year and month information has been successfully extracted from the last_updated field and stored in the new fields year and month.

# 统计每年更新的文章数量
year_counts = articles_df['year'].value_counts().sort_index()

# 统计每个月更新的文章数量
month_counts = articles_df['month'].value_counts().sort_index()

year_counts, month_counts
(2010.0        1
 2011.0        1
 2012.0        5
 2013.0       70
 2014.0       53
 2015.0      172
 2016.0      200
 2017.0     1021
 2018.0     2524
 2019.0     3985
 2020.0     4625
 2021.0    18616
 2022.0     3187
 Name: year, dtype: int64,
 1.0     3206
 2.0     2716
 3.0     1510
 4.0     2318
 5.0     3331
 6.0     4060
 7.0     2824
 8.0     3193
 9.0     2919
 10.0    2775
 11.0    2933
 12.0    2675
 Name: month, dtype: int64)
# 绘制每年文章更新数量柱状图
bar_chart_year = Bar()
bar_chart_year.set_global_opts(
    title_opts=opts.TitleOpts(title="每年文章更新数量"),
    xaxis_opts=opts.AxisOpts(name="年份"),
    yaxis_opts=opts.AxisOpts(name="文章数量"),
)

# 添加数据
bar_chart_year.add_xaxis(year_counts.index.tolist())
bar_chart_year.add_yaxis("", year_counts.values.tolist())

# 设置x轴标签旋转角度
bar_chart_year.set_series_opts(
    label_opts=opts.LabelOpts(
        position="top", rotate=45
    )
)

# 展示图表
bar_chart_year.render_notebook()

insert image description here

# 绘制每月文章更新数量柱状图
bar_chart_month = Bar()
bar_chart_month.set_global_opts(
    title_opts=opts.TitleOpts(title="每月文章更新数量"),
    xaxis_opts=opts.AxisOpts(name="月份"),
    yaxis_opts=opts.AxisOpts(name="文章数量"),
)

# 添加数据
bar_chart_month.add_xaxis(month_counts.index.tolist())
bar_chart_month.add_yaxis("", month_counts.values.tolist())

# 设置x轴标签旋转角度
bar_chart_month.set_series_opts(
    label_opts=opts.LabelOpts(
        position="top", rotate=0
    )
)

# 展示图表
bar_chart_month.render_notebook()

insert image description here

Here is a histogram of the number of article updates per year and per month.

From the histogram of the number of article updates per year, we can see:

  • From 2010 to 2017, the number of updates of articles has increased year by year, but the rate of increase is relatively slow.
  • In 2018, the number of updates to articles increased significantly, reaching 2,524.
  • In 2019 and 2020, the number of updated articles continued to increase, reaching 3985 and 4625 respectively.
  • In 2021, the number of article updates reached a peak, with a total of 18,616 article updates.
  • In 2022, the number of updates of articles decreased, but remained at a high level (3187 articles).

From the histogram of the number of article updates per month, we can see:

  • There is not much difference in the number of article updates per month, roughly between 2,500 and 4,000.
  • June had the highest number of article updates (4060), and March had the least number of article updates (1510).

This information can help us understand the update trend of articles, and which times of the year more articles are updated.

6. Text classification prediction for article titles

1. Data preprocessing

Before making text classification predictions, we need to preprocess the data. Text preprocessing is an important step that helps us clean and prepare data for more efficient model training. Preprocessing steps may include:

  • Text Cleaning: Remove or replace specific characters or words, such as punctuation marks, numbers, stop words, etc.

  • Tokenization: Breaking down article titles into words or phrases (called "lemmas").

  • Vectorization: Convert tokens into numeric vectors that can be understood by the model.

Before preprocessing, we need to split the dataset into training and testing sets. The training set is used to train the model, and the test set is used to evaluate the performance of the model. We'll use sklearn's train_test_splitfunction to do this.

Next, we start preprocessing and dataset partitioning.

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# 将数据分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    articles_df['title'], articles_df['category'], test_size=0.2, random_state=42
)

# 初始化一个TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

# 拟合并转换训练数据
X_train_vec = vectorizer.fit_transform(X_train)

# 转换测试数据
X_test_vec = vectorizer.transform(X_test)

X_train_vec.shape, X_test_vec.shape
((27659, 1000), (6915, 1000))

We have successfully split the dataset into training and testing sets and preprocessed the article titles. We used TfidfVectorizer to convert article titles into numeric vectors. This converter calculates a TF-IDF value for each word, a weight commonly used in information retrieval and text mining to reflect how important a word is to a document set or a document in a document library.

We selected the 1000 most frequently occurring words to represent each article title. The goal of this is to reduce the number of features and make the model easier to train while still retaining most of the useful information. We also removed English stop words, which are words (such as "the", "is", "and", etc.) that are very common in the text but usually do not carry useful information.

The training set contains 27,659 articles and the test set contains 6,915 articles. Each article is represented as a vector of length 1000.

Next, we will use the logistic regression model, support vector machine model and random forest model to predict the text classification of article titles.

2. Logistic regression model

Logistic regression is a simple but effective classification model suitable for dealing with binary or multi-classification problems. We will use sklearn's LogisticRegression class to create this model.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

# 初始化 LogisticRegression 模型
model = LogisticRegression(solver='liblinear', random_state=42)

# 训练模型
model.fit(X_train_vec, y_train)

# 预测测试数据的类别
y_pred = model.predict(X_test_vec)

lr_accuracy = accuracy_score(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy: ', lr_accuracy)
              precision    recall  f1-score   support

       basic       0.41      0.42      0.42      1607
        easy       0.34      0.36      0.35      1950
      expert       0.40      0.04      0.08       408
        hard       0.37      0.05      0.08       876
      medium       0.38      0.57      0.45      2074

    accuracy                           0.37      6915
   macro avg       0.38      0.28      0.27      6915
weighted avg       0.38      0.37      0.34      6915

Accuracy:  0.3745480838756327

Here is the performance report of our logistic regression model on the test set. The performance report includes the precision (precision), recall (recall) and F1 score for each category, as well as the overall accuracy (accuracy).

  • Accuracy is the proportion of the samples that are predicted to be positive, which is used to measure the reliability of the prediction results.
  • The recall rate is the proportion of the samples that are actually positive examples that are predicted to be positive examples, and is used to measure the recognition ability of positive examples.
  • The F1 score is the harmonic mean of precision and recall, which is used to comprehensively measure the performance of the model.

From the report, we can see:

  • mediumThe category has the highest F1 score of 0.45, indicating that the model performs best in predicting articles in this category.
  • expertThe F1 scores for the and hardcategories are lower, 0.08 and 0.08, respectively, indicating that the model performs poorly in predicting articles in these two categories.
  • The overall accuracy is 0.37, indicating that the probability of the model correctly predicting the article category is 37%.

These results may be affected by data imbalance, i.e., mediumcategory has far more articles than other categories. To improve the performance of the model, we can try other ways of dealing with imbalanced data, such as oversampling the minority class, undersampling the majority class, or using synthetic samples, etc. We can also try other models and parameters, or further optimize the feature extraction process.

We will use a confusion matrix to visualize the model's predictions. The confusion matrix is ​​a commonly used method to display the performance of the model, which can visually show the prediction of the model on each category.

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 计算混淆矩阵
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(10, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('混淆矩阵')
plt.xlabel('预测类别')
plt.ylabel('真实类别')
plt.show()

insert image description here

Each row of the confusion matrix represents the true class, and each column represents the predicted class. Numbers on the diagonal indicate the number of samples the model correctly predicted, and numbers off the diagonal indicate the number of samples the model predicted incorrectly.

From the confusion matrix, we can see:

  • mediumThe model performs best when predicting articles in category , and most mediumarticles in category are correctly predicted.

  • When predicting articles expertin and hardcategories, the performance of the model is poor, and most of the articles expertin and hardcategories are incorrectly predicted as other categories.

3. Support vector machine model

A support vector machine is a commonly used classification model whose goal is to find a hyperplane that maximizes the separation between two classes.

from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# 初始化 LinearSVC 模型
svm_model = LinearSVC(random_state=42)

# 训练模型
svm_model.fit(X_train_vec, y_train)

# 预测测试数据的类别
svm_pred = svm_model.predict(X_test_vec)

# 计算准确率
svm_accuracy = accuracy_score(y_test, svm_pred)

print('Accuracy:', svm_accuracy)
Accuracy: 0.3683297180043384

4. Random Forest Model

Random forest is an ensemble model that makes predictions by constructing multiple decision trees and taking their average results.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 初始化随机森林模型
rf_model = RandomForestClassifier(random_state=42)

# 训练模型
rf_model.fit(X_train_vec, y_train)

# 预测测试数据的类别
rf_pred = rf_model.predict(X_test_vec)

# 计算准确率
rf_accuracy = accuracy_score(y_test, rf_pred)

print('Accuracy:', rf_accuracy)
Accuracy: 0.3515545914678236

5. Model comparison

Here are the test set accuracies of our three models:

  • The logistic regression model has an accuracy of 37.45%

  • The SVM model has an accuracy of 36.83%

  • The accuracy of the random forest model is 35.16%

These results show that for this task, the logistic regression model performs best and the random forest model performs the worst.

Next, we will visualize these accuracies using graphs to more intuitively compare the performance of the models.

models = ['逻辑回归模型', '支持向量机模型', '随机森林模型']
accuracies = [lr_accuracy, svm_accuracy, rf_accuracy]

# 创建柱状图对象
bar_chart = Bar()
bar_chart.set_global_opts(
    title_opts=opts.TitleOpts(title="模型比较"),
    xaxis_opts=opts.AxisOpts(name="模型"),
    yaxis_opts=opts.AxisOpts(name="准确度", min_=0.3, max_=0.4),
)

# 添加数据
bar_chart.add_xaxis(models)
bar_chart.add_yaxis("", accuracies)

# 展示图表
bar_chart.render_notebook()

insert image description here

Here is a comparison of the accuracy of our three models on the test set, from which we can see:

  • The logistic regression model achieved the highest accuracy at 37.45%.
  • The accuracy of the support vector machine model was next at 36.83%.
  • The Random Forest model had the lowest accuracy at 35.16%.

Next, let's briefly discuss the pros and cons of these three models:

  • Logistic regression model :
    • Advantages: simple model, fast training and prediction speed, easy to explain, not easy to overfit.
    • Disadvantages: Assuming that the data obeys the Bernoulli distribution, the prediction performance may not be good for data that does not meet this assumption.
  • Support vector machine model :
    • Advantages: It can solve high-dimensional problems, that is, large feature spaces; it can handle the interaction of nonlinear features; it does not need to rely on the entire data.
    • Disadvantages: When there are many observation samples, the efficiency is not very high; there is no general solution to nonlinear problems, and sometimes it is difficult to find a suitable kernel function.
  • Random forest model :
    • Advantages: It can handle both binary classification problems and multi-classification problems; it can handle high-dimensional (many features) data, and does not need to do feature selection; after training, it can give which features are important.
    • Disadvantages: On some classification or regression problems with high noise, it will overfit; for data with attributes with different values, attributes with more value divisions will have a greater impact on random forests, so random forests The attribute weights produced on this kind of data are not credible.

Guess you like

Origin blog.csdn.net/qq_52417436/article/details/131636642