1. Training tasks
1. Crawler writing: Select the homepage of the content you are interested in, and grab different types of text corpus (such as different types of movie reviews, related microblogs of different microblog hot events, topic posts of different types of sports in sports forums, content posted by users in software such as Xiaohongshu, etc.), and the number of each type of corpus to be captured is required to be no less than 500.
2. Text feature extraction: Segment each text, count the word frequency and calculate the tfidf value of each word in each text. Tfidf concept reference http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
(Among them, the word segmentation software chooses jieba that has been used in class. Tfidf and other feature selection methods recommend writing code by yourself, including the classifier to be used later. Using the sklearn module can only get basic points, and self-programming scores are high)
3. Visualization: use matplotlib to display the data distribution, and can count any data (such as the distribution of tfidf values at various stages in the [0-1] interval, the number of captured text data, etc.), and can use various graphics such as line charts and histograms for display.
4. Perform PCA dimensionality reduction on the data, and use matplotlib to display the data (PCA dimensionality reduction uses sklearn)
5. Understand the bag-of-words model, use the bag-of-words model to represent each text, use supervised learning methods: naive Bayesian model, K-nearest neighbor model, unsupervised learning methods: hierarchical clustering, K-means clustering method, classify or cluster texts, and use matplotlib to display categories.
6. Compare the difference between the four models, and evaluate the effects of the four models (recall rate, precision rate, precision, F value).
2. Requirements for practical training
1. Introduce the captured content;
2. Present the code and results in the report, add detailed comments to the code, and present your own understanding of the model;
3. Detail the comparison between the models;
Three, the main work content of the training
(1) Code implementation:
#Crawling Douban Movies All kinds of movie reviews
import requests
from lxml import etree
import json
import time
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',
1636037603.12; __utmc=30149280; __utmz=30149280.1636037603.12.11.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.66655176.1610534081.1635514833.1636037603.5; __utmb=223695111.0.10.1636037603; __utmc=223695111; __utmz=223695111.1636037603.5.4.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ct=y; __utmt=1; __utmb=30149280.4.10.1636037603; _pk_id.100001.4cf6=bec7c019f9f6d24d.1610534080.5.1636041481.1635515565.'
}
# get movie reviews
def crawling(id,f):
index = 0
start = 0
while True:
try:
#Not logged in to crawl Douban film review server returns 403 access forbidden, log in with account and add Cookie to continue accessing
url = 'https://movie.douban.com/subject/'+str(id)+'/comments?start=' + str(
start) + '&limit=20&status=P&sort=new_score'
if (start > 40): #Set the upper limit of the number of comments to be crawled
break
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
selector = etree.HTML(response.text)
comments = selector.xpath("//div[@id='comments']/div[@class='comment-item ']/div[@class='comment']")
if (len(comments) == 0):
break
for i in range(len(comments)):
index += 1
reviewer_name = comments[i].xpath("h3/span[@class='comment-info']/a/text()")[0]
comment_star = comments[i].xpath("h3/span[@class='comment-info']/span")[1].xpath("@class")[0] + ""
comment_time = comments[i].xpath("h3/span[@class='comment-info']/span[@class='comment-time ']")[0].xpath(
"string(.)").strip()
comment_content = comments[i].xpath("p/span[@class='short']")[0].xpath("string(.)").strip().replace('\n',
'').replace(
'\r', '')
f.write(comment_content+'\n')
start += 20
except Exception as e:
print(e)
break
sum_url='https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0&genres=剧情'
type_movie=[' drama ',' comedy ',' action ','romance',' science fiction ' ,' animation ',' suspense',' adventure ' , ' disaster ',' martial arts ',' fantasy ' , ' western ' ,' war ',' history ',' biography ', 'music ' , ' horror ',' crime ']
#Get the movie ids in various movies respectively
for tt in type_movie[-2:]:
sum_url = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0&genres='+tt
reponse=requests.get(sum_url,headers=headers).text
response=json.loads(response)
text_dir='./data/'+tt+'.txt'
with open(text_dir,'a+',encoding='utf-8')as f:
for i in reponse['data']:
print(i['id'],i['title'])
# get movie reviews
crawling(i['id'],f)
time.sleep(2)
#Read data and generate DataFrame
type_movie=[' drama ',' comedy ',' action ',' love ']#,' science fiction ','animation', ' suspense',' adventure ' ,' disaster ',' martial arts ',' fantasy ' , ' western ', ' war ',' history ',' biography ',' music ' , ' horror ',' crime ']
text_list=[]
label_list=[]
for num in range(0,len(type_movie)):
file_path='./data/'+type_movie[num]+'.txt'
with open(file_path,encoding='utf-8')as f:
content=f.read()
for word in content.split('\n'):
a=[]
a.append(word)
a.append(num)
text_list.append(a)
tfidf=pd.DataFrame(text_list,columns=['text','label'])
tfidf.head()
#jieba participle, remove stop words
with open('stopwords.txt',encoding='utf-8')as f:
s_content=f.read()
ss=s_content.split('\n')
#This step takes a long time, about 40 seconds , because the sklearn package function is not used
sum_out_str_list=[]
n=0
for sentence in tfidf['text']:
outstr=''
sentence_depart=jieba.cut(str(sentence).strip())
for word in sentence_depart:
if word not in ss:
if word != '\t':
outstr += word+' '
sum_out_str_list.append(outstr)
tfidf.loc[:,('text')]=sum_out_str_list
tfidf_model=TfidfVectorizer(max_features=18)
tfidf_df=pd.DataFrame(tfidf_model.fit_transform(tfidf['text']).todense())
tfidf_df.columns=sorted(tfidf_model.vocabulary_)
tfidf_df.head()
#Calculate tfidf to extract text features
tfidf_model=TfidfVectorizer(max_features=1000)
tfidf_df=pd.DataFrame(tfidf_model.fit_transform(tfidf['text']).todense())
tfidf_df.columns=sorted(tfidf_model.vocabulary_)
tfidf_model1=TfidfVectorizer(max_features=50)
tfidf_df1=pd.DataFrame(tfidf_model1.fit_transform(tfidf['text']).todense())
tfidf_df1.columns=sorted(tfidf_model1.vocabulary_)
tfidf_df.head()
#PCA Dimensionality Reduction
from sklearn .decomposition import PCA
pca=PCA(2)
pca.fit(tfidf_df)
reduced_tfidf=pca.transform(tfidf_df)
reduced_tfidf
import matplotlib.pyplot as plt
import matplotlib
import matplotlib.font_manager as fm
myfont = fm.FontProperties(fname="msyh.ttc", size=14)
matplotlib.rcParams["axes.unicode_minus"] = False
scatter=plt.scatter(reduced_tfidf[:,0],reduced_tfidf[:,1],c=tfidf['label'],cmap='coolwarm')
plt.show()
#Represent the text with the bag of words model
from sklearn.feature_extraction.text import CountVectorizer
def vectorize_text(corpus,n):
bag_of_words_model=CountVectorizer(max_features=n)
# Statistical word frequency
dense_vec_matrix=bag_of_words_model.fit_transform(corpus).todense()
# convert to dataframe
bag_of_word_df=pd.DataFrame(dense_vec_matrix)
#Add column names
bag_of_word_df.columns=sorted(bag_of_words_model.vocabulary_)
return bag_of_word_df
df_1=vectorize_text(sum_out_str_list,2500)
df_2=vectorize_text(sum_out_str_list,25)
# Naive Bayesian model
from sklearn import metrics
import numpy as np
from sklearn.naive_bayes import MultinomialNB
def get_metrics(true_labels, predicted_labels):
print('Accuracy:', np.round(
metrics.accuracy_score(true_labels,
predicted_labels),2))
print('Precision:', np.round(
metrics.precision_score(true_labels,
predicted_labels,
average='weighted'),2))
print('Recall:', np.round(
metrics.recall_score(true_labels,
predicted_labels,
average='weighted'), 2))
print('F1 Score:', np.round(
metrics.f1_score(true_labels,
predicted_labels,
average='weighted'), 2))
data = np.array(df_1.iloc[:])
x, y = data[:,:-1], tfidf['label']# feature label
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) # 28 division
nb=MultinomialNB()
nb.fit(x_train,y_train)
predictions=nb.predict(x_test)
get_metrics(true_labels=y_test,predicted_labels=predictions)
#K nearest neighbor model
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
def get_metrics(true_labels, predicted_labels):
print('Accuracy:', np.round(
metrics.accuracy_score(true_labels,
predicted_labels),2))
print('Precision:', np.round(
metrics.precision_score(true_labels,
predicted_labels,
average='weighted'),2))
print('Recall:', np.round(
metrics.recall_score(true_labels,
predicted_labels,
average='weighted'), 2))
print('F1 Score:', np.round(
metrics.f1_score(true_labels,
predicted_labels,
average='weighted'), 2))
data = np.array(df_1.iloc[:])
X, y = data[:,:-1], tfidf['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) # 28 division
#print(X_train.shape)
#print(y_train)
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
y_predict =knn.predict(X_test)
get_metrics(y_test,y_predict)
#K clustering Kmeans
from sklearn.cluster import KMeans, MiniBatchKMeans
def train(X, true_k=10, minibatch=False, showLable=False):
#Use sampled data or raw data to train k-means ,
if minibatch:
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=False)
else:
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=1,
verbose=false)
km.fit(X)
result = list(km.predict(X))
print('Cluster distribution:')
print(dict([(i, result.count(i)) for i in result]))
return -km.score(X)
#Specify the number of clusters k
def k_determin(tfidf_df):
true_ks=[]
scores=[]
#The number of center points ranges from 3 to 200 ( rewritten according to your own data volume )
for i in range(3, 20, 1):
score = train(tfidf_df, true_k=i)# / len(dataset)
print(i, score)
true_ks.append(i)
scores.append(score)
plt.figure(figsize=(8, 4))
plt.plot(true_ks, scores, label="error", color="red", linewidth=1)
plt.xlabel("n_features")
plt.ylabel("error")
plt.legend()
plt.show()
def main():
''' Output clustering results under optimal parameters '''
dataset = get_dbdata()
X, vectorizer = transform(dataset, n_features=500)
score = train(X, vectorizer, true_k=25, showLable=True) / len(dataset)
print(score)
#hierarchical clustering
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']
dist = df_2.corr()
import matplotlib.pyplot as plt
import matplotlib as mpl
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist) #Define the linkage matrix using the distance pre-calculated by Ward clustering
fig, ax = plt.subplots(figsize=(10, 6)) # set size
ax = dendrogram(linkage_matrix, orientation="right")#, labels=tfidf['label'][:25]);
plt.tick_params(
axis= 'x', #use the x coordinate axis
which='both', #Use both major tick labels ( major ticks ) and minor tick labels ( minor ticks )
bottom='off', #Cancel the bottom edge ( bottom edge ) label
top='off', # cancel the top edge ( top edge ) label
labelbottom='off')
plt.tight_layout() #Display compact drawing layout
Hierarchical cluster analysis of text:
5. The harvest and experience of the training
Overall, the constructed model cannot improve the accuracy of text classification very well. The reasons are as follows:
(1): There are too many text types, and some text types have a high degree of similarity;
(2): There is too little corpus in a single text, resulting in a high weight of common words formed during model training, while the influence of the feature weights that really belong to this text type is greatly weakened, resulting in a greatly reduced accuracy of text classification;
(3): The construction of the text classification model this time is not complicated, and the corpus feature vector in the bag-of-words model is not adjusted for this corpus. This is the shortcoming of this training content and also the pain point;
(4): After word segmentation and removal of stop words, there are still a large number of interfering words in the text, which also have a lot of influence in the next model training.
The supervised learning methods established in this training: naive Bayesian model, K nearest neighbor model: in the process of text classification, their respective advantages and disadvantages can be clearly displayed:
K nearest neighbor model:
Advantages: high precision, insensitive to outliers
Disadvantages: high computational complexity and high space complexity
The main advantages of Naive Bayes are:
1 ) The naive Bayesian model originated from classical mathematical theory and has stable classification efficiency.
2 ) It performs well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training, especially when the amount of data exceeds the memory, we can go to incremental training in batches.
3 ) It is less sensitive to missing data, and the algorithm is relatively simple, which is often used in text classification.
The main disadvantages of Naive Bayes are:
1 ) In theory, Naive Bayesian model has the smallest error rate compared to other classification methods. But in fact, this is not always the case. This is because the Naive Bayesian model assumes that the attributes are independent of each other. This assumption is often not true in practical applications. When the number of attributes is large or the correlation between attributes is large, the classification effect is not good. Naive Bayes performs best when the attribute correlation is small. For this, there are algorithms such as semi-naive Bayes that are modestly improved by taking into account partial dependencies.
2 ) The prior probability needs to be known, and the prior probability often depends on the assumptions. There can be many hypothetical models, so sometimes the prediction effect will be poor due to the hypothetical prior model.
3 ) Since we determine the probability of the posterior through the prior and the data to determine the classification, there is a certain error rate in the classification decision.
4 ) It is very sensitive to the expression form of the input data.
And for the unsupervised learning clustering model: Compared with the K-means clustering method in the hierarchical clustering model:
K -means clustering method: the algorithm is fast and simple; it has high efficiency and scalability for large data sets; it has a better effect on this text clustering;
Hierarchical clustering model: The time complexity is high. For the complex bag of words model, the calculation conditions required are far greater than the K-means clustering method, and the time required is relatively long.
6. References
[1] Zhang Wenqiang. Research and Application of Network Data Acquisition Technology [D]. North China Electric Power University (Beijing), 2018.
[2] Li Xiaohong. Feature Word Extraction Method in Chinese Text Classification [J]. Computer Engineering and Design, 2009.
[3] Huang Minhao, Ding Lang, Zhang Xuelian. Web crawler and text visualization based on Python [J]. Computer programming skills and maintenance (7): 2.
[4] Zhang Liyang, Mao Hongxia. Python-based Douban movie data collection and analysis visualization [J]. Electronic Production, 2021(16): 3.
[5] Feng Yueyue. Statistical analysis of Douban TV series based on Python [D]. Xiangtan University, 2019.
[6] Zhang Shaojun, Zeng Jia. The Enlightenment of Using Python Crawler to Analyze the Relationship between Film Criticism and Public Opinion [J]. Southeast Communications, 2019, 000(008):76-78.