Python and data science (word segmentation training)

1. Training tasks

1. Crawler writing: Select the homepage of the content you are interested in, and grab different types of text corpus (such as different types of movie reviews, related microblogs of different microblog hot events, topic posts of different types of sports in sports forums, content posted by users in software such as Xiaohongshu, etc.), and the number of each type of corpus to be captured is required to be no less than 500.

2. Text feature extraction: Segment each text, count the word frequency and calculate the tfidf value of each word in each text. Tfidf concept reference http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

(Among them, the word segmentation software chooses jieba that has been used in class. Tfidf and other feature selection methods recommend writing code by yourself, including the classifier to be used later. Using the sklearn module can only get basic points, and self-programming scores are high)

3. Visualization: use matplotlib to display the data distribution, and can count any data (such as the distribution of tfidf values ​​at various stages in the [0-1] interval, the number of captured text data, etc.), and can use various graphics such as line charts and histograms for display.

4. Perform PCA dimensionality reduction on the data, and use matplotlib to display the data (PCA dimensionality reduction uses sklearn)

5. Understand the bag-of-words model, use the bag-of-words model to represent each text, use supervised learning methods: naive Bayesian model, K-nearest neighbor model, unsupervised learning methods: hierarchical clustering, K-means clustering method, classify or cluster texts, and use matplotlib to display categories.

6. Compare the difference between the four models, and evaluate the effects of the four models (recall rate, precision rate, precision, F value).

2. Requirements for practical training

1. Introduce the captured content;

2. Present the code and results in the report, add detailed comments to the code, and present your own understanding of the model;

3. Detail the comparison between the models;

Three, the main work content of the training

(1) Code implementation:

#Crawling Douban Movies All kinds of movie reviews

import requests

from lxml import etree

import json

import time

headers = {

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',

    1636037603.12; __utmc=30149280; __utmz=30149280.1636037603.12.11.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.66655176.1610534081.1635514833.1636037603.5; __utmb=223695111.0.10.1636037603; __utmc=223695111; __utmz=223695111.1636037603.5.4.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ct=y; __utmt=1; __utmb=30149280.4.10.1636037603; _pk_id.100001.4cf6=bec7c019f9f6d24d.1610534080.5.1636041481.1635515565.'

}

 

# get movie reviews

def crawling(id,f):

    index = 0

    start = 0

    while True:

        try:

            #Not logged in to crawl Douban film review server returns 403 access forbidden, log in with account and add Cookie to continue accessing

            url = 'https://movie.douban.com/subject/'+str(id)+'/comments?start=' + str(

                start) + '&limit=20&status=P&sort=new_score'

            if (start > 40): #Set the upper limit of the number of comments to be crawled

                break

            response = requests.get(url, headers=headers)

            response.encoding = 'utf-8'

            selector = etree.HTML(response.text)

            comments = selector.xpath("//div[@id='comments']/div[@class='comment-item ']/div[@class='comment']")

            if (len(comments) == 0):

                break

            for i in range(len(comments)):

                index += 1

                reviewer_name = comments[i].xpath("h3/span[@class='comment-info']/a/text()")[0]

                comment_star = comments[i].xpath("h3/span[@class='comment-info']/span")[1].xpath("@class")[0] + ""

                comment_time = comments[i].xpath("h3/span[@class='comment-info']/span[@class='comment-time ']")[0].xpath(

                    "string(.)").strip()

                comment_content = comments[i].xpath("p/span[@class='short']")[0].xpath("string(.)").strip().replace('\n',

                                                                                                                    '').replace(

                    '\r', '')

                f.write(comment_content+'\n')

            start += 20

        except Exception as e:

            print(e)

            break

sum_url='https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0&genres=剧情'

type_movie=[' drama ',' comedy ',' action ','romance',' science fiction ' ,' animation ',' suspense',' adventure ' , ' disaster ',' martial arts ',' fantasy ' , ' western ' ,' war ',' history ',' biography ', 'music ' , ' horror ',' crime ']

#Get the movie ids in various movies respectively

for tt in type_movie[-2:]:

    sum_url = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0&genres='+tt

    reponse=requests.get(sum_url,headers=headers).text

    response=json.loads(response)

    text_dir='./data/'+tt+'.txt'

    with open(text_dir,'a+',encoding='utf-8')as f:

        for i in reponse['data']:

            print(i['id'],i['title'])

 

# get movie reviews

            crawling(i['id'],f)

            time.sleep(2)

 

#Read data and generate DataFrame

type_movie=[' drama ',' comedy ',' action ',' love ']#,' science fiction ','animation', ' suspense',' adventure ' ,' disaster ',' martial arts ',' fantasy ' , ' western ', ' war ',' history ',' biography ',' music ' , ' horror ',' crime ']

text_list=[]

label_list=[]

for num in range(0,len(type_movie)):

    file_path='./data/'+type_movie[num]+'.txt'

    with open(file_path,encoding='utf-8')as f:

        content=f.read()

        for word in content.split('\n'):

            a=[]

            a.append(word)

            a.append(num)

            text_list.append(a)

tfidf=pd.DataFrame(text_list,columns=['text','label'])

tfidf.head()

 

#jieba participle, remove stop words

with open('stopwords.txt',encoding='utf-8')as f:

    s_content=f.read()

ss=s_content.split('\n')

#This step takes a long time, about 40 seconds , because the sklearn package function is not used

sum_out_str_list=[]

n=0

for sentence in tfidf['text']:

    outstr=''

    sentence_depart=jieba.cut(str(sentence).strip())

    for word in sentence_depart:

        if word not in ss:

            if word != '\t':

                outstr += word+' '

sum_out_str_list.append(outstr)

tfidf.loc[:,('text')]=sum_out_str_list

tfidf_model=TfidfVectorizer(max_features=18)

tfidf_df=pd.DataFrame(tfidf_model.fit_transform(tfidf['text']).todense())

tfidf_df.columns=sorted(tfidf_model.vocabulary_)

tfidf_df.head()

#Calculate tfidf to extract text features

tfidf_model=TfidfVectorizer(max_features=1000)

tfidf_df=pd.DataFrame(tfidf_model.fit_transform(tfidf['text']).todense())

tfidf_df.columns=sorted(tfidf_model.vocabulary_)

tfidf_model1=TfidfVectorizer(max_features=50)

tfidf_df1=pd.DataFrame(tfidf_model1.fit_transform(tfidf['text']).todense())

tfidf_df1.columns=sorted(tfidf_model1.vocabulary_)

tfidf_df.head()

 

 

#PCA Dimensionality Reduction

from sklearn .decomposition import PCA

pca=PCA(2)

pca.fit(tfidf_df)

reduced_tfidf=pca.transform(tfidf_df)

reduced_tfidf

import matplotlib.pyplot as plt

import matplotlib

import matplotlib.font_manager as fm

myfont = fm.FontProperties(fname="msyh.ttc", size=14)

matplotlib.rcParams["axes.unicode_minus"] = False

scatter=plt.scatter(reduced_tfidf[:,0],reduced_tfidf[:,1],c=tfidf['label'],cmap='coolwarm')

plt.show()

 

#Represent the text with the bag of words model

from sklearn.feature_extraction.text import CountVectorizer

def vectorize_text(corpus,n):

    bag_of_words_model=CountVectorizer(max_features=n)

    # Statistical word frequency

    dense_vec_matrix=bag_of_words_model.fit_transform(corpus).todense()

    # convert to dataframe

    bag_of_word_df=pd.DataFrame(dense_vec_matrix)

    #Add column names

    bag_of_word_df.columns=sorted(bag_of_words_model.vocabulary_)

return bag_of_word_df

df_1=vectorize_text(sum_out_str_list,2500)

df_2=vectorize_text(sum_out_str_list,25)

# Naive Bayesian model

from sklearn import metrics

import numpy as np

from sklearn.naive_bayes import MultinomialNB

def get_metrics(true_labels, predicted_labels):

    print('Accuracy:', np.round(

        metrics.accuracy_score(true_labels,

                               predicted_labels),2))

    print('Precision:', np.round(

        metrics.precision_score(true_labels,

                                predicted_labels,

                                average='weighted'),2))

    print('Recall:', np.round(

        metrics.recall_score(true_labels,

                             predicted_labels,

                             average='weighted'),  2))

    print('F1 Score:', np.round(

        metrics.f1_score(true_labels,

                         predicted_labels,

                         average='weighted'), 2))

data = np.array(df_1.iloc[:])

x, y = data[:,:-1], tfidf['label']# feature   label

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) # 28 division

nb=MultinomialNB()

nb.fit(x_train,y_train)

predictions=nb.predict(x_test)

get_metrics(true_labels=y_test,predicted_labels=predictions)

#K nearest neighbor model

from sklearn.neighbors import KNeighborsClassifier

import numpy as np

def get_metrics(true_labels, predicted_labels):

    print('Accuracy:', np.round(

        metrics.accuracy_score(true_labels,

                               predicted_labels),2))

    print('Precision:', np.round(

        metrics.precision_score(true_labels,

                                predicted_labels,

                                average='weighted'),2))

    print('Recall:', np.round(

        metrics.recall_score(true_labels,

                             predicted_labels,

                             average='weighted'),  2))

    print('F1 Score:', np.round(

        metrics.f1_score(true_labels,

                         predicted_labels,

                         average='weighted'), 2))

data = np.array(df_1.iloc[:])

X, y = data[:,:-1], tfidf['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1) # 28 division

#print(X_train.shape)

#print(y_train)

knn = KNeighborsClassifier(n_neighbors=10)

knn.fit(X_train, y_train)

y_predict =knn.predict(X_test)

get_metrics(y_test,y_predict)

#K clustering Kmeans

from sklearn.cluster import KMeans, MiniBatchKMeans

def train(X, true_k=10, minibatch=False, showLable=False):

    #Use sampled data or raw data to train k-means ,

    if minibatch:

        km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,

                             init_size=1000, batch_size=1000, verbose=False)

    else:

        km = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=1,

                    verbose=false)

    km.fit(X)

    result = list(km.predict(X))

    print('Cluster distribution:')

    print(dict([(i, result.count(i)) for i in result]))

    return -km.score(X)

#Specify the number of clusters k

def k_determin(tfidf_df):

    true_ks=[]

    scores=[]

    #The number of center points ranges from 3 to 200 ( rewritten according to your own data volume )

    for i in range(3, 20, 1):

        score = train(tfidf_df, true_k=i)# / len(dataset)

        print(i, score)

        true_ks.append(i)

        scores.append(score)

    plt.figure(figsize=(8, 4))

    plt.plot(true_ks, scores, label="error", color="red", linewidth=1)

    plt.xlabel("n_features")

    plt.ylabel("error")

    plt.legend()

    plt.show()

def main():

    ''' Output clustering results under optimal parameters '''

    dataset = get_dbdata()

    X, vectorizer = transform(dataset, n_features=500)

    score = train(X, vectorizer, true_k=25, showLable=True) / len(dataset)

    print(score)

#hierarchical clustering

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

from pylab import mpl

mpl.rcParams['font.sans-serif'] = ['SimHei']  

dist = df_2.corr()

import matplotlib.pyplot as plt

import matplotlib as mpl

from scipy.cluster.hierarchy import ward, dendrogram

linkage_matrix = ward(dist) #Define the linkage matrix using the distance pre-calculated by Ward clustering

fig, ax = plt.subplots(figsize=(10, 6)) # set size

ax = dendrogram(linkage_matrix, orientation="right")#, labels=tfidf['label'][:25]);

plt.tick_params(

        axis= 'x', #use the x coordinate axis

        which='both', #Use both major tick labels ( major ticks ) and minor tick labels ( minor ticks )

        bottom='off', #Cancel the bottom edge ( bottom edge ) label

        top='off', # cancel the top edge ( top edge ) label

    labelbottom='off')

plt.tight_layout() #Display compact drawing layout

Hierarchical cluster analysis of text:

 

5. The harvest and experience of the training

Overall, the constructed model cannot improve the accuracy of text classification very well. The reasons are as follows:

(1): There are too many text types, and some text types have a high degree of similarity;

(2): There is too little corpus in a single text, resulting in a high weight of common words formed during model training, while the influence of the feature weights that really belong to this text type is greatly weakened, resulting in a greatly reduced accuracy of text classification;

(3): The construction of the text classification model this time is not complicated, and the corpus feature vector in the bag-of-words model is not adjusted for this corpus. This is the shortcoming of this training content and also the pain point;

(4): After word segmentation and removal of stop words, there are still a large number of interfering words in the text, which also have a lot of influence in the next model training.

The supervised learning methods established in this training: naive Bayesian model, K nearest neighbor model: in the process of text classification, their respective advantages and disadvantages can be clearly displayed:

K nearest neighbor model:

Advantages: high precision, insensitive to outliers

Disadvantages: high computational complexity and high space complexity

The main advantages of Naive Bayes are:

1 ) The naive Bayesian model originated from classical mathematical theory and has stable classification efficiency.

2 ) It performs well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training, especially when the amount of data exceeds the memory, we can go to incremental training in batches.

3 ) It is less sensitive to missing data, and the algorithm is relatively simple, which is often used in text classification.

The main disadvantages of Naive Bayes are:

1 ) In theory, Naive Bayesian model has the smallest error rate compared to other classification methods. But in fact, this is not always the case. This is because the Naive Bayesian model assumes that the attributes are independent of each other. This assumption is often not true in practical applications. When the number of attributes is large or the correlation between attributes is large, the classification effect is not good. Naive Bayes performs best when the attribute correlation is small. For this, there are algorithms such as semi-naive Bayes that are modestly improved by taking into account partial dependencies.

2 ) The prior probability needs to be known, and the prior probability often depends on the assumptions. There can be many hypothetical models, so sometimes the prediction effect will be poor due to the hypothetical prior model.

3 ) Since we determine the probability of the posterior through the prior and the data to determine the classification, there is a certain error rate in the classification decision.

4 ) It is very sensitive to the expression form of the input data.

And for the unsupervised learning clustering model: Compared with the K-means clustering method in the hierarchical clustering model:

K -means clustering method: the algorithm is fast and simple; it has high efficiency and scalability for large data sets; it has a better effect on this text clustering;

Hierarchical clustering model: The time complexity is high. For the complex bag of words model, the calculation conditions required are far greater than the K-means clustering method, and the time required is relatively long.

6. References

[1] Zhang Wenqiang. Research and Application of Network Data Acquisition Technology [D]. North China Electric Power University (Beijing), 2018.

[2] Li Xiaohong. Feature Word Extraction Method in Chinese Text Classification [J]. Computer Engineering and Design, 2009.

[3] Huang Minhao, Ding Lang, Zhang Xuelian. Web crawler and text visualization based on Python [J]. Computer programming skills and maintenance (7): 2.

[4] Zhang Liyang, Mao Hongxia. Python-based Douban movie data collection and analysis visualization [J]. Electronic Production, 2021(16): 3.

[5] Feng Yueyue. Statistical analysis of Douban TV series based on Python [D]. Xiangtan University, 2019.

[6] Zhang Shaojun, Zeng Jia. The Enlightenment of Using Python Crawler to Analyze the Relationship between Film Criticism and Public Opinion [J]. Southeast Communications, 2019, 000(008):76-78.

Guess you like

Origin blog.csdn.net/weixin_45823684/article/details/130833099