How did I complete the classification of 300,000 after-sales orders in 10 minutes?

1. Background

(1) Demand, the data analysis team should analyze the company's after-sales maintenance order, filter out the top10, and then analyze and track these problems;

(2) The problem, the after-sales tracking form for the past 2 years was obtained from the after-sales department, the plain text description, about 300,000 pieces of data, and the division of labor by 5 analysts, it will take about 1-2 weeks to sort out the top 10 problems .

How did I complete the classification of 300,000 after-sales orders in 10 minutes?

(3) The leader comes , think about ideas, use programs or algorithms to categorize these problems, count the popularity, and get a top. The algorithm will be used frequently in the follow-up. It is too troublesome to manually summarize and organize.

(4) Receiving

The requirements are summarized briefly : due to the need to classify similar information from text descriptive information, the amount of data is relatively large, about more than 200,000, and the manual efficiency is low, and algorithms are needed to achieve;

The software language is briefly summarized: It is necessary to make a text similarity heat statistics algorithm to count those descriptions/words that are of the same type, and then count the number of occurrences, and select the top 10 in the lower order;

2. Selection

(1) The selection process is not cumbersome. I checked a lot of data and verified many algorithms (clustering algorithm, word segmentation algorithm, heat statistics algorithm, etc.), but the effect is not ideal. The word segmentation algorithm is ok, but it cannot satisfy the business. Demand, after-sales order description content is too much, the word segmentation popularity statistics are only the popularity of individual words, business needs to look at the context of the context, the effect is not ideal.

(2) It was finally implemented in python language, and java language is also available on the Internet. It has also been verified that the effect is not ideal, and the deployment is still a bit troublesome. The big algorithm of the python language is implemented using gensim combined with the jieba word segmentation algorithm. The improvement and modification of the results are not bad. ( Gensim is a simple and efficient natural language processing Python library for extracting semantic topics from documents)

3. Algorithm description

About 300,000 text description data, a general personal notebook (4 core 16g) produces results in about 10 minutes, and prints out detailed data to verify the effect.

Introduction to text similarity algorithm implementation :

(1) The input file is excel, and the data is about 200,000. First obtain the excel information through pandas, and process it through jieba word segmentation. The jieba word segmentation must first customize the dictionary and exclude the information, so the effect will be very different, and then form a two-dimensional Array.
(2) Use the corpora module in gensim to generate a dictionary from the two-dimensional array after word segmentation.
(3) Pass the two-dimensional array through the doc2bow sparse vector to form a corpus.
(4) At the beginning of using the TF model algorithm, it was later changed to: LsiModel model Algorithm to calculate the Tfidf value from the corpus.
(5) Obtain the feature number of the dictionary token2id
(6) Calculate the sparse matrix similarity and establish an index
(7) Read excel row data, and perform word segmentation
through jieba (8) Calculate the sparse vector of the test data through doc2bow
(9) Find The similarity between the test data and the sample data

如果你觉得文章对你有些帮助,欢迎微信搜索「软件老王」第一时间阅读或交流!

Algorithm description:

(1) Here is a point. Steps 7-9 will be executed in a loop. For each row in the document description column, it will be compared with the index created in step 6, and the similarity with that row will be more than 50%. Row data, and the statistical data will be added to an array, and subsequent comparisons will not be performed to avoid repeated statistics.
(2) In the first step, the jieba algorithm will use professional term dictionaries and stop dictionaries, 7-9 parts will be executed in a loop, the current similarity threshold is set to: 50%, and the excel operation is not much to say (summary added The hyperlink of the list can be navigated to the list)
(3) In terms of efficiency, it takes about 10 minutes for more than 200,000 data to convert vectors.
(4) The big algorithm is like this. This time I will mainly introduce the popularity statistics of the whole sentence, and then I will introduce the statistics of the next sentence grouping.
How did I complete the classification of 300,000 after-sales orders in 10 minutes?

4. Complete code and description

The complete code, friends in need can take it and use it directly, no routines, the code is divided into 1-6 points for explanation.

import jieba.posseg as pseg
import jieba.analyse
import xlwt
import openpyxl
from gensim import corpora, models, similarities
import re

#停词函数
def StopWordsList(filepath):
    wlst = [w.strip() for w in open(filepath, 'r', encoding='utf8').readlines()]
    return wlst

def str_to_hex(s):
    return ''.join([hex(ord(c)).replace('0x', '') for c in s])

# jieba分词
def seg_sentence(sentence, stop_words):
    stop_flag = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'f', 'r']
    sentence_seged = pseg.cut(sentence)
    outstr = []
    for word, flag in sentence_seged:
        if word not in stop_words and flag not in stop_flag:
            outstr.append(word)
    return outstr

if __name__ == '__main__':
    #1 这些是jieba分词的自定义词典,软件老王这里添加的格式行业术语,格式就是文档,一列一个词一行就行了,
    # 这个几个词典软件老王就不上传了,可注释掉。
    jieba.load_userdict("g1.txt")
    jieba.load_userdict("g2.txt")
    jieba.load_userdict("g3.txt")

    #2 停用词,简单理解就是这次词不分割,这个软件老王找的网上通用的,会提交下。
    spPath = 'stop.txt'
    stop_words = StopWordsList(spPath)

    #3 excel处理
    wbk = xlwt.Workbook(encoding='ascii')
    sheet = wbk.add_sheet("软件老王sheet")  # sheet名称
    sheet.write(0, 0, '表头-软件老王1')
    sheet.write(0, 1, '表头-软件老王2')
    sheet.write(0, 2, '导航-链接到明细sheet表')
    wb = openpyxl.load_workbook('软件老王-source.xlsx')
    ws = wb.active
    col = ws['B']
    # 4 相似性处理
    rcount = 1
    texts = []
    orig_txt = []
    key_list = []
    name_list = []
    sheet_list = []

    for cell in col:
        if cell.value is None:
            continue
        if not isinstance(cell.value, str):
            continue
        item = cell.value.strip('\n\r').split('\t')  # 制表格切分
        string = item[0]
        if string is None or len(string) == 0:
            continue
        else:
            textstr = seg_sentence(string, stop_words)
            texts.append(textstr)
            orig_txt.append(string)
    dictionary = corpora.Dictionary(texts)
    feature_cnt = len(dictionary.token2id.keys())
    corpus = [dictionary.doc2bow(text) for text in texts]
    tfidf = models.LsiModel(corpus)
    index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
    result_lt = []
    word_dict = {}
    count =0
    for keyword in orig_txt:
        count = count+1
        print('开始执行,第'+ str(count)+'行')
        if keyword in result_lt or keyword is None or len(keyword) == 0:
            continue
        kw_vector = dictionary.doc2bow(seg_sentence(keyword, stop_words))
        sim = index[tfidf[kw_vector]]
        result_list = []
        for i in range(len(sim)):
            if sim[i] > 0.5:
                if orig_txt[i] in result_lt and orig_txt[i] not in result_list:
                    continue
                result_list.append(orig_txt[i])
                result_lt.append(orig_txt[i])
        if len(result_list) >0:
            word_dict[keyword] = len(result_list)
        if len(result_list) >= 1:
            sname = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", keyword[0:10])+ '_'\
                    + str(len(result_list)+ len(str_to_hex(keyword))) + str_to_hex(keyword)[-5:]
            sheet_t = wbk.add_sheet(sname)  # Excel单元格名字
            for i in range(len(result_list)):
                sheet_t.write(i, 0, label=result_list[i])

    #5 按照热度排序 -软件老王
    with open("rjlw.txt", 'w', encoding='utf-8') as wf2:
        orderList = list(word_dict.values())
        orderList.sort(reverse=True)
        count = len(orderList)
        for i in range(count):
            for key in word_dict:
                if word_dict[key] == orderList[i]:
                    key_list.append(key)
                    word_dict[key] = 0
        wf2.truncate()
    #6 写入目标excel
    for i in range(len(key_list)):
        sheet.write(i+rcount, 0, label=key_list[i])
        sheet.write(i+rcount, 1, label=orderList[i])
        if orderList[i] >= 1:
            shname = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", key_list[i][0:10]) \
                     + '_'+ str(orderList[i]+ len(str_to_hex(key_list[i])))+ str_to_hex(key_list[i])[-5:]
            link = 'HYPERLINK("#%s!A1";"%s")' % (shname, shname)
            sheet.write(i+rcount, 2, xlwt.Formula(link))
    rcount = rcount + len(key_list)
    key_list = []
    orderList = []
    texts = []
    orig_txt = []
    wbk.save('软件老王-target.xls')

Code description:

(1) #1 is a custom dictionary for jieba word segmentation. The format industry term added by the software king here is the document, just one column, one word per line, and these industry dictionary software kings will not upload it. Comment out.

    jieba.load_userdict("g1.txt")
    jieba.load_userdict("g2.txt")
    jieba.load_userdict("g3.txt")

(2) #2 Stop words. The simple understanding is that these words are not split. This file software Pharaoh is found on the Internet for general use, and it is not necessary.

    spPath = 'stop.txt'
    stop_words = StopWordsList(spPath)

(3) #3 excel processing, here is a new sheet named "Software Pharaoh sheet", there are three headers, namely "Header-Software Pharaoh 1", "Machine header-Software Pharaoh 2", "Navigation-link to detailed sheet", in which "navigation-link to detailed sheet" has a hyperlink to navigate to detailed data.

    wbk = xlwt.Workbook(encoding='ascii')
    sheet = wbk.add_sheet("软件老王sheet")  # sheet名称
    sheet.write(0, 0, '表头-软件老王1')
    sheet.write(0, 1, '表头-软件老王2')
    sheet.write(0, 2, '导航-链接到明细sheet表')
    wb = openpyxl.load_workbook('软件老王-source.xlsx')
    ws = wb.active
    col = ws['B']

(4)# 4 Similarity processing, the core of the algorithm is here.

    rcount = 1
    texts = []
    orig_txt = []
    key_list = []
    name_list = []
    sheet_list = []
    for cell in col:
        if cell.value is None:
            continue
        if not isinstance(cell.value, str):
            continue
        item = cell.value.strip('\n\r').split('\t')  # 制表格切分
        string = item[0]
        if string is None or len(string) == 0:
            continue
        else:
            textstr = seg_sentence(string, stop_words)
            texts.append(textstr)
            orig_txt.append(string)
    dictionary = corpora.Dictionary(texts)
    feature_cnt = len(dictionary.token2id.keys())
    corpus = [dictionary.doc2bow(text) for text in texts]
    tfidf = models.LsiModel(corpus)
    index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
    result_lt = []
    word_dict = {}
    count =0
    for keyword in orig_txt:
        count = count+1
        print('开始执行,第'+ str(count)+'行')
        if keyword in result_lt or keyword is None or len(keyword) == 0:
            continue
        kw_vector = dictionary.doc2bow(seg_sentence(keyword, stop_words))
        sim = index[tfidf[kw_vector]]
        result_list = []
        for i in range(len(sim)):
            if sim[i] > 0.5:
                if orig_txt[i] in result_lt and orig_txt[i] not in result_list:
                    continue
                result_list.append(orig_txt[i])
                result_lt.append(orig_txt[i])
        if len(result_list) >0:
            word_dict[keyword] = len(result_list)
        if len(result_list) >= 1:
            sname = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", keyword[0:10])+ '_'\
                    + str(len(result_list)+ len(str_to_hex(keyword))) + str_to_hex(keyword)[-5:]
            sheet_t = wbk.add_sheet(sname)  # Excel单元格名字
            for i in range(len(result_list)):
                sheet_t.write(i, 0, label=result_list[i])

(5) #5 Sort by popularity, mainly operating on excel data.


    with open("rjlw.txt", 'w', encoding='utf-8') as wf2:
        orderList = list(word_dict.values())
        orderList.sort(reverse=True)
        count = len(orderList)
        for i in range(count):
            for key in word_dict:
                if word_dict[key] == orderList[i]:
                    key_list.append(key)
                    word_dict[key] = 0
        wf2.truncate()

(6) #6 Write target excel

for i in range(len(key_list)):
        sheet.write(i+rcount, 0, label=key_list[i])
        sheet.write(i+rcount, 1, label=orderList[i])
        if orderList[i] >= 1:
            shname = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", key_list[i][0:10]) \
                     + '_'+ str(orderList[i]+ len(str_to_hex(key_list[i])))+ str_to_hex(key_list[i])[-5:]
            link = 'HYPERLINK("#%s!A1";"%s")' % (shname, shname)
            sheet.write(i+rcount, 2, xlwt.Formula(link))
    rcount = rcount + len(key_list)
    key_list = []
    orderList = []
    texts = []
    orig_txt = []
    wbk.save('软件老王-target.xls')

5. Effect picture

(1) Software Pharaoh-source.xlsx, excel text description data to be processed.

How did I complete the classification of 300,000 after-sales orders in 10 minutes?How did I complete the classification of 300,000 after-sales orders in 10 minutes?

(2) The software pharaoh-target.xls, the result data processed by the algorithm.

(3) Brief description

​ In fact, there are a lot of pitfalls in the middle. One is to adjust the algorithm, and the other is to process excel to generate the desired result of the business. Fortunately, the result is good, the business is satisfied, and the real data is not easy to publish (in fact, it is the business after-sales service. The tracking sheet describes what problems the customer feedbacks, what problems are judged, what problems are in the end, and how to solve them), a simple list of demo data is created to illustrate the effect.


For more knowledge, please pay attention to the official account: "Software King" , share IT technology and related dry goods, reply to keywords to get corresponding dry goods, java , send 10 must-see "Martial Arts Cheats"; pictures , send more than 1 million copies for commercial use High-definition pictures; interview , send java interview questions with a monthly salary of "20k" just after graduation. Follow-up updates, such as " soft test ", " tools ", etc., are already being sorted.

How did I complete the classification of 300,000 after-sales orders in 10 minutes?

Guess you like

Origin blog.51cto.com/14130291/2536345