Text Classification of Machine Learning - From Word Frequency Statistics to Neural Networks (1)

It is difficult to insist on doing what you like all the time, especially when there is no profit-driven situation. Fortunately, I have not given up today! -To myself
Advertising space: source code address https://github.com/zzubqh/TextCategorization
=============Serious dividing line==============
Written at the top:
1. All packages used in the program, numpy, yaml, jieba, please install pip first, for example:
pip install -i https://pypi.doubanio.com/simple/ pyyaml
​​2. Machine configuration, In fact, the code was changed on the original code. Before, in order to save storage space (mainly because my books were not good enough), the dictionary was used for storage. This time, it was directly changed to a matrix, so at least 16G of memory is required to run the code. , The CPU should be as good as possible. The code here does not need GPU support for the time being
. 3. Regarding the data set, I forgot where to download it. There are a total of 10 categories in it, namely military, history, humanities, economics, etc. (in the training set The label is C0000XX, these are the labels attached by my AI) The training set has a total of 8036 files, 70% are extracted for training, and the remaining 30% are used for verification. Dataset download address: http://pan.baidu.com/s/1c99UME
4. Folder organization in the program
write picture description here
5. Text feature extraction adopts TF-IDF (detailed later), and the classification algorithm adopts Naive Bayes. The overall classification accuracy rate reaches 82.5%. The reason why it is so low is mainly because there is an article under the category '23' that I don't even know what it should be classified under. The correct rate is as follows:
2017-11-24 10:17:16,651 - main - INFO - the class 0 rate is 81.2765957447%
2017-11-24 10:17:16,651 - main - INFO - the class 1 rate is 72.4324324324%
2017-11-24 10:17:16,651 - main - INFO - the class 2 rate is 97.7941176471%
2017-11-24 10:17:16,651 - main - INFO - the class 3 rate is 3.42465753425%
2017-11-24 10:17:16,651 - main - INFO - the class 4 rate is 84.4106463878%
2017-11-24 10:17:16,651 - main - INFO - the class 5 rate is 92.6829268293%
2017-11-24 10:17:16,651 - main - INFO - the class 6 rate is 70.4081632653%
2017-11-24 10:17:16,651 - main - INFO - the class 7 rate is 87.8676470588%
2017-11-24 10:17:16,651 - main - INFO - the class 8 rate is 99.4029850746%
2017-11-24 10:17:16,651 - main - INFO - the class 9 rate is 89.3023255814%
2017-11-24 10:17:16,653 - main - INFO - the total correct rate is 82.5436408978%
Note: 0,1,2 here are just serial numbers not categories, please refer to 'Data for specific categories The output file 'pridictLable.tx' under ->output' predicts the output, and the correct category label corresponding to 'trueLable.txt'
6. Finally tried the word2vec text feature, the classification result is not ideal, it should be my understanding of the word vector Not deep, continue to explore
7. Please point out the mistakes or areas that need improvement in the articles and programs, thank you very much!
================================ I am the lovely dividing line ============= =======
First, start from word frequency statistics
For computers to complete the classification task, we first need data that our computers can process, so we need a method to convert text files into data that can be calculated. The first thing I can think of is to count the number of times each meaningful word appears in the text as a feature. For example, "car" must appear more often in articles about cars than in articles about entertainment, and so on. However, unfortunately words like 'I', 'you''...these words appear more frequently and in almost every type of article, so we also need a copy of 'stop words' Database, remove all these words at the beginning of statistics. There is a ready-made stop word list on the Internet, just download it directly (here is the one I use http://pan.baidu.com/s/1i5mvijN )
Now the conversion from words to numbers is there, but how to represent an article as a matrix? There are two ways to represent the words in the article in vector form. One is to count all the words in the sample set, and then use these words as a row vector as the feature of each article, and then count the occurrences of the words in the article separately. The number of times (you can also count whether it occurs) is the bag of words model. For example, all words in the sample set are recorded as ['a','i','am','you','boy','gril'], article 1 If the content is i am a boy, the article 1 can be expressed as [1,1,1,0,1,0]. This representation has a disadvantage. Generally, the sample set will be very large, with dimensions ranging from hundreds of thousands to several million. The matrix represented in this way is a sparse matrix, resulting in a large waste of storage space, or it is stored in a dictionary. The other is the word vector model, because the shortcomings of the bag of words model are obvious. One is that the data mentioned above is sparse, and the other is that this representation only expresses the characteristics of a single word and does not consider the relationship between words. The context relationship, if the context relationship can be taken into account, should be of great benefit to text classification. Finally, the feature matrix represented by the bag-of-words model is too large to be processed by the well-developed deep neural network. As a result, word vector model (word2vec) appeared, which is simply a language modeling using a neural network for the words in a sample set, but the output of this neural network is associated with a huge Huffman tree, so that in the language When the modeling is done, you get a useful by-product: word vectors. Each word has the same dimension, such as 100 or 400, etc. (this is artificially specified during training). For details, please refer to https://www.cnblogs.com/iloveai/p/word2vec.html The author made it very clear, Stronger than me! If you use it, you can directly pip install gensim, which is a word2vec project open sourced by Google, which can be called directly under python. The similarity of two words can be easily obtained. As for how to use it in text classification, I'm still researching. The classification of this article still uses the bag of words model!
2. Participle
Now that you know how to represent an article, you first need to turn an article into words, which is the first step in the entire program. It can be done with jieba. Because each word segmentation task does not have any interaction with each other, it is processed directly with multi-threading.

    #分词的线程处理函数
    def __splitWordswithjieba__(self,class_dir_path,seg_class_path):
        file_list = os.listdir(class_dir_path)
        for file_path in file_list:
            filename = class_dir_path + file_path
            content = self.ReadFile(filename).strip()
            content = content.replace("\r\n","").strip()
            content_seg = jieba.cut(content)
            self.SaveFile(seg_class_path + file_path," ".join(content_seg))

    def SplitWords(self,inputpath='',outpath=''):
        if inputpath == '':
            inputpath = self.__oripath__
        if outpath == '':
            outpath = self.__segpath__

        catelist = os.listdir(self.__oripath__)
        cateNum = range(len(catelist))
        threads = []

        for dir in catelist:
            ori_class_path = os.path.join(inputpath,dir)
            seg_class_path = os.path.join(outpath , dir )
            if not os.path.exists(seg_class_path):
                os.makedirs(seg_class_path)
            t = MyThread(self.__splitWordswithjieba__,(ori_class_path,seg_class_path),self.__splitWordswithjieba__.__name__)
            threads.append(t)

        for index in cateNum:
            threads[index].start()

        for index in cateNum:
            threads[index].join()

After word segmentation, another step is to process stop words and deduplication

    #filter的线程处理函数,去掉filename中的停用词
    def __dofilder__(self,class_dir_path):
        logger = logging.getLogger(__name__)
        global threadSeq
        threadSeq += 1
        localseq = threadSeq
        logger.info( 'thread ' + str(threadSeq) + 'start')
        file_list = os.listdir(class_dir_path)
        pattern = re.compile(r'^\w+$') #验证任意数字、字母、下划线组成的模式

        for file_path in file_list:
            filename = os.path.join(class_dir_path , file_path)
            content = self.ReadFile(filename)
            content = content.decode('utf-8')
            wordslist = content.split()
            newcontent = ''
            for word in wordslist:
                if self.__stopwords__.find(word.encode('utf-8')) != -1:
                    continue
                elif pattern.match(word.decode('utf-8')) != None:
                    continue
                if lock.acquire():
                    self.wordset.add(word)
                    lock.release()
                newcontent = newcontent + ' ' + word
            self.SaveFile(filename,newcontent)
            logger.info( 'thread ' + str(localseq) + ' end')

    #用中文停用词对分词后的文件做过滤
    def Filter(self,inputpath = ''):
        if inputpath == '':
            inputpath = os.path.join(self.__segpath__, "train")
        #读取停用词
        stop_words_path = os.path.join(self.__rootpath__, "stopwords.txt")
        stop_words = self.ReadFile(stop_words_path)
        stop_words = stop_words.replace("\r\n","").strip()
        self.__stopwords__ = stop_words.decode('utf-8')

        #起线程开始处理已分词的文件
        catelist = os.listdir(inputpath)
        cateNum = range(len(catelist))
        threads = []

        for dir in catelist:
            seg_class_path = os.path.join(inputpath , dir)
            t = MyThread(self.__dofilder__,(seg_class_path,),self.__dofilder__.__name__)
            threads.append(t)

        for index in cateNum:
            threads[index].start()

        for index in cateNum:
            threads[index].join()

Finally, write a high-level function to generate the bag of words we need

    def CreateDataSet(self,wordset_fileName):
        self.SplitWords()
        self.Filter()
        #创建词集,去重
        content = ''
        for word in self.wordset:
            content = content + ' ' + word
        self.wordsetvec = content.split()
        self.SaveFile(wordset_fileName,content)

In this way, the word bag of the sample set is available, and the next step is to use TF-IDF to calculate the feature vector of each article

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325866700&siteId=291194637