Text Classification of Machine Learning - From Word Frequency Statistics to Neural Networks (2)

3. Text classification has me: TF-IDF (word frequency-inverse document frequency)
is well understood for TF. It counts the number of times the words in the word bag appear in each article, and the ones that do not appear are 0. In fact, it is already possible to classify directly using TF, and the performance is ok in a small data set, but there is a very prominent problem that the length of the article has a great impact on the classification results, because the longer the article is. The greater the probability of words in the word bag appearing. For example, articles in the humanities and social sciences will also appear the word 'car', and the article is long, and 'car' may appear more than once, so it is very likely when classifying This kind of article is classified as an automobile article, but the result is definitely wrong. So first normalize this number, that is, calculate the frequency of the
write picture description here
word appearing in the article. In the above formula, the numerator is the number of times a word appears in the article, and the denominator is the sum of the times of all words in the article.
IDF is a measure of the general importance of a word. The IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient to get
write picture description here
where the numerator is the total number of all articles in the sample set , the denominator is the total number of words contained in the sample set (the frequency of each word has been counted during the calculation, and all are calculated by the total number - the number of articles that do not contain the word in the sample set), generally, in order to prevent division by 0, the denominator will be increased by 1 , all you see in the code is

idf = np.log(docNum / (1.0 + docNum - np.sum(self.__sample_data__ == 0, axis=0)))

And further expand the calculation of tf from a single file to calculate the tf of each class

for index, className in enumerate(self.__labelvec__):
    if not matrixDic.has_key(className):
        matrixDic[className] = [0] * self.__sample_data__.shape[1] 
matrixDic[className]+=self.__sample_data__[index,:]/float(sum(self.__sample_data__[index,:]) + 1)

Finally, tf*log(idf) is the result we want. np.array(weightMatrix) * log(idf)
Say two words of nonsense: if you can use this method when doing matrix operations, don't use for loops, the efficiency is really Much worse. This time, I did a special test. Self.sample_data is a matrix of about 5600*140000. It takes about 7 minutes to calculate the idf of each column by using the for loop. Of course, the for loop is still a for loop, and the above form is 11s !
4. Who is Bayesian?
So far, the text has been turned into a matrix that can be calculated, so how to calculate it? Bayes said: here I come! In fact, looking back on how we turned the article into a matrix, that is, statistics, then it is natural to think of conditional probability. If the probability of occurrence of words x1, x2, and x3 in class 1 is very high, then the probability of occurrence of class 1 under the probability of occurrence of x1, x2, and x3 should also be very large. We call the former a priori probability, and the latter is posterior probability. This question, Mr. Bayesian has long given the answer

P ( A B ) = P ( A ) ⋅ P ( B | A ) = P ( B ) ⋅ P ( A | B )

Further consider the general formula:
write picture description here
where A i is the tf-idf value of the word in each article, P ( B | A i ) is the probability of class B in the presence of words A1, A2, A3....Ai, also is the final result we want to calculate, and P (A i | B ) is the probability of words A1, A2, A3....Ai appearing in the case of class B, which can be directly calculated in the sample set. The denominator is the same for each class to be calculated, and it does not affect the result, so no calculation is required. In this way, the probability of the sample to be tested under each class is calculated separately, and the class with the largest value is our classification result.

def Pridict(self,testVec):
    classProb =  self.__classprob__
    test_array = array(testVec,dtype=float)
    vec = np.array([sum(np.log(self.dataMatrix[index,:] * test_array +1))+classProb[lable] for index, lable in enumerate(self.labelVec)])
    maxVal = vec.max()
    index = np.argwhere(vec == maxVal)
    lable = self.labelVec[index[0][0]]
    return lable

This is the result of the beginning of the article!
5. The word vector model
This is only a role to play a role, because I have just been in touch with word vectors for a long time, and I am afraid of misleading the masses who do not know the truth. Let me give you an introduction. Let me tell you what I think, it may be wrong! I installed it directly on ubuntu16, and trained a model for each class, maybe this idea is wrong! Then, I read the source code under gensim and found that the calculation in most_similar() is to find out whether the new word is in the model, and then give the score. With this idea, I will use each word in the article to be classified as The most_similar() function of each model is calculated once, the first number of the result is taken, and then all the words are summed, whichever class gets the largest score is divided into which class, the result, the classification result is very touching
2017-11-27 10 :24:22,104 - main - INFO - the class 0 rate is 2.12765957447%
2017-11-27 10:24:22,104 - main - INFO - the class 1 rate is 0.348432055749%
2017-11-27 10:24:22,104 - main - INFO - the class 2 rate is 2.98507462687%
2017-11-27 10:24:22,104 - main - INFO - the class 3 rate is 0.367647058824%
2017-11-27 10:24:22,104 - main- INFO - the class 4 rate is 1.08108108108%
2017-11-27 10:24:22,104 - main - INFO - the class 5 rate is 1.02040816327%
2017-11-27 10:24:22,104 - main - INFO - the class 6 rate is 0.0%
2017-11-27 10:24:22,104 - main - INFO - the class 7 rate is 0.735294117647%
2017-11-27 10:24:22,104 - main - INFO - the class 8 rate is 28.8372093023%
2017- 11-27 10:24:22,104 - main - INFO - the class 9 rate is 31.5589353612 % A simple and practical introduction, the input is a file that has been segmented:


import multiprocessing
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

model = Word2Vec(LineSentence(fileName), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
print model.wv[u'新车']
[ -6.10132441e-02  -1.15644395e-01   5.97980022e-01   2.00566962e-01
   2.73794159e-02   1.44698828e-01   1.11371055e-01   2.28789579e-02
  ......
]
model.most_similar(u'新车')
[(u'\u62db\u6570', 0.9994064569473267), (u'\u7cfb\u5217', 0.9993598461151123),...]

#保存模型
model.save(modelName)
model.wv.save_word2vec_format(vecName, binary=False)

#加载模型
wvModel = Word2Vec.load(wvModelFile)

These are the basic operations. It is very simple. If you are interested, you can go directly to the source code.
6. Conclusion
In a physical sense, I think that using word vectors to classify should get better results, but if you are new to it, you need to read more information. The reason why the title is written as 'from word frequency to neural network' is because the word vector model is trained with a simple 3-layer network, and secondly, the word vector can already be sent to ANN for training, due to time problems No further research has been done, hopefully the next article will be about using word vectors to train ANNs. Please let me know if there are any mistakes in the article and if you have explained it clearly, the first time I used CSDN's formula editor, I really can't use it. After tossing for a long time, the final formula is still a picture posted, can't I ask BS ^_^

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325866675&siteId=291194637