TF-IDF & CNN

IDF-TF
----------------------------------------------- -----------------
believe a word of text appear smaller frequency, its ability to distinguish between different categories of greater, so the introduction of the concept of inverse document frequency IDF: to TF and the product of value as a measure of the IDF feature space coordinate system.

Wi represents the weight of the i th feature words weight, TFi (t, d) represents the frequency of occurrence t word document d, N represents the total number of documents, DF (t) represents the number of documents that contain t. With TF-IDF algorithm to calculate a weight value of the feature word is a higher frequency when a word appears in this document, while the fewer number of occurrences in the other documents, it indicates that the word represented by this document to distinguish the stronger the ability, so its weight value should be greater.

The weights of all sorted words, there are two necessary selection method:
select a fixed weight maximum number n of keywords
selected keyword weights greater than a certain threshold value
of data is the concept of experience, the selected computer keywords 10∽15 number, the number of keywords in the artificial selection 4∽6 more appropriate, generally have the best coverage and of specificity.

Also taking into account the ability to distinguish between different categories of words, TFIDF French believe a word of text appear smaller frequency, it is the ability to distinguish different types of text will be. Therefore introduces the concept of inverse document frequency IDF, to the product of TF and IDF value as a measure of the feature space coordinate system, and using it to accomplish the objectives of the weight adjustment TF, adjust the weights is to highlight important words, suppressing the secondary word. But in essence, it is an attempt to suppress IDF weighted noise, and simply think the more important small text word frequency, the more useless text large frequency words, it is clear this is not entirely correct. IDF distribution of simple structure and can not effectively reflect the importance of words and word features, making it a very good handle on the weight adjustment feature, so accuracy is not very high TF * IDF law.

Document frequency (Document Frequency, DF) is one of the most simple feature selection algorithm, it refers to the entire data set the number of text containing the word. But if a rare entry was seen in certain types of training set, but it can well reflect the characteristics of the category, and because below a set threshold and filtered out, the judgment contains important information is discarded, so there will be some impact on classification accuracy.

When extracting text feature, you should first consider removing these useless text classification function words, while in the content words, nouns and verbs again the strongest, it is possible to extract only nouns and verbs in a text expression for a class characteristic of the text as word-level feature text.

According to statistics, more than vocabulary word is commonly used words, not suitable as a keyword, so the keywords can actually get word to make restrictions. For example, keywords extracted 5, herein allows up to three word keywords exist.

CNN convolution neural network
--------------------------------------------- -------------------
* & features convolution operation
to take pictures, for example.
Few pictures are painted the letters X, but X each picture look different, but they are defined as X.
They always have several of the same features, such as a line paragraphs acquaintance. And these lines met, say a few features (feature).
After convolution, each feature extracted from the original image out "features", to give feature Map (wherein FIG. Fill the mean is a new graph.)
Value Map feature therein, is closer to 1 indicates that the corresponding position and feature matching more complete, the more close to -1, and represents reverse position matches the corresponding feature of the more complete, the value close to 0 indicates a position corresponding to no match or no correlation.
X on this map, we use the three feature, and therefore eventually produce three feature map.

Non-linear active layer
by the action of the nonlinear activation function, feature map in the value <0 and set to 0.

· Pooling layer pooled
data amount in order to reduce the characteristic of FIG.
Pooling is divided into two, Max Pooling largest pooling, Average Pooling average pooled. As the name implies, is to take the maximum value of the maximum pooling, the pooling is averaged mean.
Since the maximum pool with retention of maximum in each tile, so that it retains one equivalent of the best match (as values closer to 1 indicates a match, the better).
CNN can be found whether the image having a certain characteristic. This will help solve the previously mentioned rigid practice of computer one by one pixel match.

· Fully connected layer
fully connected layers do is for all operations carried out before a conclusion, give us a final result. It is the largest object on the feature map changes dimensions to obtain the probability value corresponding to each classification category.

Convolution layer uses a "local connection" thinking.
In addition to that window, partly unconnected how to do it? We all know, using a sliding window method of follow up connection. The idea of this method is the "parameter sharing" parameter refers to the filter, sliding window fashion, to share this value to the original filter in each region of the connection for a convolution operation.

Or back to see the next operation to obtain the characteristic of FIG 2X2, fully connected network applied thereto, and then the whole connection layer has a very important function of the Softmax ----, which is a function of the classification, the output per the corresponding probability category value. For example:
[] 0.5,0.03,0.89,0.97,0.42,0.15 says there are six categories, and the probability of belonging to the fourth category of maximum values of 0.89, it is determined belong to the fourth category.
Thus directly change the three dimensional feature maps directly into one-dimensional data. One-dimensional data is the probability value.

· Neural network training and optimization of
training are those convolution kernel (filter).
BP algorithm --- BackProp back-propagation algorithm, a lot of training data.

In training, the training data we use are generally label label with a picture. If the letters in the picture is X, then label = x, if the letters in the picture is A, then label = A. Label can directly reflect the picture.

In the beginning, before the training, we define a convolution kernel of size 3X3, then the specific value of which is how much we do not know, but it can not be zero, so I used the random initialization method to carry out the assignment. At first there is an error. The ultimate purpose of training is to make the minimum error, commonly used method is gradient descent method.

Guess you like