1, the knowledge points
""" Installation Module: bs4 nltk gensim nltk: handle English 1, the installation 2, nltk.download () module download the appropriate English data processing: 1, remove the html tags example = BeautifulSoup (df [ 'review'] [1000], 'html.parser'). Get_text () 2, removing punctuation example_letter = re.sub (r '[^ a-zA-Z]', '', example) 3, cut into word / token words = example_letter.lower (). Split () 4, for example, remove stop words: the a an it's stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt')]) words_nostop = [w for w in words if w not in stopwords] 5, reorganized into a new sentence Word vector solution: 1, one-hot encoding Cons: This solution wasting storage space or minor, more importantly, there is no correlation between words (vector and vector), the computer can not even fully understand and deal with a little bit of 2, based on the method of singular value decomposition (SVD) of Steps: a) The first step is to form the word space matrix X by a large number of existing document statistics, there are two ways. One is the number of times an article documents the statistics of each word appears, the word is assumed that the number of W, the document number of articles is M, X at this time dimension is W * M; The second method is for a particular word, the statistics appear in the text before and after the other word frequency, thus forming an X W * W matrix. b) The second step is the SVD for the X matrix, eigenvalues, the first k eigenvalues and corresponding eigenvectors of the first k taken as required, Then the dimension of the matrix first k eigenvectors is W * k, which constitute all the W-word k-dimensional vector representation Disadvantages: 1, the need to maintain a great word space sparse matrix X, and with the advent of new words will change frequently; 2, SVD large amount of computation, and after each increase or decrease in a word or document, you need to be recalculated 3, build a word2vec model: parameters and results have been encoded word document in which learning through a lot of iterations, so that each new one document do not modify existing models, only requires an iterative calculation parameters and again to the word vector Example: I love the python and java a) CBOW algorithm: Input: I love target: python and java CBOW vector algorithm uses the context window word as input, after these vectors summed (or averaged), an output obtained by correlation distribution word space, Further use softmax function gets hit probability over the entire space of the output terms, cross entropy loss value and the target is the one-hot encoded word, Loss by the input and output words for the gradient vector, to use gradient descent (gradient descent) to give an iteration method for adjusting the input and output vectors of words. b) Skip-Gram algorithm: Input: python and java, target values: I love Skip-Gram target word vector algorithm uses as an input, which is determined with the output distribution associated word space, Further use softmax function gets hit probability over the entire space of the output terms, individually calculating the cross-entropy with the word context one-hot encoding, After loss value is summed by loss gradient vector for the inputs and outputs of the word, Method for the adjustment iteration to obtain the input and output words to use gradient descent vector (gradient descent) """
2, Chinese data cleaning (use stop words)
import os import re import numpy as np import pandas as pd from bs4 import BeautifulSoup from sklearn.feature_extraction.text import CountVectorizer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.linear_model import LogisticRegression import nltk from nltk.corpus import stopwords import jieba def clean_chineses_text(text): """ Chinese data stored in the blog garden cleaning stopwords_chineses.txt file :param text: :return: "" " Text = the BeautifulSoup (text, ' html.parser ' ) .get_text () # remove html tag text = jieba.lcut (text); stopwords = {} .fromkeys ([line.rstrip () for Line in Open ( ' ./stopwords_chineses.txt ' )]) # loading stop words (Chinese) eng_stopwords = SET (stopwords) # remove duplicate word words = [ W for W in text IF W Not in eng_stopwords] # remove stop words in the text return ' ' .join (words)
3, English data cleaning (use stop words)
import os import re import numpy as np import pandas as pd from bs4 import BeautifulSoup from sklearn.feature_extraction.text import CountVectorizer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.linear_model import LogisticRegression import nltk from nltk.corpus import stopwords import jieba def clean_english_text(text): """ English data stored in the blog garden cleaning stopwords_english.txt file :param text: :return: "" " Text = the BeautifulSoup (text, ' html.parser ' ) .get_text () # remove html tag text = the re.sub (R & lt ' [^ A-zA-the Z] ' , ' ' , text) # leaving only English letter words text.lower = (). Split () # all lowercase stopwords .fromkeys} = {([line.rstrip () for Line in Open ( ' ./stopwords_english.txt ' )]) # loading stop words (Chinese ) eng_stopwords = SET (stopwords) # remove duplicate word words = [W for W inwords IF W Not in eng_stopwords] # remove stop words in the text Print (words) return ' ' .join (words) if __name__ == '__main__': text = "ni hao ma ,hello ! my name is haha'. ,<br/> " a = clean_english_text(text) print(a) test1 = " What are you doing, ah, how not to respond to my messages! on the" Your mother is looking for you. " " b = clean_chineses_text (test1) Print (b)
4、stopwords_english.txt