, Knowledge points
""" Install module: bs4 nltk gensim nltk: handle English 1, install 2, nltk.download() download the corresponding module English data processing: 1. Remove the html tag example = BeautifulSoup(df['review'][1000], 'html.parser').get_text() 2. Remove punctuation example_letter = re.sub(r'[^a-zA-Z]','',example) 3. Divide into words/token words = example_letter.lower ().split() 4. Remove the stop words. For example: the a an it's stopwords = {}.fromkeys([line.rstrip() for line in open('./ stopwords.txt ')]) words_nostop = [w for w in words if w not in stopwords] 5. Reorganize into new sentences. Word vector solutions: 1. One-hot encoding disadvantages: This kind of solution wastes storage space or is secondary, and more importantly, words and words (vector And vector), the computer is completely unable to understand and process even the slightest 2. A method based on singular value decomposition (SVD) Steps: a) The first step is to form the word space matrix X through the statistics of a large number of existing documents. There are two methods. One is to count the number of occurrences of each word in a document. Assuming that the number of words is W and the number of documents is M, then the dimension of X is W*M; the second method is for a specific word. Count the frequency of occurrence of other words in the text before and after it to form an X matrix of W*W. b) The second step is to perform SVD decomposition for the X matrix to obtain the eigenvalues, and intercept the first k eigenvalues and the corresponding first k eigenvectors as needed, then the dimension of the matrix formed by the first k eigenvectors is W*k, this It constitutes the k-dimensional representation vector of all W words. Disadvantages: 1. It is necessary to maintain a very large word space sparse matrix X, and it will often change with the emergence of new words; 2. SVD has a large amount of calculation and every After adding or subtracting a word or document, you need to recalculate 3. Build a word2vec model: Iteratively learn the parameters and encoding results of existing words through a large number of documents, so that every new document does not need to modify the existing model, only It requires iterative calculation parameters again and the word vector can be , for example: I love python and java a) CBOW algorithm: Input: I love, target value: python and java CBOW algorithm uses the word vector in the context window as input, and sums these vectors ( Or take the mean), find the correlation distribution with the output word space, and then use softmax get hit probability function over the entire output word space, and cross-entropy loss is the value of the target word one-hot encoding, through the loss for The gradient of the input and output word vectors can be adjusted using the gradient descent method to get an iterative adjustment for the input and output word vectors. b) Skip-Gram algorithm: Input: python and java, target value: I love the Skip-Gram algorithm uses the target word vector as input, finds its correlation distribution with the output word space, and then uses the softmax function to get the entire output word The hit probability in space is calculated one by one with the context word of one-hot encoding, and the sum is the loss value. Through the gradient of the input and output word vectors through loss, the gradient descent method can be used to get it once Iterative adjustment """ for input and output word vectors
2. Chinese data cleaning (using stop words)
import os import re import numpy as np import pandas as pd from bs4 import BeautifulSoup from sklearn.feature_extraction.text import CountVectorizer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.linear_model import LogisticRegression import nltk from nltk.corpus import stopwords import jieba def clean_chineses_text(text): """ 中文数据清洗 stopwords_chineses.txt存放在博客园文件中 :param text: :return: """ text = BeautifulSoup(text, 'html.parser').get_text() #去掉html标签 text =jieba.lcut(text); stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords_chineses.txt')]) #Load stopwords (Chinese) eng_stopwords = set(stopwords ) #Remove duplicate words words = [ w for w in text if w not in eng_stopwords] #Remove stop words in the text return ''.join(words)
3. English data cleaning (use stop words)
import os import re import numpy as np import pandas as pd from bs4 import BeautifulSoup from sklearn.feature_extraction.text import CountVectorizer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.linear_model import LogisticRegression import nltk from nltk.corpus import stopwords import jieba def clean_english_text(text): """ 英文数据清洗 stopwords_english.txt存放在博客园文件中 :param text: :return: """ text = BeautifulSoup(text, 'html.parser').get_text() #去掉html标签 text = re.sub(r'[^a-zA-Z]', ' ', text) #Only keep English letters words = text.lower().split() #all lowercase stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords_english.txt')]) #Load stop words (Chinese) eng_stopwords = set(stopwords) #Remove repeated words words = [w for w in words if w not in eng_stopwords] #Remove stop words in the text print(words) return ''.join(words) if __name__ ==' __main__': text = "ni hao ma ,hello! my name is haha'. ,<br/> " a = clean_english_text(text) print(a) test1 = "What are you doing, why don’t you reply to my message!, By the way, "Your mother is looking for you"." b = clean_chineses_text(test1) print(b)
4. Data cleaning with stop words of nltk
def clean_english_text_from_nltk(text): """ Use nltk stop words to clean English data : param text: :return: """ text = BeautifulSoup(text,'html.parser').get_text() #Remove html tags text = re.sub(r'[^a-zA-Z]','',text) #remove punctuation words = text.lower().split() #turn to lowercase and split stopwords = nltk.corpus .stopwords.words('english') #Use nltk stop words wordList =[word for word in words if word not in stopwords] return ''.join(wordList)