NLP data clean

, Knowledge points

Copy code

""" 
Install module: bs4 nltk gensim 
nltk: handle English 
    1, install 
    2, nltk.download() download the corresponding module 

English data processing: 
    1. Remove the html tag example = BeautifulSoup(df['review'][1000], 'html.parser').get_text() 
    2. Remove punctuation example_letter = re.sub(r'[^a-zA-Z]','',example) 
    3. Divide into words/token words = example_letter.lower ().split() 
    4. Remove the stop words. For example: the a an it's 
                stopwords = {}.fromkeys([line.rstrip() for line in open('./ 
                stopwords.txt ')]) words_nostop = [w for w in words if w not in stopwords] 
    5. Reorganize into new sentences. 

Word vector solutions: 
    1. One-hot encoding 
        disadvantages: This kind of solution wastes storage space or is secondary, and more importantly, words and words (vector And vector), the computer is completely unable to understand and process even the slightest 
    2. A method based on singular value decomposition (SVD)
        Steps: a) The first step is to form the word space matrix X through the statistics of a large number of existing documents. There are two methods. 
                One is to count the number of occurrences of each word in a document. Assuming that the number of words is W and the number of documents is M, then the dimension of X is W*M; the 
                second method is for a specific word. Count the frequency of occurrence of other words in the text before and after it to form an X matrix of W*W. 
              b) The second step is to perform SVD decomposition for the X matrix to obtain the eigenvalues, and intercept the first k eigenvalues ​​and the corresponding first k eigenvectors as needed, 
                then the dimension of the matrix formed by the first k eigenvectors is W*k, this It constitutes the k-dimensional representation vector of all W words. 
        Disadvantages: 
            1. It is necessary to maintain a very large word space sparse matrix X, and it will often change with the emergence of new words; 
            2. SVD has a large amount of calculation and every After adding or subtracting a word or document, you need to recalculate 
    3. Build a word2vec model: Iteratively learn the parameters and encoding results of existing words through a large number of documents, so that every new document does not need to modify the existing model, only It requires iterative calculation parameters again and the word vector can be 
            , for example: I love python and java  
            a) CBOW algorithm: Input: I love, target value: python and java 
                   CBOW algorithm uses the word vector in the context window as input, and sums these vectors ( Or take the mean), find the correlation distribution with the output word space,
                   and then use softmax get hit probability function over the entire output word space, and cross-entropy loss is the value of the target word one-hot encoding, 
                   through the loss for The gradient of the input and output word vectors can be adjusted using the gradient descent method to get an iterative adjustment for the input and output word vectors.

            b) Skip-Gram algorithm: Input: python and java, target value: I love the 
                    Skip-Gram algorithm uses the target word vector as input, finds its correlation distribution with the output word space, 
                    and then uses the softmax function to get the entire output word The hit probability in space is calculated one by one with the context word of one-hot encoding, and the 
                    sum is the loss value. Through the gradient of the input and output word vectors through loss, the 
                    gradient descent method can be used to get it once Iterative adjustment 
""" for input and output word vectors

Copy code

2. Chinese data cleaning (using stop words)

Copy code

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
import  jieba
def clean_chineses_text(text):
    """
    中文数据清洗  stopwords_chineses.txt存放在博客园文件中
    :param text:
    :return:
    """
    text = BeautifulSoup(text, 'html.parser').get_text() #去掉html标签
    text =jieba.lcut(text);
    stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords_chineses.txt')]) 
    #Load stopwords (Chinese) eng_stopwords = set(stopwords ) #Remove duplicate words 
    words = [ w for w in text if w not in eng_stopwords] #Remove stop words in the text 
    return ''.join(words)

Copy code

3. English data cleaning (use stop words)

Copy code

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
import  jieba
def clean_english_text(text):
    """
    英文数据清洗  stopwords_english.txt存放在博客园文件中
    :param text:
    :return:
    """
    text = BeautifulSoup(text, 'html.parser').get_text() #去掉html标签
    text = re.sub(r'[^a-zA-Z]', ' ', text) #Only keep English letters
    words = text.lower().split() 
    #all lowercase stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords_english.txt')]) #Load stop words (Chinese) 
    eng_stopwords = set(stopwords) #Remove repeated words 
    words = [w for w in words if w not in eng_stopwords] #Remove stop words in the text 
    print(words) 
    return ''.join(words) 

if __name__ ==' __main__': 
    text = "ni hao ma ,hello! my name is haha'. ,<br/> " 
    a = clean_english_text(text) 
    print(a) 

    test1 = "What are you doing, why don’t you reply to my message!, By the way, "Your mother is looking for you"." 
    b = clean_chineses_text(test1) 
    print(b)

Copy code

 4. Data cleaning with stop words of nltk

Copy code

def clean_english_text_from_nltk(text): 
    """ 
    Use nltk stop words to clean English data 
    : param text: 
    :return: 
    """ 
    text = BeautifulSoup(text,'html.parser').get_text() #Remove html tags 
    text = re.sub(r'[^a-zA-Z]','',text) 
    #remove punctuation words = text.lower().split() #turn to lowercase and split 
    stopwords = nltk.corpus .stopwords.words('english') #Use nltk stop words 
    wordList =[word for word in words if word not in stopwords] 
    return ''.join(wordList)

Copy code

Guess you like

Origin blog.csdn.net/u010451780/article/details/110957195