The data cleansing NLP python

1, the knowledge points

"""
Installation Module: bs4 nltk gensim
nltk: handle English
    1, the installation
    2, nltk.download () module download the appropriate

English data processing:
    1, remove the html tags example = BeautifulSoup (df [ 'review'] [1000], 'html.parser'). Get_text ()
    2, removing punctuation example_letter = re.sub (r '[^ a-zA-Z]', '', example)
    3, cut into word / token words = example_letter.lower (). Split ()
    4, for example, remove stop words: the a an it's
                stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt')])
                words_nostop = [w for w in words if w not in stopwords]
    5, reorganized into a new sentence

Word vector solution:
    1, one-hot encoding
        Cons: This solution wasting storage space or minor, more importantly, there is no correlation between words (vector and vector), the computer can not even fully understand and deal with a little bit of
    2, based on the method of singular value decomposition (SVD) of
        Steps: a) The first step is to form the word space matrix X by a large number of existing document statistics, there are two ways.
                One is the number of times an article documents the statistics of each word appears, the word is assumed that the number of W, the document number of articles is M, X at this time dimension is W * M;
                The second method is for a particular word, the statistics appear in the text before and after the other word frequency, thus forming an X W * W matrix.
              b) The second step is the SVD for the X matrix, eigenvalues, the first k eigenvalues ​​and corresponding eigenvectors of the first k taken as required,
                Then the dimension of the matrix first k eigenvectors is W * k, which constitute all the W-word k-dimensional vector representation
        Disadvantages:
            1, the need to maintain a great word space sparse matrix X, and with the advent of new words will change frequently;
            2, SVD large amount of computation, and after each increase or decrease in a word or document, you need to be recalculated
    3, build a word2vec model: parameters and results have been encoded word document in which learning through a lot of iterations, so that each new one document do not modify existing models, only requires an iterative calculation parameters and again to the word vector
            Example: I love the python and java
            a) CBOW algorithm: Input: I love target: python and java
                   CBOW vector algorithm uses the context window word as input, after these vectors summed (or averaged), an output obtained by correlation distribution word space,
                   Further use softmax function gets hit probability over the entire space of the output terms, cross entropy loss value and the target is the one-hot encoded word,
                   Loss by the input and output words for the gradient vector, to use gradient descent (gradient descent) to give an iteration method for adjusting the input and output vectors of words.

            b) Skip-Gram algorithm: Input: python and java, target values: I love
                    Skip-Gram target word vector algorithm uses as an input, which is determined with the output distribution associated word space,
                    Further use softmax function gets hit probability over the entire space of the output terms, individually calculating the cross-entropy with the word context one-hot encoding,
                    After loss value is summed by loss gradient vector for the inputs and outputs of the word,
                    Method for the adjustment iteration to obtain the input and output words to use gradient descent vector (gradient descent)
"""

2, Chinese data cleaning (use stop words)

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
import  jieba
def clean_chineses_text(text):
    """
    Chinese data stored in the blog garden cleaning stopwords_chineses.txt file
    :param text:
    :return:
    "" " 
    Text = the BeautifulSoup (text, ' html.parser ' ) .get_text () # remove html tag 
    text = jieba.lcut (text);
    stopwords = {} .fromkeys ([line.rstrip () for Line in Open ( ' ./stopwords_chineses.txt ' )]) # loading stop words (Chinese) 
    eng_stopwords = SET (stopwords) # remove duplicate word 
    words = [ W for W in text IF W Not  in eng_stopwords] # remove stop words in the text 
    return  '  ' .join (words)

3, English data cleaning (use stop words)

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
import  jieba
def clean_english_text(text):
    """
    English data stored in the blog garden cleaning stopwords_english.txt file
    :param text:
    :return:
    "" " 
    Text = the BeautifulSoup (text, ' html.parser ' ) .get_text () # remove html tag 
    text = the re.sub (R & lt ' [^ A-zA-the Z] ' , '  ' , text)   # leaving only English letter 
    words text.lower = (). Split ()   # all lowercase 
    stopwords .fromkeys} = {([line.rstrip () for Line in Open ( ' ./stopwords_english.txt ' )]) # loading stop words (Chinese ) 
    eng_stopwords = SET (stopwords) # remove duplicate word 
    words = [W for W inwords IF W Not  in eng_stopwords] # remove stop words in the text 
    Print (words)
     return  '  ' .join (words)

if __name__ == '__main__':
    text = "ni hao ma ,hello ! my name is haha'. ,<br/> "
    a = clean_english_text(text)
    print(a)

    test1 = " What are you doing, ah, how not to respond to my messages! on the" Your mother is looking for you. " " 
    b = clean_chineses_text (test1)
     Print (b)

 4、stopwords_english.txt

Guess you like

Origin www.cnblogs.com/ywjfx/p/11019689.html