Image data extraction processing pandas textrank

1. Remove the first line into Content []

= Contents []            
Contents = df.ix [:, 10] of 11 data #

2. extracted data by type

good = df.loc [df [ "evaluation type"] == "Good"]                           
good_contents good.ix = [:, 10]                             
good_contents.index = List (Range (good_contents.shape [0])) re-index #

3. Extract Keywords

textrank DEF (Contents, TOPK):                                                                                                                
    cons = []                                                                                                                               
    for I in Range (len (Contents)):                                                                                                          
        Content = Contents [I]                                                                                                               
        Keywords = jieba.analyse.textrank (Content, TOPK = TOPK, allowPOS = ( 'n-', 'nz', 'v', 'vd', 'vn', 'l', 'a', 'd')) # TextRank keyword extraction, filtering speech         
        word_split = " ".join(keywords)                                                                                                    
        #print(word_split)                                                                                                                 
        cons.append(word_split) #.encode("utf-8")                                                                                          
                                                                                                                                           
    result = pd.DataFrame(cons)                                                                                                            
    return result                                                                                                                          

4. The extracted keyword output file

result.to_csv("keywords.csv", encoding = 'utf-8',index=False)

5. Problems

When the content que :: textrank algorithm will first inputted word, after the first stage will each comment text extracted from the second stage to extract all keywords keywords, between words have been separated by spaces . This is reasonable?

- "directly or based on word frequency, maximum output frequency of keywords

que: When do the second stage, the first set of keywords directly with spaces put together? This reasonable?

--"to be solved

- "adjectives, adverbs do not output

  

  

good = df.loc[df["评价类型"]=="好评"]
good_contents = good.ix[:,10]
good_contents.index=list(range(good_contents.shape[0]))

Guess you like

Origin www.cnblogs.com/yuyu-blog/p/11569984.html