写在前面的话

9月份的第一篇blog, 正好是开学季，作为一个学生，hahahha，我还是正正经经的贡献一篇认为比较干货的文章给大家吧。

我自己是花了很多时间去看懂LDA的算法了，当然了这篇文章不涉及我们具体的原理。为什么呢，我觉得你还是要搞懂的，其实不是很难，当然如果你看到数学就头大的话，并且你是个害怕困难的人，那就当我说的是错的。如果你很喜欢研究，很喜欢挑战，很喜欢思考的话，我相信你一定会和我一样爱上我们的LDA的算法。

因为很巧妙，逻辑思维很高杠的，就是特别帅气的一个棒呆吊炸天的逻辑思维的烟花盛宴把。

LDA是经典的主题模型算法，今天主要讲的是代码实现，原理部分真的是一大篇一大篇的数学公式，觉得打出来我这一天就等于废了，时间比较宝贵，直接来点实际的比较好。

本文主要是依托了sklearn 来实现LDA

具体的文档请看： sklearn.decomposition.LatentDirichletAllocation

LDA主题模型的实现和调参

1.语料库的加载

一般我们要处理文本，最重要的就是要想清楚我们的语料库是什么，在topic model 中语料可能对我们的实验结果是会有很多的影响的。

首先我们就是要加载我们要处理的文本。

假设我们的数据都放在了某个路径下面，我们把这个路径赋值给path，path这个路径下面有很多我们自己收集整理的文档。

在我写的代码里，我把我处理的每一个文档看做是一个语料库，根据每个人的需要不同，你的语料库可能就是整个文档的集合，代码是有所不同的。

所以我们先要加载文档：

    fileList = os.listdir(path)

    for file in fileList:
        docList = []
        filePath = os.path.join(path,file)

        f = open(filePath)

        doc = f.readlines()

        f.close()

        for line in doc:
            line = line.replace('\n',"")
            newline = line.decode('utf-8','ignore').encode('utf-8')

这里我们有一行代码需要做一些说明就是

  newline = line.decode('utf-8','ignore').encode('utf-8')

因为有的输入可能不能正确的解析，就会报错，一般就是解码的错误，对于解码的错误我的博客也做过详细的笔记，大家可以查看一下。

当然，如果出现什么utf8 不能正确的解码我们的asci 的编码的问题，我们可以在脚本开始的时候这么做：

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

在这里，我把每一行都做一些预处理，关于预处理以及其中的一些基本概念的话，你可以看我前面的一篇文章：自然语言处理中预处理的一些常用操作

在这里我是对每一行单独进行一个预处理
在这里我们对我们的文本进行预处理，需要用到的库就是我们的nltk,在使用之前可以先自己安装一下

比如一种安装方法就是：

sudo pip install nltk

还可以尝试一些其他的安装方法：

sudo apt-get install nltk

关于nltk 的安装方法可以自行上网搜索，并且实践一下，不难的。

这个时候我们对文本信息进行预处理的话，我们就需要把我们的这个库导入；
在这次编程中我们使用到的文本处理的库导入如下：

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

导入库之后 ,我们就可以进行一些预处理操作，预处理操作一般就是把文字全部转换为小写，删除标点，删除停顿词，删除一些不是英文的字符(在这里我们处理的是英文的文本)，删除数字以及一些特殊的字符，分词，提取词干，具体这些词是什么意思，可以看我之前写的一些文章。里面会有详细的介绍。

如果你的文本不是经过矫正的，那么我们还需要修改一些错误，比如你们的文本输入是大众的输入，那么这些输入可能含有很多的错误，我们可以用TextBlob 来做我们的修正。

下面的这个版本的话是比较简单的预处理，

#***********************************
# The general preprocessing steps
#***********************************
def Preprocessing(text):
# 将文本转成小写
    text = text.lower()
# 删除我们的标点符号，
    for c in string.punctuation:
        text = text.replace(c," ")
# 分词
    wordList = nltk.word_tokenize(text)
# 去除停顿词
    filtered = [w for w in wordList if w not in stopwords.words('english')]

    # stem 
    ps = PorterStemmer()
    filtered = [ps.stem(w) for w in filtered]

    return " ".join(filtered)

如果我们要做的复杂一些的我们还可以像下面这样来做：

#***********************************
# The general preprocessing steps
#***********************************
def Preprocessing(text):
# 将文本转成小写
    text = text.lower()
# 删除我们的标点符号，
    for c in string.punctuation:
        text = text.replace(c," ")
        
    text = text.translate(None,'0123456789')
# 分词
    wordList = nltk.word_tokenize(text)
# 去除停顿词
    filtered = [w for w in wordList if w not in stopwords.words('english')]

    # stem 
    ps = PorterStemmer()
    filtered = [ps.stem(w) for w in filtered]

    return " ".join(filtered)

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from textblob import TextBlob
from nltk.stem import WordNetLemmatizer


#***********************************
# The general preprocessing steps
#***********************************
def Preprocessing(text):
# 将文本转成小写
    text = text.lower()
# 删除我们的标点符号，
    for c in string.punctuation:
        text = text.replace(c," ")
        
    text = text.translate(None,'0123456789')
# 分词
    wordList = nltk.word_tokenize(text)
# 去除停顿词
    filtered = [w for w in wordList if w not in stopwords.words('english')]

    # stem 
    ps = PorterStemmer()
    filtered = [ps.stem(w) for w in filtered]
   # 词形还原
   wordnet_lemmatizer = WordNetLemmatizer()
   filtered = [wordnet_lemmatizer.lemmatize(w) for w in filtered ]

    return " ".join(filtered)

这个时候我们就可以直接调用预处理：

newline = Preprocessing(newline)

完整的代码如下所示：


def getStarter():

    fileList = os.listdir(path)

    for file in fileList:
        docList = []
        filePath = os.path.join(path,file)

        f = open(filePath)

        doc = f.readlines()

        f.close()

        for line in doc:
            line = line.replace('\n',"")
            newline = line.decode('utf-8','ignore').encode('utf-8')
            newline = Preprocessing(newline)
            docList.append(newline)


        LDA(docList,file)

这个时候我们就要完成我们的LDA的代码：
我们需要把我们的文本向量化，之后在直接的调用我们的sklearn 里面的LDA的这个函数，有一些参数可能需要我们自己来设置一下，否则的话系统会自动的选择我们LDA里面默认的参数。

在这里，我们代码的前两行就是我们的文本向量化，后面的两行就是我们调用我们的sklearn 来对我们的文本进行聚类


    tf_vectorizer = CountVectorizer()

    tf = tf_vectorizer.fit_transform(docList)

    lda = LatentDirichletAllocation(n_topics=n_topic,max_iter=2000,learning_method='batch')

	lda.fit(tf)

当我们聚类完之后，可能想看看我们这些聚类有那几类，当然这个类别使我们自己设置的，就是topic 的个数，我们可以看一下每一个类别代表它的topic words

打印出topic words

打印出我们的LDA输出的关键字，具体的代码如下所示

#***********************************

def print_top_words(model,feature_names,n_top_words,file,n_topic):
    #print the term in each topic with the highest weigh
    
    filename = file.split(".")[0]+"Topic"
    filePath = os.path.join(outputPath,filename)
    fo = open(filePath,"a")
    line = "The total number of Topic is: {0}".format(n_topic) + "\n"
    fo.write(line)

    for topic_idx, topic in enumerate(model.components_):
        line1 = "Topic #{0}".format(topic_idx) + "\n"
        
        fo.write(line1)
        
#        cmd = '''echo "Topic #"{0}>>{1}'''.format(topic_idx,filePath)
#        os.system(cmd)
        line = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1 : -1]])
        fo.write(line)
        fo.write("\n")

#        cmd = "echo {0}>>{1}".format(line,filePath)
#        os.system(cmd)


    fo.write("******************************\n")
    fo.close()

设置标签

这一步实际上就是查看我们的聚类的结果。
topic model 中LDA是一个聚类的算法，这个时候我们要对其进行聚类结果的展示。
LDA计算一个语料库中的所有文本集合，会给每一个文本一个概率，表示这个文本属于某个topic 的概率，概率最大的那个我们认为，这个文本很可能就是属于这个topic.

在这里其实最主要的关键就是两个函数
fit(lda model)
和我们的transform (lda model)这两个函数
两个缺一不可
transform 就是给出每一个文本对应的哪一个topic
具体的做法我们可以这么做：


    topic_dist = lda.transform(tf)
    labels = [-1]*len(topic_dist)
    
    i = 0
    for dist in review_topic_dist:
        dist = dist.tolist()
        if len(set(dist)) != 1:
            review_topic_index = dist.index(max(dist))
            review_max_topic_probality = dist[review_topic_index]
            labels[i]=review_topic_index
            i = i + 1
        else:
            i = i + 1

    print labels

性能判断，最佳参数选择

性能判断，在这里我们还可以选择最好的topic 的参数，我们使用perplexity 来做为一个判断的标准：

    # set the parameters of the topic and find the best number of topics
    n_topics = range(1,81,2) 

    perplexityLst = [1.0]*len(n_topics)

    lda_models = []

    for idx, n_topic in enumerate(n_topics):

        lda = LatentDirichletAllocation(n_topics=n_topic,max_iter=2000,learning_method='batch')

        t0 = time.time()

        lda.fit(tf)

        perplexityLst[idx] = lda.perplexity(tf)
        lda_models.append(lda)

        cmd = "echo '# of Topic: '{0}>>{1}".format(str(n_topics[idx]),filePath)
        os.system(cmd)

        cmd = "echo 'done in' {0}'s', N-iter: {1}, perplexity Score: {2}>>{3}".format(time.time()-t0,lda.n_iter_,perplexityLst[idx],filePath)
        os.system(cmd)

        cmd = "echo {0},{1}>>{2}".format(str(n_topics[idx]),perplexityLst[idx],filePerplexityPath)
        os.system(cmd)
         	

        n_top_words = 20 
        tf_feature_names = tf_vectorizer.get_feature_names()
        print "feature names:"
        print tf_feature_names
        
        print_top_words(lda,tf_feature_names,n_top_words,file,n_topic)
       

    best_index = perplexityLst.index(min(perplexityLst))
    best_n_topic = n_topics[best_index]
    best_models = lda_models[best_index]
    cmd = "echo 'Best # of Topic'{0}>>{1}".format(best_n_topic,filePath)
    os.system(cmd)

在这里插入图片描述

写在后面的话

好好学习，天天向上啊
我很想家的，很想很想。

Reference

https://blog.csdn.net/TiffanyRabbit/article/details/76445909
没有上文作者的贡献，就没有我的今天，当然我写的还不是很好，还是需要重新修改的。等我有时间再来捯饬捯饬一下。

【带你玩转主题模型Topic Model】—— 之利用sklearn 实现Latetnt Dirichlet Allocation(LDA)主题模型.md

写在前面的话

LDA主题模型的实现和调参

1.语料库的加载

打印出topic words

设置标签

性能判断，最佳参数选择

写在后面的话

Reference

猜你喜欢

【带你玩转主题模型Topic Model】—— 之 利用sklearn 实现Latetnt Dirichlet Allocation(LDA)主题模型.md

写在前面的话

LDA主题模型的实现和调参

1.语料库的加载

打印出topic words

设置标签

性能判断，最佳参数选择

写在后面的话

Reference

猜你喜欢

【带你玩转主题模型Topic Model】—— 之利用sklearn 实现Latetnt Dirichlet Allocation(LDA)主题模型.md