Python3 Natural Language Processing - Language Processing and Python

Python3 Natural Language Processing - Language Processing and Python

Note: Reproduced please contact bloggers, or concerns micro-channel public number "citation space", proposed reprint background application, wait for a reply. Otherwise it will report plagiarism!

"Python Natural Language Processing" is the Stanford University Steven Bird, Edward Loper and Ewan Klein, eds NLP practical books, the book is clear, detailed, suitable for all skill levels of readers, highly recommended, but book with Python version Python2, and now mainstream and learn to use Python Python3, which gives readers learn NLP may cause problems, because Python3 not backward compatible Python2.
Through the study of the book, and try to use Python3 to realize the function of the book will be published in their own learning experiences on public number "citation space", in order to enable more people to see, to reprint this platform, the content will constantly updated. Due to my limited level, there are some features may be ill-considered when using Python3 operation or method is not easy, I urge readers to understand, if readers have new ideas or new ways of content, welcome to explore. Here Insert Picture Description
The first chapter Language Processing and Python
must first install NLTK, can be downloaded from http://www.nltk.org, can also be downloaded by pip, namely:

pip install nltk

After installation is complete, start the Python interpreter, enter the following code:

import nltk
nltk.download()

Select the book marker row, downloading data. Module can import data after the download is complete Python:

from nltk.book import *

After the transport, the editor is shown below:

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

These texts need to find the time, just type the name (such as text1) to:

>>>text1
<Text: Moby Dick by Herman Melville 1851>

Some parts are behind some of the basic features of Python, not discussed here, in the "frequency distribution" section, Python3 operation and Python2 different. Examples of the book for the use of high frequency FreqDist Looking first 50 "White Dick" appearing in the original code is as follows:

fdist1=FreqDist(text1)
vocabulary1=fdist1.keys()
vocabulary1[:50]

Python3 run this code, the error would be so:

Traceback (most recent call last):
  File "<pyshell#34>", line 1, in <module>
    vocabulary1[:50]
TypeError: 'dict_keys' object is not subscriptable

解决这个问题我们可以把字典的键放到列表里,再用列表进行切片,但是这样的话会发现列表中的元素是乱序的,FreqDist没起作用。通过观察发现,上述代码中fdist1的值及类型为:

>>>fdist1
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
>>>type(fdist1)
<class 'nltk.probability.FreqDist'>

可以看出fdist1的类型是nltk中的一个特殊类,类似于字典,并且在fdist1中,key是按照value从大到小排列的,因此,我们可以通过对value的排序来使key列表(及单词列表)根据出现频次大小进行排列,代码如下:

fdist1=FreqDist(text1)
vocabulary1=sorted(fdist1,key=lambda k:fdist1[k],reverse=True)
vocabulary1[:50]

代码运行结果如下:

[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']

与书上结果相同。

除此之外,我们可以将这些词汇生成统计图,书上生成的为累积频率图,代码为:

fdist1.plot(50,cumulative=True)

书上利用Python2生成的是一个横轴为50个高频词,纵轴为Cumulative Percentage的一个统计图,但是在Python3中,生成的图是这样的:Here Insert Picture Description
可以看见,这张图不是累积频率图,而是累积频数图,横轴为Sample,纵轴为Cumulative Count。问题出在哪里呢?我们可以看到在plot中有一个cumulative=True,这个不难看出是累积的意思,如果将这一条去掉,即只留下:

fdist1.plot(50)

就会是下图这样:Here Insert Picture Description
这张图是词频统计图,没有累积,按照频次从大到小排列的前50个高频词。但是也没有成为书上的频率图,纵轴仍然为Count。
为了解决这个问题,我看了一下plot方法的源码,我使用的版本为Python3.8,安装路径为默认,plot方法的源码路径为:

C:\Users\Administrator\AppData\Local\Programs\Python\Python38\Lib\site-packages\nltk

在这个路径下有一个文件名为probability.py,打开源码后找到plot方法,发现plot方法的代码如下:

 def plot(self, *args, **kwargs):
        """
        Plot samples from the frequency distribution
        displaying the most frequent sample first.  If an integer
        parameter is supplied, stop after this many samples have been
        plotted.  For a cumulative plot, specify cumulative=True.
        (Requires Matplotlib to be installed.)

        :param title: The title for the graph
        :type title: str
        :param cumulative: A flag to specify whether the plot is cumulative (default = False)
        :type title: bool
        """
        try:
            import matplotlib.pyplot as plt
        except ImportError:
            raise ValueError(
                'The plot function requires matplotlib to be installed.'
                'See http://matplotlib.org/'
            )

        if len(args) == 0:
            args = [len(self)]
        samples = [item for item, _ in self.most_common(*args)]

        cumulative = _get_kwarg(kwargs, 'cumulative', False)
        percents = _get_kwarg(kwargs, 'percents', False)
        if cumulative:
            freqs = list(self._cumulative_frequencies(samples))
            ylabel = "Cumulative Counts"
            if percents:
                freqs = [f / freqs[len(freqs) - 1] * 100 for f in freqs]
                ylabel = "Cumulative Percents"
        else:
            freqs = [self[sample] for sample in samples]
            ylabel = "Counts"
        # percents = [f * 100 for f in freqs]  only in ProbDist?

        ax = plt.gca()
        ax.grid(True, color = "silver")

        if "linewidth" not in kwargs:
            kwargs["linewidth"] = 2
        if "title" in kwargs:
            ax.set_title(kwargs["title"])
            del kwargs["title"]

        ax.plot(freqs, **kwargs)
        ax.set_xticks(range(len(samples)))
        ax.set_xticklabels([text_type(s) for s in samples], rotation=90)
        ax.set_xlabel("Samples")
        ax.set_ylabel(ylabel)

        plt.show()

        return ax

我们把第26行和第27行单拎出来:

cumulative = _get_kwarg(kwargs, 'cumulative', False)
percents = _get_kwarg(kwargs, 'percents', False)

再查看一下_get_kwarg方法:

def _get_kwarg(kwargs, key, default):
    if key in kwargs:
        arg = kwargs[key]
        del kwargs[key]
    else:
        arg = default
    return arg

可以看出cumulative的默认(default)值为False,与cumulative并列的还有一个参数为pertcents,即为百分数频率。再来分析代码28-33行:

if cumulative:
            freqs = list(self._cumulative_frequencies(samples))
            ylabel = "Cumulative Counts"
            if percents:
                freqs = [f / freqs[len(freqs) - 1] * 100 for f in freqs]
                ylabel = "Cumulative Percents"

As can be seen, percents in cumulative within true loop, so to draw frequency diagram will be cumulative, so this way, we can only draw these types of chart: word frequency charts, cumulative frequency diagram, cumulative frequency plot . FIG been given of the first two exemplary, worse as a book and cumulative frequency source for a review of FIG was added percents parameters:

fdist1.plot(50,cumulative=True,percents=True)

The resulting figure is this:
Here Insert Picture Description
at first glance seems to be no problem, but this picture is just a sample 50 words, so the final cumulative frequency of 100%, while the sample chart in the book for all the words, so the final cumulative frequency It will not reach 100%. The reason for this is unclear, it may be the difference Python3 and Python2 itself caused, it could be due to my knowledge is not perfect, resulting in the omission where, welcome reader comments discussed.

Published an original article · won praise 1 · views 23

Guess you like

Origin blog.csdn.net/weixin_45930839/article/details/104087000
Recommended