An introductory tutorial for using some natural language tools in Python

This article mainly introduces the introductory tutorials on the use of some natural language tools in Python. This article is from the technical documentation of the IBM official website. Friends who need it can refer to
NLTK, which is an excellent tool for teaching and practicing computational linguistics using Python. In addition, computational linguistics is closely related to fields such as artificial intelligence, language / special language recognition, translation, and grammar checking.
What NLTK includes

NLTK will naturally be seen as a series of layers with a stack structure, these layers are built on top of each other. For readers familiar with the grammar and parsing of artificial languages ​​(such as Python), understanding the similar—but more esoteric—layers in natural language models will not be too difficult.
Glossary

Corpora: A collection of related text. For example, Shakespeare's works may be collectively referred to as a corpus; and the works of several authors are referred to as complete works.

Histogram: The statistical distribution of the frequency of occurrence of different words, letters, or other items in the data set.

Structure (Syntagmatic): The study of segments; that is, the statistical relationship in which letters, words, or phrases appear consecutively in a complete set.

Context-free grammar: The second category in the Noam Chomsky hierarchy composed of four types of formal grammar. See Resources for a detailed description.

Although NLTK comes with many complete sets that have been preprocessed (usually manually) to varying degrees, conceptually each layer relies on adjacent lower-level processing. The first is word breaking; then labeling words; then parsing groups of words into grammatical elements, such as noun phrases or sentences (depending on one of several technologies, each of which has its advantages and disadvantages); Finally, classify the final sentence or other grammatical units. Through these steps, NLTK allows you to generate statistics on the occurrence of different elements and draw a graph that describes the processing process itself or the statistical totals.

In this article, you will see some relatively complete examples of low-level capabilities, while most high-level capabilities will be described simply and abstractly. Now let's analyze the first steps of text processing in detail.

断词(Tokenization)

Much of the work you can do with NLTK, especially low-level work, is not much different from using Python's basic data structure to complete. However, NLTK provides a set of systematic interfaces that are dependent on and used by higher layers, rather than simply providing practical classes to handle tagged or tagged text.

Specifically, the nltk.tokenizer.Token class is widely used to store annotated fragments of text; these annotations can mark many different features, including parts-of-speech, subtoken structure, and a logo (Token) Offset position in larger text, morphological stems, grammatical sentence components, etc. In fact, a token is a special dictionary-and accessed as a dictionary-so it can hold any key you want. Some special keys are used in NLTK, and different keys are used by different subroutine packages.

Let's briefly analyze how to create a token and split it into sub tokens:
Listing 1. First acquaintance with the nltk.tokenizer.Token class

>>> from nltk.tokenizer import *
>>> t = Token(TEXT='This is my first test sentence')
>>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace
>>> print t['TEXT']
This is my first test sentence
>>> print t['SUBTOKENS']
[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]
>>> t['foo'] = 'bar'
>>> t
<TEXT='This is my first test sentence', foo='bar',
SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]>
>>> print t['SUBTOKENS'][0]
<This>@[0:4c]
>>> print type(t['SUBTOKENS'][0])
<class 'nltk.token.SafeToken'>

Probability

For a complete set of languages, a fairly simple thing you might do is to analyze the frequency distribution of various events and make probability predictions based on these known frequency distributions. NLTK supports a variety of methods for probability prediction based on natural frequency distribution data. I ’m not going to introduce those methods here (see the probabilistic tutorials listed in Resources), as long as there is something between what you would definitely expect and what you already know (not just the obvious scaling / regularization) A fuzzy relationship is enough.

Basically, NLTK supports two types of frequency distribution: histogram and conditional frequency distribution (conditional frequency). The nltk.probability.FreqDist class is used to create histograms; for example, you can create a word histogram like this:
Listing 2. Use nltk.probability.FreqDist to create a basic histogram

>>> from nltk.probability import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> freq = FreqDist()
>>> for word in article['SUBTOKENS']:
...   freq.inc(word['TEXT'])
>>> freq.B()
1194
>>> freq.count('Python')
12

The probability tutorial discusses the creation of histograms with more complex features, such as "the length of words after words ending in vowels." The nltk.draw.plot.Plot class can be used for the visual display of histograms. Of course, you can also analyze the frequency distribution of high-level grammatical features or even data sets that are not related to NLTK in this way.

The conditional frequency distribution may be more interesting than the ordinary histogram. The conditional frequency distribution is a two-dimensional histogram-it shows you a histogram for each initial condition or "context". For example, the tutorial proposes a word length distribution corresponding to each initial letter. We analyze in this way:
Listing 3. Conditional frequency distribution: the length of the word corresponding to each initial letter

>>> cf = ConditionalFreqDist()
>>> for word in article['SUBTOKENS']:
...   cf[word['TEXT'][0]].inc(len(word['TEXT']))
...
>>> init_letters = cf.conditions()
>>> init_letters.sort()
>>> for c in init_letters[44:50]:
...   print "Init %s:" % c,
...   for length in range(1,6):
...     print "len %d/%.2f," % (length,cf[c].freq(n)),
...   print
...
Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03,
Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12,
Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18,
Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,

An excellent application of conditional frequency distribution in terms of language is to analyze the distribution of segments in the whole set-for example, given a specific word, which word is most likely to appear next. Of course, grammar brings some restrictions; however, the study of the choice of syntactic options belongs to the categories of semantics, pragmatics, and terminology.

Stemming

The nltk.stemmer.porter.PorterStemmer class is an extremely convenient tool for obtaining grammatical (prefix) stems from English words. This ability is particularly exciting for me, because I used to create a common, full-text index search tool / library in Python before (see the description in Developing a full-text indexer in Python, which has been used by quite a lot of other Project).

Although the ability to search a large number of documents for a set of exact words is very practical (work done by gnosis.indexer), for many search graphs, a little ambiguity will help. Maybe, you ca n’t be particularly sure whether the email you are looking for uses the words "complicated", "complications", "complicating" or "complicates", but you remember that it is roughly involved (may come with some other words Complete a valuable search).

NLTK includes an excellent algorithm for word stem extraction, and allows you to customize the stem extraction algorithm to your liking:
Listing 4. Extract word stems for morphological roots

>>> from nltk.stemmer.porter import PorterStemmer
>>> PorterStemmer().stem_word('complications')
'complic'

In fact, how you can use the stemming function in gnosis.indexer and its derivatives or a completely different indexing tool depends on your usage scenario. Fortunately, gnosis.indexer has an open interface that can be easily customized. Do you need an index composed entirely of stems? Or do you include both complete words and stems in the index? Do you need to separate stemming matches from exact matches in the results? In a future version of gnosis.indexer, I will introduce some kinds of stem extraction capabilities, but end users may still want to customize differently.

In any case, adding stemming is generally very simple: first, get stemming from a document by specifying gnosis.indexer.TextSplitter; then, of course, when performing a search, (optionally) using search criteria Extracting its stem before performing an index search may be achieved by customizing your MyIndexer.find () method.

When using PorterStemmer, I found that the nltk.tokenizer.WSTokenizer class is indeed not as easy to use as the tutorial warns. It can handle conceptual roles, but for actual text, you can better identify what is a "word." Fortunately, gnosis.indexer.TextSplitter is a robust word breaker tool. For example:
Listing 5. Stem extraction based on poor NLTK word segmentation tool

>>> from nltk.tokenizer import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> from nltk.probability import *
>>> from nltk.stemmer.porter import *
>>> stemmer = PorterStemmer()
>>> stems = FreqDist()
>>> for word in article['SUBTOKENS']:
...   stemmer.stem(word)
...   stems.inc(word['STEM'].lower())
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[20:40]
['"generator-bas', '"implement', '"lazili', '"magic"', '"partial',
'"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."',
'"state', '"understand', '"weightless', '"whatev', '#', '#-----',
'#----------', '#-------------', '#---------------', '#b17:']

Looking at some stems, not all stems in the collection seem to be available for indexing. Many are not actual words at all, and others are compound words connected by dashes, and some irrelevant punctuation marks are added to the words. Let's try it with a better word segmentation tool:
Listing 6. Stem extraction using smart heuristics in the word segmentation tool

>>> article = TS().text_splitter(open('cp-b17.txt').read())
>>> stems = FreqDist()
>>> for word in article:
...   stems.inc(stemmer.stem_word(word.lower()))
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[60:80]
['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte',
'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast',
'certain', 'certainli', 'chang', 'charm']

Here, you can see that some words have multiple possible expansions, and all words look like words or morphemes. The word segmentation method is essential for random text collections; to be fair, the complete set bundled with NLTK has been packaged as an easy-to-use and accurate word segmentation tool through WSTokenizer (). To get a robust and practical indexer, you need to use a robust word breaker tool.

Add tagging, chunking, and parsing

The largest part of NLTK consists of various parsers of varying complexity. To a large extent, this introduction will not explain their details, but I would like to give a brief overview of what they are trying to achieve.

Do n’t forget the background that the logo is a special dictionary-specifically those that can include a TAG key to indicate the grammatical role of the word. NLTK complete documents usually have some special language pre-labeled, but of course, you can add your own labels to unlabeled documents.

Partitioning is somewhat similar to "rough analysis". That is to say, the progress of the block work is based on the existing signs of the grammatical components, or based on the signs you manually added or semi-automatically generated using regular expressions and program logic. However, to be precise, this is not true parsing (there are no same generation rules). For example:
Listing 7. Block parsing / tagging: words and larger units

>>> from nltk.parser.chunk import ChunkedTaggedTokenizer
>>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]"
>>> sentence = Token(TEXT=chunked)
>>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP')
>>> tokenizer.tokenize(sentence)
>>> sentence['SUBTOKENS'][0]
(NP: <the/DT> <little/JJ> <cat/NN>)
>>> sentence['SUBTOKENS'][0]['NODE']
'NP'
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]
<the/DT>
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG']
'DT'
>>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS'])
(S:
 (NP: <the/DT> <little/JJ> <cat/NN>)
 <sat/VBD>
 <on/IN>
 (NP: <the/DT> <mat/NN>))

The mentioned chunking work can be accomplished by the nltk.tokenizer.RegexpChunkParser class using pseudo-regular expressions to describe the series of tags that make up the syntax elements. Here is an example from a probabilistic tutorial:
Listing 8. Chunking using regular expressions on labels

>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>',
...        'Chunk optional det, zero or more adj, and a noun')
>>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S')
>>> chunkparser.parse(sentence)
>>> print sent['TREE']
(S: (NP: <the/DT> <little/JJ> <cat/NN>)
 <sat/VBD> <on/IN>
 (NP: <the/DT> <mat/NN>))

The real analysis will lead us into many theoretical fields. For example, the top-down parser can ensure that every possible product is found, but it may be very slow because of the frequent (exponential) backtracking. Shift-reduce is more efficient, but may miss some products. In either case, the declaration of grammar rules is similar to parsing grammar declarations of artificial languages. This column once introduced some of them: SimpleParse, mx.TextTools, Spark, and gnosis.xml.validity (see Resources).

Even, in addition to top-down and shift-reduce parsers, NLTK also provides "chart parsers", which can create partial assumptions so that a given sequence can then complete a rule. This method can be both effective and complete. Take a vivid (toy-level) example:
Listing 9. Define basic products for context-free grammar

>>> from nltk.parser.chart import *
>>> grammar = CFG.parse('''
...  S -> NP VP
...  VP -> V NP | VP PP
...  V -> "saw" | "ate"
...  NP -> "John" | "Mary" | "Bob" | Det N | NP PP
...  Det -> "a" | "an" | "the" | "my"
...  N -> "dog" | "cat" | "cookie"
...  PP -> P NP
...  P -> "on" | "by" | "with"
...  ''')
>>> sentence = Token(TEXT='John saw a cat with my cookie')
>>> WSTokenizer().tokenize(sentence)
>>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT')
>>> parser.parse_n(sentence)
>>> for tree in sentence['TREES']: print tree
(S:
 (NP: <John>)
 (VP:
  (VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>)))
  (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>)))))
(S:
 (NP: <John>)
 (VP:
  (V: <saw>)
  (NP:
   (NP: (Det: <a>) (N: <cat>))
   (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))

Probabilistic context-free grammar (or PCFG) is a context-free grammar that associates each product with a probability. Similarly, a parser for probabilistic analysis is also bundled into NLTK.

What are you waiting for?

NLTK has other important features that cannot be covered in this short introduction. For example, NLTK has a complete framework for text classification using statistical techniques such as "naive Bayesian" and "maximum entropy". Even if there is still space, I still cannot explain its essence. However, I think that even the lower layers of NLTK can be a practical framework that can be used for both teaching and practical applications.

Thank you very much for reading
. When I chose to study python at university, I found that I ate a bad computer foundation. I did n’t have an academic qualification. This is
nothing to do. I can only make up for it, so I started my own counterattack outside of coding. The road, continue to learn the core knowledge of python, in-depth study of computer basics, sorted out, if you are not willing to be mediocre, then join me in coding, and continue to grow!
In fact, there are not only technology here, but also things beyond those technologies. For example, how to be an exquisite programmer, rather than "cock silk", the programmer itself is a noble existence, isn't it? [Click to join] Want to be yourself, want to be a noble person, come on!

Published 54 original articles · Like 22 · Visits 30,000+

Guess you like

Origin blog.csdn.net/chengxun03/article/details/105567908