The basic idea [data analysis] study notes day28 natural language processing NLTK + jieba word + jieba word of Case and process description

jieba word

jieba word

jieba word is written in a python regarded industry segmentation open source library that github address: https://github.com/fxsjy/jieba , in Python installation of:pip install jieba

A simple example:

import jieba as jb

seg_list = jb.cut("我来到北京清华大学", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

seg_list = jb.cut("我来到北京清华大学", cut_all=False)
print("精确模式: " + "/ ".join(seg_list))  # 精确模式

seg_list = jb.cut("他来到了网易杭研大厦")  
print("默认模式: " + "/ ".join(seg_list)) # 默认是精确模式

seg_list = jb.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  
print("搜索引擎模式: " + "/ ".join(seg_list)) # 搜索引擎模式

Results of the:

全模式:/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
精确模式:/ 来到/ 北京/ 清华大学
默认模式:/ 来到// 网易/ 杭研/ 大厦
搜索引擎模式: 小明/ 硕士/ 毕业// 中国/ 科学/ 学院/ 科学院/ 中国科学院/ 计算/ 计算所//// 日本/ 京都/ 大学/ 日本京都大学/ 深造

The basic idea of ​​the word jieba

jieba word for word have been included and not included the word has a corresponding algorithm for processing, thinking its handling is very simple, the main roadmap is as follows:

  • Loading the dictionary dict.txt
  • Construction of DAG from the sentence dictionary memory (directed acyclic graph)
  • Not included in the dictionary for the word HMM model using the viterbi algorithm attempts to word processing
  • After already included and not included word word word whole section was finished, the DAG dp looking for maximum output probability path segmentation results

Case:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import jieba
import requests
from bs4 import BeautifulSoup

def extract_text(url):
    # 发送url请求并获取响应文件
    page_source = requests.get(url).content
    bs_source = BeautifulSoup(page_source, "lxml")

    # 解析出所有的p标签
    report_text = bs_source.find_all('p')

    text = ''
    # 将p标签里的所有内容都保存到一个字符串里
    for p in report_text:
        text += p.get_text()
        text += '\n'

    return text

def word_frequency(text):
    from collections import Counter
    # 返回所有分词后长度大于等于2 的词的列表
    words = [word for word in jieba.cut(text, cut_all=True) if len(word) >= 2]

    # Counter是一个简单的计数器,统计字符出现的个数
    # 分词后的列表将被转化为字典
    c = Counter(words)

    for word_freq in c.most_common(10):
        word, freq = word_freq
        print(word, freq)

if __name__ == "__main__":
    url = 'http://www.gov.cn/premier/2017-03/16/content_5177940.htm'
    text = extract_text(url)
    word_frequency(text)

Results of the:

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/dp/wxmmld_s7k9gk_5fbhdcr2y00000gn/T/jieba.cache
Loading model cost 0.843 seconds.
Prefix dict has been built succesfully.
发展 134
改革 85
经济 71
推进 66
建设 59
社会 49
人民 47
企业 46
加强 46
政策 46
Process Introduction
  1. First, we grab the full text of the government work report from the Internet. I will this step is encapsulated in a simple function named extract_text, accept the url as a parameter. Because the target page report of the text in all p elements, so we only need to select all p elements BeautifulSoup can finally returns a string containing the body of the report.
  2. Then, we can use jieba divide word. Here, we have to choose full mode word. Full mode jieba the word, that all the words in a sentence can be scanned into words are very fast, but does not resolve the ambiguity. It did so precisely because the default mode, word frequency data returned is inaccurate.
  3. When word, but also pay attention to remove punctuation, due to the length of punctuation is 1, so we add a len (word)> = 2 the condition can be.
  4. Finally, we can use Counter class list after the word quickly into the dictionary, the number of occurrences in which the key is the key, that is, the number of times the word appears in the text.

Copyright © BigCat all right reserved,powered by Gitbook「Revision Time: 2017-04-27 00:22:49」

发布了192 篇原创文章 · 获赞 56 · 访问量 1万+

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104084961