[Text information processing] Network text access and processing + word segmentation

1. Network text access and processing

1、re.findall( )

Returns all strings in string that match pattern in the form of an array.

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)

2. Code

import codecs
import re
import urllib.request

url = "https://www.gdufs.edu.cn/info/1397/59442.htm"
html_text = urllib.request.urlopen(url).read()
print(type(html_text), html_text)

html_text_new = codecs.decode(html_text, 'utf-8')
print(type(html_text_new), html_text_new)

# 提取该网页的标题
# p1 = '<TITLE>(.*?)</TITLE>'

# 如果是<title>.*?</title> 后续再进行切片
p1 = '<title>(.+?)</title>'
title1 = re.findall(p1, html_text_new)
print(title1[0])

# 提取该网页中所有网址列表
# p2 = 'HREF="(.*?)"'
p2 = 'href="(https.+?)"'
http = re.findall(p2, html_text_new)
print(http)

output
Please add a picture description

2. English word segmentation (word_tokenize)

1. Basic Usage

#分句
nltk.tokenize.sent_tokenize(txt)
#分词
nltk.tokenize.word_tokenize(txt)

2. Code

import nltk.tokenize

txt = 'On the morning of March 24, Ruben Espinoza, Consul General of the Consulate General of Peru in Guangzhou, together with his delegation, visited GDUFS. Jiao Fangtai, vice president of the university, welcomed the guests in the VIP Hall of the administration building of the Baiyunshan campus. The two sides exchanged views on talent training and cultural exchange.'
chinesetext='这是一个很小的测试'
# 分句
sents = nltk.tokenize.sent_tokenize(txt)
print(sents)
print(len(sents))

# 分词
for i in range(len(sents)):
    words = nltk.tokenize.word_tokenize(sents[i])
    print(words)

output

['On the morning of March 24, Ruben Espinoza, Consul General of the Consulate General of Peru in Guangzhou, together with his delegation, visited GDUFS.', 'Jiao Fangtai, vice president of the university, welcomed the guests in the VIP Hall of the administration building of the Baiyunshan campus.', 'The two sides exchanged views on talent training and cultural exchange.']
3
['On', 'the', 'morning', 'of', 'March', '24', ',', 'Ruben', 'Espinoza', ',', 'Consul', 'General', 'of', 'the', 'Consulate', 'General', 'of', 'Peru', 'in', 'Guangzhou', ',', 'together', 'with', 'his', 'delegation', ',', 'visited', 'GDUFS', '.']
['Jiao', 'Fangtai', ',', 'vice', 'president', 'of', 'the', 'university', ',', 'welcomed', 'the', 'guests', 'in', 'the', 'VIP', 'Hall', 'of', 'the', 'administration', 'building', 'of', 'the', 'Baiyunshan', 'campus', '.']
['The', 'two', 'sides', 'exchanged', 'views', 'on', 'talent', 'training', 'and', 'cultural', 'exchange', '.']

3. Chinese word segmentation (jieba)

1. Examples

Read the exercise 1.txt text into the program, use jieba word segmentation, and write the word segmentation results into the result.txt file

Exercise 1.txt
Recently, Ruanke released the list of the top 50 major liberal arts projects in Chinese universities in 2022. Our school has established 5 national major liberal arts projects, ranking 23rd among universities in the country, and tied for 2nd in Guangdong Province. Projects include 1 major project of the National Social Science Fund, 3 annual major projects of the National Social Science Fund, and 1 major project of the Ministry of Education's philosophy and social science research, involving disciplines including foreign literature , Applied Economics, Law, Journalism and Communication, International Studies and other disciplines. Since the "14th Five-Year Plan", the school has established 12 major national projects.

import jieba
import nltk
from jieba import posseg

# 将文本读入程序
f = open("练习1.txt", "r", encoding='utf-8')
txt = f.read()
print(txt)

# 使用jieba分词
token = list(jieba.cut(txt))
str = ""
print(token)
print(type(token))

for word in token:
    str = str + word + "/"

# 将分词结果写入result.txt文件
f2 = open("result.txt", "w", encoding='utf-8')
f2.write(str)
f2.close()

viewresult.txt

Recently /, /Ruanke/Released/2022/China/Universities/Liberal Arts/Major Projects/50/Top/Ranking List/, /Our School/Project Approval/Liberal Arts/National/Major Projects/5/Items/, / Ranked /23/ in /National/ Universities/, /tied /2/ in /Guangdong Province/. /Project/Project/Include/Interpretation/Party/of/Nineteenth/Session/Sixth Plenary Session/Spirit/National/Social Science/Fund/Major Project/1/Item/, /National/Social Science/Fund/Annual/Major Project/3/item/, /Ministry of Education/Philosophy/Social Science/Research/Major/Problem/Project/1/item/, /involves/discipline/including/foreign literature/, /applied/economics/, /law/ , /journalism/ and /communication/, /international/issues/research/etc/disciplines/. /"/14th Five-Year/"/since/, /school/has/established/national/major project/12/item/. /

2. Segment word and display part of speech

fenci=jieba.posseg.cut("这是一个小小的测试")
print(list(fenci))

output

[pair('这', 'r'), pair('是', 'v'), pair('一个', 'm'), pair('小小的', 'z'), pair('测试', 'vn')]

4. Others

insert image description here

Guess you like

Origin blog.csdn.net/weixin_51293984/article/details/130237073