Python Level 2 Programming Questions "From Tsinghua University to MIT"

1. Topic

Take the data.txt file and read an article "From Tsinghua University to MIT", use the full mode of the function Icut of the jieba library to do word segmentation, and the statistical vocabulary length is 2

The number of occurrences of the word, output the top 10 words with the most occurrences and the number of occurrences.

2. Effect display

Example 1 : ‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬

输入:从data.txt文件读入
输出:"
大学:21‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
设计:20‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
美国:16‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
清华:15‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
学生:14‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
教授:12‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
课程:11‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
一个:10‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
国大:8‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬
计算:8

3. Code display

1. Official code

import jieba        # 导入jieba中文分词库
dk = {}             # 定义dk字典变量 type(dk):<class 'dict'>

#使用with后不管with中的代码出现什么错误,都会进行对当前对象进行清理工作。
#例如file的file.close()方法,无论with中出现任何错误,都会执行file.close()方法

#以指定utf-8编码只读方式打开data.txt文件,文件句柄命名为f
with open('data.txt','r',encoding = "utf-8") as f:  
    sl = f.readlines()

#print(type(f))    f是<class '_io.TextIOWrapper'>文件句柄的类型
#print(type(sl))   sl是一个列表,包含了文件中每一行内容
#print(type(sl[0]))  sl[0]是列表sl中第一个元素,是文件中第一行所有内容

for s in sl:        #循环读取列表元素
    k =jieba.lcut(s, cut_all = True)
    #对每个s,使用jieba.lcut函数以全模式方式返回一个列表(由词语组成)
    for wo in k:    #对每个词语进行筛选
        if len(wo) == 2:    #如果词语的长度为2,进行统计
           dk[wo] = dk.get(wo,0) + 1
           #逐步构建统计字典,形式如{"大学":1,"设计":2,...},备注,这里的1、2是逐渐变化中

dp = list(dk.items())   #转换为列表,列表中元素为元组。
dp.sort(key= lambda x:int(x[1]), reverse = True)

for i in range(10):   #输出排序后的内容
   print("{}:{}".format(dp[i][0],dp[i][1]))

2. Personal solution

import jieba
dk = {}
with open('data.txt','r',encoding="utf-8") as f:
    text = f.read()
    lst = jieba.lcut(text,cut_all=True)
    for elm in lst:  
        if len(elm) == 2:
            dk[elm] = dk.get(elm,0)+1
dp = list(dk.items())
dp.sort(key= lambda x:int(x[1]), reverse = True)
for k,v in dp[:11]:  #这里列表只取前10个,因为已经排过序了。
    print(k+":"+str(v))

3. Optimized version

import jieba
dk = {}
with open('data.txt','r',encoding="utf-8") as f:
    text = f.read()
    lst = [elm for elm in jieba.lcut(text,cut_all=True) if len(elm) ==2]
    for i in lst:  
        dk[elm] = dk.get(elm,0)+1
dp = list(dk.items())
dp.sort(key= lambda x:int(x[1]), reverse = True)
for k,v in dp[:11]:  #这里列表只取前10个,因为已经排过序了。
    print(k+":"+str(v))

4. Post-school reflection

  1. Python secondary dictionary sorting is almost a must. You must use sort, lambda, and jieba well. Pay attention to whether the word segmentation result is a full pattern.
  2. Note the usage difference between sort and sorted, sort is to change the list, and sorted is to generate a copy without changing the original list.

Guess you like

Origin blog.csdn.net/henanlion/article/details/130889864