文章目录

第1章计算机基础及Python简介
第2章编写简单的程序
- 2.1 示例程序
- 2.2 标识符及命名规范
- 2.3 变量与赋值语句
- 2.4 数据的输入与输出
- - 2.4.1 输入函数input()
  - 2.4.2 输出函数print()
- 2.5 数值
- - 2.5.1 数值数据类型
  - 2.5.2 内置的数值操作
- 2.6 字符串
- 2.7 混合运算和类型转换
第3章程序流程控制
- 3.1 条件表达式
- 3.2 选择结构
- 3.3 循环结构
- 3.4 random库的基本应用
- 3.5 程序流程控制应用实例
第4章列表与元组
- 4.1 列表介绍与元素访问
- 4.2 操作列表元素
- 4.3 操作列表
- 4.4 数值列表
- 4.5 元组
- 4.6 转换函数
- 4.7 列表与元组应用实例
第5章字典与集合
- 5.1 字典的创建与访问
- - 5.1.1 创建字典
  - 5.1.2 访问字典
- 5.2 字典的基本操作
- 5.3 字典的整体操作
- 5.4 集合
- 5.5 字典与集合应用实例
第6章函数
- 6.1 函数的基本概念
- 6.2 函数的使用
- 6.3 lambda()函数
- 6.4 变量的作用域
- 6.5 递归函数
- 6.6 函数应用实例
第7章文件与异常
- 7.1 文件基础知识
- - 7.1.1 文件与文件类型
  - 7.1.2 目录与文件路径
- 7.2 文件操作
- 7.3 CSV文件操作
- 7.4 异常和异常处理
- 7.5 文件与异常应用实例
第8章中文文本分析基础
- 8.1 中文文本分析相关库
- 8.2 中文文本分析应用实例
第9章科学计算基础：numpy库和matplotlib库的应用
- 9.1 numpy库的使用
- 9.2 数组对象的常见操作
- 9.3 numpy库的专门应用
- - 9.3.1 numpy库在线性代数的应用
  - 9.3.2 多项式的应用
- 9.4 数组的文件输入与输出
- 9.5 matplotlib库的使用
- 9.6 科学计算相关库应用实例
第10章数据分析利器：pandas库的应用
- 10.1 pandas库简介
- 10.2 Series对象的应用
- 10.3 DataFrame对象的应用
- 10.4 数据分析相关库应用实例
第11章网络爬虫技术的应用
- 11.1 计算机网络基础知识
- 11.2 requests库的使用
- 11.3 BeautifulSoup库的使用
- 11.4 网络爬虫技术应用实例
- - 11.3.4 搜索文档树
- 11.4 网络爬虫技术应用实例

第1章计算机基础及Python简介

第2章编写简单的程序

2.1 示例程序

例2-1 任意输入两个整数，求这两个整数的和及平均值。

m = eval(input("input first number: "))
n = eval(input("input second number: "))
sum = m + n
avg = (m+n)/2
print("和为： ",sum)
print("平均值为： ",avg)

input first number: 



Traceback (most recent call last):


  File "D:\Application\Anaconda\envs\py36\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)


  File "<ipython-input-1-de036dfe0c81>", line 1, in <module>
    m = eval(input("input first number: "))


  File "<string>", line unknown
    
    ^
SyntaxError: unexpected EOF while parsing

2.2 标识符及命名规范

2.3 变量与赋值语句

2.3.1 Python语言中的变量

例2-2 变量动态类型示例及讨论。

m = 2
type(m)

int

m = 2.6
type(m)

m = '你好'
type(m)

例2-3 变量的强数据类型示例。

a = 100
b = "30"
a + b

2.3.2 变量的赋值

例2-4 变量的赋值示例。

x = 100
print(x)

str = "I am a boy."
print(str)

print(y)

2.3.3 链式赋值语句

例2-5 链式赋值语句示例。

x = y = z =200
print(x,y,z)

x = x + 100
y = y - 100
print(x,y,z)

2.3.4 解包赋值语句

例2-6 解包赋值语句示例。

a,b = 100,200
print(a,b)

100 200

例2-7 利用解包赋值语句实现两个变量值的交换。

a = 100
b = 200
print("a=",a,"b=",b)

a= 100 b= 200

a,b = b,a
print('a=',a,"b=",b)

a= 200 b= 100

2.4 数据的输入与输出

2.4.1 输入函数input()

例2-8 input()函数输入交互示例。

name = input("请输入您的姓名：")

请输入您的姓名：Sandy

例2-9 使用eval()函数获取input()函数输入的数值类型数据。

m = input("请输入整数1：")
n = input("请输入整数2：")
print("m和n的差是：",m-n)

请输入整数1：6
请输入整数2：3



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-28-93fffc5453d1> in <module>
      1 m = input("请输入整数1：")
      2 n = input("请输入整数2：")
----> 3 print("m和n的差是：",m-n)


TypeError: unsupported operand type(s) for -: 'str' and 'str'

#修改1
m = eval(input("请输入整数1："))
n = eval(input("请输入整数2："))
print("m和n的差是：",m-n)

请输入整数1：6
请输入整数2：3
m和n的差是： 3

#修改2
m = input("请输入整数1：")
n = input("请输入整数2：")
print("m和n的差是：",eval(m)-eval(n))

请输入整数1：6
请输入整数2：3
m和n的差是： 3

2.4.2 输出函数print()

例2-10 print()函数输出示例。

print(3+5)

print("3+5 =",3+5)

例2-11 print()函数中的换行控制。

print(3)
print(4)
print("the answer is ",end = "")  #使用end=""，输出字符串后不换行
print(3+4)                        #在上一行继续输出3+4的结果

3
4
the answer is 7

2.5 数值

2.5.1 数值数据类型

例2-12 学生参加体育测试，有三个单项，分别是短跑、3分钟跳绳和跳远。每个单项的满分均为100分，且单项成绩为整数，单项成绩分别一0.4、0.3和0.3的权重计入测试总评成绩。输入一名学生的三个单项成绩，计算他的体育测试总评成绩。

run = eval(input("短跑成绩："))
rope = eval(input("3分钟跳绳成绩："))
longJump = eval(input("跳远成绩："))

score = run * 0.4 + rope * 0.3 + longJump * 0.3
print("体育测试总评成绩：", score)

例2-13 整数类型数据与浮点类型数据示例。

x = 123
y = 12.3
m = 12.
n = .98
k = 12e-5
print(m,n,k)

12.0 0.98 0.00012

2.5.2 内置的数值操作

例2-14 商店需要找钱给顾客，现在只有50元、5元和1元的人民若干张。输入一个整数金额值，给出找钱的方案，假设人民币足够多，且优先使用面额大的钱币。

money = eval(input("输入金额："))
m50 = money // 50
money = money % 50
m5 = money // 5
money = money % 5
m1 = money
print("50元面额需要的张数：",m50)
print("5元面额需要的张数：",m5)
print("1元面额需要的张数：",m1)

输入金额：354
50元面额需要的张数： 7
5元面额需要的张数： 0
1元面额需要的张数： 4

例2-15 复合赋值运算符示例。

a,b = 10,20
a += b
a %= 2
b **= 2
print(a)
print(b)

例2-16 内置数值运算函数使用示例。

#abs(-2)
divmod(28,12)
#round(3.1415,2)
#pow(2,3)
#max(2,5,0,-4)
#min(2,5,0,-4)

(2, 4)

例2-17 math库中数值函数使用示例。

import math
print(math.fabs(-3.2),math.fmod(21,5))
print(math.fsum([0.1,0.2,0.3]))
print("12和28的最大公约数：",math.gcd(12,28))

3.2 1.0

2.6 字符串

a = "我是{}班{}号的学生{}".format("化工1701",28,"赵帅")    
b = "我是{1}班{2}号的学生{0}".format("赵帅","化工1701",28)
print(a)
print(b)

例2-42 format方法对字符串的格式化示例

"{:*^20}".format("Mike")

"{:=<20}".format("Mike")

例2-43 format()方法对实数和整数的设置示例

"{:.2f}".format(3.1415926)

"{:.4f}".format(3.1415926)

"{:=^30.4f}".format(3.1415926)

2.7 混合运算和类型转换

例2-46 在input()函数输入中将字符类型数据转换成数值类型数据

import math 
result = (math.pi)**2+3
print("the result of pi^2+3 is: ",result)

第3章程序流程控制

3.1 条件表达式

例 3-1 关系运算符使用示例

3.2 选择结构

例3-3 输入三角形三条边的边长，计算三角形的面积

import math
a = float(input("请输入三角形的边长a: "))
b = float(input("请输入三角形的边长b: "))
c = float(input("请输入三角形的边长c: "))
p = (a+b+c)/2
area = math.sqrt(p*(p-a)*(p-b)*(p-c))
print("三角形的面积为: {:.2f}".format(area))

例3-4 用户使用键盘输入两个任意整数a和b，比较a和b的大小，并输出a和b，其中a为输入的两个整数中的较大者。

a = int(input("请输入整数a: "))
b = int(input("请输入整数b: "))
print("输入值 a = {},b={}".format(a,b))
if a < b:
    a,b = b,a
print("比较后的值 a = {},b = {}".format(a,b))

例 3-5 判断回文字符串

str1 = input("请输入字符串：")
if (str1 == str1[::-1]):
    print(str1 + "为回文串")
else:
    print(str1 + "不是回文串")

例 3-6 输入三条线段的长度，对用户输入的数据做合法性检查，并求由这三条线段围成的三角形的面积。

import math
a = float(input("请输入三角形的边长a: "))
b = float(input("请输入三角形的边长b: "))
c = float(input("请输入三角形的边长c: "))
if(a+b>c and a+c>b and b+c>a and a>0 and b>0 and c>0):
    h = (a+b+c)/2
    area = math.sqrt(h*(h-a)*(h-b)*(h-c))
    print("三角形的面积为:{:.2f}".format(area))
else:
    print("用户输入数据有误！")

例3-7 根据用户的身高和体重，计算用户的BMI指数，并给出相应的健康建议。BMI指数，即身体质量指数，是用体重(kg)除以身高(m)的平方得出的数字。

过轻：低于18.5
正常：18.5-23.9
过重：24-27.9
肥胖：28-32
过于肥胖：32以上

height = eval(input("请输入您的身高(m): "))
weight = eval(input("请输入您的体重(kg): "))
BMI = weight/height/height
print("您的BMI指数是: {:.1f}".format(BMI))
if BMI < 18.5:
    print("您的体型偏瘦，要多吃多运动哦！")
elif 18.5 <= BMI < 24:
    print("您的体型正常，继续保持！")
elif 24 <= BMI < 28:
    print("您的体型偏胖，有发福迹象！")
elif 28 <= BMI < 32:
    print("不要悲伤，您是个迷人的胖子！")
else:
    print("什么也不说了，您照照镜子就知道了......")

例 3-8 使用键盘输入一个三位数的正整数，输出其中的最大的一位数字是多少。

num = int(input("请输入一个三位正整数："))
a = str(num)[0]
b = str(num)[1]
c = str(num)[2]
if a > b:
    if a > c:
        max_num = a
    else:
        max_num = c
else:
    if b > c:
        max_num = b
    else:
        max_num = c
print(str(num) + "中最大的数字是："+ max_num)

3.3 循环结构

例 3-9 统计英文句子中大写字符、小写字符和数字各有多少个。

str = input("请输入一句英文：")
count_upper = 0
count_lower = 0
count_digit = 0
for s in str:
    if s.isupper(): count_upper = count_upper + 1
    if s.islower(): count_lower = count_lower + 1
    if s.isdigit(): count_digit = count_digit + 1
print("大写字符：",count_upper)
print("小写字符：",count_lower)
print("数字字符：",count_digit)

例 3-10 利用for循环求1-100中所有整数的和。

sum = 0
for i in range(1,100+1):
    sum = sum + i
print("sum=", sum)

例3-11 利用for循环求1-100中所有的奇数和偶数的和分别是多少。

sum_odd = 0
sum_even = 0
for i in range(1,100+1):
    if i % 2 == 1:
        sum_odd = sum_odd + i
    else:
        sum_even = sum_even + i
print("1-100中所有的奇数和",sum_odd)
print("1-100中所有的偶数和",sum_even)

例 3-12 利用for循环求正整数n的所有约数，即所有能把n整除的数。例如，输入6，输出1，2，3，6。

n = int(input("请输入一个正整数："))
for i in range(1,n+1):
    if n % i == 0:
        print(i,end = '  ')

n = int(input("请输入一个正整数："))
for i in range(n,0,-1):
    if n % i == 0:
        print(i,end = '  ')

例3-13 利用while语句求1-100中所有整数的和

sum = 0
i = 1
while i <= 100:
    sum = sum + i
    i = i + 1
print("sum =",sum)

例3-14 求非负数字序列中的最小值、最大值和平均值。用户输入-1就表示序列终止。

count = 0
total = 0
print("请输入一个非负整数，以-1作为输入结束！")
num = int(input("输入数据："))
min_num = num
max_num = num
while(num != -1):
    count += 1
    total += num
    if num < min_num: min_num = num  
    if num > max_num: max_num = num
    num = int(input("输入数据："))
if count > 0:
    print("最小{},最大{},均值{:.2f}".format(min_num,max_num,total/count))
else:
    print("输入为空")

例3-15 有一类数学计算问题需要使用while循环。对于这类问题，人们设计出一套反复计算的规则，称为迭代规则，并证明了反复使用迭代规则就一定能得到解，但何时结束要看计算的实际进展情况。下面利用迭代规则来求解一个实数的算术平方根，人们提出的计算规则如下：
(1)假设需要求正实数x的算术平方根，任取y为某个正实数。
(2)如果y×y=x，计算结束，y就是x的算术平方根。
(3)否则令z = (y+x/y)/2。
(4)令y的新值等于z，转回步骤(1)。
按上述规则反复计算，可以得到一个y的序列。已经证明这个序列将趋向于x的算术平方根。这种计算算术平方根的方法被称为“牛顿迭代法”，代码如下：

###当输入2.0，程序陷入死循环。
x = float(input("输入一个实数："))
y = 1.0
while y*y != x:
    y = (y + x / y)/2
    print(y,y*y)
print("算术平方根为：",y)

###用牛顿迭代法求算术平方根
import math
x = float(input("输入一个正实数："))
n = 0
y = 1.0
while abs(y*y-x) > 1e-8:
    y = (y+x/y)/2
    n = n+1
    print(n,y)
print("算术平方根为：",y)
print("sqrt求算术平方根为：",math.sqrt(x))

例3-16 判断一个正整数n (n>=2)是否为素数。称一个大于1且除了1和它自身外，不能被其他整数整除的数为素数；否则称为合数。

n = int(input("输入一个正整数 n(n>=2):"))
for i in range(2,n):
    if n%i == 0: break
if i == n-1:
    print(n,"是素数")
else:
    print(n,"不是素数")

例3-17 求两个正整数m和n的最大公约数。

m = int(input("输入一个正整数m: "))
n = int(input("输入一个正整数n: "))
for i in range(min(m,n),0,-1):
    if m % i == 0 and n % i == 0:
        print("{}和{}的最大公约数为：{}".format(m,n,i))
        break

例3-18 用带else子句的循环结构判断正整数n是否为素数。

 n = int(input("请输入一个正整数n(n>=2):"))
 for i in range(2,n):
    if n%i == 0:
        print(n,"不是素数")
        break
else:
    print(n,"是素数")

例3-19 打印"*"组成的图形。

for i in range(5):
    for m in range(10):
        print("*", end = ' ')
    print()

#打印'* '组成的图形
for m in range(1,5+1):
    for i in range(1,m+1):
        print('*',end = ' ')
    print()
for m in range(1,5+1):
    for i in range(1,6-m+1):
        print('*',end = ' ')
    print()

例3-20 找出300以内的所有素数。

for n in range(2,300):
    for i in range(2,n):
        if n % i == 0:
            break
    else:
        print("{:>4}".format(n),end = ' ')

#统计2--300之间素数的个数
count = 0
for n in range(2,300):
    for i in range(2,n):
        if n % i == 0:
            break
    else:
        print("{:>4}".format(n),end = ' ')
        count += 1
print("\n共有{}个素数".format(count))

3.4 random库的基本应用

例 3-21 使用random库示例

from random import *
random()

ls = [1,2,3,4,5,6,7,8]
shuffle(ls)
print(ls)

例3-22 赌场中有一种称为“幸运7”的游戏，游戏规则是玩家掷两枚骰子，如果其点数之和为7，玩家就赢4元；不是7，玩家就输1元。请你分析一下，这样的规则是否公平。

from random import *
count = 0
for i in range(100000):
    num1 = randint(1,6)
    num2 = randint(1,6)
    if num1 + num2 == 7:
        count += 1
print(count/100000)

#  "幸运7"游戏 例 3-22-2
from random import *
money = 10
max = money
while money > 0:
    num1 = randint(1,6)
    num2 = randint(1,6)
    if num1 + num2 == 7:
        money += 4
        if money > max: max = money
    else:
        money -= 1
    print(money, end = ' ')
print("\nmax = ", max)

3.5 程序流程控制应用实例

例 3-23 找出所有的水仙花数。水仙花数是指一个3位数，它的每一位上的数字的3次幂之和等于它本身(如 1×1×1 + 5×5×5 + 3×3×3 = 153)。

for i in range(100,999+1):
    a = i//100
    b = i//10%10
    c = i%10
    if a**3 + b**3 + c**3 == i:
        print(i, end = " ")

例 3-24 找出1000以内所有的完全数。完全数又称完美数或完备数，是一些特殊的自然数，它所有的真因子的和，恰好等于它本身。第一个完全数是6，第二个完全数是28，第三个完全数是496，后面的完全数还有8 128、33 550 336等。

for n in range(1,999+1):
    sum = 0
    for i in range(1,n):
        if n % i == 0:
            sum += i
    if sum == n:
        print(n, end = " ")

例3-25 无穷级数4/1-4/3+4/5-4/7+…的和是圆周率pi，请编写一个程序计算出这一级数前n项的和。

n = int(input("请输入项数："))
PI =  0
for i in range(1,n+1):
    PI = PI + (-1)**(i+1)*(1/(2*i-1))
print("PI = ", PI*4)

import math
PI =  0
i = 1
count = 0
while abs(PI*4 - math.pi) >= 1e-6:
    PI = PI + (-1)**(i+1)*(1/(2*i-1))
    i += 1
    count += 1
print("PI = {}, 级数的前{}项数 ".format(PI*4,count))

例3-26 斐波那契数列因数学家列昂纳多.斐波那契以兔子繁殖为例而引入，故又称为“兔子数列”。斐波那契数列指的是这样一个数列：1，1，2，3，5，8，13，21，34，…，这个数列从第3项开始，每一项都等于前两项之和。现要求输出该数列的前n项，每行输出4个数字。

n = int(input("输入数列项数： "))
x1 = 1
x2 = 1
count = 2
print("{:>8}{:>8}".format(x1,x2), end = " ")
for i in range(3,n+1):
    x3 = x1 + x2
    print("{:>8}".format(x3),end = " ")
    count += 1
    if count % 4 == 0: print()
    x1 = x2
    x2 = x3

第4章列表与元组

4.1 列表介绍与元素访问

例4-1 根据输入的数字输出对应的月份信息。例如，输入“6”，则输出“It’s June.”

months = ["January", "February", "March","April","May","June","July","August","September","October","November","December"]
m = eval(input("请输入月份："))
print("It's {}.".format(months[m-1]))

4.2 操作列表元素

4.2.1 修改元素

guests = ['萧峰','杨过','令狐冲','张无忌','郭靖']
print("guests原列表元素          {}：".format(guests))
guests[-1] = "黄蓉"
print("guests修改列表最后一个元素{}：".format(guests))

4.2.2 增加元素

guests = ['萧峰', '杨过', '令狐冲', '张无忌', '黄蓉']
guests.append("段誉")
print("guests增加列表元素{}：".format(guests))
len(guests)

guests.insert(0,"张三丰")
print("guests插入列表元素到指定位置{}：".format(guests))

4.2.3 删除元素

#del删除列表元素“黄蓉”
guests = ['萧峰', '杨过', '令狐冲', '张无忌', '黄蓉', '段誉']
print("删除列表元素“黄蓉”之前{}".format(guests))
del guests[-2]
print("删除列表元素“黄蓉”之后{}".format(guests))

#pop()方法通过指定索引从列表中删除对应元素，并返回该元素。
guests = ['张三丰','萧峰','杨过','令狐冲','张无忌','郭靖']
print("原列表元素{}".format(guests))
itemDel = guests.pop(4)
print("元素'{}'已从列表中成功删除！".format(itemDel))

#使用remove方法删除指定元素
guests = ['萧峰', '杨过', '令狐冲', '张无忌', '黄蓉', '段誉']
print("原列表元素{}".format(guests))
guests.remove("段誉")
print("现列表元素{}".format(guests))

#使用remove方法删除含有重复的元素
guests = ['萧峰', '杨过','黄蓉',  '令狐冲', '张无忌', '黄蓉', '段誉']
print("原列表元素{}".format(guests))
guests.remove("黄蓉")
print("现列表元素{}".format(guests))

#len() 函数
guests = ['张三丰','萧峰', '杨过', '令狐冲', '张无忌', '黄蓉', '段誉']
len(guests)

#  元素 in 列表
guests = ['萧峰', '杨过','黄蓉',  '令狐冲', '张无忌', '黄蓉', '段誉']
'萧峰' in guests

# index() 方法
guests = ['萧峰', '杨过','黄蓉',  '令狐冲', '张无忌', '黄蓉', '段誉']
guests.index('黄蓉')

# count()方法 统计并返回列表中指定元素的个数。
guests = ['萧峰', '杨过','黄蓉',  '令狐冲', '张无忌', '黄蓉', '段誉','萧峰']
guests.count('萧峰')

4.3 操作列表

例 4-2 警察抓了A、B、C、D四个偷窃嫌疑犯，其中只有一个人是真正的小偷，审问记录如下：
A 说：“我不是小偷。”
B 说：“C是小偷。”
C 说：“小偷肯定是D。”
D 说：“C在冤枉人。”
已知四个人中有三个人说的是真话，一个人说的是假话。请问到底谁是小偷？

4.3.1 遍历列表

suspects = ['A','B','C','D']
for x in suspects:
   if (x != 'A') + (x == 'C') + (x == 'D') + (x != 'D') == 3:
        print("小偷是：",x)
        break

4.3.2 列表排序

# 1.sort() 方法 表示列表元素从小到大按升序排列
guests = ['Lily', 'Sara', 'Peter', 'Zen']
guests.sort()
print(guests)

# 2.sorted()函数 对指定的列表进行排序： sorted(列表,reverse) 
guests = ['Lily', 'Sara', 'Peter', 'Zen']
guests_new = sorted(guests,reverse = True)
print("排序前：",guests)
print("排序后：",guests_new)

4.3.3 列表切片

guests = ['张三丰','萧峰','杨过','令狐冲','张无忌','郭靖']
guests[1:3]

guests = ['张三丰','萧峰','杨过','令狐冲','张无忌','段誉','虚竹']
print("guests: ",guests)
print("guests[:5]: ",guests[:5])
print("guests[3:]: ",guests[3:])
print("guests[:]: ",guests[:])
print("guests[:-1]: ",guests[:-1])
print("guests[1:5:2]: ",guests[1:5:2])
print("guests[::-1]: ",guests[::-1])

4.3.4 列表的扩充

# “+”运算
guests = ['张三丰','萧峰','杨过','令狐冲','张无忌','段誉','虚竹']
ls = ['李秋水','郭襄','赵敏','任盈盈','袁紫衣']
guests + ls

# extend() 方法
guests = ['张三丰','萧峰','杨过','令狐冲','张无忌','段誉','虚竹']
ls = ['李秋水','郭襄','赵敏','任盈盈','袁紫衣']
guests.extend(ls)
print(guests)

# “*”运算
ls = ['李秋水','郭襄','赵敏','任盈盈','袁紫衣']
print(ls*3)

4.3.5 列表的复制

#1 利用切片实现
guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹', '李秋水', '郭襄', '赵敏', '任盈盈', '袁紫衣']
guestsCopy = guests[:]
print(guestsCopy)

#2 Copy 方法
guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹', '李秋水', '郭襄', '赵敏', '任盈盈', '袁紫衣']
guestsCopy = guests.copy()
print(guestsCopy)

#3 列表之间的赋值
guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹', '李秋水', '郭襄', '赵敏', '任盈盈', '袁紫衣']
guests1 = guests
print(guests1)

# 深拷贝和浅拷贝的区别
# 切片和 copy()方法术语深拷贝
# 直接赋值属于浅拷贝
guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹', '李秋水', '郭襄', '赵敏', '任盈盈', '袁紫衣']
guestsCopy = guests.copy()
del guestsCopy[0]
print("guests: ",guests)
print("guestsCopy: ", guestsCopy)

# 深拷贝和浅拷贝的区别
# 切片和 copy()方法术语深拷贝
# 直接赋值属于浅拷贝
guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹', '李秋水', '郭襄', '赵敏', '任盈盈', '袁紫衣']
guests1 = guests
del guests1[0]
print("guests: ",guests)
print("guests1: ", guests1)

4.3.6 列表的删除

guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹']
del guests[2:4]
guests

#guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹']
del guests[:]
guests

# 删除列表 del 列表名
guests = ['张三丰', '萧峰', '杨过', '令狐冲', '张无忌', '段誉', '虚竹']
del guests
guests

4.4 数值列表

4.4.1 创建数值列表

#1 通过input()函数输入
lnum = eval(input("请输入一个数值列表：\n"))
print(lnum)
type(lnum)

#2 通过list()函数转换
lnum = list(range(1,11))
lnum

4.4.2 列表生成式

#利用循环生成列表
lnum = []
for i in range(1,11):
    lnum.append(i**2)
lnum

#列表生成式
lnum = [i**2 for i in range(1,11)]
lnum

4.4.3 简单统计计算

例 4-3 输入10位学生的考试成绩，统计并输出其中的最高分、最低分和平均分。

#根据输入的10位学生的考试成绩，求最高分、最低分和平均分
score = eval(input("请输入10个同学的分数列表\n"))
maxScore = max(score)
minScore = min(score)
aveScore = sum(score)/len(score)
print("这次考试的最高分是{},最低分是{},平均分是{}。\n".format(maxScore,minScore,aveScore))

请输入10个同学的分数列表
1,2,3,4,5,6,7,8,9,10
这次考试的最高分是10,最低分是1,平均分是5.5。

4.5 元组

4.5.1 定义元组

# 定义元组，将多元素用“,”隔开
tupScores = (98,86,95,94,92)
print(tupScores)
print(type(tupScores))

4.5.2 操作元组

4.5.3 元组充当列表元素

group1 = [("萧峰",98),("杨过",96)]
print(group1[0])
print(group1[0][0])

#不能修改元组元素
group1 = [("萧峰",98),("杨过",96)]
group1[0][1] = 92

#通过新的元组元素替换原元组元素
group1 = [("萧峰",98),("杨过",96)]
group1[0] = ("萧峰",92)
group1

4.6 转换函数

#1 元组与列表之间的转换
tupPlay1 = ("萧峰","男",98)
print("{} is a tuple.".format(tupPlay1))
listPlay1 = list(tupPlay1)
print("{} is a list.".format(listPlay1))

#2 字符串与列表之间的转换
name = "张三丰,萧峰"
guests = list(name)
guests

#3 split()方法
# 列表 = 字符串.split(分隔符)  缺省为空格
name = "张三丰,萧峰"
guests = name.split(',')
guests

4.7 列表与元组应用实例

例 4-4 筛选法求素数。

primes = [1] * 300
primes[0:2] = [0,0]
count  = 0
for i in range(2,300):
    if primes[i] == 1:
        for j in range(i+1,300):
            if primes[j] !=0 and j % i == 0:
                primes[j] = 0
print("300以内的素数包括： ")
for i in range(2,300):
    if primes[i]:
        print(i, end = " ")
        count += 1
print("\n count = ",count)

例 4-5 二分查找。

ls = [34,64,67,72,73,82,83,85,87,88,90,91,96,98]
x = int(input("请输入待查找的数："))
low = 0
high = len(ls) - 1
while low <= high:
    mid = (low + high) // 2
    if ls[mid] < x:
        low = mid + 1
    elif ls[mid] > x:
        high = mid - 1
    else:
        print("找到{},索引为{}!".format(x,mid))
        break
if low > high:
    print("没有找到{}!".format(x))

例 4-6 为了监督饮食质量，食堂向学生发起了一次简短的问卷调查。请大家在“非常满意”“满意”“一般”“不满意”中选择一个评语评价食堂当天的饮食，最后食堂回收了90份问卷，并将所有的评语都汇总成了一个字符串：

不满意，一般，很满意，一般，不满意，很满意，满意…很满意，一般，一般，满意，很满意，一般

# 食堂伙食质量问卷调查
comments = ['不满意','一般','满意','很满意']
result = "不满意,一般,很满意,一般,不满意,很满意,满意,一般,一般,"\
         "不满意,满意,满意,满意,满意,满意,一般,很满意,一般,满意"
resultList = result.split(',')
commentCnts = [0] * 4
for i in range(4):
    commentCnts[i] = resultList.count(comments[i])
most = max(commentCnts)
mostComment = comments[commentCnts.index(most)]
print("根据统计，对今天伙食感觉：")
print("'很满意'的学生{}人；".format(commentCnts[3]))
print("'满意'的学生{}人；".format(commentCnts[2]))
print("'一般'的学生{}人；".format(commentCnts[1]))
print("'不满意'的学生{}人；".format(commentCnts[0]))

print("调查结果中，出现次数最多的评语是：",mostComment)

例 4-7 编写一个程序，模拟掷两个骰子100000次，统计各点数出现的概率。

from random import *
seed()
faces = [0] * 13
for i in range(100000):
    face1 = int(random() * 100) % (6-1+1) +1
    face2 = int(random() * 100) % (6-1+1) +1
    faces[face1 + face2] += 1
print("模拟掷两个骰子100000次结果如下：")
for i in range(2,13):
    rate = faces[i] / 100000
    print("点数{}共出现了{}次".format(i,faces[i]),end = ",")
    print("出现概率{:.2%}".format(rate))

例4-8 某餐厅推出了优惠下午茶套餐活动。顾客可以以优惠的价格从给定的糕点和给定的饮料中各选一款组成套餐。已知，指定的糕点包括松饼、提拉米苏、芝士蛋糕和三明治；指定的饮料包括红茶、咖啡和橙汁。请问，一共可以搭配出多少种套餐供客户选择？并请打印输出各种套餐详情。

#为优惠下午茶搭配配套餐
snacks = ['松饼','提拉米苏','芝士蛋糕','三明治']
drinks = ['红茶','咖啡','橙汁']
menus = []
for snack in snacks:
    for drink in drinks:
        menu = (snack,drink)
        menus.append(menu)
print("优惠下午茶可提供的搭配套餐如下：")
for menu in menus:
    print(menu)

# 用列表生成式为优惠下午茶搭配套餐
snacks = ['松饼','提拉米苏','芝士蛋糕','三明治']
drinks = ['红茶','咖啡','橙汁']
menus = [(snack,drink) for snack in snacks for drink in drinks]
print("优惠下午茶可提供的搭配套餐如下：")
for menu in menus:
    print(menu)

例4-9 武林大会胜利召开，引发江湖众多话题。有好事之人从“筋骨、敏捷、气势、反应、技巧、内力”几个角度详细分析了各与会嘉宾的武功属性，做了一份武林大侠武功属性得分表，传播甚广。有人截取了如表4-2所示的5位大侠的相关数据保存成了如下列表。
请统计分析：
（1）5位大侠的武功总得分。
（2）5位大侠在不同属性上的平均分。
（3）分别输出5位大侠的总分和不同属性的平均分。
（4）找出总分最高的大侠。

attrs = ["筋骨","敏捷","气势","反应","技巧","内力"]
tables = [['萧峰',20,17,20,20,18,19],
         ['杨过',18,19,17,20,18,18],
         ['令狐冲',12,17,14,20,19,13],
         ['张无忌',20,17,15,14,20,20],
         ['郭靖',19,18,19,18,19,20]]
# 提取大侠名字列表 names
names = [item[0] for item in tables]

# 提取评分列表 scores
scores = [item[1:] for item in tables]

# 生成各位大侠的总分列表 totals
totals = [sum(item) for item in scores]

# 生成各个属性平均分列表 avgs
avgs = []
for j in range(6):
    avgs.append(sum([scores[i][j] for i in range(5)])/5)
    
#输出五位大侠的总分
print("\n 五位大侠的总分是：")
for i in range(5):
    print("{:<6}:{:>4}".format(names[i],totals[i]))

#输出不同属性的平均分
print("\n 不同属性的平均分是：")
for i in range(6):
    print("{:<8}:{:>4}".format(attrs[i],avgs[i]))

#输出得分最高的大侠名字
print("\n总分最高的大侠是：", end = '--')
print(names[totals.index(max(totals))])

第5章字典与集合

5.1 字典的创建与访问

5.1.1 创建字典

#1 直接创建字典
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
print(dicAreas)
print(type(dicAreas))

#2 使用内置函数dict()创建字典
items = [('俄罗斯',1707.5),('加拿大',997.1),('中国',960.1)]
dicAreas = dict(items)
dicAreas

5.1.2 访问字典

dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
dicAreas['中国']
#dicAreas['美国']

dicAreas = {
    
    '俄罗斯':[1707.5,'莫斯科'],'加拿大':[997.1,'渥太华'],'中国':[960.1,'北京']}
dicAreas['中国']

5.2 字典的基本操作

5.2.1 空字典与字典更新

#1 添加条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
dicAreas['美国'] = 936.4
dicAreas

#2 修改条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
dicAreas['加拿大'] = 997.1
dicAreas

5.2.2 删除字典条目

#1 使用del命令删除指定条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
del dicAreas['加拿大']
dicAreas

{'俄罗斯': 1707.5, '中国': 960.1}

#2 使用pop()方法删除指定条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
area = dicAreas.pop('加拿大')
area

991.7

#3 用popitem()方法随机删除字典条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
item = dicAreas.popitem()
print(item)
print(type(item))

('中国', 960.1)
<class 'tuple'>

#4 用clear()方法清空字典条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
dicAreas.clear()
dicAreas

#5 直接删除整个字典
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
del dicAreas
dicAreas

5.2.3 查找字典条目

#1 成员运算符in
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
'中国' in dicAreas
#'美国' in dicAreas

True

#2 用get()方法获取条目的值
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
dicAreas.get('俄罗斯')
#dicAreas.get('巴西','未知')

1707.5

例5-1 统计英文句子"Life is short, we need Python."中各字符出现的次数。

#统计英文句子中各字符出现的次数
sentence = "Life is short, we need Python."
sentence = sentence.lower()
counts = {
    
     }
for c in sentence:
    if c in counts:
        counts[c] = counts[c] + 1
    else:
        counts[c] = 1
print(counts)

#用get方法统计英文句子中各字符出现的次数
sentence = "Life is short, we need Python."
sentence = sentence.lower()
counts = {
    
     }
for c in sentence:
    counts[c] = counts.get(c,0) + 1
print(counts)

5.3 字典的整体操作

5.3.1 字典的遍历

#1 遍历字典中所有的键
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
for key in dicAreas.keys():
    print(key)

俄罗斯
加拿大
中国

dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
for key in dicAreas.keys():
    print(key,dicAreas[key])

#2 遍历字典中所有的值
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
for value in dicAreas.values():
    print(value)

#3 遍历字典中所有的条目
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':991.7,'中国':960.1}
for item in dicAreas.items():
    print(item)

for k,v in dicAreas.items():
    print("{}的面积是{}万平方公里。".format(k,v))

5.3.2 字典的排序

dicAreas = {
    
    'Russia':1707.5,'Canada':991.7,'China':960.1}
sorted(dicAreas)

例5-2 按照国家名的升序输出Russia、Canada、China三个国家和对应的国土面积。

dicAreas = {
    
    'Russia':1707.5,'Canada':997.1,'China':960.1}
ls = sorted(dicAreas)
for country in ls:
    print(country,dicAreas[country])

例5-3 按照国土面积的升序输出Russia、Canada、China三个国家和对应的国土面积。

#按照面积值从小到大输出三个国家和对应的国土面积
dicAreas = {
    
    'Russia':1707.5,'Canada':997.1,'China':960.1}

#使用列表生成器生成(面积，国家)元组构成的列表
lsVK = [(v,k) for k,v in dicAreas.items()]

#对新列表按照面积排序
lsVK.sort()

#使用列表生成器生成(国家，面积)元组构成的列表
lsVK = [(v,k) for k,v in lsVK]
print(lsVK)

5.3.3 字典的合并

#1 使用for循环
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
dicOthers = {
    
    '美国':936.4,'巴西':854.7}
for k,v in dicOthers.items():
    dicAreas[k] = v
dicAreas

#2 使用字典的update()方法
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
dicOthers = {
    
    '美国':936.4,'巴西':854.7}
dicAreas.update(dicOthers)
dicAreas

#3 使用dict()函数
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
dicOthers = {
    
    '美国':936.4,'巴西':854.7}
ls = list(dicAreas.items())+list(dicOthers.items())
dicAreas = dict(ls)
dicAreas

#4 使用dict()函数的另一种形式
dicAreas = {
    
    '俄罗斯':1707.5,'加拿大':997.1,'中国':960.1}
dicOthers = {
    
    '美国':936.4,'巴西':854.7}
dicAreas = dict(dicAreas, **dicOthers)
dicAreas

例5-4 地理课上除了介绍了各个国家的国土面积信息外，还介绍了各国的首都。假设俄罗斯、加拿大、中国、美国、巴西五国的首都信息已经保存成了字典dicCapitals，请编写程序将字典dicAreas和字典dicCapitals合并成一个新的字典dicCountries，并保存着5个国家的首都和国土面积信息。

dicAreas = {
    
    '俄罗斯': 1707.5, '加拿大': 997.1, '中国': 960.1, '美国': 936.4, '巴西': 854.7}
dicCapitals = {
    
    '俄罗斯': '莫斯科', '加拿大': '渥太华', '中国': '北京', '美国': '华盛顿', '巴西': '巴西利亚'}
dicCountries = {
    
    }
for key in dicAreas.keys():
    dicCountries[key] = [dicAreas[key],dicCapitals[key]]
print(dicCountries[key])

for item in dicCountries.items():
    print(item)

5.4 集合

5.4.1 集合的创建与访问

#1 直接创建集合
# 元素必须是不重复的
set1 = {
    
    6,3,7,3}
set1

# 元素必须是不可变的
set2 = {
    
    (1,2),(3,4)}
set2

# 元素必须是不可变的
set3 = {
    
    [1,2],[3,4]}
set3

#2 使用set()函数创建集合
s1 = set("Hello,world!")
s1

s2 = set([1,2,4,2,1,5])
s2

s3 = set(1231)
s3

#3 创建空集合
s1 = {
    
    }
type(s1)

s2 = set()
type(s2)

#4 集合的访问

例5-5 生成20个0-20之间的随机数并输出其中互不相同的数。

import random
ls = []
for i in range(20):
    ls.append(random.randint(0,20))
s = set(ls)
print("生成的20个0-20随机数为：")
print(ls)
print("其中出现的数有：")
print(s)

5.4.2 集合的基本操作

#1  添加元素  S.add(item) 将参数item作为元素添加到集合S中， 
#  如果item是序列，则将其作为一个元素整体加入集合
#  作为参数的item只能是不可变的数据

S = {
    
    1,2,3}
S.add((2,3,4))
S

# 将参数序列items中的元素拆分去重后加入集合
# 参数items可以是可变数据

S = {
    
    1,2,3}
S.update({
    
    2,3,4})
S

#2 删除元素
# S.remove(item) 将指定元素item从集合S中删除。如果元素item在集合中不存在，系统将报错。
S = {
    
    1,2,3,4}
S.remove(5)

# S.discard(item)
# 将指定元素item从集合S中删除。如果item在集合中不存在，系统正常执行，无任何输出
S = {
    
    1,2,3,4}
S.discard(5)
S

# S.pop()
# 从集合S中随机删除并返回一个元素
S = {
    
    1,2,3,4}
item = S.pop()
print("刚删除了元素 ",item)

# S.clear()
# 清空集合中所有的元素
S = {
    
    1,2,3,4}
S.clear()
S

#3 成员判断
# item in S 判断元素item是否在集合S中。若在，返回True;若不在，则返回Fasle。
S = {
    
    1,2,3,4}
if 5 in S:
    S.remove(5)
else:
    print("The item is not in the Set.")

5.4.3 集合的数学运算

例5-6 IEEE和TIOBE是两大热门编程语言排行榜。截止2018年12月，IEEE榜排名前五的编程语言分别是：Python、C++、C、Jave和C#；TIOBE榜排名前五的编程语言是：Java、C、Python、C++、VB.NET。请编写程序求出：

(1)上榜的所有语言。
(2)在两个榜单中同时排名前五的语言。
(3)只在IEEE榜排名前五的语言。
(4)只在一个榜单排名前五的语言。

#IEEE和TIOBE榜单前五名编程语言讨论
setI = {
    
    'Python','C++','C','Java','C#'}
setT = {
    
    'Java','C','Python','C++','VB.NET'}

print("IEEE2018排行榜前五的编程语言有：")
print(setI)

print("TIOBE2018排行榜前五的编程语言有：")
print(setT)

print("前五名上榜的所有语言有：")
print(setI | setT)

print("在两个榜单同时进行前五的语言有：")
print(setI & setT)

print("只在IEEE榜进前五的语言有：")
print(setI - setT)

print("只在一个榜单进前五的语言：")
print(setI ^ setT)

5.5 字典与集合应用实例

例5-7 学生基本信息如表5-4所示，请编写程序分别统计男、女生的人数，并查找所有年龄超过18岁的学生的姓名。

disStus = {
    
    '李明':('男',19),'杨柳':('女',18),'张一凡':('男',18),'许可':('女',20),'王小小':('女',19),'陈心':('女',19)}
cnts = {
    
    }
names = []
for k,v in disStus.items():
    cnts[v[0]] = cnts.get(v[0],0) + 1
    #因为字典不可以按照值直接访问键，因此用if语句将>18岁的学生姓名逐个添加到列表names中。
    if v[1] > 18:
        names.append(k)
print("学生中女生共有{}名，男生共有{}名".format(cnts['女'],cnts['男']))    
print("其中年龄超过18岁的学生有：")
print(names)

例5-8 小夏和小迪接到一个调研任务，需要按省份统计班级同学的籍贯分布情况。他们决定两人分头统计男生和女生的籍贯分布，最后再汇总结果。已知小夏统计的女生籍贯分布是：江苏3人、浙江2人、吉林1人；小迪统计的男生籍贯分布是：江苏8人、浙江5人、山东5人、安徽4人、福建2人。请编写程序将两人的调研结果合并并输出。

dicBoys = {
    
    '江苏':8,'浙江':5,'山东':5,'安徽':4,'福建':2}
dicGirls = {
    
    '江苏':3,'浙江':2,'吉林':1}

for k,v in dicGirls.items():
    dicBoys[k] = dicBoys.get(k,0) + v
print(dicBoys)

例5-9 如表5-5所示的是2018年“双十一”期间对学生进行的匿名购物调查结果的一部分。请根据该表完成以下统计和分析工作：
(1)统计每一类消费项目的平均消费金额。
(2)分别统计男生、女生“双十一”的消费总金额的平均值。

               表5-5 “双十一”学生购物调查表

性别	书本	文具	服饰	零食	日用品
女	10	30	300	150	600
女	200	10	300	300	100
男	200	100	1000	100	200
男	50	20	300	100	200
男	200	50	400	100	200
女	100	10	500	150	800
女	200	100	500	300	200
男	300	50	0	10	50
男	100	10	500	40	500
男	200	50	200	100	100

ls = [{
    
    '性别':'女','书本':10,'文具':30,'服饰':300,'零食':150,'日用品':600},
       {
    
    '性别':'女','书本':200,'文具':10,'服饰':300,'零食':300,'日用品':100},
       {
    
    '性别':'男','书本':200,'文具':100,'服饰':1000,'零食':100,'日用品':200},
       {
    
    '性别':'男','书本':50,'文具':20,'服饰':300,'零食':150,'日用品':600},
       {
    
    '性别':'男','书本':200,'文具':50,'服饰':400,'零食':100,'日用品':200},
       {
    
    '性别':'女','书本':100,'文具':10,'服饰':500,'零食':150,'日用品':800},
       {
    
    '性别':'女','书本':200,'文具':100,'服饰':500,'零食':300,'日用品':200},
       {
    
    '性别':'男','书本':300,'文具':50,'服饰':0,'零食':10,'日用品':50},
       {
    
    '性别':'男','书本':100,'文具':10,'服饰':500,'零食':40,'日用品':500},
       {
    
    '性别':'男','书本':200,'文具':50,'服饰':200,'零食':100,'日用品':100}]

#统计每一类消费项目的平均消费金额
total = {
    
    }
for dic in ls:
    for k,v in dic.items():
        if k != '性别':
            total[k] = total.get(k,0) + v
            
print("每一类消费项目的平均消费金额如下所示：")
for key in total.keys():
    if key != '性别':
        print(key, end = '\t')
print()
for value in total.values():
    if key != '性别':
        print(value/len(ls),end = '\t') 
        
#分别统计男、女学生双十一的平均消费总金额
totalMale = []
totalFemale = []

for dic in ls:
    s = 0
    for k,v in dic.items():
        if k != '性别':
            s = s + v 
    if dic['性别'] == '女':
        totalFemale.append(s)
    else:
        totalMale.append(s)
avgBySex = {
    
    }
avgBySex['女'] = sum(totalFemale)/len(totalFemale)
avgBySex['男'] = sum(totalMale)/len(totalMale)

print("\n男、女生双十一的平均消费总金额是：")
print(avgBySex)

例5-10 请根据例4-9中的武功属性表，用字典统计分析：
(1) 各位大侠的武功总得分。
(2) 五位大侠在不同属性上的平均分。
(3) 分别输出五位大侠的总分和不同属性的平均分。
(4) 找出总分最高的大侠。

                               表5-6 武功属性表

姓名	筋骨	敏捷	气势	反应	技巧	内力
萧峰	20	17	20	20	18	19
杨过	18	19	17	20	18	18
令狐冲	12	17	14	20	19	13
张无忌	20	17	15	14	20	20
郭靖	19	18	19	18	19	20

 tables = {
    
    '萧峰':{
    
    '筋骨':20,'敏捷':17,'气势':20,'反应':20,'技巧':18,'内力':19},
           '杨过':{
    
    '筋骨':18,'敏捷':19,'气势':17,'反应':20,'技巧':18,'内力':18},
           '令狐冲':{
    
    '筋骨':12,'敏捷':17,'气势':14,'反应':20,'技巧':18,'内力':19},
           '萧峰':{
    
    '筋骨':20,'敏捷':17,'气势':15,'反应':14,'技巧':20,'内力':20},
           '郭靖':{
    
    '筋骨':19,'敏捷':18,'气势':19,'反应':18,'技巧':19,'内力':20}}
for k,v in tables.items():
    tables[k]['总分'] = tables[k]['筋骨']+tables[k]['敏捷']+tables[k]['气势']+tables[k]['反应']+tables[k]['技巧']+tables[k]['内力']
    print("{:<6}:{:>4}".format(k,v['总分']))
    
#求解并输出不同属性平均分
dictri = {
    
    }
for k,v in tables.items():
    dictri['筋骨'] = dictri.get('筋骨',0) + v['筋骨']
    dictri['敏捷'] = dictri.get('敏捷',0) + v['敏捷']
    dictri['气势'] = dictri.get('气势',0) + v['气势']
    dictri['反应'] = dictri.get('反应',0) + v['反应']
    dictri['技巧'] = dictri.get('技巧',0) + v['技巧']
    dictri['内力'] = dictri.get('内力',0) + v['内力']
    
n = len(tables)
for k,v in dictri.items():
    dictri[k] = v/n

print("\n 不同属性的平均分：")

for k,v in dictri.items():
    print("{:<8}:{:>4}".format(k,v))
#求解并输出得分最高的大侠

totals = [(v['总分'],k) for k,v in tables.items()]
totals.sort(reverse = True)
print('\n 总分最高的大侠是',end = '--')
print(totals[0][1])  #排序后的第一元素

第6章函数

6.1 函数的基本概念

#1 内置函数    如abs()、len()等，在程序中可以直接使用。
data_abs = abs(-87)
print(data_abs)

#2 标准库函数  如math()、random()等，可以通过import语句，
#  导入标准库，然后使用其中定义的函数。
import math
print(math.pi)

#3 第三方库函数 如jieba、numpy等，可以通过import语句，
#  导入标准库，然后使用其中定义的函数。
import numpy as np
data_arr = np.ones(3)
print(data_arr)

#4 用户自定义函数

6.2 函数的使用

6.2.1 函数的定义和调用

def max(a,b):
    if a >= b: return a
    else: return b

max(12,45)

例6-1 编写函数，求任意个连续整数的和。

#编写函数，求1+2+3+......+100的和。
def calSum():
    sum = 0
    for i in range(1,101):
        sum += i
    print("sum = ",sum)
calSum()

#编写函数，求任意个连续整数的和。
def calSum(n1,n2):
    sum = 0
    for i in range(n1,n2+1):
        sum += i
    print('sum = ',sum)
    
m1 = int(input('初值：'))
m2 = int(input('终值：'))
calSum(m1,m2)

#编写函数，求(2+3+......+19+20)+(11+12+......+99+100)的和。
def calSum(n1,n2):
    sum = 0
    for i in range(n1,n2+1):
        sum += i
    return sum
print("sum = ",calSum(2,20) + calSum(11,100))

例6-2 思考以下三个问题：
(1) 找出2-100中所有的素数。
(2) 找出2-100中所有的孪生素数。孪生素数是指相差2的素数对，如3和5、5和7、11和13等。
(3) 将4-20中所有的偶数分解成两个素数的和。例如，6=3+3、8=3+5、10=3+7等。

#判断一个数是否为素数
def prime(n):
    for i in range(2,n):
        if n % i == 0:
            return False
        else:
            return True

#找出2-100中所有的素数
for i in range(2,100+1):
    if prime(i) == True:
        print("{:^4}".format(i),end = '')

#找出2-100中所有的孪生素数
for i in range(2,100+1):
    if prime(i) == True and prime(i+2) == True and i+2 <= 100:
        print("({:^4},{:^4})".format(i,i+2))

#将4-20中所有的偶数分解成两个素数的和。
for i in range(4,20+1):
    for j in range(2,i):
        if prime(j) == True and prime(i-j) == True and i%2 == 0:
            print("{:^4}={:^4}+{:^4}".format(i,j,i-j))
            break     #只要找到一种分解方式就可以退出循环了

6.2.2 函数的参数

#1 默认值参数
# 函数babble的第二个参数指定了默认值
def babble(words,times = 1):
    print((words+" ")* times)
# 对babble()函数进行调用。
babble('hello',3)
babble('Tiger')

例6-3 基于期中成绩和期末成绩，按指定的权重计算总评成绩。

def mySum1(mid_score,end_score,rate = 0.4):
    score = mid_score * rate + end_score * (1-rate)
    return score
print("总评成绩：{:.2f}".format(mySum1(88,93)))
print("总评成绩：{:.2f}".format(mySum1(88,93,0.5)))
print("总评成绩：{:.2f}".format(mySum1(62,78,0.6)))

#2 名称传递参数

例6-4 基于期中成绩和期末成绩，按指定的权重计算总评成绩。

def mySum1(mid_score,end_score,rate = 0.4):
    score = mid_score * rate + end_score * (1-rate)
    return score
print(mySum1(88,93))                                            #按位置顺序传递参数
print(mySum1(mid_score = 88, end_score = 93, rate = 0.5))       #按名称传递参数
print(mySum1(rate = 0.5, end_score = 93, mid_scor e = 88))      #按名称传递参数

例6-5 在print()函数中使用名称传递参数控制输出格式。

print(1,2,3,sep = '-')    #用“-”分隔多项输出
print(23,5,34,sep = "/")  #用“/”分隔多项输出
for i in range(1,4):      #输出之后不换行
    print(i,end = "")

for i in range(1,10):
    print(i,end = "," if i % 3 != 0 else "\n")

#3 可变参数

例6-6 利用可变参数输出名单

def commonMultiple(*c):    # c为可变参数
    for i in c:
        print("{:^4}".format(i),end = '')
    return len(c)

count = commonMultiple("李白","杜甫")
print("共{}人".format(count))
count = commonMultiple("李白","杜甫","王维","袁枚")
print("共{}人".format(count))

例6-7 利用可变参数求人数和

def commonMultiple(**d):       #d为可变参数
    total = 0
    print(d)
    for key in d:
        total += d[key]
    return total
print(commonMultiple(group1 = 5, group2 = 20, group3 = 14 ,group4 = 22))
print(commonMultiple(male = 5, female = 12))

#4 形参与实参的讨论
# 银行账户管理，实现自动将利息添加到账户余额

def addInterest(money,rate):
    money = money * (1+rate)
    
amount = 1000
rate = 0.05
addInterest(amount,rate)
print("amount=",amount)

例6-8 累计单账户利息。

def addInterest(money,rate):
    money = money * (1+rate)
    return money               #使用返回值来修改实参amount
    
amount = 1000
rate = 0.05
amount = addInterest(amount,rate)
print("amount=",amount)

例6-9 累计多账户利息。

def addInterest(money,rate):
    for i in range(len(money)):
        money[i] = money[i] * (rate+1)
        
amount = [1200,1400,800,650,1600]
rate = 0.05
addInterest(amount,rate)        #实参是列表类型时，形参将是该列表的引用。
print("amount:",amount)

#列表list、字典是可变对象，可以直接在函数中修改；
#而整型、浮点型、字符串和布尔型为不可变对象，不能通过函数直接修改。

6.2.3 返回多个值

例6-10 编写一个函数，返回两个整数本身，以及它们的商和余数。

def fun(a,b):
    return (a,b,a//b,a%b)    #通过元组返回多个值

n1, n2, m, d = fun(6,4)
print("两个整数是：{}和{}".format(n1,n2))
print("它们的商是：",m)
print("余数是：",d)

6.3 lambda()函数

例6-11 利用lambda()函数输出列表中所有的负数。

f = lambda x : x < 0
list = [3,5,-7,4,-1,0,-9]
for i in filter(f,list):         #filter()函数用于过滤序列，过滤掉>=0的元素，结果组成新列表。
    print(i)

例6-12 利用lambda()函数对字典元素按值或按键排序。

dict_data = {
    
    "化1704":33,"化1702":28,"化1701":34,"化1703":30}
print(sorted(dict_data))           #按键排序，输出键值
print(sorted(dict_data.items()))   #按键排序，输出键值对
print(sorted(dict_data.items(),key = lambda x:x[1]))       #按值排序，输出键值对       
print(sorted(dict_data.items(),key = lambda x:x[1] % 10))  #按值的个位数排序，输出键值对

注：sorted()函数的默认参数key可以在排序时指定用迭代对象元素的某个属性或函数作为排序关键字。

list = [-2,7,-3,2,9,-1,0,4]
print(sorted(list,key = lambda x:x*x))

list = ['their','are','this','they','is']
print(sorted(list,key = lambda x : len(x)))

6.4 变量的作用域

#1 局部变量
def f():
    x = 10      #局部变量
    return x*x
f()
print(x)

#2 全局变量
def f():
    x = 10        #局部变量
    return x*x       
x = 1000          #全局变量
print(x)

#3 全局变量和局部变量同名
def f():
    x = 5
    print("f内部：x=",x)
    return x*x

x = 10
print("f()=",f())
print("f外部：x=",x)

#4 函数f()中访问全局变量
def f():
    global x
    x = 5
    print("f内部：x=",x)
    return x*x
x = 10
print("f()=",f())
print("f外部：x=",x)

6.5 递归函数

例6-13 递归方法求阶乘

def fact(n):
    if n == 1: return 1
    else: 
        return n * fact(n-1)
    
for i in range(1,9+1):
    print("{}! = ".format(i),fact(i))

例6-14 递归方法求斐波拉契数列。

def fibo(n):
    if n == 1 or n == 2:
        return 1
    else:
       return  fibo(n-1) + fibo(n-2)

for i in range(1,20+1):
    print("{:>8}".format(fibo(i)),end = " " if i % 5 !=0 else "\n")

例6-15 递归方法求最大公约数。

def gcd(a,b):
    if b == 0: return a
    else: return gcd(b,a % b)
    
print("gcd(12,24) = ", gcd(12,24))
print("gcd(48,24) = ", gcd(48,24))
print("gcd(15,11) = ", gcd(15,11))
print("gcd(15,35) = ", gcd(15,35))

6.6 函数应用实例

例6-16 编写函数，接收任意多的参数，返回一个元组，其中第一个元素为所有参数的平均值，其他元素为所有参数中大于平均值的实数。

def fun(*para):
    avg = sum(para)/len(para)         #平均值
    g = [i for i in para if i > avg]  #列表生成式
    return (avg,g)

m,l = fun(6.7,2.4,-0.1,2.15,-5.8)
print("平均值：",m)
print("大于均值的数：",l)

例6-17 编写函数，提取短语的首字母缩略词。缩略词是由短语中每一个单词取首字母组合而成的，且要求大写。例如，“very important person"的缩略词是"VIP”。

def fun(s):
    lst = s.split()
    return [x[0].upper() for x in lst]

s = input("输入短语：")
print("".join(fun(s)))

例6-18 小明做打字测试，请编写程序计算小明输入字符串的准确率。

def rate(origin,userInput):
    right = 0
    for origin_char,user_char in zip(origin,userInput):
        if origin_char == user_char:
            right += 1
    return right/len(origin)

origin = "Your smile will make my whole world bright."
print(origin)
userInput = input("输入：")
if len(origin) != len(userInput):
    print("字符串长度不一致，请重新输入")
else:
    print("准确率为：{:.2%}".format(rate(origin,userInput)))

例6-19 输入一段英文文本，统计出现频率最高的前10个单词(除去 of、a、the等无意义词语)。

#预处理函数
def getText(text):
    text = text.lower()              #将文本中字母全变为小写
    for ch in ",.;?-:\'":
        text = text.replace(ch," ")  #将文本中的标点符号替换为空格
    return text

#统计单词出现频率参考代码
def wordFreq(text,topn):
    words = text.split()
    counts = {
    
    }
    for word in words:
        counts[word] = counts.get(word,0) + 1
    excludes = {
    
    'the','and','to','of','a','be'}
#定义集合存放需要去除的无意义单词
    for word in excludes:
        del(counts[word])
    items = list(counts.items())
    items.sort(key = lambda x:x[1],reverse = True)

#按照单词频率计数的逆序排序
    return items[:topn]

text = '''I have a dream today! I have a dream that one day every valley 
shall be exalted,and every hill and mountain shall be made low, the rough 
places will be made plain, and the crooked places will be made straight;" 
and the crooked places will be made straight;" and the glory of the Lord 
shall be revealed and all flesh shall see it tegether." This is our hope,
and this is the faith that I go back to the South with.With this faith, 
we will be able to hew out of the mountain of despair a stone of hope.With 
this faith,we will be able to transform the jangling discords of our nation 
into a beautiful symphony of brotherhood. With this faith,we will be able to
work together,to pray together, to struggle tegether,to go to jail together,
to stand up for freedom together,knowing that we will be free one day.'''
text = getText(text)
for word,freq in wordFreq(text,20):
    print("{:<10}{:>}".format(word,freq))
print("统计结束")

第7章文件与异常

7.1 文件基础知识

7.1.1 文件与文件类型

#文件名包括两部分：主文件名和扩展名，两者之间用"."分隔；
#文件名：由用户根据操作系统的命名规则自行命名，用来与其他文件加以区别；
#扩展名根据文件类型对应专属的缩写，用来指定打开和操作该文件的应用程序。

7.1.2 目录与文件路径

#1 路径：文件保存的位置；

#2 绝对路径：指从文件所在驱动器名称（又称盘符）开始描述文件的保存位置。
    ##注：反斜杠"\"是盘符、目录和文件之间在windows操作系统下的分隔符。
#如果要在Python程序中描述一个文件的路径，需要使用字符串。而字符串中
#反斜杠"\"是转义序列符，所以为了还原反斜杠分隔符的含义，在字符串中
#需要连续写两个反斜杠，如 "F:\\documents\\python\\5-1.py"
#或者python语言提供了另一种路径字符串的表示方法：
# r "F:\documents\python\5-1.py"       注：r表示取消后续字符串中反斜杠"\"的转义特性。

#3 相对路径：指从当前工作目录开始描述文件的保存位置。每个运行的程序都有一个当前工作目录，
#又称为cwd。一般来说，当前工作目录默认为应用程序的安装目录，可以通过Python语言自带的
#os库函数重新设置。
import os
os.getcwd()

7.2 文件操作

7.2.1 文件的打开与关闭

file = open("mydata.txt","r")

file = open("mydata.txt","w")
file

file.close()
file

file.write("文件已关闭！")

7.2.2 写文件

#用file对象.write()方法写文件
file = open("mydata.txt","w")
file.write("飞雪连天射白鹿")
file.close()

#用file对象.writelines()方法写文件
file = open("mydata.txt","w")
file.writelines(["飞雪连天射白鹿\t","笑书神侠倚碧鸳\n"])
file.writelines(["横批：越女剑\n"])
file.close()

7.2.3 读文件

#1 用file对象的read()方法读文件
# 字符串变量 = file对象.read()
file = open("mydata.txt","r")
text = file.read()
file.close()
text

#2 用file对象的readline()方法读文件
# 字符串变量 = file对象.readline()
file = open("mydata.txt","r")
text = file.readline()
file.close()
print(text)

#3 用file对象的readlines()方法读文件   
# 以列表形式返回整个文件内容，其中一行对应一个列表元素。
# 列表变量 = file对象.readlines()
file = open("mydata.txt","r")
ls = file.readlines()
ls

7.3 CSV文件操作

7.3.1 CSV文件的打开

with open("mydata.txt",'r') as file:     #当文件读操作结束后，系统自动调用close方法。
    print(file.readline())
    print(file.readline())

7.3.2 reader对象

import csv
with open("stu.csv",'r') as stucsv:
    reader = csv.reader(stucsv)
    for row in reader:
        print(row)

7.3.3 writer对象

import csv
with open("stu.csv","a",newline = '') as stucsv:
    writer = csv.writer(stucsv)
    writer.writerow(['张芳','女','20'])
    writer.writerow(['王虎','男','18'])

# csv一次写入多行 writerows()
import csv
with open("stu.csv",'a',newline = '') as stucsv:
    writer = csv.writer(stucsv)
    #writer对象的writerows()方法只接受一个序列作为参数，可以是列表，也可以是元组。
    writer.writerows(['张芳','女','20'],['王虎','男','18'])

7.4 异常和异常处理

例7-1 从键盘输入a和b，求a除以b的结果并输出。

try:
    a = int(input("a="))
    b = int(input("b="))
    c = a/b
except ZeroDivisionError:
    print("除数不能为0！")
else:
    print("c=",c)

例7-2 读取并输出F:\documents\python目录下data2.txt文件中的内容，如果文件不存在则提醒用户先创建文件。

import os
os.chdir(r'F:\documents\python')
try:
    file = open("data2.txt",'r')
except IOError:
    print("data2.txt文件不存在，请先创建！")
else:
    text = file.read()
    print("data2.txt内容:\n",text)
    file.close()

7.5 文件与异常应用实例

例7-3 《哈姆雷特》是莎士比亚的一部经典悲剧作品。已知该作品对应的hamlet.txt文件保存在F:\documents\python目录下，
请编写程序统计hamlet.txt中出现频率最高的前10个单词，并将结果用文件名hamlet_词频.txt保存在同目录下。

#分析《哈姆雷特》中前20位的高频词              
def getText(text):
    text = text.lower()                 
    for ch in ",.;?-:\'|":
        text = text.replace(ch, " ")   
    return text

#编写函数统计单词出现频率
# text为待统计文本，topn表示取频率最高的单词个数
def wordFreq(text,topn): 
    words  = text.split()    # 将文本分词
    counts = {
    
    }
    for word in words:
        counts[word] = counts.get(word,0) + 1
    excludes = {
    
    'the','and','to','of','a','be','it','is','not','but','with'}
    for word in excludes:
        del(counts[word])    
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True)
    return items[:topn]

#编写主程序，调用函数
try:
    with open(r"F:\documents\python\hamlet.txt",'r') as file:
        text = file.read()
        text = getText(text)
        freqs = wordFreq(text,10)
except IOError:
    print("文件不存在,请确认!\n")
else:
    try:
        with open(r"F:\documents\python\hamlet_词频.txt",'w')as fileFreq:
                items =[ word + '\t' + str(freq) + '\n' for word,freq in freqs]
                fileFreq.writelines(items)
    except IOError:
        print("写入文件出错")
        for word,freq in freqs:
            print("{:<10}{:>}".format(word, freq))

例7-4 一年级要举行一个猜谜比赛，需要从儿童谜语集中随机抽题组成5份试卷。已知谜语集存储在“F:\documents\python”目录下名为“儿童谜语集.csv”的文件中，内容如图7-12所示。现要求每一份试卷中包含10道谜语，请编写程序完成组卷，并生成试卷文件和答卷文件。

import os
import csv
import random

#打开文件,将谜语集读成字典
def getDic(fileName):
    dic = {
    
    }
    with open(fileName,'r',encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)                #跳过文件中的表头
        for row in reader:
            dic[row[0]] = row[1]      #谜面作为key,谜底作为value
    return dic

#生成长度为n的试卷列表,每一个元素为一套试卷列表
def creatPapers(dic,n):
    tests = []
    items = list(dic.keys())
    for i in range(n):
        random.shuffle(items)       #打乱列表顺序取前10题
        ls = items[:10]
        tests.append(ls)
    return tests

#生成n个试卷文件和n个答卷文件
def createFiles(lsPapers,lsAnswers,n):
    for i in range(n):
        fpn = "paper" + str(i+1) + ".txt"
        with open(fpn,'w',encoding='utf-8') as filep:
            filep.writelines([item + '\n' for item in lsPapers[i]])
        fan = "answer" + str(i+1) + ".txt"
        with open(fan,'w',encoding='utf-8') as filea:
            filea.writelines([item + '\n' for item in lsAnswers[i]])
            

#主程序,生成n套试卷和答卷
os.chdir("F:\\documents\\python")  
fn = "儿童谜语集.csv"
n = 5
riddles = getDic(fn)                
papers = creatPapers(riddles,n)

answers = []           #根据谜面列表papers生成对应的答案列表
for paper in papers:
    ans = [riddles[item] for item in paper]
    answers.append(ans)
createFiles(papers,answers,n)

第8章中文文本分析基础

8.1 中文文本分析相关库

8.1.1 中文分词jieba库

#1 精确模式分词
import jieba
s = "我爱北京天安门"
for x in jieba.cut(s):
    print(x, end = ' ')

# 全模式分词
import jieba
s = "我爱北京天安门"
for x in jieba.cut(s,cut_all = True):
    print(x,end = ' ')

# 搜索引擎模式分词
import jieba
s = '李明硕士毕业于中国科学院计算所'
print(jieba.lcut(s))                  #精确模式
print()
print(jieba.lcut(s,cut_all = True))   #全模式
print()
print(jieba.lcut_for_search(s))       #搜索引擎模式

# 词性标注
import jieba.posseg as psg         #引入词性标注接口
text = "我和同学一起去北京故宫玩"
seg = psg.cut(text)                #词性标注
for ele in seg:
    print(ele,end = ' ')

text = "其实走曼谷线+海岛不是只有沙美岛，还有象岛也可以去，就是远了一点，上面的酒店价格很贵，\
格兰岛就算了，全是旅游团~我们是走的曼谷+芭提雅+象岛6天5晚的行程，比其他的行程要累一点，到\
芭提雅了坐车还要3个多小时，但是人少~不喜欢热闹的可以考虑一下这个行程，很安静~"
seg = psg.cut(text)
lst = [x.word for x in seg if  x.flag == 'ns']
lst

例8-1 利用jieba分词系统中的TF-IDF接口抽取关键词示例。

from jieba import analyse
#原始文本xiaoshu
text =  '''很多人不知道的是，金庸开始武侠小说的创作，是一次很偶然的机会。1955年，《大公报》
下一个晚报有个武侠小说写得很成功的年轻人，和金庸是同事，他名叫梁羽生。那年梁羽生的武侠小说即将完结
而他的创作又到了疲惫期，于是，报纸总编辑邀请金庸将武侠小说继续写下去。虽然此前从未写过小说，但凭借
他对武侠小说的了解与喜爱，金庸还是答应接替梁羽生的任务。他把自己名字中的镛字拆开，做了一个笔名
，《书剑恩仇录》正是他的第一部武侠作品，作品一炮而红。此书成功之后，金庸又在短短的几年内创作了
《碧血剑》、《雪山飞狐》和《射雕英雄传》等作品，一时间风靡全港。十余年间，他写下15部洋洋大作。'''
#基于TF-IDF算法进行关键词抽取
#topK表示最大抽取个数，默认为20个
#withWeight表示是否返回关键词权重值，默认值为False
keywords = analyse.extract_tags(text,topK = 10,withWeight = True)
print("keywords by tfidf:")
#输出抽取出的关键词
for keyword in keywords:
    print("{:<5} weight:{:4.2f}".format(keyword[0],keyword[1]))

8.1.2 词云绘制wordcloud库

例8-2 根据字典生成词云。

import wordcloud
import random
import string      # 导入string库
# string.ascii_uppercase可以获取所有的大写字母
lstChar = [x for x in string.ascii_uppercase]
# 使用randint获取26个随机整数
lstfreq = [random.randint(1,100) for i in range(26)]
# 使用字典生成式，产生形式如{'A': 80, 'B': 11, 'C': 38……}的字典
freq = {
    
    x[0]:x[1] for x in zip(lstChar,lstfreq)}
print(freq)
wcloud = wordcloud.WordCloud(
    background_color = "white",width=1000,
    max_words = 50,
    height = 860, margin = 1).fit_words(freq)# 利用字典freq生成词云
wcloud.to_file("resultcloud.png")            # 将生成的词云图片保存
print('结束')

8.1.3 社交关系网络networkx库

例8-3 社交关系网络图的绘制。

import matplotlib.pyplot as plt        # 引入pyplot模块用于画图
import networkx as nx                   # 引入networkx用于生成关系网络
plt.rcParams['font.sans-serif']=['SimHei'] # 用来正常显示图中中文标签
G = nx.Graph()                             # 生成一张空的图
lst = [['小花','明明',0.8],['小花','小灰',0.8],['小花','小白',0.2],['小白','小灰',0.1], ['小花','大李',0.4],['大李','小灰',0.8]]
for ii in lst:                           # 向图中添加边
    G.add_edge(ii[0], ii[1], weight = ii[2]) # 读入每条边的节点和权重
# 将权重大于0.5的边添加到elarge列表中
elarge = [(u,v) for (u,v,d) in G.edges(data=True) if d['weight']>0.5]
# 将权重小于等于0.5的边添加到esmall列表中
esmall = [(u,v) for(u,v,d) in G.edges(data=True) if d['weight']<=0.5]
pos = nx.spring_layout(G) 
# 设置节点
nx.draw_networkx_nodes(G, pos, node_size = 1500)
# 设置边
nx.draw_networkx_edges(G,pos,edgelist=elarge,width=3,edge_color='g')
nx.draw_networkx_edges(G,pos,edgelist=esmall,width=1,edge_color='g')
# 设置图标标签
nx.draw_networkx_labels(G, pos, font_size = 18)
# 关闭图标坐标轴
plt.axis('off')
# 添加图表标题
plt.title("人物社交网络图")
# 显示所绘图表
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-03LafDEK-1636028517199)(PythonMaterial_files/PythonMaterial_416_0.png)]

8.2 中文文本分析应用实例

8.2.1 数据准备

#编写函数读取文本文件数据
def getText(filepath):    # 传入待读取文件的文件名
    f = open(filepath, "r",encoding='utf-8')
    text = f.read()
    f.close()
    return text            # 返回读出的文本数据

8.2.2 分词并统计词频

import jieba
def stopwordslist(filepath):  
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]  
    return stopwords

def wordFreq(filepath,text,topn):
    words  = jieba.lcut(text.strip())
    counts = {
    
    }
    stopwords = stopwordslist('stop_words.txt')
    for word in words:
        if len(word) == 1:
            continue
        elif word not in stopwords:  
            if word == "凤姐儿" :
                word="凤姐"
            elif word=="林黛玉" or word=="林妹妹" or word=="黛玉笑":
                word="黛玉"
            elif word == "宝二爷":
                word="宝玉"
            elif word == "袭人道":
                word="袭人"
            counts[word] = counts.get(word,0) + 1        
    items = list(counts.items())
    items.sort(key = lambda x:x[1], reverse = True)
    f = open(filepath[:-4]+'_词频.txt', "w")
    for i in range(topn):
        word, count = items[i]
        f.writelines("{}\t{}\n".format(word, count))
    f.close()

text=getText('红楼梦.txt')
wordFreq('红楼梦.txt',text,300)
print('统计结束')

8.2.3 制作词云

import matplotlib.pyplot as plt
import wordcloud
f = open("红楼梦_词频.txt",'r')
text = f.read()
wcloud = wordcloud.WordCloud(background_color = 'white',width = 1000,max_words = 500,\
                            height = 860,margin = 2).generate(text)
# generate(text) 指根据词频文件生成词云
wcloud.to_file("红楼梦cloud.png")
plt.imshow(wcloud)
f.close()

import matplotlib.pyplot as plt
import wordcloud
from scipy.misc import imread
bg_pic = imread('star.jpg')
f = open("红楼梦_词频.txt",'r')
text = f.read()
f.close()
wcloud = wordcloud.WordCloud(font_path = r'C:\Windows\Fonts\simhei.ttf',background_color = 'white',width = 1000,max_words = 500,
                            mask = bg_pic,height = 860,margin = 2).generate(text)
wcloud.to_file("红楼梦cloud_star.png")

#显示词云图片
plt.imshow(wcloud)
plt.axis('off')
plt.show()

8.2.4 章回处理

#1 章回拆分
#分割章回
f = open('红楼梦.txt','r',encoding='utf-8') 
s = f.read()
import re
lst_chapter = []
chapter = re.findall("第[\u4E00-\u9FA5]+回", s)#"第([\u4E00-\u9FA5]+)回"返回第和回中间的内容
for x in chapter:
    if x not in lst_chapter and len(x)<=5:
        lst_chapter.append(x)
print(lst_chapter)
lst_start_chapterindex = []
for x in lst_chapter:
     lst_start_chapterindex.append(s.index(x))
print(lst_start_chapterindex)
lst_end_chapterindex = lst_start_chapterindex[1:]+[len(s)]
lst_chapterindex=list(zip(lst_start_chapterindex,lst_end_chapterindex))
print(lst_chapterindex)

['第一回', '第二回', '第三回', '第四回', '第五回', '第六回', '第七回', '第八回', '第九回', '第十回', '第十一回', '第十二回', '第十三回', '第十四回', '第十五回', '第十六回', '第十七回', '第十八回', '第十九回', '第二十回', '第二十一回', '第二十二回', '第二十三回', '第二十四回', '第二十五回', '第二十六回', '第二十七回', '第二十八回', '第二十九回', '第三十回', '第三十一回', '第三十二回', '第三十三回', '第三十四回', '第三十五回', '第三十六回', '第三十七回', '第三十八回', '第三十九回', '第四十回', '第四十一回', '第四十二回', '第四十三回', '第四十四回', '第四十五回', '第四十六回', '第四十七回', '第四十八回', '第四十九回', '第五十回', '第五十一回', '第五十二回', '第五十三回', '第五十四回', '第五十五回', '第五十六回', '第五十七回', '第五十八回', '第五十九回', '第六十回', '第六十一回', '第六十二回', '第六十三回', '第六十四回', '第六十五回', '第六十六回', '第六十七回', '第六十八回', '第六十九回', '第七十回', '第七十一回', '第七十二回', '第七十三回', '第七十四回', '第七十五回', '第七十六回', '第七十七回', '第七十八回', '第七十九回', '第八十回', '第八十一回', '第八十二回', '第八十三回', '第八十四回', '第八十五回', '第八十六回', '第八十七回', '第八十八回', '第八十九回', '第九十回', '第九十一回', '第九十二回', '第九十三回', '第九十四回', '第九十五回', '第九十六回', '第九十七回', '第九十八回', '第九十九回', '第一零零回', '第一零一回', '第一零二回', '第一零三回', '第一零四回', '第一零五回', '第一零六回', '第一零七回', '第一零八回', '第一零九回', '第一一零回', '第一一一回', '第一一二回', '第一一三回', '第一一四回', '第一一五回', '第一一六回', '第一一七回', '第一一八回', '第一一九回', '第一二零回']
[11, 8066, 14154, 22837, 28897, 36984, 44422, 51910, 58616, 64296, 69437, 75397, 79712, 84940, 90612, 95704, 103346, 111187, 116756, 126041, 131384, 137548, 144736, 150353, 158936, 167189, 174594, 181152, 190800, 195378, 201209, 208401, 214279, 219081, 226411, 234304, 241067, 249254, 255371, 261660, 271118, 277689, 285306, 292066, 298825, 306664, 314978, 322287, 329028, 336445, 344095, 351207, 359558, 367725, 373131, 381279, 390048, 401327, 408379, 413451, 421033, 427649, 437520, 448715, 458398, 465285, 470372, 479699, 487183, 494429, 498226, 507219, 514763, 522259, 533290, 542880, 550551, 561104, 570432, 575238, 582113, 588999, 597475, 605642, 613326, 621445, 627550, 634219, 640968, 647097, 653345, 658727, 659729, 666220, 674566, 681144, 687907, 697843, 703768, 709926, 715571, 723581, 728553, 735462, 741756, 747187, 753046, 759333, 766398, 775546, 782216, 789538, 796692, 803645, 808606, 815615, 822504, 830144, 837747, 847515]
[(11, 8066), (8066, 14154), (14154, 22837), (22837, 28897), (28897, 36984), (36984, 44422), (44422, 51910), (51910, 58616), (58616, 64296), (64296, 69437), (69437, 75397), (75397, 79712), (79712, 84940), (84940, 90612), (90612, 95704), (95704, 103346), (103346, 111187), (111187, 116756), (116756, 126041), (126041, 131384), (131384, 137548), (137548, 144736), (144736, 150353), (150353, 158936), (158936, 167189), (167189, 174594), (174594, 181152), (181152, 190800), (190800, 195378), (195378, 201209), (201209, 208401), (208401, 214279), (214279, 219081), (219081, 226411), (226411, 234304), (234304, 241067), (241067, 249254), (249254, 255371), (255371, 261660), (261660, 271118), (271118, 277689), (277689, 285306), (285306, 292066), (292066, 298825), (298825, 306664), (306664, 314978), (314978, 322287), (322287, 329028), (329028, 336445), (336445, 344095), (344095, 351207), (351207, 359558), (359558, 367725), (367725, 373131), (373131, 381279), (381279, 390048), (390048, 401327), (401327, 408379), (408379, 413451), (413451, 421033), (421033, 427649), (427649, 437520), (437520, 448715), (448715, 458398), (458398, 465285), (465285, 470372), (470372, 479699), (479699, 487183), (487183, 494429), (494429, 498226), (498226, 507219), (507219, 514763), (514763, 522259), (522259, 533290), (533290, 542880), (542880, 550551), (550551, 561104), (561104, 570432), (570432, 575238), (575238, 582113), (582113, 588999), (588999, 597475), (597475, 605642), (605642, 613326), (613326, 621445), (621445, 627550), (627550, 634219), (634219, 640968), (640968, 647097), (647097, 653345), (653345, 658727), (658727, 659729), (659729, 666220), (666220, 674566), (674566, 681144), (681144, 687907), (687907, 697843), (697843, 703768), (703768, 709926), (709926, 715571), (715571, 723581), (723581, 728553), (728553, 735462), (735462, 741756), (741756, 747187), (747187, 753046), (753046, 759333), (759333, 766398), (766398, 775546), (775546, 782216), (782216, 789538), (789538, 796692), (796692, 803645), (803645, 808606), (808606, 815615), (815615, 822504), (822504, 830144), (830144, 837747), (837747, 847515), (847515, 855342)]

#2 《红楼梦》之“刘姥姥三进荣国府”
cnt_liulaolao = []
for ii in range(120):
    start = lst_chapterindex[ii][0]
    end = lst_chapterindex[ii][1]
    cnt_liulaolao.append(s[start:end].count("刘姥姥"))
#用折线图将上述代码统计的数据呈现出来。
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.figure(figsize = (18,4))
plt.plot(cnt_liulaolao,label = "刘姥姥出场次数")
plt.xlabel("章节数",Fontproperties = 'SimHei')
plt.ylabel("出现次数",Fontproperties = 'SimHei')
plt.legend()

#3 《红楼梦》之“哭说笑闹总关情”
#统计每一回中“笑”和“喜”、“哭”与“悲”的出现次数。
cnt_laugh = []
cnt_cry = []
for ii in range(120):
    start = lst_chapterindex[ii][0]
    end = lst_chapterindex[ii][1]
    cnt_laugh.append(s[start:end].count("笑")+s[start:end].count("喜"))
    cnt_cry.append(s[start:end].count("哭")+s[start:end].count("悲"))
#用折线图将上述代码统计的数据呈现出来。
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.figure(figsize = (18,4))
plt.plot(cnt_laugh,label = "笑和喜")
plt.plot(cnt_cry,label = "哭与悲")
plt.xlabel("章节数",Fontproperties = 'SimHei')
plt.ylabel("出现次数",Fontproperties = 'SimHei')
plt.legend()
plt.title("《红楼梦》120回悲喜变化图",Fontproperties = 'SimHei')
plt.show()

#4 《红楼梦》之平均段落数与字数
import matplotlib.pyplot as plt
cnt_chap = [] #存放每一回的段落数 
cnt_word = [] #存放每一回的字数
for ii in range(120):
    start = lst_chapterindex[ii][0]
    end = lst_chapterindex[ii][1]
    cnt_chap.append(s[start:end].count('\n'))
    cnt_word.append(len(s[start:end]))
#绘制散点图。
plt.figure(figsize = (8,6))
plt.scatter(cnt_chap,cnt_word)
for ii in range(120):
    plt.text(cnt_chap[ii]-2,cnt_word[ii]+100,lst_chapter[ii],Fontproperties = 'SimHei',size = 7)
plt.xlabel("章节段数",Fontproperties = 'SimHei')
plt.ylabel("章节字数",Fontproperties = 'SimHei')
plt.title("《红楼梦》120回",Fontproperties = 'SimHei')
plt.show()

#5 《红楼梦》之人物社交关系网络
# 生成人物关系权重
Names=['宝玉','凤姐','贾母','黛玉','王夫人','老太太','袭人','贾琏','平儿','宝钗','薛姨妈','探春','鸳鸯',
       '贾政','晴雯','湘云','刘姥姥','邢夫人','贾珍','紫鹃','香菱','尤氏','薛蟠','贾赦']

import networkx as nx
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
relations={
    
    }
lst_para=s.split('\n') #按段落划分，假设在同一段落中出现的人物具有共现关系
for text in lst_para:
    for name1 in Names:
        if name1 in text:
            for name2 in Names:
                if name2 in text and name1!=name2 and (name2,name1) not in relations:
                    relations[(name1,name2)]=relations.get((name1,name2),0)+1
                    
print(relations.items())
maxRela=max([v for k,v in relations.items()])
relations={
    
    k:v/maxRela for k,v in relations.items()}
print(relations.items(),maxRela)

plt.figure(figsize=(15,15))
G=nx.Graph()

for k,v in relations.items():
    G.add_edge(k[0],k[1],weight = v)

elarge=[(u,v) for (u,v,d) in G.edges(data=True)
        if d['weight'] >0.6]

emidle = [(u,v) for (u,v,d) in G.edges(data=True)
          if (d['weight'] >0.3) & (d['weight'] <= 0.6)]

esmall=[(u,v) for (u,v,d) in G.edges(data=True)
        if d['weight'] <=0.3]

pos=nx.spring_layout(G) 

nx.draw_networkx_nodes(G,pos,alpha=0.8,node_size= 800)

nx.draw_networkx_edges(G,pos,edgelist=elarge,width=2.5,
                       alpha=0.9,edge_color='g')

nx.draw_networkx_edges(G,pos,edgelist=emidle,width=1.5,
                       alpha=0.6,edge_color='y')

nx.draw_networkx_edges(G,pos,edgelist=esmall,width=1,
                       alpha=0.4,edge_color='b',style='dashed')

nx.draw_networkx_labels(G,pos,font_size= 12)
plt.axis('off')
plt.title("《红楼梦》主要人物社交关系网络图")
plt.show()

Names=['宝玉','凤姐','贾母','黛玉','王夫人','老太太','袭人','贾琏','平儿','宝钗','薛姨妈','探春','鸳鸯',
       '贾政','晴雯','湘云','刘姥姥','邢夫人','贾珍','紫鹃','香菱','尤氏','薛蟠','贾赦']

import networkx as nx
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
relations={
    
    }
lst_para=s.split('\n') #按段落划分，假设在同一段落中出现的人物具有共现关系
for text in lst_para:
    for name1 in Names:
        if name1 in text:
            for name2 in Names:
                if name2 in text and name1!=name2 and (name2,name1) not in relations:
                    relations[(name1,name2)]=relations.get((name1,name2),0)+1
                    
print(relations.items())
maxRela=max([v for k,v in relations.items()])
relations={
    
    k:v/maxRela for k,v in relations.items()}
print(relations.items(),maxRela)

plt.figure(figsize=(15,15))
G=nx.Graph()

for k,v in relations.items():
    G.add_edge(k[0],k[1],weight = v)

elarge=[(u,v) for (u,v,d) in G.edges(data=True)
        if d['weight'] >0.6]

emidle = [(u,v) for (u,v,d) in G.edges(data=True)
          if (d['weight'] >0.3) & (d['weight'] <= 0.6)]

esmall=[(u,v) for (u,v,d) in G.edges(data=True)
        if d['weight'] <=0.3]
#布局模型
pos=nx.circular_layout(G)
nx.draw_networkx_nodes(G,pos,alpha=0.6,node_size = 800)
#alpha是透明度，width是连接线的宽度
nx.draw_networkx_edges(G,pos,edgelist=elarge,width=2.5,
                       alpha=0.9,edge_color='g')
nx.draw_networkx_edges(G,pos,edgelist=emidle,width=1.5,
                       alpha=0.6,edge_color='y')
nx.draw_networkx_edges(G,pos,edgelist=esmall,width=1,
                       alpha=0.2,edge_color='b',style='dashed')
nx.draw_networkx_labels(G,pos,font_size=12)
plt.axis('off')
plt.title("《红楼梦》主要人物社交关系网络图")
plt.show()

第9章科学计算基础：numpy库和matplotlib库的应用

9.1 numpy库的使用

9.1.1 核心对象：ndarray

import numpy as np
#创建一维数组
aArray = np.array([1,2,3])
print(type(aArray))
print("数组的秩为：",aArray.ndim)
print("数组的形状为：",aArray.shape)
print("数组的元素个数为：",aArray.size)
print("数组元素的类型为：",aArray.dtype)
print("数组元素占用的字节数为：",aArray.itemsize)

9.1.2 创建数组的常用方法

#1 arange([start,]stop,[step,]dtype=None)
import numpy as np
Array1 = np.arange(0,10.0,0.1)
Array1

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
       1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5,
       2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8,
       3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1,
       5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3, 6.4,
       6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7,
       7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9. ,
       9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9])

#2 linspace(start,stop,num=50,endpoint=True,retstep=False,dtype=None)
Array2 = np.linspace(1,10,4)
Array2

#3 ones(shape,dtype=None)
Array3 = np.ones(5)
Array3

#4 zeros(shape,dtype=float)
Array5 = np.zeros((2,3))
Array5

#5 full(shape,fill_value)
np.full((3,4),5)

#6 eye(N,M=None,dtype=float)
Array6 = np.eye(3,dtype=int)
Array6

#7 random.rand(d0,d1,...,dn) 创建n维数组，元素为0到1之间的随机小数。
Array8 = np.random.rand(3,4)
Array8

9.1.3 ndarray的数据类型

9.2 数组对象的常见操作

9.2.1 数组的基本运算

例9-1 假设有4个人共同参加一个测试，每个人分别做两次，两次测试成绩分别为99,98,80,60和80,75,65,80，请计算每个人两次测试的总分。

#方法1：利用列表来存储4个人一次测试成绩。
test1 = [99,98,80,60]
test2 = [80,75,65,80]
tsum = []
for i in range(4):
    tsum.append(test1[i]+test2[i])
print(tsum)

#方法2：利用numpy一维数组来存储4个人一次测试的成绩。
import numpy as np
tArray1 = np.array([99,98,80,60])
tArray2 = np.array([80,75,65,80])
tSum = tArray1 + tArray2
print(tSum)

9.2.2 ndarray的基本索引和切片

room = np.array([[[0,1,2,3],[4,5,6,7],[8,9,10,11]],[[12,13,14,15],[16,17,18,19],[20,21,22,23]]])
room

room[:,0,0]

room[0,:,:]

room[0,1,:]

room[0,1,0:4:2]

9.2.3 ndarray的形态变换操作

#1 reshape(shape)
import numpy as np
Array1 = np.arange(12)
Array2 = Array1.reshape((2,6))
Array2

Array3 = Array1.reshape((2,2,3))
Array3

Array1  #reshape()函数不会改变原数组

#2 resize()函数  会改变原数组
Array1.resize((3,4))
Array1

#3 transpose() 实现对数组的按轴转置
Array5 = np.arange(24).reshape((2,3,4))
Array5

Array5.transpose() #完全转置，轴的顺序变为(2,1,0)

Array5.transpose((0,2,1)) #执行的是行列转置

#4 flatten()函数
Array6 = np.arange(6).reshape((2,3))
Array6

Array7 = Array6.flatten()
Array7

9.2.4 ndarray常用的统计方法

Array6 = np.arange(6).reshape((2,3))
Array6

Array6.sum()

Array6.sum(axis=0) #按列求和

Array6.max()

Array6.max(axis=0) #按列求最大值

Array6.max(axis=1) #按行求最大值

Array6.cumsum(axis=0) #数组按0轴方向累积求和，即当前行是前面所有行元素和。

Array6.cumsum(axis=1)

9.3 numpy库的专门应用

9.3.1 numpy库在线性代数的应用

import numpy as np
Array1 = np.arange(6).reshape((2,3))
Array1

Array2 = np.arange(6).reshape((3,2))
Array2

Array1.dot(Array2)

Array3 = np.arange(4).reshape((2,2))
Array3

detArray3 = np.linalg.det(Array3)
detArray3

invArray3 = np.linalg.inv(Array3)
invArray3

eyeArray = Array3.dot(invArray3)  #互逆矩阵进行内积运算
eyeArray

eigenvalues,eigenvectors = np.linalg.eig(Array3) #求特征值和特征向量
eigenvalues
#eigenvectors

例9-2 “鸡兔同笼”问题：今有雉（鸡）兔同笼，上有三十五头，下有九十四足。问雉兔各几何。

#假设鸡有x只，兔有y只，列出线性方程组为：x+y=35 2x+4y=94
import numpy as np
heads,foots = 35,94
A = np.array([[1,1],[2,4]])  #方程组的系数矩阵
b = np.array([heads,foots])
X = np.linalg.solve(A,b)
print("鸡：{}，兔：{}".format(X[0],X[1]))

9.3.2 多项式的应用

#1 poly1d(A) 创建多项式
#f(x)=x3-2x+1
import numpy as np
A = np.array([1,0,-2,1])
f = np.poly1d(A)
print(f)       #输出多项式f的数学表达式
print(f(1))    #把x=1代入函数f(x)=x3-2x+1

#2 polyval(p,k)函数用于计算多项式p在x=k时的值。
np.polyval(f,2) #把x=2代入函数f(x)=x3-2x+1

#3 polyder(p,m=1)函数用于求多项式p的m阶导数，m默认值为1。
fder1 = np.polyder(f)     #求多项式f的一阶导
print(fder1)

fder2 = np.polyder(f,2)    #求多项式f的二阶导
print(fder2)

#4 polyint(p,m=1)函数用于求多项式p的m重积分，m默认为1。
fint1 = np.polyint(f)     #求多项式f的一重积分
fint1

print(fint1)

fint2 = np.polyint(f,m=2)
print(fint2)

#5 polyadd(p1,p2) 用于多项式求和
p1 = np.poly1d(np.array([1,2,3]))
p2 = np.poly1d(np.array([1,2,3,4]))
print(p1)
print()
print(p2)
print()
print(np.polyadd(p1,p2))

#6 polysub(p1,p2)   用于多项式求差
print(p2-p1)

#7 polymul(p1,p2) 用于多项式求积
print(np.polymul(p2,p1))

#8 polydiv(p1,p2) 用于多项式求商
print(np.polydiv(p2,p1))

#9 polyfit(x,y,k)函数用于多项式拟合 其中x,y为待拟合数据，k为拟合多项式的最高次幂。
x = np.array([0.0,1.0,2.0,3.0,4.0,5.0])
y = np.array([0.0,0.8,0.9,0.1,-0.8,-1.0])
parray = np.polyfit(x,y,3)  #用polyfit返回一个拟合多项式的系数数组
p = np.poly1d(parray)
print(parray)
print(p)

9.4 数组的文件输入与输出

#1 numpy.savetxt(fname,X,fmt='%.18e',delimiter='')
# 把一个一维或者二维数组写入一个指定的文本文件
import numpy as np
data = np.arange(50).reshape(5,10)
np.savetxt('data_txt.txt',data,fmt='%d')
np.savetxt('d:\\data_csv.csv',data,fmt='%d',delimiter=',')

#2 numpy.loadtxt(fname,dtype=np.float,delimiter=None)
data1Fromfile = np.loadtxt('data_txt.txt')
data1Fromfile

data2Fromfile = np.loadtxt('d:\\data_csv.csv',dtype = np.int,delimiter = ',')
data2Fromfile

#3 numpy.ndarray.tofile(fname,sep = "",format = "%s")
data = np.arange(50).reshape((2,5,5))
data.tofile("data_tofile1.dat",format='%d') #以二进制文件保存数组
data

#4 numpy.fromfile(fname,dtype=np.float,count=-1,sep='')
dataNew = np.fromfile("data_tofile1.dat",dtype=np.int)
dataNew

dataShape = np.fromfile("data_tofile1.dat",dtype=np.int).reshape((2,5,5))
dataShape

#5 numpy.save(fname,X)
data = np.arange(50).reshape((2,5,5))
np.save("data_save.npy",data)
data

#6 numpy.load(fname)
dataLoad = np.load("data_save.npy")
dataLoad

9.5 matplotlib库的使用

9.5.1 pyplot模块的使用

例9-3 绘制基本图形

import matplotlib.pyplot as plt
import matplotlib
import numpy as np
matplotlib.rcParams['font.family'] = 'SimHei'
matplotlib.rcParams['axes.unicode_minus'] = False
x = np.linspace(0,10,100)
y = np.sin(x)
plt.plot(x,y,'r')
f1 = plt.figure(1)
plt.title("我是图标题")
plt.xlabel("我是x轴标签")
plt.ylabel("我是y轴标签")
plt.text(np.pi,0.6,"我是图文字")
plt.ylim(-2,2)
plt.legend(labels=["我是图例"])
plt.show()

9.5.2 pyplot.plot()绘图函数的使用

#plot(x,y,s,linewidth)
#  x：横坐标的取值范围，省略时默认用y数据集的索引。
#绘制图9-5
import matplotlib.pyplot as plt
plt.plot([1,2,3])
plt.show()

#绘制图9-6
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
y = np.arange(10)
plt.plot(x,y,"r*:")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3A3yQYnG-1636028517203)(PythonMaterial_files/PythonMaterial_516_0.png)]

#绘制图9-7
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
y1 = x
y2 = 2*x
y3 = 3*x
plt.plot(x,y1,"ro--",x,y2,"gv:",x,y3,"bs-")

[<matplotlib.lines.Line2D at 0x2da1440a278>,
 <matplotlib.lines.Line2D at 0x2da1440a400>,
 <matplotlib.lines.Line2D at 0x2da1440a748>]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ezQMNgEB-1636028517209)(PythonMaterial_files/PythonMaterial_517_1.png)]

9.5.3 pyplot模块中坐标轴及标签等属性设置

#例9-4 绘制9-8图形
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
##matplotlib.rcParams['font.family']='SimHei'       # 设置中文字体为黑体
##matplotlib.rcParams['axes.unicode_minus']=False   # 坐标轴负号显示正常
x=np.arange(10)
y1=x
y2=2*x
y3=3*x
plt.plot(x,y1,"ro--",x,y2,"gv:",x,y3,"bs-")
plt.axis('scaled')    # 设置x轴y轴按实际比例显示
plt.xlim(0,10)        # 设置x轴的区间
plt.ylim(0,30)        # 设置y轴的区间
xmin,xmax,ymin,ymax=plt.axis()   #获取当前x轴y轴的区间值
print("x轴[{},{}]y轴:[{},{}]".format(xmin,xmax,ymin,ymax))
plt.xticks([0,1,2,3,4,5,6,7,8,9,10])
plt.yticks([5,10,15,20,25,30],['a','b','c','d','e','f']) # 'a'与5对应，'b'与[10]对应
y_ticks,labels=plt.yticks()     # 返回y轴的标签刻度值和对应标签
print(y_ticks)          # 输出y轴的刻度值
print(type(labels))     # 输出返回的刻度标签的类型
for label in labels:    # 输出y轴的刻度标签
    print(label)
plt.xlabel("x-axis")    # 设置x轴标签
plt.ylabel("y-axis")    # 设置y轴标签         
plt.legend(["y1=x","y2=2x","y3=3x"])   # 添加图例
plt.text(4,2,"TEXT")  # TEXT可以换成你想要显示的文本
plt.title("TITLE")    # TITLE可以换成你想要显示的文本

plt.show()

x轴[0.0,10.0]y轴:[0.0,30.0]
[ 5 10 15 20 25 30]
<class 'matplotlib.cbook.silent_list'>
Text(0, 0, 'a')
Text(0, 0, 'b')
Text(0, 0, 'c')
Text(0, 0, 'd')
Text(0, 0, 'e')
Text(0, 0, 'f')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o1g1CAoX-1636028517212)(PythonMaterial_files/PythonMaterial_519_1.png)]

9.5.4 pyplot模块中的绘图函数示例

#1 bar(x,height)函数
#例9-5 
import matplotlib.pyplot as plt
import numpy as np
x=np.arange(7)
height=[3, 4, 7, 6, 2, 8, 9]
plt.bar(x,height)
plt.show()

#2 scatter(x,y)
#例9-6 
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(7)
y = [3,4,7,6,2,8,9]
plt.scatter(x,y)

#3 pie(x,explode=None,labels=None,autopct=None,shadow=False)
#例9-7 
import matplotlib.pyplot as plt
Labels = 'Class-A', 'Class-B', 'Class-C', 'Class-D'
data = [15, 30, 45, 10]
Explode = (0, 0.1, 0, 0)  
plt.pie(data, explode=Explode,labels=Labels,autopct='%.2f%%')
plt.show()

9.5.5 子图绘制——subplot()函数

例9-8 绘制多个子图

import matplotlib.pyplot as plt
plt.subplot(2,2,1)
plt.bar(range(7), [3, 4, 7, 6, 2, 8, 9])
plt.subplot(2,2,2)
plt.plot(range(7), [3, 4, 7, 6, 2, 8, 9])
plt.subplot(2,2,3)
plt.scatter(range(7), [3, 4, 7, 6, 2, 8, 9])
plt.subplot(2,2,4)
plt.barh(range(7), [3, 4, 7, 6, 2, 8, 9])
plt.show()

9.5.6 matplotlib库的中文显示问题

例9-9 绘制含有中文的图形

import matplotlib.pyplot as plt
plt.plot([1,2,4],[1,2,3])
plt.title("坐标系标题")
plt.xlabel('时间(s)')
plt.ylabel('范围(m)')
plt.show()

例9-9 修改1：绘制含有中文的图形

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family']='kaiti'
plt.plot([1,2,4], [1,2,3])
plt.title("坐标系标题")
plt.xlabel('时间(s)')
plt.ylabel('范围(m)')
plt.show()

例9-9 修改2：绘制含有中文的图形

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family']='kaiti'
plt.plot([1,2,4], [1,2,3])
plt.title("坐标系标题",fontproperties="Simhei")
plt.xlabel('时间(s)',fontproperties="Kaiti")
plt.ylabel('范围(m)',fontproperties="Microsoft YaHei")
plt.show()

D:\Application\Anaconda\envs\py36\lib\site-packages\matplotlib\font_manager.py:1238: UserWarning: findfont: Font family ['Microsoft YaHei'] not found. Falling back to DejaVu Sans.
  (prop.get_family(), self.defaultFamily[fontext]))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EutA4ufX-1636028517214)(PythonMaterial_files/PythonMaterial_533_1.png)]

9.6 科学计算相关库应用实例

例9-3 某班级共有30名学生，每名学生有三门课程，学生的学号和各门功课成绩如图9-17所示。为了方便数据的输入，把学生的学号和成绩保存在文件"student_score.csv"中。"student_score.csv"文件的内容如图9-18所示。请计算每个学生的三门课程的总分、此班级每门课程的平均分和最高分及最低分，并绘制相应的图形来统计三门课程的成绩分布。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family']='SimHei'
stuScore=np.loadtxt('student_score.csv',delimiter=',',skiprows =1)
sumEach=np.sum(stuScore[:,1:],axis=1)   #返回每个学生三门课程总分
avgEachCourse=np.average(stuScore[:,1:],axis=0) #返回所有学生每门课程平均分
maxMath=np.max(stuScore[:,1])  #返回高数的最高分
maxEng=np.max(stuScore[:,2])   #返回英语的最高分
maxPython=np.max(stuScore[:,3])#返回Pyhon的最高分
minMath=np.min(stuScore[:,1])  #返回高数的最低分
minEng=np.min(stuScore[:,2])   #返回英语的最低分
minPython=np.min(stuScore[:,3])#返回Pyhon的最低分
print("每个学生的三门课程总分：")
print(sumEach)
print("所有学生的每门课程平均分：")
print(avgEachCourse)
print("每门课程的最高分：")
print(maxMath,maxEng,maxPython)
print("每门课程的最低分：")
print(minMath,minEng,minPython)
mathScore=stuScore[:,1]  # 取高数成绩
engScore=stuScore[:,2]   # 取英语成绩
pythonScore=stuScore[:,3]# 取Pyhon成绩
plt.suptitle("课程成绩分布直方图")  #为当前绘图区添加标题
#绘制高数成绩直方图
plt.subplot(3,1,1)
plt.hist(mathScore,bins=10,range=(0,100),color='red') # 绘制直方图，从0到100分成10段
plt.xlabel("高数成绩分数段") #设置x轴标签
plt.ylabel("人数") #设置y轴标签
plt.xlim(0,100)  #设置x轴区间
plt.xticks([0,10,20,30,40,50,60,70,80,90,100]) #设置x轴刻度
plt.yticks([0,2,4,6,8,10,12,14,16,18,20]) #设置y轴刻度
plt.grid()            #设置网格线
#绘制英语成绩直方图
plt.subplot(3,1,2)
plt.hist(engScore,bins=10,range=(0,100),color='green')# 同上
plt.xlabel("英语成绩分数段")
plt.ylabel("人数")
plt.xlim(0,100)
plt.xticks([0,10,20,30,40,50,60,70,80,90,100])
plt.yticks([0,2,4,6,8,10,12,14,16,18,20])
plt.grid()
#绘制Pyhon成绩直方图
plt.subplot(3,1,3)
plt.hist(pythonScore,bins=10,range=(0,100))  # 同上
plt.xlabel("Pyhon成绩分数段")
plt.ylabel("人数")
plt.xlim(0,100)
plt.xticks([0,10,20,30,40,50,60,70,80,90,100])
plt.yticks([0,2,4,6,8,10,12,14,16,18,20])
plt.grid()
plt.show()

第10章数据分析利器：pandas库的应用

10.1 pandas库简介

10.2 Series对象的应用

10.2.1 Series对象的创建

#用列表创建
import pandas as pd
s1 = pd.Series([2,4,6,8])
s1

0    2
1    4
2    6
3    8
dtype: int64

#通过Series对象的values属性获取数据部分
s1.values

#通过Series对象的index属性获取索引部分
s1.index

#使用index参数为对象指定字符串类型索引
ls = ['a','b','c','d']
s2 = pd.Series([10,20,30,10],index = ls)
s2

10.2.2 Series的常见运算

s2 ** 2

s2[s2>10]

np.sqrt(s2)

#利用字典创建Series对象
dic = {
    
    '郭靖':20,'萧峰':19,'杨过':18,'令狐冲':13,'张无忌':20}
s3 = pd.Series(dic)
s3

10.2.3 Series的索引与访问

#采用整数索引对Series进行切片时，区间不包括右侧最大索引
s3[1:4]

#使用标签索引对Series进行切片时，切片区间包括右侧最大索引
s3['萧峰':'令狐冲']

#Series——加法运算
dic = {
    
    '郭靖':20,'萧峰':19,'杨过':18,'令狐冲':13,'张无忌':20}
s3 = pd.Series(dic)
s3

dic2 = {
    
    '郭靖':18,'萧峰':17,'杨过':18,'令狐冲':19,'韦小宝':5}
s4 = pd.Series(dic2)
s4

s3+s4

#pandas提供了isnull()和notnull()函数检查对象中的数据缺失
pd.isnull(s3+s4)

pd.notnull(s3+s4)

10.3 DataFrame对象的应用

10.3.1 DataFrame基础

#! 创建DataFrame对象
#2017年中国大陆城市GDP数据
dic = {
    
    '城市':['北京','上海','广州','深圳','重庆'],'人口':[2171,2418,1090,1404,3372],'GDP':[28000,30133,21500,22286,19530]}
df = pd.DataFrame(dic,columns = ['城市','GDP','人口'])
df

	城市	GDP	人口
0	北京	28000	2171
1	上海	30133	2418
2	广州	21500	1090
3	深圳	22286	1404
4	重庆	19530	3372

#2 DataFrame的索引
# 自定义索引
# 2017年中国大陆城市GDP数据
dic = {
    
    '城市':['北京','上海','广州','深圳','重庆'],'人口':[2171,2418,1090,1404,3372],'GDP':[28000,30133,21500,22286,19530]}
df = pd.DataFrame(dic,index = [2,1,4,3,5],columns = ['城市','GDP','人口'])
print(df)

   城市    GDP    人口
2  北京  28000  2171
1  上海  30133  2418
4  广州  21500  1090
3  深圳  22286  1404
5  重庆  19530  3372

#DataFrame不允许直接修改对象的索引，可以通过set_index()和reindex()方法实现
dic = {
    
    '城市':['北京','上海','广州','深圳','重庆'],'人口':[2171,2418,1090,1404,3372],'GDP':[28000,30133,21500,22286,19530]}
df = pd.DataFrame(dic,columns = ['城市','GDP','人口'])
df = df.set_index(['城市'])
df = df.reindex(['上海','北京','深圳','广州','重庆'])
print(df)

      GDP    人口
城市             
上海  30133  2418
北京  28000  2171
深圳  22286  1404
广州  21500  1090
重庆  19530  3372

# 数据选择
#1 选择行
print(df[0:1]) #获取第1行
#print(df[1:3]) #获取第1、2行
#print(df['北京':'广州'])
#print(df.head()) #获取前5行数据
#print(df.tail(1)) #获取最后1行数据

#2 选择列
print(df['GDP'])  #获取GDP列

#3 选择区域
#基于行列索引标签选择
#df.loc['北京']                        #选取北京行
#df.loc['北京'：'广州']                #选取北京、深圳、广州三行
df.loc['北京':'广州','GDP':'人口']   #选取指定的三行两列

#基于数据所在的行列位置进行选择
df.iloc[1]
#df.iloc[1:4]
#df.iloc[1:4,0:]

#基于标签选择深圳的人口
df.at['深圳','人口']

#基于位置选择深圳的人口
df.iat[2,1]

#4 关于数据文件的导入和导出
# 导入数据，header参数取0表示第一行作为标题行，如果没有标题行，则取None
import pandas as pd
df = pd.read_csv('stu.csv',encoding="gb18030",header = 0)      # encoding="gb18030" 文件中包含中文字符、特殊字符。
df.to_excel('result1.xlsx',sheet_name = 'sheet1')
df.head()

10.3.2 DataFrame对象的数据操作

#1 数据筛选及排序
dic = {
    
    '城市':['北京','上海','广州','深圳','重庆'],'人口':[2171,2418,1090,1404,3372],'GDP':[28000,30133,21500,22286,19530]}
df = pd.DataFrame(dic,columns = ['城市','GDP','人口'])
df_filter = df[df['人口']>2000]
df_filter

#筛选出GDP超过20000亿元并且人口大于2000万的城市
df_loc = df.loc[(df['GDP']>20000) & (df['人口']>2000)]
print(df_loc)

#按GDP列从高到低进行城市排序
df = df.sort_values(by = ['GDP'],ascending = False) 
#新增一列：排序，取值1-5
df['排名'] = list(range(1,6))
print(df)

#2 数据分组
dic = {
    
    '省份':['广东','广东','江苏','浙江','江苏','浙江'],
       '城市':[ '深圳','广州','苏州','杭州','南京','宁波'],
       'GDP':[22286,21500,17319,12556,11715,9846],
       '人口':[1090,1404,1065,919,827,788]}
df = pd.DataFrame(dic)
print(df)

#创建一个分组对象
group = df.groupby('省份')

#对各省GDP、人口数据求平均值
avg = group.mean()
print(avg)

   省份  城市    GDP    人口
0  广东  深圳  22286  1090
1  广东  广州  21500  1404
2  江苏  苏州  17319  1065
3  浙江  杭州  12556   919
4  江苏  南京  11715   827
5  浙江  宁波   9846   788
        GDP      人口
省份                 
广东  21893.0  1247.0
江苏  14517.0   946.0
浙江  11201.0   853.5

#对各省GDP、人口数据求和
total = group.sum()
print(total)

#求各省GDP值最高的城市数据
max = group.max()
print(max)

#计算各省人均GDP值，精确到2位小数
avgPro = (total['GDP']/total['人口']).round(2)
print(avgPro)

10.3.3 DataFrame对象的绘图

例10-1 绘制2017年深圳、广州、苏州等6个城市的GDP和人口数据的柱状图。

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

dic = {
    
    '省份':['广东','广东','江苏','浙江','江苏','浙江'],
       '城市':[ '深圳','广州','苏州','杭州','南京','宁波'],
       'GDP(亿元)':[22286,21500,17319,12556,11715,9846],
       '人口(万)':[1090,1404,1065,919,827,788]}
df = pd.DataFrame(dic)
df = df.set_index('城市')   #重新设定城市名称为行索引
print(df)
matplotlib.rcParams['font.family'] = 'SimHei'
df.plot(kind = 'bar',title = '2017年城市GDP及人口数据')
plt.show()

10.4 数据分析相关库应用实例

例10-2 以两张数据表的编号列为主键，将两张表进行数据合并，并生成一些新的数据列，最终结果存入Excel文件。

import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    
    
        "编号": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009],
        "日期": pd.date_range('20181001', periods=9),
        "品牌": ['HW', 'Apple', 'samsung', 'HuaWei', 'xiaomi', 'OPPO', 'APPLE', 'NOKIA', 'vivo'],
        "型号": ['P20 Pro', 'iPhone XR', 'Note 9', 'Mate 20', 'MI 8', 
'Find X', 'iPhone XS', 'NOKIA 8 Sirocco', 'NEX'],
        "配置": ['6G-128G', '4G-128G', '6G-128G', '6G-128G', '8G-128G', '8G-256G', '4G-256G', '6G-128G', '8G-128G'],
        "价格": [4988, 6999, 6999, np.nan, 3599, 5999, 10165, 4399, 4298]
        },
        columns = ['编号', '日期', '品牌', '型号', '配置', '价格'])

df2 = pd.DataFrame({
    
    
     "编号": [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1010],
     "国家": ['China', 'USA', 'Korea', 'China', 'China', 'China', 'USA', 'Finland', 'Japan'],
     "系统": ['Android', 'IOS', 'Android', 'Android', 'Android', 'Android', 'IOS', 'Android', 'Android'],
     "屏幕尺寸": [6.1, 6.1, 6.4, 6.5, 6.2, 6.4, 5.8, 5.5, 6]
     })

# 数据清洗
ave_price = df1['价格'].mean()
df1 = df1.fillna(ave_price)      	# 缺失价格用平均值进行填充
df1['价格'] = df1['价格'].astype('int')   # 价格列数据全部转为整型数据
df1['品牌'] = df1['品牌'].replace('HW', 'HUAWEI')
df1['品牌'] = df1['品牌'].str.upper()     # 品牌列数据全部转为大写字符

# 采用内连接方式进行数据表合并
df_inner = pd.merge(df1, df2, how='inner')  

df_inner = df_inner.set_index('编号')    # 设置索引
df_inner = df_inner.sort_index()        # 按索引排序

# 将配置列分拆为两列,存入一个新DataFrame：df_split
df_split = pd.DataFrame((x.split('-') for x in df_inner['配置']) ,
                        	  index=df_inner.index ,
                           columns=['运行内存','存储容量'])

# 使用merge函数将df_split并入df_inner
df_inner = pd.merge(df_inner, df_split, right_index=True, left_index=True)

# 新增一列：价格档次
df_inner['价格档次'] = np.where(df_inner['价格'] > 6000, '高档', '中档')

# 新增一列: 国产大屏
df_inner.loc[(df_inner['国家'] == 'China') & (df_inner['屏幕尺寸'] > 6.2), '国产大屏'] = 'YES'

df_inner['综合性能'] = df_inner['屏幕尺寸'].astype('float32') * 100 + \
                      df_inner['运行内存'].str[0:-1].astype('int') * 25 + \
                      df_inner['存储容量'].str[0:-1].astype('int') 

df_inner['性价比'] = np.where(df_inner['综合性能']/df_inner['价格']>=0.18, '高','一般')

df_inner.to_excel('手机统计数据.xlsx', sheet_name='mobile_sheet') 
df_inner.to_csv('手机统计数据.csv')

第11章网络爬虫技术的应用

11.1 计算机网络基础知识

11.2 requests库的使用

11.2.1 请求网页

#1 get(URL)方式  表示从指定的网页文件请求数据。

import requests
r = requests.get('http://www.baidu.com')
print(r.text)

<!DOCTYPE html>

<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å

³äºŽç™¾åº¦ About Baidu

#2 post(URL,data={'key':'value'})方式 表示向指定的某个网页提交有待处理的数据。

11.2.2 带头部参数的网页请求

import requests
#在headers参数中封装一个User-Agent数据项模仿浏览器访问。
myHeaders = {
    
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML,like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
url = 'http://www.zhihu.com'
r = requests.get(url,headers = myHeaders)
r.encoding = 'utf-8'
print(r.status_code)
print(r.text)

11.2.3 Response对象

#1 encoding属性的使用
import requests
r = requests.get('http://www.baidu.com')
#r.encoding = 'utf-8'           #方法一：直接转换Response对象的编码格式
#print(r.text)
print(r.content.decode('utf-8')) #方法二：对二进制内容以utf-8格式解码

#2 content属性的使用

例11-4 利用content属性下载并保存网页中的图片。

import requests
r = requests.get('http://www.baidu.com/img/bd_logo1.png')
path = 'D:\\baidu.png'
with open(path,'wb') as file:
    file.write(r.content)

#3 raise_for_status()

例11-5 采集豆瓣网上关于三联书店出版的《天龙八部》的短评

import time
import requests

url = 'https://book.douban.com/subject/1255625/comments/hot?p='
for i in range(1, 11):
    try:
        r = requests.get(url + str(i))
        r.raise_for_status()
        r.encoding = 'utf-8'

        path = 'D:\\评论第{}页.html'.format(i)
        with open(path, 'w', encoding='utf-8') as file:
            file.write(r.text)
        time.sleep(3)	# 抓取一页评论数据后，休眠3秒再抓取下一页

    except Exception as ex:
        print("第{}页采集出错，出错原因:{}。".format(i, ex))

第1页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=1。
第2页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=2。
第3页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=3。
第4页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=4。
第5页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=5。
第6页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=6。
第7页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=7。
第8页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=8。
第9页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=9。
第10页采集出错，出错原因:418 Client Error:  for url: https://book.douban.com/subject/1255625/comments/hot?p=10。

11.3 BeautifulSoup库的使用

11.3.1 选择解析器

例11-6 选择解析器来解析HTML文档。

import requests 
import bs4

r = requests.get('http://www.baidu.com')
r.encoding = 'utf-8'

# 新建BeautifulSoup对象，赋值给soup变量
soup = bs4.BeautifulSoup(r.text, 'html.parser')

11.3.2 BeautifulSoup的四种对象

#1 Tag对象

例11-7 输出HTML文档标签对象及类型。

import requests 
import bs4

code = '''<html>
<body bgcolor="#eeeeee">
	<style>
		.css1 { background-color:yellow; color:green; font-style:italic;}
	</style>
	<h1 align="center">这里是标题行</h1>
	<p name="p1" class="css1">这是第一段</p>
	<p name="p2" class="css1">这是第二段</p>

	<img src="http://www.baidu.com/img/bd_logo1.png" style="width:200px;height:100px"></img>
	<a id='link' href="http://baidu.com">点我跳去百度</a>
</body>
</html>'''

# 根据code新建BeautifulSoup对象
soup = bs4.BeautifulSoup(code, 'html.parser')

print(soup.p)		 # 输出标签对象：<p class="css1" name="p1">这是第一段</p> 
print(type(soup.p))	 # 输出对象类型：<class 'bs4.element.Tag'>
print(soup.p.name)	 # 输出标签类型：p

print(soup.p.attrs)	 # 输出标签属性字典：{'name': 'p1', 'class': ['css1']}
print(soup.p['name'])	 # 输出标签name属性：p1
print(soup.p['class'])	 # 输出标签class属性：['css1']

<p class="css1" name="p1">这是第一段</p>
<class 'bs4.element.Tag'>
p
{'name': 'p1', 'class': ['css1']}
p1
['css1']

#2 BeautifulSoup对象
# 表示的是一个文档的整体，可以把它看作HTML文档树的根或者一个顶层节点。

#3 NavigableString对象
soup = bs4.BeautifulSoup(code,'html.parser')
print(soup.h1.string)
print(type(soup.p.string))

这里是标题行
<class 'bs4.element.NavigableString'>

#4 Comment对象

11.3.3 遍历文档树

#1 自上而下解析

例11-8 使用contents属性输出网页中table标签所有子节点表。

from bs4 import BeautifulSoup, element

code = '''<html><head>
<title>网页标题</title></head>
<body><h2>金庸群侠传</h2>
<table width="400px" border="1">
        <tr><th>书名</th> <th>人物</th> <th>年份</th></tr>
        <tr><td>《射雕英雄传》</td> <td>郭靖</td> <td>1959年</td></tr>
        <tr><td>《倚天屠龙记》</td> <td>张无忌</td> <td>1961年</td></tr>
        <tr><td>《笑傲江湖》</td> <td>令狐冲</td> <td>1967年</td></tr>
        <tr><td>《鹿鼎记》</td> <td>韦小宝</td> <td>1972年</td></tr>
</table></body></html>'''

soup = BeautifulSoup(code, 'html.parser')
print(soup.table.contents)		# 输出<table>标签所有子节点

['\n', <tr><th>书名</th> <th>人物</th> <th>年份</th></tr>, '\n', <tr><td>《射雕英雄传》</td> <td>郭靖</td> <td>1959年</td></tr>, '\n', <tr><td>《倚天屠龙记》</td> <td>张无忌</td> <td>1961年</td></tr>, '\n', <tr><td>《笑傲江湖》</td> <td>令狐冲</td> <td>1967年</td></tr>, '\n', <tr><td>《鹿鼎记》</td> <td>韦小宝</td> <td>1972年</td></tr>, '\n']

例11-9 对children属性做循环，输出table标签的子标签。

from bs4 import BeautifulSoup, element

soup = bs4.BeautifulSoup(code, 'html.parser')

for child in soup.table.children:
    if type(child) != element.NavigableString:	  # 过滤标签之间的换行
        print(child)

<tr><th>书名</th> <th>人物</th> <th>年份</th></tr>
<tr><td>《射雕英雄传》</td> <td>郭靖</td> <td>1959年</td></tr>
<tr><td>《倚天屠龙记》</td> <td>张无忌</td> <td>1961年</td></tr>
<tr><td>《笑傲江湖》</td> <td>令狐冲</td> <td>1967年</td></tr>
<tr><td>《鹿鼎记》</td> <td>韦小宝</td> <td>1972年</td></tr>

例11-10 对descendants属性做循环，输出table标签的后代标签。

for des in soup.table.descendants:
    if type(des) != element.NavigableString:	  # 过滤标签之间的换行
        print(des)

<tr><th>书名</th> <th>人物</th> <th>年份</th></tr>
<th>书名</th>
<th>人物</th>
<th>年份</th>
<tr><td>《射雕英雄传》</td> <td>郭靖</td> <td>1959年</td></tr>
<td>《射雕英雄传》</td>
<td>郭靖</td>
<td>1959年</td>
<tr><td>《倚天屠龙记》</td> <td>张无忌</td> <td>1961年</td></tr>
<td>《倚天屠龙记》</td>
<td>张无忌</td>
<td>1961年</td>
<tr><td>《笑傲江湖》</td> <td>令狐冲</td> <td>1967年</td></tr>
<td>《笑傲江湖》</td>
<td>令狐冲</td>
<td>1967年</td>
<tr><td>《鹿鼎记》</td> <td>韦小宝</td> <td>1972年</td></tr>
<td>《鹿鼎记》</td>
<td>韦小宝</td>
<td>1972年</td>

#2 水平方向解析

例11-11 按行解析网页标签。

from bs4 import BeautifulSoup, element

soup = bs4.BeautifulSoup(code, 'html.parser')

for child in soup.table.tr.next_siblings:		  # 获取第一行向后的兄弟标签
    if type(child) != element.NavigableString:	  # 过滤标签之间的换行
        print(child)

<tr><td>《射雕英雄传》</td> <td>郭靖</td> <td>1959年</td></tr>
<tr><td>《倚天屠龙记》</td> <td>张无忌</td> <td>1961年</td></tr>
<tr><td>《笑傲江湖》</td> <td>令狐冲</td> <td>1967年</td></tr>
<tr><td>《鹿鼎记》</td> <td>韦小宝</td> <td>1972年</td></tr>

# 自下而上解析

11.3.4 搜索文档树

#1 name参数
import requests 
import bs4

code = '''<html>
<body bgcolor="#eeeeee">
	<style>
		.css1 { background-color:yellow; color:green; font-style:italic;}
	</style>
	<h1 align="center">这里是标题行</h1>
	<p name="p1" class="css1">这是第一段</p>
	<p name="p2" class="css1">这是第二段</p>

	<img src="http://www.baidu.com/img/bd_logo1.png" style="width:200px;height:100px"></img>
	<a id='link' href="http://baidu.com">点我跳去百度</a>
</body>
</html>'''

soup = bs4.BeautifulSoup(code, 'html.parser')
print(soup.find_all('h1'))
print()
print(soup.find_all('p'))

[<h1 align="center">这里是标题行</h1>]

[<p class="css1" name="p1">这是第一段</p>, <p class="css1" name="p2">这是第二段</p>]

#2 **kwargs参数
soup.find_all(id = 'link')

[<a href="http://baidu.com" id="link">点我跳去百度</a>]

soup.find_all('h1',align = 'center')

[<h1 align="center">这里是标题行</h1>]

from bs4 import BeautifulSoup, element
soup2 = bs4.BeautifulSoup(code, 'html.parser')
soup2 = BeautifulSoup('<p class = "css1 css2"></p>','html.parser')
soup2.find_all(class_ = 'css1')
soup2.find_all(class_ = 'css2')
soup2.find_all(class_ = 'css1 css2')

[<p class="css1 css2"></p>]

#3 attrs参数
soup.find_all(attrs = {
    
    'class':'css1'})

[<p class="css1" name="p1">这是第一段</p>, <p class="css1" name="p2">这是第二段</p>]

#4 recursive参数 默认查找当前对象的所有后代节点。

#5 text参数 精确查询
soup.find_all(text = '这是第一段')

['这是第一段']

soup.find_all(text = '是')

[]

# 模糊查询，需要导入正则库re
import re
soup.find_all(text = re.compile('是'))

['这里是标题行', '这是第一段', '这是第二段']

#6 limit参数

11.4 网络爬虫技术应用实例

import time, requests, jieba
from bs4 import BeautifulSoup
from wordcloud import WordCloud 

# 函数1：爬取给定url的html文档
def getHtmlDoc(url, page):
    try:
        url = url + page
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.text
    except Exception as ex:
        print("第{}页采集出错，出错原因: {}。".format(page, ex))
        return ""

# 函数2：获取给定html文档中的评论内容，返回评论列表
def getComment(html):
    comment = [] 	# 该列表用于存储当前页面的所有评论
    soup = BeautifulSoup(html, 'html.parser')
    div = soup.find('div', id='comments')

    # 获取<div>标签内部的所有列表项标签<li>，再获取后代标签<p>、<span>
    for li in div.find_all('li', {
    
    'class':'comment-item'}):
        p = li.find('p', {
    
    'class':'comment-content'})
        text = p.span.string
        comment.append(text)
    return comment

# 函数3：根据给定评论文件，利用jieba库分词后生成词云文件
def createWordCould(fileName):
    with open(fileName, 'r', encoding='utf-8') as file:
        text = file.read()
        ls_word = jieba.lcut(text)      # 利用jieba库对所有评论进行分词
        all_words = ','.join(ls_word)   # 所有词语以逗号连接成一个长字符串        
        wcloud = WordCloud(font_path = r'C:\Windows\fonts\simhei.ttf',
                           width = 1000, height = 800,
                           background_color = 'white',
                           max_words = 200,
                           margin = 2) 
        wcloud.generate(all_words)
        # 生成词云图片文件，主文件名同文本文件名
        fileCloud = fileName.split('.')[0] + '.png'
        wcloud.to_file(fileCloud)


# 以下为主程序
url = 'https://book.douban.com/subject/1255625/comments/hot?p='
all_comment = [] 	# 存储全部评论的列表

for p in range(1,201):    
    html = getHtmlDoc(url, str(p))	    # 循环爬取前200页html文档
    page_comment = getComment(html)         # 从html文档中抽取评论内容
    all_comment.extend(page_comment)	    # 每页的评论列表添加到总列表中    
    time.sleep(2)			    	    # 每爬取一页暂停2秒
    print('第{}页处理完成。'.format(p))

print('网页采集结束，开始写入文件、生成词云。')

# 评论列表全部写入文件
fileName = 'd:\\天龙八部评论.txt'
with open(fileName, 'w', encoding='utf-8') as file:    
    file.writelines(all_comment)

# 根据评论文件生成词云
createWordCould(fileName)
print('词云生成结束。')

Building prefix dict from the default dictionary ...


第200页处理完成。
网页采集结束，开始写入文件、生成词云。


Dumping model to file cache C:\Users\Lenovo\AppData\Local\Temp\jieba.cache
Loading model cost 1.471 seconds.
Prefix dict has been built succesfully.


词云生成结束。

td>
《鹿鼎记》韦小宝 1972年
《鹿鼎记》
韦小宝
1972年

#2 水平方向解析