学习伟大的Python的第八天

1.beautifulsoup的简单使用

# 解析库：re,selenium
# XML解析器
# Beatifulsoup解析库，需要配合解析器使用
# 目前主要的解析器：Python标准库，lxml HTML解析器（首选）
# Beatifulsoup能给我们提供一种查找文档树的方法，其内部封装了re
# 1.什么bs4,为什么要使用bs4
# html_doc = """
# <html><head><title>The Dormouse's story</title></head>
# <body>
# $37
#
# Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" >Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
# <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.
#
# ...
# """

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
$37

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" >Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
"""
from bs4 import BeautifulSoup # 从bs4中导入Beautiful
# 调用BeautifulSoup实例化一个soup对象
# 参数一：解析文本
# 参数二：解析器（html.parser、lxml）
soup=BeautifulSoup(html_doc,'lxml')
print(soup)
print(type(soup))
# 文档美化
html=soup.prettify()
print(html)

2.bs4之搜索文档树

html_doc = """<html><head><title>The Dormouse's story</title></head><body>$37Once upon a time there were three little sisters; and their names weretank<a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.<hr></hr>..."""
'''
搜索文档树:
 find() 找一个
 find_all() 找多个

标签查找与属性查找:
 标签:
 name 属性匹配
 attrs 属性查找匹配
 text 文本匹配

 - 字符串过滤器
 字符串全局匹配

 - 正则过滤器
 re模块匹配

 - 列表过滤器
 列表内的数据匹配

 - bool过滤器
 True匹配

 - 方法过滤器
 用于一些要的属性以及不需要的属性查找。

 属性:
 - class_
 - id
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

# 字符串过滤器
# name
p_tag = soup.find(name='p')
print(p_tag) # 根据文本p查找某个标签
# 找到所有标签名为p的节点
tag_s1 = soup.find_all(name='p')
print(tag_s1)

# attrs
# 查找第一个class为sister的节点
p = soup.find(attrs={"class": "sister"})
print(p)
# 查找所有class为sister的节点
tag_s2 = soup.find_all(attrs={"class": "sister"})
print(tag_s2)

# text
text = soup.find(text="$37")
print(text)

# 配合使用:
# 找到一个id为link2、文本为Lacie的a标签
a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie")
print(a_tag)

# # 正则过滤器
# import re
# # name
# p_tag = soup.find(name=re.compile('p'))
# print(p_tag)

# 列表过滤器
# import re
# # name
# tags = soup.find_all(name=['p', 'a', re.compile('html')])
# print(tags)

# - bool过滤器
# True匹配
# 找到有id的p标签
# p = soup.find(name='p', attrs={"id": True})
# print(p)

# 方法过滤器
# 匹配标签名为a、属性有id没有class的标签
# def have_id_class(tag):
# if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'):
# return tag
#
# tag = soup.find(name=have_id_class)
# print(tag)

3.bs4之遍历文档树

html_doc = """<html><head><title>The Dormouse's story</title></head><body>$37Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" >Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well...."""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
'''
遍历文档树：
1.直接使用
'''
#　1.直接使用
print(soup.p) # 查找第一个标签
print(soup.a) # 查找第一个<a>标签
# 2.获取标签的名称
print(soup.head.name)
# 3.获取标签的属性
print(soup.a.attrs) # 以字典的形式
print(soup.a.attrs['href']) # 获取a标签中的href属性
# 4.获取标签的内容
print(soup.p.text) # $37
# 5.嵌套选择
print(soup.html.head)
# 6.子节点，子孙节点
# 找到闭合的标签
print(soup.body.children) # 找到body所有的子节点，返回的是迭代器的对象，这样可以节省电脑的资源
print(list(soup.body.children)) # 强制转化为列表类型
print(soup.body.descendants) #返回子孙节点
print(list(soup.body.descendants))
# 7.父节点、祖先节点
print(soup.p.parent)# 获取p标签的父亲节点
print(soup.p.parents) # 获取p标签所有的祖先节点
# 8.兄弟节点
# 找下一个兄弟
print(soup.p.next_sibling)
# 找下面所有的兄弟
print(soup.p.next_siblings) # 此时返回的是迭代器的对象，这样可以节省电脑的资源
print(list(soup.p.next_siblings))
# 找上面的兄弟，逗号，文本都可以是兄弟
print(soup.a.previous_sibling) # 找到a标签的上一个兄弟
# 找到a标签上面所有的兄弟
print(soup.a.previous_siblings)
print(list(soup.a.previous_siblings))

3.mongDB的简单使用

关系型数据库，强大的查询功能
非关系型数据库，灵活模式，扩展性，性能，需要建集合，没有一一对应的关系，

1.MangoDB
db全局变量显示当前位置
创建集合
SQL:
create table f1,f2...
MangoDB:
db.student
插入数据
MangoDB:
插多条
db.student.insert([{"name1":"tank1",{"name2":"tank2"}])
插一条
db.student.insert({"name1":"tank1"})
查数据
查全部
db.student.find({})
查一条查找name为tank的记录
db.student.find({"name":"tank"})

from pymongo import MongoClient

# 1、链接mongoDB客户端
# 参数1: mongoDB的ip地址
# 参数2: mongoDB的端口号默认:27017
client = MongoClient('localhost', 27017)
print(client)

# 2、进入tank_db库,没有则创建
print(client['tank_db'])

# 3、创建集合
print(client['tank_db']['people'])

# 4、给tank_db库插入数据

# 1.插入一条
data1 = {
 'name': 'tank',
 'age': 18,
 'sex': 'male'
}
client['tank_db']['people'].insert(data1)

# 2.插入多条
data1 = {
 'name': 'tank',
 'age': 18,
 'sex': 'male'
}
data2 = {
 'name': 'tank1',
 'age': 84,
 'sex': 'female'
}
data3 = {
 'name': 'tank2',
 'age': 73,
 'sex': 'male'
}
client['tank_db']['people'].insert([data1, data2, data3])

# 5、查数据
# 查看所有数据
data_s = client['tank_db']['people'].find()
print(data_s) # <pymongo.cursor.Cursor object at 0x000002EEA6720128>
# 需要循环打印所有数据
for data in data_s:
 print(data)

# 查看一条数据
data = client['tank_db']['people'].find_one()
print(data)

# 官方推荐使用
# 插入一条insert_one
# client['tank_db']['people'].insert_one()
# 插入多条insert_many
# client['tank_db']['people'].insert_many()

扫描二维码关注公众号，回复： 6561075 查看本文章

学习伟大的Python的第八天

猜你喜欢