爬虫第四课---网页解析

BeautifulSoup4的使用(文档https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html)

1.安装 pip install BeautifulSoup4

'''
bs4的使用
'''
import re
from bs4 import BeautifulSoup

#测试的html
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#实例化
soup = BeautifulSoup(html_doc,'lxml')#默认有两个参数,文档，解析器(lxml的安装)

# print(soup)#自动补全html文档
#tag
print(soup.a)#获取标签
print(soup.a.attrs)#获取标签中的属性
print(soup.a.string)#获取标签中的文本
print(soup.find_all('a'))#搜索文档标签,返回符合规则的多个元素，接收的参数多种形式，可以是字符串（标签），正则表达式
print(soup.find_all(re.compile(r'^b')))#正则匹配标签，(compile是编译，返回一个正则表达式）
print(soup.find_all(id='link1'))
print(soup.find_all('a',class_='sister'))
#css匹配
print(soup.select('a # .'))#a是标签 .name是class #name是id，如a[class='sister']

xpath的使用（http://zvon.org/xxl/XPathTutorial/General_chi/examples.html）

1.pip install lxml

'''
xpath 路径语言
使用符号：
/   从根节点查找
//   从整个文档中查找
.
@
'''
from lxml import etree
import requests
url = 'https://www.baidu.com'
resp = requests.get(url).content.decode()
# print(resp)
#实例
selecter = etree.HTML(resp)
print(selecter.xpath('//div[@id="u1"]/a[@class="mnav"]'))
print(selecter.xpath('//div[@id="u1"]/a[@class="mnav"]/text()'))#获取文本

爬虫第四课---网页解析

猜你喜欢