爬虫(BeautifulSoup)

官方文档:

http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#

爬虫网络请求方式: urllib(模块), requests(库), scrapy, pyspider(框架)

爬虫数据提取方式：正则表达式, bs4, lxml, xpath, css

from bs4 import BeautifulSoup

参数1：序列化的html源代码字符串，将其序列化成一个文档树对象。

参数2：将采用 lxml 这个解析库来序列化 html 源代码

# html = BeautifulSoup(open('index.html', encoding='utf-8'), 'lxml')

html(是自己为了测试写的)

# print(html.title)
# print(html.a)

# 获取某一个标签的所有属性
# {'href': 'http://www.baidu.com', 'id': '1', 'name': '2'}
# print(html.a.attrs)

# 获取其中一个属性
# print(html.a.get('id'))

# 获取多个标签, 需要遍历文档树
# print(html.head.contents) # 拿到<head>中的所有标签

# print(html.head.children) # list_iterator object
# for ch in html.head.children:
# print(ch)

# descendants
# print(html.head.descendants) # generator object
# for de in html.head.descendants:
# print(de)

# findall
# find
# get_text: 标签内所有文本, 包含子标签
# select (按属性查找)查找出来的内容是迭代对象
# string: 不能有其他标签
# print(html.select('.two')[0].get_text())

# findall: 根据标签名查找一组元素
# res = html.find_all('a')
# print(res)

# select: 支持所有的css选择器语法
# res = html.select('.one')[0]
# # print(res.get_text())
# # print(res.get('class'))

# res = html.select('.two')[0]
# print(res)
# print('---', res.next_sibling) # 遗留问题

import os

# os.mkdir('abc') # 在当前目录下(6-7),创建abc
os.chdir('abc') # 进入到abc
# os.mkdir('123') # 在abc创建123目录

os.chdir(os.path.pardir) # 回到父级目录

os.mkdir('erf')

备注: html

<!DOCTYPE html>

<html>
<head>

    <meta http-equiv="content-type" content="text/html;charset=utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta content="always" name="referrer">
    <meta name="theme-color" content="#2932e1">
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>
    <title>百度一下，你就知道</title>
</head>

<body>
    <div class="one">123</div>
    <div class="two"><span>1111</span>
        <a href="http://www.baidu.com" id="1" , name="2">百度一下</a>
    </div>
    <a href="#">我自己写的</a>
</body>
</html>

猜你喜欢