初探：Python中使用request和BeautifulSoup库进行网络爬虫

说起网络爬虫，Python中最底层的应该是urllib，但是语法结构有些繁琐，需要使用正则。而使用request和BeautifulSoup库进行网络爬虫，发现这真的是web开发人员的福音。凡是懂一些前端知识的人来说，使用request和BeautifulSoup库进行爬虫，真的有一种开心而愉快的感觉。

requests 主要是一个封装好了http功能的库，可以实现基本的http操作。

beautifulsoup 主要提供了对html, xml网页的一个完美的解析方式，实际上，他将html中的tag 作为树节点进行解析，于是，我们可以将一个html页面看成是一颗树结构。也就是利用DOM（Document Object Model）来进行内容的抓取。

获得网页源代码：

import requests
res = requests.get('https://www.sina.com.cn/')
res.encoding = 'utf-8'
print(res.text)

获得需要的内容：

# 获得需要的内容
from bs4 import BeautifulSoup
html_sample = res.text
soup = BeautifulSoup(html_sample,'html.parser')
# print(soup.text)  #得到的是title标签内的内容
# 使用select找出含有h1的元素
header = soup.select('h1')
print(header)  # 得到的的含有h1标签的一个列表，要获得单纯的一个含h1的标签，可使用header[0]，要获得其中的文字，可使用下面。若有很多，可使用for循环
print(header[0].text)  # 提取其中的文字

seelct的使用：

如果使用select 找出所有id为title 的元素：alike = soup.select(‘#title’)
如果使用select 找出所有class为link 的元素：
soup = BeautifulSoup(html_sample)
for link in soup.select(‘.link’):
    print(link)

例子：

使用select找出所有a tag的hrefl链接：
ainks = soup.select(‘a’)
for link in alinks:
    print(link[‘href’])
例子：
a = '<a href="#" qoo=123 abc=456>I am a link</a>'
soup2 = BeautifulSoup(a)
alinks = soup2.select('a')[0]
print(alinks['href'])

一个简单的抓取“糗事百科”内容的例子：

import requests
from bs4 import BeautifulSoup
content = requests.get('https://www.qiushibaike.com/').content
soup = BeautifulSoup(content,'html.parser')
story = soup.select('.content span')
for p in story:
    print(p.text)

以上的例子都是抓取的文字内容，并且其都在Doc中，有些网页内容是在js或者XHR中，要具体问题具体分析。

此文只是requests和Beautifulsoup初探，后续会继续更文。

初探：Python中使用request和BeautifulSoup库进行网络爬虫

获得网页源代码：

获得需要的内容：

seelct的使用：

例子：

一个简单的抓取“糗事百科”内容的例子：

猜你喜欢