爬虫流程

发起请求，通过使用HTTP库向目标站点发起请求，即发送一个Request，请求可以包含额外的headers等信息，并等待服务器响应。
获取响应内容如果服务器能正常响应，则会得到一个Response，Response的内容就是所要获取的页面内容，其中会包含：html，json，图片，视频等。
解析内容得到的内容可能是html数据，可以使用正则表达式、第三方解析库如Beautifulsoup，etree等，要解析json数据可以使用json模块，二进制数据，可以保存或者进一步的处理。
保存数据保存的方式比较多元，可以存入数据库也可以使用文件的方式进行保存

BeautifulSoup

简单来说，BeautifulSoup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

具体用法详见中文文档：

中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

实战：爬取南方周末首页新闻

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

def del_sign(s):
    '删除各种符号和空格'
    s=s.replace(' ','').replace('\n','').replace('\t','').replace('\r','')
    return s
        
if __name__=='__main__':   
#参数设置     
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}#请求头，模拟浏览器进行请求
    url='http://www.infzm.com/news.shtml'#要爬虫的网址

#开始爬取
    response=requests.get(url,headers=headers)#发送请求并得到响应
    title_list,content_list=[],[]#存储文章标题和概要容器
    if response.status_code==200:
        response.encoding=response.apparent_encoding#字符编码设置为网页本来所属编码
        html=response.text#获取网页代码
        html_bs = bs(html,'lxml')#将网页文本代码转为bs对象
        article_bs=html_bs.find_all('article',attrs={'class','news_left_two_next'})#获取文章标签内容
        for i,x in enumerate(article_bs):
            print('第{}条新闻提取中'.format(i+1))
            title_list.append(del_sign(x.find('h3').text))#文章标题
            content_list.append(del_sign(x.find('p',attrs={'class':'summary'}).text))#文章概要   
        
        #存储数据
        data=pd.DataFrame(zip(title_list,content_list),columns=['title','content'])#转为dataframe
        data.to_csv('./南方周末新闻.csv',encoding='utf-8-sig')#存为csv
    else:
        print('请求失败！')

阳望

发布了45 篇原创文章 · 获赞 94 · 访问量 15万+

私信关注

爬虫二：用BeautifulSoup爬取南方周末新闻

爬虫流程

BeautifulSoup

实战：爬取南方周末首页新闻

猜你喜欢