爬虫的基本简述

本文链接： https://blog.csdn.net/weixin_43999482/article/details/101162960

什么是爬虫

就是一个自动向服务器请求数据的并提取程序

按F12或者单击鼠标右键，点审查元素，在Elements由网站源代码

爬虫的基本流程

1.发送请求

通过http库向目标站点发送请求，及发送一个Request，请求包括header等信息，等待服务器响应

2.获取响应内容

如何服务器响应，就会返回一个Response,Response返回的内容就是页面要获取的内容

3.解析内容

得到的内容，可能是HTML，可以用正则表达式或网页解析库进行分析。可能是json，可以直接转换为json对象解析，可能是二进制文件，可以做保存或进一步处理

4.保存数据

保存数据多种多样，可以是文本，也可以是数据库，或者是特定的文本形式

什么是requests，response

你的电脑发送个服务器叫做resques
服务器做出处理发送个你的电脑叫做response
在打开审查元素是，点击network，在进行刷新，你可以看到一些你的电脑与服务器的一些交互内容（请求头，响应头，IP地址的信息）

request

请求方式

主要有get，post,另外还有head,put,delete等等
get与post主要的不同在于：
get：一个信息会显示在URL后面，赛选比较方便
post：数据在一个dateform内，需要进行验证与提交，比较安全
URL请求：URL全称（统一资源定位符）如果一张网页，一张图片，一
段视频都可以用一个URL来确定
请求头：包含请求的头部信息，如User-Agent，Host，cookies等信息
请求体：请求时额外携带数据，如表单提交的表单数据（from data）
一般来说get方式下是不会携带如何数据的
如以下

Request URL: https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png
Request Method: GET
Status Code: 200  (from memory cache)
Remote Address: 47.244.21.100:443
Referrer Policy: origin

Rosponse

响应状态：
服务器有多种响应状态，如200代表成功，301代表跳转，404表示找不到页面，505表示服务器处理出错
Status Code: 200
响应头：
如内容类型，内容长度，服务器信息，设置cookies等等
alt-svc: quic=":443"; ma=2592000; v=“46,43,39”
cache-control: private, max-age=0
content-encoding: br
content-type: text/html; charset=UTF-8
date: Sun, 22 Sep 2019 12:20:45 GMT
expires: -1
server: gws
set-cookie: 1P_JAR=2019-09-22-12; expires=Tue, 22-Oct-2019 12:20:45 GMT; path=/; domain=.google.com; SameSite=none
status: 200
strict-transport-security: max-age=31536000
x-frame-options: SAMEORIGIN
x-xss-protection: 0
响应体：
最重要的部分，包含请求的内容，如HTML文件，图片二进制数据等等

演示一下

import requests  #引入这个http请求库,用来做模拟请求的
response = requests.get("http://www.baidu.com")  #输入URL
print(response.text)   #输处网页源代码
print(response.headers)  #输出响应的头部信息
print(response.status_code) #输处状态码


<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç™¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿
è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 22 Sep 2019 12:50:55 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
200

爬虫能抓取怎样的数据

网页文本：HTML文档，Json格式文件等
图片：保存为二进制文件
视频文件：：同位二进制文件
其他：只要能请求道的都能抓取

解析方式

直接处理
json解析
正则表达式
BeautifulSoup
PyQuery
XPath

为什么我们抓取的数据与我们在游览器看到的不一样

因为有js调用后台端口来进行处理
怎样解决JavaScript的渲染

用Ajax请求
用Seleniun/WebDrive 来驱动游览器来模拟加载一个网页，用来做自动化测试的工具
Splash
PyV8，Ghost.py

怎样保存数据

文本：纯文本，Json,Xmi等
关系型数据库：如MySQL，Oracle，SQl serves等具有结构化表格数据的存储
非关系性数据库：如MongoDB，Redis等key-valus形式保存形式存储
二进制文件存储：如图片，视屏，音频的等保存为特有文件形式