通用爬虫和聚焦爬虫概念
1.爬虫概念:
    用程序从互联网上爬取信息
2.语言:
    c/c++ 效率最高
    php   不擅长
    python 优美简洁
    java    代码长 多 冗余 改变

Common reptiles:

    1.抓取网页
    2.数据存储
    3.数据处理
    4.关键字进行搜索

Keyword search method:

    手动提交url 百度站长
    加友情链接
    花钱竞价
     针对搜索引擎下有一个robot.txt 针对搜索引擎 规范抓取内容

Focus on the crawler

    根据需求 定制的抓取程序  抓取得都是我们想要的对应的数据
抓取网页的共同的特点:
    都有url
    网页由html .js css组成
    遵循的都是http协议或者https协议

Overall content:

1、python语法
2、如何抓取页面，使用到python库
    urllib.reqeust  urllib.parse  requests
3、解析内容
    正则表达式、xpath、bs4、jsonpath
4、采集动态html
    selenium+phantomjs
5、scrapy
    高性能异步网络框架
6、分布式，scrapy-redis组件
    在scrapy的基础上增了一套组件，结合redis进行存储等功能

http:

        http:80
        https:443
        mysql:3306
        ftp:21
        端口号:0-65535  0-1024
        http工作:
            url(统一资源定位符)
        组成部分：http://www.baidu.com

    http状态码：

　　200：表示请求已成功
　　201：表示请求成功并且服务器创建了新的资源
　　202：服务器已接受请求，但尚未处理
    301：被请求的资源已永久移动到新位置。
　　302：请求的资源临时从不同的URI响应请求，但请求者应继续使用原有位置来进行以后的请求
    304：自从上次请求后，请求的网页未修改过。服务器返回此响应时，不会返回网页内容。 
    401：请求要求身份验证。对于需要登录的网页，服务器可能返回此响应。
    403：服务器已经理解请求，但是拒绝执行它。
    404：请求失败，请求所希望得到的资源未被在服务器上发现
　　500：服务器遇到了一个未曾预料的状况，导致了它无法完成对请求的处理。一般来说，这个问题都会在服务器的程序码出错时出现。
　　503：由于临时的服务器维护或者过载，服务器当前无法处理请求。通常，这个是暂时状态，一段时间会恢复


    请求头：
        告诉服务器目前能接受的mime
        accept:
        能接收的编码类型
        接收的语言类型
        缓存相关
        长连接
        cookie:会话相关的  会话控制
        Host 主机
        User-Agent 客户端的浏览器类型
        Referer 上级页面
        X-Requested-With:XMLHttpRequest 异步请求
    响应头:
        content-Encoding 内容编码类型
        Content-Type: 内容mime类型
        Date:Fri,16 Mar 2018 时间 
        Expires:    过期时间
        server:服务器版本
        Transfer-Encoding:chunked 内容是否以分块形式发送

urlretrieve directly requests and saves data to a file

urllib.request.urlretrieve(url,filename=文件名)

Two ways to scrape data

The first

Set the request header information
rep = urllib.request.Request(url, headers = header)
send request
resp = urllib.request.urlopen(rep)
data = resp.read().decode('UTF-8')

the second

rep = urllib.request.Request(url)
//添加请求头
rep.add_header('User-Agent','浏览器信息')
resp = urllib.request.urlopen(rep)
data = resp.read().decode('UTF-8')

Anti-Anti-Reptilian

When the same ip or the same request header crawls data many times, the ip or q request header
may set more request headers to randomly crawl

agentList = ['','','']
agentStr = random.choice(agentList)

Request timed out

import urllib.request
for i in range(1,100):
    try:
        res = urllib.request.urlopen(url,timeout = 0.5)
    except:
        prishint('长时间没有响应')

Request browser data and save

urllib.request.urlretrieve(url)
//清除缓存
urllib.request.urlcleanup()

python notes (7) crawler

Common reptiles:

Focus on the crawler

Overall content:

http:

Two ways to scrape data

The first

the second

Anti-Anti-Reptilian

Request timed out

Request browser data and save

Guess you like