Business reptile study notes day1

day1

一. HTTP

1 Introduction:

              https://www.cnblogs.com/vamei/archive/2013/05/11/3069788.html

              http://blog.csdn.net/guyuealian/article/details/52535294

2. When a user enters a URL (such as www.baidu.com), what is the process of sending network requests?

 

a. www.baidu.com parsed address corresponding to the domain name server ip

(1) first know the mac address of the default gateway (get mac address of the default gateway through arp protocol)

(2) organize data sent to the default gateway (ip or dns server ip, but mac mac address is the default gateway address)

(3) the default gateway has the ability to forward data, forwards the data to the router

(4) according to their own router routing protocol, to select an appropriate fast path, forwards the data to the destination gateway

(5) The purpose gateway (gateway dns server resides) forwards the data to the dns server

(6) dns query server parses the corresponding www.baidu.com ip address and put it to backtrack to the requesting client domain

B. After the ip address obtained baidu.com performs the 3-way handshake tcp, so as to achieve the client and server connection

c. the data transmission request to the corresponding web server via the http protocol

After d.web server receives data requests, by querying their server resources, resulting in client requests, backtrack to the requester (the browser)

d. Upon receipt of the data browser to display the web browser's own rendering function

e. tcp browser closes the connection, i.e. four times waved

2.http way request

(1) get requests

       Data sent to a particular resource request, which appear in the url parameters directly, insecurity, in addition there is the request length limit, access only type data is ASCII code

(2) post request

Submitting data to the processing request specified resource (e.g., file submission form or upload), data (allowing binary data) is included in the request body. Security request method, request no length limit. POST requests may result in the modification of existing resources or create new resources

(3)PUT

To upload their content to the latest specified resource location

(4)delete

Requests that the server delete the resource identified by the Request-URI

(5)trace

The client can track request message transmission path, which is to a method of communicating back to web server requests the client before, mainly for testing and diagnostic

(6)head

Similarly get request, it returns no response to specific content (i.e., does not return the message body portion)

(7)options

Returns the server supports HTTP-specific resource request method (ie, a client asked what method server requests can be submitted)

(8)connect

HTTP / 1.1 protocol can be reserved for connection to the pipeline mode proxy server

The method of claim tunnel connect resume communication with the proxy server, to realize a TCP communication protocol tunnel. The main use of the SSL (Secure Sockets Layer) and TLS (Transport Layer Security) protocol to encrypt the communication content over a network tunneled

Although there are eight kinds of HTTP request methods, but we'll get that is commonly used in post and practical application, request other ways also can be achieved indirectly by these two methods

The difference 3.http and https

(1) https protocol gonna CA (Certificate Authority, certificate authority), requires a certain economic costs

(2) is transmitted in the clear http, https is encrypted and secure transmission, the transmission https relatively safe with respect to http

(3) port is not the same, http is 80, https is 443

4. The request header (request header) Content

(1) Accept: the file format of the text
(2) Accept-Encoding: encoding format
(3) Connection: ⻓ long link short link
(4) Cookie: Verify Use with
(5) Host: Domain
(6) Referer: flag from which ⻚ page ⾯ jump over the surface
(7) user-Agent: Using the browser and user information

 II. Getting reptile

(1) Crawler value:

1.买卖数据(⾼高端的领域价格特别贵)
2.数据分析:出分析报告
3.流量量
4.指数阿⾥里里指数,百度指数

(2)合法性:灰⾊色产业
政府没有法律律规定爬⾍虫是违法的,也没有法律律规定爬⾍虫是合法的
公司概念:公司让你爬数据库(窃取商业机密)责任在公司
(3)爬⾍虫可以爬取所有东⻄西?(不是)爬⾍只能爬取⽤用户能访问到的数据
爱奇艺的视频(vip,非vip)
1.普通⽤用户 只能看非vip 爬取非vip的的视频
2.vip 爬取vip的视频
3.普通⽤用户想要爬取vip视频(⿊黑客)

三. 爬虫的分类

(1)通⽤用爬⾍虫
1.使⽤用搜索引擎:百度 谷歌 360 雅虎 搜狗
优势:开放性 速度快
劣势:⽬标不不明确
返回内容:基本上%90是⽤户不不需要的
不清楚用户的需求在哪⾥里里
(2)聚焦爬⾍虫(学习)
1.⽬标明确
2.对⽤户的需求非常精准
3.返回的内容很固定
增量式:翻⻚:从第⼀一⻚页请求到最后⼀一⻚页

Deep 深度爬⾍虫:

静态数据:html css
动态数据:js代码,加密的js

(3)robots

Robots协议(也称为爬虫协议、机器人协议等)的全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过Robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。您可以在您的网站中创建一个纯文本文件robots.txt,在这个文件中声明该网站中不想被robot访问的部分,这样,该网站的部分或全部内容就可以不被搜索引擎收录了,或者指定搜索引擎只收录指定的内容。

聚焦爬虫不遵守robots

四. 爬虫的工作原理

确定抓取目标的url是哪一个--->使用代码发送请求获取数据--->解析获取到的数据--->若有新的目标(url),回到第一步--->数据持久化(如,将数据写入文件中)

1. python3(原生提供的模块):urlib.request

(1) urlopen:

a.返回response对象

b. response.read()

c. bytes.decode("utf-8")

(2) get:传参

1.汉字报错 :解释器ascii没有汉字,url汉字转码
(3)post
(4)handle处理器的⾃定义
(5)urlError

 

 五.代码

发送请求

import urllib.request

def load_data():
    url = "http://www.baidu.com/"
    response = urllib.request.urlopen(url)  # 发送请求
    print(response)
    data = response.read()  # 读取到的内容为bytes
    print(data)
    # 将文件获取的内容转换成字符串
    str_data = data.decode("utf-8")
    print(str_data)
    # 将数据写入文件
    with open("baidu.html", "w",encoding="utf-8") as f:  # 此处encoding="utf-8"一定要写,否则报错
        f.write(str_data)   
load_data()

#python爬取的类型:str bytes
#如果爬取回来的是bytes类型:但是你写入的时候需要字符串 decode("utf-8")
#如果爬取过来的是str类型:但你要写入的是bytes类型 encode(""utf-8")

 

 发送带参数的请求

import urllib.request

def get_method_params():
    url = "http://www.baidu.com/s?wd="
    name = "美女"
    final_url = url + name
    print(final_url)
    # 使用代码发送网络请求
    response = urllib.request.urlopen(final_url)
    print(response)

get_method_params()

这样运行会报错UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)

原因:python:是解释性语言;解析器只支持 ascii 码,不支持中文,此处的final_url包含了中文,所以就要进行url的转译,如将url(https://www.baidu.com/s?wd=美女)复制到pycharm中会直接转译,如下:

但我复制到自己的pycharm不会产生转译

 转译要用到parse,string模块,如下

 
 

import urllib.request
import urllib.parse
import string

def get_method_params():

    url = "http://www.baidu.com/s?wd="
    name = "美女"
    final_url = url+name
    print(final_url)
    #将包含汉字的网址进行转译
    encode_new_url = urllib.parse.quote(final_url,safe=string.printable)
    print(encode_new_url)
    # 使用代码发送网络请求
    response = urllib.request.urlopen(encode_new_url) 
    print(response) # 返回的是http对象,如<http.client.HTTPResponse object at 0x0000028324337780>
  #读取内容 
  data = response.read().decode() print(data)
  #保存到本地
  with open("02-encode.html","w",encoding="utf-8")as f:
     f.write(data)

get_method_params()

 

Guess you like

Origin www.cnblogs.com/jj1106/p/11204472.html