day03 reptiles basis

First, the principle crawler
1. Internet: the network device by a stack of a computer network called the Internet together
2. The purpose of establishing the Internet: transmitting and sharing data
3. The whole process of the Internet
normal user:

  Open your browser ----> send the request to the target site ----> fetch response data ----> Render to browser
crawler:

  Analog browser ----> sending a request to a target site ----> ---- fetch response data> extract valuable data ----> persisted to the data
request sent by the browser 4

http protocol request of
the client:
        browser is a software ----> Client IP and port of
the server:
        https://www.jd.com/
        www.jd.com (Jingdong domain) ---> DNS resolve ---> IP and port services side of Jingdong
    client Ip port ----> IP and port to send the request to the server can establish a link to obtain the corresponding data
5. reptile whole process
- sends a request (need to request library: requests request library, Selenium request library)
- acquires the corresponding data (just to the server sends a request, the adoption will return response data)
- parses and extracts data (requires parsing library: Re, BeautifulSoup4, the Xpath ...)
- saved locally (file processing, database, MongoDB repository)

Two, request library Requests
1, installation and use
- open cmd
- Input: pip3 install requests

Import requests # import request requests library 
# # 
# Baidu page transmission request to acquire the response object 
Response = requests.get (URL = ' HTTPS: www.baidu.com ' )
 # # 
# Set. 8-character encoding UTF 
response.encoding = ' UTF-. 8 ' 
# # 
# print response text 
Print (response.text)
 # # 
# the local response text written 
with Open ( ' baidu.html ' , ' W ' , encoding = ' UTF-. 8 '  ) AS F :
    f.write (response.text)

2, crawling video: Video pear

Import Requests 
video_url = ' https://www.pearvideo.com/video_1570475 ' 
Response = requests.get (URL = video_url)
 Print (response.text)
 # to the source address of the video transmission request 
Response = requests.get ( ' HTTPS: / /video.pearvideo.com/mp4/adshort/20190626/cont-1570475-14058832_adpkg-ad_hd.mp4 ' )
 # print binary stream, such as pictures, video and other data 
Print (response.content)
 # 
# save the video to a local 
with open ( ' video.mp4 ' , ' WB ' ) AS F: 
    f.write (response.content)

3, batch crawling video

. (1) first sends a request to the Home Video pear
    https://www.pearvideo.com/
 resolve all videos get id:
    video_1570302
    re.findall ()
. (2) to obtain video detail page url

Import Requests
 Import Re # regular, for parsing the text data 
# 1 to send a request to the Video Home pears 
Response = requests.get ( ' https://www.pearvideo.com/ ' )
 Print (response.text) 

# Re regular get all the video matches the above mentioned id 
# parameter 1: correct matching rule 
# parameter 2: parse text 
# parameter 3: match mode 

res_list = re.findall ( ' . "? video _ (*)" <A href = ' , response.text, Re .S)
  Print (res_list)
 for v_id in res_list: 
    detail_url = ' https://www.pearvideo.com/video_ '+ V_id
      Print (detail_url) 

    # send a request for each page for a video before the video source URL 
    Response = requests.get (URL = detail_url)
     # Print (response.text) 

    # parses and extracts video details page URL 
    # video URL 
    video_url = Re .findall ( ' srcUrl = "(. *?)" ' , response.text, re.S) [0]
     Print (video_url) 

    # video name 
    video_name = re.findall ( ' <h1 class = "video-tt"> (. *?) </ h1 of> ' , response.text, re.S) [0]
     Print (video_name) 

    # to acquire a video transmission request url video stream binary 
    v_response = requests.get (video_url)

    Open with ( ' % s.mp4 ' % video_name, ' WB ' ) AS F: 
        f.write (v_response.content) 
        Print (video_name, ' video crawling completed ' )

 

Three crawling IMDb (top250)

: Starting from the current position

*: Find all

?: Find the first not to look

* ?: non-greedy match

*: Greedy match

(. *?): Extract data in brackets

Data taken crawl:

Movie rankings, movies url, film name, director - starring - the type of movie scores, number of reviews, film synopsis

’’’

Main page URL

https://movie.douban.com/top250?start=0&filter=

https://movie.douban.com/top250?start=25&filter=

https://movie.douban.com/top250?start=50&filter=

1. The transmission request

2. Parse the data

3. Save data

’’’

import requests
import re

# 爬虫三部曲
# 1.发送请求
def get_page(base_url):
    response = requests.get(base_url)
    return response

# 2.解析文本
def parse_index(text):
    res = re.findall(
        '<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演:(.*?)</p>.*?<span class="rating_num".*?>(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>',
        text, re.S)
    return res

# 3.保存数据
def save_data(data):
    with open('douban.txt','a',encoding='utf-8')as f:
        f.write(data)
# main + 回车键
if __name__ == '__main__':

# num = 10
# base_url = 'https://movie.douban.com/top250?start{}&filter='.format(num)
    num = 0
    for line in range(10):
        base_url = f'https://movie.douban.com/top250?start={num}&filter='
        num += 25
        print(base_url)

        # 1.发送请求,调用函数
        response = get_page(base_url)

        # 2.解析文本
        movie_list = parse_index(response.text)

        # 3.保存数据
        # 数据的格式化
        for movie in movie_list:
            # print(movie)

            # 解压赋值
            # 电影排名、电影url、电影名称、导演-主演-类型、电影评分、评价人数、电影简介
            v_top,v_url,v_name,v_daoyan,v_point,v_num,v_desc = movie
            # v_top = movie[0]
            # v_url = movie[1]
            movie_concent = f'''
            电影排名:{v_top}
            电影url:{v_url}
            电影名称:{v_name}
            导演主演:{v_daoyan}
            电影评分:{v_point}
            评价人数:{v_num}
            电影简介:{v_desc}
            \n
            '''
            print(movie_concent)

            # 保存数据
            save_data(movie_concent)

四、抓包分析

 打开浏览器的开发者模式(检查)---->选中network
 找到访问的页面后缀xxx.html(响应文本)
 1)请求url(访问的网站地址)
 2)请求方式:
        GET:

         直接发送请求获取数据

         https://www.cnblogs.com/kermitjam/articles/9692597.html
        POST:

         需要携带用户信息往目标地址发送请求

         https://www.cnblogs.com/logi
 3)响应状态码:
 2xx:成功
 3xx:重定向
 4xx:找不到资源
 5xx:服务器错误
 4)请求头信息
 User-Agent:用户代理(证明是通过电脑设备及浏览器发送的请求)
 Cookies:登录用户真实信息(证明你是目标网站的用户)
 Referer:上一次访问的url(证明你是从目标网站跳转过来的
 5)请求体:
 POST请求才会有请求体
 Form Data
    {
       'user':'monika',
       'pwd':'123'
    }

Guess you like

Origin www.cnblogs.com/decades/p/11093222.html