Introduction and operation examples of Requests library of Python crawler

1. What is a crawler?

Web crawlers (also known as web spiders, web robots, in the FOAF community, and more often web chasers) are programs or scripts that automatically crawl information on the World Wide Web in accordance with certain rules. Other less commonly used names include ants, automatic indexing, simulators, or worms.

In fact, popularly speaking, it is to obtain the data you want on the web page through the program, that is, automatically grab the data.

You can crawl to the pictures of girls and the videos you want to watch. . Wait for the data you want to crawl, as long as you can access the data through the browser, you can get it through the crawler

Second, the nature of crawlers

Simulate the browser to open the webpage and get the data we want in the webpage

The process of opening a webpage by the browser:
when you enter the address in the browser, the server host is found through the DNS server, and a request is sent to the server. After the server is parsed, the result is sent to the user's browser, including html, js, css and other file contents , The browser parses it out and finally presents the result that the user sees on the browser

Therefore, the result of the browser that the user sees is composed of HTML code. Our crawler is to obtain these contents. By analyzing and filtering the html code, we can obtain the resources we want (text, picture, video...)

Three, the basic process of crawler

Initiate a request Initiate a request
to the target site through the HTTP library, that is, send a Request, the request can contain additional headers and other information, and wait for the server to respond

Get response content
If the server responds normally, it will get a Response. The content of the response is the content of the page to be obtained. The type may be HTML, Json string, binary data (picture or video), etc.

Parse the content The content
obtained may be HTML, which can be parsed by regular expressions, page parsing libraries, it may be Json, which can be directly converted to Json object analysis, it may be binary data, which can be saved or further processed

Save data There
are various ways of saving, you can save as text, you can also save to a database, or save a file in a specific format

Four, what is Requests

Requests is written in python language based on urllib, using the HTTP library of the Apache2 Licensed open source protocol.
If you have read the previous article about the use of urllib library, you will find that urllib is still very inconvenient, and Requests is better than Urllib is more convenient and can save us a lot of work. (After using requests, you basically don't want to use urllib.) In a word, requests is the simplest and easy-to-use HTTP library implemented by python. It is recommended that crawlers use the requests library.

After python is installed by default, the requests module is not installed, and it needs to be installed separately through pip

Five, the basic knowledge of the Requests library

Insert picture description here
We get the returned object by calling the method in the Request library. It includes two objects, the request object and the response object.

The request object is the URL we want to request, and the response object is the returned content, as shown in the figure:
Insert picture description here

Six, the installation of Requests

1. It is strongly recommended that you use pip to install: pip insrall requests

2. Pycharm installation: file-"default settings-"project interpreter-"search for requests-"install package-"ok

Seven, the operation example of the Requests library

1. Crawling of Jingdong Commodities-Common Crawling Framework

import requests
url = "https://item.jd.com/2967929.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败!")

2. Crawling of Amazon products-by modifying the headers field, simulating a browser to initiate a request to the website

import requests
url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.status_code)
    print(r.text[:1000])
except:
    print("爬取失败")

3. Baidu/360 search keyword submission-modify params parameters to submit keywords

Baidu's keyword interface: http://www.baidu.com/s?wd=keyword
360's keyword interface: http://www.so.com/s?q=keyword

import requests
url="http://www.baidu.com/s"
try:
    kv={'wd':'Python'}
    r=requests.get(url,params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
    print(r.text[500:5000])
except:
    print("爬取失败")

4. Crawling and storage of network pictures-combined with the use of os library and file operations

import requests
import os
url="http://tc.sinaimg.cn/maxwidth.800/tc.service.weibo.com/p3_pstatp_com/6da229b421faf86ca9ba406190b6f06e.jpg"
root="D://pics//"
path=root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

Finally: exception handling

When you are not sure what error will happen, try to use try...except to catch all requests exception:

import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException

try:
    response = requests.get('http://www.baidu.com',timeout=0.5)
    print(response.status_code)
except ReadTimeout:
    print('timeout')
except HTTPError:
    print('httperror')
except RequestException:
    print('reqerror')

Guess you like

Origin blog.csdn.net/xieminglu/article/details/109270305