Introduction to getting started with python crawlers

A crawler (also known as a web spider, a network robot) is a program that simulates a client sending a network request and receiving a request response. It is a program that automatically grabs Internet information according to certain rules.

In principle, as long as the browser (client) can do anything, the crawler can do it.

According to the number of crawled websites, we divide crawlers into:

General crawler: usually refers to the crawler of the search engine

Focus on crawlers: crawlers for specific websites

Robots agreement: The website tells the search engine which pages can be crawled and which pages cannot be crawled through the Robots agreement, but it is only a constraint on the moral level.

The browser will actively request js, css and other content, js will modify the content of the page, js can also resend the request, and finally the content rendered by the browser is in elements, which includes css, pictures, js, responses corresponding to url addresses, etc. .

But in the crawler, the crawler will only request the url address, and get the response corresponding to the url address. The page rendered by the browser is different from the page requested by the crawler. Therefore, in the crawler, it is necessary to extract data based on the response corresponding to the url address.

URL format: scheme://host[:port#]/path/…/[?query-string][#anchor]

scheme: protocol (eg: http, https, ftp)

host: IP address or domain name of the server

port: The port of the server (if it is the default port of the protocol, 80 or 443)

path: the path to access the resource

query-string: parameter, the data sent to the http server

anchor: anchor (jump to the specified anchor position of the web page)

HTTP: Hypertext Transfer Protocol, default port number: 80

HTTPS: HTTP + SSL (Secure Sockets Layer), default port number: 443 (HTTPS is more secure than HTTP, but has lower performance)

Common HTTP request headers

Host (host and port number)

Connection (connection type)

Upgrade-Insecure-Requests (upgrade to HTTPS requests)

User-Agent (browser name)

Accept (transfer file type)

Referer (page jump)

Accept-Encoding (file codec format)

Cookie (Cookie)

x-requested-with :XMLHttpRequest (Ajax asynchronous request)

Common Request Methods

GET

POST

Some other notes about GET requests:

GET requests can be cached

GET requests remain in the browser history

GET requests can be bookmarked

GET requests should not be used when handling sensitive data

GET requests have a length limit

GET requests should only be used to retrieve data

Some other notes about POST requests:

POST requests are not cached

POST requests are not kept in browser history

POST cannot be bookmarked

POST request has no requirement on data length

#! -*- encoding:utf-8 -*-

    import requests

    # 要访问的目标页面
    targetUrl = "http://ip.hahado.cn/ip"

    # 代理服务器
    proxyHost = "ip.hahado.cn"
    proxyPort = "39010"

    # 代理隧道验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : proxyHost,
        "port" : proxyPort,
        "user" : proxyUser,
        "pass" : proxyPass,
    }

    proxies = {
        "http"  : proxyMeta,
        "https" : proxyMeta,
    }

    resp = requests.get(targetUrl, proxies=proxies)

    print resp.status_code
    print resp.text

Guess you like

Origin blog.csdn.net/weixin_73725158/article/details/130487658