Hand-in-hand with you to get started with Python crawler (two, crawler preliminary knowledge)

1. Computer network protocol foundation

A complete network request process is as follows:
Insert picture description here
After the browser enters the domain name, the browser first visits the DNS server, DNS returns the ip to the browser, and then the browser establishes a TCP connection with the web server, the browser can send an http request, and the web server returns Data is sent to the browser, and the next step is for the browser to parse the content.

Seven-layer network protocol:

  • Application layer Http、ftp、pop3、DNS
  • Presentation layer
  • Session layer
  • Transport layer TCP、UDP
  • Network layer ICMP、IP、IDMP
  • data link layer ARP、RARP
  • Physical layer physical transmission medium

二、Html、Css、Javascript

The three elements of a webpage: it Html、Css、Javascript
Htmlis the skeleton that carries the content of the webpage; it
Cssis the style
Javascriptof the webpage ; it is the script that the webpage runs;

The content we need to crawl is generally part of the HTML content of the web page, so we can get it if it is visible, and we can crawl it as long as we can see it on the page.

Browser loading process:
build DOM tree-sub-resource loading (load external css, js, pictures and other external resources)-style rendering (css execution)

Finding page elements is generally found through the DOM tree:
Insert picture description here

Ajax asynchronous loading

Some data is sent to the server through js, and the data is returned and the data is dynamically inserted into the page through js. This method will not refresh the page, and the user experience effect is good.
The data returned by ajax may be in json format or part of an html page.

Dynamic web pages and static web pages:
dynamic: the data interacts with the background and can be changed (ajax)
static: the data is immutable (you need to modify the source code if you want to change)

Dynamic webpage experience is good, partial loading, good for server, good scalability
Static webpage is good for SEO

GET request and POST request

GET parameters are included in the URL, and POST passes parameters through the request body.

  1. GET is harmless when the browser rolls back, and POST will submit the request again
  2. GET requests can only be url-encoded, while POST supports multiple encoding methods
  3. The parameters transmitted in the URL for GET requests are limited in length, while POST does not
  4. GET is less secure than POST, because the parameters are directly exposed on the URL, so sensitive information cannot be passed

3 content-types

  1. application/x-www-form-urlencoded
    POST submits data, the browser's native form form, if the enctype attribute is not set, then the data will be submitted in application/x-www-form-urlencoded finally. The submitted data is encoded according to key1=val1&key2=val2, and both key and val are URL transcoded.

  2. multipart/form-data
    Form upload file.

  3. application/json
    Tell the server that the message body is a serialized JSON string.

Three, the basic method of crawling

1. Collection plan classification

Generally, we only collect the specified data required by the collection website, and the collection scheme is classified:

  1. Use http protocol to collect-page analysis
  2. Use api interface to collect-app data collection
  3. Use the api collection of the target website-Weibo, github

2. requests library

Official document address: https://requests.readthedocs.io/zh_CN/latest/
Installation:

pip install requests

If you use a virtual environment, please make sure to install it again in the virtual environment to ensure the normal operation of the project using the virtual environment

Firstly, crawl the Baidu page:

import requests

res = requests.get("http://www.baidu.com")
print(res.text)

The html code of the Baidu page is printed out: the
Insert picture description here
specific items will be introduced in detail later.

3. Regular expressions

Regular expressions are for better processing of the obtained strings and more convenient for obtaining the characters we need.
Commonly used regular syntax:

grammar effect
. Match any character (not including newline)
^ Match start position, match the beginning of each line in multi-line mode
$ Match end position, match the end of each line in multi-line mode
* Match the previous metacharacter 0 or more times
+ Match the previous metacharacter one or more times
? Match the previous metacharacter 0 to 1 times
{m,n} Match the previous metacharacter m to n times
\\ Escape character
[ ] character set
| Logical OR
\b Match an empty string at the beginning or end of a word
\B Match an empty string that is not at the beginning or end of a word
\d Match a number
\D Match non-digit
\s Match any blank
\S Match non-arbitrary whitespace
\w Match any character among numbers, letters, and underscores
\W Match any character other than numbers, letters, and underscores

Python uses regular to extract birthdays simply:

import re

info = "姓名:zhangsan 生日:1995年12月12日 入职日期:2020年12月12日"

# print(re.findall("\d{4}", info))
match_result = re.match(".*生日.*?(\d{4})", info)
print(match_result.group(1))  # 1995

4. Beautifulsoup usage

  1. Installation
    (if you are using a virtual environment, you need to switch to the virtual environment for installation)
pip install beautifulsoup4
  1. Official document
    https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
  2. Simple to use
from bs4 import BeautifulSoup
import requests

baidu = requests.get("http://www.baidu.com")
baidu.encoding = "utf-8"

bs = BeautifulSoup(baidu.text, "html.parser")
title = bs.find("title")
print(title.string)

navs = bs.find_all("img")
for i in navs:
    print(i)

result:
Insert picture description here

5. Xpath basic syntax

Here we mainly introduce Selector.

Installation:
Python package download: https://www.lfd.uci.edu/~gohlke/pythonlibs/
If you install directly lxmlor if the scrapyinstallation is unsuccessful, you can go to the above website to download the installation package in turn, and then pipinstall it:

pip install lxml
pip install Twisted-20.3.0-cp38-cp38-win32.whl
pip install Scrapy-1.8.0-py2.py3-none-any.whl

xpath uses path expressions to navigate in xml and html.
Insert picture description here
Insert picture description here
Simple usage:

import requests
from scrapy import Selector

baidu = requests.get("http://www.baidu.com")
baidu.encoding = "utf-8"
html = baidu.text

sel = Selector(text=html)
tag = sel.xpath("//*[@id='lg']/img").extract()[0]
print(tag)
# <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">

6. CSS selector to extract elements

Insert picture description here

import requests
from scrapy import Selector

baidu = requests.get("http://www.baidu.com")
baidu.encoding = "utf-8"
html = baidu.text

sel = Selector(text=html)

imgs = sel.css("img").extract()
for i in imgs:
    print(i)

# <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">
# <img src="//www.baidu.com/img/gs.gif">

Guess you like

Origin blog.csdn.net/zy1281539626/article/details/111144236