Preliminary knowledge of crawlers
1. Computer network protocol foundation
A complete network request process is as follows:
After the browser enters the domain name, the browser first visits the DNS server, DNS returns the ip to the browser, and then the browser establishes a TCP connection with the web server, the browser can send an http request, and the web server returns Data is sent to the browser, and the next step is for the browser to parse the content.
Seven-layer network protocol:
- Application layer
Http、ftp、pop3、DNS
- Presentation layer
- Session layer
- Transport layer
TCP、UDP
- Network layer
ICMP、IP、IDMP
- data link layer
ARP、RARP
- Physical layer physical transmission medium
二、Html、Css、Javascript
The three elements of a webpage: it Html、Css、Javascript
Html
is the skeleton that carries the content of the webpage; it
Css
is the style
Javascript
of the webpage ; it is the script that the webpage runs;
The content we need to crawl is generally part of the HTML content of the web page, so we can get it if it is visible, and we can crawl it as long as we can see it on the page.
Browser loading process:
build DOM tree-sub-resource loading (load external css, js, pictures and other external resources)-style rendering (css execution)
Finding page elements is generally found through the DOM tree:
Ajax asynchronous loading
Some data is sent to the server through js, and the data is returned and the data is dynamically inserted into the page through js. This method will not refresh the page, and the user experience effect is good.
The data returned by ajax may be in json format or part of an html page.
Dynamic web pages and static web pages:
dynamic: the data interacts with the background and can be changed (ajax)
static: the data is immutable (you need to modify the source code if you want to change)
Dynamic webpage experience is good, partial loading, good for server, good scalability
Static webpage is good for SEO
GET request and POST request
GET parameters are included in the URL, and POST passes parameters through the request body.
- GET is harmless when the browser rolls back, and POST will submit the request again
- GET requests can only be url-encoded, while POST supports multiple encoding methods
- The parameters transmitted in the URL for GET requests are limited in length, while POST does not
- GET is less secure than POST, because the parameters are directly exposed on the URL, so sensitive information cannot be passed
3 content-types
-
application/x-www-form-urlencoded
POST submits data, the browser's native form form, if the enctype attribute is not set, then the data will be submitted in application/x-www-form-urlencoded finally. The submitted data is encoded according to key1=val1&key2=val2, and both key and val are URL transcoded. -
multipart/form-data
Form upload file. -
application/json
Tell the server that the message body is a serialized JSON string.
Three, the basic method of crawling
1. Collection plan classification
Generally, we only collect the specified data required by the collection website, and the collection scheme is classified:
- Use http protocol to collect-page analysis
- Use api interface to collect-app data collection
- Use the api collection of the target website-Weibo, github
2. requests library
Official document address: https://requests.readthedocs.io/zh_CN/latest/
Installation:
pip install requests
If you use a virtual environment, please make sure to install it again in the virtual environment to ensure the normal operation of the project using the virtual environment
Firstly, crawl the Baidu page:
import requests
res = requests.get("http://www.baidu.com")
print(res.text)
The html code of the Baidu page is printed out: the
specific items will be introduced in detail later.
3. Regular expressions
Regular expressions are for better processing of the obtained strings and more convenient for obtaining the characters we need.
Commonly used regular syntax:
grammar | effect |
---|---|
. | Match any character (not including newline) |
^ | Match start position, match the beginning of each line in multi-line mode |
$ | Match end position, match the end of each line in multi-line mode |
* | Match the previous metacharacter 0 or more times |
+ | Match the previous metacharacter one or more times |
? | Match the previous metacharacter 0 to 1 times |
{m,n} | Match the previous metacharacter m to n times |
\\ | Escape character |
[ ] | character set |
| | Logical OR |
\b | Match an empty string at the beginning or end of a word |
\B | Match an empty string that is not at the beginning or end of a word |
\d | Match a number |
\D | Match non-digit |
\s | Match any blank |
\S | Match non-arbitrary whitespace |
\w | Match any character among numbers, letters, and underscores |
\W | Match any character other than numbers, letters, and underscores |
Python uses regular to extract birthdays simply:
import re
info = "姓名:zhangsan 生日:1995年12月12日 入职日期:2020年12月12日"
# print(re.findall("\d{4}", info))
match_result = re.match(".*生日.*?(\d{4})", info)
print(match_result.group(1)) # 1995
4. Beautifulsoup usage
- Installation
(if you are using a virtual environment, you need to switch to the virtual environment for installation)
pip install beautifulsoup4
- Official document
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ - Simple to use
from bs4 import BeautifulSoup
import requests
baidu = requests.get("http://www.baidu.com")
baidu.encoding = "utf-8"
bs = BeautifulSoup(baidu.text, "html.parser")
title = bs.find("title")
print(title.string)
navs = bs.find_all("img")
for i in navs:
print(i)
result:
5. Xpath basic syntax
Here we mainly introduce Selector.
Installation:
Python package download: https://www.lfd.uci.edu/~gohlke/pythonlibs/
If you install directly lxml
or if the scrapy
installation is unsuccessful, you can go to the above website to download the installation package in turn, and then pip
install it:
pip install lxml
pip install Twisted-20.3.0-cp38-cp38-win32.whl
pip install Scrapy-1.8.0-py2.py3-none-any.whl
xpath uses path expressions to navigate in xml and html.
Simple usage:
import requests
from scrapy import Selector
baidu = requests.get("http://www.baidu.com")
baidu.encoding = "utf-8"
html = baidu.text
sel = Selector(text=html)
tag = sel.xpath("//*[@id='lg']/img").extract()[0]
print(tag)
# <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">
6. CSS selector to extract elements
import requests
from scrapy import Selector
baidu = requests.get("http://www.baidu.com")
baidu.encoding = "utf-8"
html = baidu.text
sel = Selector(text=html)
imgs = sel.css("img").extract()
for i in imgs:
print(i)
# <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">
# <img src="//www.baidu.com/img/gs.gif">