Learn the simple knowledge needed by crawlers

Tip: After the article is written, the table of contents can be automatically generated. For how to generate it, please refer to the help document on the right


Preface

As a beginner, I slowly started to learn crawlers. The following are some simple knowledge I found on the Internet. I feel that the writing is better. I sorted it out and saved it as a note. Thank you.

1. What is a crawler

Crawler: A program that automatically grabs information from the Internet, grabbing information that is valuable to us from the Internet.

Two, Python crawler architecture

The Python crawler architecture is mainly composed of five parts, namely the scheduler, URL manager, web page downloader, web page parser, and application (the valuable data crawled).

Scheduler: It is equivalent to the CPU of a computer, which is mainly responsible for scheduling the coordination between the URL manager, downloader, and parser.

URL Manager: Including the URL address to be crawled and the URL address that has been crawled, to prevent repeated URL crawling and loop crawling URL. There are three main ways to realize the URL manager, through memory, database, and cache database.

Web page downloader: download a web page by passing in a URL address and convert the web page into a string. The web page downloader has urllib2 (the official Python basic module) including login, proxy, and cookie, requests (third-party package), here I prefer to use the requests package

Web page parser: parsing a web page string, we can extract useful information according to our requirements, or it can be parsed according to the DOM tree analysis method. The web page parser has regular expressions (intuitively, the web page is converted into a string to extract valuable information through fuzzy matching. When the document is more complex, this method will be very difficult to extract data), html. parser (built in Python), beautifulsoup (third-party plug-in, you can use Python's built-in html.parser for parsing, or you can use lxml for parsing, which is more powerful than the other ones), lxml (third-party plug-in , Can parse xml and HTML), html.parser, beautifulsoup and lxml are all parsed in a DOM tree.

Application: It is an application composed of useful data extracted from web pages.

3. Basic knowledge required

Python basic grammar and data structure; file/database related knowledge; function/object-oriented programming; exception handling; concurrent handling, here I think as long as some basics of the C language, the simplest crawler can also be achieved.

Related concepts: http protocol; get/post; cookie/useragent/proxy.... (get method to crawl; post method to log in)

模块:urllib/requests/scrapy
re /bs4/xpath/css

Fourth, quickly build a crawler

http detailed explanation:
use get/post method, web browser/packet capture tool to analyze http protocol and locate page information;
urllib module detailed explanation:
get method to crawl; post method to log in.
Regular expression and BS4 module extract page information:
Anti-scrabble and processing method:
proxy use:
high-concurrency proxy crawler implementation:
Requests module and file upload:

http: Hypertext transfer protocol, data transfer protocol between client and server (based on TCP).
Server default port: 80. The
browser first performs a resolution of DNS domain name and obtains an IP, which connects to the server to form a path, and data can be transmitted. Request an HTML

Guess you like

Origin blog.csdn.net/weixin_45070922/article/details/112399250