1. Introduction to crawlers

Reptile first impression

Definition: A program that automatically grabs information from the Internet, grabbing information that is valuable to us from the Internet

Python crawler architecture

The Python crawler architecture is mainly composed of five parts, namely the scheduler, URL manager, web page downloader, web page parser, and application (the valuable data crawled).

  • Scheduler: It is equivalent to the CPU of a computer, which is mainly responsible for scheduling the coordination between the URL manager, downloader, and parser.
    URL manager: including the URL address to be crawled and the URL address that has been crawled, to prevent repeated crawling of URLs and looping of crawling URLs. There are three main ways to realize the URL manager, which is realized through memory, database, and cache database.

  • Web page downloader: download a web page by passing in a URL address and convert the web page into a string. The web page downloader has urllib2 (the official Python basic module) including login, proxy, cookie, requests (third-party package)

  • Web page parser: Parsing a web page string can extract our useful information according to our requirements, or parse it according to the DOM tree analysis method. The web page parser has regular expressions (intuitively, the web page is converted into a string to extract valuable information through fuzzy matching. When the document is more complex, this method will be very difficult to extract data), html. parser (built in Python), beautifulsoup (third-party plug-in, you can use Python's built-in html.parser for parsing, or you can use lxml for parsing, which is more powerful than the other ones), lxml (third-party plug-in , Can parse xml and HTML), html.parser, beautifulsoup and lxml are all parsed in the DOM tree.

  • Application: It is an application composed of useful data extracted from web pages.
    Insert picture description here
    1


Knowledge points needed for subsequent crawler projects: reasonable control of the program execution process

First look at the following code:

def main(var):
    print("hiya", var)

main(1)

if __name__ == "__main__":
    main(2)

The result of the execution is:

hiya 1
hiya 2

Look at another code example

print ("test1")
def Fun():
    print ("Fun")
def main():
    print ("main")
    Fun()
if __name__ == '__main__':
    main()

The result of the execution is:

test1
main
Fun

Python is an interpreted language. The execution process is judged by the following rules:

When a python program is run as a py file, the file attribute __name__ is main; when imported as a module, the file attribute __name__ is the file name (module name)

Python executes the first non-function definition and non-class definition unindented code

In the follow-up, we will judge the current executing program, and control the logic of the whole program in main.

if __name__ == '__main__':

  1. Pictures from rookie tutorial ↩︎

Guess you like

Origin blog.csdn.net/qq_43808700/article/details/113549010