Getting started with the most popular Python3 web crawler

Author: Jack Cui Source:

http://cuijiahua.com/blog/2017/10/spider_tutorial_1.html

 

Introduction to web crawlers

 

Web crawlers are also called web spiders. It crawls web content according to the web address (URL), and the web address (URL) is the website link we enter in the browser. For example: https://www.baidu.com/, it is a URL.

 

1. Review elements

 

 

Enter the URL address in the address bar of the browser, right-click on the web page, and find and check. (Different browsers are called differently, Chrome browser is called inspection, Firefox browser is called viewing element, but the functions are the same)

 

image

 

We can see that a big push code appears on the right side, and these codes are called HTML. What is HTML? To give an easy-to-understand example: our genes determine our original appearance, and the HTML returned by the server determines the original appearance of the website.

 

image

 

Why is it a primitive look? Because people can have plastic surgery! Heartbroken, is there anything? Can the website also have plastic surgery? can! Please see the picture below:

 

image

 

Can I have so much money? Obviously impossible. How do I plasticize the website? It is by modifying the HTML information returned by the server. Each of us is a plastic surgeon and can modify the page information. Where we click on the review element on the page, the browser will locate the corresponding HTML location for us, and then we can change the HTML information locally.

 

To give another small example: we all know that using the browser’s function of remembering the password will turn the password into a bunch of small black dots, which are invisible. Can the password be displayed? Yes, just a minor operation on the page! Take Taobao as an example, right-click the input password box and click Check.

 

image

 

 

As you can see, the browser automatically locates the corresponding HTML location for us. Change the password attribute value in the figure below to the text attribute value ( modify directly in the code on the right ):

 

 

image

 

The password we let the browser remember appears like this:

 

image

 

What do you mean by saying so much? The browser obtains information from the server as a client, then parses the information and displays it to us. We can modify the HTML information locally to make a facelift for the webpage, but the information we modified will not be sent back to the server, and the HTML information stored by the server will not be changed. Refresh the interface, the page will return to its original appearance. This is the same as plastic surgery. We can change some superficial things, but we cannot change our genes.

 

2. Simple examples

 

 

The first step of a web crawler is to obtain the HTML information of a web page based on the URL. In Python3, you can use urllib.request and requests to crawl web pages.

 

  • The urllib library is built-in in python, no additional installation is required for us, as long as Python is installed, this library can be used.

  • The requests library is a third-party library and we need to install it ourselves.

 

The requests library is powerful and easy to use, so this article uses the requests library to get the HTML information of the web page. The github address of the requests library: https://github.com/requests/requests

 

 

(1)requests installation

 

In cmd, use the following command to install requests:

 

pip install requests

or:

 

easy_install requests

 

 

(2) Simple example

 

The basic method of the requests library is as follows:

image

 

Official Chinese tutorial address: http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

The developers of the requests library provided us with a detailed Chinese tutorial, which is very convenient to query. This article will not explain all of its content, but extract some of the content used for actual combat.

First, let us look at the requests.get() method, which is used to initiate a GET request to the server. It doesn't matter if you don't understand the GET request. We can understand it like this: get in Chinese means to get and grab, then the requests.get() method is to get and grab data from the server, that is, get data. Let us look at an example (take www.gitbook.cn as an example) to deepen our understanding:

 

# -*- coding:UTF-8 -*-
import requests

if __name__ == '__main__':
    target = 'http://gitbook.cn/'
    req = requests.get(url=target)
    print(req.text)

One of the parameters that the requests.get() method must set is the url, because we have to tell the GET request who is our target and whose information we want to get. Run the program to see the results:

 

image

 

On the left is the result obtained by our program, and on the right is the information obtained by reviewing elements on the www.gitbook.cn website. We can see that we have successfully obtained the HTML information of the webpage. This is the simplest example of a crawler. You may ask, I just crawled the HTML information of this webpage, what's the use? Please stay tuned for the guest, there will be online novel downloads (static website) and beautiful wallpaper downloads (dynamic website) for actual combat, so stay tuned.

 

  1.  

 

 

 

 

END

Send book

 

 image Follow "web front-end camp"
reply
527252download 

image

image Follow "web front-end camp"
reply 527681 download
image

 

Guess you like

Origin blog.csdn.net/bigzql/article/details/114867032