The first step of the python3 crawler is to simply obtain the basic information of the webpage

Note: This series of columns requires a simple python3 language foundation
. The related role of crawlers will not be explained here. I believe that readers who can click into this series of articles have already understood what crawlers are and what they can do. Because they are articles published on the Internet, the series of articles are not narrated from beginning to end and some brief introductions in the form of books. This article will quickly explain the development of crawlers.

Start

The general implementation process of a crawler is as follows:
Insert picture description here
first send a request to a Url address, and then the remote server will return the entire web page. Under normal circumstances, when we use a browser to visit a website, this is the same process; the user enters an address in the browser, the browser will send a server request, the server returns the requested content, and then the browser parses the content.
Second, after sending the request, you will get the content of the entire web page.
Finally, we need to parse the entire web page through our needs, and obtain the required data through regular or other methods.

Send request to get web page

In general, sending a request and obtaining a web page are mutually implemented, and the web page data will be obtained after the request is passed.
We use the requests library to make web requests.
The code is written as follows:

import requests

url="https://www.baidu.com/"
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}
html=requests.get(url,headers=headers)
print(html.text)
  • import requests: Introducing the requests module
  • url="https://www.baidu.com/": Set the url value to be requested, here is Baidu
  • headers: In order to better fake that you are visiting through a browser, you need to add a header to make it look like you are visiting through a browser
  • html=requests.get(url,headers=headers): Requests use the get method to request the value set by the website for the url, and the header is headers
  • print(html.text): Display the text text in the returned value html, and the text text is the source code of the web page

Parse webpage

Next, we need to use a library BeautifulSoup library, BeautifulSoup is a flexible and convenient web page parsing library, using bs4 (BeautifulSoup) can quickly enable us to get general information in the web page. For example, we need to get the title title in the source code of the webpage we just got, first import the bs library:

from bs4 import BeautifulSoup

Then use beautifulsoup for parsing. html.parser represents the html parser, which can parse html code; where html.text is the source code of the web page as html, as follows:

val = BeautifulSoup(html.text, 'html.parser')

After parsing, if you want to get the title value, you can directly use .title to get it:

print(val.title)

The results are as follows: the
Insert picture description here
complete code is as follows:

import requests
from bs4 import BeautifulSoup

url="https://www.baidu.com/"
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}
html=requests.get(url,headers=headers)
val = BeautifulSoup(html.text, 'html.parser')
print(val.title)

If you want to save the captured file, you can write the code as follows:

f = open(r'D:\html.html',mode='w')
f.write(html.text)
f.close() 

The above code saves the source code of the webpage to the root directory of Disk D, the complete code is as follows:

import requests
from bs4 import BeautifulSoup

url="https://www.baidu.com/"
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}
html=requests.get(url,headers=headers)
val = BeautifulSoup(html.text, 'html.parser')
print(val.title)
f = open(r'D:\html.html',mode='w')
f.write(html.text)
f.close() 

The above codes may have inconsistencies in encoding and "garbled" situations, which can be solved in the following ways:

f = open(r'D:\html.html',mode='w',encoding="utf-8")

In the open function, just add the code as utf-8. The final opened and saved file is as follows:
Insert picture description here
Because some resources are dynamically loaded, the obtained links are time-sensitive, so they are not displayed.

In this way, the simplest crawler is solved, and the next article will continue to learn more about crawlers.

Guess you like

Origin blog.csdn.net/A757291228/article/details/107170282