Note: This series of columns requires a simple python3 language foundation
. The related role of crawlers will not be explained here. I believe that readers who can click into this series of articles have already understood what crawlers are and what they can do. Because they are articles published on the Internet, the series of articles are not narrated from beginning to end and some brief introductions in the form of books. This article will quickly explain the development of crawlers.
Start
The general implementation process of a crawler is as follows:
first send a request to a Url address, and then the remote server will return the entire web page. Under normal circumstances, when we use a browser to visit a website, this is the same process; the user enters an address in the browser, the browser will send a server request, the server returns the requested content, and then the browser parses the content.
Second, after sending the request, you will get the content of the entire web page.
Finally, we need to parse the entire web page through our needs, and obtain the required data through regular or other methods.
Send request to get web page
In general, sending a request and obtaining a web page are mutually implemented, and the web page data will be obtained after the request is passed.
We use the requests library to make web requests.
The code is written as follows:
import requests
url="https://www.baidu.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}
html=requests.get(url,headers=headers)
print(html.text)
import requests
: Introducing the requests moduleurl="https://www.baidu.com/"
: Set the url value to be requested, here is Baiduheaders
: In order to better fake that you are visiting through a browser, you need to add a header to make it look like you are visiting through a browserhtml=requests.get(url,headers=headers)
: Requests use the get method to request the value set by the website for the url, and the header is headersprint(html.text)
: Display the text text in the returned value html, and the text text is the source code of the web page
Parse webpage
Next, we need to use a library BeautifulSoup library, BeautifulSoup is a flexible and convenient web page parsing library, using bs4 (BeautifulSoup) can quickly enable us to get general information in the web page. For example, we need to get the title title in the source code of the webpage we just got, first import the bs library:
from bs4 import BeautifulSoup
Then use beautifulsoup for parsing. html.parser represents the html parser, which can parse html code; where html.text is the source code of the web page as html, as follows:
val = BeautifulSoup(html.text, 'html.parser')
After parsing, if you want to get the title value, you can directly use .title to get it:
print(val.title)
The results are as follows: the
complete code is as follows:
import requests
from bs4 import BeautifulSoup
url="https://www.baidu.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}
html=requests.get(url,headers=headers)
val = BeautifulSoup(html.text, 'html.parser')
print(val.title)
If you want to save the captured file, you can write the code as follows:
f = open(r'D:\html.html',mode='w')
f.write(html.text)
f.close()
The above code saves the source code of the webpage to the root directory of Disk D, the complete code is as follows:
import requests
from bs4 import BeautifulSoup
url="https://www.baidu.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'}
html=requests.get(url,headers=headers)
val = BeautifulSoup(html.text, 'html.parser')
print(val.title)
f = open(r'D:\html.html',mode='w')
f.write(html.text)
f.close()
The above codes may have inconsistencies in encoding and "garbled" situations, which can be solved in the following ways:
f = open(r'D:\html.html',mode='w',encoding="utf-8")
In the open function, just add the code as utf-8. The final opened and saved file is as follows:
Because some resources are dynamically loaded, the obtained links are time-sensitive, so they are not displayed.
In this way, the simplest crawler is solved, and the next article will continue to learn more about crawlers.