Web Crawler and Automation

Use requests and beautifulsoup4 two libraries
Web crawler generally includes two steps:
1) Obtain web page content (string form) through network links - requests library.
2) Process the obtained web content - beautifulsoup4 library.
Use the pip3 command to install these two libraries:
Under Linux:
sudo pip3 install requests
sudo pip3 install beautifulsoup4

1) Use of the requests library The following code is recommended for the function
to obtain the content of a web page:

import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)  
        r.raise_for_status()  #如果状态不是200,引发异常
        r.encoding='utf-8'  #无论原来采用什么编码,都改成  utf-8
        return r.text
    except:
        return ""
url='http://www.baidu.com'
print(getHTMLText(url))

2) Use of beautifulsoup4 (or bs4) library After
using the requests library to obtain the HTML page and convert it into a string, it is necessary to further parse the HTMO page format and extract useful information, which requires a function library for processing HTML and XML-beautifulsoup4.
beautifulsoup4 - builds parse trees based on HTML and XML grammars, and parses the content efficiently.
from bs4 import BeautifulSoup

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325395728&siteId=291194637