Article Directory
1. Crawl the original page
The original page to be crawled here is as follows:
2. Error code
I don’t know if you guys are using the code shown in the figure below just like me?
import requests
url = "http://item.jd.com/2967929.html"
try:
r = requests.get(url)
r.raise_for_status()
print(r.status_code)
r.encoding = r.apparent_encoding
print(r.text[:1001])
except:
print("爬取异常")
The result I got with the above code is like this
There is only one URL, and the content is not displayed as expected. What is the reason? Let's analyze it.
3. Error analysis
Error analysis using IDLE interactive environment
(1) View the status code and encoding method
By checking the status code and encoding method, we found that there does not seem to be any problem. At this time, we must consider whether Jingdong has imposed user-agent restrictions on crawlers.
(2) Output the header information submitted to Jingdong
By outputting the header information, we found that in the information submitted to JD.com, we honestly told the crawler used by JD.com to obtain information. Since JD.com has done a source review of the crawler, we cannot view the crawled content.
(3) Solution
Now that the cause of the error has been found, the corresponding solution is obvious. We only need to use a dictionary to construct a key-value pair and change the header information. Change the content in the above user-agent to any browser.
headers = {
"User-Agent": "Mozilla/5.0"}
Mozilla/5.0 means that the submitting visit may be any browser such as Firefox, Google, etc. It is the identification field of the standard browser.
4. Complete code
import requests
url = "http://item.jd.com/2967929.html"
headers = {
"User-Agent": "Mozilla/5.0"}
try:
r = requests.get(url, headers=headers) #因为京东有user-agent限制所以要加入头部信息
r.raise_for_status()
print(r.status_code)
r.encoding = r.apparent_encoding
print(r.text[:1001])
except:
print("爬取异常")
The output content is as follows, we can see that the content has been crawled normally
At the end of this article, please point out any errors~