Problem solving: crawling Jingdong product information return URL without displaying content

Article Directory

1. Crawl the original page

The original page to be crawled here is as follows:
Insert picture description here

2. Error code

I don’t know if you guys are using the code shown in the figure below just like me?

import requests
url = "http://item.jd.com/2967929.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    print(r.status_code)
    r.encoding = r.apparent_encoding
    print(r.text[:1001])
except:
    print("爬取异常")

The result I got with the above code is like this

Insert picture description here
There is only one URL, and the content is not displayed as expected. What is the reason? Let's analyze it.

3. Error analysis

Error analysis using IDLE interactive environment

(1) View the status code and encoding method

Insert picture description here
By checking the status code and encoding method, we found that there does not seem to be any problem. At this time, we must consider whether Jingdong has imposed user-agent restrictions on crawlers.

(2) Output the header information submitted to Jingdong

Insert picture description here
By outputting the header information, we found that in the information submitted to JD.com, we honestly told the crawler used by JD.com to obtain information. Since JD.com has done a source review of the crawler, we cannot view the crawled content.

(3) Solution

Now that the cause of the error has been found, the corresponding solution is obvious. We only need to use a dictionary to construct a key-value pair and change the header information. Change the content in the above user-agent to any browser.

headers = {
    
    "User-Agent": "Mozilla/5.0"}

Mozilla/5.0 means that the submitting visit may be any browser such as Firefox, Google, etc. It is the identification field of the standard browser.

4. Complete code

import requests
url = "http://item.jd.com/2967929.html"
headers = {
    
    "User-Agent": "Mozilla/5.0"}
try:
    r = requests.get(url, headers=headers)  #因为京东有user-agent限制所以要加入头部信息
    r.raise_for_status()
    print(r.status_code)
    r.encoding = r.apparent_encoding
    print(r.text[:1001])
except:
    print("爬取异常")

The output content is as follows, we can see that the content has been crawled normally

Insert picture description here
At the end of this article, please point out any errors~