Problem solving: crawling Jingdong product information return URL without displaying content

1. Crawl the original page

  The original page to be crawled here is as follows:
Insert picture description here

2. Error code

  I don’t know if you guys are using the code shown in the figure below just like me?

import requests
url = "http://item.jd.com/2967929.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    print(r.status_code)
    r.encoding = r.apparent_encoding
    print(r.text[:1001])
except:
    print("爬取异常")

  The result I got with the above code is like this

Insert picture description here
  There is only one URL, and the content is not displayed as expected. What is the reason? Let's analyze it.

3. Error analysis

  Error analysis using IDLE interactive environment

(1) View the status code and encoding method

Insert picture description here
  By checking the status code and encoding method, we found that there does not seem to be any problem. At this time, we must consider whether Jingdong has imposed user-agent restrictions on crawlers.

(2) Output the header information submitted to Jingdong

Insert picture description here
  By outputting the header information, we found that in the information submitted to JD.com, we honestly told the crawler used by JD.com to obtain information. Since JD.com has done a source review of the crawler, we cannot view the crawled content.

(3) Solution

  Now that the cause of the error has been found, the corresponding solution is obvious. We only need to use a dictionary to construct a key-value pair and change the header information. Change the content in the above user-agent to any browser.

headers = {
    
    "User-Agent": "Mozilla/5.0"}

  Mozilla/5.0 means that the submitting visit may be any browser such as Firefox, Google, etc. It is the identification field of the standard browser.

4. Complete code

import requests
url = "http://item.jd.com/2967929.html"
headers = {
    
    "User-Agent": "Mozilla/5.0"}
try:
    r = requests.get(url, headers=headers)  #因为京东有user-agent限制所以要加入头部信息
    r.raise_for_status()
    print(r.status_code)
    r.encoding = r.apparent_encoding
    print(r.text[:1001])
except:
    print("爬取异常")

The output content is as follows, we can see that the content has been crawled normally

Insert picture description here
  At the end of this article, please point out any errors~

Guess you like

Origin blog.csdn.net/weixin_44578172/article/details/109302571