Crawling Jingdong review and edit the URL directly reusable Oh (transmission code)

Article reprinted from the public seven days a small number code brother, brother of the small yards

                                                                                                   

 

 

The real python, the main goal is to use Python crawling Jingdong number of comments commodity, as shown above: crawling "Python" Father's Little Blue Book recommendation, the information including user names, titles, comments and other information.

 

URL url crawling is https://item.jd.com/12531181.html, the result of crawling saved in csv file inside for easy data analysis.

 

 

01

How to prepare reptile environment?

It is not difficult

 

 

Environment: MAC + Python3.6; IDE:. Pycharm module is as follows.

 

  1.  
    import requests
  2.  
    import re
  3.  
    import json

 

But if the anaconda installed on your system, module requests installation has been completed, but pycharm software does not recognize.

 

In this case, it is necessary to use preferences directly mounted into the FIG., Click +, can be installed directly.

 

 

 

 

02

Reptile analysis is really important

Ready to work

 

 

Our goal is to crawl Jingdong number of "zero-based Easy PYTHON" comment, open web pages found that many of the comments.

 

That we need to render multiple pages. Therefore, we expect to use a for loop to achieve.

 

 

So how to find the URL of comments it? First, open the browser, such as Chrome, then right click check out page source code modulation, as shown below:

 

 

Then, click the Network tab, and roll to the left of the page number of comments, the final search COMMEN. As shown below: You will find a red box inside the URL. The URL is our crawling URLs.

 

 

 

Specific URL is https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv36&productId=12531181&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1. Through observation, we find that page = 0, pagesize = 10 and other information.

 

并且当你点击下一页时,您会发现page=2,而网址中的其他信息没有变化,如图所示:

 

 

因此,我们构造循环即可实现对多个网页的爬取,比如100个网页,代码如下:

 

if __name__ == '__main__':
    # 循环100次
    for i in range(101):
        main(start=i)

 

 

03

真正开始爬取评论数

两步走

 

 

根据以前爬虫文章(爬虫实战)的解析 ,我们分2步爬取本次任务。第一步是解析网页;第二步是爬取评论数并且保存文件。

 

为了方便代码可复用性和简洁,我们把两步写入两个函数里,分别是begain_scraping()和python_coments(),代码如下:

 

def main(start):
    """
    开始爬取
    :return:
    """
    # 第一步解析网页
    comments_jd = begain_scraping(start)

    # 第二步 爬取评论并保存文件
    python_comments(comments_jd)

 

 

 

04

开始解析网页

第一步

 

 

解析网页,也就是编写begain_scraping(),代码如下:

 

 

首先,根据爬取的网址(https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv36&productId=12531181&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1,我们得到下面信息:

 

# 构造商品地址
url_jd = 'https://sclub.jd.com/comment/productPageComments.action?callback'
# 网页信息
vari_p = {
    # 商品ID
    'productId': 12531181,  # 换成你想爬取的ID就可以了
    'score': 0,
    'sortType': 5,
    # 爬取页面
    'page': page,
    'pageSize': 10,
}

 

为了防止反爬虫,我们构造一个伪装浏览器,然后开始爬取,代码如下:

 

# 防止反爬虫,不需要更换
headers = {
    'cookie': 'shshshfpaJsAhpiXZzNtbFCHZXchb60B240F81702FF',
    'referer': 'https://item.jd.com/11993134.html',
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}

comment_jd = requests.get(url=url_jd, params=vari_p, headers=headers)

 

 

05

开始爬取评论数并保存

第二步

 

 

开始爬取评论数并保存,也就是实现函数python_comment。本函数主要是对爬取的网页解析,然后保存在CSV文件。这也是模块化编程,逻辑清晰 ,代码简洁高效。具体代码如下:

 

def python_comments(comment_resp):
    """
    爬取数据并且写入评论
    :param comment_resp:
    :return:
    """
    comment_js = comment_resp.text

    comment_dict = json.loads(comment_js)
    comments_jd = comment_dict['comments']
    for comment in comments_jd:
        user = comment['nickname']
        color = comment['productColor']
        comment_python = comment['content']

        # 写入文件
        with open('comments_jd.csv', 'a', newline='')as csv_file:
            rows = (user, color, comment_python)
            writer = csv.writer(csv_file)
            writer.writerow(rows)

 

 

06

爬取结果展示

效果

 

 

首先,在pycharm软件控制台 ,您可以看到爬取页面信息,如下:

 

 

另外,您会在项目下面, 多了一个CSV文件,就是我们保存的文件。打开看一下效果吧:

 

 

 

Guess you like

Origin www.cnblogs.com/finer/p/11261830.html
Recommended