Jingdong analyzed underwear sales record, tell your sister who really Size!

Today her spare time to write a reptile example. Reptile climb by taking user evaluation of Jingdong, crawling through data analysis can get a lot of results, for example, what kind of bra color most women are welcome, as well as Chinese women's average size (for reference only oh)

Open the Developer Tools -network, the user evaluation page we found that there is such a browser request

image description

Through the analysis we found that there are three main parameters used productId, page, pageSize. After two of paging parameters, productId is the id of each product, product evaluation records to get through this id, so we just need to know productId each item will easily get evaluated. Let's analyze the search page page source code

image description

Through the analysis we found in each product label li, li tag and have a data-pid attribute, the corresponding value of the commodity is productId.

Learn about the entire process, we can start our work reptiles.


First we need to get the id into the search page, provide for the following productId crawling user evaluation. key_word for the keyword search, this is [bra]

import requests
import re

"""
查询商品id
"""
def find_product_id(key_word):
    jd_url = 'https://search.jd.com/Search'
    product_ids = []
    # 爬前3页的商品
    for i in range(1,4):
        param = {'keyword': key_word, 'enc': 'utf-8', 'page': i}
        response = requests.get(jd_url, params=param)
        # 商品id
        ids = re.findall('data-pid="(.*?)"', response.text, re.S)
        product_ids += ids
    return product_ids
复制代码

The first three pages of merchandise id into the list, then we can evaluate crawling

Obtaining the user evaluation format is a response to the request back a json string concatenation (as shown below), so we just useless character deleted, you can get to json objects we want our discovery by analyzing the preview.

The content of the comments in the json object is what we want the final evaluation records

image description

"""
获取评论内容
"""
def get_comment_message(product_id):
    urls = ['https://sclub.jd.com/comment/productPageComments.action?' \
            'callback=fetchJSON_comment98vv53282&' \
            'productId={}' \
            '&score=0&sortType=5&' \
            'page={}' \
            '&pageSize=10&isShadowSku=0&rid=0&fold=1'.format(product_id, page) for page in range(1, 11)]
    for url in urls:
        response = requests.get(url)
        html = response.text
        # 删除无用字符
        html = html.replace('fetchJSON_comment98vv53282(', '').replace(');', '')
        data = json.loads(html)
        comments = data['comments']
        t = threading.Thread(target=save_mongo, args=(comments,))
        t.start()
复制代码

In this method gained only url evaluation of the first 10 pages, urls into this list. By acquiring an evaluation cycle record different pages, then start a thread used to store the message data to the MongoDB.

We continue to analyze the comments that were recorded this interface we want to find two data

  • productColor: Color

  • productSize: Product Size

    image description

# mongo服务
client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
# jd数据库
db = client.jd
# product表,没有自动创建
product_db = db.product

#  保存mongo
def save_mongo(comments):
    for comment in comments:
        product_data = {}
        # 颜色
        # flush_data清洗数据的方法
        product_data['product_color'] = flush_data(comment['productColor'])
        # size
        product_data['product_size'] = flush_data(comment['productSize'])
        # 评论内容
        product_data['comment_content'] = comment['content']
        # create_time
        product_data['create_time'] = comment['creationTime']
        # 插入mongo
        product_db.insert(product_data)

复制代码

Because the color of each commodity, there are differences in size are described, for terms of statistics, we conducted a simple data cleaning. This code is not very Pythonic. But only a small demo, you can ignore.

def flush_data(data):
    if '肤' in data:
        return '肤色'
    if '黑' in data:
        return '黑色'
    if '紫' in data:
        return '紫色'
    if '粉' in data:
        return '粉色'
    if '蓝' in data:
        return '蓝色'
    if '白' in data:
        return '白色'
    if '灰' in data:
        return '灰色'
    if '槟' in data:
        return '香槟色'
    if '琥' in data:
        return '琥珀色'
    if '红' in data:
        return '红色'
    if '紫' in data:
        return '紫色'
    if 'A' in data:
        return 'A'
    if 'B' in data:
        return 'B'
    if 'C' in data:
        return 'C'
    if 'D' in data:
        return 'D'
复制代码

Write complete functionality of these modules, they only need to link below

# 创建一个线程锁
lock = threading.Lock()

# 获取评论线程
def spider_jd(ids):
    while ids:
        # 加锁
        lock.acquire()
        # 取出第一个元素
        id = ids[0]
        # 将取出的元素从列表中删除,避免重复加载
        del ids[0]
        # 释放锁
        lock.release()
        # 获取评论内容
        get_comment_message(id)


product_ids = find_product_id('胸罩')
for i in (1, 5):
    # 增加一个获取评论的线程
    t = threading.Thread(target=spider_jd, args=(product_ids,))
    # 启动线程
    t.start()
复制代码

The reason the above code lock is to prevent duplication of spending the shared variable

Check after running MongoDB:

image description

After getting the results, in order to more intuitive performance data, we can be graphed to show the library with matplotlib

import pymongo
from pylab import *


client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
# jd数据库
db = client.jd
# product表,没有自动创建
product_db = db.product
# 统计以下几个颜色
color_arr = ['肤色', '黑色', '紫色', '粉色', '蓝色', '白色', '灰色', '香槟色', '红色']

color_num_arr = []
for i in color_arr:
    num = product_db.count({'product_color': i})
    color_num_arr.append(num)

# 显示的颜色
color_arr = ['bisque', 'black', 'purple', 'pink', 'blue', 'white', 'gray', 'peru', 'red']

#labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
#autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
#shadow,饼是否有阴影
#startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
#pctdistance,百分比的text离圆心的距离
#patches, l_texts, p_texts,为了得到饼图的返回值,p_texts饼图内部文本的,l_texts饼图外label的文本
patches,l_text,p_text = plt.pie(sizes, labels=labels, colors=colors,
                                labeldistance=1.1, autopct='%3.1f%%', shadow=False,
                                startangle=90, pctdistance=0.6)
#改变文本的大小
#方法是把每一个text遍历。调用set_size方法设置它的属性
for t in l_text:
    t.set_size=(30)
for t in p_text:
    t.set_size=(20)
# 设置x,y轴刻度一致,这样饼图才能是圆的
plt.axis('equal')
plt.title("内衣颜色比例图", fontproperties="SimHei") #
plt.legend()
plt.show()
复制代码

Run the code, we found that the skin of the most popular followed by black (steel straight men said they did not know is not true ...)

image description

Next we look at the statistical distribution of size, the histogram display here

index=["A","B","C","D"]

client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
db = client.jd
product_db = db.product

value = []
for i in index:
    num = product_db.count({'product_size': i})
    value.append(num)

plt.bar(left=index, height=value, color="green", width=0.5)

plt.show()

复制代码

After the run we found some more B size women

image description


Finally, we welcome the attention to my public number (python3xxx). Python will push different dry day.

image description

Guess you like

Origin juejin.im/post/5d39d75d518825542956dcd6