[Python crawler + data analysis] Collect data information of e-commerce platform and make visual presentation


foreword

With the rise of e-commerce platforms, more and more people start to shop online. For e-commerce platforms, data such as product information, prices, and reviews are very important. Therefore, capturing data such as product information, prices, and reviews on e-commerce platforms has become a very valuable task. This article will introduce how to use Python to write a crawler program to grab product information, prices, reviews and other data from e-commerce platforms.

I have prepared some Python-related materials for everyone, and you can take them away.

Please add a picture description

1. Preparation

Before we start writing the crawler, we need to prepare some tools and environment.

Python3.8
PyCharm

2. Analyze the target website

Before starting to write a crawler program, we need to analyze the structure and data of the target website. In this article, we choose to capture data such as product information, prices, and reviews from JD.com.

1. Commodity information

  1. The product information of the mall includes product name, product number, product classification, product brand, product model, product specification, product origin, product weight, product packaging and other information. This information can be found on the product detail page.

  2. Price
    The commodity price in the mall includes information such as commodity original price, commodity promotion price, commodity discount and so on. This information can be found on the product detail page.

  3. Comments
    The product reviews on JD.com include information such as user reviews, user pictures, and user follow-up reviews. This information can be found on the product detail page.

3. Write a crawler program

After analyzing the structure and data of the target website, we can start writing the crawler. In this article, we use the Scrapy framework to write a crawler program and save the captured data to a MySQL database.

  1. Create a Scrapy project

First, we need to create a Scrapy project. Enter the following command at the command line:

scrapy startproject jingdong

This will create a Scrapy project called jingdong.

  1. Create a crawler

Next, we need to create a crawler. Enter the following command at the command line:

scrapy genspider jingdong_spider jd.com

This will create a crawler called jingdong_spider, crawling the website jd.com.

  1. Write crawler code

After creating the crawler, we need to write the crawler code. In the Scrapy framework, the crawler code mainly includes the following parts:

(1) Define Item

Item is a concept in the Scrapy framework, which is used to define the data structure to be crawled. In this article, we need to define an Item to save data such as product information, price, and comments. In the project's items.py file, add the following code:

import scrapy

class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    sku = scrapy.Field()
    category = scrapy.Field()
    brand = scrapy.Field()
    model = scrapy.Field()
    spec = scrapy.Field()
    origin = scrapy.Field()
    weight = scrapy.Field()
    package = scrapy.Field()
    price = scrapy.Field()
    promotion_price = scrapy.Field()
    discount = scrapy.Field()
    comment = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

An Item named JingdongItem is defined here, including product name, product number, product category, product brand, product model, product specification, product origin, product weight, product packaging, product price, product promotion price, product discount, product review , product image and other fields.

(2) Write the crawler code
In the spiders directory of the project, open the jingdong_spider.py file and add the following code:

import scrapy
from jingdong.items import JingdongItem

class JingdongSpider(scrapy.Spider):
    name = 'jingdong'
    allowed_domains = ['jd.com']
    start_urls = ['https://www.jd.com/']

    def parse(self, response):
        # 获取所有分类链接
        category_links = response.xpath('//div[@class="category-item"]/div[@class="item-list"]/ul/li/a/@href')
        for link in category_links:
            yield scrapy.Request(link.extract(), callback=self.parse_category)

    def parse_category(self, response):
        # 获取所有商品链接
        product_links = response.xpath('//div[@class="gl-i-wrap"]/div[@class="p-img"]/a/@href')
        for link in product_links:
            yield scrapy.Request(link.extract(), callback=self.parse_product)

        # 获取下一页链接
        next_page_link = response.xpath('//a[@class="pn-next"]/@href')
        if next_page_link:
            yield scrapy.Request(next_page_link.extract_first(), callback=self.parse_category)

    def parse_product(self, response):
        item = JingdongItem()

        # 获取商品名称
        item['name'] = response.xpath('//div[@class="sku-name"]/text()')[0].extract()

        # 获取商品编号
        item['sku'] = response.xpath('//div[@class="itemInfo-wrap"]/div[@class="clearfix"]/div[@class="sku"]/div[@class="item"]/div[@class="name"]/text()')[0].extract()

        # 获取商品分类
        category_list = response.xpath('//div[@class="breadcrumb"]/a/text()')
        item['category'] = '>'.join(category_list.extract())

        # 获取商品品牌
        item['brand'] = response.xpath('//div[@class="itemInfo-wrap"]/div[@class="clearfix"]/div[@class="sku-name"]/a/@title')[0].extract()

        # 获取商品型号
        item['model'] = response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]/dl/dt/text()')[0].extract()

        # 获取商品规格
        spec_list = response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]/dl/dd/ul/li/text()')
        item['spec'] = ','.join(spec_list.extract())

        # 获取商品产地
        item['origin'] = response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]/dl/dd/text()')[0].extract()

        # 获取商品重量
        item['weight'] = response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]/dl/dd/text()')[1].extract()

        # 获取商品包装
        item['package'] = response.xpath('//div[@class="Ptable"]/div[@class="Ptable-item"]/dl/dd/text()')[2].extract()

        # 获取商品价格
        price_list = response.xpath('//div[@class="summary-price-wrap"]/div[@class="summary-price J-summary-price"]/div[@class="dd"]/span/text()')
        item['price'] = price_list[0].extract()
        item['promotion_price'] = price_list[1].extract() if len(price_list) > 1 else ''
        item['discount'] = response.xpath('//div[@class="summary-price-wrap"]/div[@class="summary-price J-summary-price"]/div[@class="dd"]/div[@class="promo"]/span/text()')[0].extract()

        # 获取商品评论
        comment_list = response.xpath('//div[@class="comment-item"]')
        comment_text_list = []
        for comment in comment_list:
            comment_text = comment.xpath('div[@class="comment-column J-comment-column"]/div[@class="comment-con"]/div[@class="comment-con-top"]/div[@class="comment-con-txt"]/text()').extract_first()
            if comment_text:
                comment_text_list.append(comment_text.strip())
        item['comment'] = '\n'.join(comment_text_list)

        # 获取商品图片
        item['image_urls'] = response.xpath('//div[@class="spec-items"]/ul/li/img/@src')
        item['images'] = []

        yield item

A crawler named JingdongSpider is defined here. First, it obtains all category links, then visits each category page in turn, obtains all product links, and then visits each product page in turn, grabs product information, prices, reviews and other data, and saves it. to Item.

(3) Configuration database

In the project's settings.py file, add the following code:

ITEM_PIPELINES = {
    
    
    'jingdong.pipelines.JingdongPipeline': 300,
}

MYSQL_HOST = 'localhost'
MYSQL_PORT = 3306
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_DBNAME = 'jingdong'

A pipeline named JingdongPipeline is defined here, which is used to save the captured data to the MySQL database. At the same time, the connection information of the MySQL database is configured.

(4) Write pipeline code

In the project's pipelines.py file, add the following code:

import pymysql
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from jingdong.items import JingdongItem

class JingdongPipeline(object):
    def __init__(self, host, port, user, password, dbname):
        self.host = host
        self.port = port
        self.user = user
        self.password = password
        self.dbname = dbname

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            port=crawler.settings.get('MYSQL_PORT'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            dbname=crawler.settings.get('MYSQL_DBNAME')
        )

    def open_spider(self, spider):
        self.conn = pymysql.connect(host=self.host, port=self.port, user=self.user, password=self.password, db=self.dbname, charset='utf8')
        self.cursor = self.conn.cursor()

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        if not isinstance(item, JingdongItem):
            return item

        # 保存商品信息
        sql = 'INSERT INTO product(name, sku, category, brand, model, spec, origin, weight, package, price, promotion_price, discount, comment) VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
        self.cursor.execute(sql, (item['name'], item['sku'], item['category'], item['brand'], item['model'], item['spec'], item['origin'], item['weight'], item['package'], item['price'], item['promotion_price'], item['discount'], item['comment']))
        product_id = self.cursor.lastrowid

        # 保存商品图片
        if item['image_urls']:
            for image_url in item['image_urls']:
                self.cursor.execute('INSERT INTO image(product_id, url) VALUES(%s, %s)', (product_id, image_url))
            self.conn.commit()

        return item

A pipeline named JingdongPipeline is defined here, which is used to save the captured data to the MySQL database. In the process_item method, first save the product information to the product table, and then save the product picture to the image table.

(5) Configure picture download

In the project's settings.py file, add the following code:

ITEM_PIPELINES = {
    
    
    'jingdong.pipelines.JingdongPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = 'images'

The pipeline and storage path for image download are configured here.

(6) Run the crawler

Enter the following command on the command line to run the crawler:

scrapy crawl jingdong

This will start the crawler program and begin to grab data such as product information, prices, reviews, etc. of Jingdong Mall, and save them in the MySQL database.

V. Summary

This article introduces how to use Python to write a crawler program to grab product information, prices, comments and other data from e-commerce platforms. Through the study of this article, you can understand the basic usage of the Scrapy framework and how to save the captured data to the MySQL database. At the same time, you can also learn how to simulate the behavior of the browser and capture the data of dynamic pages. Hope this article helps you.

↓ ↓ ↓ Find me on the business card below, various source codes and cases ↓ ↓ ↓

Please add a picture description

Guess you like

Origin blog.csdn.net/weixin_45841831/article/details/131109155