Python crawler product information of a certain Dongwang | Unexpectedly, the highest sales volume is

Hello everyone, I am Xianyu

It’s been a long time since I updated articles related to python crawlers. Today we use the selenium module to simply write a crawler program—to crawl product information on a certain Dongwang website.

URL link: https://www.jd.com/

The complete source code is at the end of the article

element positioning

We need to find the location information (xpth path) of the elements on the webpage.
insert image description here
We first need to know the location of the search box and the search button, and then we can enter the product name into the search box and click the shrink button

Open F12, check the corresponding location through the developer debugging tool, and get the following XPath expression:

# 输入框位置:
//*[@id="key"]

# 搜索按钮位置:
//*[@class='form']/button

Take python books as an example
insert image description here

We need to obtain the name, price, number of reviews and store name of the product
insert image description here
insert image description here
, and then check the corresponding location through the developer debugging tool. The following XPath expression can be obtained:

# 当前页面商品列表
//*[@id="J_goodsList"]/ul/li

# 商品名字
.//div[@class="p-name"]/a/em | .//div[@class="p-name p-name-type-2"]/a/em

# 商品价格
.//div[@class="p-price"]/strong

# 商品评价数量
.//div[@class="p-commit"]/strong

#店铺名字
.//div[@class="p-shopnum"] | .//div[@class="p-shop"]

Please note that I used or (|) when locating the product name xpath, this is because I found that there are multiple xpath paths for the product name when crawling other product information

.//div[@class="p-name"]/a/em 或 .//div[@class="p-name p-name-type-2"]/a/em

insert image description here
insert image description here

The name of the store is the same

.//div[@class="p-shopnum"].//div[@class="p-shop"]

And multiple XPath path expressions can be used at the same time, the syntax is as follows:

xpath表达式1 | xpath表达式2 | xpath表达式3

After realizing the automatic search, the next step is to grab the product information in the page

It should be noted that you will find that all the products will be loaded only when the slider is scrolled to the bottom

We still need to make a judgment. When crawling to the last page, the button of the next page cannot be clicked. At this time, we will exit the crawler program.
insert image description here

Code

First we define a class JdSpider, and then initialize the object for it

class JdSpider(object):
    def __init__(self):
        self.url = 'http://www.jd.com/' 
        self.options = webdriver.ChromeOptions()
        self.options.add_argument('--headless')  # 设置不显示窗口
        self.browser = webdriver.Chrome(options=self.options)  # 创建浏览器对象
        self.i = 0  # 计数,一共有多少件商品

Then, enter the product name and click the search button to implement the code

    def get_html(self):
        self.browser.get(self.url)
        self.browser.find_element(By.XPATH, '//*[@id="key"]').send_keys('python书籍')
        self.browser.find_element(By.XPATH, "//*[@class='form']/button").click()

getting information

    def get_data(self):
        # 执行js语句,拉动进度条
        self.browser.execute_script(
            'window.scrollTo(0,document.body.scrollHeight)'
        )
        # 给页面元素加载时预留时间
        time.sleep(2)
        # 用xpath提取每页中所有商品,最终形成一个大列表 \
        li_list = self.browser.find_elements(By.XPATH, '//*[@id="J_goodsList"]/ul/li')
        for li in li_list:
            # 构建空字典
            item = {
    
    }
            item['name']=li.find_element(By.XPATH, './/div[@class="p-name"]/a/em | .//div[@class="p-name p-name-type-2"]/a/em').text.strip()
            item['price']=li.find_element(By.XPATH, './/div[@class="p-price"]/strong').text.strip()
            item['count']=li.find_element(By.XPATH, './/div[@class="p-commit"]/strong').text.strip()
            item['shop']=li.find_element(By.XPATH, './/div[@class="p-shopnum"] | .//div[@class="p-shop"]').text.strip()
            print(item)
            self.i += 1

entry function

    def run(self):
        # 搜索出想要抓取商品的页面
        self.get_html()
        # 循环执行点击“下一页”操作
        while True:
            # 获取每一页要抓取的数据
            self.get_data()
            # 判断是否是最一页,-1说明没找到,不是最后一页,执行点击 “下一页” 操作
            print(self.browser.page_source.find('pn-next disabled'))
            if self.browser.page_source.find('pn-next disabled') == -1:
                self.browser.find_element(By.CLASS_NAME, 'pn-next').click()
                # 预留元素加载时间
                time.sleep(1)
            else:
                print('数量', self.i)
                break

Let’s take a look.
insert image description here
insert image description here
Friends can perform data cleaning and other operations on the crawled data, and then they can analyze the data.

The source code is as follows:

from selenium import webdriver
import time
from selenium.webdriver.common.by import By


class JdSpider(object):
    def __init__(self):
        self.url = 'http://www.jd.com/'
        self.options = webdriver.ChromeOptions()
        self.options.add_argument('--headless')  # 无头模式
        self.browser = webdriver.Chrome(options=self.options)  # 创建无界面参数的浏览器对象
        self.i = 0  # 计数,一共有多少件商品
        # 输入地址+输入商品+点击按钮,切记这里元素节点是京东首页的输入栏、搜索按钮

    def get_html(self):
        self.browser.get(self.url)
        self.browser.find_element(By.XPATH, '//*[@id="key"]').send_keys('python书籍')
        self.browser.find_element(By.XPATH, "//*[@class='form']/button").click()
        # 把进度条件拉倒最底部+提取商品信息

    def get_data(self):
        # 执行js语句,拉动进度条件
        self.browser.execute_script(
            'window.scrollTo(0,document.body.scrollHeight)'
        )
        # 给页面元素加载时预留时间
        time.sleep(2)
        # 用xpath提取每页中所有商品,最终形成一个大列表 \
        li_list = self.browser.find_elements(By.XPATH, '//*[@id="J_goodsList"]/ul/li')
        for li in li_list:
            # 构建空字典
            item = {
    
    }
            item['name']=li.find_element(By.XPATH, './/div[@class="p-name"]/a/em | .//div[@class="p-name p-name-type-2"]/a/em').text.strip()
            item['price']=li.find_element(By.XPATH, './/div[@class="p-price"]/strong').text.strip()
            item['count']=li.find_element(By.XPATH, './/div[@class="p-commit"]/strong').text.strip()
            item['shop']=li.find_element(By.XPATH, './/div[@class="p-shopnum"] | .//div[@class="p-shop"]').text.strip()
            print(item)
            self.i += 1

    def run(self):
        # 搜索出想要抓取商品的页面
        self.get_html()
        # 循环执行点击“下一页”操作
        while True:
            # 获取每一页要抓取的数据
            self.get_data()
            # 判断是否是最一页,-1说明没找到,不是最后一页,执行点击 “下一页” 操作
            print(self.browser.page_source.find('pn-next disabled'))
            if self.browser.page_source.find('pn-next disabled') == -1:
                self.browser.find_element(By.CLASS_NAME, 'pn-next').click()
                # 预留元素加载时间
                time.sleep(1)
            else:
                print('数量', self.i)
                break


if __name__ == '__main__':
    spider = JdSpider()
    spider.run()

Guess you like

Origin blog.csdn.net/s_alted/article/details/131117734