Crawler combat: use Selenium to crawl JD baby information

Some page data is obtained using Ajax, but these Ajax interface parameters are more complicated, and encryption keys may be added. For this kind of page, the most convenient way is through selenium. Selenium can be used to simulate browser operations and grab Jingdong product information.
Web analytics

Today I used Selenium to simulate a browser to crawl information.

The input box id is q

In the Chrome browser, select the search button, review the elements, find the location of the search button, right-click, select copy, select copy seletor to select the css of
the search button. The result of this selection is #search> div> div.form> button

Enter the baby's name: such as underwear,
then the web page is pulled to the end of the page.
Similarly, find the css path of the last 100 pages of the page
#J_bottomPage> span.p-skip> em:nth-child(1)> b

As shown in the figure, each product is a Li label. Extract the css path of the li tag (ul contains all goods) The
css path is #J_goodsList> ul> li

driver.find_elements_by_css_selector('#J_goodsList > ul > li')

What we get is li:nth-child(1), li:nth-child(2)...

To extract product information, such as extracting the name, the css path obtained by selecting the name of a product is
#J_goodsList> ul> li:nth-child(1)> div> div.p-name.p-name-type-2> a> em
we should modify it to div> div.p-name.p-name-type-2> a> em

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Author: yudengwu(余登武)
# @Date  : 2021/2/18
#@email:[email protected]
from selenium import webdriver
import time
import csv
import re

def search_product(key):
    driver.find_element_by_id('key').send_keys(key)
    driver.find_element_by_css_selector('#search > div > div.form > button').click()
    # 浏览器窗口最大化
    driver.maximize_window()
    time.sleep(3)
    # 找到页数的标签
    page = driver.find_element_by_css_selector('#J_bottomPage > span.p-skip > em:nth-child(1) > b').text
    page = re.findall('(\d+)', page)[0]
    return int(page)
def get_product():
    lis = driver.find_elements_by_css_selector('#J_goodsList > ul > li')#
    for li in lis:
        # 商品名称
        info = li.find_element_by_xpath('div/div[4]/a/em').text
        # 商品价格
        price = li.find_element_by_css_selector('div > div.p-price > strong > i').text + "元"
        #评价数
        evaluate=li.find_element_by_css_selector('div > div.p-commit > strong').text
        # 店铺名称
        name = li.find_element_by_css_selector('div > div.p-shop > span > a').text
        #图片超链接
        photo=li.find_element_by_css_selector('div > div.p-img > a > img').get_attribute('src')
        print(info, price, evaluate,name,photo,sep='|')
        with open('京东商品.csv', 'a', newline="") as fp:
            csvwriter = csv.writer(fp, delimiter=',')
            csvwriter.writerow([info, price, evaluate,name,photo])



def main():
    print('正在爬取第一页数据')

    page = search_product(keyword)
    get_product()
    page_num = 1
    #
    while page_num != page:
        print('-*-' * 10)
        print('正在爬取第{}页的数据'.format(page_num + 1))
        print('*-*' * 10)
        driver.get('https://search.jd.com/Search?keyword={}&wq={}&page={}&s=116&click=0'.format(keyword,keyword, page_num))
        # 浏览器等待方法
        driver.implicitly_wait(2)
        # 最大化浏览器
        driver.maximize_window()
        get_product()
        page_num += 1


if __name__ == '__main__':
    keyword = input("请输入你要商品的关键字:")
    path='D:\chromedriver_win32\chromedriver.exe'#驱动目录
    driver = webdriver.Chrome(path)
    driver.get('https://www.jd.com/?cu=true&utm_source=c.duomai.com&utm_medium=tuiguang&utm_campaign=t_16282_137005883&utm_term=5b4355c849464534ae6a958b61187471')
    main()

Try another product

We can crawl all the goods on Jingdong

Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/113841793