python-selenium automatically crawls all categories of data on JD.com's mobile phone----"climbing from ancestors to grandchildren"

1. Preamble

Hello everyone, I am Xiaolong. Today we don’t talk about Java-related technologies, but I want to share with you a crawler I used for projects in college competitions.

事情是这样的:

The competition project has a module about e-commerce, but there is no data, and then I plan to crawl JD.com and Taobao PC data, but some pictures do not meet the size of the mobile phone. After comprehensive consideration, I decided to crawl JD.com’s mobile phone web page data

Related Links:

Some functions of my project "Smart Campus Assistant v1.0.1 Based on Artificial Intelligence"

https://www.bilibili.com/video/BV1XT4y1w7Xc

"Jingdong Mall"

https://so.m.jd.com/webportal/channel/m_category?searchFrom=home

Jingdong Mall
Jingdong Mall

Let's take a look at the screenshot of the last part of the data first. That's right, all the crawled data is saved as json data, which is convenient for use everywhere.

partial results
partial results

In the end, hundreds of thousands of pieces of data were obtained, and all of them were imported to ElasticSearchbuild a powerful search system and recommendation center.

During this period, I encountered many problems and learned a lot of new knowledge. I think that if you are not mainly engaged in Java, you will only be related to Java 语言只是工具,是为了方便我们自身,所以具体应用场景使用什么方便就怎么来吧. Of course, Java can also crawl, but I personally feel that it is not as powerful and convenient as python.

This article records my thoughts and thoughts throughout the process, from the beginning to the end of the generation of all ideas and various problems encountered, and then how to solve the problem and some of my own opinions and conclusions, hoping to bring readers new ideas. The understanding and perception (whether in terms of technology or problem-solving ideas), one more sentence: Xiaolong python is not very good, if there are any deficiencies, please leave a message to correct me, and you can also learn and discuss progress together through technical exchanges.

Not much nonsense, follow Xiaolong's thinking, let's take a look step by step! !

2. Database design

Category information table (product_category)

CREATE TABLE product_category(
  category_id SMALLINT UNSIGNED AUTO_INCREMENT NOT NULL COMMENT '分类ID',
  category_name VARCHAR(10NOT NULL COMMENT '分类名称',
  img  VARCHAR(100NOT NULL COMMENT '分类图片logo',
  parent_id SMALLINT UNSIGNED NOT NULL DEFAULT 0 COMMENT '父分类ID--若为0则该层为父类',
  category_level TINYINT NOT NULL DEFAULT 1 COMMENT '分类层级--该层为该分类第几层',
  category_status TINYINT NOT NULL DEFAULT 1 COMMENT '分类状态--是否还可继续往下分,是1否0',
  PRIMARY KEY pk_categoryid(category_id)
)ENGINE=innodb COMMENT '商品分类表'

Commodity information table (product_info)

CREATE TABLE product_info(
  sku_id INT UNSIGNED AUTO_INCREMENT NOT NULL COMMENT '商品ID',
  product_name VARCHAR(20NOT NULL COMMENT '商品名称',
  category_id1 SMALLINT UNSIGNED NOT NULL COMMENT '一级分类ID',
  category_id2 SMALLINT UNSIGNED NOT NULL COMMENT '二级分类ID',
  category_id3 SMALLINT UNSIGNED NOT NULL COMMENT '三级分类ID',
  price FLOAT NOT NULL COMMENT '商品销售价格',
  publish_status TINYINT NOT NULL DEFAULT 0 COMMENT '上下架状态:0下架1上架',
  descript VARCHAR(100NOT NULL COMMENT '商品描述',
  spec_param VARCHAR(10000NOT NULL COMMENT '商品规格参数',
  title VARCHAR(100NOT NULL COMMENT '商品标题',
  PRIMARY KEY pk_productid(product_id)
ENGINE = innodb COMMENT '商品信息表';

Product picture table (productpicinfo)

CREATE TABLE product_pic_info(
  product_pic_id INT UNSIGNED AUTO_INCREMENT NOT NULL COMMENT '商品图片ID',
  product_id INT UNSIGNED NOT NULL COMMENT '商品ID',
  pic_desc VARCHAR(50COMMENT '图片描述',
  pic_url VARCHAR(200NOT NULL COMMENT '图片URL',
  is_master TINYINT NOT NULL DEFAULT 0 COMMENT '是否主图:0.非主图1.主图',
  pic_status TINYINT NOT NULL DEFAULT 1 COMMENT '图片是否有效:0无效 1有效',
  PRIMARY KEY pk_picid(product_pic_id)
)ENGINE=innodb COMMENT '商品图片信息表';

3. Analysis

3.1. Page analysis

Each category is divided into multiple 2-level and 3-level classifications

Classified three-level list
Classified three-level list

3.2. Review of page elements

element review
element review
review analysis
review analysis

solve:

1. Implementation idea

The first layer gets category1 (mobile digital, home appliances...), the second layer gets categorygeoy2 (popular categories, mobile communication...), the third layer gets categorygeoy3 (Xiaomi, Huawei, Honor...)

2. Data request

The webpage uses ajaxs to request data every time category1 is selected, and the underlying data is reloaded once, so selenium模拟浏览器automatic crawling is selected.

# 总枢纽,控制每一级分类,详情爬取
def spider(driver):
    # 得到ul列表
    # 显示等待
    wait(driver, '//*[@id="category2"]//li')
    html = etree.HTML(driver.page_source)
    lis = html.xpath('//*[@id="category2"]//li')
    # 解析每个li标签
    # or
    for i, li in enumerate(lis):
        # 分类太多,爬取时间太长,选取有用类爬取
        if ( i == 1 or i == 3 or i == 8 or i == 26):
            # 保存一级分类
            category_id1 = li.xpath('./@id')[0]
            category_name1 = li.xpath('.//text()')[0]
            category_level1 = 1
            category_status1 = 1
            parent_id1 = 0
            # path = 'category%s' %str(i+6)
            path = '//*[@id="category%s"]/a' % str(i+6)
            # driver.find_element_by_xpath('//*[@id="category18"]/a').click()
            # time.sleep(2)
            # driver.find_element_by_xpath('//*[@id="category27"]/a').click()
            # time.sleep(2)
            item = driver.find_element_by_xpath(path)
            item.click()
            time.sleep(3)
            phone_html = etree.HTML(driver.page_source)
            branchList = phone_html.xpath('//*[@id="branchList"]/div')
            for j, item in enumerate(branchList):
                # 一个item就是一个div
                # 获取二级分类
                print(j)
                if(i == 14 and j == 0):
                    continue
                category_name2 = (item.xpath('./h4/text()'))[0]
                category_id2 = 'SecCategory'+str(category_id1).split('y')[1]
                category_level2 = 2
                category_status2 = 1
                parent_id2 = category_id1
                lis = item.xpath('./ul/li')
                for k, li in enumerate(lis):
                #这是由于数据太多,我控制category3只要6个就够了
                    if(k>5):
                        break
                    category_id3 = (li.xpath('./a/@id'))[0]
                    img = li.xpath('./a/img/@src')[0]
                    href = li.xpath('./a/@href')[0]
                    # 解析详情页
                    # ****detail(page_source)****
                    category_name3 = (li.xpath('.//span/text()'))[0]
                    category_level3 = 3
                    category_status3 = 0
                    parent_id3 = category_id2
                    print("进入第%s页爬取......" % (k+1))
                    detail_href = li.xpath('./a/@href')[0]
                    # 打开商品详情页,另开窗口
                    driver.execute_script("window.open ('%s')" % detail_href)
                    driver.switch_to.window(driver.window_handles[1])
                    time.sleep(2)
                    flag = isElementExist(driver, '//*[@id="pcprompt-viewpc"]')
                    if flag:
                        start = driver.find_element_by_xpath('//*[@id="pcprompt-viewpc"]')
                        start.click()
                    print('开始爬取第三级分类商品.....')
                    # driver.execute_script("arguments[0].click();", li)
                    # start = time.clock()
                    # 爬取详情信息
                    getList(driver, category_id1, category_id2, category_id3)
                    # end = time.clock()
                    print("爬取第三级分类商品完成.....")
                    driver.close()
                    driver.switch_to.window(driver.window_handles[0])
                    # 解析详情页
                    # time.sleep(2)
                    # 三级
                    category3 = {
                        "category_id": category_id3,
                        "category_name": category_name3,
                        "category_level": category_level3,
                        'category_status': category_status3,
                        "parent_id": parent_id3,
                        'img': img,
                        'href': href
                    }
                    list.append(category3)
                #     二级
                category2 = {
                    "category_id": category_id2,
                    "category_name": category_name2,
                    "category_level": category_level2,
                    'category_status': category_status2,
                    "parent_id": parent_id2
                }
                list.append(category2)
            #     一级
            category1 = {
                "category_id": category_id1,
                "category_name": category_name1,
                "category_level": category_level1,
                'category_status': category_status1,
                "parent_id": parent_id1
                }
            list.append(category1)
            with open('categoryList.json''w', encoding='utf-8'as fp:
                json.dump(list, fp, ensure_ascii=False)


3. Pop-up window problem

Since I started to enter the website, it jumped a pop-up window (similar to the picture below) to cover the page, and I couldn’t use it to get and seleniumclick on the page elements, and then sometimes I didn’t get it, which was depressed, so I added a function to judge it Is it worth it

# 判断某元素是否存在
def isElementExist(driver, path):
    flag = True
    try:
        driver.find_element_by_xpath(path)
        return flag
    except:
        flag = False
        return flag
insert image description here
insert image description here

4. Get categoryList

Analysis page. find the required information,

category[
category3 = {
“category_id”: category_id3,
“category_name”: category_name3,
“category_level”: category_level3,
‘category_status’: category_status3,
“parent_id”: parent_id3,
‘img’: img,
‘href’: href
}
]
insert image description here
insert image description here
insert image description here
insert image description here

After analysis, it is found that the three-level classified product a标签超链接is the entry to enter the product list, and enter the goods_list through href.

                for k, li in enumerate(lis):
                    if(k>5):
                        break
                    category_id3 = (li.xpath('./a/@id'))[0]
                    img = li.xpath('./a/img/@src')[0]
                    href = li.xpath('./a/@href')[0]
                    # 解析详情页
                    # ****detail(page_source)****
                    category_name3 = (li.xpath('.//span/text()'))[0]
                    category_level3 = 3
                    category_status3 = 0
                    parent_id3 = category_id2
                    print("进入第%s页爬取......" % (k+1))
                    detail_href = li.xpath('./a/@href')[0]
                    driver.execute_script("window.open ('%s')" % detail_href)
                    driver.switch_to.window(driver.window_handles[1])
                    time.sleep(2)
                    flag = isElementExist(driver, '//*[@id="pcprompt-viewpc"]')
                    if flag:
                        start = driver.find_element_by_xpath('//*[@id="pcprompt-viewpc"]')
                        start.click()
                    print('开始爬取第三级分类商品.....')
                    # driver.execute_script("arguments[0].click();", li)
                    # start = time.clock()
                    getList(driver, category_id1, category_id2, category_id3)
                    # end = time.clock()
                    print("爬取第三级分类商品完成.....")

5. Enter category4 list

Analyze the page and find the relevant information we need him

{
    
    
skuid
price
title
url
img 图片到详情页与海报图一起获取
}
list of divs
list of divs
Entrance Details
Entrance Details

After analysis, it is found that using his hyperlink –tourl will redirect to the PC side page

insert image description here
insert image description here
Details page
Details page

Therefore, after analyzing the address of each commodity entering

1)https://item.m.jd.com/product/100004559325.html?sku=100004559325&price=1599.00&fs=1&sid=&sf=newM&sceneval=2&pos=1&csid=5e01c492804b348d18864f64a9d55e52_1583067332315_1_1583067332316&ss_symbol=10&ss_mtest=m-list-none,~&key=

2)https://item.m.jd.com/product/100008348542.html?sku=100008348542&price=5999.00&fs=1&sid=&sf=newM&sceneval=2&pos=2&csid=5e01c492804b348d18864f64a9d55e52_1583067332315_1_1583067332316&ss_symbol=10&ss_mtest=m-list-none,~&key=

。。。。。。。

I found that each link only has his skuid, which is different from the price. You can try it, but occasionally there are other small changes, which I find troublesome. So, after researching js and links, I extracted a

Universal address https://item.m.jd.com/product/(skuid).html?m2wq_ismiao=1. Then as long as I get his skuid, I can enter his details page smoothly.

Collect sku part data code:

print('开始爬取商品详情页.....')
    loadPage(driver, 1)
    html = etree.HTML(driver.page_source)
    wait(driver, "//*[@id='itemList']/div[@class='search_prolist_item']//div[@class='search_prolist_price']//span")
    divs = html.xpath("//*[@id='itemList']/div[@class='search_prolist_item']")
    skus = []
    for i, div in enumerate(divs):
        try:
            skuid = div.xpath('./@skuid')[0]
        except:
            skuid = 0
        try:
            title = div.xpath(".//div[@class='search_prolist_title']/text()")[0].strip()
        except:
            title = ""
        try:
            price = div.xpath(".//div[@class='search_prolist_price']//span/text()")[0].strip()
        except:
            price = 0
        #     这是当时写了个判断,但是出错,又嫌代码缩进不好处理,就改了个1
        if (1):
            sku = {
                'skuid': skuid,
                'title': title,
                'price': price,
                "category_id1": category_id1,
                'category_id2': category_id2,
                'category_id3': category_id3,
                'url'"https://item.m.jd.com/product/%s.html?m2wq_ismiao=1" % skuid,
            }
            skus.append(sku)
sku = {
    
    
‘skuid’: skuid,
‘title’: title,
‘price’: price,
“category_id1”: category_id1,
‘category_id2’: category_id2,
‘category_id3’: category_id3,
‘url’: “https://item.m.jd.com/product/%s.html?m2wq_ismiao=1” % skuid,
}

3.3. Details page analysis

Details page
Details page

Obtain product carousels, product posters, product store information, product introductions, and specifications

shop
shop
Specifications
Specifications

1. Analyze pictures (commodity carousels, posters)

product carousel
product carousel

The id of the image is not provided and is simply repeated, so I end with the name of the image as the id, and now I encapsulate the function to parse the img_id.

Parse img_id code:

# 根据图片地址截取图片名称划取img的id(主键)
def splitImgId(src):
    img_id = ''
    list = src.split('/')
    for i, li in enumerate(list):
        if(li.find('.jpg'0, len(li)) !=-1 or li.find('.png'0, len(li)) != -1):
            for key in li:
                if(key == '.'):
                    break
                img_id = img_id + key
    return img_id

Get product poster image – product introduction:

Note: This is too frustrating. I have been looking for it for a long time but I can’t see where the picture is rendered. Later, I finally found out that he took the url on the style.

get url
get url

Then encapsulate a function yourself to get his background-url:

# 获取海报图imgList
def HB(html, driver):
    imgs = []
    img = ''
    wait(driver, '//*[@id="detail"]')
    style = str(html.xpath('//*[@id="commDesc"]/style/text()'))
    style = style.split('background-image:url(')
    for item in style:
        for i, key in enumerate(item):
            if (item[0] == '/' and item[1] == '/'):
                if (item[i] == ')'):
                    if (img != ''):
                        imgs.append(img)
                        img = ''
                    break
                img = img + item[i]
    return imgs

Parse image code:

# 1.解析图片(包括商品的轮播图,海报图)
def detail_img(html, goods_id, driver):
    # 解析商品轮播图
    imgs = []
    flag = wait(driver, '//*[@id="loopImgUl"]/li')
    if(flag == 1):
        return imgs
    lis = html.xpath('//*[@id="loopImgUl"]/li')
    product_id = goods_id
    is_master = 0
    for i, li in enumerate(lis):
        pic_url = li.xpath('./img/@src')[0]
        product_pic_id = splitImgId(pic_url)
        pic_desc = '商品轮播图'
        pic_status = 1
        # 主图,爬取的第一个为
        if (i == 0):
            is_master = 1
        img = {
            'product_id': product_id,
            'is_master': is_master,
            'pic_url': pic_url,
            'pic_status': pic_status,
            'product_pic_id': product_pic_id
        }
        imgs.append(img)
    # 解析商品海报图
    for item in HB(html, driver):
        pic_url = item
        product_pic_id = splitImgId(item)
        pic_status = 1
        pic_desc = "商品海报图"
        img = {
            'product_id': product_id,
            'is_master': is_master,
            'pic_url': pic_url,
            'pic_status': pic_status,
            'product_pic_id': product_pic_id
        }
        imgs.append(img)
    return imgs

2. Obtain store information

There is nothing to say about this, it is simple, the only thing is that in practice, it is found that there are some wrong shops, which will lead to program errors.

So I added a display to wait, wait for him for 20s, if the element of the store is not found, it may not be due to the speed of the network, but there is no such store

 shop = {}
 # 显示等待---封装函数
    flag = wait(driver, '//*[@id="shopBaseInfo"]/div[2]/p[1]')
    if(flag == 1):
        return shop
    try:
        img = html.xpath('//*[@id="shopLogoInfo"]/span/img/@src')[0]
        shop_name = html.xpath('//*[@id="shopInfo"]/p/span/text()')[0]
        fan = html.xpath('//*[@id="shopBaseInfo"]/div[1]/p[1]/text()')[0]
        goods_num = html.xpath('//*[@id="shopBaseInfo"]/div[2]/p[1]/text()')[0]
    except:
        return {}
    shop = {
        'img': img,
        'shop_name': shop_name,

3. Obtain specification parameters

This is nothing to say, just save his data, generally do not encounter errors, but you can add a capture exception

# 有问题,就返回空字符串,程序继续执行
    try:
        specification = { 'specification': spec_param(driver)}
    except:
        specification = { 'specification'''}
    try:
        sub_title = { 'sub_title': html.xpath('//*[@id="itemName"]/text()')[0]}
    except:
        sub_title = { 'sub_title'''}

The rest is to integrate, process and save the analyzed data, so I won’t introduce too much.

4. Partial results display

partial results
partial results

1. I only selected 5 major categories for the first-level classification of category1, and only 14 products for each third-level classification. It took more than 7 hours. I deeply felt the pain. Consider good discipline for code writing.

As a result, there were many loopholes, and it took a lot of extra time. After the project was completed, I decided to optimize the code, and considered using multi-threaded crawling. It is estimated that the crawling can be completed in up to 1.5 hours.

2. Although it took me 3 days to analyze, write, fix bugs, run this code, and make mistakes and fixes, I suffered a lot, but in the end it succeeded.

I have learned a lot. If you find this article by chance, if there is a better solution, please give me your advice. improve together. . . .

The above is what I want to share today, haha! I also know that there are shortcomings, and I hope you don't mind.

If you have any questions or want this script, you can contact me in the background or add me, or you can enter the technical exchange group to discuss and learn together.

come on! Come on! ! Tomorrow's you will definitely thank you for your hard work today! !

5. Preview of the next episode

Haha, because many fans privately messaged me before that I wanted my university competition project (the first prize of the China Computer Design Competition, the first prize of the Huawei Kunpeng Competition), and it is also one of the main projects for my autumn recruitment internship.

I wanted to take the time to share it with you before, but I have been busy with projects recently and have no time. I plan to share it with you in the next issue. .

I wanted to take the time to share it with you before, but I have been busy with projects recently and have no time. I plan to share it with you in the next issue. .

Project part screenshot:

If you need a project or a python script, you can follow the official account and reply [jd_spider] [Smart campus assistant based on artificial intelligence]

Finally, students who think this article is good, must remember to forward it to Xiaolong , read it again , and share it with more students!

Hello, I'm Xiaolong, a cute new programmer who successfully counterattacked BAT in the last two books.

Official account: The original technical account designed to help everyone perfectly counterattack BAT. Focus on sharing JAVA technology, MySQL, Redis, distributed, network planning, OS, etc.; here are not only detailed explanations of interview questions from major manufacturers, internal recommendations from major manufacturers, but also [ Second Book Counterattack BAT Experience ] and the number of autumn recruits experienced by the owner. [ Interview Notes ] sharing of ten large and medium-sized factory interview summaries . And some of the owner's 【投资之道】comments 【副业之法】!

Pay attention to the official account and explore more exciting things with Xiaolong. Pay attention now, and reply to [ Dachang Interview ] in the background to receive [Boutique Interview Notes] to help you hit Dachang! ! , Reply to [ Smart Campus Assistant Based on Artificial Intelligence ] to receive the account master's autumn recruitment interview boutique project.

Past wonderful review:

Ali almost lost on this question: Why is the MySQL auto-increment primary key not continuous?

"Interview Notes" - The End of JVM (Hanging series)

Redis basics (starting from a high-rise building): the core underlying data structure

"Interview Notes" - the end of MySQL (30 questions and answers)

Guess you like

Origin blog.csdn.net/qq_43666365/article/details/120577139