[Java code] Obtaining all classified data of Jingdong products (table building statement + Jar package dependency + tree structure packaging + crawler source code) including csv and sql format data download available

This article has participated in the "Newcomer Creation Ceremony" event to start the road of gold creation together.

【Resource link】

Link: pan.baidu.com/s/15fuerPIQ…

Extraction code: 6psl

【Include files】

insert image description here

1. Description

The current project needs to use product classification data. I checked the homepages of Taobao and Jingdong on the Internet, and the data of Jingdong www.jd.com/allSort.asp… is easier to obtain.

insert image description here

2. Realize

2.1 Create table statement

The project uses a GreenPlumdatabase, and other types of database table building partners can do it by themselves :smile:

-- 建表
CREATE TABLE "data_commodity_classification" ( 
"id" VARCHAR ( 32 ), 
"parent_id" VARCHAR ( 32 ), 
"level" int2, 
"name" VARCHAR ( 64 ), 
"merger_name" VARCHAR ( 255 ) 
);
-- 注释
COMMENT ON TABLE "data_commodity_classification" IS '3级商品分类数据表';
COMMENT ON COLUMN "data_commodity_classification"."level" IS '类别等级';
COMMENT ON COLUMN "data_commodity_classification"."name" IS '商品分类';
COMMENT ON COLUMN "data_commodity_classification"."merger_name" IS '商品类别组合名';
复制代码

2.2 Jar package dependencies

jsoupIt is required, the project uses the method that mybatis-pluscan be called when saving the object, it .saveBatch()is not required.

<!--不用纠结于版本-->
<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.10.2</version>
</dependency>
<!--不是必须的-->
<dependency>
	<groupId>com.baomidou</groupId>
	<artifactId>mybatis-plus-boot-starter</artifactId>
	<version>3.3.0</version>
</dependency>
复制代码

2.3 Object Encapsulation

It is used to simplify code whenlombok constructing objects :builder

@Data
@EqualsAndHashCode(callSuper = false)
@Accessors(chain = true)
@ApiModel(value="DataCommodityClassification对象", description="")
@Builder
public class DataCommodityClassification implements Serializable {
    private static final long serialVersionUID=1L;
    private String id;
    private String parentId;
    @ApiModelProperty(value = "类别等级")
    private Integer level;
    @ApiModelProperty(value = "商品分类")
    private String name;
    @ApiModelProperty(value = "商品类别组合名")
    private String mergerName;
}
复制代码

2.4 Crawler source code

html page tags:

insert image description here

Data acquisition logic: clear historical data, >crawl latest data and encapsulate it, and >save the latest data.

	public boolean getCommodityClassificationData() throws IOException {
 		
 		// 首先清除历史数据
        LambdaQueryWrapper<DataCommodityClassification> lambdaQuery = Wrappers.lambdaQuery(DataCommodityClassification.class);
        dataCommodityClassificationService.remove(lambdaQuery);

        // 处理树结构ID【随手就写了 不知道有没有更好的方法】
        AtomicInteger atomicIntegerOne = new AtomicInteger();
        AtomicInteger atomicIntegerTwo = new AtomicInteger();
        AtomicInteger atomicIntegerThree = new AtomicInteger();

        // 结果数据
        List<DataCommodityClassification> dataCommodityClassificationList = new ArrayList<>();
        
        // ************* 以下是爬虫代码 *************
		// 地址信息
        String url = "https://www.jd.com/allSort.aspx";
        Document document = Jsoup.parse(new URL(url), 300000);
        // 获取包含所有分类数据的根元素
        Element root = document.getElementsByClass("category-items clearfix").get(0);
        // 获取一级分类标签数据
        Elements levelOne = root.getElementsByClass("category-item m");
        levelOne.forEach(one -> {
            String levelOneData = one.getElementsByClass("item-title").get(0).child(2).text();
            String oneId = "" + atomicIntegerOne.getAndIncrement();
            dataCommodityClassificationList.add(DataCommodityClassification.builder().id(oneId).parentId(null).level(0).name(levelOneData).build());
            // 获取二级分类标签数据
            Elements levelTwo = one.getElementsByClass("items").get(0).getElementsByTag("dl");
            levelTwo.forEach(two -> {
                String levelTwoData = two.getElementsByTag("dt").text();
                String twoId = oneId + atomicIntegerTwo.getAndIncrement();
                String mergerNameTwo = levelOneData + "," + levelTwoData;
                dataCommodityClassificationList.add(DataCommodityClassification.builder().id(twoId).parentId(oneId).level(1).name(levelTwoData).mergerName(mergerNameTwo).build());
                // 获取三级级分类标签数据
                Elements levelThree = two.getElementsByTag("dd").get(0).children();
                levelThree.forEach(three -> {
                    // 获取三级分类信息
                    String levelThreeData = three.text();
                    String threeId = twoId + atomicIntegerThree.getAndIncrement();
                    String mergerNameThree = mergerNameTwo + "," + levelThreeData;
                    dataCommodityClassificationList.add(DataCommodityClassification.builder().id(threeId).parentId(twoId).level(2).name(levelThreeData).mergerName(mergerNameThree).build());
                });
            });
        });

        // 保存最新数据
        boolean isSaveSuccess = dataCommodityClassificationService.saveBatch(dataCommodityClassificationList);
        return isSaveSuccess;
	}
复制代码

3. Results

parent_idThe sum of the first-level classification has merger_namenot been processed, and I do not know if there is any problem in the process of business use.

insert image description hereData in csv and sql formats are provided. The crawling date is 20220310. If you need the latest data, you need to run the crawler code to get it.

Guess you like

Origin juejin.im/post/7087402591381880862