Python realizes that you can get any product data without logging in to Taobao

I. Introduction

1. The crawled content is all the data that users can browse on Taobao pages.
2. It is used for communication and learning and will not be used for commercial use.
3. If there is any infringement in this article, please contact me to delete the article

Second, the library that needs to be imported

import requests
from lxml import etree
import xlwt
from time import sleep

Three, page analysis

First, open the Taobao official website page: Portal,
Insert picture description here
drop down to find this place, click on the item, you can access the detailed page, in order to be able to crawl according to our own needs, we need to find the law of the URL .
For example:
click on the male watch to visit, the URL obtained is as follows:

https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8&catid=&refpid=&_input_charset=utf8&spm=a21bo.2017.201874-p4p.9.5af911d9r0a6w7&clk1=4b099603c0fb26b731424628d9163765

In fact, just enter keywordit after the keywordcontent, and the content is the keyword we entered, namely:

https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8

Then I scrolled down to the end and found that it is not a waterfall data flow, but also a page turning type, which is much easier.
Click on the second and third pages and found that the rule is keywordto add pagecontent at the end, and pagethe content is the number of pages visited, namely:

https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8&page=2
https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8&page=3

This completes the rough idea of ​​crawling.
The rest is the extraction of data.
Insert picture description here
The page layout is not very complicated, xpathit can be crawled down easily by using it. Look at the code part in detail.

Fourth, the code

# -*- coding: UTF-8 -*-
"""
@Author  :远方的星
@Time   : 2021/3/12 20:28
@CSDN    :https://blog.csdn.net/qq_44921056
@腾讯云   : https://cloud.tencent.com/developer/column/91164
"""
import requests
from lxml import etree
import xlwt
from time import sleep


# 获取内容:店名、商品名、价格、付款人数
def get_content(url, param):
    response = requests.get(url=url, params=param)
    response.encoding = 'utf-8'
    response = response.text
    html = etree.HTML(response)
    goods = html.xpath('//*[@id="J_waterfallWrapper"]/div')  # 用于遍历每个商品的根节点
    list_all = list()  # 提前准备一个空列表预存数据
    for i in range(len(goods)):
        price = goods[i].xpath('./a/div[2]/p[1]/span[1]/strong/text()')[0]  # 获取价格
        goods_name = goods[i].xpath('./a/div[2]/span/text()')[0]  # 获取商品名称
        goods_shop = goods[i].xpath('./a/div[2]/p[2]/span[1]/text()')[0]  # 获取店名
        number = goods[i].xpath('./a/div[2]/p[2]/span[2]/text()')[0].replace('人付款', '')  # 获取付款人数
        list_all.append([goods_shop, goods_name, price, number])  # 存放在一个列表中
    return list_all


def main():
    keyword = input('请输入你想要爬取的商品名称:')
    page = input('请输入你想要爬取的页数:')  # 一共一百页,一页200个数据
    page = int(page) + 1
    list_all = list()
    path = 'D:/淘宝.xls'  # 工作表路径
    workbook = xlwt.Workbook(encoding='utf-8', style_compression=0)
    worksheet = workbook.add_sheet('淘宝', cell_overwrite_ok=True)  # 可覆盖  # 设置工作表名
    col = ('店家', '商品名', '价格/元', '付款人数/人')
    for i in range(0, 4):
        worksheet.write(0, i, col[i])  # 设置列名
    print('即将为你下载所需数据= =,请稍后')
    sleep(1)
    for i in range(1, page):
        base_url = 'https://re.taobao.com/search?'
        page = i
        print('正在下载第{}页内容^-^'.format(i))
        sleep(1)
        param = {
    
    
            'keyword': keyword,
            'page': page
        }
        data_list = get_content(base_url, param)  # 调用函数获取内容
        list_all.append([data_list])
        list_s = list_all[0][0]
        for j in range(len(list_s)):  # j代表每一个i循环内的行数
            data = list_s[j]
            for k in range(0, 4):  # k代表列数
                worksheet.write((i-1)*200+j+1, k, data[k])  # worksheet.write(x,y,z),x代表行,y代表列,z代表存放的内容
        print('第{}页内容下载完毕!'.format(i))
        sleep(1)
    workbook.save(path)  # 保存数据
    print('已完成所有下载任务!')


if __name__ == '__main__':
    main()


5. Results display

Insert picture description here
Insert picture description here

6. Blogger's speech

A keyword can be up to 100 pages ( 100 pages for several products I have seen), and there are 200 product data on one page , and because it may be a hot product, the data may be incomplete, but as an exercise, it is enough.

If you feel that this article is a little helpful to you, please also like, follow, bookmark, three links with one click.

Author: distant star
CSDN: https: //blog.csdn.net/qq_44921056
Tencent cloud: https: //cloud.tencent.com/developer/column/91164
This article is only for the exchange of learning, without the author's permission is prohibited reprint , Don’t use it for other purposes, offenders must be investigated.

Guess you like

Origin blog.csdn.net/qq_44921056/article/details/114730976