Article Directory
I. Introduction
1. The crawled content is all the data that users can browse on Taobao pages.
2. It is used for communication and learning and will not be used for commercial use.
3. If there is any infringement in this article, please contact me to delete the article
Second, the library that needs to be imported
import requests
from lxml import etree
import xlwt
from time import sleep
Three, page analysis
First, open the Taobao official website page: Portal,
drop down to find this place, click on the item, you can access the detailed page, in order to be able to crawl according to our own needs, we need to find the law of the URL .
For example:
click on the male watch to visit, the URL obtained is as follows:
https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8&catid=&refpid=&_input_charset=utf8&spm=a21bo.2017.201874-p4p.9.5af911d9r0a6w7&clk1=4b099603c0fb26b731424628d9163765
In fact, just enter keyword
it after the keyword
content, and the content is the keyword we entered, namely:
https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8
Then I scrolled down to the end and found that it is not a waterfall data flow, but also a page turning type, which is much easier.
Click on the second and third pages and found that the rule is keyword
to add page
content at the end, and page
the content is the number of pages visited, namely:
https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8&page=2
https://re.taobao.com/search?keyword=%E7%94%B7%E6%89%8B%E8%A1%A8&page=3
This completes the rough idea of crawling.
The rest is the extraction of data.
The page layout is not very complicated, xpath
it can be crawled down easily by using it. Look at the code part in detail.
Fourth, the code
# -*- coding: UTF-8 -*-
"""
@Author :远方的星
@Time : 2021/3/12 20:28
@CSDN :https://blog.csdn.net/qq_44921056
@腾讯云 : https://cloud.tencent.com/developer/column/91164
"""
import requests
from lxml import etree
import xlwt
from time import sleep
# 获取内容:店名、商品名、价格、付款人数
def get_content(url, param):
response = requests.get(url=url, params=param)
response.encoding = 'utf-8'
response = response.text
html = etree.HTML(response)
goods = html.xpath('//*[@id="J_waterfallWrapper"]/div') # 用于遍历每个商品的根节点
list_all = list() # 提前准备一个空列表预存数据
for i in range(len(goods)):
price = goods[i].xpath('./a/div[2]/p[1]/span[1]/strong/text()')[0] # 获取价格
goods_name = goods[i].xpath('./a/div[2]/span/text()')[0] # 获取商品名称
goods_shop = goods[i].xpath('./a/div[2]/p[2]/span[1]/text()')[0] # 获取店名
number = goods[i].xpath('./a/div[2]/p[2]/span[2]/text()')[0].replace('人付款', '') # 获取付款人数
list_all.append([goods_shop, goods_name, price, number]) # 存放在一个列表中
return list_all
def main():
keyword = input('请输入你想要爬取的商品名称:')
page = input('请输入你想要爬取的页数:') # 一共一百页,一页200个数据
page = int(page) + 1
list_all = list()
path = 'D:/淘宝.xls' # 工作表路径
workbook = xlwt.Workbook(encoding='utf-8', style_compression=0)
worksheet = workbook.add_sheet('淘宝', cell_overwrite_ok=True) # 可覆盖 # 设置工作表名
col = ('店家', '商品名', '价格/元', '付款人数/人')
for i in range(0, 4):
worksheet.write(0, i, col[i]) # 设置列名
print('即将为你下载所需数据= =,请稍后')
sleep(1)
for i in range(1, page):
base_url = 'https://re.taobao.com/search?'
page = i
print('正在下载第{}页内容^-^'.format(i))
sleep(1)
param = {
'keyword': keyword,
'page': page
}
data_list = get_content(base_url, param) # 调用函数获取内容
list_all.append([data_list])
list_s = list_all[0][0]
for j in range(len(list_s)): # j代表每一个i循环内的行数
data = list_s[j]
for k in range(0, 4): # k代表列数
worksheet.write((i-1)*200+j+1, k, data[k]) # worksheet.write(x,y,z),x代表行,y代表列,z代表存放的内容
print('第{}页内容下载完毕!'.format(i))
sleep(1)
workbook.save(path) # 保存数据
print('已完成所有下载任务!')
if __name__ == '__main__':
main()
5. Results display
6. Blogger's speech
A keyword can be up to 100 pages ( 100 pages for several products I have seen), and there are 200 product data on one page , and because it may be a hot product, the data may be incomplete, but as an exercise, it is enough.
If you feel that this article is a little helpful to you, please also like, follow, bookmark, three links with one click.
Author: distant star
CSDN: https: //blog.csdn.net/qq_44921056
Tencent cloud: https: //cloud.tencent.com/developer/column/91164
This article is only for the exchange of learning, without the author's permission is prohibited reprint , Don’t use it for other purposes, offenders must be investigated.