Climbing is healthier

Definition of crawler

It is a program or script that automatically obtains network information according to certain specifications. 

Simply put, a web crawler is to implement programming development based on a certain algorithm. 

The data is captured and discovered mainly through URL.

Let's solve the pre-knowledge first, this crawler will use the following libraries

  1. requests is a very practical Python and a very practical HTTP client needs

  2. json is used for data processing

  3. csv for data storage


Analysis description

Crawling the information of Taobao products, the data is mainly used to analyze market trends and formulate a series of marketing plans. The functions are as follows:

  1. Users provide keywords and use Taobao search function to obtain data after searching

  2. Get product information: title, price, sales volume, store area

  3. Data is stored in file format

The functional realization in turn reflects the crawler development process:

Crawler Rules -> Data Cleaning -> Data Storage

Use Google Chrome to enter the Taobao website, use the search function to enter the keyword "four-piece", 

Use the debugging function of the browser to capture information. If the data cannot be found in the response HTML,

Then the data may be requested to the background through Ajax, and then rendered to the page through the front end,

Click XHR to send a request to view data request information

Next, we click Preview to view the response data format of the URL, 

Found that the data is in JSON format, the title of the product, price, sales,

The corresponding data for the store name and store location are raw title, view price, view sales, nick, item loc

Let's take out its request link and study it

https://s.taobao.com/api?_ksTS=1540176287763_226&callback=jsonp227&ajax=true&m=customized&sourceId=tb.index&_input_charset=utf-8&bcoffset=-1&commend=all&suggest=history_1&source=suggest&search_type=item&ssid=s5-e&suggest_query=&spm=a21bo.2017.201856-taobao-item.2&q=四件套&s=36&initiative_id=tbindexz_20170306&imgfile=&wq=&ie=utf8&rn=9e6055e3af9ce03b743aa131279aacfd

我们可以把这个长长的链接简化一下

https://s.taobao.com/api?callback=jsonp227&m=customized&q=%E5%9B%9B%E4%BB%B6%E5%A5%97&s=36

从简化后的URL看出,有两个参数可以动态设置来获取不同的商品

  1. q = 四件套 这个是搜索的关键字

  2. s = 36 这个是页数设置,

功能实现

根据对网站的分析获取单个关键字搜索的单页商品信息, 代码如下:

import requestsimport jsonurl = "https://s.taobao.com/api?callback=jsonp227&m=customized&q=四件套&s=36"r = requests.get(url)response = r.text# 截取成标准的JSON格式# 由于Ajax返回的数据是字符串格式的饿,在返回的值jsonp227(XXX)中# XXX部分是JSON格式数据,因此先用字符串split()截取XXX部分,#然后将XXX部分由字符串格式转成JSON格式的数据读取response = response.split('(')[1].split(')')[0]# 读取JSONresponse_dict = json.loads(response)# 定位到商品信息列表response_auctions_info = response_dict['API.CustomizedApi']['itemlist']['auctions']

如果想要获取多页数据,可以在上述的代码中加入一个循环,实现代码如下:

for p in range(88):    url = "https://s.taobao.com/api?callback=jsonp227&m=customized&q=四件套&s=%s"%(p)    r = requests.get(url)    # 获取响应信息字符串    response = r.text    # 转换成JSON格式    response = response.split('(')[1].split(')')[0]    # 加载数据    response_dict = json.loads(response)    # 商品信息    response_auctions_info = response_dict['API.CustomizedApi']['itemlist']['auctions']

上述代码只能获取单个关键字搜索的商品信息, 如果要实现多个关键字的功能呢,就可以在上述代码中在多加一个循环,代码如下:

 
 
  1. for k in ['四件套','手机壳']:

  2.    for p in range(88):

  3.        url = "https://s.taobao.com/api?callback=jsonp227

          1. &m=customized&q=%s&s=%s"  %(k,p)

  4.        r = requests.get(url)

  5.        response = r.text

  6.        response = response.split('(')[1].split(')')[0]

  7.        response_dict = json.loads(response)

  8.        # 商品信息

  9.        response_auctions_info = response_dict['API.CustomizedApi']['itemlist']['auctions']

数据存储

我们以CSV文件的格式存储数据,我们来定义一个函数,传入参数 分别为responseauctionsinfo数据集合信息, file_name保存的文件名:

 
 
  1. def get_auctions_info(response_auctions_info,file_name):

  2.    with open(file_name,'a',newline='') as csvfile:

  3.        # 生成CSV对象,用于写入CSV文件

  4.        writer = csv.writer(csvfile)

  5.        for i in response_auctions_info:

  6.            # 判断是否数据已经记录

  7.            if str(i['raw_title']) not in auctions_distinct:

  8.                # 写入数据

  9.                # 分别是商品信息列表和CSV文件路径。

  10.                # 但该文件并没有对CSV设置表头,所以在开始获取数据之前。

  11.                # 应该生成对应CSV文件,并设定其表头

  12.                writer.writerrow([i['raw_title'],i['view_price'],i['view_sales'],i['nick'],i['item_loc']])


  13.                auctions_distinct.append(str(i['raw_title']))

  14.        csvfile.close()

最后附上爬取结果图:

图片


Guess you like

Origin blog.51cto.com/15067249/2576229