Python crawler entry examples five Taobao product information directed crawling (optimized version)

Write in front

  This example was written by the author today while studying at the Chinese University MOOC (Songtian Beijing Institute of Technology). But after I finished writing it quickly, I found something was wrong, because after writing the code according to the teacher, the information could not be crawled. After some toss, I basically solved the problem and made some optimizations. Write this blog For record, the following figure is the final crawling result.

Insert picture description here

One, crawl the original page

  The crawling page is a Taobao website. Taking women's clothing as an example, the original picture is as follows. Since the ranking of Taobao products is updated in real time, the order of the crawling results may be different from the order of the website. The content crawled in this example is the price and name of the product, and a serial number is added to it.

Insert picture description here

Two, programming ideas

  This part of the teacher Songtian gave an explanation in the class, here I will organize and share with you.

1. Function description

Goal: Get information on Taobao search page and extract the name and price.

Understanding:
(1). Get Taobao’s search interface
(2). Handling of page turning

Technical route: requests-re

2. The structure design of the program

Step 1: Submit a product search request and get pages in a loop.
Step 2: For each page, extract the product name and price information.
Step 3: Output the information to the screen

Define three functions corresponding to the above three steps:

(1) getHTMLText() to obtain the page
(2) parsePage() to parse each obtained page
(3) printGoodsList() to output product information to the screen

Three, the programming process

1. Solve the page turning problem

  First, let’s take a look at the URLs of the first three pages

Insert picture description here
  Observing the number of products on each page of Taobao, we can find that there are 44 products on each page. Combining the above results, we can guess that the variable s represents the number of the starting product on the second, third...page. Based on this rule, we can construct URL links for different pages.

code show as below:

for i in range(depth):#对每次翻页后的URL链接进行设计
    url = start_url + '&s='+str(44*i)
    html = getHTMLText(url)
    parsePage(infoList,html)

2. Write the getHTMLText() function

def getHTMLText(url):#获得页面
    try:
        kv = {
    
    'user-agent': 'Mozilla/5.0',
              'cookie':' '#请自行获取
              }
        r = requests.get(url,headers=kv,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print("获取页面失败")

For how to obtain cookies, please refer to my blog
link: https://blog.csdn.net/weixin_44578172/article/details/109353017 .

3. Write the parsePage() function

(1). Content analysis programming ideas

  First look at the source code of the women's clothing search results page.
Insert picture description here
  By observing the source code, we found that the prices and names of all commodities in Taobao are in the corresponding key-value pairs, namely: "view_price": "price", "view_title": " name". So we want to obtain these two information, only need to retrieve view_price and view_title in the obtained text and extract the subsequent related content. Here we use the regular expression method.

(2). Function code

def parsePage(ilt,html):#对每一个获得的页面进行解析
#两个变量分别是结果的列表类型和相关的HTML页面的信息
    try:
        re1 = re.compile(r'\"view_price\"\:\"[\d\.]*\"')#编译商品价格正则表达式
        re2 = re.compile(r'\"raw_title\"\:\".*?\"')#编译商品名称正则表达式
        plt = re1.findall(html)
        tlt = re2.findall(html)
        #plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        #tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])#去掉view_price字段,只要价格部分,eval将获取到的最外层/内层的单引号或双引号去掉
            title = eval(tlt[i].split(':')[1])#去掉raw_title字段,只要名称部分
            ilt.append([price,title])
    except:
        print("网页解析失败")

4. Write printGoodsList()

def printGoodsList(ilt):#将商品的信息输出到屏幕上
    try:
        tplt = "{:4}\t{:8}\t{:16}" #定义打印模板
        print(tplt.format("序号","价格","商品名称"))
        count = 0
        for s in ilt:
            count = count + 1
            print(tplt.format(count,s[0],s[1]))
    except:
        print("输出失败")

Fourth, the complete code

'''
功能描述

目标:获取淘宝搜索页面的信息,提取其中的名称和价格。

理解:
1.获得淘宝的搜索接口
2.对翻页的处理

技术路线:requests-re

程序的结构设计
步骤1:提交商品搜索请求,循环获取页面
步骤2:对于每个页面,提取商品名称和价格信息
步骤3:将信息输出到屏幕上
'''
import requests
import re

def getHTMLText(url):#获得页面
    try:
        kv = {
    
    'user-agent': 'Mozilla/5.0',
              'cookie':' '#请自行获取
              }
        r = requests.get(url,headers=kv,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print("获取页面失败")

def parsePage(ilt,html):#对每一个获得的页面进行解析
#两个变量分别是结果的列表类型和相关的HTML页面的信息
    try:
        re1 = re.compile(r'\"view_price\"\:\"[\d\.]*\"')#编译商品价格正则表达式
        re2 = re.compile(r'\"raw_title\"\:\".*?\"')#编译商品名称正则表达式
        plt = re1.findall(html)
        tlt = re2.findall(html)
        #plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        #tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])#去掉view_price字段,只要价格部分,eval将获取到的最外层/内层的单引号或双引号去掉
            title = eval(tlt[i].split(':')[1])#去掉raw_title字段,只要名称部分
            ilt.append([price,title])
    except:
        print("网页解析失败")

def printGoodsList(ilt):#将商品的信息输出到屏幕上
    try:
        tplt = "{:4}\t{:8}\t{:16}" #定义打印模板
        print(tplt.format("序号","价格","商品名称"))
        count = 0
        for s in ilt:
            count = count + 1
            print(tplt.format(count,s[0],s[1]))
    except:
        print("输出失败")

def main():
    goods = input("请输入想要搜索的商品:") #定义搜索关键词变量
    depth = input("请输入想要搜索商品的深度(整数):") #定义爬取的深度即页数
    depth = int(depth)
    start_url = 'https://s.taobao.com/search?q='+goods
    infoList = [] #定义整个的输出结果变量
    for i in range(depth):#对每次翻页后的URL链接进行设计
        try:
            url = start_url + '&s='+str(44*i)
            html = getHTMLText(url)
            parsePage(infoList,html)
        except:
            continue
    printGoodsList(infoList)

#调用主函数
main()

  At the end of this article, please point out any errors~

Quote from

中国大学MOOC Python网络爬虫与信息提取
https://www.icourse163.org/course/BIT-1001870001

Guess you like

Origin blog.csdn.net/weixin_44578172/article/details/109356900