Python Taobao crawler crawls Taobao product data based on requests

I took a python class at school and felt that python is very powerful. I have time during the winter vacation to mess around a bit, hoping to discuss and learn with everyone. Without further ado, let's get straight to the point.

requests is Python's http library, which can complete most of the work related to http applications. Of course, it is very convenient for some regular data capture. 
See the manual for details: 
http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

For example, these two lines of code 
url=”www.baidu.com” 
resp = requests.get(url)

The return value of requests' get is a Response object. This object has many attributes, such as text, encoding, status_code, links, etc. (see 0.0 with help). The text attribute of this object (resp.text) contains the corresponding HTML text. , the data we want to crawl is here, encoding is the encoding to display resp.text, and the encoding can also be modified.

It is also very convenient for requests to pass URL parameters. To popularize URL parameters, it means: to locate the address of information resources on the network, such as 
https://s.taobao.com/search?q=python  is Taobao search python 
https:// s.taobao.com/search?q=java  is Taobao search java 

. Let's compare the difference. The difference is that the url parameters have been changed. We can use a dictionary to provide these URL parameters. See the manual for details! ! !

The next example function is: Taobao searches for python, and saves the name, unit price, and address of the product information from 1 to 100 pages in the taobao_test.txt file. 
Preparation: 
1. python development environment 
2, re library 
3, requests library

#coding=utf-8
import re
import requests

url = 'https://s.taobao.com/search'
payload = {'q': 'python','s': '1','ie':'utf8'}  #字典传递url参数    
file = open('taobao_test.txt','w',encoding='utf-8')

for k in range(0,100):        #100次,就是100个页的商品数据

    payload ['s'] = 44*k+1   #此处改变的url参数为s,s为1时第一页,s为45是第二页,89时第三页以此类推                          
    resp = requests.get(url, params = payload)
    print(resp.url)          #打印访问的网址
    resp.encoding = 'utf-8'  #设置编码
    title = re.findall(r'"raw_title":"([^"]+)"',resp.text,re.I)  #正则保存所有raw_title的内容,这个是书名,下面是价格,地址
    price = re.findall(r'"view_price":"([^"]+)"',resp.text,re.I)    
    loc = re.findall(r'"item_loc":"([^"]+)"',resp.text,re.I)
    x = len(title)           #每一页商品的数量

    for i in range(0,x) :    #把列表的数据保存到文件中
        file.write(str(k*44+i+1)+'书名:'+title[i]+'\n'+'价格:'+price[i]+'\n'+'地址:'+loc[i]+'\n\n')


file.close()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

txt file content

The code is relatively short, which proves the convenience of the requests module. The code may be a bit bloated, and the result of the txt file is no problem.

ps: This is the first time to write a technical blog. This is the starting point. I hope to learn from each other and grow, and also record study notes.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325980137&siteId=291194637