python crawling Taobao commodity parity instance (the use of re library to solve the anti-Taobao reptiles mechanisms)

Examples Introduction

Objective: To obtain information on certain categories of goods, to extract the name and price of goods

Feasibility Analysis

1. Check Taobao robots protocol, attached URL https://www.taobao.com/robots.txt

 

 View does not allow anyone found Taobao Taobao information crawling. So as a law-abiding citizen in order not to cause unnecessary trouble,

First, do not crawl, and second, crawling programs do not make any commercial purposes, can only be used only technological learning.

Program Structure

1. Request Search products, circulation get the page

2. Parse the content of the page to get the name of commodity prices

3. Output information obtained

Structural Analysis

Read a number, for example, I want to see the sweater

 

 

 Shows one hundred, then we must consider how much time to see the view, if it is one, you only need a link crawling inside information,

If you want to crawl more pages of information, you need more than one link, and then you need to find the relationship between the links.

The first page of the URL:

https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5e&commend=all&imgfile=&q=卫衣&suggest=history_1&_input_charset=utf8&wq=&suggest_query=&source=suggest&bcoffset=6&ntoffset=6&p4ppushleft=1%2C48&s=0

The second page URL:

https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5e&commend=all&imgfile=&q=卫衣&suggest=history_1&_input_charset=utf8&wq=&suggest_query=&source=suggest&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44

The second page URL:

https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5e&commend=all&imgfile=&q=卫衣&suggest=history_1&_input_charset=utf8&wq=&suggest_query=&source=suggest&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88

After three consecutive see the URL of the page found, each end of the URL s to 44 increments. After finding the law, you can request a page with a loop.

Examples of the preparation

. 1  Import Requests
 2  Import Re
 . 3  
. 4  DEF getHTMLText (URL):
 . 5      the try :
 . 6          R & lt requests.get = (URL, timeout = 30 )
 . 7          r.raise_for_status ()
 . 8          r.encoding = r.apparent_encoding
 . 9          return r.text
 10      the except :
 11          return  "" 
12  DEF parsePage (ILT, HTML):
 13      # regular expression to obtain trade names and commodity prices 
14      the try :
 15  # using regular expressions, \ represents the introduction of a "view_price" key, followed by \ introduction of key the value
16         plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
17 #*?表示最小匹配
18         tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
19         for i in range(len(plt)):
20             price = eval(plt[i].split(":")[1])
21             title = eval(tlt[i].split(":")[1])
22             ilt.append([price,title])
23     except:
24         print( "  " )
 25  DEF printGoodslist (ILT):
 26 is      TPLT = " {:}. 4 \ T {:}. 8 \ T {:} 16 " 
27      Print (tplt.format ( " No. " , " price " , " trade name " ))
 28      COUNT = 0
 29      for X in ILT:
 30          COUNT = COUNT +. 1
 31 is          Print (tplt.format (COUNT, X [0], X [. 1 ]))
 32  DEF main ():
 33 is      Goods = ' sweater'
34     depth = 2
35     star_url = 'https://s.taobao.com/search?q=' +goods
36     infoList = []
37     for i in range(depth):
38         try:
39             url = star_url + '&s=' + str(44*i)
40             html = getHTMLText(url)
41             parsePage(infoList,html)
42         except:
43             continue
44     printGoodslist(infoList)
45 main()

 

 After compiling and running the program does not find an error, but the output of goods does not appear, really confusing. So in Baidu help,

I first check your URL is not correct, copy the program to open the URL in your browser, you can enter Taobao found, URL is correct.

The problem is not the URL, it is a question of Taobao, Taobao had set up anti-crawler mechanism, requiring users to login authentication. When the need to simulate crawling Liu

Browser login to get information.

1. Open the browser Taobao, login

2. Search box to search for what they want crawling, such as sweater, press F12 to open the Developer Tools

3. Select the network, and then select the doc file

 

 4. Go back to Taobao Home, developer tools, do not turn off!

 

 Figure out a refresh of file

5. Open the file, find the cookie and user-agent, the contents of which fully replicated down

 

 

 

 

 

 

 Finally, write a dictionary analog browser headers in the original request for access to the code, placed just copy the cookie and user-agent

Full Code:

 1 import requests
 2 import re
 3 
 4 def getHTMLText(url):
 5     headers={'cookie':'td_cookie=18446744071423230592; thw=cn; v=0; cna=SCVpFkZXfCwCAT24cx3PJgie; t=6adef129ce0b98c6fcd52f3e83e3be03; cookie2=7de44eefb19e3e48e25b7349163592b7; _tb_token_=f1fae43e5e551; unb=3345403123; uc3=nk2=F6k3HMt8ZHbGobgMG0t6YMg7MKU%3D&vt3=F8dByuQFmIAq493a88Y%3D&lg2=W5iHLLyFOGW7aA%3D%3D&id2=UNN5FEBc3j%2FI9w%3D%3D; csg=07879b0c; lgc=t_1499166546318_0384; cookie17=UNN5FEBc3j%2FI9w%3D%3D; dnk=t_1499166546318_0384; skt=759aebdc118b2fc5; existShop=MTU3NTEwNzAyMg%3D%3D; uc4=id4=0%40UgQxkzEr7yNNkd0wQjAOQOK5hAra&nk4=0%40FbMocp0bShNOwIAboxPdw7pZW0Ru%2FnrngZiTM4a03Q%3D%3D; tracknick=t_1499166546318_0384; _cc_=UIHiLt3xSw%3D%3D; tg=0; _l_g_=Ug%3D%3D; sg=439; _nk_=t_1499166546318_0384; cookie1=B0TwtzQNNmewbhSpcaaRe7U24nc6DXOpwhexZLEN8Zo%3D; mt=ci=0_1; _m_h5_tk=ec0a32b82d6a8d5c46fe6f873373169b_1575114952532; _m_h5_tk_enc=cfea89ad4f02b520c3a094931d00e376; enc=CnjhIlaGaoA3J%2FSi2PeXU8%2FNC4cXQUAZjulyZI%2Bd9Z8JjGflldsE%2F%2B8F0Ty2oLD4v1wKgm3CuiGftr11IfyB5w%3D%3D; hng=CN%7Czh-CN%7CCNY%7C156; l=dBIBcdfeq5nSzFl5BOCa-urza77ThIRvfuPzaNbMi_5Ia1T6YV7OknJtce96cjWfTG8B4HAa5Iy9-etlwrZEMnMgcGAw_xDc.; uc1=cookie15=VFC%2FuZ9ayeYq2g%3D%3D&cookie14=UoTbmEp9zNxMrw%3D%3D; isg=BDk53DNPQMq9RRxe_Fnoei4wSKUTRi34hR8HPVturmDf4ll0o5Y9yKc0YOYUrsUw',
 6          'user-agent':'Mozilla/5.0'}
 7     try:
 8         r = requests.get(url,headers = headers,timeout = 30)
 9         r.raise_for_status()
10         r.encoding = r.apparent_encoding
11         return r.text
12     except:
13         return ""
14 def parsePage(ilt,html):
15     #正则表达式获取商品名称和商品价格
16     try:
17 #使用正则表达式,\表示引入一个"view_price"的键,后面\引入键的值
18         plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
19 #*?表示最小匹配
20         tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
21         for i in range(len(plt)):
22             price = eval(plt[i].split(":")[1])
23             title = eval(tlt[i].split(":")[1])
24             ilt.append([price,title])
25     except:
26         print(" ")
27 def printGoodslist(ilt):
28     tplt = "{:4}\t{:8}\t{:16}"
29     print(tplt.format("序号","价格","商品名称"))
30     count = 0
31     for x in ilt:
32         count = count + 1
33         print(tplt.format(count,x[0],x[1]))
34 def main():
35     goods = '卫衣'
36     depth = 2
37     star_url = 'https://s.taobao.com/search?q=' +goods
38     infoList = []
39     for i in range(depth):
40         try:
41             url = star_url + '&s=' + str(44*i)
42             html = getHTMLText(url)
43             parsePage(infoList,html)
44         except:
45             continue
46     printGoodslist(infoList)
47 main()

在headers字典时,程序一直报错,报错在user-agent后面的冒号上,弄了很长时间不得解,百度也没办法

最后才想起写字典时中间的键值对没有给英文逗号,令人啼笑皆非。

最后的编译结果:

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/yangbiao6/p/11965408.html