Use python crawling Taobao commodity names and prices

Recent bored at home, due to open outlets at home, my mother was a headache for the commodity from the title, so I wanted to crawl some of the information in Taobao.

Station found a small break to learn video, with the video is found again in 2018, and Taobao in 2019 may join the anti-crawling mechanism, using the normal method climb less than the result.

But there is a way to crawl, first Taobao landing page version, and then to search, get cookie and user-agent.

code show as below:

import requests
import re


def getHTMLText(url):
    kv = {'cookie':'cna=54y8Fm+TyioCATzcP+BwvvDA; thw=cn; lgc=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; tracknick=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; tg=0; enc=p6YWWbSWACqr5t1PcdDiNADVd7zKpnQG9X%2FZ666%2Fl7CM9%2FsOLpiM1WX5QQNnS%2B5ydtOFYKtHlmwg9AgeUX0Rjg%3D%3D; mt=ci=25_1; hng=CN%7Czh-CN%7CCNY%7C156; _m_h5_tk=9564049e168909dda591afc00632fed0_1581060181038; _m_h5_tk_enc=69c0046fffbc1f3750258e3f8fb06eb6; v=0; t=a8c0eb2a0d2265808242379d7f81bf64; cookie2=1f0913be01a97ec8038fb3c5354c793a; _tb_token_=ed66e3be155ef; alitrackid=www.taobao.com; _samesite_flag_=true; sgcookie=Q0iMW4gaGTiQdKuBbG0Q; unb=2514996592; uc3=vt3=F8dBxdzxpjLCa1%2BSc2Y%3D&id2=UU2zVEbkyyO2WQ%3D%3D&nk2=p2NDIgVDkDZm7A%3D%3D&lg2=VT5L2FSpMGV7TQ%3D%3D; csg=5b7ad9c4; cookie17=UU2zVEbkyyO2WQ%3D%3D; dnk=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; skt=c0aa8f1a6e59edcd; existShop=MTU4MTU5NTkzOQ%3D%3D; uc4=nk4=0%40pVWU4YQkw8jOJbWHe0nFlK5IE6%2Bq&id4=0%40U2%2F0ltjMwUTM8KkFFqREWIu1Zr5o; _cc_=VFC%2FuZ9ajQ%3D%3D; _l_g_=Ug%3D%3D; sg=%E5%92%8C22; _nk_=%5Cu58A8%5Cu8FF9%5Cu9519%5Cu54AF%5Cu548C; cookie1=BxAV4i9dmFVSXeCQYeZRwnecoaXNB46utcHDDLh6ZgY%3D; lastalitrackid=i.taobao.com; uc1=cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&cookie21=VT5L2FSpccLuJBreK%2BBd&cookie15=W5iHLLyFOGW7aA%3D%3D&existShop=false&pas=0&cookie14=UoTUO8VjZOZt2g%3D%3D&tag=8&lng=zh_CN; JSESSIONID=214606DAA803E833B2D19064DCFFE666; isg=BFZW9gNpFGQi5iC5GO6s0NHppwxY95oxuW-OAcC_eTnUg_cdPocWQaoxGxdvK5JJ; l=dBLXnIncQINxmF32BOCgC40XkGbTvIRfgukohvEHi_5Kv_8sGz_Oo7aMMEJ6cfWAMjxM4cULng2tieLYJiuKHdGJ4AadZxDDB',
          'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
    try:
        r = requests.get(url, headers=kv,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def parsePage (ant, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")


def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print (tplt.format ( "number", "price", "trade name"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))


def main():
    goods = 'aged clothing'
    depth = 3
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44 * i)
            html = getHTMLText(url)
            parsePage (infoList, html)
        except:
            continue
    printGoodsList(infoList)


main()

 The end result of crawling as shown below:

 

 problem:

Need to replace the cookie intermittent, otherwise the data will not climb climb several times

Guess you like

Origin www.cnblogs.com/lijiahaoAA/p/12305034.html