Reptile crawling day cat product information Cover

# Crawling search fields cover information 

'' ' 
q: the type of search can identify Chinese 

Lynx control the login fields: 
the Sort: Sort 
s: starting the first of several commodities    

Question 1: After the s deleted, although you can skip the landing, but you can only access the first page, and why not? 
Solution: after multiple clicks page found Taobao incomplete test, after clicking on the jump page, the parameters of which retain only the url 
     q totalPage jumpto three fields, you're ready to jump into the field by modifying the value jumpto other pages 
     
question 2: after landing solve the problem, found it impossible to know in advance the total number of pages totalpage 
resolve: as long as the display page, then we can get the corresponding data, so you can re-search field, enter totalPage, 
found in the field input input box, to get the corresponding attribute, by lookup find css and their values 
'' ' 

from requests_html Import HTMLSession 

the session = HTMLSession () 

keyword = input ( ' enter crawling goods: ' ) 


the params = {
     'totalPage':12,
    'jumpto':2,
    'q':keyword
}

url = 'https://list.tmall.com/search_product.htm?'

# 获取总页数:
def get_totalPage(url,params):
    r = session.request(method='get',url=url,params=params)
    totalPage = int(r.html.find('[name="totalPage"]',first=True).attrs.get('value'))
    params['TotalPage ' ] = TotalPage 

# changing the value of the field jumpto achieve multiple access 
DEF get_params (the params, TotalPage):
     for I in Range (. 1, TotalPage +. 1 ): 
        the params [ ' jumpto ' ] +. 1 =
         the yield the params 


#  Get Men information 
DEF the get_info (URL, the params): 
    R & lt = session.request (Method = ' GET ' , the params = the params, URL = URL) 
    product_list = r.html.find ( ' the .product ' )
     for product_element in product_list:
        try:
            product_img_url = product_element.find('.productImg-wrap a img',first=True).attrs.get('src')
            product_title= product_element.find('.productTitle a',first=True).attrs.get('title')
            product_price = product_element.find('.productPrice em',first=True).attrs.get('title')
            product_shop_url = product_element.find('A .productShop ' , First = True) .attrs.get ( ' the href ' ) 
            product_volume = product_element.find ( ' .productStatus EM ' , First = True) .text 

            Print (product_img_url)
             Print (PRODUCT_TITLE)
             Print (PRODUCT_PRICE)
             Print ( product_shop_url)
             Print (product_volume)
         the except :
             Print ( ' ! problematic details some of the goods ' )   # some of the goods missing fields, you need to capture abnormal 

get_info (url, params) 
for param in get_params(params,params['totalPage']):
    get_info(url,param)

 

Guess you like

Origin www.cnblogs.com/changwenjun-666/p/11355209.html