Python crawls Taobao sales data! Data is money this year!

 

Preface

This article records the implementation process of learning to crawl the highest-selling product data code under each category on the Taobao website. It involves using the webdriver in the selenium library to operate the GOOGLE browser to log in, search, and click to sort by sales, get the page content and use it The beautiful library analysis process.

1. Basic environment configuration

Python version: python 3.8.3 Editor: spyder browser version under anaconda3: Google Chrome 87.0.4280.88 Browser driver: This article uses webdriver in selenium to drive the browser to simulate human click behavior to crawl information, because it also needs to be downloaded The corresponding version of the browser driver, Google Drive download address: https://npm.taobao.org/mirrors/chromedriver There are various versions of the driver under this address, find the Google Drive under my own version (the version I downloaded is / 87.0.4280.20), download and place it in the Google directory, and add the path to the environment variable.

Second, use steps

1. Import the library

The code is as follows (example):

from bs4 import BeautifulSoup 
from selenium import webdriver 
#ActionChain is used to implement some basic automation operations: such as mouse movement, mouse click, etc. ActionChains can implement multiple steps in one step 
from selenium.webdriver import ActionChains 
import PIL 
from PIL import Image 
import time 
import base64 #Base64 encoding is the process from binary to character 
import threading 
import pandas as pd

The role of the main library: BeautifulSoup: a python library that can extract data from HTML or XML files, can realize the usual document navigation, search, and modify the document through the favorite converter. This article mainly uses the library to crawl the webpages of webriver The data is parsed to obtain the desired data content. For details, please refer to the following URL: https://beautifulsoup.readthedocs.io/zh_CN/latest/ selenium.webdriver: Generally, python uses urlopen in urllib to crawl webpage data. Method, return the web page object, and use the read method to get the html content of the url, and then use beautifulsoup combined with regular expressions to grab a certain tag content, urlopen can also realize the access to the website content with the cookies after logging in without having to do it every time The trouble of logging in; but the limitation of urllib is that it can only get the static html content of the webpage. The dynamic content of the webpage is not included in the static html, so sometimes the html information of the crawled website is the same as our actual website. The html information seen is inconsistent. The selenium module can obtain the content generated by dynamic web pages. Image: Image.open() can open the pictures in the folder. This article is used to open the Taobao login QR code downloaded to the local area. By scanning and opening the large image QR code, you can achieve Taobao scan code login. time: In this article, the time.sleep() method can be used to realize the temporary sleep function to prevent the access speed from being recognized as a crawler.

2. Actual case

Log in to Taobao:

def login_first(self): #Taobao 
        
        homepage link request.get returns status_code of 502 when using https at the beginning 
        #pageview_url ='http ://www.taobao.com/?spm=a1z02.1.1581860521.1.CPoW0X ' in the class attribute It has been defined 
        that the difference between #PhantomJS and Chrome is that Chrome can start the browser and observe the page operations performed at each step of the code operation. PhantomJS simulates a virtual browser, that is, operates the browser without opening the browser 
        #driver = webdriver.PhantomJS() 
        #driver = webdriver.Chrome() has been defined 
        in the class properties #The loading time in the get() step will be very long. I don’t know if it is a network problem, so it is limited to stop loading 
        self.driver after 20 seconds . set_page_load_timeout(40) 
        self.driver.set_script_timeout(40) 
        try: 
            self.driver.get(self.pagelogin_url) 
        except: 
            print("Page loading is too slow, stop loading, continue to the next step") #here  
            self.driver.execute_script(" window.stop()")
        you need to set up waiting, Otherwise, if the following statement is executed before the page is cached, the element cannot be found
        time.sleep(40) #Find the 
        login button and click #Originally, 
        after obtaining the cookies, use the request library to access the pages that need to be logged in with cookies, but it does not work for webdriver 
        #wedriver various methods of element positioning https:// www.cnblogs.com/yufeihlf/p/5717291.html #The 
        most basic requirement of webpage automation is to locate each element first, and then operate on each element (input, click, clear, submit, etc.) 
        #XPath is a kind of XML The language of the positioning element in the document. This positioning method is also a commonly used positioning method 
        #8.1 Locate elements by attributes find_element_by_xpath("//tag name[@attribute='attribute value']") #Possible 
        attributes: id, class, name, maxlength #Locate 
        by tag name Attribute: all input tag elements find_element_by_xpath("//input") #Locate 
        attributes by parent and child: all input tag elements find_element_by_xpath("//span/input") 
        #Locate by element content: find_element_by_xpath("//p[contains(text (),'京公网')]") 
        #Get the first div under the tag with id="login"//' means that searching from any node does not need to start from the root node. 
        #How to find the xpath content of the corresponding button or page: Check the page, click the arrow sign in the upper left corner, 
        #Then 
        click on the target content, it will automatically locate the label attribute of the content in Elements, right click COPY Xpath to     #self.driver.find_element_by_xpath('//*[@class="btn-login ml1 tb-bg weight"]' ).click() 
  
        #time.sleep(40) #The 
        purpose of the code below: find the source code of the QR code login and click to log in using the QR code 
        try: 
            #Find the 
            corresponding attribute of the QR code login and click the QR code to log in to driver_data =self.driver.find_element_by_xpath('//*[@id="login"]/div[1]/i').click() 
            
        except: 
            pass #Usually 
        need to pause for a few seconds, otherwise it will be detected as a crawler 
        #Wait 
        for webpage buffer time.sleep(20) 
        # Execute JS to get the QR code of canvas 
        #. Locate the element by tag_name 
        JS ='return document.getElementsByTagName("canvas" )[0].toDataURL("image/png");'
        im_info = self.driver.execute_script(JS) # Execute JS to obtain picture information 
        im_base64 = im_info.split(',')[1] 
        #Get base64 encoded picture information im_bytes = base64.b64decode(im_base64) #Convert to bytes type 
        time.sleep(2) 
        with open('E:/Learn'/login.png','wb') as f: 
            #Generate 
            QR code and save f.write(im_bytes) 
            f.close() #Open 
        QR code Picture, you need to manually scan the QR code to log in 
        t = threading.Thread(target=self.opening,args=('E:/Learn'/login.png',)) 
        t.start() 
        print("Logining... Please sweep the code!\n") 
            #Get 
        the cookies after logging in (only the user name is seen, but the account and password are not seen)  
        while(True):c = self.driver.get_cookies() 
            if len(c)> 20: #Get logged in successfully To cookies
                cookies = {} 
                #The following is hidden because only the name and value are reserved for cookies and can only be used for requests in the future, and cannot be used for webdriver's add_cookies function otherwise InvalidCookieDomainException 
                #for i in range(len(c)): 
                    #cookies[ c[i]['name']] = c[i]['value'] 
                self.driver.close() 
                print("Login in successfully!\n") 
                #return cookies 
            return c 
            time.sleep(10)

Executing the above code has completed Taobao login and jumped to my Taobao page

Jump from my Taobao to Taobao homepage:

def my_split(self,s, seps): 
        """split remove multiple characters""" 
        res = [s] 
        for sep in seps: 
            t = [] 
            list(map(lambda ss: t.extend(ss.split (sep)), res)) 
            res = t 
        return res 
    
    def is_Chinese(self,word): 
        """Judging whether it is Chinese""" 
        for ch in word: 
            if'\u4e00' <= ch <='\u9fff': 
                return True 
        return False 
        
    def get_Cates(self): 
        """ After logging in, it will jump to my page by default. The user of this code jumps to the Taobao homepage and obtains all Taobao product categories """ 
        #The above steps have achieved Taobao login, and Go to my page, click Taobao homepage, and jump to the homepage 
        time.sleep(10)sleep(10) #View the 
        source code to determine the attribute corresponding to the button of "Taobao.com Homepage", click to click
        self.driver.find_element_by_xpath('//*[@class="site-nav-menu site-nav-home"]').click() #After 
        jumping to the home page, first get all categories in the left column of the Taobao page 
        
        #detect To Taobao is dynamic JS, using the request library to obtain web page information will be inconsistent with the data checked by the web page, so you need to use the selenium package. 
        #How to 
        identify static and dynamic webpages: https://www.jianshu.com/p/236fc043db0b #View 
        the attributes corresponding to the source code location target content (ie category column) class driver_data = self.driver.find_element_by_xpath('//*[ @class="screen-outer clearfix"]') 
        html_doc = self.driver.page_source 
        #driver.quit() #Using 
        beautifulSoup to parse the webpage source code 
        soup = BeautifulSoup(html_doc, "lxml") #Find 
        all themes in Taobao homepage , The corresponding class can be determined by checking the range of class 
        cate_list=[] 
        soup_data_list = soup.find("div",attrs={'
        #Get 
        the text information in the source code, that is, the theme of all items in Taobao 
        #Get the Chinese characters after removing the illegal characters through the custom split function list_tuple = list(("\n","\\","\ue62e" ,"/","\t"," "," "," ")) 
        cate_list=self.my_split(soup_data_list.text,list_tuple) #Use the 
        custom is_Chinese function to keep only Chinese text 
        keep_select = [] 
        #cate_list_final= [] 
        for i in cate_list: 
            keep_select = self.is_Chinese(i) 
            if keep_select: 
                self.cate_list_final.append(i) 
        time.sleep(10) 
        return self.cate_list_final

The execution of the above code has been completed to obtain all category names on Taobao pages

Enter the category you want to search in the search box, click search, and click the sales order button to get the page content for analysis:

def search_Taobao(self,cate): 
        print("The category being searched is:%s"%cate) #Click on 
        
        Taobao homepage to jump to the homepage page, whether it is on my homepage and the homepage under search of each category, you need to click The class that jumps back to the homepage of Taobao remains unchanged, so put the code for clicking the homepage here" 
        time.sleep(10) #Enter 
        
        the content you want to search in the search bar on the homepage, and click Search 
        #One input box, two inputs, two In the case of overlapping inputs, first click on the input of the value prompt, and then the input that really needs to be entered will be displayed, then input the value to this input #First 
        click on the search box'//*[@class="search- combobox-input-wrap"]', making it interactive 
        self.driver.find_element_by_xpath('//*[@class="search-combobox-input-wrap"]').click() #Find the 
        input search again Box class name, enter the content you want to search, find the corresponding class and click search 
        driver_input_data = self.driver.find_element_by_xpath('//*[@class="search-combobox-input"]') 
        #Fill in the category that needs to be searched. I don’t know why I still need to log in to with cookies 
        #driver_input_data.send_keys('Women's clothing')
        driver_input_data.send_keys(cate) 
        #Pause 
        for 3 seconds, otherwise the speed will be recognized as a crawler time.sleep(8) #Find the 
        "Search" button on the page 
        try: 
            submit = self.driver.find_element_by_xpath('//* [@class="search-button"]') 
            submit.click() 
        except: 
            pass 
            
        time.sleep(5) 
        
    def get_Catinfo(self,cate): 
        #self.login_first() 
        time.sleep(20) 
        self.search_Taobao( cate) #After 
        logging in, I entered the page corresponding to the search category, and obtained the product information of the first page in descending order by sales 
        time.sleep(50) #Find 
        
        the elements sorted by sales, and click to get the product information in descending order 
        submit_order = self.driver.find_element_by_xpath('//*[@class="J_Ajax link "]')
        submit_order.click() 
        time.sleep(5 ) #Get the 
        entire page source code 
        html_doc = self.driver.page_source #Get 
        
        the necessary information of each product through the page source code 
        soup = BeautifulSoup(html_doc,"lxml") 
        shop_data_list = soup.find(' div',class_="grid g-clearfix").find_all_next('div',class_="items") 
        for shop_data in shop_data_list: #Different 
            information is distributed in the following two different classes 
            shop_data_a = shop_data.find_all(" div",class_="ctx-box J_MouseEneterLeave J_IconMoreNew") 
            shop_data_b = shop_data.find_all("div",class_="pic-box J_MouseEneterLeave J_PicBox") 
            for goods_contents_b in shop_data_b:
                #Another column is the product category to crawl 
                self.shop_cate_list.append(cate) 
                #0. Get product name 
                goods_name = goods_contents_b.find("div",class_="pic").find_all("img",class_="J_ItemPic img")[0]["alt"]
                self.goods_name_list.append(goods_name)
                #0.获取商品图片
                goods_pic = goods_contents_b.find("div",class_="pic").find_all("img",class_="J_ItemPic img")[0]["src"]
                self.goods_pic_list.append(goods_pic)
                    
            for goods_contents_a in shop_data_a:
                #2.获取商品价格trace-price
                goods_price = goods_contents_a.find_all("a",class_="J_ClickStat")[0]["trace-price"]
                self.goods_price_list.append(goods_price)
                #goods_price = goods_contents_a.find("div",class_="price g_price g_price-highlight")
                #goods_price_list.append(goods_price)
                #1.Get the sales of goods 
                goods_salenum = goods_contents_a.find("div",class_="deal-cnt")
                self.goods_salenum_list.append(goods_salenum)
                #2.获取商品id
                goods_id = goods_contents_a.find_all("a",class_="J_ClickStat")[0]["data-nid"]
                self.goods_id_list.append(goods_id)
                #2.获取商品链接
                goods_href = goods_contents_a.find_all("a",class_="J_ClickStat")[0]["href"]
                self.goods_href_list.append(goods_href)
                #2.获取店铺名称
                goods_store = goods_contents_a.find("a",class_="shopname J_MouseEneterLeave J_ShopInfo").contents[3]
                #goods_store = goods_contents.find_all("span",class_="dsrs")
                self.goods_store_list.append(goods_store)
                #4.获取店铺地址
                goods_address = goods_contents_a.find("div",class_="location").contents
                self.goods_address_list.append(goods_address)
                
                #爬取结果整理成dataframe形式
        for j in range(min(
               len(self.goods_name_list),len(self.goods_id_list),len(self.goods_price_list)
               ,len(self.goods_salenum_list),len(self.goods_pic_list),len(self.goods_href_list)
               ,len(self.goods_store_list),len(self.goods_address_list)
               )
           ):
            self.data.append([self.shop_cate_list[j],self.goods_name_list[j],self.goods_id_list[j],self.goods_price_list[j]
                         ,self.goods_salenum_list[j],self.goods_pic_list[j],self.goods_href_list[j]
                         ,self.goods_store_list[j],self.goods_address_list[j]
                         ])
        #out_df = pd.DataFrame(self.data,columns=['goods_name'
        #                                  ,'goods_id'
        #                                  ,'goods_price'
        #                                 ,'goods_salenum'
        #                                  ,'goods_pic'
        #                                  ,'goods_href'
        #                                  ,'goods_store'
        #                                  ,'goods_address'])
        
        #self.Back_Homepage()
        #If 
        you don’t sleep, you may encounter a click on the page before it can load, causing a click error time.sleep(20) 
        self.driver.find_element_by_xpath('//*[@class="site-nav-menu site-nav-home "]').click()   
        return self.data

The execution of the above code has completed the search for the product information on the first page of each category sorted by sales in descending order

3. How to locate web content tags and attributes?

The above is the source code of using selenium crawler Taobao category TOP sales data, but the final content of crawling data is how to locate the tags and attributes of the content that needs to be obtained in the web page. I introduce the method introduced by the crawler video in Tencent Video as follows: 1. Right-click on the web page, click the review element (or check), click the upper right corner and the corresponding attribute information of the page appears, click the arrow in the upper left corner of the check information (on the left of Elements, it will display select an element in the page to inspect it), and then click any place on the webpage to locate the corresponding attribute;

 

 

Afterword

This article takes the sales of each category of Taobao as an actual case as a crawler exercise, but in the crawling process, there is not a complete concept of the structure of HTML, the method of positioning elements in webdriver, the positioning of beautifulsoup elements, and the use of regular expressions. Follow-up This part of the content will be sorted out as follows.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112228530