Preface
This article records the implementation process of learning to crawl the highest-selling product data code under each category on the Taobao website. It involves using the webdriver in the selenium library to operate the GOOGLE browser to log in, search, and click to sort by sales, get the page content and use it The beautiful library analysis process.
1. Basic environment configuration
Python version: python 3.8.3 Editor: spyder browser version under anaconda3: Google Chrome 87.0.4280.88 Browser driver: This article uses webdriver in selenium to drive the browser to simulate human click behavior to crawl information, because it also needs to be downloaded The corresponding version of the browser driver, Google Drive download address: https://npm.taobao.org/mirrors/chromedriver There are various versions of the driver under this address, find the Google Drive under my own version (the version I downloaded is / 87.0.4280.20), download and place it in the Google directory, and add the path to the environment variable.
Second, use steps
1. Import the library
The code is as follows (example):
from bs4 import BeautifulSoup from selenium import webdriver #ActionChain is used to implement some basic automation operations: such as mouse movement, mouse click, etc. ActionChains can implement multiple steps in one step from selenium.webdriver import ActionChains import PIL from PIL import Image import time import base64 #Base64 encoding is the process from binary to character import threading import pandas as pd
The role of the main library: BeautifulSoup: a python library that can extract data from HTML or XML files, can realize the usual document navigation, search, and modify the document through the favorite converter. This article mainly uses the library to crawl the webpages of webriver The data is parsed to obtain the desired data content. For details, please refer to the following URL: https://beautifulsoup.readthedocs.io/zh_CN/latest/ selenium.webdriver: Generally, python uses urlopen in urllib to crawl webpage data. Method, return the web page object, and use the read method to get the html content of the url, and then use beautifulsoup combined with regular expressions to grab a certain tag content, urlopen can also realize the access to the website content with the cookies after logging in without having to do it every time The trouble of logging in; but the limitation of urllib is that it can only get the static html content of the webpage. The dynamic content of the webpage is not included in the static html, so sometimes the html information of the crawled website is the same as our actual website. The html information seen is inconsistent. The selenium module can obtain the content generated by dynamic web pages. Image: Image.open() can open the pictures in the folder. This article is used to open the Taobao login QR code downloaded to the local area. By scanning and opening the large image QR code, you can achieve Taobao scan code login. time: In this article, the time.sleep() method can be used to realize the temporary sleep function to prevent the access speed from being recognized as a crawler.
2. Actual case
Log in to Taobao:
def login_first(self): #Taobao homepage link request.get returns status_code of 502 when using https at the beginning #pageview_url ='http ://www.taobao.com/?spm=a1z02.1.1581860521.1.CPoW0X ' in the class attribute It has been defined that the difference between #PhantomJS and Chrome is that Chrome can start the browser and observe the page operations performed at each step of the code operation. PhantomJS simulates a virtual browser, that is, operates the browser without opening the browser #driver = webdriver.PhantomJS() #driver = webdriver.Chrome() has been defined in the class properties #The loading time in the get() step will be very long. I don’t know if it is a network problem, so it is limited to stop loading self.driver after 20 seconds . set_page_load_timeout(40) self.driver.set_script_timeout(40) try: self.driver.get(self.pagelogin_url) except: print("Page loading is too slow, stop loading, continue to the next step") #here self.driver.execute_script(" window.stop()") you need to set up waiting, Otherwise, if the following statement is executed before the page is cached, the element cannot be found time.sleep(40) #Find the login button and click #Originally, after obtaining the cookies, use the request library to access the pages that need to be logged in with cookies, but it does not work for webdriver #wedriver various methods of element positioning https:// www.cnblogs.com/yufeihlf/p/5717291.html #The most basic requirement of webpage automation is to locate each element first, and then operate on each element (input, click, clear, submit, etc.) #XPath is a kind of XML The language of the positioning element in the document. This positioning method is also a commonly used positioning method #8.1 Locate elements by attributes find_element_by_xpath("//tag name[@attribute='attribute value']") #Possible attributes: id, class, name, maxlength #Locate by tag name Attribute: all input tag elements find_element_by_xpath("//input") #Locate attributes by parent and child: all input tag elements find_element_by_xpath("//span/input") #Locate by element content: find_element_by_xpath("//p[contains(text (),'京公网')]") #Get the first div under the tag with id="login"//' means that searching from any node does not need to start from the root node. #How to find the xpath content of the corresponding button or page: Check the page, click the arrow sign in the upper left corner, #Then click on the target content, it will automatically locate the label attribute of the content in Elements, right click COPY Xpath to #self.driver.find_element_by_xpath('//*[@class="btn-login ml1 tb-bg weight"]' ).click() #time.sleep(40) #The purpose of the code below: find the source code of the QR code login and click to log in using the QR code try: #Find the corresponding attribute of the QR code login and click the QR code to log in to driver_data =self.driver.find_element_by_xpath('//*[@id="login"]/div[1]/i').click() except: pass #Usually need to pause for a few seconds, otherwise it will be detected as a crawler #Wait for webpage buffer time.sleep(20) # Execute JS to get the QR code of canvas #. Locate the element by tag_name JS ='return document.getElementsByTagName("canvas" )[0].toDataURL("image/png");' im_info = self.driver.execute_script(JS) # Execute JS to obtain picture information im_base64 = im_info.split(',')[1] #Get base64 encoded picture information im_bytes = base64.b64decode(im_base64) #Convert to bytes type time.sleep(2) with open('E:/Learn'/login.png','wb') as f: #Generate QR code and save f.write(im_bytes) f.close() #Open QR code Picture, you need to manually scan the QR code to log in t = threading.Thread(target=self.opening,args=('E:/Learn'/login.png',)) t.start() print("Logining... Please sweep the code!\n") #Get the cookies after logging in (only the user name is seen, but the account and password are not seen) while(True):c = self.driver.get_cookies() if len(c)> 20: #Get logged in successfully To cookies cookies = {} #The following is hidden because only the name and value are reserved for cookies and can only be used for requests in the future, and cannot be used for webdriver's add_cookies function otherwise InvalidCookieDomainException #for i in range(len(c)): #cookies[ c[i]['name']] = c[i]['value'] self.driver.close() print("Login in successfully!\n") #return cookies return c time.sleep(10)
Executing the above code has completed Taobao login and jumped to my Taobao page
Jump from my Taobao to Taobao homepage:
def my_split(self,s, seps): """split remove multiple characters""" res = [s] for sep in seps: t = [] list(map(lambda ss: t.extend(ss.split (sep)), res)) res = t return res def is_Chinese(self,word): """Judging whether it is Chinese""" for ch in word: if'\u4e00' <= ch <='\u9fff': return True return False def get_Cates(self): """ After logging in, it will jump to my page by default. The user of this code jumps to the Taobao homepage and obtains all Taobao product categories """ #The above steps have achieved Taobao login, and Go to my page, click Taobao homepage, and jump to the homepage time.sleep(10)sleep(10) #View the source code to determine the attribute corresponding to the button of "Taobao.com Homepage", click to click self.driver.find_element_by_xpath('//*[@class="site-nav-menu site-nav-home"]').click() #After jumping to the home page, first get all categories in the left column of the Taobao page #detect To Taobao is dynamic JS, using the request library to obtain web page information will be inconsistent with the data checked by the web page, so you need to use the selenium package. #How to identify static and dynamic webpages: https://www.jianshu.com/p/236fc043db0b #View the attributes corresponding to the source code location target content (ie category column) class driver_data = self.driver.find_element_by_xpath('//*[ @class="screen-outer clearfix"]') html_doc = self.driver.page_source #driver.quit() #Using beautifulSoup to parse the webpage source code soup = BeautifulSoup(html_doc, "lxml") #Find all themes in Taobao homepage , The corresponding class can be determined by checking the range of class cate_list=[] soup_data_list = soup.find("div",attrs={' #Get the text information in the source code, that is, the theme of all items in Taobao #Get the Chinese characters after removing the illegal characters through the custom split function list_tuple = list(("\n","\\","\ue62e" ,"/","\t"," "," "," ")) cate_list=self.my_split(soup_data_list.text,list_tuple) #Use the custom is_Chinese function to keep only Chinese text keep_select = [] #cate_list_final= [] for i in cate_list: keep_select = self.is_Chinese(i) if keep_select: self.cate_list_final.append(i) time.sleep(10) return self.cate_list_final
The execution of the above code has been completed to obtain all category names on Taobao pages
Enter the category you want to search in the search box, click search, and click the sales order button to get the page content for analysis:
def search_Taobao(self,cate): print("The category being searched is:%s"%cate) #Click on Taobao homepage to jump to the homepage page, whether it is on my homepage and the homepage under search of each category, you need to click The class that jumps back to the homepage of Taobao remains unchanged, so put the code for clicking the homepage here" time.sleep(10) #Enter the content you want to search in the search bar on the homepage, and click Search #One input box, two inputs, two In the case of overlapping inputs, first click on the input of the value prompt, and then the input that really needs to be entered will be displayed, then input the value to this input #First click on the search box'//*[@class="search- combobox-input-wrap"]', making it interactive self.driver.find_element_by_xpath('//*[@class="search-combobox-input-wrap"]').click() #Find the input search again Box class name, enter the content you want to search, find the corresponding class and click search driver_input_data = self.driver.find_element_by_xpath('//*[@class="search-combobox-input"]') #Fill in the category that needs to be searched. I don’t know why I still need to log in to with cookies #driver_input_data.send_keys('Women's clothing') driver_input_data.send_keys(cate) #Pause for 3 seconds, otherwise the speed will be recognized as a crawler time.sleep(8) #Find the "Search" button on the page try: submit = self.driver.find_element_by_xpath('//* [@class="search-button"]') submit.click() except: pass time.sleep(5) def get_Catinfo(self,cate): #self.login_first() time.sleep(20) self.search_Taobao( cate) #After logging in, I entered the page corresponding to the search category, and obtained the product information of the first page in descending order by sales time.sleep(50) #Find the elements sorted by sales, and click to get the product information in descending order submit_order = self.driver.find_element_by_xpath('//*[@class="J_Ajax link "]') submit_order.click() time.sleep(5 ) #Get the entire page source code html_doc = self.driver.page_source #Get the necessary information of each product through the page source code soup = BeautifulSoup(html_doc,"lxml") shop_data_list = soup.find(' div',class_="grid g-clearfix").find_all_next('div',class_="items") for shop_data in shop_data_list: #Different information is distributed in the following two different classes shop_data_a = shop_data.find_all(" div",class_="ctx-box J_MouseEneterLeave J_IconMoreNew") shop_data_b = shop_data.find_all("div",class_="pic-box J_MouseEneterLeave J_PicBox") for goods_contents_b in shop_data_b: #Another column is the product category to crawl self.shop_cate_list.append(cate) #0. Get product name goods_name = goods_contents_b.find("div",class_="pic").find_all("img",class_="J_ItemPic img")[0]["alt"] self.goods_name_list.append(goods_name) #0.获取商品图片 goods_pic = goods_contents_b.find("div",class_="pic").find_all("img",class_="J_ItemPic img")[0]["src"] self.goods_pic_list.append(goods_pic) for goods_contents_a in shop_data_a: #2.获取商品价格trace-price goods_price = goods_contents_a.find_all("a",class_="J_ClickStat")[0]["trace-price"] self.goods_price_list.append(goods_price) #goods_price = goods_contents_a.find("div",class_="price g_price g_price-highlight") #goods_price_list.append(goods_price) #1.Get the sales of goods goods_salenum = goods_contents_a.find("div",class_="deal-cnt") self.goods_salenum_list.append(goods_salenum) #2.获取商品id goods_id = goods_contents_a.find_all("a",class_="J_ClickStat")[0]["data-nid"] self.goods_id_list.append(goods_id) #2.获取商品链接 goods_href = goods_contents_a.find_all("a",class_="J_ClickStat")[0]["href"] self.goods_href_list.append(goods_href) #2.获取店铺名称 goods_store = goods_contents_a.find("a",class_="shopname J_MouseEneterLeave J_ShopInfo").contents[3] #goods_store = goods_contents.find_all("span",class_="dsrs") self.goods_store_list.append(goods_store) #4.获取店铺地址 goods_address = goods_contents_a.find("div",class_="location").contents self.goods_address_list.append(goods_address) #爬取结果整理成dataframe形式 for j in range(min( len(self.goods_name_list),len(self.goods_id_list),len(self.goods_price_list) ,len(self.goods_salenum_list),len(self.goods_pic_list),len(self.goods_href_list) ,len(self.goods_store_list),len(self.goods_address_list) ) ): self.data.append([self.shop_cate_list[j],self.goods_name_list[j],self.goods_id_list[j],self.goods_price_list[j] ,self.goods_salenum_list[j],self.goods_pic_list[j],self.goods_href_list[j] ,self.goods_store_list[j],self.goods_address_list[j] ]) #out_df = pd.DataFrame(self.data,columns=['goods_name' # ,'goods_id' # ,'goods_price' # ,'goods_salenum' # ,'goods_pic' # ,'goods_href' # ,'goods_store' # ,'goods_address']) #self.Back_Homepage() #If you don’t sleep, you may encounter a click on the page before it can load, causing a click error time.sleep(20) self.driver.find_element_by_xpath('//*[@class="site-nav-menu site-nav-home "]').click() return self.data
The execution of the above code has completed the search for the product information on the first page of each category sorted by sales in descending order
3. How to locate web content tags and attributes?
The above is the source code of using selenium crawler Taobao category TOP sales data, but the final content of crawling data is how to locate the tags and attributes of the content that needs to be obtained in the web page. I introduce the method introduced by the crawler video in Tencent Video as follows: 1. Right-click on the web page, click the review element (or check), click the upper right corner and the corresponding attribute information of the page appears, click the arrow in the upper left corner of the check information (on the left of Elements, it will display select an element in the page to inspect it), and then click any place on the webpage to locate the corresponding attribute;
Afterword
This article takes the sales of each category of Taobao as an actual case as a crawler exercise, but in the crawling process, there is not a complete concept of the structure of HTML, the method of positioning elements in webdriver, the positioning of beautifulsoup elements, and the use of regular expressions. Follow-up This part of the content will be sorted out as follows.
Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself