Reptile (XI): selenium reptiles

1. selenium basis

selenium can see part of what I wrote selenium base part, because the issue here is not too many links come.

Proxy ip:

Sometimes frequent crawling some pages. Server will find that you are after sealing of reptiles your ip address. At this time we can change the proxy ip. Change proxy ip different browsers have different implementations. Here I use the most popular Chrome browser, for example.

from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()
 
# Set the proxy 
chromeOptions.add_argument ( " --proxy-Server = HTTP: //202.20.16.82: 10152 " )
 # must pay attention to, = no spaces on either side, can not be so --proxy-server = http: //202.20 .16.82: 10152 
Driver = webdriver.Chrome (chrome_options = chromeOptions)
 
# Check your phone ip, to see whether the agent acts 
driver.get ( " http://httpbin.org/ip " )
 Print (driver.page_source)
 
# Exit, clear your browser cache 
driver.quit ()

Precautions: 

First, selection of stable fixed proxy IP. Do not select dynamic proxy IP. High dynamic IP, we used an anonymous IP proxy reptiles usually have a highly confidential nature, is dynamically generated by dial-up, timeliness is very short, usually about 3 minutes.

Second, select faster proxy IP. Because it is browser rendering technology, this browser rendering speed is inherently slow selenium reptiles used. If you select a proxy IP slow crawl of time will further increase.

Third, there must be large enough computer memory. Because chrome accounted for a large memory, high concurrency in the case, likely to cause the browser to crash, that crash.

Fourth, at the end of the program, call driver.quit () clear the browser cache.

2. selenium reptile examples

Selected Cases really gave me a whole spit, he began Xiangnong most commonly used Taobao, the results of that search will login, and then is Lynx, click Next you need to log in, and made me climb up on the first page. Finally Jingdong good, anything can be.

2.1 Preliminary analysis

As Jingdong, Taobao, Lynx these sites are loaded dynamically, just open only to load dozens of pieces of data, when the slider reaches a certain position, will continue to load. At this time we can simulate the process of selenium web browser drop down, get information site all commodities.

browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

 

2.2 analog flip

In front, if we are to climb the content of each page to take queries, we can only analyze url, find the law, in order to jump to the next page, and retrieve data.

Now we can use xpath positioning + selenium Click to simulate the behavior of the browser page.

Drop down to the bottom of the page can be found on the next page there is a button, we only need to obtain and click on the elements to turn pages.

browser.find_element_by_xpath('//a[@class="pn-next" and @onclick]').click()

 

2.3 获取数据 

接下来,我们需要解析每一个网页来获取我们需要的数据,具体包括(可以使用selenium选择元素):

商品 ID:browser.find_elements_by_xpath('//li[@data-sku]'),用于构造链接地址

商品价格:browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[2]/strong/i')

商品名称:browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[3]/a/em')

评论人数:browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[4]/strong')

 

2.4 代码实现

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.common.exceptions
import json
import csv
import time

class JdSpider():
    def open_file(self):
        self.fm = input('请输入文件保存格式(txt、json、csv):')
        while self.fm!='txt' and self.fm!='json' and self.fm!='csv':
            self.fm = input('输入错误,请重新输入文件保存格式(txt、json、csv):')
        if self.fm=='txt' :
            self.fd = open('Jd.txt','w',encoding='utf-8')
        elif self.fm=='json' :
            self.fd = open('Jd.json','w',encoding='utf-8')
        elif self.fm=='csv' :
            self.fd = open('Jd.csv','w',encoding='utf-8',newline='')

    def open_browser(self):
        self.browser = webdriver.Chrome()
        self.browser.implicitly_wait(10)
        self.wait = WebDriverWait(self.browser,10)

    def init_variable(self):
        self.data = zip()
        self.isLast = False

    def parse_page(self):
        try:
            skus = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//li[@class="gl-item"]')))
            skus = [item.get_attribute('data-sku') for item in skus]
            links = ['https://item.jd.com/{sku}.html'.format(sku=item) for item in skus]
            prices = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[2]/strong/i')))
            prices = [item.text for item in prices]
            names = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[3]/a/em')))
            names = [item.text for item in names]
            comments = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[4]/strong')))
            comments = [item.text for item in comments]
            self.data = zip(links,prices,names,comments)
        except selenium.common.exceptions.TimeoutException:
            print('parse_page: TimeoutException')
            self.parse_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('parse_page: StaleElementReferenceException')
            self.browser.refresh()

    def turn_page(self):
        try:
            self.wait.until(EC.element_to_be_clickable((By.XPATH,'//a[@class="pn-next"]'))).click()
            time.sleep(1)
            self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            time.sleep(2)
        except selenium.common.exceptions.NoSuchElementException:
            self.isLast = True
        except selenium.common.exceptions.TimeoutException:
            print('turn_page: TimeoutException')
            self.turn_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('turn_page: StaleElementReferenceException')
            self.browser.refresh()

    def write_to_file(self):
        if self.fm == 'txt':
            for item in self.data:
                self.fd.write('----------------------------------------\n')
                self.fd.write('link:' + str(item[0]) + '\n')
                self.fd.write('price:' + str(item[1]) + '\n')
                self.fd.write('name:' + str(item[2]) + '\n')
                self.fd.write('comment:' + str(item[3]) + '\n')
        if self.fm == 'json':
            temp = ('link','price','name','comment')
            for item in self.data:
                json.dump(dict(zip(temp,item)),self.fd,ensure_ascii=False)
        if self.fm == 'csv':
            writer = csv.writer(self.fd)
            for item in self.data:
                writer.writerow(item)

    def close_file(self):
        self.fd.close()

    def close_browser(self):
        self.browser.quit()

    def crawl(self):
        self.open_file()
        self.open_browser()
        self.init_variable()
        print('开始爬取')
        self.browser.get('https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8')
        time.sleep(1)
        self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(2)
        count = 0
        while not self.isLast:
            count += 1
            print('正在爬取第 ' + str(count) + ' 页......')
            self.parse_page()
            self.write_to_file()
            self.turn_page()
        self.close_file()
        self.close_browser()
        print('结束爬取')

if __name__ == '__main__':
    spider = JdSpider()
    spider.crawl()

代码中需要注意的地方:

1.self.fd = open('Jd.csv','w',encoding='utf-8',newline='')

在打开csv文件时,最好加上参数newline='',否则我们写入的文件会出现空行,不利于后续的数据处理。

2.self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

在模拟浏览器向下拖动网页时,由于数据更新不及时,所以经常出现StaleElementReferenceException异常,我们可以在操作中加入time.sleep()给浏览器充足的加载时间,或者就是捕获该异常进行相应的处理了。

3.skus = [item.get_attribute('data-sku') for item in skus]

在selenium中使用xpath语法选取元素时,无法直接获取节点的属性值,而需要使用get_attribute()方法。

4.无头启动浏览器可以加快爬取速度,只需在启动浏览器时设置无头参数即可。

opt = webdriver.chrome.options.Options()
opt.set_headless()
browser = webdriver.Chrome(chrome_options=opt)

Guess you like

Origin www.cnblogs.com/liuhui0308/p/12079458.html
Recommended