多进程爬取淘宝商品信息

爬取思路、策略：一开始试着通过抓包模拟请求来爬取淘宝，但是淘宝返回的数据并不全是正确的，即通过返回真和假数据来达到反爬的目的，上网查资料也没多少是涉及到直接抓包请求爬取淘宝的，就这样自己瞎琢磨了一阵子后还是没有弄明白如何破解淘宝的反爬，于是决定采用selenium无头浏览器先实现爬取淘宝商品信息的目的，往后会继续来填这个坑。

采用selenium无头浏览器，完全模拟浏览器人为操作，故淘宝的反爬策略在它面前就束手无策了，但爬取效率还是没有直接请求url的逆向分析方式快，为了提高爬取效率，采取多进程爬取，之所以采用多进程而不是多线程，是因为打开多个浏览器同时运行是个CPU密集型操作，而不是I/ O密集型操作，故选择了多进程爬取，爬取流程图如下：

起初打算只采用两个进程爬取，但发现淘宝商品展示有重复现象，而且去重对于爬虫来说也是必要的，这里采用将商品ID的MD5压缩数据放入一集合set（）中判断是否重复，从而达到去重的目的，利用一个进程来实现统一去重和上传数据，而且必须将去重和爬取分开进程执行，因为多进程是在独立的虚拟内存中运行的，两个爬取进程都产生一个集合set（）用来去重，就无法达到统一去重的目的，这样去重效果会大大减低，故需要单独一进程来执行去重操作。

最后想谈谈多线程/多进程：由于python中全局解释器锁（GIL）的存在，在任意时刻只允许一个线程在解释器中运行，因此python的多线程不适合处理cpu密集型的任务。想要处理CPU密集型任务，可以使用多进程模型；多个进程之间，使用的虚拟地址空间是独立的，只能通过Queue、Pipe来互相通信，这也是爬虫中考虑采用单独一个进程来去重的原因。而多进程间的同步和停止也是很关键的，若要判断进程是否该停止，队列的empty（）方法应该尽量少用，因为你不知道是否还有数据等待着put到队列中，所以会容易产生误判；要判断进程停止，可以在数据插入队列最后插入一结束标示符，一旦其他进程检测到这个结束标示符，结束进程。

上面是纯文字描述，下面就该是code了。

爬取前50页：

    def parse_pre_50page_product(self ,  queue):
        options = Options()
        options.add_argument('-headless')
        driver = Firefox(executable_path='D:\SeleniumDriver\geckodriver.exe', firefox_options=options)
        wait = WebDriverWait(driver, timeout=20)
        # 搜索关键字
        driver.get('https://www.taobao.com/')
        wait.until(expected.visibility_of_element_located((By.CSS_SELECTOR, '.search-combobox-input-wrap input'))).send_keys('充电宝')
        wait.until(expected.visibility_of_element_located((By.CSS_SELECTOR, 'button.btn-search.tb-bg'))).click()

        print('---------------process 1 is running--------------')
        for i in range(50):
            page_num = i+1
            current_url = driver.current_url
            #点击下一页
            try:
                driver.get(current_url)
                print('Click the next page')
                wait.until(expected.visibility_of_element_located((By.CSS_SELECTOR, '.item.next a.J_Ajax.num.icon-tag'))).click()
            except:
                print('no page to next')
                break
            # time.sleep(5)
            print('--------------parse {}nth page product--------------'.format(page_num))
            # 加锁，是为了防止散乱的打印。 保护一些临界状态
            # 多个进程运行的状态下，如果同一时刻都调用到parse，那么显示的打印结果(数据处理结果)将会混乱
            # lock.acquire()
            self.parse(driver.page_source , queue)
            # lock.release()

爬取后50页：

    def parse_pro_50page_product(self , queue):
        options = Options()
        options.add_argument('-headless')
        driver = Firefox(executable_path='D:\SeleniumDriver\geckodriver.exe', firefox_options=options)
        wait = WebDriverWait(driver, timeout=20)
        # 搜索关键字
        driver.get('https://s.taobao.com/search?q=%E5%85%85%E7%94%B5%E5%AE%9D&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.2&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=-141&ntoffset=-141&p4ppushleft=1%2C48&s=2156')

        print('------------process 2 is running--------------')
        for i in range(50 , 100):
            page_num = i+1
            current_url = driver.current_url
            #点击下一页
            try:
                driver.get(current_url)
                print('Click the next page')
                wait.until(expected.visibility_of_element_located((By.CSS_SELECTOR, '.item.next a.J_Ajax.num.icon-tag'))).click()
            except BaseException as e:
                print('no page to next or' , e)
                break
            # time.sleep(5)
            print('-------------parse {}nth page product------------'.format(page_num))
            # lock.acquire()
            self.parse(driver.page_source , queue)
            # lock.release()
            if page_num == 100:
                print('parse product completed!')
                queue.put('StopProcess')

解析商品信息：

# 解析商品信息
    def parse(self , page_source , queue):
        match_obj = re.search('g_page_config = (.*).*', page_source)
        if match_obj:
            info_match = match_obj.group(1)
            info_match = info_match.replace(';', '')
            info_dict = json.loads(info_match)
            product_items = info_dict['mods']['itemlist']['data']['auctions']
            for product in product_items:
                title = product['raw_title']
                price = float(product['view_price'])
                detail_url = product['detail_url']
                # detail_url_md5 = self.MD5(detail_url)
                product_ID = product['nid']
                product_ID = self.MD5(product_ID)
                shop = product['nick']
                comment_count = product['comment_count']
                if comment_count:
                    comment_count = int(comment_count)
                else:
                    comment_count = 0
                view_sales = product['view_sales']
                match_obj = re.match('(\d+)人付款' , view_sales)
                if match_obj:
                    view_sales = int(match_obj.group(1))
                else:
                    view_sales = 0
                comment_url = product['comment_url']
                if comment_url:
                    comment_url = comment_url
                else:
                    comment_url = None
                title_gbk = title.replace(u'\u2705', u' ')    #去掉gbk无法解码的符号
                title_gbk = title_gbk.replace(u'\u2708', u' ')
                try:
                    print('parsing the product:' , title_gbk)
                except UnicodeEncodeError as e:
                    print('error:' , e)
                params = (title_gbk , price , product_ID , detail_url , shop , comment_url , comment_count , view_sales)
                queue.put(params)

去重和上传数据：

#多进程间统一去重
    def delDuplicate(self , queue):
        save_flag = 0
        while True:
            save_flag += 1
            # if queue.empty():
            #     time.sleep(20)
            #     if queue.empty():
            #         break
            params = queue.get()
            if params == 'StopProcess':
                queue.put('StopProcess')
                break
            product_ID = params[2]
            print('Deleting duplicate data')
            if product_ID in self.Unduplicate:
                continue
            else:
                self.Unduplicate.add(product_ID)
            self.insert_to_mysql(params)
            if save_flag % 100 == 0:
                with open('Unduplicate.taobao', 'w') as wf:
                    wf.write(str(self.Unduplicate))
        with open('Unduplicate.taobao', 'w') as wf:
            wf.write(str(self.Unduplicate))

    def insert_to_mysql(self , params):
        insert_sql = '''
        insert into TaobaoProduct values (%s , %s , %s , %s , %s , %s , %s , %s)'''
        self.cursor.execute(insert_sql , params)
        self.conn.commit() #执行完这条语句后数据才实际写入数据库
        print('Insert to mqsql,succeed!')

爬取的数据：

所遇到的问题也就是如何统一进行去重和多进程间的处理上

多进程爬取淘宝商品信息

多进程爬取淘宝商品信息

猜你喜欢