python crawler base Ⅲ - selenium, data storage



Reptiles base portion Ⅲ

A portion of the analysis and understanding of Ajax Fetch, but some pages such as Taobao, which contains a lot of Ajax interfaces encryption parameters, we find it difficult to direct the law, it is difficult to directly analyze Ajax to grab.
To solve these problems, we can directly use simulation browser running mode to achieve, so that you can do in the browser to see what is, what is the source crawl, which is seen to climb. JS so do not control internal pages with what algorithm to render, the page is loaded, they do not control web page background Ajax interfaces in the end what parameters.

selenium

Too lazy screenshots (screenshots are each fix TAT)

What (1) selenium is

It is a powerful database, you first need to install it pip install selenium. It can use a few lines of code to control the browser , to make open automatically, input, operations such as clicking, just like a real user in the same operation.

For those interactions in complex, sophisticated encryption website, seleniumsimplify the problem, crawl dynamic pages as static pages as easy to climb.

In regard to dynamic and static web pages, in fact you have been in contact with before.

With the front part Ⅰ htmlwritten page, the page is static. We use the BeautifulSoupcrawled pages of this type, because the page source code contains all information on the web page (0th request, in the previewwill be able to see all the information there), so that the web address bar of URLthat page's source code URL.

Later in Part Ⅱ, came into contact with more complex pages, such as QQ music, the data is not to be crawling HTML source code, but json, you can not use the direct URL bar URL, while the need to find jsonreal data URL. This is a dynamic web page.

Wherever data exists, the browser is always initiated by a variety of requests to the server when the request is complete, they will form developer tools together Elementsin shows, rendering the finished page source code.


(2) the benefits and disadvantages

In the face of complex interactive page or URLencryption logic complexity of the situation, seleniumcame in handy, it can truly open a browser, wait for all the data is loaded into Elementsafter the, then this page as static pages like crawling a.

Having said that the advantages of the use of seleniumtime, of course, fly in the ointment.

Due to truly run a local browser, open the browser and wait for network rendering take some time to complete, seleniumthe work inevitably at the expense of speed and more resources, but, at least not slower than people.


(3) how to use it

1. Download the browser driver

First you need to download it good Chrome浏览器, then go http://npm.taobao.org/mirrors/chromedriver/2.44/ here to download the browser good drive, then put the installation directory of the python. Then run the local code below, there are pop-up to open the browser is Baidu home page and search for the "Python", down to the very bottom, and finally pop up a prompt, then the installation was successful (/ ≧ ▽ ≦) /

# -*- coding:utf-8 -*-
from selenium import  webdriver
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome()
driver.get('http://www.baidu.com')
Search = driver.find_element_by_id('kw')
Search.send_keys('Python')
time.sleep(1)
Search.send_keys(Keys.ENTER)
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('alert("To Bottom")')
time.sleep(2)
driver.close()

2. Set the browser engine

# 本地Chrome浏览器设置方法
from selenium import webdriver #从selenium库中调用webdriver模块

driver = webdriver.Chrome() 
# 设置引擎为Chrome,真实地打开一个Chrome浏览器(然后赋值给变量driver,它就是一个实例化的浏览器)

3. Obtain data

We used previously BeautifulSoupparse the page source, and extract the data therein. seleniumThe library also has the ability to parse the data, the ability to extract data. And it is BeautifulSoupconsistent with the underlying principles, but there is a discrepancy in some of the details and syntax. seleniumThe direct extraction is Elementsall the data, and BeautifulSoupthe parsed only the Networkresponse of the request 0. And with seleniumthe Web page open, you have all the information loaded into Elementsthere, then, you can put dynamic pages crawled by the method of static pages took (I am not a repeater).

Let's use QQ music https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC % A6 do examples, see Note:

from selenium import  webdriver
import time

driver = webdriver.Chrome() #设置浏览器
url = 'https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。


driver.close()

4. Parse extract data

Think back to the first use BeautifulSoupwhen extracting data from the data acquired, first of all necessary Responseobjects resolved to BeautifulSoupthe object, and then extract the data.

In the seleniummiddle, there is a web page to get driverin, and then, at the same time acquiring and parsing is done, are made driverto complete this instantiated browser. That data is parsed driver(our pipe 8), the data is done automatically extract drivera method:

method effect
find_element_by_tag_name These are the individual elements of the extraction method
find_element_by_class_name See known name meaning, needless to say!
find_element_by_id The more elements you want to extract, just put element into elements Jiuhaola
find_element_by_name Note also that the class name can not be located complex, if not find_element_by_class_name ( 'xxx xxx')
find_element_by_link_text
find_element_by_partial_link_text

The extracted elements are <class 'selenium.webdriver.remote.webelement.WebElement'>of this class, it BeautifulSoupis Tagsimilar to the object, but also extract text attributes and attribute values, and put the object usage WebElement Tag comparison object following:

element Tag effect
WebElement.text Tag.text Extract the text
WebElement.get_attribute() Tag[ ] Fill attribute name, returns the property value

Examples of this:

# 以下方法都可以从网页中提取出'你好啊!'这段文字-------------------

find_element_by_tag_name:通过元素的名称选择
# 如<h1>你好啊!</h1> 
# 可以使用find_element_by_tag_name('h1')

find_element_by_class_name:通过元素的class属性选择
# 如<h1 class="title">你好啊!</h1>
# 可以使用find_element_by_class_name('title')

find_element_by_id:通过元素的id选择
# 如<h1 id="title">你好啊!</h1> 
# 可以使用find_element_by_id('title')

find_element_by_name:通过元素的name属性选择
# 如<h1 name="hello">你好啊!</h1> 
# 可以使用find_element_by_name('hello')


#以下两个方法可以提取出超链接------------------------------------

find_element_by_link_text:通过链接文本获取超链接
# 如<a href="spidermen.html">你好啊!</a>
# 可以使用find_element_by_link_text('你好啊!')

find_element_by_partial_link_text:通过链接的部分文本获取超链接
# 如<a href="https://localprod.pandateacher.com/python-manuscript/hello-spiderman/">你好啊!</a>
# 可以使用find_element_by_partial_link_text('你好')

Example 5

Jay QQ music or song name https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0% E4% BC% A6, here only the first page is extracted with selenium does not need to look for it in json ~!

# -*- coding: utf-8 -*-
from selenium import  webdriver
import time

driver = webdriver.Chrome() #设置浏览器
url = 'https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。

# 直接定位提取数据
song_ul = driver.find_element_by_class_name('songlist__list')
song_li = song_ul.find_elements_by_class_name('js_songlist__child')

for song in song_li:
    name = song.find_element_by_class_name('songlist__songname_txt')
    print(name.text.strip())

driver.close()

6. selenium used in conjunction with the BS

With selenium to obtain the first rendering the complete source code for the web page, then the string returned to the BS take the form of parsing and extraction. (Why they want)

How to get it? Also use drivera method of:page_source

HTML源代码字符串 = driver.page_source 

Example:

# -*- coding: utf-8 -*-
from selenium import  webdriver
import time,bs4

driver = webdriver.Chrome() #设置浏览器
url = 'https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。

# 用BS解析网页
song = bs4.BeautifulSoup(driver.page_source,'html.parser')

# 直接定位提取数据
song_ul = song.find(class_='songlist__list')
song_li = song_ul.find_all(class_='js_songlist__child')

for song in song_li:
    name = song.find(class_='songlist__songname_txt')
    print(name.text.strip())

driver.close()

Before useless selenium, we only use BS resolution is not climb song information, and now you can, think about why


7. selenium node interaction method

.send_keys('你想输入的内容') # 模拟按键输入,自动填写表单
#你只需要定位到那个输入的框框那里,再用这个方法就可以往输入框里输入

.click() # 点击元素
#定位到能点的地方,基本上都能点

.clear() # 清空元素的内容
#清空你send_keys()里边输入的

More operations can be found in the official documentation of the interaction Description: Here
example, open the Baidu search for "Python":

# -*- coding: utf-8 -*-
from selenium import  webdriver
import time
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome() #设置浏览器
url = 'http://www.baidu.com'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。

#定位<input>标签
baidu = driver.find_elements_by_tag_name('input')

#输入框是第7个
baidu[7].send_keys('Python')
time.sleep(1)

#搜索是第8个,并点击
baidu[8].click()

# 也可以用下面的按下回车事件来代替点击搜索按钮
# baidu[7].send_keys(Keys.ENTER)

time.sleep(3)
driver.close()

8. The interface mode is set to None

Chrome needs to upgrade to version 59 and above.

# 本地Chrome浏览器的静默默模式设置:
from selenium import  webdriver #从selenium库中调用webdriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类

chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 把Chrome浏览器设置为静默模式
driver = webdriver.Chrome(options = chrome_options) # 设置引擎为Chrome,在后台默默运行

So the browser runs silently in the background, I do not think we blocked it.


Storing data

Examples: v Zhang Jiawei know almost big article, "title", "Summary", "link", and stored in a local file. You need to install the relevant modules. See comments on the code:

(1) write xlsx file

import requests
import openpyxl
from bs4 import BeautifulSoup
import csv

headers = {
    'referer': 'https://www.zhihu.com/people/zhang-jia-wei/posts/posts_by_votes',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36'
}

#HTML里只放了两篇文章,然后去XHR里面找找看,有两个article的请求,在第二个找到了第一页的detail,可用json提取
#再比较各页参数,用循环写就可以获取多页的detail了
list1 = [] #用于储存detail
n = 1
for i in range(3): #这里找两页,否则会给对方服务器造成很大压力
    params = {
        'include': 'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topics',
        'offset': str(i*20),
        'limit': str(20),
        'sort_by': 'voteups'
    }
    res = requests.get('https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles',headers=headers,params=params)
    #print(res.status_code)
    detaillist = res.json()
    articlelist = detaillist['data']
    for article in articlelist:
        # 把title、url、excerpt写成列表,后边用append函数一行行写入Excel
        list1.append([article['title'],article['url'],article['excerpt']])

#开始储存
#创建工作薄
wb = openpyxl.Workbook()
#获取工作薄的活动表
sheet = wb.active
#工作表重命名
sheet.title = 'zjw' 
#加表头,给A1单元格B1...赋值
sheet['A1'] = '文章标题' 
sheet['B1'] = '文章链接'
sheet['C1'] = '摘要'

for i in list1:
    sheet.append(i) #每次写入一行
wb.save('zhihuspider.xlsx') #记得保存格式为xlsx

Then py file in the directory, there will be a zhihuspider.xlsx file it!


(2) write csv file

#或者用csv储存
csv_file = open('zhihuspider.csv','w',newline='',encoding='utf-8')
writer = csv.writer(csv_file) #用csv.writer()函数创建一个writer对象。
list2 = ['标题','链接','摘要']
#调用writer对象的writerow()方法,可以在csv文件里写入一行文字 “标题”和“链接”和"摘要"。
writer.writerow(list2)

for article in list1:
    writer.writerow(article) #每次写入一行
csv_file.close()



-------- everyone is complaining about life is not easy, but are quietly hard for a living --------

Published 16 original articles · won praise 113 · Views 4892

Guess you like

Origin blog.csdn.net/qq_43280818/article/details/97045264