Article Directory
Reptiles base portion Ⅲ
A portion of the analysis and understanding of Ajax Fetch, but some pages such as Taobao, which contains a lot of Ajax interfaces encryption parameters, we find it difficult to direct the law, it is difficult to directly analyze Ajax to grab.
To solve these problems, we can directly use simulation browser running mode to achieve, so that you can do in the browser to see what is, what is the source crawl, which is seen to climb. JS so do not control internal pages with what algorithm to render, the page is loaded, they do not control web page background Ajax interfaces in the end what parameters.
selenium
Too lazy screenshots (screenshots are each fix TAT)
What (1) selenium is
It is a powerful database, you first need to install it pip install selenium
. It can use a few lines of code to control the browser , to make open automatically, input, operations such as clicking, just like a real user in the same operation.
For those interactions in complex, sophisticated encryption website, selenium
simplify the problem, crawl dynamic pages as static pages as easy to climb.
In regard to dynamic and static web pages, in fact you have been in contact with before.
With the front part Ⅰ html
written page, the page is static. We use the BeautifulSoup
crawled pages of this type, because the page source code contains all information on the web page (0th request, in the preview
will be able to see all the information there), so that the web address bar of URL
that page's source code URL
.
Later in Part Ⅱ, came into contact with more complex pages, such as QQ music, the data is not to be crawling HTML source code, but json
, you can not use the direct URL bar URL
, while the need to find json
real data URL
. This is a dynamic web page.
Wherever data exists, the browser is always initiated by a variety of requests to the server when the request is complete, they will form developer tools together Elements
in shows, rendering the finished page source code.
(2) the benefits and disadvantages
In the face of complex interactive page or URL
encryption logic complexity of the situation, selenium
came in handy, it can truly open a browser, wait for all the data is loaded into Elements
after the, then this page as static pages like crawling a.
Having said that the advantages of the use of selenium
time, of course, fly in the ointment.
Due to truly run a local browser, open the browser and wait for network rendering take some time to complete, selenium
the work inevitably at the expense of speed and more resources, but, at least not slower than people.
(3) how to use it
1. Download the browser driver
First you need to download it good Chrome浏览器
, then go http://npm.taobao.org/mirrors/chromedriver/2.44/ here to download the browser good drive, then put the installation directory of the python. Then run the local code below, there are pop-up to open the browser is Baidu home page and search for the "Python", down to the very bottom, and finally pop up a prompt, then the installation was successful (/ ≧ ▽ ≦) /
# -*- coding:utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('http://www.baidu.com')
Search = driver.find_element_by_id('kw')
Search.send_keys('Python')
time.sleep(1)
Search.send_keys(Keys.ENTER)
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('alert("To Bottom")')
time.sleep(2)
driver.close()
2. Set the browser engine
# 本地Chrome浏览器设置方法
from selenium import webdriver #从selenium库中调用webdriver模块
driver = webdriver.Chrome()
# 设置引擎为Chrome,真实地打开一个Chrome浏览器(然后赋值给变量driver,它就是一个实例化的浏览器)
3. Obtain data
We used previously BeautifulSoup
parse the page source, and extract the data therein. selenium
The library also has the ability to parse the data, the ability to extract data. And it is BeautifulSoup
consistent with the underlying principles, but there is a discrepancy in some of the details and syntax. selenium
The direct extraction is Elements
all the data, and BeautifulSoup
the parsed only the Network
response of the request 0. And with selenium
the Web page open, you have all the information loaded into Elements
there, then, you can put dynamic pages crawled by the method of static pages took (I am not a repeater).
Let's use QQ music https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC % A6 do examples, see Note:
from selenium import webdriver
import time
driver = webdriver.Chrome() #设置浏览器
url = 'https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。
driver.close()
4. Parse extract data
Think back to the first use BeautifulSoup
when extracting data from the data acquired, first of all necessary Response
objects resolved to BeautifulSoup
the object, and then extract the data.
In the selenium
middle, there is a web page to get driver
in, and then, at the same time acquiring and parsing is done, are made driver
to complete this instantiated browser. That data is parsed driver
(our pipe 8), the data is done automatically extract driver
a method:
method | effect |
---|---|
find_element_by_tag_name | These are the individual elements of the extraction method |
find_element_by_class_name | See known name meaning, needless to say! |
find_element_by_id | The more elements you want to extract, just put element into elements Jiuhaola |
find_element_by_name | Note also that the class name can not be located complex, if not find_element_by_class_name ( 'xxx xxx') |
find_element_by_link_text | |
find_element_by_partial_link_text |
The extracted elements are <class 'selenium.webdriver.remote.webelement.WebElement'>
of this class, it BeautifulSoup
is Tag
similar to the object, but also extract text attributes and attribute values, and put the object usage WebElement Tag comparison object following:
element | Tag | effect |
---|---|---|
WebElement.text | Tag.text | Extract the text |
WebElement.get_attribute() | Tag[ ] | Fill attribute name, returns the property value |
Examples of this:
# 以下方法都可以从网页中提取出'你好啊!'这段文字-------------------
find_element_by_tag_name:通过元素的名称选择
# 如<h1>你好啊!</h1>
# 可以使用find_element_by_tag_name('h1')
find_element_by_class_name:通过元素的class属性选择
# 如<h1 class="title">你好啊!</h1>
# 可以使用find_element_by_class_name('title')
find_element_by_id:通过元素的id选择
# 如<h1 id="title">你好啊!</h1>
# 可以使用find_element_by_id('title')
find_element_by_name:通过元素的name属性选择
# 如<h1 name="hello">你好啊!</h1>
# 可以使用find_element_by_name('hello')
#以下两个方法可以提取出超链接------------------------------------
find_element_by_link_text:通过链接文本获取超链接
# 如<a href="spidermen.html">你好啊!</a>
# 可以使用find_element_by_link_text('你好啊!')
find_element_by_partial_link_text:通过链接的部分文本获取超链接
# 如<a href="https://localprod.pandateacher.com/python-manuscript/hello-spiderman/">你好啊!</a>
# 可以使用find_element_by_partial_link_text('你好')
Example 5
Jay QQ music or song name https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0% E4% BC% A6, here only the first page is extracted with selenium does not need to look for it in json ~!
# -*- coding: utf-8 -*-
from selenium import webdriver
import time
driver = webdriver.Chrome() #设置浏览器
url = 'https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。
# 直接定位提取数据
song_ul = driver.find_element_by_class_name('songlist__list')
song_li = song_ul.find_elements_by_class_name('js_songlist__child')
for song in song_li:
name = song.find_element_by_class_name('songlist__songname_txt')
print(name.text.strip())
driver.close()
6. selenium used in conjunction with the BS
With selenium to obtain the first rendering the complete source code for the web page, then the string returned to the BS take the form of parsing and extraction. (Why they want)
How to get it? Also use driver
a method of:page_source
HTML源代码字符串 = driver.page_source
Example:
# -*- coding: utf-8 -*-
from selenium import webdriver
import time,bs4
driver = webdriver.Chrome() #设置浏览器
url = 'https://y.qq.com/portal/search.html#page=1&searchid=1&remoteplace=txt.yqq.top&t=song&w=%E5%91%A8%E6%9D%B0%E4%BC%A6'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。
# 用BS解析网页
song = bs4.BeautifulSoup(driver.page_source,'html.parser')
# 直接定位提取数据
song_ul = song.find(class_='songlist__list')
song_li = song_ul.find_all(class_='js_songlist__child')
for song in song_li:
name = song.find(class_='songlist__songname_txt')
print(name.text.strip())
driver.close()
Before useless selenium, we only use BS resolution is not climb song information, and now you can, think about why
7. selenium node interaction method
.send_keys('你想输入的内容') # 模拟按键输入,自动填写表单
#你只需要定位到那个输入的框框那里,再用这个方法就可以往输入框里输入
.click() # 点击元素
#定位到能点的地方,基本上都能点
.clear() # 清空元素的内容
#清空你send_keys()里边输入的
More operations can be found in the official documentation of the interaction Description: Here
example, open the Baidu search for "Python":
# -*- coding: utf-8 -*-
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome() #设置浏览器
url = 'http://www.baidu.com'
driver.get(url)#get(URL)是webdriver的一个方法,它的使命是为你打开指定URL的网页。
time.sleep(2)#由于浏览器缓冲加载网页需要耗费一些时间,等待1-2秒再去解析和提取比较稳妥。
#定位<input>标签
baidu = driver.find_elements_by_tag_name('input')
#输入框是第7个
baidu[7].send_keys('Python')
time.sleep(1)
#搜索是第8个,并点击
baidu[8].click()
# 也可以用下面的按下回车事件来代替点击搜索按钮
# baidu[7].send_keys(Keys.ENTER)
time.sleep(3)
driver.close()
8. The interface mode is set to None
Chrome needs to upgrade to version 59 and above.
# 本地Chrome浏览器的静默默模式设置:
from selenium import webdriver #从selenium库中调用webdriver模块
from selenium.webdriver.chrome.options import Options # 从options模块中调用Options类
chrome_options = Options() # 实例化Option对象
chrome_options.add_argument('--headless') # 把Chrome浏览器设置为静默模式
driver = webdriver.Chrome(options = chrome_options) # 设置引擎为Chrome,在后台默默运行
So the browser runs silently in the background, I do not think we blocked it.
Storing data
Examples: v Zhang Jiawei know almost big article, "title", "Summary", "link", and stored in a local file. You need to install the relevant modules. See comments on the code:
(1) write xlsx file
import requests
import openpyxl
from bs4 import BeautifulSoup
import csv
headers = {
'referer': 'https://www.zhihu.com/people/zhang-jia-wei/posts/posts_by_votes',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36'
}
#HTML里只放了两篇文章,然后去XHR里面找找看,有两个article的请求,在第二个找到了第一页的detail,可用json提取
#再比较各页参数,用循环写就可以获取多页的detail了
list1 = [] #用于储存detail
n = 1
for i in range(3): #这里找两页,否则会给对方服务器造成很大压力
params = {
'include': 'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topics',
'offset': str(i*20),
'limit': str(20),
'sort_by': 'voteups'
}
res = requests.get('https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles',headers=headers,params=params)
#print(res.status_code)
detaillist = res.json()
articlelist = detaillist['data']
for article in articlelist:
# 把title、url、excerpt写成列表,后边用append函数一行行写入Excel
list1.append([article['title'],article['url'],article['excerpt']])
#开始储存
#创建工作薄
wb = openpyxl.Workbook()
#获取工作薄的活动表
sheet = wb.active
#工作表重命名
sheet.title = 'zjw'
#加表头,给A1单元格B1...赋值
sheet['A1'] = '文章标题'
sheet['B1'] = '文章链接'
sheet['C1'] = '摘要'
for i in list1:
sheet.append(i) #每次写入一行
wb.save('zhihuspider.xlsx') #记得保存格式为xlsx
Then py file in the directory, there will be a zhihuspider.xlsx file it!
(2) write csv file
#或者用csv储存
csv_file = open('zhihuspider.csv','w',newline='',encoding='utf-8')
writer = csv.writer(csv_file) #用csv.writer()函数创建一个writer对象。
list2 = ['标题','链接','摘要']
#调用writer对象的writerow()方法,可以在csv文件里写入一行文字 “标题”和“链接”和"摘要"。
writer.writerow(list2)
for article in list1:
writer.writerow(article) #每次写入一行
csv_file.close()
-------- everyone is complaining about life is not easy, but are quietly hard for a living --------