I am a good man, a big innocent civilians.
Good or bad, the key is the user Oh!
Scrapy
It is a commonly used data acquisition tool;
Selenium
It is a browser test automation tool;
Combined Scrapy
processing mechanism and the data Selenium
to simulate real browser to get the data (such as: automatic login, automatic page turning, etc.). You can better complete the acquisition.
About Scrapy
Scrapy
Is the developer for one common data collection tools on the Web, for by API
obtaining the data we have become commonplace, but some WebSite
still to "performance or security" reasons, deliberately avoided by some technical means API
to transfer data (such as static pages, a one-time token, etc.). Therefore, in order to be able to collect these data, we can analyze sites and tag structure, and then by means of Scrapy
data collection.
Brief the Scrapy
role of the Framework, and specifically how it helped us collect data it? Take a look at Scrapy
the structure of it:
Scrapy
The data stream consists of Scrapy Engine
a control, flow:
Engine
Initialization, and from theSpider
acquisition request.- To
Request
the scheduler. - Scheduler
Request
sent one by one to theScrapy Engine
consumer. Scrapy Engine
By downloading request to the middleware downloader.- Downloader will use
Request
the page acquired as aResponse
return to the resultScrapy Engine
. Scrapy Engine
FromDownloader
receivingResponse
and sendsSpider
process (Spider Middleware).Spider
ProcessingResponse
andScrapy Engine
returnItem
.Scrapy Engine
Sends the processedItem
to aItem Pipeline
transmission signal, while the processed together to the scheduler (Scheduler
), at the request of a collection request.
Repeat the above steps collection request process, until Scheduler
no new Request
.
Scrapy
Installation Tutorial: HTTPS: //doc.scrapy.org/en/lat ...
Scrapy project creation
Today Take 清博大数据
as a case sample, to automate the login, search, and automated data collection.
In the file in the root directory of execution:
scrapy startproject qingbo
Then enter the directory qingbo/
to the next:
scrapy genspider crawl gsdata.cn
Come to the following directory:
qingbo/ scrapy.cfg # deploy configuration file qingbo/ # project's Python module, you'll import your code from here __init__.py items .py # Project Definition File items middlewares .py # browser start-up and access methods in this operation Pipelines .py # data handling in this final process Settings .py # Project Settings File Spiders / # A Directory the WHERE you 'LL PUT your Spiders later the __init__. Py crawl .py # connections and data access crawling process here
In fact, how to Scrapy
combine Selenium
the most critical is,middlewares.py
Specifically, how the package may reference herein: HTTPS: //www.osgeo.cn/scrapy/t ...
About Selenium
Selenium
Is an open source automated testing framework to verify Web application by different browsers and platforms currently support multiple languages calls, such as: Python, Java, PHP and so on.
Selenium tests run directly in the browser, just as real users do the same, so take advantage of this, we can better data collection.
Python Selenium installation tutorial: HTTPS: //selenium-python-zh.re ...
Selenium Case
If there is no direct access to the login state clear broad data Tencent video
Not surprisingly, then, it will jump to the login page to login. As already mentioned Selenium
the installation environment, where it directly on the code:
Site opens
options = Options() options.headless = False driver = webdriver.Firefox(options=options) driver.get('https://u.gsdata.cn/member/login') driver.implicitly_wait ( 10) # open the page load time required, it is recommended to add a silent waiting
Login operation
Two can be found in tab, namely: two-dimensional code login, Qing Bo account login.
Page has been opened, how to clear tab-blog account to log it?
Here we need to know about Xpath (XML Path Language), which is an XML document is used to determine the position of a part of the language.
Simply put, what we can "clear-blog account login" The Tab positioning with Xpath
driver.find_element_by_xpath(".//div[@class='loginModal-content']/div/a[2]").click()
Then navigate to the account password box, fill in the information:
driver.find_element_by_xpath(".//input[@name='username']").send_keys("username") driver.find_element_by_xpath(".//input[@name='password']").send_keys("password")
Finally, click the login button:
driver.find_element_by_xpath(".//div/button[@class='loginform-btn']").click() driver.implicitly_wait(5)
login successful! ~
Query operation
driver.get('http://www.gsdata.cn/') driver.find_element_by_xpath(".//input[@id='search_input']").send_keys("腾讯视频") driver.find_element_by_xpath(".//button[@class='btn no btn-default fl search_wx']").click() driver.implicitly_wait(5)
After search results are as follows:
By Xpath
positioning the Tencent video a
tab, and then click into the page content data Tencent video:
driver.find_element_by_xpath( ".//ul[@class='imgword-list']/li[1]/div[@class='img-word']/div[@class='word']/h1/a").click() driver.implicitly_wait(5)
Contents page
Here, and not scared of surprises? Now you can by Xpath
processing it needs to locate and acquire content, not elaborate.
Close Operation
driver.close()
Data acquisition complete, if no other operation, and you can put off the browser.
to sum up
This chapter describes Scrapy
and Selenium
basic concepts and general use, in general, can help us to offer new solutions and ideas in solving some of the problems.
Reference
https://www.cnblogs.com/luozx207/p/9003214.html
https://kite.com/blog/python/web-scraping-scrapy/
https://docs.scrapy.org/en/latest/intro/tutorial.html