Python + Scrapy + Selenium Data Acquisition

I am a good man, a big innocent civilians.

Good or bad, the key is the user Oh!

ScrapyIt is a commonly used data acquisition tool;

SeleniumIt is a browser test automation tool;

Combined Scrapyprocessing mechanism and the data Seleniumto simulate real browser to get the data (such as: automatic login, automatic page turning, etc.). You can better complete the acquisition.

About Scrapy

ScrapyIs the developer for one common data collection tools on the Web, for by APIobtaining the data we have become commonplace, but some WebSitestill to "performance or security" reasons, deliberately avoided by some technical means APIto transfer data (such as static pages, a one-time token, etc.). Therefore, in order to be able to collect these data, we can analyze sites and tag structure, and then by means of Scrapydata collection.

Brief the Scrapyrole of the Framework, and specifically how it helped us collect data it? Take a look at Scrapythe structure of it:

ScrapyThe data stream consists of Scrapy Enginea control, flow:

  1. EngineInitialization, and from the Spideracquisition request.
  2. To Requestthe scheduler.
  3. Scheduler Requestsent one by one to the Scrapy Engineconsumer.
  4. Scrapy EngineBy downloading request to the middleware downloader.
  5. Downloader will use Requestthe page acquired as a Responsereturn to the result Scrapy Engine.
  6. Scrapy EngineFrom Downloaderreceiving Responseand sends Spiderprocess (Spider Middleware).
  7. SpiderProcessing Responseand Scrapy Enginereturn Item.
  8. Scrapy EngineSends the processed Itemto a Item Pipelinetransmission signal, while the processed together to the scheduler ( Scheduler), at the request of a collection request.

Repeat the above steps collection request process, until Schedulerno new Request.

ScrapyInstallation Tutorial: HTTPS: //doc.scrapy.org/en/lat ...

Scrapy project creation

Today Take 清博大数据as a case sample, to automate the login, search, and automated data collection.

In the file in the root directory of execution:

scrapy startproject qingbo

Then enter the directory  qingbo/ to the next:

scrapy genspider crawl gsdata.cn

Come to the following directory:

qingbo/
    scrapy.cfg            # deploy configuration file

    qingbo/             # project's Python module, you'll import your code from here
        __init__.py

        items .py           # Project Definition File items 

        middlewares .py     # browser start-up and access methods in this operation 

        Pipelines .py       # data handling in this final process 

        Settings .py        # Project Settings File 

        Spiders /           # A Directory the WHERE you 'LL PUT your Spiders later 
            the __init__. Py
            crawl .py       # connections and data access crawling process here

 

In fact, how to Scrapycombine Seleniumthe most critical is,middlewares.py

Specifically, how the package may reference herein: HTTPS: //www.osgeo.cn/scrapy/t ...

About Selenium

SeleniumIs an open source automated testing framework to verify Web application by different browsers and platforms currently support multiple languages ​​calls, such as: Python, Java, PHP and so on.

Selenium tests run directly in the browser, just as real users do the same, so take advantage of this, we can better data collection.

Python Selenium installation tutorial: HTTPS: //selenium-python-zh.re ...

Selenium Case

If there is no direct access to the login state  clear broad data Tencent video

Not surprisingly, then, it will jump to the login page to login. As already mentioned Seleniumthe installation environment, where it directly on the code:

Site opens

options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)
driver.get('https://u.gsdata.cn/member/login')
driver.implicitly_wait ( 10) # open the page load time required, it is recommended to add a silent waiting

image-20200327102210587

Login operation

Two can be found in tab, namely: two-dimensional code login, Qing Bo account login.

Page has been opened, how to clear tab-blog account to log it?

Here we need to know about Xpath (XML Path Language), which is an XML document is used to determine the position of a part of the language.

Simply put, what we can "clear-blog account login" The Tab positioning with Xpath

image-20200327111331243

driver.find_element_by_xpath(".//div[@class='loginModal-content']/div/a[2]").click()

Then navigate to the account password box, fill in the information:

driver.find_element_by_xpath(".//input[@name='username']").send_keys("username")
driver.find_element_by_xpath(".//input[@name='password']").send_keys("password")

Finally, click the login button:

driver.find_element_by_xpath(".//div/button[@class='loginform-btn']").click()
driver.implicitly_wait(5)

image-20200327112059788

login successful! ~

Query operation

driver.get('http://www.gsdata.cn/')
driver.find_element_by_xpath(".//input[@id='search_input']").send_keys("腾讯视频")
driver.find_element_by_xpath(".//button[@class='btn no btn-default fl search_wx']").click()
driver.implicitly_wait(5)

image-20200327112437222

After search results are as follows:

image-20200327112546457

By Xpathpositioning the Tencent video atab, and then click into the page content data Tencent video:

driver.find_element_by_xpath(
    ".//ul[@class='imgword-list']/li[1]/div[@class='img-word']/div[@class='word']/h1/a").click()
driver.implicitly_wait(5)

Contents page

image-20200327115153854

Here, and not scared of surprises? Now you can by Xpathprocessing it needs to locate and acquire content, not elaborate.

Close Operation

driver.close()

Data acquisition complete, if no other operation, and you can put off the browser.

to sum up

This chapter describes Scrapyand Seleniumbasic concepts and general use, in general, can help us to offer new solutions and ideas in solving some of the problems.

Reference

https://www.cnblogs.com/luozx207/p/9003214.html

https://kite.com/blog/python/web-scraping-scrapy/

https://docs.scrapy.org/en/latest/intro/tutorial.html

 

Guess you like

Origin www.cnblogs.com/wilburxu/p/12581049.html