Practical operation | from 0-1 teach you to use Python to crawl the entire network of weather stations

Scrapy

Python is Scrapy developed a quick, high-level screen scraping and web crawling framework for crawling web sites and extract structured data from the page.

First, install scrapy

Installation Twisted
  • Twisted: package for Python-based event-driven networking engine.

  • In the following URL to install Twisted

url: https: //www.lfd.uci.edu/~gohlke/pythonlibs/
Installation scrapy
  • Enter cmdpip install scrapy

  • Installation, cmd, enter scrapy appear successful installation.

Second, understand scrapy

Scrapy components

  • Engine, used to process the data stream processing of the entire system, triggering the transaction.

  • A scheduler for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request.

  • Downloader for downloading web content, web content and return to the spider.

  • Spider, Spider is a major work, and use it to develop rules to resolve specific domains or web page.

  • Pipeline project, is responsible for handling spiders drawn from the Web project, the main task is clear, validate and store data. When a page is parsed spider, the project will be sent to the pipeline, and after a few specific order of processing data.

  • Downloader middleware framework positioned between the hook and the downloader Scrapy engine, mainly dealing with requests and responses between the engine and Scrapy downloader.

  • Spider middleware, hooks interposed between the frame and the engine Scrapy spiders, major work is a process output response to the input request and spiders.

  • Scheduling middleware, the middleware is interposed between the engine and Scrapy scheduling, transmitted from the engine to the Scrapy scheduling requests and responses.

Its processing flow is:

  • When the engine opened a domain name, the domain name processing spiders, and spiders get the URL of a crawling.

  • Engine needs to get the first URL from the spider that crawl, then a scheduling request in the scheduling.

  • From scheduling engine that were acquired Next page crawling.

  • Scheduling the next crawling the URL returned to the engine, the engine will send them to download via download middleware.

  • When the download is complete the download page, a content download response is transmitted to the engine through the middleware.

  • Engine receives the response and downloading it to the spider through the spider intermediate transmission processing.

  • Spiders crawling process and returns a response to the project, and then send a new request to the engine.

  • Engine will crawl to the pipeline project, scheduled to send the request.

  • Repeat the system behind the second portion, until no scheduling request.

Third, the project analysis

Crawling weather information network of the city
url : https://www.aqistudy.cn/historydata/
Crawling key information: Popular Cities for each day of air quality information
Click on the month as well as crawling daily air quality information

Fourth, new projects

  • New Folder command reptiles Weather Network

  • cd to the root directory, open cmd, runscrapy startproject weather_spider

  • Create a spider

cd to the root directory, run scrapy genspider weather www.aqistudy.cn/historydata
weather here is the spider's name
  • Path created as follows:

V. coding

Objects for scrapy, the first step must be written item.py, clear crawling
  • item.py


For good camouflage crawling must UA, as defined in setting.py MY_USER_AGENTto hold the UA, note the name must be capitalized in settings

  • settings.py

MY_USER_AGENT = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",]复制代码
After you define a good UA, created in middlewares.py in RandomUserAgentMiddleware class
  • middlewares.py


Note To activate settings.py must be 900 to remove UA scrapy itself

  • setting.py


Start writing the most important spider.py, we recommended scrapy.shell step by step debugging

  • First to get all of the city

As in xpath xpath syntax scrapy in the methods and lxml
We can see that the front part is missing url, url Follow automatic splicing method may, to pass by the need to save the city name meta methods to schedule the next URL by callback method crawling
  • weather.py

def parse(self, response):    city_urls = response.xpath('//div[@class="all"]/div[@class="bottom"]//li/a/@href').extract()[16:17]    city_names = response.xpath('//div[@class="all"]/div[@class="bottom"]//li/a/text()').extract()[16:17]    self.logger.info('正在爬去{}城市url'.format(city_names[0]))    for city_url, city_name in zip(city_urls, city_names):        #  用的follow快捷方式,可以自动拼接url        yield response.follow(url=city_url, meta={'city': city_name}, callback=self.parse_month)复制代码
Then parse_month function is defined, the first month of the analysis details page, get the url month
Or step by step in debugging scrapy.shell
City_name to pass through to save the name of the city follow splicing method url, meta, selenium: True first matter
Then scheduling the next callback method by crawling the URL, i.e., is the day of the detail page crawling
  • weather.py

def parse_month(self, response):    """    解析月份的url    :param response:    :return:    """    city_name = response.meta['city']    self.logger.info('正在爬取{}城市的月份url'.format(city_name[0]))    # 由于爬取的信息太大了,所有先爬取前5个    month_urls = response.xpath('//ul[@class="unstyled1"]/li/a/@href').extract()[0:5]    for month_url in month_urls:        yield response.follow(url=month_url, meta={'city': city_name, 'selenium': True}, callback=self.parse_day_data)复制代码
In the detailed information page of the day is taken out by xpah
He discovered empty
Also found that no information source code
Js is explained by the data generated, scrapy only to climb static information, so scrapy leads docking knowledge selenium, so the above meta parameter passed is to tell scrapy use selenium to crawling.
Replication WeatherSpiderDownloaderMiddleware process_request function method to download middleware

middlewares.py

import time
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class WeatherSpiderDownloaderMiddleware(object):
    def process_request(self, request, spider):
        if request.meta.get('selenium'):
            # 为了让浏览器能够无界面的工作
            chrome_options = Options()
            # 设置chrome浏览器无界面模式
            chrome_options.add_argument('--headless')
            driver = webdriver.Chrome(chrome_options=chrome_options)
            # 用浏览器去访问这个地址
            driver.get(request.url)
            time.sleep(1.5)  # 因为浏览器需要加载渲染
            html = driver.page_source
            driver.quit()
            return scrapy.http.HtmlResponse(url=request.url, body=html, encoding='utf-8', request=request)
        return None复制代码

激活WeatherSpiderDownloaderMiddleware


Finally, in the rest of the code written weather.py



Sixth, run the project

We must pay attention to the root directory of the project execution order, by scrapy list the existence of the project view
scrapy easiest way to save the information, there are four, -o specifies the output format of the file, the command is as follows:
Default json
  • scrapy crawl weather -o spider.json

json lines format, the default encoding to Unicode
  • scrapy crawl weather -o spider..jl

csv comma expression can be used to open Excel
  • scrapy crawl weather -o spider..csv

xml format
  • scrapy crawl weather -o spider..xml

But the code does not save, must be added in the settings FEED_EXPORT_ENCODING = 'utf-8'

Seven, storage operation

Here the library is Mongodb, in the configuration in settings.py


For starters mainly addressed the pipelines in

  • pipelines.py

  • Activate pipelines in settings in


Results are as follows

8 Conclusion

We crawled through this weather websites as learning Scrapy of respect shown here Scrapy most of the knowledge points. Rewriting list, you can crawl all the Beijing weather information, of course, can also be crawling weather information for all cities, that all this weather network basically crawling.


Source acquire Click "read the original text" or the background reply Scrapy is acquired.




Guess you like

Origin juejin.im/post/5d81c9cee51d4561b072ddc2