The Scrapy framework combined with Spynner collects web pages that need to be dynamically loaded with js and ajax and extracts web page information (take the collection of WeChat public account article lists as an example)

There are several types of web page collection:

1. Static web pages

2. Dynamic web pages (web pages that require js, ajax to dynamically load data)

3. Web pages that can only be collected after a simulated login

4. Encrypted Web Pages

 

The solutions and ideas of 3 and 4 will be stated in the follow-up blog

Now only for 1, 2 solutions and ideas:

1. Static web pages

      There are many ways to collect and analyze static web pages! Both java and python provide many toolkits or frameworks, such as java's httpclient, Htmlunit, Jsoup, HtmlParser, etc., Python's urllib, urllib2, BeautifulSoup, Scrapy, etc., not detailed, there are many online materials.

 

2. Dynamic web pages

      For the collection, the dynamic web pages are those web pages that need to be dynamically loaded through js and ajax to obtain data. There are two ways to collect data: 

      1. Analyze the request of js and ajax through the packet capture tool, and simulate the request to obtain the data loaded by js.

      2. Call the browser's kernel, get the loaded web page source code, and then parse the source code

      A person who studies crawler js must know something, there are many online learning materials, no statement, write this article only for the integrity of the article

There are also several Java toolkits that call the browser kernel, but they are not the focus of today's discussion. Today's focus is on the title of the article. No. article list as an example)

 

 

Start......

1. Create a WeChat public account article list collection project (hereinafter referred to as micro collection)

scrapy startproject weixin

 

2. Create an acquisition spider file in the spider directory

vim weixinlist.py

    Write the following code

from weixin.items import WeixinItem
import sys
sys.path.insert(0,'..')
import scrapy
import time
from scrapy import Spider

class MySpider(Spider):
        name = 'weixinlist'
        allowed_domains = []
        start_urls = [
                'http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ',
         ]
        download_delay = 1
        print('start init....')

        def parse(self, response):
                sel=scrapy.Selector(response)
                print('hello,world!')
                print(response)
                print (sel)
                list=sel.xpath('//div[@class="txt-box"]/h4')
                items=[]
                for single in list:
                        data=WeixinItem()
                        title=single.xpath('a/text()').extract()
                        link=single.xpath('a/@href').extract()
                        data['title']=title
                        data['link']=link
                        if len(title)>0:
                                print(title[0].encode('utf-8'))
                                print(link)

 

 

3. Add the WeixinItem class to items.py

 

import scrapy


class WeixinItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
        title=scrapy.Field()
        link=scrapy.Field()

 

 

4. Create a download middleware downloadwebkit.py in the same directory of items.py, and write the following code into it:

import spynner
import pyquery
import time
import BeautifulSoup
import sys
from scrapy.http import HtmlResponse
class WebkitDownloaderTest( object ):
    def process_request( self, request, spider ):
#        if spider.name in settings.WEBKIT_DOWNLOADER:
#            if( type(request) is not FormRequest ):
                browser = spynner.Browser()
                browser.create_webview()
                browser.set_html_parser(pyquery.PyQuery)
                browser.load(request.url, 20)
                try:
                        browser.wait_load(10)
                except:
                        pass
                string = browser.html
                string=string.encode('utf-8')
                renderedBody = str(string)
                return HtmlResponse( request.url, body=renderedBody )

 

 

   This code is to call the browser kernel to get the source code after the webpage is loaded

5. Configure in the setting.py file and declare that the download uses the download middleware

    Add the following code at the bottom:

#which spider should use WEBKIT
WEBKIT_DOWNLOADER=['weixinlist']

DOWNLOADER_MIDDLEWARES = {
    'weixin.downloadwebkit.WebkitDownloaderTest': 543,
}

import them
os.environ["DISPLAY"] = ":0"

 

 

 

6. Run the program:

    Run the command:

 

scrapy crawl weixinlist

    operation result: 

kevinflynndeMacBook-Pro:spiders kevinflynn$ scrapy crawl weixinlist
start init....
2015-07-28 21:13:55 [scrapy] INFO: Scrapy 1.0.1 started (bot: weixin)
2015-07-28 21:13:55 [scrapy] INFO: Optional features available: ssl, http11
2015-07-28 21:13:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'weixin.spiders', 'SPIDER_MODULES': ['weixin.spiders'], 'BOT_NAME': 'weixin'}
2015-07-28 21:13:55 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

2015-07-28 21:13:55 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-28 21:13:55 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, WebkitDownloaderTest, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-28 21:13:55 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-28 21:13:55 [scrapy] INFO: Enabled item pipelines:
2015-07-28 21:13:55 [scrapy] INFO: Spider opened
2015-07-28 21:13:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-28 21:13:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
QFont::setPixelSize: Pixel size <= 0 (0)
2015-07-28 21:14:08 [scrapy] DEBUG: Crawled (200) <GET http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ> (referer: None)
hello,world!
<200 http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ>
<Selector xpath=None data=u'<html><head><meta http-equiv="X-UA-Compa'>
Getting Started with Internet Protocols
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=210032701&idx=1&sn=6b1fc2bc5d4eb0f87513751e4ccf610c&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
Write your own Bayesian classifier to classify books
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=210013947&idx=1&sn=1f36ba5794e22d0fb94a9900230e74ca&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
10 Ways to Improperly Free Tech Support
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209998175&idx=1&sn=216106034a3b4afea6e67f813ce1971f&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
Using Python as an example to introduce Bayesian theory
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209998175&idx=2&sn=2f3dee873d7350dfe9546ab4a9323c05&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
I "stealed" 30 million QQ user data from Tencent, and made a very interesting...
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209980651&idx=1&sn=11fd40a2dee5132b0de8d4c79a97dac2&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
How to quickly develop applications with Spark?
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209820653&idx=2&sn=23712b78d82fb412e960c6aa1e361dd3&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
Let's write a simple interpreter together (1)
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209797651&idx=1&sn=15073e27080e6b637c8d24b6bb815417&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
The guy who changed the bug directly in the machine code
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209762756&idx=1&sn=04ae1bc3a366d358f474ac3e9a85fb60&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
Open source a library, what should you do
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209762756&idx=2&sn=0ac961ffd82ead6078a60f25fed3c2c4&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
The programmer's dilemma
[u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209696436&idx=1&sn=8cb55b03c8b95586ba4498c64fa54513&3rd=MzA3MDU4NTYzMw==&scene=6#rd']
2015-07-28 21:14:08 [scrapy] INFO: Closing spider (finished)
2015-07-28 21:14:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 131181,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 28, 13, 14, 8, 958071),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 7, 28, 13, 13, 55, 688111)}
2015-07-28 21:14:08 [scrapy] INFO: Spider closed (finished)
QThread: Destroyed while thread is still running
kevinflynndeMacBook-Pro:spiders kevinflynn$

 

 

    

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326993185&siteId=291194637