There are several types of web page collection:
1. Static web pages
2. Dynamic web pages (web pages that require js, ajax to dynamically load data)
3. Web pages that can only be collected after a simulated login
4. Encrypted Web Pages
The solutions and ideas of 3 and 4 will be stated in the follow-up blog
Now only for 1, 2 solutions and ideas:
1. Static web pages
There are many ways to collect and analyze static web pages! Both java and python provide many toolkits or frameworks, such as java's httpclient, Htmlunit, Jsoup, HtmlParser, etc., Python's urllib, urllib2, BeautifulSoup, Scrapy, etc., not detailed, there are many online materials.
2. Dynamic web pages
For the collection, the dynamic web pages are those web pages that need to be dynamically loaded through js and ajax to obtain data. There are two ways to collect data:
1. Analyze the request of js and ajax through the packet capture tool, and simulate the request to obtain the data loaded by js.
2. Call the browser's kernel, get the loaded web page source code, and then parse the source code
A person who studies crawler js must know something, there are many online learning materials, no statement, write this article only for the integrity of the article
There are also several Java toolkits that call the browser kernel, but they are not the focus of today's discussion. Today's focus is on the title of the article. No. article list as an example)
Start......
1. Create a WeChat public account article list collection project (hereinafter referred to as micro collection)
scrapy startproject weixin
2. Create an acquisition spider file in the spider directory
vim weixinlist.py
Write the following code
from weixin.items import WeixinItem import sys sys.path.insert(0,'..') import scrapy import time from scrapy import Spider class MySpider(Spider): name = 'weixinlist' allowed_domains = [] start_urls = [ 'http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ', ] download_delay = 1 print('start init....') def parse(self, response): sel=scrapy.Selector(response) print('hello,world!') print(response) print (sel) list=sel.xpath('//div[@class="txt-box"]/h4') items=[] for single in list: data=WeixinItem() title=single.xpath('a/text()').extract() link=single.xpath('a/@href').extract() data['title']=title data['link']=link if len(title)>0: print(title[0].encode('utf-8')) print(link)
3. Add the WeixinItem class to items.py
import scrapy class WeixinItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() link=scrapy.Field()
4. Create a download middleware downloadwebkit.py in the same directory of items.py, and write the following code into it:
import spynner import pyquery import time import BeautifulSoup import sys from scrapy.http import HtmlResponse class WebkitDownloaderTest( object ): def process_request( self, request, spider ): # if spider.name in settings.WEBKIT_DOWNLOADER: # if( type(request) is not FormRequest ): browser = spynner.Browser() browser.create_webview() browser.set_html_parser(pyquery.PyQuery) browser.load(request.url, 20) try: browser.wait_load(10) except: pass string = browser.html string=string.encode('utf-8') renderedBody = str(string) return HtmlResponse( request.url, body=renderedBody )
This code is to call the browser kernel to get the source code after the webpage is loaded
5. Configure in the setting.py file and declare that the download uses the download middleware
Add the following code at the bottom:
#which spider should use WEBKIT WEBKIT_DOWNLOADER=['weixinlist'] DOWNLOADER_MIDDLEWARES = { 'weixin.downloadwebkit.WebkitDownloaderTest': 543, } import them os.environ["DISPLAY"] = ":0"
6. Run the program:
Run the command:
scrapy crawl weixinlist
operation result:
kevinflynndeMacBook-Pro:spiders kevinflynn$ scrapy crawl weixinlist start init.... 2015-07-28 21:13:55 [scrapy] INFO: Scrapy 1.0.1 started (bot: weixin) 2015-07-28 21:13:55 [scrapy] INFO: Optional features available: ssl, http11 2015-07-28 21:13:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'weixin.spiders', 'SPIDER_MODULES': ['weixin.spiders'], 'BOT_NAME': 'weixin'} 2015-07-28 21:13:55 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. 2015-07-28 21:13:55 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-07-28 21:13:55 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, WebkitDownloaderTest, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-07-28 21:13:55 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-07-28 21:13:55 [scrapy] INFO: Enabled item pipelines: 2015-07-28 21:13:55 [scrapy] INFO: Spider opened 2015-07-28 21:13:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-07-28 21:13:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 QFont::setPixelSize: Pixel size <= 0 (0) 2015-07-28 21:14:08 [scrapy] DEBUG: Crawled (200) <GET http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ> (referer: None) hello,world! <200 http://weixin.sogou.com/gzh?openid=oIWsFt5QBSP8mn4Jx2WSGw_rCNzQ> <Selector xpath=None data=u'<html><head><meta http-equiv="X-UA-Compa'> Getting Started with Internet Protocols [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=210032701&idx=1&sn=6b1fc2bc5d4eb0f87513751e4ccf610c&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] Write your own Bayesian classifier to classify books [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=210013947&idx=1&sn=1f36ba5794e22d0fb94a9900230e74ca&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] 10 Ways to Improperly Free Tech Support [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209998175&idx=1&sn=216106034a3b4afea6e67f813ce1971f&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] Using Python as an example to introduce Bayesian theory [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209998175&idx=2&sn=2f3dee873d7350dfe9546ab4a9323c05&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] I "stealed" 30 million QQ user data from Tencent, and made a very interesting... [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209980651&idx=1&sn=11fd40a2dee5132b0de8d4c79a97dac2&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] How to quickly develop applications with Spark? [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209820653&idx=2&sn=23712b78d82fb412e960c6aa1e361dd3&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] Let's write a simple interpreter together (1) [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209797651&idx=1&sn=15073e27080e6b637c8d24b6bb815417&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] The guy who changed the bug directly in the machine code [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209762756&idx=1&sn=04ae1bc3a366d358f474ac3e9a85fb60&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] Open source a library, what should you do [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209762756&idx=2&sn=0ac961ffd82ead6078a60f25fed3c2c4&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] The programmer's dilemma [u'http://mp.weixin.qq.com/s?__biz=MzA4MjEyNTA5Mw==&mid=209696436&idx=1&sn=8cb55b03c8b95586ba4498c64fa54513&3rd=MzA3MDU4NTYzMw==&scene=6#rd'] 2015-07-28 21:14:08 [scrapy] INFO: Closing spider (finished) 2015-07-28 21:14:08 [scrapy] INFO: Dumping Scrapy stats: {'downloader/response_bytes': 131181, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 28, 13, 14, 8, 958071), 'log_count/DEBUG': 2, 'log_count/INFO': 7, 'log_count/WARNING': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 7, 28, 13, 13, 55, 688111)} 2015-07-28 21:14:08 [scrapy] INFO: Spider closed (finished) QThread: Destroyed while thread is still running kevinflynndeMacBook-Pro:spiders kevinflynn$