Python爬虫学习日记二
冰冠 2018年06月14日08:59:27
从网页中抽取数据,实现某些事情的做法成为抓取(scraping)
1、分析网页数据
通过各种web开发者工具可以进行分析
2、三种网页抓取方法
2.1正则表达式
有关正则表达式的内容具体请看
https://blog.csdn.net/ice_cap1995/article/details/80694675
2.2Beautiful Soup
Beautiful Soup是一个python模块,该模块可以解析网页 安装
pip install beautifulsoup4
Beautiful soup的使用
(1)将已下载的html内容解析为soup文档(2)查找
beautifulsoup 能够正确解析缺失的引号并闭合标签。
Lxml是基于libxml2的python 的封装,该模块使用c语言编写,解析速度快于BeautifulSoup
安装也更为复杂,安装说明请看官方文档:http://Lxml.de/installation.html
选择多有标签 *
选择<a>标签 a
选择所有class="link"的元素 .link
选择所有class="link"的<a>标签 a.link
选择id="link"的<a>标签 a#link
选择父元素为<a>标签的所有span标签 a > span
选择<a>标签你恶不的所有<span>标签 a span
选择title属性为"'home"的所有<a>标签 a[title=home]\
2.4总结
抓取方法 性能 使用难度 安装难度
正则表达式 快 困难 简单(内置模块)
BeautifulSoup 慢 简单 简单(纯Python)
Lxml 快 简单 相对困难
通常情况下 lxml是最好的选择方式,该方法即快速又健壮
3、为链接爬虫添加抓取回调
冰冠 2018年06月14日08:59:27
从网页中抽取数据,实现某些事情的做法成为抓取(scraping)
1、分析网页数据
通过各种web开发者工具可以进行分析
2、三种网页抓取方法
2.1正则表达式
有关正则表达式的内容具体请看
https://blog.csdn.net/ice_cap1995/article/details/80694675
2.2Beautiful Soup
Beautiful Soup是一个python模块,该模块可以解析网页 安装
pip install beautifulsoup4
Beautiful soup的使用
(1)将已下载的html内容解析为soup文档(2)查找
beautifulsoup 能够正确解析缺失的引号并闭合标签。
from day01_crawl import crawling from bs4 import BeautifulSoup url = 'http://example.webscraping.com/places/default/view/China-47' html = crawling.download(url) soup = BeautifulSoup(html,"lxml") tr = soup.find(attrs={'id':'places_area__row'}) td = tr.find(attrs={'class':'w2p_fw'}) area = td.text print(area)
2.3Lxml
安装也更为复杂,安装说明请看官方文档:http://Lxml.de/installation.html
import lxml.html broken_html = '<ul class=country><li>Area<li>Population</ul>' tree = lxml.html.fromstring(broken_html) fixed_html = lxml.html.tostring(tree, pretty_print=True) print(fixed_html.decode('utf-8'))使用lxml的css选择器
import crawling url = 'http://example.webscraping.com/places/default/view/China-47' html = crawling.download(url) tree = lxml.html.fromstring(html) td=tree.cssselect('tr#places_area__row>td.w2p_fw')[0] area = td.text_content() print(area)附:css选择器 不支持的功能可以查看https://cssselect.readthedocs.io/en/latest/#id8
选择多有标签 *
选择<a>标签 a
选择所有class="link"的元素 .link
选择所有class="link"的<a>标签 a.link
选择id="link"的<a>标签 a#link
选择父元素为<a>标签的所有span标签 a > span
选择<a>标签你恶不的所有<span>标签 a span
选择title属性为"'home"的所有<a>标签 a[title=home]\
2.4总结
抓取方法 性能 使用难度 安装难度
正则表达式 快 困难 简单(内置模块)
BeautifulSoup 慢 简单 简单(纯Python)
Lxml 快 简单 相对困难
通常情况下 lxml是最好的选择方式,该方法即快速又健壮
3、为链接爬虫添加抓取回调
为了复用已经写好的部分爬虫代码爬取其他网站,需要添加一个回调函数,在此起名为callback,包含url和html两个参数,并且可以返回一个待爬取的url列表
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ @author [email protected] @function: @create 18-6-14 上午10:48""" import re import csv import lxml.html from day02_scraping.crawling import link_crawler class ScrapeCallback: def __init__(self): self.fields = ( 'area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.write_2_file('countries.csv',self.fields) def write_2_file(self,file_name,row): with open(file_name, 'w', newline='') as csvfile: self.writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL) self.writer.writerow(row) # 对象作为函数被调用是会调用该方法即 ScrapeCallback(url,html)和ScrapeCallback.__cell__(url,html)是等价的 def __call__(self, url, html): if re.search('/view/', url): print(html.decode('utf-8')) tree = lxml.html.fromstring(html) row = [] for field in self.fields: row.append(tree.cssselect('table > tr#places_{0}__row > td.w2p_fw'.format(field))[ 0].text_content()) self.write_2_file('countries.csv',row) if __name__ == '__main__': link_crawler('http://example.webscraping.com/places/default/view/Zimbabwe-252', '(.*?)/(index|view)', scrape_callback=ScrapeCallback())
附:能用的链接爬虫示例
https://github.com/ice1995/python_web_crawler-/tree/master/day02_scraping
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ @author [email protected] @function: @create 18-6-14 上午10:39"""'' import lxml.html import re import urllib.parse import urllib.request import time import csv from datetime import datetime import urllib.robotparser import queue FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1, scrape_callback=None): """Crawl from the given seed URL following links matched by link_regex """ # the queue of URL's that still need to be crawled crawl_queue = [seed_url] # the URL's that have been seen and at what depth seen = {seed_url: 0} # track how many URL's have been downloaded num_urls = 0 rp = get_robots(seed_url) throttle = Throttle(delay) headers = headers or {} if user_agent: headers['User-agent'] = user_agent while crawl_queue: url = crawl_queue.pop() depth = seen[url] # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): throttle.wait(url) html = download(url, headers, proxy=proxy, num_retries=num_retries) links = [] if scrape_callback: links.extend(scrape_callback(url, html) or []) if depth != max_depth: # can still crawl further if link_regex: # filter for links matching our regular expression links.extend(link for link in get_links(html) if re.match(link_regex, link)) for link in links: link = normalize(seed_url, link) # check whether already crawled this link if link not in seen: seen[link] = depth + 1 # check link is within same domain if same_domain(seed_url, link): # success! add this new link to queue crawl_queue.append(link) # check whether have reached downloaded maximum num_urls += 1 if num_urls == max_urls: break else: print('Blocked by robots.txt:', url) class Throttle: """Throttle downloading by sleeping between requests to same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): """Delay if have accessed this domain recently """ domain = urllib.parse.urlsplit(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.now() - last_accessed).seconds if sleep_secs > 0: time.sleep(sleep_secs) self.domains[domain] = datetime.now() def download(url, headers, proxy, num_retries, data=None): print('Downloading:', url) request = urllib.request.Request(url, data, headers) opener = urllib.request.build_opener() if proxy: proxy_params = {urllib.parse.urllib.parse(url).scheme: proxy} opener.add_handler(urllib.request.ProxyHandler(proxy_params)) try: response = opener.open(request) html = response.read() code = response.code except urllib.request.URLError as e: print('Download error:', e.reason) html = '' if hasattr(e, 'code'): code = e.code if num_retries > 0 and 500 <= code < 600: # retry 5XX HTTP errors html = download(url, headers, proxy, num_retries - 1, data) else: code = None return html def normalize(seed_url, link): """Normalize this URL by removing hash and adding domain """ link, _ = urllib.parse.urldefrag(link) # remove hash to avoid duplicates return urllib.parse.urljoin(seed_url, link) def same_domain(url1, url2): """Return True if both URL's belong to same domain """ return urllib.parse.urlparse(url1).netloc == urllib.parse.urlparse(url2).netloc def get_robots(url): """Initialize robots parser for this domain """ rp = urllib.robotparser.RobotFileParser() rp.set_url(urllib.parse.urljoin(url, '/robots.txt')) rp.read() return rp def get_links(html): """Return a list of links from html """ # a regular expression to extract all links from the webpage webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html.decode('utf-8')) def scrape_callback(url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() for field in FIELDS] print(url, row) class ScrapeCallback: def __init__(self): self.fields = ( 'area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.write_2_file('countries.csv', self.fields) def write_2_file(self, file_name, row): with open(file_name, 'a+') as csvfile: self.writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL) self.writer.writerow(row) # 对象作为函数被调用是会调用该方法即 ScrapeCallback(url,html)和ScrapeCallback.__cell__(url,html)是等价的 def __call__(self, url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [] for field in self.fields: list = tree.cssselect('table > tr#places_{0}__row > td.w2p_fw'.format(field)) if len(list) > 0: row.append(list[0].text_content()) self.write_2_file('countries.csv', row) if __name__ == '__main__': # link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, user_agent='BadCrawler') # link_crawler('http://example.webscraping.com', '/(index|view)', delay=0, num_retries=1, max_depth=1, user_agent='GoodCrawler') # link_crawler('http://example.webscraping.com/', '/(index|view)',scrape_callback=scrape_callback) link_crawler('http://example.webscraping.com/', '(.*?)/(index|view)', scrape_callback=ScrapeCallback())