微信公众号权律二表情和壁纸爬虫

搜狗搜索引擎可以搜索到微信的公众号,许久没有爬虫了,最近买了崔大神的《python网络爬虫开发实战》,感觉又回到了一年前初学爬虫时满怀激情的时代。下面小试牛刀,利用一些基本的库 requests-html,xpath,request以及正则表达式来抓一些表情和壁纸。

先来看看效果是怎么样吧


源码奉上,其实改一改就能爬取其他内容。

import os
import urllib.request
import re
import ssl
from requests_html import HTMLSession

import time
from lxml import etree

ssl._create_default_https_context = ssl._create_unverified_context


def getData(url):
    # 模拟成浏览器
    headers = ("User-Agent",
               "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    # 将opener安装为全局
    urllib.request.install_opener(opener)
    data = urllib.request.urlopen(url).read().decode("utf-8")
    return data

def getcontent(url):
    data = getData(url)
    # 构建表情提取的正则表达式
    stickerpat = '<img.*?data-src="(.*?)"'
    stickerlist = re.compile(stickerpat, re.S).findall(data)
    # 构建标题提取的正则表达式
    titlepat = '<h2.*?>(.*?)</h2>'
    title = re.compile(titlepat,re.S).findall(data)
    title = title[0].replace('\n','').replace('|','').strip()
    return stickerlist,title


def download(stickerlist,title):
    path = title
    number = 1
    for sticker in stickerlist:
        if(sticker.endswith('gif')):
            filename =   os.path.join(path,str(number)+'.gif')
            print("正在下载:" ,filename)
            urllib.request.urlretrieve(sticker, filename = filename)
            time.sleep(1)
            number += 1
        if (sticker.endswith('jpeg')):
            filename = os.path.join(path, str(number) + '.jpeg')
            print("正在下载:", filename)
            urllib.request.urlretrieve(sticker, filename=filename)
            time.sleep(1)
            number += 1

def creatDir(title):
    isExists = os.path.exists(title)
    if not isExists:
        os.makedirs(title)
        print(title + ' 创建成功')
        return True
    return False

def getUrlList():
    session = HTMLSession()
    for page in range(1,11):
        url = 'http://weixin.sogou.com/weixin?query=%E6%9D%83%E5%BE%8B%E4%BA%8C&_sug_type_=&sut=4989&lkt=1%2C1530759390068%2C1530759390068&s_from=input&_sug_=y&type=2&sst0=1530759390170&page='+str(page)+'&ie=utf8&w=01019900&dr=1'
        time.sleep(5)
        r = session.get(url)
        dom = r.html
        print(dom)
        for i in range(10):
            try:
                result = dom.xpath('//*[@id="sogou_vr_11002601_title_'+str(i)+'"]//@href')
                time.sleep(5)
                print(i,result)
                stickerlist, title = getcontent(result[0])
                if(creatDir(title)):
                    download(stickerlist, title)
            except Exception:
                continue





getUrlList()

代码

顺便复习一下基础的知识,等到暑假再好好精修吧。

正则表达式基础知识

基础1:
全局匹配函数使用格式	re.compile(正则表达式).findall(源字符串)

普通字符	正常匹配
\n			匹配换行符  
\t 			匹配制表符
\w 			匹配字母、数字、下划线
\W 			匹配除字母、数字、下划线
\d 			匹配十进制数字
\D 			匹配除十进制数字
\s 			匹配空白字符
\S 			匹配除空白字符
[ab89x]		原子表,匹配ab89x中的任意一个
[^ab89x]		原子表,匹配除ab89x以外的任意一个字符

实例1:
源字符串:"aliyunedu"
正则表达式:"yu"
匹配出什么?	yu


源字符串:'''aliyun
edu'''
正则表达式:"yun\n"
匹配出什么?	yun\n

源字符串:"aliyu89787nedu"
正则表达式:"\w\d\w\d\d\w"
匹配出什么?	u89787


源字符串:"aliyu89787nedu"
正则表达式:"\w\d[nedu]\w"
匹配出什么?	87ne


基础2:
.	匹配除换行外任意一个字符
^	匹配开始位置
$	匹配结束位置
*	前一个字符出现0\1\多次 
?	前一个字符出现0\1次
+	前一个字符出现1\多次
{n}	前一个字符恰好出现n次
{n,}	前一个字符至少n次
{n,m}前一个字符至少n,至多m次 
|	模式选择符或
()	模式单元,通俗来说就是:想提取出什么内容,就在正则中用小括号将其括起来

实例2:
源字符串:'''aliyunnnnji87362387aoyubaidu'''

正则表达式:"ali..."
匹配出什么?	aliyun

正则表达式:"^li..."
匹配出什么?	None

正则表达式:"^ali..."
匹配出什么?	aliyun

正则表达式:"bai..$"
匹配出什么?	baidu

正则表达式:"ali.*"
匹配出什么?	aliyunnnnji87362387aoyubaidu
Tips:默认贪婪,即默认尽可能多地进行匹配

正则表达式:"aliyun+"
匹配出什么? aliyunnnn

正则表达式:"aliyun?"
匹配出什么? aliyun

正则表达式:"yun{1,2}"
匹配出什么?	yunn

正则表达式:"^al(i..)."
匹配出什么?	iyu

基础3:
贪婪模式:尽可能多地匹配
懒惰模式:尽可能少地匹配,精准模式

默认贪婪模式
如果出现如下组合,则代表为懒惰模式:
*?
+?

实例3:
源字符串:"poytphonyhjskjsa"
正则表达式:"p.*y"
匹配出什么?	poytphony
为什么?	默认贪婪模式

源字符串:"poytphonyhjskjsa"
正则表达式:"p.*?y"
匹配出什么?	['poy', 'phony']
为什么?	懒惰模式,精准匹配

基础4:
模式修正符:在不改变正则表达式的情况下通过模式修正符使匹配结果发生更改

re.S		让.也可以匹配多行
re.I		让匹配时忽略大小写

实例4:
源字符串:"Python"
正则表达式:"pyt"
匹配方式:re.compile("pyt").findall("Python")
匹配结果: []

源字符串:"Python"
正则表达式:"pyt"
匹配方式:re.compile("pyt",re.I).findall("Python")
匹配结果: Pyt

源字符串:string="Python"
正则表达式:"pyt"
匹配方式:re.compile("pyt",re.I).findall("Python")
匹配结果: Pyt

源字符串:string="""我是阿里云大学
欢迎来学习
Python网络爬虫课程
"""
正则表达式:pat="阿里.*?Python"
匹配方式:re.compile(pat).findall(string)
匹配结果: []

源字符串:string="""我是阿里云大学
欢迎来学习
Python网络爬虫课程
"""
正则表达式:pat="阿里.*?Python"
匹配方式:re.compile(pat,re.S).findall(string)
匹配结果: ['阿里云大学\n欢迎来学习\nPython']

xpath基础知识

/ 逐层提取
text() 提取标签下面的文本
//标签名**  提取所有名为**的标签
//标签名[@属性='属性值']  提取属性为XX的标签
@属性名  代表取某个属性值

<html>
<head>
<title>
主页
</title>
</head>
<body>
<p>abc</p>
<p>bbbvb</p>
 <a href="//qd.alibaba.com/go/v/pcdetail" target="_top">安全推荐</a>
<a href="//qd.alibaba.com/go/v/pcdetail" target="_top">安全推荐2</a>
<div class="J_AsyncDC" data-type="dr">
        <div id="official-remind">明月几时有
</div>
</div>
</body>

分析以下XPath表达式提取的内容:
/html/head/title/text()
//p/text()
//a
//div[@id='official-remind']/text()
//a/@href

实例:
提取标题:/html/head/title/text()
提取所有的div标签://div
提取div中<div class="tools">标签的内容: //div[@class='tools']/text()


Requests-HTML基础

Make a GET request to 'python.org', using Requests:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()

>>> r = session.get('https://python.org/')
Grab a list of all links on the page, as–is (anchors excluded):

>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
Grab a list of all links on the page, in absolute form (anchors excluded):

>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}
Select an element with a CSS Selector:

>>> about = r.html.find('#about', first=True)
Grab an element's text contents:

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Introspect an Element's attributes:

>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
Render out an Element's HTML:

>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
Select Elements within Elements:

>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
Search for links within an element:

>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
Search for text on the page:

>>> r.html.search('Python is a {} language')[0]
programming
More complex CSS Selector example (copied from Chrome dev tools):

>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'

>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath is also supported:

>>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]
JavaScript Support
Let's grab some text that's rendered by JavaScript:

>>> r = session.get('http://python-requests.org')

>>> r.html.render()

>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.

Using without Requests
You can also use this library without Requests:

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""

>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

猜你喜欢

转载自blog.csdn.net/sinat_33487968/article/details/80926654