Python crawling public micro-channel number account information

Sogou search micro-channel offers two types of search keywords, number one is searching public content of the article, and the other is direct public micro-channel number search. You can get the basic information of the public and No. 10 recently published article by micro-channel public number search today to capture micro-channel public information about the account number (

reptile

Firstly Home, you can grab by category, through the "More" to find pages link rules:

import requests as req
import re

reTypes = r'id="pc_\d*" uigs="(pc_\d*)">([\s\S]*?)</a>'
Entry = "http://weixin.sogou.com/"
entryPage = req.get(Entry)
allTypes = re.findall(reTypes, getUTF8(entryPage))

for (pcid, category) in allTypes:
    for page in range(1, 100):
        url = 'http://weixin.sogou.com/pcindex/pc/{}/{}.html'.format(pcid, page)
        print(url)

        categoryList = req.get(url)
        if categoryList.status_code != 200:
            break

More code above list by loading the page gets loaded, and then crawl from which the micro-channel public number detail page:

reProfile = r'<li id[\s\S]*?<a href="([\s\S]*?)"'
allProfiles = re.findall(reOAProfile, getUTF8(categoryList))
for profile in allProfiles:
    profilePage = req.get(profile)
    if profilePage.status_code != 200:
        continue

Enter the details page number of the public can obtain 名称/ID/功能介绍/账号主体/头像/二维码/最近10篇文章information such as:

We will certainly encounter many difficulties when learning python, as well as the pursuit of new technologies, here's what we recommend learning Python buckle qun: 784758214, here is the python learner gathering place! ! At the same time, he was a senior development engineer python, python script from basic to web development, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Daily share some methods of learning and the need to pay attention to small details

Click: Python technology sharing

Precautions

Details page link:http://mp.weixin.qq.com/profile?src=3&timestamp=1477208282&ver=1&signature=8rYJ4QV2w5FXSOy6vGn37sUdcSLa8uoyHv3Ft7CrhZhB4wO-bbWG94aUCNexyB7lqRNSazua-2MROwkV835ilg==

1. codes

There may need to access the details page code, automatic identification code is very difficult, so to do the job camouflage crawler.

2. unsaved details page link

Details page of links There are two important parameters: timestamp & signatureThis help page link is time-sensitive, so it should be preserved useless;

3. The two-dimensional code

Two-dimensional code image link same time sensitive, and therefore we need the best picture as download.

Show results with Flask

最近 Python 社区出现了一款异步增强版的 Flask 框架:Sanic,基于uvloophttptools,可以达到异步、更快的效果,但保持了与 Flask 一致的简洁语法。虽然项目刚起步,还有很多基本功能为实现,但已经获得了很多关注(2,222 Star)。这次本打算用抓取的微信公众号信息基于 Sanic 做一个简单的交互应用,但无奈目前还没有加入模板功能,异步的 redis 驱动也还有 BUG 没解决,所以简单尝试了一下之后还是切换回 Flask + SQLite,先把抓取结果呈现出来,后续有机会再做更新。

安装 Sanic

Debug Sanic

Flask + SQLite App

from flask import g, Flask, render_template
import sqlite3

app = Flask(__name__)
DATABASE = "./db/wx.db"

def get_db():
    db = getattr(g, '_database', None)
    if db is None:
        db = g._database = sqlite3.connect(DATABASE)
    return db
@app.teardown_appcontext
def close_connection(exception):
    db = getattr(g, '_database', None)
    if db is not None:
        db.close()

@app.route("/<int:page>")
@app.route("/")
def hello(page=0):
    cur = get_db().cursor()
    cur.execute("SELECT * FROM wxoa LIMIT 30 OFFSET ?", (page*30, ))
    rows = []
    for row in cur.fetchall():
        rows.append(row)
    return render_template("app.html", wx=rows, cp=page)

if __name__ == "__main__":
    app.run(debug=True, port=8000)

Guess you like

Origin blog.csdn.net/weichen090909/article/details/91472762