Python爬取百度贴吧回帖中的微信号（基于简单http请求）

作者：草小诚
转载请注原文地址：https://blog.csdn.net/cxcjoker7894/article/details/85685115

前些日子媳妇儿有个需求，想要一个任意贴吧近期主题帖的所有回帖中的微信号，用来做一些微商的操作，你懂的。因为有些贴吧专门就是微商互加，或者客户留微信的，还有专门特定用户群的贴吧，非常精准，我们一致认为比其他加人模式效率要高，所以如果能方便快捷的提取微信号，价值还是很高的（事后来看微信号到购买转化率约1%，已经很满意了）。

那需求很明确，说干就干，当天晚上就想把这个工具实现出来，我呢打开电脑开始调研 BeautifulSoup 和 Scrapy，媳妇儿开始整理目标贴吧，考虑关键词的过滤和回帖正则的匹配方式。在她一步步点击操作中我就发现，百度贴吧的 url 和 css 名极为规整，分页与功能按钮也都简单的在 GET 请求的 url 中体现，而且没有混淆或加密等等。既然如此，还用得着爬虫框架么，直接循环 http 请求，想办法从返回的 HTML 中提取微信号就行了，非常简单。

下面都以“宝妈微信群”吧为例：

1.访问贴吧，发现 url 如下:

https://tieba.baidu.com/f?kw=宝妈微信群ie=utf-8

对比其他贴吧，格式一致，地址/f，参数 kw(keyword) 和 ie，kw即为贴吧名称，ie是编码不用动。由于需求只要近期的帖子，所以这里不翻页。

2.访问主题帖，发现 url 如下：

https://tieba.baidu.com/p/5981196309

对比其他帖子，真的是太规整了，/p/ 就是帖子，后面跟帖子id即可，帖子id唯一且不会变化。
在主题帖列表的html中，包含 ‘j_th_tit’ 属性并不含 ‘threadlist_title’ 属性的元素即为主题帖链接。

3.分页怎么办？点一下第五页：

https://tieba.baidu.com/p/5144424400?pn=5

bing~加一个 pn 参数就可以访问第N页，这里有个问题是，如果pn值超过当前主题帖最大页数，就会仍然显示本贴的最后一页，也就是说如果只有5页，即使传100也还是显示第5页，所以循环页数的时候需要根据屏显的页数判断是否翻页了，如果没有真的翻页就continue。

至此，所有的 HTML 都可以获取到了，要想办法提取其中的回帖信息，分为三部分：

按行读，并切割拆分元素，依靠标签名和class名，提取有用的html元素。
排除噪音干扰，去掉一些夹杂着的的 js 代码段。
将有用的html元素分别处理，微信去提取，分页去判断，等等。

举例，class="tP"的元素是分页显示，含有‘d_post_content j_d_post_content’ 的元素是回复，由于HTML内容庞大，就不在文中贴出了。

由于没有想出好的微信号判断逻辑，就简单的将不包含英文和数字的回复过滤掉了。

再将 while true 循环翻页的逻辑写好，就大功告成了，把爬下来的有效回复全都写到文件里就好，一个贴吧首页所有主题帖中的所有有效回复，就提取出来了。

贴一下成果图，这个吧大约有2000条：

爬取下来的宝妈微信群吧的回帖中的微信号

下面贴代码吧，看官可以拿去作为工具使用，前三个变量分别为：

tieba_name ：贴吧名称，注意不含“吧”字。
tiezi_names ：这里可以限定帖子的名称，如果设置了，会只爬取包含有设置的关键词的帖子。
store_file_path ：结果存储在哪里？文件路径。

其他的无需改动，即可使用：

#coding=utf-8
'''
@author: caoxiaocheng
'''
import requests
import re
import string
import time

tieba_name = '宝妈微信群'
tiezi_names = []
store_file_path = 'C:\\Users\\user\\Desktop\\贴吧爬取数据%s.txt' % str(int(time.time()))

def special_print(print_str, store_file=None):
    print('★★★★★★★★    ' + str(print_str), flush=True)
    if store_file:
        store_file.write('★★★★★★★★    ' + str(print_str)+'\n')

def tiezi_check(tiezi_names, tieba_title_line):
    if not tiezi_names:
        return True
    for tiezi_name in tiezi_names:
        if tiezi_name in tieba_title_line:
            return True
    return False

def word_check(word, check_words):
    for check_word in check_words:
        if check_word in word:
            return False
    if re.match('..\:..', word):
        return False
    num_or_letter = False
    for one_char in word:
        if one_char in [str(x) for x in string.digits] or one_char in [str(x) for x in string.ascii_letters]:
            num_or_letter = True
            break
    if not num_or_letter:
        return False
    return True
    
def get_resp_line(resp):
    resp_html = resp.content.decode('utf-8')
    resp_lines = resp_html.split('\n')
    return resp_lines

tieba_index_url = 'https://tieba.baidu.com/f'
tieba_detail_url = 'https://tieba.baidu.com'
tieba_index_params = {'kw': tieba_name,
                      'ie': 'utf-8',
                      }
detail_link_mapping = {}
check_words = ['该楼层疑似违规已被系统折叠',
               '送TA礼物',
               '收起回复',
               'function']

# 将目标贴吧首页的所有主题帖链接存下来
special_print('正在获取帖子列表')
tieba_index_resp = requests.get(tieba_index_url, tieba_index_params)
tieba_index_lines = get_resp_line(tieba_index_resp)
for tieba_index_line in tieba_index_lines:
    if 'j_th_tit' in tieba_index_line and 'threadlist_title' not in tieba_index_line and tiezi_check(tiezi_names, tieba_index_line):
        title_line = tieba_index_line.strip()
        title_name = title_line.split(' ')[3][7:-1]
        title_link_tail = title_line.split(' ')[2][6:-1]
        title_link = tieba_detail_url + title_link_tail
        detail_link_mapping[title_link] = title_name
for detail_link in detail_link_mapping:
    print('标题:'+detail_link_mapping[detail_link])
    print('链接:'+detail_link)

# 遍历所有主题帖，爬取主题帖所有回复
with open(store_file_path, 'a', encoding='utf-8') as store_file:
    for detail_link in detail_link_mapping:
        title_name = detail_link_mapping[detail_link]
        special_print('正在爬取帖子:'+title_name, store_file)
        last_page = 0
        current_page = 1
        try:
            while(True):
                special_print('正在爬取第'+str(last_page+1)+'页', store_file)
                detail_link_resp = requests.get(detail_link+'?pn='+str(last_page+1))
                detail_link_lines = get_resp_line(detail_link_resp)
                more_page = False
                for detail_link_line in detail_link_lines:
                    detail_line = detail_link_line.strip()
                    detail_line_fix = re.sub(u"\\<.*?\\>", "", detail_line)
                    if 'class="tP"' in detail_link_line:
                        more_page = True
                        current_page = int(detail_line_fix.strip())
                        if current_page == last_page:
                            special_print('没有更多页了', store_file)
                            raise IndexError
                    if 'd_post_content j_d_post_content' in detail_link_line and 'topic_name' not in detail_link_line:
                        if 'void' in detail_line_fix:
                            detail_line_fix = detail_line_fix[:detail_line_fix.index('void')]
                        detail_line_words = [x for x in detail_line_fix.split(' ') if x]
                        for detail_line_word in detail_line_words:
                            if word_check(detail_line_word, check_words):
                                print(detail_line_word)
                                store_file.write(detail_line_word+'\n')
                if not more_page:
                    special_print('没有更多页了', store_file)
                    raise IndexError
                last_page = current_page
        except IndexError:
            continue

Python爬取百度贴吧回帖中的微信号（基于简单http请求）

猜你喜欢