Reject inefficient! Python teach you reptile number of public posts and links

This article first appeared in public No. "Python knowledge circle", For reprint, please contact the author at authorized public number.

Foreword

An article on finishing the number of public navigation links to all articles, in fact, if the manual finishing up, it is a very laborious thing, because the public number in the article, we only add a chapter selection is a single box .

The face of hundreds of articles in the article, such a choice, it is a chore.

pk brother as a Pythoner, of course, can not be so inefficient that we use crawlers to extract the headlines and links to articles and other information.

Ethereal

URL we need to extract the number of articles by public request packet capture, packet capture previously wrote an article reference preparation before Python reptile APP , PK brother directly grab the PC side of the public micro-channel number information list of articles, and more simple.

Charles I to packet capture tool, for example, check the computer to allow fetch requests, usually on the default checked.

In order to filter out other unrelated requests, we set up at the bottom left of us want to crawl domain.

After opening the PC end micro-channel, open the "Python Knowledge Circle" article list public number, Charles will crawl to a large number of requests, we need to find the request, JSON information returned contains the title of the article, summaries, links and other information, in comm_msg_info below.

These are the requests link back, link url request we can see in the Overview.

After capture by acquiring so much information, we can write reptiles crawling all articles and information saved.

Initialization function

No public history list of articles to slide up, after loading more articles found only offset this link parameters change, we create an initialization function, add the proxy IP, and request header information, request header contains the User-Agent, Cookie, Referer.

This information can be seen in capture tool.

Request data

By capturing Packet analysis out request link, we can use requests library requested, whether with a return code to make a judgment as 200, then 200 Description Returns information properly, we'll build a function parse_data () to resolve extract what we need returned messages.

def request_data(self):
    try:
        response = requests.get(self.base_url.format(self.offset), headers=self.headers, proxies=self.proxy)
        print(self.base_url.format(self.offset))
        if 200 == response.status_code:
           self.parse_data(response.text)
    except Exception as e:
        print(e)
        time.sleep(2)
        pass

Extract data

通过分析返回的 Json 数据,我们可以看到,我们需要的数据都在 app_msg_ext_info 下面。

我们用 json.loads 解析返回的 Json 信息,把我们需要的列保存在 csv 文件中,有标题、摘要、文章链接三列信息,其他信息也可以自己加。

    def parse_data(self, responseData):
            all_datas = json.loads(responseData)
            if 0 == all_datas['ret'] and all_datas['msg_count']>0:
                summy_datas = all_datas['general_msg_list']
                datas = json.loads(summy_datas)['list']
                a = []
                for data in datas:
                    try:
                        title = data['app_msg_ext_info']['title']
                        title_child = data['app_msg_ext_info']['digest']
                        article_url = data['app_msg_ext_info']['content_url']
                        info = {}
                        info['标题'] = title
                        info['小标题'] = title_child
                        info['文章链接'] = article_url
                        a.append(info)
                    except Exception as e:
                        print(e)
                        continue

                print('正在写入文件')
                with open('Python公众号文章合集1.csv', 'a', newline='', encoding='utf-8') as f:
                    fieldnames = ['标题', '小标题', '文章链接']  # 控制列的顺序
                    writer = csv.DictWriter(f, fieldnames=fieldnames)
                    writer.writeheader()
                    writer.writerows(a)
                    print("写入成功")

                print('----------------------------------------')
                time.sleep(int(format(random.randint(2, 5))))
                self.offset = self.offset+10
                self.request_data()
            else:
                print('抓取数据完毕!')

这样,爬取的结果就会以 csv 格式保存起来。

运行代码时,可能会遇到 SSLError 的报错,最快的解决办法就是 base_url 前面的 https 去掉 s 再运行。

保存markdown格式的链接

经常写文章的人应该都知道,一般写文字都会用 Markdown 的格式来写文章,这样的话,不管放在哪个平台,文章的格式都不会变化。

在 Markdown 格式里,用 [文章标题](文章url链接) 表示,所以我们保存信息时再加一列信息就行,标题和文章链接都获取了,Markdown 格式的 url 也就简单了。

md_url = '[{}]'.format(title) + '({})'.format(article_url)

爬取完成后,效果如下。

我们把 md链接这一列全部粘贴到 Markdown 格式的笔记里就行了,大部分的笔记软件都知道新建 Markdown 格式的文件的。

这样,这些导航文章链接整理起来就是分类的事情了。

你用 Python 解决过生活中的小问题吗?欢迎留言讨论。

Guess you like

Origin www.cnblogs.com/pythoncircle/p/12297215.html