Reject inefficiency, Python teaches you crawler public account articles and links

foreword

The navigation links of all the articles of the official account organized in the previous article, in fact, if you manually organize it, it is a very laborious thing, because when adding articles to the official account, you can only choose one article, which is a radio button .

 

In the face of hundreds of articles, it is a chore to choose one by one.

As a Pythoner, pk certainly can't be so inefficient. We use crawlers to extract information such as the title and link of the article.

packet capture

We need to extract the URL of the request of the official account article by capturing the packet, refer to a packet capture article written before

In order to filter out other irrelevant requests, we set the domain name we want to crawl at the bottom left.

Open WeChat on the PC side and open the article list of the "Python Knowledge Circle" public account. Charles will grab a large number of requests and find the request we need. The returned JSON information contains the title, abstract, link and other information of the article. All under comm msg info.

These are the returns after requesting the link, and we can view the request link url in the Overview.

After obtaining so much information through packet capture, we can write a crawler to crawl all the information of the article and save it.

initialization function

The list of historical articles on the official account slides up. After loading more articles, it is found that only the offset parameter changes in the link. We create an initialization function and add the proxy IP, request header and information. The request header contains User-Agent, Cookie, and Referer.

This information can be seen in the packet capture tool.

request data

After capturing and analyzing the request link, we can use the requests library to make a request, and make a judgment by whether the return code is 200. If it is 200, the returned information is normal. We will build a function parse_data() to parse and extract what we need. returned messages.

def request_data(self):
    try:
        response = requests.get(self.base_url.format(self.offset), headers=self.headers, proxies=self.proxy)
        print(self.base_url.format(self.offset))
        if 200 == response.status_code:
           self.parse_data(response.text)
    except Exception as e:
        print(e)
        time.sleep(2)
        pass复制代码

Extract data

By analyzing the returned Json data, we can see that the data we need is under the app msg ext_info.

We use json.loads to parse the returned Json information, and save the columns we need in a csv file. There are three columns of information: title, abstract, and article link. Other information can also be added by yourself.

    def parse_data(self, responseData):
            all_datas = json.loads(responseData)
            if 0 == all_datas['ret'] and all_datas['msg_count']>0:
                summy_datas = all_datas['general_msg_list']
                datas = json.loads(summy_datas)['list']
                a = []
                for data in datas:
                    try:
                        title = data['app_msg_ext_info']['title']
                        title_child = data['app_msg_ext_info']['digest']
                        article_url = data['app_msg_ext_info']['content_url']
                        info = {}
                        info['标题'] = title
                        info['小标题'] = title_child
                        info['文章链接'] = article_url
                        a.append(info)
                    except Exception as e:
                        print(e)
                        continue

                print('正在写入文件')
                with open('Python公众号文章合集1.csv', 'a', newline='', encoding='utf-8') as f:
                    fieldnames = ['标题', '小标题', '文章链接']  # 控制列的顺序
                    writer = csv.DictWriter(f, fieldnames=fieldnames)
                    writer.writeheader()
                    writer.writerows(a)
                    print("写入成功")

                print('----------------------------------------')
                time.sleep(int(format(random.randint(2, 5))))
                self.offset = self.offset+10
                self.request_data()
            else:
                print('抓取数据完毕!')复制代码

In this way, the crawled results will be saved in csv format.

When running the code, you may encounter an SSLError error. The quickest solution is to remove the s in the https in front of the base_url and then run it.

Save links in markdown format

People who often write articles should know that generally writing text will use the Markdown format to write articles. In this case, no matter which platform it is placed on, the format of the article will not change.

In Markdown format, it is represented by [article title] (article url link), so we just need to add another column of information when saving the information, the title and article link are obtained, and the url in Markdown format is also simple.

md_url = '[{}]'.format(title) + '({})'.format(article_url)复制代码

After the crawling is completed, the effect is as follows.

We just paste the entire column of md links into the notes in Markdown format. Most note-taking software knows how to create new files in Markdown format.

In this way, these navigation article links are organized into categories.

Have you ever solved a small problem in your life with Python? Welcome to leave a message for discussion.

At the end, I will also give you a python spree [Jiajun Yang: 419693945] to help you learn better!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324342349&siteId=291194637