Get official account article widget

Python's automatic acquisition of official account articles

foreword

Since the author wants to post an article on the official account every day, but he doesn’t want to write an article every day, the author doesn’t have the energy and time, so I wanted to come and write a small tool that can quickly publish an article with simple operations every day. article, although this behavior is not very good, but it is indeed a lot more convenient

see the picture:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-oYzon87C-1666528630152) (C:\Users\14299\AppData\Roaming\Typora\typora-user-images\ image-20221022221835195.png)]

Now you only need to enter keywords and the number of articles, and you can get the corresponding number of articles locally, in word document format. I don’t need to tell you that word documents can be directly converted into the graphic news of the official account.

Does it sound like this gadget is quite interesting.

Without further ado, let's get right to the tutorial!


1. Development environment

  • windows system

  • python 3.7

    The python library called by python has

    json os tkinter(图形化界面) pypandoc(使用此库时,需要注意的是需要安装pandoc) requests
    

    To obtain pandocthe installation package, please follow the official account Xiaolei Miaomiya Reply: pandoc or pdoc


2. Development steps

2.1. Reptile part

This time, the crawler is developed as an independent project. The crawler program of this website is relatively basic. 0 basics can also be learned.

step:

Mainly divided into the following 4 steps

  1. Keyword and number of articles entered in the console
  2. Initiate a request to the url link and extract the data
  3. Save the data source code in html format
  4. Convert the saved html file to a docx file

import external library

import json
import os
import pypandoc
import requests

Enter keywords

def main():
    serch_url = 'https://4l77k49qor-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)%3B%20docsearch%20(3.0.0)%3B%20docsearch-react%20(3.0.0)&x-algolia-api-key=0f8cb8d4dbe2581b6912018d4e33fb8d&x-algolia-application-id=4L77K49QOR'
    key = input('请输入要搜索的关键词:')
    try:
        num = int(input('请输入你需要爬取几篇:'))
    except Exception as e:
        print('请正常输入好吧', e)
        num = 1
    post_data = '{"requests":[{"query":"' + key + '","indexName":"mdnice","params":"attributesToRetrieve=%5B%22hierarchy.lvl0%22%2C%22hierarchy.lvl1%22%2C%22hierarchy.lvl2%22%2C%22hierarchy.lvl3%22%2C%22hierarchy.lvl4%22%2C%22hierarchy.lvl5%22%2C%22hierarchy.lvl6%22%2C%22content%22%2C%22type%22%2C%22url%22%5D&attributesToSnippet=%5B%22hierarchy.lvl1%3A10%22%2C%22hierarchy.lvl2%3A10%22%2C%22hierarchy.lvl3%3A10%22%2C%22hierarchy.lvl4%3A10%22%2C%22hierarchy.lvl5%3A10%22%2C%22hierarchy.lvl6%3A10%22%2C%22content%3A10%22%5D&snippetEllipsisText=%E2%80%A6&highlightPreTag=%3Cmark%3E&highlightPostTag=%3C%2Fmark%3E&hitsPerPage=20&clickAnalytics=true"}]}'
    post_data = post_data.encode('utf-8')
    run_(serch_url,post_data,num) #调用run方法

Initiate a post request

def run_(serch_url, post_data, num=1):
    text = requests.post(url=serch_url, data=post_data)
    if text.status_code == 200:
        results = json.loads(text.text)['results'][0]['hits']
        if num > len(results):
            num = len(results)
        for r in results[:num]:
            file_name = r['hierarchy']['lvl1']
            url = r['url']
            html_name = download_html(file_name, url)
            to_docx(html_name)
    else:
        print('链接失效')

Crawl data and save html file

def download_html(filename, url):
    url = url
    content = requests.get(url).text
    file_path = os.getcwd() + '\\file'
    if not os.path.exists(file_path):
        os.mkdir(file_path)
    filename = file_path + '\\' + filename + '.html'
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(content)
    return filename

html file to docx

Tip: If pandoc is not installed, the program will report an error, because pypandoc operates it by using pandoc

To get pandocthe installation package, you can follow the public account Xiaolei Miaomiao Reply: Gadgets

You can also download from Baidu yourself, the download speed will be slower

def to_docx(html_name):
    new_name = html_name.split('.')[0]
    pypandoc.convert_file(html_name, 'docx', outputfile=f'{
      
      new_name}.docx')

call the main function

if __name__ == '__main__':
    main()

3.2. Encapsulate the crawler program with stkinter

This time we use python's object-oriented programming, if you don't know it, you can learn it first, hh

import external library

import json
import os
import tkinter as tk
from tkinter import messagebox
# 图形化
import pypandoc
import requests

create class

class ToolGetArticle(tk.Tk):
    def __init__(self):
        super(ToolGetArticle, self).__init__()
        self.title('获取文章工具')
        width, height = 300, 150
        screenwidth = self.winfo_screenwidth()
        screenheight = self.winfo_screenheight()
        size_geo = '%dx%d+%d+%d' % (width, height, (screenwidth - width) / 2, (screenheight - height) / 2)
        self.geometry(size_geo)
        # self.root_window.iconbitmap('C:/Users/Administrator/Desktop/favicon.ico')
        self["background"] = "#C9C9C9"
        # 爬取的关键字
        self.mainkey = tk.StringVar()
        self.num = tk.IntVar()

    def add_kongjian(self):
        tk.Label(self, text="爬取关键字:").grid(row=0)
        tk.Label(self, text="篇数:").grid(row=1)
        self.e1 = tk.Entry(self)
        self.e2 = tk.Spinbox(self)
        self.e1.grid(row=0, column=1, padx=10, pady=5)
        self.e2.grid(row=1, column=1, padx=10, pady=5)
        tk.Button(self, text="开始", width=10, command=self.new_func).grid(row=3, column=0, sticky="w", padx=10, pady=5)
        tk.Button(self, text="退出", width=10, command=self.quit).grid(row=3, column=1, sticky="e", padx=10, pady=5)

    def check_func(self):
        self.mainkey = self.e1.get()
        self.num = self.e2.get()
        #############判断篇数是不是数字#############
        try:
            num = int(self.num)
            self.num1 = num
        except Exception as e:
            messagebox.showwarning(str(e), "篇数:写个数字吧")
            self.e2.delete(0, tk.END)
            return False
        ############判断key是否有关键字############
        if self.mainkey == '':
            messagebox.showwarning("错误", "搜索关键词要写啊")
            self.e1.delete(0, tk.END)
            return False
        return True

    #####################爬虫程序######################
    def download_html(self, filename, url):
        content = requests.get(url).text
        file_path = os.getcwd() + '\\file'
        if not os.path.exists(file_path):
            os.mkdir(file_path)
        filename = file_path + '\\' + filename + '.html'
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(content)
        return filename

    def to_docx(self, html_name):
        new_name = html_name.split('.')[0]
        # print(,new_name)
        try:
            pypandoc.convert_file(html_name, 'docx', outputfile=f'{
      
      new_name}.docx')
        except Exception as e:
            print(e)
            messagebox.showwarning('警告','出错了联系管理员')
        messagebox.showwarning('成功','下载成功查看本地同级文件夹file')
        print('原网址:',self.url)
        print('文件名:',new_name)

    def run_(self, serch_url, post_data, num=1):
        text = requests.post(url=serch_url, data=post_data)
        if text.status_code == 200:
            results = json.loads(text.text)['results'][0]['hits']
            if num > len(results):
                num = len(results)
            for r in results[:num]:
                file_name = r['hierarchy']['lvl1']
                self.url = r['url']
                html_name = self.download_html(file_name,self.url)
                self.to_docx(html_name)
        else:
            return False

    def main(self):
        serch_url = 'https://4l77k49qor-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)%3B%20docsearch%20(3.0.0)%3B%20docsearch-react%20(3.0.0)&x-algolia-api-key=0f8cb8d4dbe2581b6912018d4e33fb8d&x-algolia-application-id=4L77K49QOR'
        key = self.mainkey
        num = self.num1
        post_data = '{"requests":[{"query":"' + str(
            key) + '","indexName":"mdnice","params":"attributesToRetrieve=%5B%22hierarchy.lvl0%22%2C%22hierarchy.lvl1%22%2C%22hierarchy.lvl2%22%2C%22hierarchy.lvl3%22%2C%22hierarchy.lvl4%22%2C%22hierarchy.lvl5%22%2C%22hierarchy.lvl6%22%2C%22content%22%2C%22type%22%2C%22url%22%5D&attributesToSnippet=%5B%22hierarchy.lvl1%3A10%22%2C%22hierarchy.lvl2%3A10%22%2C%22hierarchy.lvl3%3A10%22%2C%22hierarchy.lvl4%3A10%22%2C%22hierarchy.lvl5%3A10%22%2C%22hierarchy.lvl6%3A10%22%2C%22content%3A10%22%5D&snippetEllipsisText=%E2%80%A6&highlightPreTag=%3Cmark%3E&highlightPostTag=%3C%2Fmark%3E&hitsPerPage=20&clickAnalytics=true"}]}'
        post_data = post_data.encode('utf-8')
        self.run_(serch_url, post_data, num)

    ##################################################

    def new_func(self):
        if self.check_func():
            self.main()

    def run_main(self):
        self.mainloop()

The class here inherits tkinter.Tk, that is to say, the parent method and components of the Tk class have been inherited, and then it is more convenient to operate the components

I don't know what to say about this, so let's start with the constructor:

Constructor

_ _ init _ _ () function

1. This function sets the size and position of the main window
screenwidth = self.winfo_screenwidth()
screenheight = self.winfo_screenheight()
size_geo = '%dx%d+%d+%d' % (width, height, (screenwidth - width) / 2, (screenheight - height) / 2)
self.geometry(size_geo)
2. When components in tk need to pass values ​​dynamically, variables need to be declared in this way
self.mainkey = tk.StringVar()
self.num = tk.IntVar()
add controls

In add_kongjian() function

1. Add label tags, and Entry, Spinbox tags
 tk.Label(self, text="爬取关键字:").grid(row=0)
 tk.Label(self, text="篇数:").grid(row=1)
 self.e1 = tk.Entry(self)
 self.e2 = tk.Spinbox(self)
 self.e1.grid(row=0, column=1, padx=10, pady=5)
 self.e2.grid(row=1, column=1, padx=10, pady=5)
2. Add the Button tag and bind the function
tk.Button(self, text="开始", width=10, command=self.new_func).grid(row=3, column=0, sticky="w", padx=10, pady=5)
tk.Button(self, text="退出", width=10, command=self.quit).grid(row=3, column=1, sticky="e", padx=10, pady=5)

It is worth noting that when binding tags, row, column, etc. are used for positioning

The beginning Button is bound to a custom function, new_func()

start crawling

Before starting the crawler program, we need to check whether the incoming data meets the requirements. If the requirements are met, the crawler will continue. If not, a warning needs to pop up

1. new_func function

The check function is called

def new_func(self):
	if self.check_func():
        self.main()
2. check_func function
def check_func(self):
    self.mainkey = self.e1.get()
    self.num = self.e2.get()
    #############判断篇数是不是数字#############
    try:
        num = int(self.num)
        self.num1 = num
    except Exception as e:
        messagebox.showwarning(str(e), "篇数:写个数字吧")
        self.e2.delete(0, tk.END)
        return False
    ############判断key是否有关键字############
    if self.mainkey == '':
        messagebox.showwarning("错误", "搜索关键词要写啊")
        self.e1.delete(0, tk.END)
        return False
    return True
3.main the program of the main crawler

The main difference from before is that the tto_docx function has added a catch exception. If an exception occurs, a pop-up window will pop up to warn

def to_docx(self, html_name):
    new_name = html_name.split('.')[0]
    # print(,new_name)
    try:
        pypandoc.convert_file(html_name, 'docx', outputfile=f'{
      
      new_name}.docx')
    except Exception as e:
        print(e)
        messagebox.showwarning('警告', '出错了联系管理员')
        messagebox.showwarning('成功', '下载成功查看本地同级文件夹file')
        print('原网址:', self.url)
        print('文件名:', new_name)
let's run
if __name__ == '__main__':
    tk1 = ToolGetArticle()
    tk1.add_kongjian()
    tk1.run_main()

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-5T7kFoQJ-1666528630153) (C:\Users\14299\AppData\Roaming\Typora\typora-user-images\ image-20221022234121640.png)]

If the operation is successful, a tool window will pop up. Isn’t it interesting? Let’s try it

When running the code on the linux system, you need to modify the file path

Instructions for use, needless to say, after running successfully, a file folder will be created in the path where your program is running, and the same level directory, and then the obtained html file and converted docx file will be stored in the folder. go try it

   print('文件名:', new_name)

##### 开始运行吧

```python
if __name__ == '__main__':
    tk1 = ToolGetArticle()
    tk1.add_kongjian()
    tk1.run_main()

[External link image transfer...(img-5T7kFoQJ-1666528630153)]

If the operation is successful, a tool window will pop up. Isn’t it interesting? Let’s try it

When running the code on the linux system, you need to modify the file path

Instructions for use, needless to say, after running successfully, a file folder will be created in the path where your program is running, and the same level directory, and then the obtained html file and converted docx file will be stored in the folder. go try it

Guess you like

Origin blog.csdn.net/qq_64047342/article/details/127480315