Python's automatic acquisition of official account articles
foreword
Since the author wants to post an article on the official account every day, but he doesn’t want to write an article every day, the author doesn’t have the energy and time, so I wanted to come and write a small tool that can quickly publish an article with simple operations every day. article, although this behavior is not very good, but it is indeed a lot more convenient
see the picture:
Now you only need to enter keywords and the number of articles, and you can get the corresponding number of articles locally, in word document format. I don’t need to tell you that word documents can be directly converted into the graphic news of the official account.
Does it sound like this gadget is quite interesting.
Without further ado, let's get right to the tutorial!
1. Development environment
-
windows system
-
python 3.7
The python library called by python has
json os tkinter(图形化界面) pypandoc(使用此库时,需要注意的是需要安装pandoc) requests
To obtain
pandoc
the installation package, please follow the official account Xiaolei Miaomiya Reply: pandoc or pdoc
2. Development steps
2.1. Reptile part
This time, the crawler is developed as an independent project. The crawler program of this website is relatively basic. 0 basics can also be learned.
step:
Mainly divided into the following 4 steps
- Keyword and number of articles entered in the console
- Initiate a request to the url link and extract the data
- Save the data source code in html format
- Convert the saved html file to a docx file
import external library
import json
import os
import pypandoc
import requests
Enter keywords
def main():
serch_url = 'https://4l77k49qor-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)%3B%20docsearch%20(3.0.0)%3B%20docsearch-react%20(3.0.0)&x-algolia-api-key=0f8cb8d4dbe2581b6912018d4e33fb8d&x-algolia-application-id=4L77K49QOR'
key = input('请输入要搜索的关键词:')
try:
num = int(input('请输入你需要爬取几篇:'))
except Exception as e:
print('请正常输入好吧', e)
num = 1
post_data = '{"requests":[{"query":"' + key + '","indexName":"mdnice","params":"attributesToRetrieve=%5B%22hierarchy.lvl0%22%2C%22hierarchy.lvl1%22%2C%22hierarchy.lvl2%22%2C%22hierarchy.lvl3%22%2C%22hierarchy.lvl4%22%2C%22hierarchy.lvl5%22%2C%22hierarchy.lvl6%22%2C%22content%22%2C%22type%22%2C%22url%22%5D&attributesToSnippet=%5B%22hierarchy.lvl1%3A10%22%2C%22hierarchy.lvl2%3A10%22%2C%22hierarchy.lvl3%3A10%22%2C%22hierarchy.lvl4%3A10%22%2C%22hierarchy.lvl5%3A10%22%2C%22hierarchy.lvl6%3A10%22%2C%22content%3A10%22%5D&snippetEllipsisText=%E2%80%A6&highlightPreTag=%3Cmark%3E&highlightPostTag=%3C%2Fmark%3E&hitsPerPage=20&clickAnalytics=true"}]}'
post_data = post_data.encode('utf-8')
run_(serch_url,post_data,num) #调用run方法
Initiate a post request
def run_(serch_url, post_data, num=1):
text = requests.post(url=serch_url, data=post_data)
if text.status_code == 200:
results = json.loads(text.text)['results'][0]['hits']
if num > len(results):
num = len(results)
for r in results[:num]:
file_name = r['hierarchy']['lvl1']
url = r['url']
html_name = download_html(file_name, url)
to_docx(html_name)
else:
print('链接失效')
Crawl data and save html file
def download_html(filename, url):
url = url
content = requests.get(url).text
file_path = os.getcwd() + '\\file'
if not os.path.exists(file_path):
os.mkdir(file_path)
filename = file_path + '\\' + filename + '.html'
with open(filename, 'w', encoding='utf-8') as f:
f.write(content)
return filename
html file to docx
Tip: If pandoc is not installed, the program will report an error, because pypandoc operates it by using pandoc
To get
pandoc
the installation package, you can follow the public account Xiaolei Miaomiao Reply: GadgetsYou can also download from Baidu yourself, the download speed will be slower
def to_docx(html_name):
new_name = html_name.split('.')[0]
pypandoc.convert_file(html_name, 'docx', outputfile=f'{
new_name}.docx')
call the main function
if __name__ == '__main__':
main()
3.2. Encapsulate the crawler program with stkinter
This time we use python's object-oriented programming, if you don't know it, you can learn it first, hh
import external library
import json
import os
import tkinter as tk
from tkinter import messagebox
# 图形化
import pypandoc
import requests
create class
class ToolGetArticle(tk.Tk):
def __init__(self):
super(ToolGetArticle, self).__init__()
self.title('获取文章工具')
width, height = 300, 150
screenwidth = self.winfo_screenwidth()
screenheight = self.winfo_screenheight()
size_geo = '%dx%d+%d+%d' % (width, height, (screenwidth - width) / 2, (screenheight - height) / 2)
self.geometry(size_geo)
# self.root_window.iconbitmap('C:/Users/Administrator/Desktop/favicon.ico')
self["background"] = "#C9C9C9"
# 爬取的关键字
self.mainkey = tk.StringVar()
self.num = tk.IntVar()
def add_kongjian(self):
tk.Label(self, text="爬取关键字:").grid(row=0)
tk.Label(self, text="篇数:").grid(row=1)
self.e1 = tk.Entry(self)
self.e2 = tk.Spinbox(self)
self.e1.grid(row=0, column=1, padx=10, pady=5)
self.e2.grid(row=1, column=1, padx=10, pady=5)
tk.Button(self, text="开始", width=10, command=self.new_func).grid(row=3, column=0, sticky="w", padx=10, pady=5)
tk.Button(self, text="退出", width=10, command=self.quit).grid(row=3, column=1, sticky="e", padx=10, pady=5)
def check_func(self):
self.mainkey = self.e1.get()
self.num = self.e2.get()
#############判断篇数是不是数字#############
try:
num = int(self.num)
self.num1 = num
except Exception as e:
messagebox.showwarning(str(e), "篇数:写个数字吧")
self.e2.delete(0, tk.END)
return False
############判断key是否有关键字############
if self.mainkey == '':
messagebox.showwarning("错误", "搜索关键词要写啊")
self.e1.delete(0, tk.END)
return False
return True
#####################爬虫程序######################
def download_html(self, filename, url):
content = requests.get(url).text
file_path = os.getcwd() + '\\file'
if not os.path.exists(file_path):
os.mkdir(file_path)
filename = file_path + '\\' + filename + '.html'
with open(filename, 'w', encoding='utf-8') as f:
f.write(content)
return filename
def to_docx(self, html_name):
new_name = html_name.split('.')[0]
# print(,new_name)
try:
pypandoc.convert_file(html_name, 'docx', outputfile=f'{
new_name}.docx')
except Exception as e:
print(e)
messagebox.showwarning('警告','出错了联系管理员')
messagebox.showwarning('成功','下载成功查看本地同级文件夹file')
print('原网址:',self.url)
print('文件名:',new_name)
def run_(self, serch_url, post_data, num=1):
text = requests.post(url=serch_url, data=post_data)
if text.status_code == 200:
results = json.loads(text.text)['results'][0]['hits']
if num > len(results):
num = len(results)
for r in results[:num]:
file_name = r['hierarchy']['lvl1']
self.url = r['url']
html_name = self.download_html(file_name,self.url)
self.to_docx(html_name)
else:
return False
def main(self):
serch_url = 'https://4l77k49qor-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)%3B%20docsearch%20(3.0.0)%3B%20docsearch-react%20(3.0.0)&x-algolia-api-key=0f8cb8d4dbe2581b6912018d4e33fb8d&x-algolia-application-id=4L77K49QOR'
key = self.mainkey
num = self.num1
post_data = '{"requests":[{"query":"' + str(
key) + '","indexName":"mdnice","params":"attributesToRetrieve=%5B%22hierarchy.lvl0%22%2C%22hierarchy.lvl1%22%2C%22hierarchy.lvl2%22%2C%22hierarchy.lvl3%22%2C%22hierarchy.lvl4%22%2C%22hierarchy.lvl5%22%2C%22hierarchy.lvl6%22%2C%22content%22%2C%22type%22%2C%22url%22%5D&attributesToSnippet=%5B%22hierarchy.lvl1%3A10%22%2C%22hierarchy.lvl2%3A10%22%2C%22hierarchy.lvl3%3A10%22%2C%22hierarchy.lvl4%3A10%22%2C%22hierarchy.lvl5%3A10%22%2C%22hierarchy.lvl6%3A10%22%2C%22content%3A10%22%5D&snippetEllipsisText=%E2%80%A6&highlightPreTag=%3Cmark%3E&highlightPostTag=%3C%2Fmark%3E&hitsPerPage=20&clickAnalytics=true"}]}'
post_data = post_data.encode('utf-8')
self.run_(serch_url, post_data, num)
##################################################
def new_func(self):
if self.check_func():
self.main()
def run_main(self):
self.mainloop()
The class here inherits tkinter.Tk, that is to say, the parent method and components of the Tk class have been inherited, and then it is more convenient to operate the components
I don't know what to say about this, so let's start with the constructor:
Constructor
_ _ init _ _ () function
1. This function sets the size and position of the main window
screenwidth = self.winfo_screenwidth()
screenheight = self.winfo_screenheight()
size_geo = '%dx%d+%d+%d' % (width, height, (screenwidth - width) / 2, (screenheight - height) / 2)
self.geometry(size_geo)
2. When components in tk need to pass values dynamically, variables need to be declared in this way
self.mainkey = tk.StringVar()
self.num = tk.IntVar()
add controls
In add_kongjian() function
1. Add label tags, and Entry, Spinbox tags
tk.Label(self, text="爬取关键字:").grid(row=0)
tk.Label(self, text="篇数:").grid(row=1)
self.e1 = tk.Entry(self)
self.e2 = tk.Spinbox(self)
self.e1.grid(row=0, column=1, padx=10, pady=5)
self.e2.grid(row=1, column=1, padx=10, pady=5)
2. Add the Button tag and bind the function
tk.Button(self, text="开始", width=10, command=self.new_func).grid(row=3, column=0, sticky="w", padx=10, pady=5)
tk.Button(self, text="退出", width=10, command=self.quit).grid(row=3, column=1, sticky="e", padx=10, pady=5)
It is worth noting that when binding tags, row, column, etc. are used for positioning
The beginning Button is bound to a custom function, new_func()
start crawling
Before starting the crawler program, we need to check whether the incoming data meets the requirements. If the requirements are met, the crawler will continue. If not, a warning needs to pop up
1. new_func function
The check function is called
def new_func(self):
if self.check_func():
self.main()
2. check_func function
def check_func(self):
self.mainkey = self.e1.get()
self.num = self.e2.get()
#############判断篇数是不是数字#############
try:
num = int(self.num)
self.num1 = num
except Exception as e:
messagebox.showwarning(str(e), "篇数:写个数字吧")
self.e2.delete(0, tk.END)
return False
############判断key是否有关键字############
if self.mainkey == '':
messagebox.showwarning("错误", "搜索关键词要写啊")
self.e1.delete(0, tk.END)
return False
return True
3.main the program of the main crawler
The main difference from before is that the tto_docx function has added a catch exception. If an exception occurs, a pop-up window will pop up to warn
def to_docx(self, html_name):
new_name = html_name.split('.')[0]
# print(,new_name)
try:
pypandoc.convert_file(html_name, 'docx', outputfile=f'{
new_name}.docx')
except Exception as e:
print(e)
messagebox.showwarning('警告', '出错了联系管理员')
messagebox.showwarning('成功', '下载成功查看本地同级文件夹file')
print('原网址:', self.url)
print('文件名:', new_name)
let's run
if __name__ == '__main__':
tk1 = ToolGetArticle()
tk1.add_kongjian()
tk1.run_main()
If the operation is successful, a tool window will pop up. Isn’t it interesting? Let’s try it
When running the code on the linux system, you need to modify the file path
Instructions for use, needless to say, after running successfully, a file folder will be created in the path where your program is running, and the same level directory, and then the obtained html file and converted docx file will be stored in the folder. go try it
print('文件名:', new_name)
##### 开始运行吧
```python
if __name__ == '__main__':
tk1 = ToolGetArticle()
tk1.add_kongjian()
tk1.run_main()
[External link image transfer...(img-5T7kFoQJ-1666528630153)]
If the operation is successful, a tool window will pop up. Isn’t it interesting? Let’s try it
When running the code on the linux system, you need to modify the file path
Instructions for use, needless to say, after running successfully, a file folder will be created in the path where your program is running, and the same level directory, and then the obtained html file and converted docx file will be stored in the folder. go try it