WeChat public account image crawling

WeChat public account image crawling

My friend asked me if I could crawl the pictures of the official account "There is a dog next door". He wanted pictures of all historical articles. Although I haven't climbed the official account, after a little analysis, I think it can still meet the needs of friends. Don't talk nonsense, get your hands dirty!
1. Preparations:
Open the "PC version of WeChat", find the official account,
Insert picture description here
click on the upper right corner, and then click on the "View History" option, of course, there is more than this method, but the effect we need is that the following page appears:
Insert picture description here
then right click and blank Place, you will find an option to "view source code". After clicking it, a text in txt form will pop up automatically, which is the source code under this page. You can just copy the title of a historical article on the current page, and then search it in the txt file, and find that it can be found, and you will also find that the links to these articles are also in the source code.
However, there is actually a trap here, that is, the source code viewed here is not all historical articles, it is only the historical articles you saw at the beginning. Without thinking about it, I pulled the scroll wheel of this page to the bottom, which is the position of the first article published by this official account.
Insert picture description here
Then we right-click "view source code" in the bottom blank, and you will be surprised to find that the content of the txt opened this time is obviously much more than the previous one. Enter the first and last article titles in the txt search box and After a successful match, we confidently copy the source code locally, and our preparations are complete.
PS: Don't just click on the articles on this page, otherwise you will have to drag the scroll wheel again.

2. Analyze the source code:
Taking the first historical article as an example, the link address obtained by clicking "copy link address" and "open with default browser" is the same, but it is not found in the source code. There is only one possibility, that is, there is more than one link to this article.
Insert picture description here
Just look for the link by the article name. Since the name is in the source code, it is a bit unreasonable if the link is not in it. Find the name of the article in the source code. The result is shown in the figure. Obviously, the hrefs is the link we need. After opening it, I found that it is ok. I can see that the framework is still very clear.
Insert picture description here
The next step is the analysis of the webpage. After checking the elements through F12 and right-clicking, you can view the source code of the webpage. After a full search and search, we found that the link to the image is in the source code of the webpage, so that we don’t need to capture the packet and directly parse the webpage. The source code is fine.

3. Crawler writing:
My personal habit is to create two py files, one of which is named test.py, which is used to test the correctness of the code written in the official crawler. The process of writing a crawler is generally a circuitous exploration. Module Modified programming makes it fun to build blocks. Go directly to the code!

import requests				#请求网页
import os						#用于创建文件夹
from lxml import etree		#使用其中的xpath

def start():
#第一层,获取公众号全部文章的链接
    filename=’source.txt’#之前存在本地的公众号源代码
    with open(filename,'r',encoding='utf-8')as f:
        source=f.read()		#读取内容
    html_ele=etree.HTML(source)	#xpath常规操作
    hrefs=html_ele.xpath('//div[contains(@data-type,"APPMSG")]/h4/@hrefs')      #锁定元素位置
    num=0
    for i in hrefs:
        num+=1
        try:
            apply_one(i)
        except:
            continue
        print('第%d篇爬取完毕'%num)
#第二层,解析单篇文章
def apply_one(url):
    headers={
    
    			#可以用自己浏览器的User-Agent,也可以用fake-useragent库函数生成
        }
	#ps:fake-useragent举例
#pip install fake-useragent
	#from fake_useragent import UserAgent
	#ua = UserAgent()
	#print(ua.random)
    response=requests.get(url,headers=headers)
    elements=etree.HTML(response.text)
    data_src=elements.xpath('//section[contains(@style,"text-align")]/section/img/@data-src')
    #print(len(data_src))
    data_src=data_src[2:-1]
    #print(len(data_src))
    for src in data_src:
        try:
            download(src)		#下载图片
        except:
            continue
    
#第三层下载层
def download(src):
    headers={
    
    
        'User-Agent': #自行添加}
    response=requests.get(src,headers=headers)
    name=src.split('/')[-2]		#截取文件名
    #print(name)
    dtype=src.split('=')[-1]		#截取图片类型
	name+='.'+dtype					#重构图片名
	os.makedirs('doge',exist_ok=True)	#当前目录生成doge文件夹
    with open('doge/'+name,'wb')as f:
        f.write(response.content)
        
if __name__=='__main__':
    start()

4. Summary:
The compilation of many similar crawlers is a process from point to surface. First, make a breakthrough, and then iteratively promote and extend the surface, from the surface to the inside, from the shallower to the deeper. The learning of reptiles requires a lot of time and energy. When you want to give up, grit your teeth again. With that trace of perseverance and disposition, I believe that I can break through the difficulties. The learning process is not closed, and you can communicate more with excellent people, but It is still advocating to solve the problem by yourself, there is really no way to ask for help, because you will find that only you can really save yourself. I hope you can learn something from it. Please forgive me if you write it badly!

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/107191771