Python crawler crawls the Yunqi community

Title Python crawler crawls Yunqi community

This code can be crawled normally after testing, now we will do the overall project introduction.
Crawling target: https://yq.aliyun.com/, which is the title and content of the post about Python in the Yunqi community.
Implemented tool: Python request library
The technologies involved: heads camouflage, the use of re regular expressions, URL splicing, URL automatic jump, and crawled information is stored locally.
It took about three hours to complete the entire project.
Preliminary preparation of the project: installation of the requests library, search for specific details by yourself.

import time
import requests
import re
#导入项目包
url='https://yq.aliyun.com/search/articles'
#爬取目标
key='python'
#上面的网址中需要搜索的关键字
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8'
}
#头信息伪装,可以不使用,因为暂时没有头信息反爬
data=requests.get(url,params={
    
    'q':key},headers=headers).text
#读取第一个页面
pat1='<div class="_search-info">找到(.*?)条关于'
#获取帖子数量的正则
alllines=re.compile(pat1,re.S).findall(data)[0]
#使用正则提取帖子数量
#print(type(alllines),alllines)
allpage=int(alllines)//15+1
#帖子的页数
#print(type(allpage),allpage)
for i in range(0,int(allpage)):
    print('*'*40)
    index=str(i+1)
    getdata={
    
    'q':key,'p':index}
    #拼接URL
    data=requests.get(url,params=getdata).text
    pat2='<div class="media-body text-overflow">.*?<a href="(.*?)">'
    articles=re.compile(pat2,re.S).findall(data)
    #获取每一个帖子的URL
    #print(articles)
    for j in articles:
        thisurl="https://yq.aliyun.com/"+j
        #每一个帖子的URL,
        #print(type(thisurl))
        thisdata=requests.get(thisurl).text
        pat_title='<p class="hiddenTitle">(.*?)</p>'
     
        #print(type(pat_title))
        pat_content='<div class="content-detail unsafe markdown-body">(.*?)<div class="copyright-outer-line">'
        title = re.compile(pat_title,re.S).findall(thisdata)[0]
        #获取每一个帖子的标题
        
        content=re.compile(pat_content,re.S).findall(thisdata)[0]
        #获取每一个帖子的正文
        file=open('e:\\aaaaa\\'+str(i)+'_'+str(time.time())+'.txt','w',encoding='utf-8')
        #指定保存数据的路径,文件名为爬取页码加当前时间,
        #文件后缀为txt,可以任意指定文本格式,比如html
        file.write(title+'<br/> <br/>'+content)
        #写模式写入数据,每一次内循环结束,生成一个文件
        #标题和正文之间加入换行
        file.close()
        #写入完毕,关闭数据流

 



Guess you like

Origin blog.csdn.net/alwaysbefine/article/details/105001095