Batch web content crawler (with regular expressions)

Recently, my lovely girlfriend has a new task. It needs to copy part of the content on the webpage, as many as 1,500 pages, into a word document, which contains both text and pictures, which is very complicated. Is it possible to use a crawler to solve it all at once?

009965bae2714e10933ef02cbe81ccc9.png

First analyze the web page:

0a5fa165b4ca4e34a9c3560f59207fe4.png

 Each page has 30 news articles, first crawl the links of the articles in the page

6700dfde66824b3793c7df98e1b92d76.png

 After analyzing the page, get the style of the article link, and then use regular expressions to select it.

#爬取每一页得链接
list2=[]
url='https://www'

for i in range(170,215):
        url3=url+str(i)+'.htm'
        r=requests.get(url3,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        demo=r.text
        pic_urll = re.findall('a href="../(.{5}\d{4}.*)"',demo)
        m=pic_urll[0:46]
        for j in m:
           if j not in list2:
               list2.append(j)#去重

Notable among them are:

The deduplication method list1=list(set(list1)) will change the order and is not easy to use.

for j in m:
           if j not in list2:
               list2.append(j)#Deduplication, this method is easy to use.

The use of regular expressions (picture from China University mook)

daee690cbaf040188e00efb4baee52ad.png

65124446b10445c28b6ecd176efe7145.png

f9c884779c9e4f2bb5784b77344e7eee.pngAfter getting the links of all articles, start to analyze the tree structure of each article page

7f9de7fc935c4076a1e88bcc8ae26a6f.png

Text is searched using tags, and image links are searched using regular expressions. Note that pictures and text are crawled separately and cannot be carried out completely according to the content of the news on the webpage, only pictures can be placed at the front. How can I ask the big guys to help me follow the original typesetting?

Download the picture and put it in a folder, delete it after use, and then download the picture in the new article. One point worth noting is that using

mm=os.listdir(root)    

The addresses of pictures will be out of order, so remember to sort them before writing to word

mm.sort(key=lambda x:int(x[:-4]))

The python-docx library is used to write word, which is extremely unfriendly to Chinese.

generate word document = Document()

写入标题  head0 = document.add_heading(level=1)
                 head0.alignment = WD_ALIGN_PARAGRAPH.CENTER
                 title_run = head0.add_run(head)

page break document.add_page_break()

write paragraph p = document.add_paragraph()
                  run = p.add_run(body)

save document.save('171 to 215.docx')

 On the whole, it is to crawl the links of each article, and then crawl the text and pictures of the article links separately, and then output them into a word document. There are a lot of deficiencies, the code is just piled up with any function you think of, there is no logic at all, no function, almost no comment, really ugly! Then the efficiency of crawling is extremely slow. It takes 30 minutes to crawl a 1500 task. There is no distributed crawler, and neither is the crawler framework. In other words, in fact, novice crawlers are equivalent to network attacks.

Output result:

0313adb680d64bbe874c4825e549b9c2.png

I didn’t want to care about pageviews at first, pageviews are actually a dynamic data, I really don’t want to write it, my girlfriend said I would, so I made a random number~

 

Full code:

"""
Created on Thu Aug  4 07:52:15 2022

@author: 18705
"""
import requests
import os
import re
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Cm, Pt
from docx.oxml.ns import qn  # 设置中文格式
import random
from docx.enum.text import WD_ALIGN_PARAGRAPH


#爬取每一页得链接
list2=[]
url='页面链接一部分'

for i in range(170,215):
        url3=url+str(i)+'.htm'
        r=requests.get(url3,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        demo=r.text
        pic_urll = re.findall('a href="../(.{5}\d{4}.*)"',demo)
        m=pic_urll
        for j in m:
           if j not in list2:
               list2.append(j)



#生成一个word文档
document = Document()
url11='图片链接一部分'
#生成一个目录
p = document.add_paragraph('目录')
for j in range(0,len(list2)):
        try:
            url4='https://www.cup.edu.cn/news/'+list2[j]
            r=requests.get(url4,timeout=30)
            r.raise_for_status()
            r.encoding=r.apparent_encoding
            demo=r.text
            soup=BeautifulSoup(demo,'html.parser')
            head=soup.body.article.div.contents[3].div.div.div.contents[1].text
            p=document.add_paragraph(style='List Number')
            run = p.add_run(head)
            run.font.name = '仿宋'
            run.font.size = Pt(14)
            run.font.element.rPr.rFonts.set(qn('w:eastAsia'),'仿宋')
        except:
            pass
#分页符
document.add_page_break()
root='D://work//pics//'
#生成文章内容
for j in range(0,len(list2)):
        try:
            url4='https://www.cup.edu.cn/news/'+list2[j]
            r=requests.get(url4,timeout=30)
            r.raise_for_status()
            r.encoding=r.apparent_encoding
            demo=r.text
            soup=BeautifulSoup(demo,'html.parser')
            head=soup.body.article.div.contents[3].div.div.div.contents[1].text
            body=soup.body.article.div.contents[3].div.div.div.contents[5].text
            data=soup.body.article.div.contents[3].div.div.div.contents[3].text
            
            
            head0 = document.add_heading(level=1)
            head0.alignment = WD_ALIGN_PARAGRAPH.CENTER
            title_run = head0.add_run(head)
            title_run.font.size = Pt(20)
# 标题中文字体
            title_run.font.name = '微软雅黑'
            title_run.font.element.rPr.rFonts.set(qn('w:eastAsia'), '微软雅黑')
            
            
        except:
            pass
        
        
        #爬取图片并且写入
        try:
            r=requests.get(url4,timeout=30)
            r.raise_for_status()
            r.encoding=r.apparent_encoding
            demo=r.text
            pic_urll = re.findall('src="/pub/news/images/content/(.*?)"',demo)
            pic_urll = pic_urll[0:len(pic_urll)]
        except :
            pass
        for pp in range(0,len(pic_urll)):
            url5=(url11+pic_urll[pp])
            path=root+str(pp)+'.jpg'
            if not os.path.exists(path):
                r=requests.get(url5)
                with open(path,'wb') as f:
                    f.write(r.content)
                    f.close()
        mm=os.listdir(root)
        mm.sort(key=lambda x:int(x[:-4]))
        try:
            for l in range(0,len(mm)):
                document.add_picture('D:/work/pics/'+mm[l], width=Cm(15))       
        except :
             pass
       #将文章得内容写入
        try:
            sj=random.randint(50,1000)
            p = document.add_paragraph()
            data=data.split('\n')[:4]
            data.append(str(sj))
            run = p.add_run(data)
            run.font.name = '仿宋'
            run.font.size = Pt(14)
            run.font.element.rPr.rFonts.set(qn('w:eastAsia'),'仿宋')
            run.underline = False
        except AttributeError:
            pass
        p = document.add_paragraph()
        run = p.add_run(body)
        run.font.name = '仿宋'
        run.font.size = Pt(14)
        run.font.element.rPr.rFonts.set(qn('w:eastAsia'),'仿宋')
        run.underline = False
        
        document.add_page_break()
        #每次爬取一次之后删除
        for jj in mm :# os.listdir(path_data)#返回一个列表,里面是当前目录下面的所有东西的相对路径
            file_data = root + jj#当前文件夹的下面的所有东西的绝对路径
            os.remove(file_data)
#保存文件
document.save('171到215.docx')





            

 

 

 

 

Guess you like

Origin blog.csdn.net/m0_57491181/article/details/126180956