Teach you to use python crawler to download 1w+『ppt template』

1 Introduction

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


Whether you are a student or a job, you will deal with ppt. Every time you make a ppt, you need to find a template. Sometimes ppt templates are still charged. This..., a bit disgusting, hahaha! !

Today I teach you how to use python crawler to crawl 10,000 copies of "ppt template". In the future, making ppt will no longer be afraid of no templates! ! !

2. Related Introduction

1. Template source


https://sc.chinaz.com/ppt/free_1.html

 

20 items per page, 500 pages in total, 10,000 ppt templates in total!

2. Crawler ideas

  • First traverse each page and get the url of each ppt template.
  • Obtain the download address according to the url of the ppt template.
  • Finally, download the file to the local according to the download address.

3. Crawling data

1. Traverse every page

 

Through xpath, you can locate the tag class=bot-div, which contains the url and name of the ppt template.


import requests
from lxml import etree

###遍历每一页
def getlist():

    for k in range(1,501):
        url = "https://sc.chinaz.com/ppt/free_"+str(k)+".html"
        res = requests.get(url)
        res.encoding = 'utf-8'
        text = res.text

        selector = etree.HTML(text)
        list = selector.xpath('//*[@class="bot-div"]')
        for i in list:
            title = i.xpath('.//a/text()')[0].replace("\n", '').replace(" ", '')
            href = i.xpath('.//a/@href')[0].replace("\n", '').replace(" ", '')
            print(title)
            print(href)
            print("----------------")

When traversing, you need to get the url (title) and name (href) of each ppt template (it is convenient for downloading as the name of the saved file)

 

2. Get the download address

Take the following URL as an example


https://sc.chinaz.com/ppt/210305465710.htm

 

Parse download link

 

 

You can locate the label class=download-url through xpath, which contains four download addresses. In fact, all four are the same. Just choose one of them.


res = requests.get(url)
res.encoding = 'utf-8'
text = res.text
selector = etree.HTML(text)
href = selector.xpath('//*[@class="download-url"]/a/@href')[0]
print(href)

3. Download and save

Download the file according to the download address obtained and save it locally.


r = requests.get(href)
with open(str(title)+".rar", "wb") as code:
  code.write(r.content)

 

 

ok, so the ppt template is downloaded to the local.

Let's start downloading in batches!

4. Batch download


##下载文件
def download(url,title):
    res = requests.get(url)
    res.encoding = 'utf-8'
    text = res.text
    selector = etree.HTML(text)
    href = selector.xpath('//*[@class="download-url"]/a/@href')[0]

    r = requests.get(href)
    with open(str(title)+".rar", "wb") as code:
      code.write(r.content)
    print(str(title)+":下载完成!")


###遍历每一页
def getlist():

    for k in range(1,501):
        url = "https://sc.chinaz.com/ppt/free_"+str(k)+".html"
        res = requests.get(url)
        res.encoding = 'utf-8'
        text = res.text

        selector = etree.HTML(text)
        list = selector.xpath('//*[@class="bot-div"]')
        for i in list:
            title = i.xpath('.//a/text()')[0].replace("\n", '').replace(" ", '')
            href = i.xpath('.//a/@href')[0].replace("\n", '').replace(" ", '')
            download("https://sc.chinaz.com/"+str(href), str(title))

 

 

In this way, 10,000 ppt templates can be downloaded!

4. Summary

Crawling 10,000 copies of ppt template material through python programming, no longer have to worry about making ppt without a template in the future!

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114528933