Crawl PPT

Crawl PPT

Recording the first time of writing and crawling PPT.
As a freshman, it is actually the first time to write a crawler. The process may be a bit cumbersome, but it is also easy to understand. There may be something wrong with the writing. I hope to correct it.

Preface

The purpose of writing this article is also to be alert to yourself, to clarify your thinking, and to be able to write crawlers better, and indeed the writing is quite weak

Second, use steps

1. Import the library

The code is as follows (example):

from bs4 import BeautifulSoup
from lxml import etree
import requests
from selenium import webdriver
import urllib
import time
import os

code show as below

The website crawled this time is Youpin

#http://www.ypppt.com/moban/lunwen/list-2.html
#http://www.ypppt.com/moban/lunwen/
#/html/body/div[2]/div/ul/li[1]/a

from bs4 import BeautifulSoup
from lxml import etree
import requests
from selenium import webdriver
import urllib
import time
import os

headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
}

time.sleep(4)

num=1
page=1
for page in range(1, 6):

    if page == 1:    
        new_url = 'http://www.ypppt.com/moban/lunwen/'
    else:
        new_url =['http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)]
        new_url = new_url[0]
    print("正在爬取" + new_url)
    response = requests.get(new_url, headers=headers)

    response.encoding = 'utf-8'
    jx = BeautifulSoup(response.content,'lxml')
    mains = jx.find('ul',{
    
    'class':'posts clear'})
    main_ppts = mains.find_all('li')
    for i in main_ppts:
        a= i.a.attrs['href']
        b=requests.get('http://www.ypppt.com'+a)
        b.encoding=b.apparent_encoding
       
        c=BeautifulSoup(b.content,'lxml')
        down = c.find('div',{
    
    'class':'button'})
        down1= down.a.attrs['href']
        down_1 = requests.get('http://www.ypppt.com'+down1)
        down_1.encoding=down_1.apparent_encoding
       
        down_2=BeautifulSoup(down_1.content,'lxml')
        e = down_2.find('ul',{
    
    'class':'down clear'})
        f =e.find('li')
        downlaod_url =f.a.attrs['href']
        download=requests.get(url=downlaod_url,headers=headers).content
        
        with open(str(num) + '.zip', 'wb') as f:
            f.write(download)
        print(str(num)+ '下载成功')
        num+=1

The first step is to write the request header
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400” } In fact, the time.sleep in the middle can refer to other documents. After all, I am not sure, it is still too weak


After importing all libraries, you can start parsing web pages

  • first step

After logging into the website, I found that except for the first page, the rest of the pages are regular,
so I used a method of judgment

for page in range(1, 6):

if page == 1:    
    new_url = 'http://www.ypppt.com/moban/lunwen/'
else:
    new_url =['http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)]
    new_url = new_url[0]
  • Step 2
    Actually, I still prefer to use BeautifulSoup to parse web pages. The
    code is

BeautifulSoup(response.content,‘lxml’)

Remember to add content at the back, otherwise it will make an error. It took a very long time to find this solution, and finally I found it.

-Step 3
Finally, you can start to formally analyze the things in the webpage. It's
my favorite bs4. The
principle is to find tags step by step.

a= i.a.attrs['href']
        b=requests.get('http://www.ypppt.com'+a)

I use this method to simulate clicking into the webpage. After all, I don’t know much about selenium, so I can only use this method.

  • The fourth step
    is not written in this. In fact, you can create a folder and store all of these. For
    example, the imported os library is, os.mkdir() is, if you are interested, you can look at the big guys.

Guess you like

Origin blog.csdn.net/weixin_52300580/article/details/110674818