Crawl PPT
Recording the first time of writing and crawling PPT.
As a freshman, it is actually the first time to write a crawler. The process may be a bit cumbersome, but it is also easy to understand. There may be something wrong with the writing. I hope to correct it.
Preface
The purpose of writing this article is also to be alert to yourself, to clarify your thinking, and to be able to write crawlers better, and indeed the writing is quite weak
Second, use steps
1. Import the library
The code is as follows (example):
from bs4 import BeautifulSoup
from lxml import etree
import requests
from selenium import webdriver
import urllib
import time
import os
code show as below
The website crawled this time is Youpin
#http://www.ypppt.com/moban/lunwen/list-2.html
#http://www.ypppt.com/moban/lunwen/
#/html/body/div[2]/div/ul/li[1]/a
from bs4 import BeautifulSoup
from lxml import etree
import requests
from selenium import webdriver
import urllib
import time
import os
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
}
time.sleep(4)
num=1
page=1
for page in range(1, 6):
if page == 1:
new_url = 'http://www.ypppt.com/moban/lunwen/'
else:
new_url =['http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)]
new_url = new_url[0]
print("正在爬取" + new_url)
response = requests.get(new_url, headers=headers)
response.encoding = 'utf-8'
jx = BeautifulSoup(response.content,'lxml')
mains = jx.find('ul',{
'class':'posts clear'})
main_ppts = mains.find_all('li')
for i in main_ppts:
a= i.a.attrs['href']
b=requests.get('http://www.ypppt.com'+a)
b.encoding=b.apparent_encoding
c=BeautifulSoup(b.content,'lxml')
down = c.find('div',{
'class':'button'})
down1= down.a.attrs['href']
down_1 = requests.get('http://www.ypppt.com'+down1)
down_1.encoding=down_1.apparent_encoding
down_2=BeautifulSoup(down_1.content,'lxml')
e = down_2.find('ul',{
'class':'down clear'})
f =e.find('li')
downlaod_url =f.a.attrs['href']
download=requests.get(url=downlaod_url,headers=headers).content
with open(str(num) + '.zip', 'wb') as f:
f.write(download)
print(str(num)+ '下载成功')
num+=1
The first step is to write the request header
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400” } In fact, the time.sleep in the middle can refer to other documents. After all, I am not sure, it is still too weak
After importing all libraries, you can start parsing web pages
- first step
After logging into the website, I found that except for the first page, the rest of the pages are regular,
so I used a method of judgment
for page in range(1, 6):
if page == 1:
new_url = 'http://www.ypppt.com/moban/lunwen/'
else:
new_url =['http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)]
new_url = new_url[0]
- Step 2
Actually, I still prefer to use BeautifulSoup to parse web pages. The
code is
BeautifulSoup(response.content,‘lxml’)
Remember to add content at the back, otherwise it will make an error. It took a very long time to find this solution, and finally I found it.
-Step 3
Finally, you can start to formally analyze the things in the webpage. It's
my favorite bs4. The
principle is to find tags step by step.
a= i.a.attrs['href']
b=requests.get('http://www.ypppt.com'+a)
I use this method to simulate clicking into the webpage. After all, I don’t know much about selenium, so I can only use this method.
- The fourth step
is not written in this. In fact, you can create a folder and store all of these. For
example, the imported os library is, os.mkdir() is, if you are interested, you can look at the big guys.