Python batch crawler download PDF file code implementation

The background of this article is: a teacher with a good relationship with the university asked me if I could download the pdf files corresponding to the 1000 hyperlink URLs in Excel. Although you can manually click to download one by one, but this is too labor-intensive and time-consuming. I remembered my previous reptile experience, analyzed the feasibility for the teacher, and then practiced it.
  
Unexpectedly, I encountered difficulties at the beginning, and the hyperlinks in Excel were directly displayed in Chinese when read in Python. So the first step is to sort out the URLs corresponding to the hyperlinks, and then use Python to crawl the pdf of the corresponding URLs. The first step has been described in detail in the previous article. This article shares the second step of batch crawler downloading files, and detailed code introduction.


  

1. Read data

  
First read the data, the code is as follows:

import os 
import numpy as np 
import pandas as pd 

#设置文件存放的地址
os.chdir(r'F:\老师\下载文件')
#读取数据
link_date = pd.read_csv('import.csv',encoding='gbk')
link_date.head(2)

got the answer:
  
insert image description here

  
  

2. Simulate the login URL and click the button to download the pdf

  
Then simulate using the Chrome browser to log in, open the first URL with the code, and simulate a person to click to download. The specific code is as follows:

import json
import time
import random
from captcha import * 
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import wait
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
#导入库

print('程序开始时间:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
#模拟使用Chrome浏览器登陆
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")  
driver =  webdriver.Chrome(options=options)
driver.implicitly_wait(10)
#打开搜索页
driver.get(link_date['网址'][0])
time.sleep(20)  # 暂停20s
#点击下载pdf的按钮
driver.find_element_by_xpath('//*[@id="mdiv"]/div[3]/div[2]/a').click()

The opened URL is as shown in the figure below, left-click the three dots in the red box on the far right, left-click [More Tools], move the mouse to [Developer Tools] and left-click, you can see the display bar on the right of the figure below. Then left-click the arrow in the red box, move the mouse to the pdf in the leftmost red box and left-click, you can see the module corresponding to the href in the red box on the right. Right-click the module, left-click [Copy], and then left-click [Copy Xpath] to get the path in driver.find_element_by_xpath.
  
insert image description here

  
  

3. Write cycle to download all files in batches

  
Write a loop to download all files in batches. The easiest way is to traverse all URLs and simulate clicking to download pdf. The code is as follows:

for i in range(0,1000):
    print(i)
    #打开搜索页
    driver.get(link_date['超链接'][i])
    time.sleep(20)  # 暂停20s
    driver.find_element_by_xpath('//*[@id="mdiv"]/div[3]/div[2]/a').click()

But there is a problem with this code. Once there is an unexpected URL, the code is easy to be interrupted, and the following error will be reported:
  
insert image description here

  
At this time, it is necessary to manually check which file has been downloaded, and then adjust the value in the range to download. If you don't want to stare at the code, you can write it in try mode and record the downloaded tags in the lab. If you encounter an accident, jump directly to the next URL, and after the full download, sort out which URL has not been downloaded. The specific sentences are as follows:

lab = []
for i in range(1, 1000):
    try:
        print(i)
        #打开搜索页
        driver.get(link_date['网址'][i])
        time.sleep(20)  # 暂停20s
        driver.find_element_by_xpath('//*[@id="mdiv"]/div[3]/div[2]/a').click()
        lab.append(i)
    except:
        pass

The final download result is as follows:
  
insert image description here

  
So far, the code implementation of Python batch crawler downloading PDF files has been explained, and interested students can implement the picture by themselves.
  
[ Limited time free access to the group ] The group provides recruitment information related to learning Python, playing with Python, risk control modeling, artificial intelligence, and data analysis, excellent articles, learning videos, and can also exchange related problems encountered in learning and work. Friends who need it can add WeChat ID 19967879837, and add time to note the groups they want to join, such as risk control modeling.
  
You may be interested in:
Drawing Pikachu in PythonUsing
Python to draw word cloudsPython
face recognition - you are the only one in my eyesPython
draws a beautiful starry sky map (beautiful background)
Use the py2neo library in Python to operate neo4j and build an association mapPython
romance Confession source code collection (love, rose, photo wall, confession under the stars)

Guess you like

Origin blog.csdn.net/qq_32532663/article/details/132395715