project source:
Because of work needs, the beautiful seniors have to collect their English names, descriptions, and abbreviations if they exist according to the existing school list. There are a total of 2740 school name data, one by one to search to get the results, and then copy and paste into the table, it is estimated that people are stupid after a day.
I took a few minutes to write a small crawler program to help the young lady liberate her hands, and successfully performed a wave. The senior lady praised me for a while and invited me to drink a cup of American coffee.
analyse problem
A total of 2740 school name data
Search for Tsinghua University in Baidu Encyclopedia
Looking at the source code of the webpage, you can be pleasantly surprised to find that the data briefly described lie at the beginning!
After analysis, it can be found that the structure of the web page is simple. You can obtain the source code of the web page by constructing a URL request, and then extract the data we want from it.
Crawler code
Dangdang, now it’s time for our crawler to play
Import the required libraries
-
import requests
-
import pandas as pd
-
from random import choice
-
from lxml import etree
-
import openpyxl
-
import logging
Basic configuration parameters
-
# Basic configuration of log output
-
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
-
wb = openpyxl.Workbook() # Create a workbook object
-
sheet = wb.active # Get the active worksheet
-
sheet.append(['School Name','Chinese Abbreviation','School Name (English)','Description','Baidu Encyclopedia Link']) # Add the first row and column name
-
# Generate random request headers for switching
-
user_agent = [
-
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
-
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
-
......
-
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
-
]
Read data, crawl web pages
-
# Read existing school name data
-
df = pd.read_excel('School Name.xlsx')['School Name']
-
items = df.values
-
# Traverse to get the name of each school
-
for item in items:
-
try:
-
# Randomly generate request header
-
headers = {
-
'User-Agent':choice(user_agent)
-
}
-
# Construct url
-
url = f'https://baike.baidu.com/item/{item}'
-
# Send request to get response
-
rep = requests.get(url, headers=headers)
-
# Xpath parses and extracts data
-
html = etree.HTML(rep.text)
-
# Description
-
description = ''.join(html.xpath('/html/head/meta[4]/@content'))
-
# Foreign name
-
en_name = ','.join(html.xpath('//dl[@class="basicInfo-block basicInfo-left"]/dd[2]/text()')).strip()
-
# Chinese abbreviation is under the dd[3] tag
-
simple_name = ''.join(html.xpath('//dl[@class="basicInfo-block basicInfo-left"]/dd[3]/text()')).strip()
-
sheet.append([item, simple_name, en_name, url])
-
logging.info([item, simple_name, en_name, description, url])
-
except Exception as e:
-
logging.info(e.args)
-
pass
-
# save data
-
wb.save('Result.xlsx')
The operation effect is as follows:
There are 2740 pages of data that need to be crawled. In order to improve the crawling efficiency, you can use multiple threads.
Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself