Python crawler automation, help young lady free her hands!

project source:

Because of work needs, the beautiful seniors have to collect their English names, descriptions, and abbreviations if they exist according to the existing school list. There are a total of 2740 school name data, one by one to search to get the results, and then copy and paste into the table, it is estimated that people are stupid after a day.

I took a few minutes to write a small crawler program to help the young lady liberate her hands, and successfully performed a wave. The senior lady praised me for a while and invited me to drink a cup of American coffee.

 

 

analyse problem

A total of 2740 school name data

Search for Tsinghua University in Baidu Encyclopedia

Looking at the source code of the webpage, you can be pleasantly surprised to find that the data briefly described lie at the beginning!

After analysis, it can be found that the structure of the web page is simple. You can obtain the source code of the web page by constructing a URL request, and then extract the data we want from it.

Crawler code

Dangdang, now it’s time for our crawler to play

 

 

Import the required libraries

  1. import requests

  2. import pandas as pd

  3. from random import choice

  4. from lxml import etree

  5. import openpyxl

  6. import logging

 

Basic configuration parameters

  1. # Basic configuration of log output

  2. logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')

  3. wb = openpyxl.Workbook() # Create a workbook object

  4. sheet = wb.active # Get the active worksheet

  5. sheet.append(['School Name','Chinese Abbreviation','School Name (English)','Description','Baidu Encyclopedia Link']) # Add the first row and column name

  6.  
  7. # Generate random request headers for switching

  8. user_agent = [

  9. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",

  10. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",

  11. ......

  12. "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

  13. ]

 

Read data, crawl web pages

  1. # Read existing school name data

  2. df = pd.read_excel('School Name.xlsx')['School Name']

  3. items = df.values

  4.  
  5. # Traverse to get the name of each school

  6. for item in items:

  7. try:

  8. # Randomly generate request header

  9. headers = {

  10. 'User-Agent':choice(user_agent)

  11. }

  12. # Construct url

  13. url = f'https://baike.baidu.com/item/{item}'

  14. # Send request to get response

  15. rep = requests.get(url, headers=headers)

  16. # Xpath parses and extracts data

  17. html = etree.HTML(rep.text)

  18. # Description

  19. description = ''.join(html.xpath('/html/head/meta[4]/@content'))

  20. # Foreign name

  21. en_name = ','.join(html.xpath('//dl[@class="basicInfo-block basicInfo-left"]/dd[2]/text()')).strip()

  22. # Chinese abbreviation is under the dd[3] tag

  23. simple_name = ''.join(html.xpath('//dl[@class="basicInfo-block basicInfo-left"]/dd[3]/text()')).strip()

  24. sheet.append([item, simple_name, en_name, url])

  25. logging.info([item, simple_name, en_name, description, url])

  26.  
  27. except Exception as e:

  28. logging.info(e.args)

  29. pass

  30.  
  31. # save data

  32. wb.save('Result.xlsx')

 

The operation effect is as follows:

There are 2740 pages of data that need to be crawled. In order to improve the crawling efficiency, you can use multiple threads.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112360051