Python universal code template: crawler code

Hello, I am Yuechuang.

When many students hear Python or programming languages, they may feel "difficult" in conditioned reflex. But today's Python course is an exception, because the **Python skills taught today do not require you to understand computer principles, nor do you need to understand complex programming models. **Even for non-developers, just replace links, files, and it can be done easily.

And these few practical tips are simply the best practices of Python's daily helpers. for example:

  • Crawl documents, crawl forms, crawl learning materials;
  • Play with charts and generate data visualizations;
  • Naming files in batches to realize office automation;
  • Make pictures in batches, add watermarks, and adjust the size.

Next, we will use Python to implement one by one. The code I provided is a universal code, which can be processed by replacing it with the web page link, file location, and photo you want to crawl.

If you have not installed Python and related environment setup, you can refer to my previous article:

**Tips:**Because the data in different chapters may be cross-referenced, it is recommended that you first create a work folder on the desktop, and then create a separate Python file for each chapter for experimentation. For example, you can create a new pytips directory, and then create a tips folder for each chapter in this directory, and create corresponding .pyfiles . (According to your details, my folder is also different from this one)

1. Use Python crawlers skillfully to realize the freedom of wealth

First of all, you can use Python to crawl. What is a crawler? To understand it simply, it is to grab data (documents, materials, pictures, etc.) on the network. For example, you can crawl documents and study materials for postgraduate entrance examinations, analyze table data on the Internet, and download pictures in batches.

Let's take a look at how to achieve it one by one.

1.1 Crawling documents and learning materials

First of all, you have to determine what website you want to crawl? What are you trying to get it for? For example, Xiaoyue wants to climb the application guide on the website of the Qingyan Gang , so he wants to collect the titles and hyperlinks of all the articles on the current web page for subsequent browsing.
image.png
image.png

Crawl the link of the website: https://zkaoy.com/sions/exam
Purpose: Collect the titles and hyperlinks of all articles on the current page

To use Python, you can refer to the following two-step code template implementation (reminder: you need to install the Python dependency first: urllib3 bs4).
Install the required libraries:

pip install urllib3 BeautifulSoup4

The first step is to download the web page and save it as a file, the code is as follows.
**PS:**Here, for the sake of clarity, I split it into two code files, and I will merge them into one code file later.

# urllib3 的方法
# file_name:Crawler_urllib3.py
import urllib3


def download_content(url):
    """
    第一个函数,用来下载网页,返回网页内容
    参数 url 代表所要下载的网页网址。
    整体代码和之前类似
    """
    http = urllib3.PoolManager()
    response = http.request("GET", url)
    response_data = response.data
    html_content = response_data.decode()
    return html_content


# 第二个函数,将字符串内容保存到文件中
# 第一个参数为所要保存的文件名,第二个参数为要保存的字符串内容的变量
def save_to_file(filename, content):
    fo = open(filename, "w", encoding="utf-8")
    fo.write(content)
    fo.close()


def main():
    # 下载报考指南的网页
    url = "https://zkaoy.com/sions/exam"
    result = download_content(url)
    save_to_file("tips1.html", result)


if __name__ == '__main__':
    main()


# requests 代码
# file_name:Crawler_requests.py
import requests


def download_content(url):
    """
    第一个函数,用来下载网页,返回网页内容
    参数 url 代表所要下载的网页网址。
    整体代码和之前类似
    """
    response = requests.get(url).text
    return response


# 第二个函数,将字符串内容保存到文件中
# 第一个参数为所要保存的文件名,第二个参数为要保存的字符串内容的变量
def save_to_file(filename, content):
    with open(filename, mode="w", encoding="utf-8") as f:
        f.write(content)


def main():
    # 下载报考指南的网页
    url = "https://zkaoy.com/sions/exam"
    result = download_content(url)
    save_to_file("tips1.html", result)


if __name__ == '__main__':
    main()

The second step is to parse the webpage and extract the link and title of the article.

# file_name:html_parse.py
# 解析方法一
from bs4 import BeautifulSoup

# 输入参数为要分析的 html 文件名,返回值为对应的 BeautifulSoup 对象
def create_doc_from_filename(filename):
	with open(filename, "r", encoding='utf-8') as f:
		html_content = f.read()
		doc = BeautifulSoup(html_content)
	return doc

def parse(doc):
	post_list = doc.find_all("div", class_="post-info")
	for post in post_list:
		link = post.find_all("a")[1]
		print(link.text.strip())
		print(link["href"])

def main():
	filename = "tips1.html"
	doc = create_doc_from_filename(filename)
	parse(doc)

if __name__ == '__main__':
	main()


# file_name:html_parse_lxml.py
# 解析方法二,指定解析器
from bs4 import BeautifulSoup

# 输入参数为要分析的 html 文件名,返回值为对应的 BeautifulSoup 对象
def create_doc_from_filename(filename):
	with open(filename, "r", encoding='utf-8') as f:
		html_content = f.read()
		soup = BeautifulSoup(html_content, "lxml")
	return soup

def parse(soup):
	post_list = soup.find_all("div", class_="post-info")
	for post in post_list:
		link = post.find_all("a")[1]
		print(link.text.strip())
		print(link["href"])

def main():
	filename = "tips1.html"
	soup = create_doc_from_filename(filename)
	parse(soup)

if __name__ == '__main__':
	main()

**PS:** The two codes are very similar, but the difference is that the parser is specified - lxml

After executing the code, you can see that the title and links in the web page have been printed to the screen.

敲黑板!这些省份往届生不能预报名!
https://zkaoy.com/15123.html
二战必须回户籍所在地考吗?
https://zkaoy.com/15103.html
这些同学不能参加预报名!不注意,有可能考研报名失败!
https://zkaoy.com/15093.html
呜呼~考研报名费,这种情况可以退款!
https://zkaoy.com/15035.html
注意:又发通知!22研招有4点变化??
https://zkaoy.com/14977.html
2021考研初试时间定了!正式网报时间有变化
https://zkaoy.com/14915.html
快码住!考研前的这些关键时间点,千万不能错过!
https://zkaoy.com/14841.html
近万名考生考研报名失败!问题出在这!22考研一定注意!
https://zkaoy.com/14822.html
往届生比应届生更容易上岸,你认同吗?
https://zkaoy.com/14670.html
各省市考研报名费用!
https://zkaoy.com/14643.html
要开始报名了?现在不需要担心,没你想的那么复杂……
https://zkaoy.com/14620.html
教育部公布重要数据:研究生扩招20.74%!
https://zkaoy.com/14593.html
虚假招生?这一高校临近开学取消奖学金!
https://zkaoy.com/14494.html
下个月要预报名了,高频问题早知道
https://zkaoy.com/14399.html
注意!这些网报信息要准备好,否则影响9月考研报名!
https://zkaoy.com/14352.html
想考上研,各科应该考多少分?
https://zkaoy.com/14273.html
选择报考点需要注意什么?报考点有限制!
https://zkaoy.com/14161.html
各地考研报名费汇总!快来看看你要交多少钱!
https://zkaoy.com/14158.html
考研高校推免人数公布,统考名额还剩多少?
https://zkaoy.com/14092.html
这几所高校考研参考书有变!参考书目要怎么搜集?
https://zkaoy.com/14061.html
院校指南
https://zkaoy.com/sions/zxgg1
这些要提前准备好!不然影响报名!
https://zkaoy.com/13958.html
救命!近万人因为这个,错失考研机会!
https://zkaoy.com/13925.html
考研如何看招生简章和招生目录?
https://zkaoy.com/13924.html

Above, I disassembled it, and now it can be merged into one code file:

# file_name:Crawler.py
import requests
from bs4 import BeautifulSoup


def download_content(url):
    """
    第一个函数,用来下载网页,返回网页内容
    参数 url 代表所要下载的网页网址。
    整体代码和之前类似
    """
    response = requests.get(url).text
    return response


# 第二个函数,将字符串内容保存到文件中
# 第一个参数为所要保存的文件名,第二个参数为要保存的字符串内容的变量
def save_to_file(filename, content):
    with open(filename, mode="w", encoding="utf-8") as f:
        f.write(content)

def create_doc_from_filename(filename):
    # 输入参数为要分析的 html 文件名,返回值为对应的 BeautifulSoup 对象
    with open(filename, "r", encoding='utf-8') as f:
        html_content = f.read()
        soup = BeautifulSoup(html_content, "lxml")
    return soup

def parse(soup):
    post_list = soup.find_all("div", class_="post-info")
    for post in post_list:
        link = post.find_all("a")[1]
        print(link.text.strip())
        print(link["href"])


def main():
    # 下载报考指南的网页
    url = "https://zkaoy.com/sions/exam"
    filename = "tips1.html"
    result = download_content(url)
    save_to_file(filename, result)
    soup = create_doc_from_filename(filename)
    parse(soup)

if __name__ == '__main__':
    main()

Code file: [https://github.com/AndersonHJB/AIYC_DATA/tree/main/01-Python universal code template: 10 must-learn practical skills/1.1 Skillfully use Python crawlers to realize wealth freedom](https://github .com/AndersonHJB/AIYC_DATA/tree/main/01-Python universal code template: 10 must-learn practical skills/1.1 Skillfully use Python crawlers to realize wealth freedom)

So how do you replace it if you want to crawl other web pages? You only need to replace a few places, as shown in the picture below.
image.png
image.png

  1. Replace with the URL of the webpage you want to download
  2. Replace with the saved file name of the web page
  3. It is the BeautifulSoup function, we use it to parse out the content we want from the html structure step by step, what we achieve here is to first find all the div tags post-infowhose . If the structure of the web page you parse is different from this one, you can refer to our course https://www.aiyc.top/673.html#6. Basic operations of the Requests_ and _BeautifulSoup_ library for the specific usage of BeautifulSoup .

1.2 Grab the form and do data analysis

When we surf the Internet every day, we often see some useful tables, and we hope to save them for future use, but directly copying them to Excel is often prone to deformation, or garbled characters, or wrong formats, etc., which can be easily realized with the help of Python Save the table in the web page. (Hint: You need to install dependencies first: urllib3, pandas)

pip install urllib3 pandas

Take the foreign exchange page of China Merchants Bank as an example:
image.png
The Python code is as follows:

# file_name: excel_crawler_urllib3.py
import urllib3
import pandas as pd

def download_content(url):
	# 创建一个 PoolManager 对象,命名为 http
	http = urllib3.PoolManager()
	# 调用 http 对象的 request 方法,第一个参数传一个字符串 "GET"
	# 第二个参数则是要下载的网址,也就是我们的 url 变量
	# request 方法会返回一个 HTTPResponse 类的对象,我们命名为 response
	response = http.request("GET", url)

	# 获取 response 对象的 data 属性,存储在变量 response_data 中
	response_data = response.data

	# 调用 response_data 对象的 decode 方法,获得网页的内容,存储在 html_content
	# 变量中
	html_content = response_data.decode()
	return html_content

def save_excel():
	html_content = download_content("http://fx.cmbchina.com/Hq/")
	# 调用 read_html 函数,传入网页的内容,并将结果存储在 cmb_table_list 中
	# read_html 函数返回的是一个 DataFrame 的list
	cmb_table_list = pd.read_html(html_content)
	# 通过打印每个 list 元素,确认我们所需要的是第二个,也就是下标 1
	cmb_table_list[1].to_excel("tips2.xlsx")

def main():
	save_excel()

if __name__ == '__main__':
	main()


# file_name: excel_crawler_requests.py
import requests
import pandas as pd
from requests.exceptions import RequestException


def download_content(url):
	try:
		response = requests.get(url)
		if response.status_code == 200:
			return response.text
		else:
			return "None"
	except RequestException as e:
		return e


def save_excel(filename):
	html_content = download_content("http://fx.cmbchina.com/Hq/")
	# 调用 read_html 函数,传入网页的内容,并将结果存储在 cmb_table_list 中
	# read_html 函数返回的是一个 DataFrame 的list
	cmb_table_list = pd.read_html(html_content)
	# 通过打印每个 list 元素,确认我们所需要的是第二个,也就是下标 1
	# print(cmb_table_list)
	cmb_table_list[1].to_excel(filename)


def main():
	filename = "tips2.xlsx"
	save_excel(filename)

if __name__ == '__main__':
	main()

The figure below is for understanding:
image.png
image.png
after execution, an excel file will be tips2.xlsxgenerated , and it will be opened as shown in the figure below.
image.png
When you want to grab your own form, just replace the following 3 parts.
image.png

  1. Modify the name of the excel file you want to save;
  2. Replace with the URL of the page where you want to crawl the table;
  3. Replace it with the serial number of the table, such as which table you want to grab in the web page;

Code link: https://github.com/AndersonHJB/AIYC_DATA/tree/main/01-Python%20Universal code template: 10%20 must learn practical skills/1.2%20 Grab tables and do data analysis

1.3 Batch download pictures

When we see a lot of favorite pictures on a web page, it is relatively inefficient to save them one by one.

We can also achieve fast image downloads through Python. Taking Heilang.com as an example, we saw this webpage.
image.png
It looks good, and I hope to download all the pictures. The scheme is roughly the same as 1.

We first download the webpage, then analyze the img tag in it, and then download the picture. First, we create a folder tips_3 in the working directory to store the downloaded pictures.

First of all, download the webpage, the Python code is as follows.

# -*- coding: utf-8 -*-
# @Author: 
# @Date:   2021-09-13 20:16:07
# @Last Modified by:   aiyc
# @Last Modified time: 2021-09-13 21:02:58
import urllib3

# 第一个函数,用来下载网页,返回网页内容
# 参数 url 代表所要下载的网页网址。
# 整体代码和之前类似
def download_content(url):
	http = urllib3.PoolManager()
	response = http.request("GET", url)
	response_data = response.data
	html_content = response_data.decode()
	return html_content
# 第二个函数,将字符串内容保存到文件中
# 第一个参数为所要保存的文件名,第二个参数为要保存的字符串内容的变量

def save_to_file(filename, content):
	fo = open(filename, "w", encoding="utf-8")
	fo.write(content)
	fo.close()

url = "https://www.duitang.com/search/?kw=&type=feed"
result = download_content(url)
save_to_file("tips3.html", result)

Then extract the img tag and download the picture.

from bs4 import BeautifulSoup
from urllib.request import urlretrieve

# 输入参数为要分析的 html 文件名,返回值为对应的 BeautifulSoup 对象
def create_doc_from_filename(filename):
	fo = open(filename, "r", encoding='utf-8')
	html_content = fo.read()
	fo.close()
	doc = BeautifulSoup(html_content, "lxml")
	return doc

doc = create_doc_from_filename("tips3.html")
images = doc.find_all("img")
for i in images:
	src = i["src"]
	filename = src.split("/")[-1]
	# print(i["src"])
	urlretrieve(src, "tips_3/" + filename)

After the execution is complete, open tips_3the directory , and you can see that the pictures have been downloaded.
image.png
Replacement instructions are as follows.
image.png

  1. Replace it with the file name you want to save (web page file);
  2. Replace with the URL of the web page you want to download;
  3. Replace it with the folder where you want to save the picture, you need to create a folder.

In addition, the pictures of some websites are dynamically loaded after the webpage is displayed first, and the picture download of this kind of dynamically loaded content is not supported.
Code link: https://github.com/AndersonHJB/AIYC_DATA/tree/main/01-Python%20Universal code template: 10%20 must learn practical skills/1.3%20Batch download pictures

Launched tutoring classes, including "Python language tutoring class, C++ tutoring class, algorithm/data structure tutoring class, children's programming, pygame game development", all of which are one-on-one teaching: one-on-one tutoring + one-on-one Q & A + Homework assignment + project practice, etc. QQ, WeChat online, respond at any time! V: Jiabcdefh

Let me introduce myself first. The editor graduated from Jiaotong University in 2013. He used to work in a small company, went to big factories such as Huawei, OPPO, and joined Alibaba in 2018. Until now. I know that most junior and intermediate java engineers want to upgrade their skills, they often need to explore and grow by themselves or enroll in classes, but the tuition fee of nearly 10,000 yuan for training institutions is really stressful. My unsystematic self-study is very inefficient and long, and it is easy to hit the ceiling and the technology stops. Therefore, I collected a copy of "A Complete Set of Learning Materials for Java Development" and gave it to everyone. The original intention is also very simple, that is, I hope to help friends who want to learn by themselves but don't know where to start, and at the same time reduce everyone's burden. Add the business card below to get a full set of learning materials

Guess you like

Origin blog.csdn.net/m0_67394360/article/details/126062960