A simple and easy-to-understand overview and practice of python crawlers, a must-read for beginners! !

Article directory

  • 1. First understand how users obtain network data
  • 2. Briefly understand the composition of web page source code
    • 1. Basic programming language for web
    • 2. Use a browser to view the source code of the web page
  • 3. Overview of crawlers
    • 1. Get to know crawlers
    • 2, python reptile
    • 3. Reptile classification
    • 4. Crawler application
    • 5. Reptiles are a double-edged sword
    • 6. python crawler tutorial
    • 7. The process of writing a crawler
  • 4. Python crawler practice - obtaining blog views

Preface: A simple summary of python crawlers is to obtain web page data and then extract it on demand! Although the process is simple, its implementation requires a combination of multiple technologies and proficiency in crawler libraries to write efficient crawler code.

1. First understand how users obtain network data

1. The user uses a browser: the browser submits a request -> downloads the web page code -> parses it into a page

2. Automatic code acquisition: simulate a browser sending a request (obtain web page code html, css, javascript) -> extract useful data -> store in a database or file

Python crawler uses code to obtain it automatically. The specific process is:
Insert image description here

2. Briefly understand the composition of web page source code

1. Basic programming language for web

For a simple understanding, you can take a novice tutorial
1) HTML, CSS, and JavaScript are three languages ​​that web developers must learn. They cooperate with each other to form various rich websites (3) JavaScript controls the layout of the web page behavior (2) CSS describes the layout of the web page
(1) HTML defines the content of the web page

2) Simple example of html5
Insert image description here
3) JavaScript in html
Insert image description here

4) CSS in html
Insert image description here

2. Use a browser to view the source code of the web page

1) Right-click on the webpage to view the source code (that is, the content we want to obtain)
Insert image description here

2) HTML code, here we are required to slightly understand its composition, and then extract it based on the content
Insert image description here

3. Overview of crawlers

1. Get to know crawlers

A series of search engines we are familiar with are large web crawlers, such as Baidu, Sogou, 360 Browser, Google Search, etc. Each search engine has its own crawler program. For example, the crawler of 360 Browser is called 360Spider, and the crawler of Sogou is called Sogousspider.

2, python reptile

Baidu search engine can actually be more vividly called Baidu Spider. It crawls high-quality information from massive Internet information every day and collects it. When users search for keywords through Baidu, Baidu will first analyze the keywords entered by the user, then find relevant web pages from the included web pages, sort the web pages according to ranking rules, and finally present the sorted results to user. In this process, Baidu Spider played a very key role.

Baidu engineers have written corresponding crawler algorithms for "Baidu Spider". By applying these algorithms, "Baidu Spider" can implement corresponding search strategies, such as filtering out duplicate web pages, screening high-quality web pages, etc. Using different algorithms, the crawler's operating efficiency and crawling results will vary.

3. Reptile classification

Crawlers can be divided into three major categories: general web crawlers, focused web crawlers, and incremental web crawlers.

1) General web crawler: It is an important part of the search engine. It has been introduced above and will not be repeated here. Universal web crawlers need to comply with the robots protocol, which is used by websites to tell search engines which pages can be crawled and which pages are not allowed to be crawled.

Robots agreement: It is a "conventional name" agreement and does not have legal effect. It embodies the "contractual spirit" of Internet people. Industry practitioners will consciously abide by this agreement, so it is also called a "gentleman's agreement."

2) Focused web crawler: It is a web crawler program oriented to specific needs. The difference between it and general crawlers is that focused crawlers will filter and process web content when implementing web crawling, and try to ensure that only web page information related to needs is captured. Focused web crawlers greatly save hardware and network resources. Since the number of saved pages is small, the update speed is very fast. This also satisfies the needs of some specific groups for information in specific fields.

3) Incremental web crawler: refers to the incremental update of downloaded web pages. It is a crawler program that only crawls newly generated or changed web pages, which can guarantee the crawled pages to a certain extent. is the latest page.

4. Crawler application

With the rapid development of the Internet, the World Wide Web has become a carrier of large amounts of information. How to effectively extract and utilize this information has become a huge challenge. Therefore, crawlers have emerged. It can be used not only in the field of search engines, but also in big data analysis. and commercial fields have been widely used.

1) Data analysis

In the field of data analysis, web crawlers are often an essential tool for collecting massive amounts of data. For data analysts, to conduct data analysis, they must first have data sources, and by learning crawlers, they can obtain more data sources. During the collection process, data analysts can collect more valuable data according to their own purposes and filter out invalid data.

2) Business field

For enterprises, it is crucial to obtain market dynamics and product information in a timely manner. Enterprises can purchase data through third-party platforms, such as Guiyang Big Data Exchange, Data Hall, etc. Of course, if your company has a crawler engineer, you can obtain the desired information through a crawler.

5. Reptiles are a double-edged sword

Crawlers are a double-edged sword. While they bring us convenience, they also bring hidden dangers to network security. Some criminals use crawlers to illegally collect information about netizens on the Internet, or use crawlers to maliciously attack other people's websites, leading to serious consequences of website paralysis. Regarding how to use crawlers legally, it is recommended to read the "Cybersecurity Law of the People's Republic of China".

6. python crawler tutorial

In order to limit the dangers caused by crawlers, most websites have good anti-crawling measures. Therefore, when using crawlers, you must consciously abide by the agreement and do not illegally obtain other people's information or do things that harm other people's websites.

Why use Python to make crawlers
First of all, you should be clear that not only Python can be used to make crawlers, but PHP, Java, and C/C++ can all be used to write crawler programs. But in comparison, Python is the easiest to make a crawler. Here is a brief comparison of their advantages and disadvantages:

PHP: It does not have very good support for multi-threading and asynchronous processing, and its concurrent processing capabilities are weak; Java is also often used to write crawler programs, but the Java language itself is very cumbersome and has a large amount of code, so it is a barrier to entry for beginners. High; although C/C++ has high operating efficiency, the cost of learning and development is high. Writing a small crawler program can take a long time.

The Python language has beautiful syntax, concise code, high development efficiency, and supports multiple crawler modules, such as urllib, requests, Bs4, etc. Python's request module and parsing module are rich and mature, and it also provides a powerful Scrapy framework, making it easier to write crawler programs. Therefore, using Python to write crawler programs is a very good choice.

7. The process of writing a crawler

The crawler program is different from other programs in that its thinking logic is generally similar, so we do not need to spend a lot of time on logic. The following is a brief explanation of the process of writing a crawler program in Python:

First, use the request method of the urllib module to open the URL and obtain the webpage HTML object.
Use a browser to open the web page source code and analyze the web page structure and element nodes.
Extract data through Beautiful Soup or regular expressions.
Store data to local disk or database.

Of course, it is not limited to the above process. Writing a crawler program requires you to have good Python programming skills, so that you can be comfortable in the writing process. The crawler program needs to try its best to pretend that it is a human visiting the website instead of a machine. Otherwise, it will be restricted by the website's anti-crawling policy, or even directly block the IP.

4. Python crawler practice - obtaining blog views

No nonsense, let’s get into the code

import re
import requests
from requests import RequestException
import urllib.request

url = "https://blog.csdn.net/STCNXPARM/article/details/122297801"

def get_page(url):
	try:
		#请求头部,如果不加头部,则会被反爬虫网站识别出是爬虫,会导致获取不到数据
		headers = {
			'Referer': 'https://blog.csdn.net',  # 伪装成从CSDN博客搜索到的文章
			'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'  # 伪装成浏览器
		}
		#获取网页源代码数据
		response = requests.get(url, headers=headers)
		if response.status_code == 200:
			return response.text
		return None
	except RequestException:
		print('请求出错')
		return None

		
def parse_page(html):
	try:
		#使用正则匹配html代码中浏览量字段
		read_num = int(re.compile('<span.*?read-count.*?(\d+).*?</span>').search(html).group(1))
		#返回浏览量
		return read_num
	except Exception:
		print('解析出错')
		return None

	 
def main():
	try:
		html = get_page(url)
		if html:
			read_num = parse_page(html)
			if read_num:
				print('当前阅读量:', read_num)
	except Exception:
		print('出错啦!')

if __name__ == '__main__':
	main()

operation result:
Insert image description here

If you are new to learning python crawlers and find it difficult to learn by yourself, then you must not miss the full set of Python learning materials I will share next. I hope it will be helpful to those friends who want to learn Python!

python learning route

Environment setup

To use Python, you first need to set up a Python environment. We go directly toPython official website to download the installation package for your corresponding platform and version and install it< /span>

python development tools

As the saying goes: If you want to do your job well, you must first sharpen your tools. The same is true for learning Python. Newbies are recommended to choose PyCharm, which can get started quickly and reduce configuration time.

learning software

If a worker wants to do his job well, he must first sharpen his tools. The commonly used development software for learning Python is here, saving everyone a lot of time.

Full set of PDF e-books

The advantage of books is that they are authoritative and have a sound system. When you first start learning, you can just watch videos or listen to someone’s lectures. But after you finish learning, you feel that you have mastered it. At this time, it is recommended that you still read the books. Authoritative technical books are also the must-have for every programmer.

Insert image description here

Getting Started Learning Video

When we watch videos to learn, we cannot just move our eyes and brains but not our hands. The more scientific learning method is to use them after understanding. At this time, hands-on projects are very suitable.

Practical cases

Optical theory is useless. You must learn to follow along and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases.

5. "Read Comics and Learn Python" produced by Tsinghua University Programming Master

Use easy-to-understand comics to teach you to learn Python, making it easier for you to remember and not boring.

Insert image description here
Supporting 600 episodes of video:

Insert image description here

6. Interview materials

We must learn Python to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. After finishing this set I believe everyone can find a satisfactory job based on the interview information.



The above complete version of the complete set of Python learning materials has been uploaded to the CSDN official. If friends need it, you can directly click on itGet the CSDN official certification QR code for free [Guaranteed 100% free] .

Guess you like

Origin blog.csdn.net/2301_78096295/article/details/131072629
Recommended