[Python web crawler introductory tutorial 1] The first lesson to become a "Spider Man": HTML, Request library, Beautiful Soup library

Getting Started with Python Web Crawler: Spider Man’s First Lesson

write at the front
Summary of the first lesson

write at the front

A fan hopes to learn the practical skills of web crawlers and wants to try to build his own crawler environment to crawl data from the Internet.

I wrote a blog to share before, but the content felt too simple
[A super simple crawler demo] Explore Sina: Use Python crawler to obtain dynamic web page data

This issue invites friends who are good at crawling@PoloWitty to write this blog. Through his professional perspective and practical experience, he guides us step by step to become a "Spider Man" of data exploration.

[Python web crawler introductory tutorial 1] The first lesson to become a "Spider Man": HTML, Request library, Beautiful Soup library
[Python web crawler introductory tutorial 2] Become a "Spider Man" The second lesson of "Spider Man": observing the target website and writing code
[Python web crawler introductory tutorial 3] The third lesson of becoming a "Spider Man": from requests to scrapy, crawling the target website

As Internet data grows exponentially, it becomes increasingly important to understand how to effectively extract this information. Whether it is text models such as ChatGPT or visual models such as Stable Diffusion, most of their training data comes from massive data on the Internet. In this ever-changing era of big data, crawlers are a basic skill tree that must be learned.

This series of articles will introduce the basic knowledge and technology of Python web crawlers in a simple and easy way, from the Requests library to the use of Scrapy framework入门级, opening the door to Python web crawlers for you and becoming a spider. man, and eventually used the ScrapeMe website as a target example to crawl the cute and interesting Pokémon photos on the website.

Before we start, I would like to say a few words. Although web crawlers are powerful, they must comply with laws, regulations and the crawler protocol of the website when using them. Do not crawl data illegally and comply with relevant laws and regulations~

Please add image description

This is the first article in this series. It will introduce the Internet front-end background knowledge you need to know in the process of writing web crawlers, as well as the use of two simple and easy-to-use related libraries.

Through this tutorial, you will learn how to start using Python for web crawling to better utilize available network resources in this data-driven era. We look forward to seeing you become a data exploration “Spider Man”!

Welcome to share the problems encountered in the crawler in the comments, let us discuss and learn together!

Background knowledge introduction

To successfully write a Python crawler, the key is to accurately customize the design of the target website. At the same time, you also need to be prepared to deal with common anti-crawler mechanisms. The skill tree of Python crawlers is broad and deep. Even the most basic crawlers involve front-end knowledge such as HTML, CSS and JavaScript. Here, we’ll briefly cover these basics to lay a solid foundation for your crawling journey.

HTML is the skeleton of web pages, which defines page content through various markup languages. For example, the <img> tag is used to mark images, the <a> tag is used to mark links, and the text can be <p> (paragraph) or . a> (title) to mark, you need to filter the content based on these during the crawling process. CSS is the master of beautification. The most commonly used method is to add class names to elements in HTML to define styles. For example, you can assign the same style to all heading elements to maintain a consistent style. <h1> to <h6>

The next step is JS, which makes the page move! A common usage is to use methods like document.getElementBy to get elements on the page. For example, document.getElementById('someId') can get the element with a specific ID, and document.getElementsByClassName('someClass') can get all the elements with the same class name. Through these methods, you can easily capture various things on the page, allowing the crawler to more accurately capture the information you need!

Of course, what is mentioned here is only the relevant knowledge that may be most commonly used in the process of writing crawlers. If you want to get more relevant content, you can search for related series of courses on the Internet to learn (such as < /span>), we will not go into details here. Rookie Tutorial

Next, let's take a look at the first library we will use when writing a crawler: the request library.

Spider Silk Launcher - Request Library

Now, let’s talkrequests about libraries. requestsThe library is like Spider-Man’s web launcher! Imagine that Spider-Man can easily grab things on tall buildings with his web launcher, and that libraries can easily grab data on web pages! requests

Just like Spider-Man can quickly shoot webbing to the target,requests.get() method can quickly make requests to the website and grab the information you want. And, just like Spider-Man can adjust the strength and angle of his webbing as needed, you can also adjust the request using different parameters of therequests library to make it better suit your needs!

Sometimes you have to give the server a hint and tell it who you are. Then use the parameter of requests, just like leaving a note at the door. Moreover, if you want to send data to the website, such as login information or form content, the library can also handle it! headersrequests

Let’s give an example to put it into practice:

import requests
import json

url = 'http://t.weather.sojson.com/api/weather/city/101010100'
response = requests.get(url) # 发送请求并得到返回结果

obj = json.loads(response.text) # 加载获取的json格式数据
print(obj)

Through the above code, you can use the weather API to obtain the weather in Beijing. Then the content returned by this API is actually an object in json format. We can use the json.loads() method to load it as a python object.

Okay, after learning this, you can actually use the request library and some API services to design some cool things. For example, using the weather API and some front-end knowledge of web pages, you can also do it yourself. A weather forecast program^o^/

However, if anyone tries to change the above url to https://www.baidu.com, you may find that response.text the things obtained are a bit weird. Don't panic! This is because the content of the web page is displayed in a different encoding form than usual.

Fortunately, it’s time to inviteBeautifulSoup to appear! Just like Spider Man's smart glasses, the BeautifulSoup library can help you instantly turn those messy characters into a language you can understand, allowing you to easily read the content of the web page!

Smart Glasses—Beautiful Soup Library

When you directly use the request library to requesthttps://www.baidu.com, what is returned is actually the text representation of the web page. People generally use DOM (Document Object Model) to interpret it. When you open a web page in a browser, the browser downloads the HTML, CSS, and JavaScript files of the web page, then parses these files and builds a DOM tree. This tree structure represents the hierarchical structure of the web page, such as titles, paragraphs, links and other elements and their nested relationships.

And Beautiful Soup is specially designed for web page parsing. It can parse DOM (Document Object Model) trees with ease. With Beautiful Soup, you can operate this DOM tree like a tree and easily find the content you want. For example, you can use the find() or find_all() methods to find elements by tag name or class name, just like finding a specific type of branch in a tree.

If we take a Baidu page as an example, assuming you want to find all the links in it, you can now use code similar to this:

import requests
from bs4 import BeautifulSoup

url = 'https://www.baidu.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 找到所有的链接
links = soup.find_all('a')

# 输出所有链接的地址
for link in links:
    print(link.get('href'))

In addition to getting all links, we can also use bs4 to achieve various functions. For example, if we replace 'a' in the above code with 'img', replace 'href' with 'src', we can get the links to all the images in the Baidu page. If we then use the request library to process these link addresses By making a request, we can easily crawl all the images on this page!

Summary of the first lesson

Through the first lesson of this series of tutorials, you should have learned some basic knowledge of using python to write crawlers, and have a certain understanding of the requests library and Beautiful Soup library. Using the knowledge you learned in this lesson, you can already write some simple crawler programs ^ o^y

In the next lesson, you will use the basic knowledge learned in this lesson to use the requests library and Beautiful Soup library to write a simple crawler program for Pokémon images.

Welcome to continue to pay attention to this series of courses!

A simple crawler program ^ o^y

In the next lesson, you will use the basic knowledge learned in this lesson to use the requests library and Beautiful Soup library to write a simple crawler program for Pokémon images.

Welcome to continue to pay attention to this series of courses!