[Python web crawler introductory tutorial 3] The third lesson of becoming a "Spider Man": from requests to scrapy, crawling target websites

Introduction to Python web crawler: Spider man’s third lesson

write at the front

A fan hopes to learn the practical skills of web crawlers and wants to try to build his own crawler environment to crawl data from the Internet.

I wrote a blog to share before, but the content felt too simple
[A super simple crawler demo] Explore Sina: Use Python crawler to obtain dynamic web page data

This issue invites friends who are good at crawling@PoloWitty to write this blog. Through his professional perspective and practical experience, he guides us step by step to become a "Spider Man" of data exploration.

[Python web crawler introductory tutorial 1] The first lesson to become a "Spider Man": HTML, Request library, Beautiful Soup library
[Python web crawler introductory tutorial 2] Become a "Spider Man" The second lesson of "Spider Man": observing the target website and writing code
[Python web crawler introductory tutorial 3] The third lesson of becoming a "Spider Man": from requests to scrapy, crawling the target website


As Internet data grows exponentially, it becomes increasingly important to understand how to effectively extract this information. Whether it is text models such as ChatGPT or visual models such as Stable Diffusion, most of their training data comes from massive data on the Internet. In this ever-changing era of big data, crawlers are a basic skill tree that must be learned.

This series of articles will introduce the basic knowledge and technology of Python web crawlers in a simple and easy way, from the Requests library to the use of Scrapy framework入门级, opening the door to Python web crawlers for you and becoming a spider. man, and eventually used the ScrapeMe website as a target example to crawl the cute and interesting Pokémon photos on the website.

Before we start, I would like to say a few words. Although web crawlers are powerful, they must comply with laws, regulations and the crawler protocol of the website when using them. Do not crawl data illegally and comply with relevant laws and regulations~

Please add image description

This is the third article in this series. It will use theScrapeMe website as an example to show how to use the scrapy library to better Crawling Pokémon images on the website.

From requests to scrapy

When it comes to web crawlers, initially we may opt for the simple solution, using tools like requests and Beautiful Soup like Crafter Use glue and scissors to complete the craft. However, when our crawler tasks become more complex and larger, we need more efficient and powerful tools.

At this time,Scrapy appears! It is like an all-round robot assistant that can handle various crawler tasks. Through Scrapy, we can create the entire crawler project and define the crawling rules and processes. It can crawl multiple pages in parallel, like a multi-threaded superstar! Even better, it has various built-in functions, such as automatic request scheduling, page parsing, and data storage. It’s like a tailor-made toolbox that allows us to face the challenges of the reptile world more easily.

Specific to the simple crawler program we implemented in the second lesson, friends who have run the program may find that the crawling speed is not very fast. If you want to implement parallelism yourself, you not only need to understand the relevant knowledge of parallelism, but also Do a good job in communication scheduling between parallel programs, etc. The Scrapy framework can directly help us automatically complete these functions. We only need to focus on the implementation of a single thread code. Isn’t it very convenient?

Use scrapy to crawl target websites

First, we need to use the pip install Scrapy command to download the Scrapy package.

In order to create a complete scrapy project, we can use the command line tool provided to create it. Enterscrapy startproject spider in the command line, and we can create a new scrapy named spider. The project, in this project directory, contains the following files:

.
├── scrapy.cfg
└── spider
    ├── __init__.py 
    ├── items.py 
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py # 项目配置文件
    └── spiders 	# 用于放置你的爬虫程序的目录
        └── __init__.py

Except for the files with comments above, the contents in other files can be ignored for the time being. Interested students can search for relevant content online to learn.

Then we enter the spider/spiders directory, enter scrapy genspider pokemon scrapeme.live/shop/ in the command line, and create a crawler named pockmon (one crawler project can have many crawler), used to crawlscrapme.live/shop/ web pages.

At this time, in the spider/spiders directory, we can find our crawler programpokemon.py, and its current content is the result of automatic filling: < /span>

import scrapy


class PokemonSpider(scrapy.Spider):
    name = "pokemon"
    allowed_domains = ["scrapeme.live"]
    start_urls = ["http://scrapeme.live/"]

    def parse(self, response):
        pass

Next we need to fill in new content and use the scrapy crawl pokemon -O image_urls.csv command to run the crawler and get the results we want:

import scrapy
import requests

def download_from_url(url:str):
    '''
    利用requests库,从相应的图片链接中下载对应的图片
    结果会保存到results文件夹中
    '''
    filename = url.split('/')[-1]
    with open(f'../results/{
      
      filename}','wb') as fp:
        fig_response = requests.get(url)
        fp.write(fig_response.content)


class PokemonSpider(scrapy.Spider):
    name = "pokemon"
    allowed_domains = ["scrapeme.live"]
    start_urls = [f"https://scrapeme.live/shop/page/{
      
      pageNum}/?orderby=popularity" for pageNum in range(1,49)] # 所有的page链接

    def parse(self, response):
        image_urls = response.css('img') # 找到所有的img对应的位置
        for image_url in image_urls:
            url = image_url.attrib['src']
            download_from_url(url)
            yield {
    
    'image_url':url}

Among themPokemonSpider inherits the scrapy.Spider class, will use the name attribute to name the crawler, and limit the request to < Within /span> The a> folder. in the output file. The corresponding downloaded pictures are still in the image link to will save each method obtains the content marked with the corresponding css tag in the web page. The final object. This object is actually very similar to the DOM tree parsed using Beautiful Soup. We can use it directly function, you will get a . In the allowed_domains, start crawling with start_urlsparse()responserepoonse.css()yieldparseimage_urls.csvresults

If you want to use scrapy to make the crawler crawl faster and increase the amount of concurrency, you only need to add a new line to setting.py hhh, isn’t it so easy?CONCURRENT_REQUESTS = 256

more content

Scrapy itself still has a lot of room for expansion. Many of the current mature crawler programs are written using scrapy. As an introductory course, this course basically ends here. If you want to learn more about scrapy, you can read Scrapy Tutorial Or other advanced courses online.

Conclusion

We have initially explored the wonderful world of web crawlers together. We started with the introduction of HTML background knowledge and gained an in-depth understanding of requests and Beautiful Soup A powerful tool. Then, in the second article, we used this knowledge and tools to successfully crawl all Pokémon pictures on a website and obtained rich data.

The third article allows us to enter a more advanced field and learn how to use theScrapy library to handle larger-scale crawler tasks. Scrapy Allows us to crawl Pokémon images from the target website more efficiently and automatically, adding more magical colors to our crawler journey and laying the foundation for more complex tasks in the future. solid foundation.

Through these three courses, we not only learned technical knowledge, but also appreciated the vastness of the reptile world. Crawlers are not only a means of obtaining data, but also a way of in-depth exploration of the Internet. I hope these courses can add some fun to your learning journey and help you in your future journey of data acquisition and application. I hope you can swim freely in this vast ocean of data and discover more wonderful things!

Understand the vastness of the reptile world. Crawlers are not only a means of obtaining data, but also a way of in-depth exploration of the Internet. I hope these courses can add some fun to your learning journey and help you in your future journey of data acquisition and application. I hope you can swim freely in this vast ocean of data and discover more wonderful things!

Guess you like

Origin blog.csdn.net/wtyuong/article/details/134915132