New to Python - Spend 10 minutes learning reptiles

Foreword
Python beginners often write a crawler to practice. This tutorial uses the Scrapy framework to help you implement crawlers quickly and easily, and save the data to the database. Data mining is also very important in machine learning. My data science teacher once said that good algorithms are not as good as good data.

Python student gift package click to get it for free

First level directory

Install Scrapy (Python is installed by default):

$ pip install Scrapy

Create a new project in the current directory

$ scrapy startproject yourproject

The new project file structure is as follows:

yourproject/
----|scrapy.cfg             # 部署配置文件
    |yourproject/           # 工程目录
    |----__init__.py
    |----items.py           # 项目数据文件
    |----pipelines.py       # 项目管道文件
    |----settings.py        # 项目设置文件
    |----spiders/           # 我们的爬虫 目录
        |----__init__.py    # 爬虫主要代码在这里    

Simple crawlers mainly usespiders、items、pipelines,ageThese three files:

  • spider : The main logic of the crawler.
  • items : The data model of the crawler.
  • pipelines: The processing factory of the data obtained by the crawler, which can filter or save the data.

2. Data model: items

Analyzing the information of the webpage, we can see that the main body of the webpage is a list, and each line of the list contains information such as a quote, author name, label, etc. Click (about) to the right of the author's name to see the author's detailed information, including introduction, date of birth, place, etc. Based on the above data, we can first create the following data model:

items.py

import scrapy

# quote 我们要爬取的主体
class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    tags = scrapy.Field()
    author = scrapy.Field()

    next_page = scrapy.Field()
    pass
    
# quote 的作者信息 对应 QuoteItem.author
class AuthorItem(scrapy.Item):
    name = scrapy.Field()
    birthday = scrapy.Field()
    address = scrapy.Field()
    description = scrapy.Field()
    pass

All models must inherit fromscrapy.Item, After completing this step, we can start to write the logic of the crawler.

# 完整的 QuoteItem 数据结构示例
{
    
    
    text,
    tags,
    author:{
    
    
        name,
        birthday,
        address,
        description
    }
}

3. Reptile: spider

insert image description here
Since it is a crawler, it naturally needs to crawl the webpage. A few key points of the crawler part:

  1. Import the data model you created
  2. First of all, the reptile class should inheritscrapy.Spider
  3. Set the name of the crawlername, to be used when starting the crawler.
  4. Put the URL you want to crawl intostart_requests(), as the starting point for the crawler.
  5. After the crawled webpage information is successful, the request response inparse()in the analysis.

spiders/init.py

  • Import the created data model at the top.
import scrapy
from ScrapySample.items import QuoteItem
from ScrapySample.items import AuthorItem
  • reptiles,name-> crawler name,allowed_domains->Crawl the white list of web pages.
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]

existstart_requests()Record the URL you want to crawl in.
You can only put in one URL, and then let the crawler crawl the link to the next page in the starting URL by itself. You can also directly put all the URLs that need to be crawled here. For example, pages generally start from 1 and are in order. Write a for loop to directly input the URLs of all pages.
This article uses the method of letting the crawler crawl the URL of the next page by itself, so only a starting URL is written.

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
  • The following code, after successfully crawling the webpage, we need to analyze the structure of the webpage to find the data we need.

Let's look at the XPath syntax first,//div[@class=“col-md-8”]/div[@class=“quote”: This means to find a child node under the div node whose class is "col-md-8", and the child node is a class of“quote” div node. If such a node is found on the current page, the node information is returned, and if not found, it is returnedNone

    def parse(self, response):
        # 通过查看器,找到列表所在的节点
        courses = response.xpath('//div[@class="col-md-8"]/div[@class="quote"]')

        for course in courses:
            # 将数据模型实例化 并从节点中找到数据填入我们的数据模型
            item = QuoteItem()
            # 轮询 course 节点下所有 class 为 "text" 的 span 节点,获取所有匹配到的节点的 text() ,由于获取到的是列表,我们默认取第一个。
            item['text'] = course.xpath('.//span[@class="text"]/text()').extract_first()
            item['author'] = course.xpath('.//small[@class="author"]/text()').extract_first()
            item['tags'] = course.xpath('.//div[@class="tags"]/a/text()').extract()

            # 请求作者详细信息
            author_url = course.xpath('.//a/@href').extract_first()
            # 如果作者介绍的链接不为空 则去请求作者的详细信息
            if author_url != '':
                request = scrapy.Request(url='http://quotes.toscrape.com'+author_url, dont_filter=True, callback=self.authorParse)
                # 将我们已经获取到的 QuoteItem 传入该请求的回调函数 authorParse(),在该函数内继续处理作者相关数据。
                request.meta['item'] = item
                yield request
        
        # 继续爬向下一页 该函数具体实现下面会分析
        next_page_request = self.requestNextPage(response)
        yield next_page_request
  • Crawl the author's detailed information
    After successfully obtaining the author's detailed information AuthorItem and assigning it to the attribute author of QuoteItem, a complete quote information QuoteItem is assembled.
    def authorParse(self, response):
        # 先获取从 parse() 传递过来的 QuoteItem
        item = response.meta['item']
        # 通过查看器,找到作者详细信息所在节点
        sources = response.xpath('//div[@class="author-details"]')
        
        # 实例化一个作者信息的数据模型
        author_item = AuthorItem()
        # 往作者信息模型填入数据
        for source in sources:
            author_item['name'] = source.xpath('.//h3[@class="author-title"]/text()').extract_first()
            author_item['birthday'] = source.xpath('.//span[@class="author-born-date"]/text()').extract_first()
            author_item['address'] = source.xpath('.//span[@class="author-born-location"]/text()').extract_first()
            author_item['description'] = source.xpath('.//div[@class="author-description"]/text()').extract_first()
    
        # 最后将作者信息 author_item 填入 QuoteItem 
        item['author'] = author_item
        # 保存组装好的完整数据模型
        yield item
  • The crawler finds its way out by itself (next page link)

Through the viewer, we can find the button element of the next page, find the node and extract the link, and the crawler will go to the next vegetable garden.

    def requestNextPage(self, response):
        next_page = response.xpath('.//li[@class="next"]/a/@href').extract_first()
        # 判断下一个是按钮元素的链接是否存在
        if next_page is not None:
            if next_page != '':
                return scrapy.Request(url='http://quotes.toscrape.com/'+next_page, callback=self.parse)
        return None

The main logic of the crawler ends here. We can see that a simple crawler can be implemented with a small piece of code. Generally, mainstream webpages have done some processing for anti-crawlers, and the actual operation process may not be so smooth. We may need to imitate the browser's User-Agent, or do access delays to prevent too frequent requests and so on.

4. Data processing: pipelines

Pipelines is the pipeline used by Scrapy for subsequent processing. Multiple pipelines can exist at the same time, and can be executed in a custom order. It is usually used for data processing and data storage. We need to set the pipeline and execution order that need to be executed in the settings.py file.

# 在 settings.py 加入下面的代码
ITEM_PIPELINES = {
   'ScrapySample.pipelines.ScrapySamplePipeline': 300,
}

Here I only use one pipeline, ScrapySamplePipeline , to save the data to the database. The number 300 behind indicates the priority of the pipeline, and the smaller the number, the higher the priority.
Since we want to save the data to the database, we need to set up the database service locally first. I use MySQL here . If you don’t have a built-in partner, you can download the free version of MAMP, install it, and start Apache and MySQL with one click. Serve. Of course, the database and tables still need to be built by themselves.

# 在 pipelines.py 中加入数据库配置信息
config = {
    
    
    'host': '127.0.0.1',
    'port': 8081,
    'user': 'root',
    'password': 'root',
    'db': 'xietao',
    'charset': 'utf8mb4',
    'cursorclass': pymysql.cursors.DictCursor,
}

We can do some initialization work in the init () function, such as connecting to the database.
Then the process_item() function is a function for the pipeline to process events. We need to save the data into the database here. I wrote some insert database operations in this function.
The close_spider() function is called when the crawler finishes working, and we can close the database here.

class ScrapySamplePipeline(object):

    def __init__(self):
        # 连接数据库
        self.db = sql.connect(**config)
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        # 先保存作者信息
        sql = 'INSERT INTO author (name, birthday, address, detail) VALUES (%s, %s, %s, %s)'
        self.cursor.execute(sql, (item['author']['name'], item['author']['birthday'], item['author']['address'], item['author']['description']))
        # 获取作者id
        author_id = self.cursor.lastrowid

        # 保存引述信息
        sql = 'INSERT INTO spider (text, tags, author) VALUES (%s, %s, %s)'
        self.cursor.execute(sql, (item['text'], ','.join(item['tags']), author_id))
        self.db.commit()

    # 即将结束爬虫
    def close_spider(self, spider):
        self.db.close()
        self.cursor.close()
        print('close db')

If there is no need to save the database or process the data, the close_spider() pipelines part can be ignored. At this time, switch to the project directory on the command line and enter the crawler command:

$ scrapy crawl quotes

Some partners who do not save to the database can use the following command to export the crawled data to the project directory in Json format.

$ scrapy crawl quotes -o quotes.json

Finally, paste a screenshot of the database data successfully entered.

insert image description here
Please add a picture description

↓ ↓ ↓ Add the business card below to find me, directly get the source code and cases ↓ ↓ ↓

Guess you like

Origin blog.csdn.net/weixin_45841831/article/details/130826162
Recommended