Ali boss teaches you how to install Scrapy and create a project (introductory case)

Scrapy is a fast, high-level screen scraping and web scraping framework for Python, used to scrape web sites and extract structured data from pages. Scrapy has a wide range of uses and can be used for data mining, monitoring, and automated testing. It is sometimes called a spider or spider robot, and often referred to as a crawler for short.

Why use Scrapy?

Scrapy is an application framework written for crawling website data and extracting structured data.

It can be used in a series of programs including data mining, information processing or storing historical data. Scrapy uses Twisted, an asynchronous network library, to handle network communications. The architecture is clear and contains various middleware interfaces, which can flexibly fulfill various requirements. Scratch means crawling. This Python crawler framework is called Scrapy, which is probably what it means.

Scrapy is a framework written in python that can simplify this process. Use this tool to build crawlers of varying degrees of complexity.

Scrapy demo, I will use the provided query string and crawl related links from Google.

Code

First, import the relevant libraries

import scrapy
import re
from scrapy.linkextractors import LinkExtractor
import sys
from scrapy import Selector

Assign a variable to the google url.

gURL = 'https://www.google.com/search?q='

Now define the class.

class webSpider(scrapy.Spider):

    name = 'webCrawler'
    start_urls = []
    QUERYSTRING = ''

webSpider is the name of the class. webCrawler is the name of the searcher. Within the scrapy framework, a project will be created with this name. start_urls = [] is a list of URLs crawled by scrapy , and QUERYSTRING is a Google query

You can call Scrapy within the framework of Scrapy and test it through the shell of Scrapy.

def __init__(self, QUERY, *argv, **kwargv):
        super(webSpider, self).__init__(*argv, **kwargv)        
        self.QUERYSTRING = QUERY
        self.start_urls = [gURL+QUERY]

Now, I can call Google QUERY from an external script and load the base URL of Google search in start_urls . If scrapy sees start_urls , it will load the url mentioned in the list and start the crawl. The purpose now is to crawl the search results reported by Google.

Define the parse function:

def parse(self, response):

            xlink = LinkExtractor()
            link_list=[]

            for link in xlink.extract_links(response):
                if len(str(link))>200 or self.QUERYSTRING in link.text:
                    surl = re.findall('q=(http.*)&sa', str(link))   
                    if surl:                      
                        link_list.extend(surl)                  

            print(link_list)

Scrapy loads the URL mentioned in start_urls and issues a standard HTTP request. This function will parse the response, traverse each link retrieved by LinkExtractor , retrieve whether there is a Google query we need in the text, extract the URL through regular expressions and store it in the list.

Use an external python script to call and run:

#!/usr/bin/env python3

import scrapy
from scrapy.crawler import Crawler, CrawlerProcess
from myspider import webSpider

QUERY = 'what+is+ethical+hacking'

crawler = Crawler(webSpider)
process = CrawlerProcess()
process.crawl(webSpider, QUERY)
process.start()

With this script, whenever you have a query to search in google, all you have to do is update the variable QUERY with your search string and run it. Note: The Spider class we wrote before needs to be imported into this file. Now the retrieved links can be used for further crawling when needed.


I still want to recommend the Python learning group I built by myself : 721195303 , all of whom are learning Python. If you want to learn or are learning Python, you are welcome to join. Everyone is a software development party and share dry goods from time to time (only Python software development related), including a copy of the latest Python advanced materials and zero-based teaching compiled by myself in 2021. Welcome friends who are in advanced and interested in Python to join!

 

Guess you like

Origin blog.csdn.net/m0_55479420/article/details/114669837