A beginner's guide to web crawling using Python and Scrapy.

What is web crawling?

Web scraping involves collecting data available on the website. This can be done manually by humans or by robots. This is a process of extracting information and data from a website and converting the obtained information into structured data for further analysis. Web scraping is also called web collection or web data extraction.

Web crawling required

Web scraping helps to obtain data for analysis of trends, performance and price monitoring. It can be used for consumer sentiment analysis to obtain news article insights, market data aggregation, predictive analysis and many natural language processing projects. Various python libraries are used in web scraping, including:

  • Pattern
  • Scrapy
  • Beautiful soup
  • Requests, Merchandize ,Selenium etc.

Scrapy is a complete webscraping framework written in python, responsible for downloading HTML for parsing. However, Beautiful Soup is a library for parsing and extracting data from HTML.

Steps involved in web scraping

  1. Document loading/downloading: load the entire HTML page
  2. Parse and extract: interpret the document and gather information from the document
  3. Conversion: Convert the collected data.

For downloading , use the python request library to download the html page. Scrapy has its built-in request method.

Although it is necessary to be familiar with Hypertext Markup Language (HTML) when parsing documents. HTML is the standard markup language used to create web pages. It consists of a series of elements/tagnames that tell the browser how to display the content. HTML elements are made up of

<start tag>Content here</end tag>

HTML can be expressed as a tree structure containing tag names/nodes, where there are relationships between nodes, including parent, child, siblings, etc.

After downloading, use CSS selector or XPATH locator to extract data from HTML source

XPath is defined as an XML path. It is the grammar or language for finding any element on a web page using XML path expressions. XPath is used to find the position of any element on a web page using the HTML DOM structure.

Getting started with XPATH locator

Absolute Xpath : It contains the complete path from the root element to the desired element.

Relative Xpath : This is more like simply starting from referencing the required element and starting from a specific position. You always use relative paths to test elements

XPATH example with instructions

I created this HTML script to practice, copy and save as (.html) to use with the description

<html>
     <head><title>Store</title></head>
     <body>
         <musicshop id="music"><p><h2>MUSIC</p></h2>
            <genre><h3><p>Hip-Hop</p></h3>
                <ul>
                    <li>Travis Scott</li>
                    <li>Pop Smoke</li>
                </ul>
            </genre>
            <genre country="korea"><h3><p>K-pop</p></h3>
                <ul>
                    <li>G Dragon</li>
                    <li>Super Junior</li>
                </ul>
            </genre>
        </musicshop>
        <bookstore id='book'><p><h2>BOOKS</p></h2>
            <bookgenre class = "fiction"><p><h3>Fiction</h2></p>
                <ul>
                    <li><booktitle><h5><p>The Beetle</p></h5></booktitle></li>
                    <li><booktitle><h5><p>The Bell Jar</p></h5></booktitle></li>
                    <li><booktitle><h5><p>The Book Thief</p></h5></booktitle></bookgenre></li>
                </ul>
            <bookgenre class="horror"><p><h2>Horror</h2></p>
                <ul>
                    <li><booktitle><h5><p><a href='www.goodreads.com/book/show/3999177-the-bad-seed'>The Bad Seed</a></p></h5></booktitle></li>
                    <li><booktitle><h5><p>House of Leaves</p></h5></booktitle></li>
                    <li><booktitle><h5><p>The Hanting of Hill House</p></h5></booktitle></bookgenre></li>
                </ul>
        </bookstore>
    </body>
</html>
复制

The created HTML generates a web page in the picture below

Practice XPATH and CSS locator on the browser (Chrome) link

  1. Press F12 to open Chrome DevTools.
  2. The "Elements" panel should be opened by default.
  3. Press Ctrl + F to enable DOM search in the panel.
  4. Enter XPath or CSS selector for evaluation.
  5. If there are matching elements, they will be highlighted in the DOM.

Characters :

  • Node name-select the node with the given name
  • "/" is selected from the root node
  • "//"-Ignore the previous generation label and start from the current node that matches the selection
  • "@"-Select "node with given attribute", I will use XPATH and the above HTML document to

Choose the second HipHop

Absolute path:-/html/body/musicshop/genre/ul/li[2] The index is not specified and the default is 1

Relative path: -//musicshop//li[2] To extract , we include the name /text()
to give //musicshop//li[2]/text()

Select by attribute name

//bookstore/bookgenre[@class='fiction'] ```

//bookstore/bookgenre[contains(@class,'fiction')] can also be used

Web crawling

We will extract news links and topics from the first page of Nairaland.

First, we check Nairaland and the xpath Locator we are going to use

For links: //table[@summary='links]//a/@href

For Topic: //table[@summary='links]//a/text() should be a direct solution, but

The tag contains text, so we will use //table[contains(@class,'boards')][2]//tr[2]//a[descendant-or-self::text()]

After that, we got the main information, so we imported our library

import scrapy

from scrapy.crawler import CrawlerProcess

We create a spider class and inherit a Spider from scrapy

class Spider(scrapy.Spider):
  name = 'yourspider'
  # start_requests method
  def start_requests( self ):
    yield scrapy.Request(url = "https://www.nairaland.com/", callback=self.parse)

  def parse(self, response):
    blocks = response.xpath("//table[contains(@class,'boards')][2]//tr[2]")
    News_Titles = blocks.xpath(".//a[descendant-or-self::text()]").extract()
    News_Links= blocks.xpath(".//a/@href").extract()
    for crs_title, crs_descr in zip( News_Titles, News_Links ):
      dc_dict[crs_title] = crs_descr

So we start our crawler

process = CrawlerProcess()
process.crawl(Spider)
process.start()
print(dc_dict)


I still want to recommend the Python learning group I built myself : 721195303 , all of whom are learning Python. If you want to learn or are learning Python, you are welcome to join. Everyone is a software development party and share dry goods from time to time (only Python software development related), including a copy of the latest Python advanced materials and zero-based teaching compiled by myself in 2021. Welcome friends who are in advanced and interested in Python to join!

Guess you like

Origin blog.csdn.net/aaahtml/article/details/113029974