What is web crawling?
Web scraping involves collecting data available on the website. This can be done manually by humans or by robots. This is a process of extracting information and data from a website and converting the obtained information into structured data for further analysis. Web scraping is also called web collection or web data extraction.
Web crawling required
Web scraping helps to obtain data for analysis of trends, performance and price monitoring. It can be used for consumer sentiment analysis to obtain news article insights, market data aggregation, predictive analysis and many natural language processing projects. Various python libraries are used in web scraping, including:
- Pattern
- Scrapy
- Beautiful soup
- Requests, Merchandize ,Selenium etc.
Scrapy is a complete webscraping framework written in python, responsible for downloading HTML for parsing. However, Beautiful Soup is a library for parsing and extracting data from HTML.
Steps involved in web scraping
- Document loading/downloading: load the entire HTML page
- Parse and extract: interpret the document and gather information from the document
- Conversion: Convert the collected data.
For downloading , use the python request library to download the html page. Scrapy has its built-in request method.
Although it is necessary to be familiar with Hypertext Markup Language (HTML) when parsing documents. HTML is the standard markup language used to create web pages. It consists of a series of elements/tagnames that tell the browser how to display the content. HTML elements are made up of
<start tag>Content here</end tag>
HTML can be expressed as a tree structure containing tag names/nodes, where there are relationships between nodes, including parent, child, siblings, etc.
After downloading, use CSS selector or XPATH locator to extract data from HTML source
XPath is defined as an XML path. It is the grammar or language for finding any element on a web page using XML path expressions. XPath is used to find the position of any element on a web page using the HTML DOM structure.
Getting started with XPATH locator
Absolute Xpath : It contains the complete path from the root element to the desired element.
Relative Xpath : This is more like simply starting from referencing the required element and starting from a specific position. You always use relative paths to test elements
XPATH example with instructions
I created this HTML script to practice, copy and save as (.html) to use with the description
<html>
<head><title>Store</title></head>
<body>
<musicshop id="music"><p><h2>MUSIC</p></h2>
<genre><h3><p>Hip-Hop</p></h3>
<ul>
<li>Travis Scott</li>
<li>Pop Smoke</li>
</ul>
</genre>
<genre country="korea"><h3><p>K-pop</p></h3>
<ul>
<li>G Dragon</li>
<li>Super Junior</li>
</ul>
</genre>
</musicshop>
<bookstore id='book'><p><h2>BOOKS</p></h2>
<bookgenre class = "fiction"><p><h3>Fiction</h2></p>
<ul>
<li><booktitle><h5><p>The Beetle</p></h5></booktitle></li>
<li><booktitle><h5><p>The Bell Jar</p></h5></booktitle></li>
<li><booktitle><h5><p>The Book Thief</p></h5></booktitle></bookgenre></li>
</ul>
<bookgenre class="horror"><p><h2>Horror</h2></p>
<ul>
<li><booktitle><h5><p><a href='www.goodreads.com/book/show/3999177-the-bad-seed'>The Bad Seed</a></p></h5></booktitle></li>
<li><booktitle><h5><p>House of Leaves</p></h5></booktitle></li>
<li><booktitle><h5><p>The Hanting of Hill House</p></h5></booktitle></bookgenre></li>
</ul>
</bookstore>
</body>
</html>
复制
The created HTML generates a web page in the picture below
Practice XPATH and CSS locator on the browser (Chrome) link
- Press F12 to open Chrome DevTools.
- The "Elements" panel should be opened by default.
- Press Ctrl + F to enable DOM search in the panel.
- Enter XPath or CSS selector for evaluation.
- If there are matching elements, they will be highlighted in the DOM.
Characters :
- Node name-select the node with the given name
- "/" is selected from the root node
- "//"-Ignore the previous generation label and start from the current node that matches the selection
- "@"-Select "node with given attribute", I will use XPATH and the above HTML document to
Choose the second HipHop
Absolute path:-/html/body/musicshop/genre/ul/li[2] The index is not specified and the default is 1
Relative path: -//musicshop//li[2] To extract , we include the name /text()
to give //musicshop//li[2]/text()
Select by attribute name
//bookstore/bookgenre[@class='fiction'] ```
//bookstore/bookgenre[contains(@class,'fiction')] can also be used
Web crawling
We will extract news links and topics from the first page of Nairaland.
First, we check Nairaland and the xpath Locator we are going to use
For links: //table[@summary='links]//a/@href
For Topic: //table[@summary='links]//a/text() should be a direct solution, but
The tag contains text, so we will use //table[contains(@class,'boards')][2]//tr[2]//a[descendant-or-self::text()]
After that, we got the main information, so we imported our library
import scrapy
from scrapy.crawler import CrawlerProcess
We create a spider class and inherit a Spider from scrapy
class Spider(scrapy.Spider):
name = 'yourspider'
# start_requests method
def start_requests( self ):
yield scrapy.Request(url = "https://www.nairaland.com/", callback=self.parse)
def parse(self, response):
blocks = response.xpath("//table[contains(@class,'boards')][2]//tr[2]")
News_Titles = blocks.xpath(".//a[descendant-or-self::text()]").extract()
News_Links= blocks.xpath(".//a/@href").extract()
for crs_title, crs_descr in zip( News_Titles, News_Links ):
dc_dict[crs_title] = crs_descr
So we start our crawler
process = CrawlerProcess()
process.crawl(Spider)
process.start()
print(dc_dict)