Practical crawlers: How to use Webspot to automatically extract list pages

Primer

Crawling list pages with crawlers is one of the very common tasks in web data extraction. For crawler engineers, how to efficiently generate extraction rules is very necessary, otherwise a lot of time will be wasted writing crawler CSS selectors or XPath. This article will use a practical example to show how to use the open source tool Webspot to automatically extract list pages.

Webspot

Webspot is an open source project designed to automate web page data extraction. It currently supports the identification of list pages and pagination and the extraction of crawling rules. In addition, it also provides a Web UI interface that allows users to visually view the identified results. It also allows developers to Use the API to get the recognition result.

The installation of Webspot is very simple. You can refer to the installation tutorial in the official documentation and use Docker and Docker Compose to install.

# clone git repo
git clone https://github.com/crawlab-team/webspot
# start docker containers
docker-compose up -d
复制代码

Then, wait for the program to start, it should take about half a minute to initialize the application.

After the initialization is complete, you can visit the web interface http://localhost:9999 , and you should see the following interface, which means that it has been started successfully.

Webspot initialization interface

Now, you can initiate a recognition request, click New Request, enter quotes.toscrape.com , click Submit, and wait for a while to see the following interface.

Webspot Listing Page Recognition

Use the API to automatically fetch data

Next, we will use a Python program to call Webspot's API to automatically grab data.

The whole process is as follows.

  1. Call the Webspot API to obtain the Extract Rules (Extract Rules), that is, the extraction rules for list pages and paging, and the extraction rules are CSS Selectors.
  2. Define the capture target according to the extraction rules of the list page, that is, each item of the list page and its corresponding fields.
  3. Determine the target to grab the next page according to the paging extraction rules, and let the crawler program automatically grab the data of the next page.

call the API

Calling the API is very simple, just pass the URL to be recognized into the body. code show as below.

import requests
from bs4 import BeautifulSoup
from pprint import pprint
# API endpoint
api_endpoint = 'http://localhost:9999/api'
# url to extract
url = 'https://quotes.toscrape.com'
# call API to recognize list page and pagination elements
res = requests.post(f'{api_endpoint}/requests', json={
    'url': 'https://quotes.toscrape.com'
})
results = res.json()
pprint(results)
复制代码

Run it with the Python Console, and you will get the recognition result data, similar to the following results.

{...
 'method': 'request',
 'no_async': True,
 'results': {'pagination': [{'detector': 'pagination',
                             'name': 'Next',
                             'score': 1.0,
                             'scores': {'score': 1.0},
                             'selectors': {'next': {'attribute': None,
                                                    'name': 'pagination',
                                                    'node_id': 120,
                                                    'selector': 'li.next > a',
                                                    'type': 'css'}}}],
...
             'plain_list': [{...
                              'fields': [{'attribute': '',
                                         'name': 'Field_text_1',
                                         'node_id': None,
                                         'selector': 'div.quote > span.text',
                                         'type': 'text'},
                                         ...],
                          ...}],
            },
...}
复制代码

The recognition results include the CSS Selector of the list page and pagination, and the fields corresponding to each item of the list page.

List page and field extraction logic

Next, we will write the logic of list page and field extraction.

First, we can resultsget the list page item selector list_items_selectorand field list via fields.

# list result
list_result = results['results']['plain_list'][0]
# list items selector
list_items_selector = list_result['selectors']['full_items']['selector']
print(list_items_selector)
# fields
fields = list_result['fields']
print(fields)
复制代码

We can then write the logic that parses the list page items.

def get_data(soup: BeautifulSoup) -> list:
    # data
    data = []
    # items
    items_elements = soup.select(list_items_selector)
    for el in items_elements:
        # row data
        row = {}
        # iterate fields
        for f in fields:
            # field name
            field_name = f['name']
            # field element
            field_element = el.select_one(f['selector'])
            # skip if field element not found
            if not field_element:
                continue
            # add field value to row
            if f['type'] == 'text':
                row[field_name] = field_element.text
            else:
                row[field_name] = field_element.attrs.get(f['attribute'])
        # add row to data
        data.append(row)
    return data
复制代码

get_dataIn the function of the above code , we pass in BeautfifulSoupthe instance, and use list_items_selectorand fieldsto parse and obtain the list data data, and return it to the function caller.

Request list page and paging logic

Next, we need to write the request list page and paging logic, that is, request the specified URL, parse out the paging, and call the above get_data.

We first need to get the CSS Selector for the pagination.

# pagination next selector
next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
print(next_selector)
复制代码

Then, we write the crawler logic, which is to continuously crawl the data of the website list page.

def crawl(url: str) -> list:
    # all data to crawl
    all_data = []
    while True:
        print(f'requesting {url}')
        # request url
        res = requests.get(url)
        # beautiful soup of html
        soup = BeautifulSoup(res.content)
        # add parsed data
        data = get_data(soup)
        all_data += data
        # pagination next element
        next_el = soup.select_one(next_selector)
        # end if pagination next element not found
        if not next_el:
            break
        # url of next page
        url = urljoin(url, next_el.attrs.get('href'))
    return all_data
复制代码

This way we have all the logic written.

full code

Below is the complete code for the entire fetching logic.

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from pprint import pprint
def get_data(soup: BeautifulSoup) -> list:
    # data
    data = []
    # items
    items_elements = soup.select(list_items_selector)
    for el in items_elements:
        # row data
        row = {}
        # iterate fields
        for f in fields:
            # field name
            field_name = f['name']
            # field element
            field_element = el.select_one(f['selector'])
            # skip if field element not found
            if not field_element:
                continue
            # add field value to row
            if f['type'] == 'text':
                row[field_name] = field_element.text
            else:
                row[field_name] = field_element.attrs.get(f['attribute'])
        # add row to data
        data.append(row)
    return data
def crawl(url: str) -> list:
    # all data to crawl
    all_data = []
    while True:
        print(f'requesting {url}')
        # request url
        res = requests.get(url)
        # beautiful soup of html
        soup = BeautifulSoup(res.content)
        # add parsed data
        data = get_data(soup)
        all_data += data
        # pagination next element
        next_el = soup.select_one(next_selector)
        # end if pagination next element not found
        if not next_el:
            break
        # url of next page
        url = urljoin(url, next_el.attrs.get('href'))
    return all_data
if __name__ == '__main__':
    # API endpoint
    api_endpoint = 'http://localhost:9999/api'
    # url to extract
    url = 'https://quotes.toscrape.com'
    # call API to recognize list page and pagination elements
    res = requests.post(f'{api_endpoint}/requests', json={
        'url': 'https://quotes.toscrape.com'
    })
    results = res.json()
    pprint(results)
    # list result
    list_result = results['results']['plain_list'][0]
    # list items selector
    list_items_selector = list_result['selectors']['full_items']['selector']
    print(list_items_selector)
    # fields
    fields = list_result['fields']
    print(fields)
    # pagination next selector
    next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
    print(next_selector)
    # start crawling
    all_data = crawl(url)
    # print crawled results
    pprint(all_data[:50])
复制代码

Run the code, and finally you can get the following result data.

[{'Field_link_url_6': '/author/Albert-Einstein',
  'Field_link_url_8': '/tag/change/page/1/',
  'Field_text_1': '“The world as we have created it is a process of our '
                  'thinking. It cannot be changed without changing our '
                  'thinking.”',
  'Field_text_2': '“The world as we have created it is a process of our '
                  'thinking. It cannot be changed without changing our '
                  'thinking.”',
  'Field_text_3': '\n'
                  '            Tags:\n'
                  '            \n'
                  'change\n'
                  'deep-thoughts\n'
                  'thinking\n'
                  'world\n',
  'Field_text_4': 'Albert Einstein',
  'Field_text_5': '(about)',
  'Field_text_7': 'change'},
  ...
复制代码

In this way, we have realized the crawler task of using Webspot to automatically extract lists. There is no need to explicitly define CSS Selector or XPath, just call the API of Webspot to get the list page data.

Community

If you are interested in the author's articles, you can add the author's WeChat tikazyq1 and indicate "the way of code", and the author will pull you into the "way of code" exchange group.

Guess you like

Origin juejin.im/post/7219899306946101306