Primer
Crawling list pages with crawlers is one of the very common tasks in web data extraction. For crawler engineers, how to efficiently generate extraction rules is very necessary, otherwise a lot of time will be wasted writing crawler CSS selectors or XPath. This article will use a practical example to show how to use the open source tool Webspot to automatically extract list pages.
Webspot
Webspot is an open source project designed to automate web page data extraction. It currently supports the identification of list pages and pagination and the extraction of crawling rules. In addition, it also provides a Web UI interface that allows users to visually view the identified results. It also allows developers to Use the API to get the recognition result.
The installation of Webspot is very simple. You can refer to the installation tutorial in the official documentation and use Docker and Docker Compose to install.
# clone git repo
git clone https://github.com/crawlab-team/webspot
# start docker containers
docker-compose up -d
复制代码
Then, wait for the program to start, it should take about half a minute to initialize the application.
After the initialization is complete, you can visit the web interface http://localhost:9999 , and you should see the following interface, which means that it has been started successfully.
Now, you can initiate a recognition request, click New Request, enter quotes.toscrape.com , click Submit, and wait for a while to see the following interface.
Use the API to automatically fetch data
Next, we will use a Python program to call Webspot's API to automatically grab data.
The whole process is as follows.
- Call the Webspot API to obtain the Extract Rules (Extract Rules), that is, the extraction rules for list pages and paging, and the extraction rules are CSS Selectors.
- Define the capture target according to the extraction rules of the list page, that is, each item of the list page and its corresponding fields.
- Determine the target to grab the next page according to the paging extraction rules, and let the crawler program automatically grab the data of the next page.
call the API
Calling the API is very simple, just pass the URL to be recognized into the body. code show as below.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
# API endpoint
api_endpoint = 'http://localhost:9999/api'
# url to extract
url = 'https://quotes.toscrape.com'
# call API to recognize list page and pagination elements
res = requests.post(f'{api_endpoint}/requests', json={
'url': 'https://quotes.toscrape.com'
})
results = res.json()
pprint(results)
复制代码
Run it with the Python Console, and you will get the recognition result data, similar to the following results.
{...
'method': 'request',
'no_async': True,
'results': {'pagination': [{'detector': 'pagination',
'name': 'Next',
'score': 1.0,
'scores': {'score': 1.0},
'selectors': {'next': {'attribute': None,
'name': 'pagination',
'node_id': 120,
'selector': 'li.next > a',
'type': 'css'}}}],
...
'plain_list': [{...
'fields': [{'attribute': '',
'name': 'Field_text_1',
'node_id': None,
'selector': 'div.quote > span.text',
'type': 'text'},
...],
...}],
},
...}
复制代码
The recognition results include the CSS Selector of the list page and pagination, and the fields corresponding to each item of the list page.
List page and field extraction logic
Next, we will write the logic of list page and field extraction.
First, we can results
get the list page item selector list_items_selector
and field list via fields
.
# list result
list_result = results['results']['plain_list'][0]
# list items selector
list_items_selector = list_result['selectors']['full_items']['selector']
print(list_items_selector)
# fields
fields = list_result['fields']
print(fields)
复制代码
We can then write the logic that parses the list page items.
def get_data(soup: BeautifulSoup) -> list:
# data
data = []
# items
items_elements = soup.select(list_items_selector)
for el in items_elements:
# row data
row = {}
# iterate fields
for f in fields:
# field name
field_name = f['name']
# field element
field_element = el.select_one(f['selector'])
# skip if field element not found
if not field_element:
continue
# add field value to row
if f['type'] == 'text':
row[field_name] = field_element.text
else:
row[field_name] = field_element.attrs.get(f['attribute'])
# add row to data
data.append(row)
return data
复制代码
get_data
In the function of the above code , we pass in BeautfifulSoup
the instance, and use list_items_selector
and fields
to parse and obtain the list data data
, and return it to the function caller.
Request list page and paging logic
Next, we need to write the request list page and paging logic, that is, request the specified URL, parse out the paging, and call the above get_data
.
We first need to get the CSS Selector for the pagination.
# pagination next selector
next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
print(next_selector)
复制代码
Then, we write the crawler logic, which is to continuously crawl the data of the website list page.
def crawl(url: str) -> list:
# all data to crawl
all_data = []
while True:
print(f'requesting {url}')
# request url
res = requests.get(url)
# beautiful soup of html
soup = BeautifulSoup(res.content)
# add parsed data
data = get_data(soup)
all_data += data
# pagination next element
next_el = soup.select_one(next_selector)
# end if pagination next element not found
if not next_el:
break
# url of next page
url = urljoin(url, next_el.attrs.get('href'))
return all_data
复制代码
This way we have all the logic written.
full code
Below is the complete code for the entire fetching logic.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from pprint import pprint
def get_data(soup: BeautifulSoup) -> list:
# data
data = []
# items
items_elements = soup.select(list_items_selector)
for el in items_elements:
# row data
row = {}
# iterate fields
for f in fields:
# field name
field_name = f['name']
# field element
field_element = el.select_one(f['selector'])
# skip if field element not found
if not field_element:
continue
# add field value to row
if f['type'] == 'text':
row[field_name] = field_element.text
else:
row[field_name] = field_element.attrs.get(f['attribute'])
# add row to data
data.append(row)
return data
def crawl(url: str) -> list:
# all data to crawl
all_data = []
while True:
print(f'requesting {url}')
# request url
res = requests.get(url)
# beautiful soup of html
soup = BeautifulSoup(res.content)
# add parsed data
data = get_data(soup)
all_data += data
# pagination next element
next_el = soup.select_one(next_selector)
# end if pagination next element not found
if not next_el:
break
# url of next page
url = urljoin(url, next_el.attrs.get('href'))
return all_data
if __name__ == '__main__':
# API endpoint
api_endpoint = 'http://localhost:9999/api'
# url to extract
url = 'https://quotes.toscrape.com'
# call API to recognize list page and pagination elements
res = requests.post(f'{api_endpoint}/requests', json={
'url': 'https://quotes.toscrape.com'
})
results = res.json()
pprint(results)
# list result
list_result = results['results']['plain_list'][0]
# list items selector
list_items_selector = list_result['selectors']['full_items']['selector']
print(list_items_selector)
# fields
fields = list_result['fields']
print(fields)
# pagination next selector
next_selector = results['results']['pagination'][0]['selectors']['next']['selector']
print(next_selector)
# start crawling
all_data = crawl(url)
# print crawled results
pprint(all_data[:50])
复制代码
Run the code, and finally you can get the following result data.
[{'Field_link_url_6': '/author/Albert-Einstein',
'Field_link_url_8': '/tag/change/page/1/',
'Field_text_1': '“The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.”',
'Field_text_2': '“The world as we have created it is a process of our '
'thinking. It cannot be changed without changing our '
'thinking.”',
'Field_text_3': '\n'
' Tags:\n'
' \n'
'change\n'
'deep-thoughts\n'
'thinking\n'
'world\n',
'Field_text_4': 'Albert Einstein',
'Field_text_5': '(about)',
'Field_text_7': 'change'},
...
复制代码
In this way, we have realized the crawler task of using Webspot to automatically extract lists. There is no need to explicitly define CSS Selector or XPath, just call the API of Webspot to get the list page data.
Community
If you are interested in the author's articles, you can add the author's WeChat tikazyq1 and indicate "the way of code", and the author will pull you into the "way of code" exchange group.