AutoScraper - crawler artifact

AutoScraper is an automated crawler tool that is very smart and easy to use. AutoScraper is a web crawler implemented in Python. It is compatible with Python 3 and can quickly and intelligently obtain data on specified websites. It has 4.8K⭐️ on github. github link: https://github.com/alirezamika/autoscraper.
AutoScraper is suitable for crawling pages with weak anti-crawling mechanisms, and can effectively crawl data.
Let's start to introduce the process of using AutoScraper.

1 installation

#autoscraper支持使用python3

# ①使用 pip 从 git 仓库安装最新版本 
# pip install git+https://github.com/alirezamika/autoscraper.git

#②从PyPI安装(推荐)
#pip install autoscraper

#③下载源码安装
#python setup.py install

2 AutoScraper use

I classify the usage of AutoScraper and implement it in practice. Take the second-hand housing information of Lianjia as an example. Readers can learn and modify it by themselves.

2.1 Single information capture

from autoscraper import AutoScraper  #导入

# 爬取的网址
url = 'https://bj.lianjia.com/ershoufang/'

# 输入想抓取的标题信息,此处以二手房发布信息标题为例
wanted_list = ["西山枫林一二期南排高楼层南北通透四居室"]  #随便找一个当前页面的标题

#创建AutoScraper对象
scraper = AutoScraper()

#建立爬虫,并进行信息爬取
result = scraper.build(url, wanted_list)
print('结果数量:',len(result))  #返回结果数量与当前页面标题数量一致
print('返回结果:',result)

insert image description here

2.2 Multi-information capture

2.2.1 The first method

from autoscraper import AutoScraper  #导入

# 爬取的网址
url = 'https://bj.lianjia.com/ershoufang/'

# 输入想抓取的标题信息,此处以二手房发布信息标题和价格为例
wanted_list = ["西山枫林一二期南排高楼层南北通透四居室","745万"]

#创建AutoScraper对象
scraper = AutoScraper()

#建立爬虫,并进行信息爬取
result = scraper.build(url, wanted_list)
print("爬取结果:",result) #先返回全部的标题信息,再返回价格信息

insert image description here
This method crawls multi-title information and is not easy to align, so the second method is introduced.

2.2.2 The second method

from autoscraper import AutoScraper  #导入

## 爬取的网址
url = 'https://bj.lianjia.com/ershoufang/'

# 输入想抓取的标题信息,此处以二手房发布信息标题和价格为例
wanted_dict = {
    
    'title': ["有燃气正规一居室,户型好,格局方正"], 
               'price': ['398万']}

#创建AutoScraper对象
scraper = AutoScraper()

#建立爬虫,并进行信息爬取
scraper.build(url=url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url=url, grouped=True) #抓取相似数据,参数grouped设置返回结果是字典形式,默认是False。
print('返回结果:')
print(result)  #返回结果中rule_m3wz,rule_sgqv是规则名称,有可能同一种信息使用多种规则返回,选择其一即可。

insert image description here
It is convenient to use DataFrame to store various returned results separately.

2.3 Multi-page crawling

When multiple pages need to be crawled, and the layout of each page is the same, the crawling rules are the same, and the designed rules can be saved and then applied to other pages.

import pandas as pd
from autoscraper import AutoScraper  #导入

## 爬取的网址
url = 'https://bj.lianjia.com/ershoufang/'

# 输入想抓取的标题信息,此处以二手房发布信息标题和价格为例
wanted_dict = {
    
    'title': ["有燃气正规一居室,户型好,格局方正"], 
               'price': ['398万']}

#创建AutoScraper对象
scraper = AutoScraper()

#建立爬虫,并进行信息爬取
scraper.build(url=url, wanted_dict=wanted_dict)
result = scraper.get_result_similar(url=url, grouped=True) 
print("规则名称:",result.keys())
#返回结果中规则名称是随机生成的,一定注意,由于布局相同,所以其它页面的规则相同

#规则保存
scraper.keep_rules(['rule_4alo', 'rule_xt47']) #当一个标题给出多个规则时,选择一个规则
scraper.save('lianjie_rule')

#加载规则
lianjia_scraper = None
lianjia_scraper = AutoScraper()
lianjia_scraper.load('lianjie_rule')

df = pd.DataFrame()
#爬取多个页面
for n in range(1, 5):
    url_template = f'https://bj.lianjia.com/ershoufang/pg1/{
      
      n}'
    result = lianjia_scraper.get_result_similar(url=url_template, group_by_alias=True)
    df = pd.concat([df, pd.DataFrame(result)])
print("返回结果:")
print(df)

insert image description here
insert image description here
AutoScraper also has some other details, such as crawling with get_result_exact, etc. For details, please refer to autoscraper-examples.md .
If any readers want to learn more about the specific code of AutoScraper, they can leave a message and the author will publish a code interpretation.

Guess you like

Origin blog.csdn.net/weixin_48030475/article/details/128606286