Advanced crawler library: Scrapeasy
1 Introduction
Little Diaosi : Brother Yu, I am practicing writing reptiles recently, do you have any convenient way...
Xiaoyu : For example?
Xiao Diaosi : For example, you can crawl the entire website with just one sentence.
Xiaoyu :
Did n’t I write a lot of reptile cases? Why do you still ask this question
? This...
Xiao Diaosi : Brother Yu, you are like this article " Download Videos from the Whole Network with Only One Line of Code ".
Xiaoyu : Let me think about it.
2、Scrapeasy
According to Xiao Diaosi's idea, I thought of a library: Scrapeasy
Xiao Diaosi : Is this a third-party library.
Xiaoyu : It is necessary, python has its own library, can it have such a powerful function?
2.1 Introduction
2.1.1 Scrap
You may not know much about Scrapeasy,
but the Scrap crawler must know.
So what is Scrap?
Scrapy
Scrapy is a powerful web crawler library, which can be installed through the command pip install scrapy, and the massive data crawled can be stored through MongoDB.
My previous architecture diagram:
For other functions of Scrap,
you can go to the official website of Scrap to read, and I won’t introduce too much here.
2.1.2 Scrapeasy
Let's get to know Scrapeasy again.
Scrapeay is a third-party library for Python. Its main functions are:
- Web page data can be crawled;
- Extract data from a single web page;
- Extract data from multiple web pages;
- Can extract data from PDF and HTML tables;
It sounds awesome.
Next, let's practice in the code to see how awesome it is.
2.2 Installation
When a third-party library is involved, it must be installed . The
old rules, pip installation
pip install scrapeasy
For other installation methods, see these two articles directly:
- " Python3, choose Python to automatically install third-party libraries, and say goodbye to pip from now on! ! "
- 《Python3: I import all Python libraries with only one line of code in a low-key manner! "
2.3 Code example
code example
# -*- coding:utf-8 -*-
# @Time : 2022-10-31
# @Author : Carl_DJ
'''
实现功能:
通过scrapeasy 来实现爬取数据
'''
from scrapeasy import Website,Page
#创建网站对象
#这里我就以我博客的地址为例子
webs = Website("https://blog.csdn.net/wuyoudeyuer?type=blog")
#获取所有子链接
urls = webs.getSubpagesLinks()
#输出信息
print(f'打印所有链接信息:{
urls}')
#查找图片
images = webs.getImages()
print(f'打印所有的图片信息:{
images}')
#下载图片
webs.download('img','./data')
#下载pdf
webs.download('pdf','./data')
#获取链接
main_urls = webs.getLinks(intern=False,extern=False,domain=True)
#获取链接域
domain = webs.getLinks(intern=False,extern=True,domain=False)
# 下载其他类型的文件
cal_urls = webs.get("php")
analyze
- Download all links: getSubpagesLinks() method;
- Find pictures: getImages() method;
- Download: webs.download() method;
- Download files in other formats: get("file type");
3. Summary
Seeing this, today's sharing is almost over.
Today is mainly for simple sharing of the scrapeasy library.
Learn scrapeasy, you can be regarded as the threshold of reptiles.
In fact, I have also written some tutorials and cases about crawlers, such as:
- " Python3: I can download videos from the whole network with only one line of code, I am overwhelmed by my talent and beauty!" ! "
- " Python3, 20 lines of code, crawling Moments data through the WeChat computer version, the boss can no longer catch me looking at my phone at work! ! ! "
- " Python3, after multi-threading the video barrage and comments of the UP master of station B, I'm floating~~~ "
- " Pyhotn3, crawl the information of the up master of station B! "
I won’t list too many here. For more examples, you can see Xiaoyu’s crawler combat column .
I am a small fish :
- CSDN blog expert ;
- 51Testing certified lecturer ;
- Gold medal interviewer ;
- Business cooperation|interview training|career planning, you can scan the QR code for consultation .
Follow me and take you to learn more and more professional skills in the Python field .