Python3, how simple is a crawler, one library, one line of code, it’s OK, are you sure you don’t want to try it?

1 Introduction

Little Diaosi : Brother Yu, I am practicing writing reptiles recently, do you have any convenient way...
Xiaoyu : For example?
Xiao Diaosi : For example, you can crawl the entire website with just one sentence.
Xiaoyu :
Did n’t I write a lot of reptile cases? Why do you still ask this question
? This...
Xiao Diaosi : Brother Yu, you are like this article " Download Videos from the Whole Network with Only One Line of Code ".
Xiaoyu : Let me think about it.
insert image description here

2、Scrapeasy

According to Xiao Diaosi's idea, I thought of a library: Scrapeasy
Xiao Diaosi : Is this a third-party library.
Xiaoyu : It is necessary, python has its own library, can it have such a powerful function?
insert image description here

2.1 Introduction

2.1.1 Scrap

You may not know much about Scrapeasy,
but the Scrap crawler must know.
So what is Scrap?

Scrapy

Scrapy is a powerful web crawler library, which can be installed through the command pip install scrapy, and the massive data crawled can be stored through MongoDB.

My previous architecture diagram:

insert image description here
For other functions of Scrap,
you can go to the official website of Scrap to read, and I won’t introduce too much here.

2.1.2 Scrapeasy

Let's get to know Scrapeasy again.
Scrapeay is a third-party library for Python. Its main functions are:

  • Web page data can be crawled;
    • Extract data from a single web page;
    • Extract data from multiple web pages;
  • Can extract data from PDF and HTML tables;

It sounds awesome.
Next, let's practice in the code to see how awesome it is.

2.2 Installation

When a third-party library is involved, it must be installed . The
old rules, pip installation

pip install scrapeasy

For other installation methods, see these two articles directly:

2.3 Code example

code example

# -*- coding:utf-8 -*-
# @Time   : 2022-10-31
# @Author : Carl_DJ

'''
实现功能:
    通过scrapeasy 来实现爬取数据

'''

from scrapeasy import Website,Page

#创建网站对象

#这里我就以我博客的地址为例子
webs = Website("https://blog.csdn.net/wuyoudeyuer?type=blog")

#获取所有子链接
urls = webs.getSubpagesLinks()
#输出信息
print(f'打印所有链接信息:{
      
      urls}')


#查找图片

images = webs.getImages()

print(f'打印所有的图片信息:{
      
      images}')

#下载图片
webs.download('img','./data')

#下载pdf
webs.download('pdf','./data')


#获取链接
main_urls = webs.getLinks(intern=False,extern=False,domain=True)

#获取链接域
domain = webs.getLinks(intern=False,extern=True,domain=False)

# 下载其他类型的文件
cal_urls = webs.get("php")


analyze

  • Download all links: getSubpagesLinks() method;
  • Find pictures: getImages() method;
  • Download: webs.download() method;
  • Download files in other formats: get("file type");

3. Summary

Seeing this, today's sharing is almost over.
Today is mainly for simple sharing of the scrapeasy library.
Learn scrapeasy, you can be regarded as the threshold of reptiles.
In fact, I have also written some tutorials and cases about crawlers, such as:

I won’t list too many here. For more examples, you can see Xiaoyu’s crawler combat column .

I am a small fish :

Follow me and take you to learn more and more professional skills in the Python field .

Guess you like

Origin blog.csdn.net/wuyoudeyuer/article/details/127620699
Recommended