Python crawler selection 01 episode (first encounter crawler)

Python crawler selection 01 episode (first encounter crawler)

python learning directory portal

Overview of web crawlers

1. Definition

  • Web spiders, web robots, programs that grab web data.

  • In fact, it is to imitate people clicking on the browser and visiting the website with a Python program, and the more realistic the imitation, the better.

Second, the purpose of crawling data

  • Obtain large amounts of data for data analysis
  • Test data of company projects, data required for company business

3. How companies obtain data

  • Company own data

  • Purchase from a third-party data platform (Data Hall, Guiyang Big Data Exchange)

  • Crawler crawling data

Fourth, the advantages of python as a crawler

1、Python :请求模块、解析模块丰富成熟,强大的Scrapy网络爬虫框架

2、PHP :对多线程、异步支持不太好

3、JAVA:代码笨重,代码量大

4、C/C++:虽然效率高,但是代码成型慢

Five, crawler classification

1、通用网络爬虫(搜索引擎使用,遵守robots协议)

	robots协议 :网站通过robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取,

	通用网络爬虫需要遵守robots协议(君子协议)

	https://www.taobao.com/robots.txt

2、聚焦网络爬虫 :自己写的爬虫程序

Six, crawler crawling data steps

1、确定需要爬取的URL地址

2、由请求模块向URL地址发出请求,并得到网站的响应

3、从响应内容中提取所需数据

	① 所需数据,保存

	② 页面中有其他需要继续跟进的URL地址,继续第2步去发请求,如此循环

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/107351809