Role of Web crawlers and simple classification

Generally used for data analysis, first by cleaning the data, extract, transform, standardized data will be made of the data, then data analysis and mining, to get business value of data.

Data into the internal data and external data

In Internet companies, regardless of internal data and external data, in fact, it is to get the user specific data.

After get behavioral data users, analyzes users.

For example, the electricity supplier websites is the recommended product, search for sites like precision marketing (furniture) ad network.

Internal company data

Traffic data for the data generated, the company using BI (Business Intelligence), CRM systems, ERP systems, e-mail system;

Financial data , including data related to the daily operations of the company with a number of the company's expenditure, procurement, and other income;

User data , regardless of the site, APP or the game, the user will fill in the registration email, phone, identity card numbers and other data, which is actually very valuable, in addition to user behavior data coupled with the company's products left.

Historical data , company settling down various other data.

External Data

Social networking site data , including data on the micro-channel, microblogging, all networks, Twitter, Facebook, LinkedIn and other social media.

Description: social data portion is crawling, the other part of the operator authorization is required.

Collecting data offline , including Wifi hotspot data, map data and the like.

Description: This is a company currently doing less, but also more valuable.

Open government data, including corporate credit data, enterprise data registration, court public data, public transport data.

Note: If you're looking for, it can correspond to a government website.

Smart device data , the device comprising a smart sensor data.

Description: You know what? A smart phone, with at least eight sensing devices.

Web crawler data , including on the Internet all can crawl back into the data, text, video, pictures is actually data, but also unstructured data.

企业交易数据,包括商家流水数据、支付宝交易数据、信用卡消费数据等等。

说明:目前这一部分数据是最难获取的,因为数据就是宝贵的资产。

企业开放数据,比如微博开放了商业数据API,腾讯开放了腾讯云分析SDK上报的应用数据,高德地图开放了LBS数据等等。

说明:如果想找更多的数据API,我推荐你去数据堂、聚合数据这两家网站上看一下,上面有大量的API接口。

其它数据,比如天气数据、交通数据、人口流动数据、位置数据等等。

说明:只有想不到没有弄不到。

额外扩展

大数据就是整合完公司内部外部数据,进行大数据存储,然后通过清洗,标注、去重、去噪、关联等过程可以将数据进行结构化,也可以进行大数据挖掘和数据分析,再以数据可视化呈现结果,打通数据孤岛形成数据闭环,将数据转换成“石油”和“生产资料”,最后应用到我们日常的生活、学习和工作中去。

爬虫与搜索系统的关系

搜索系统的数据是爬虫爬取过来?不一定。

搜索系统可以简单的分为两类,通用搜索,站内搜索

通用搜索:像百度,谷歌会爬取互联网上所有的数据

站内搜索:只需要业务系统的数据。

垂直搜索:行业数据和自己的数据。

总结:搜索一定会包含爬虫(除站内搜索外),爬虫爬取的数据不一定是为搜索服务。除了搜索功能以外,爬虫爬取的数据主要用来做数据分析。

爬虫的简单分类

网络爬虫按照系统结构和实现技术,大致可以分为以下几种类型:

l 通用网络爬虫(General Purpose Web Crawler)

l 聚焦网络爬虫(Focused Web Crawler)

l 增量式网络爬虫(Incremental Web Crawler)

l 深层网络爬虫(Deep Web Crawler)

实际的网络爬虫系统通常是几种爬虫技术相结合实现的。

Guess you like

Origin blog.csdn.net/itcast_cn/article/details/90669815