Hadoop ecosystem study 1 (rationale)

I. Background of big data techniques

1. Computer and information technology (especially mobile Internet) the rapid development and popularization , application system expanded rapidly (the number of users and application scenarios, such as facebook, Taobao, micro-channel, CUP, 12306, etc.) produced, industrial applications data showed explosive growth .

2. PB easily reach several hundreds or even EB (1EB = 1024PB = 1024 * 1024TB) the size of the data has been far beyond the traditional processing power of computers and information systems.

3. Effective large data processing techniques, methods and tools has become an urgent demand .

 

Google's Troika laid for the development of a large data base is very important .

 

 

Google Troika (very important): three papers ---> ideas, principles
1, GFS: google file system --- > HDFS: Hadoop Distributed File System
is a distributed file system for solving big data storage problems.
What is the inverted index? Reverted Index
inverted index:

If you want to "big data", if only the forward index, it may take a keyword in the search a lot of time scanning the entire table, then the record keyword "big data", the huge amount of data in the case of this process slow not people made,

So with inverted index, the search engine will rebuild the forward index is an inverted index, that corresponds to the file ID mapping keywords convert to keywords mapping file ID of each keyword corresponds to a series of the documents, which have emerged this keyword.

Popular said:

Through the data, looking for an address



2, MapReduce calculation model: the source of the problem the PageRank (first split into a plurality of small computing tasks, and then aggregated)


3, BigTable large table ----> NoSQL database: HBase (sacrifice space in exchange for time)

 

 

 

II. Scenarios of big data

Big Data scenarios can be said to be very extensive, covering almost all walks of life. As

 Baidu population movements Spring Festival, Chinese New Year 2014, Baidu launched the "Baidu migration", the use of Big Data technologies, its own computational analysis LBS (location-based services) big data, and the use of innovative "visual" presentation , the industry's first to achieve the full, dynamic, real-time, visually demonstrate the trajectory and characteristics before and after the Chinese New Year large population migration, shown in Figure 1-3. (Query URL: http:. //Qianxi.baidu.com/)

 
 
Weather forecasting systems, electrical systems, etc. recommended by the manufacturer, general says can be applied to any scene.


III. Large data technology direction and core issues

(1) data stored: Distributed File System (GFS, HDFS etc.)
is calculated (2) data: distributed computing model (MapReduce, Spark RDD, etc.)
in two directions: calculated offline: Hadoop MapReduce, Spark Core, Flink DataSet
real-time calculation: Storm, Spark Streaming, Flink DataStream

 

IV. Data Warehouse

Traditional data warehouse: Oracle, MySQL, etc.

Big data: Hadoop, Spark, Flink can be seen as an implementation of a data warehouse

 

概念:OLTPOLAP
数据仓库又是一种OLAP的系统
OLTP:online transaction processing 联机事务处理
insert update delete commit rollback
特点:ACID 原子性、一致性、持久性、隔离性 -----> 关系型数据库

OLAP:online analytic processing 联机分析处理
一般:select
不关心事务

 

 

 

 

五.Hadoop生态圈的体系机构(Apache 简单版)

 

 

 



 

Guess you like

Origin www.cnblogs.com/maowei0427/p/11795581.html