Data warehousing to Big Data

Generic word to sum up:

In fact, according to the evolution of development should be in the following order:

Database -> Data Warehouse -> Data Lake

| Data mart |

 

Stage one (independent database applications):

Various business applications rely on databases, play a role in collecting data.

 

Phase II (data mining to collect data):

Business applications are relatively independent unit topics, data needs to be collected each application system, do allow aggregation analysis can dig out useful information.

Subsequently it gave rise to the concept of the data warehouse: data warehouse is built to support management decisions, subject-oriented, integrated, stable, with a collection of time-varying data.

Features: its data is thematically rather than by the application stored data across applications. For example, product theme, sales themes.

Bottleneck: As more and more data traffic system, bottlenecks in data warehouse came (to store and query):

1. Because the data warehouse is a relational database, in terms of storage can scale up.

2. The lower the amount of data the greater the efficiency of complex queries, and will be getting lower and lower.

 

Data mart is to meet the needs of a specific department or user, index is calculated in accordance with the dimensions and needs of custom-oriented analysis of multidimensional cube decision generated. It may be the data source database business applications, can also be a data warehouse.

 

Phase III :( solve the bottleneck of data warehousing, big data concepts introduced):

HDFS Hadoop birth, the concept of the data lead-out lake, it is hadoop the storage frame

For the bottleneck in data warehouse solutions:

1. First, the data is carried over the HDFS cheap hardware-based storage expandable outwardly, and the transverse extension.

2. The disk is calculated based computing framework of hadoop mapReduce or with spark RDD data set based on the elastic memory slice.

Features:

Storage: Data include structured data from relational databases (rows and columns), semi-structured data (CSV, XML, JSON logs), unstructured data (emails, documents, PDF) and binary data (images, audio, , video) so as to form a centralized data storage to hold all forms of data.

 

 

 

Data Warehouse / Data Lake (big data) significant differences:

1. different reference data: data warehouse reference is of course etl; data is data reference lake elt process.

2. Storage: A data warehouse is structured data; natural lake is the data format, stores various structures.

3. Data access: data warehouse is sql; data is Lake Directory Access (external program) / sql class program.

Guess you like

Origin www.cnblogs.com/zhangwensi/p/11281771.html