Data Warehouse - v1

Transfer: https: //www.cnblogs.com/wangtao_20/p/8294974.html

The term Note:
1、 OLTP -  this is the on-line transaction processing shorthand. Translated into online transaction processing. Is the business data online transactions. This aspect of the database is a relational database.
2, OLAP -  the On-Line Analytical Processing translated into online analytical processing. The popular understanding, is to do statistics, analysis platform. Comply with this requirement creates the concept of the data warehouse.
3, data warehouse - is just a concept, a data warehouse. Build a data warehouse technology solutions can be a relational database, it can be a column store. To popular understanding, it can be seen as a data warehouse and OLAP thing.
4, Business Intelligence BI - in essence, is dependent on the support of the data warehouse to do, no data is stored, no large amounts of data, not statistics, can not be analyzed.
 
Why prefer to use column-based data warehouse relational database storage ?
Technical data warehouse solutions used, there are many. You can use a relational database mysql, at present, the industry generally use the column storage.
 
Why not use mysql to do so on-line store relational database data warehouse and general use column store database, taking into account the characteristics of the data warehouse?:
1, data warehouse data sources multiple systems. Maybe the file, transaction data may be other relational databases.
2, need to establish a statistical model of multiple dimensions.
3, the stored data amount. History, archive, summarized data calculated.
4, need access to a large number of records to statistical results. If you can not be slow on statistical performance, not the statistical results. We can not meet the needs of statistical analysis.
Involves complex aggregate statistical inquiry, this type of system is more difficult to deal with, such as a query to some types of users over the past three months, most purchased items, because at the same time need to query large amounts of data, OLTP (relational database) system does not good at dealing with this kind of demand.
5, update the data rarely. All add data, query data. So the query speed requirements high.
 
Comparative row and a column store

A row of data (all data in this row) lines are stored together, followed by the second line of data is sequentially down.

Different column storage, all data are put together one of the.
Can be clearly seen from the figure, the data row of a table storage are put together, but the storage columns are saved separately. So they have these advantages and disadvantages of the following:
 
Line storage
Columnar storage
advantage
Ø data is saved together
Ø INSERT / UPDATE easy
Ø When it comes to query only the columns are read
Ø Projection (Projection) very high
Ø Any column can be used as an index
Shortcoming
When Ø Select (Selection) even involves only a few columns, all the data will also be read
Ø When the selection is complete, the selected columns to be reassembled
Ø INSERT / UPDATE too much trouble
 
Note: relational database theory review - Select (Selection) and projection (Projection)
 
Do join in the joint when the column is more efficient storage.
In the column storage, the following query: the SELECT the Customers, the Customers Material from the Table the WHERE = "Miler" and Material = "Refrigerator"

 All data are in one piece so that each column is an index. For a data compression is also very easy to become digital stores. Smaller storage space, storage space is small, the operation speed is faster.

The key steps are as follows:

1. 去字典表里找到字符串对应数字(只进行一次字符串比较)。
2. 用数字去列表里匹配,匹配上的位置设为1。
3. 把不同列的匹配结果进行位运算得到符合所有条件的记录下标。
4. 使用这个下标组装出最终的结果集。
 

 

业界常来搭建数据仓库的数据库
 
在数据仓库领域的收费列数据库
1、惠普公司的Vertica
2、oracle公司Oracle Warehouse Builder的
3、sybase公司的Sybase IQ/SAPIQ
4、mysql公司出的Infobright。
5、Greenplum公司的Greenplum
 
互联网公司自主研发的
1、华为的Carbondata
2、百度研发给内部使用的palo。
3、腾讯Hermes
4、Druid:广告分析,互联网广告系统监控、度量和网络监控。开源免费。
5、俄罗斯的yandex公司为自己内部统计需要研发的clickhouse。yandex为俄罗斯的"百度"、"百度统计"业务。2016年6月份才开源发布出来。这个文档全,对php语言支持好。性能不弱于百度的palo。

 

 

Guess you like

Origin www.cnblogs.com/ylz8401/p/12310188.html