The difference between database rows and columns of memory storage

1. What is the row and a column store?

  Traditional relational databases, such as Oracle, DB2, MySQL, SQL SERVER, etc. using row storage method (Row-based), in the database based on the line memory, the data is stored on the basis of logical storage unit in accordance with line data, data line is present in the form of continuously stored in a storage medium.

  Storage column (Column-based) with respect to the line memory for the emerging Hbase, HP Vertica, EMC Greenplum a distributed database storage columns are used. In column-based database storage, the data is stored in accordance with the logical units of storage as the base, in the presence of a data continuously stored in the form of a storage medium.

2, OLTP and OLAP

  In the database, data processing can be divided into two categories: online transaction processing OLTP (on-line transaction processing) and online analytical processing OLAP (On-Line Analytical Processing), OLTP is the main application of traditional relational databases, to perform some basic, routine transactions, such as database add, delete, change, etc., and OLAP is the main application of distributed database, its less demanding real-time, but the data processing capacity, usually applied to complex the dynamic reporting system.

  OLTP and OLAP main differences:

 

3, the row and a column store application scenario

  Line storage application scenarios:

    (1) for the random add, delete, change, operation;

    (2) select a row need query all the attributes;

    (3) require frequent insert or update operation, the operation is more related to the size of the index and row.

  List of stored application scenarios:

    (1) during the query may be performed for each column concurrent operation, the polymerization complete set of records in memory, reducing the response time of the query;

    (2) efficiently find the data in the data, index maintenance (any column can be used as an index), the query process can minimize unrelated IO, avoid full table scan;

    (3) Since each column is stored independently, and the data type is known, can be dynamically selected for the data type of the column, the amount of data compression algorithms size and other factors, in order to improve the physical storage utilization; a row if the column has no data in when the storage columns, can not store the value of the column, which is more than the line memory to save space.

  In practice we find that the line has an inherent defect database to read data, for example, query the selected target, i.e., only a few relates to the field, but since these object data buried in each row data unit, and the line unit often particularly, each application must read a complete rows, so that the reading efficiency is considerably low, which, optimization scheme database indexing is given, types of applications in an OLTP It can be simplified by query step indexing mechanism or other means to the partition table, and improve the efficiency of the query.

  But for the vast amounts of data OLAP application context (such as distributed databases, data warehouses, etc.), stored in a database row a bit powerless, line database indexing and materialized views need to spend a lot of time and resources, so it is not cost-effective, can not be a fundamental solution to the problem query performance and maintenance costs, does not apply to data warehousing scenarios, it later emerged that the database column-based storage.

  For data warehousing and distributed databases, in most cases it will be a summary of data from various data sources, then analyzed and feedback, most of which operate around the same field (property) carried out, and when a query attribute data recording, database columns simply returns the value of the attributes associated with the column. A large amount of data query scenario, columnar databases can be efficiently assembled in memory value of each column, forming relationship record set, it is possible to significantly reduce the IO consumption and reduce the query response time is very suitable for the data warehouse and distributed applications.

  

Guess you like

Origin www.cnblogs.com/jason--/p/11521554.html