The mystery of the origin of columnar database HBase

A dawn mass data

  Previously, because of the lack of cost-effective way to store all of the information, many companies ignore some data sources, but now this approach will make the company uncompetitive. Need to store and analyze each data point in the growing increase in demand led directly to e-commerce platform for companies to produce more data.

  In the past, the only option is saved after the deletion of the collected data, such as saving only the last N days of data. However, this approach is only feasible in the short term, it can not store all the data for months or years to collect, therefore, to construct a mathematical model covering the entire period of time or improve an algorithm re-ran all previous data in order to achieve better results.

  Google and Amazon is to recognize the value of the data model, they have begun to develop solutions that meet your business needs. For example, Google in a series of technical publications describe scalable storage to handle system-based commodity hardware. Google's open source community to use these ideas to achieve the open source Hadoop project of two modules: HDFS and MapReduce.

  Hadoop is good at any store, semi-structured data, and even unstructured data, can help users in the analysis of data to decide how to interpret the data, while allowing the user to change the way data classification, once the user has updated the algorithm, only need to re-analyze the data.

  Hadoop is currently almost a complement to all existing database systems, it provides the user with unlimited space for data storage, allowing users to store and retrieve data at the right time, and for storing large files, batch access and streaming access to do optimized.

II. Columnar database

  As a database column in the polymerization unit of data, then the column values ​​are sequentially stored in the disk, this embodiment differs from the traditional database stored line stored in the line memory successively stores the entire line database. As shown below:

  

  

  Storage column appears mainly based on the assumption: For inquiries, not all of the values ​​specified are necessary. Especially in the analytical database, this assumption is very common, so need to choose a more suitable storage mode. In this new design, reduces the total O is just one of multiple I / factors, it also has other advantages: because the data type of column is similar to a natural, even with a slight difference between each of the logic, but still according to the structure than the data lines stored together more conducive to compression, because most of the compression algorithm only limited attention to the compression window.

  Like incremental compression or prefix compression algorithms such expertise is based on the type of custom column stores, thus greatly improving the compression ratio. Better compression ratio help to reduce bandwidth consumption results are returned.

  It is noted that, from the perspective of a typical RDBMS, HBase database storage is not a column, but which utilizes a column storage format on the disk to store data in a column format on the disk. However, the conventional database is different columns: Column traditional database more suitable for real-time data access scenario, HBase more suitable key for data access or data access ordered.

III. Question relational database system

  RDBMS play in the design and implementation of business applications an indispensable role [at least remain so for the foreseeable future]. As long as the user needs to retain user information, products, sessions, and other orders, some services will be used to provide persistent data storage backend for the front-end application server. This structure is very suitable for a limited amount of data, but in the case of the rapid growth of data, this structure becomes insufficient.

  In addition, the database can also use the built-in features, such as stored procedures. When the data system needs to ensure data consistency is always more than one table, you can use stored procedures [affairs] to solve the consistency problem of multiple clients simultaneously update data. Transactions provide atomic properties across the table to update data, you can make modifications while simultaneously visible or invisible. RDBMS provides a so-called ACID properties, which means that user data is strong consistency. Referring to the relationship between the different responsible for the integrity constraint table structure, the use of domain-specific language that can write any complex SQL query. Eventually, the user does not need the relationship is actually how the data is stored, only the relationship between the concept of a higher level, for example, the table structure, table structure provides a very fixed access model in the application.

  Often this model is designed to meet the needs of a longer period of time. But with the increase in the number of users, shared database server will be increasing pressure. Increase the number of application servers is relatively easy, because the application server is shared between the central database, but with the shared CPU, and I rise central server / O load, it will be difficult to predict how long this growth can withstand.

  The first step is to reduce the pressure to increase from a server, the separate read and write read in parallel. This program keeps a primary database server, but the primary database server now only serve the write request to do so mainly on account of most of the major users browser request is generated, so the write request is much smaller than the read request. If this program is also due to the continued increase in the number of users and fail, or reduce the performance of the system, how should we do?

  The next step is common practice to increase the cache, such as Memcached. You can now read access to high-speed cache memory system data, but this program can not guarantee data consistency, because the user to update the data to the database, and the database will not take the initiative to update the data in the cache system, so it is necessary database views and synchronization buffers as quickly as possible, the cache data update time to update the database data to minimize the difference.

  While this approach can ease the pressure read requests, write requests but the pressure increases the problem is not resolved. Once the primary database server write performance degradation, we can strengthen the primary server into a server that vertical expansion, so that the server uses more resources to strengthen. If you are using a master-slave configuration scheme, it is necessary to make the same performance from the server to the main server, otherwise it will not keep up with the master server from the update speed of the server. In short, the initial case to spend more resources in comparison.

  With the use of the project, the project will need to add more new features, but these new features will undoubtedly transformed query the database for the background. The successful implementation of the previous SQL join statement is executed suddenly slowed down, or simply can not be performed, this time had to use reverse paradigm storage structure. If the situation is getting worse, we have to stop using stored procedures, stored procedures because eventually slow unenforceable. Essentially, reduce the storage of data in the database to optimize access.

  As more and more users, the load will continue to increase, logical way to achieve the most expensive time to time in advance of the inquiry program, giving users faster data services. In the end, had to abandon the use of secondary indexes, the reason is to increase the amount of data at the same time, the index amount is large enough to allow the performance of the database straight decline. The last query mode can be provided only in accordance with the primary key query.

If the load in the coming months is expected to increase an order of magnitude or how should I do more? At this point the user can consider data partitioning across multiple databases, but will make use of this scheme operation and maintenance operations into a nightmare, and very costly, and therefore not the most reasonable solution. But in essence, the use of RDBMS but also because no one else can choose from.

  Partitioning scheme describes the logic levels divided data. Characteristics of this scheme is that the data packets or sub-file storage server, rather than continuously stored.

  Partition data is implemented within a fixed range: Before incoming data, must be divided in advance of the data storage range is good, if a pressure level division exceeds its capacity to offer, you need to re-partition the data and migrate data. Repartition and migrate the data is very resource consuming operation, equivalent to redo data, and then re demarcation laterally resolved. Large-scale copy operation will consume a large amount of I / O resources, while also temporarily increased storage needs. In the data re-partitioning process, the client application will still have to perform update operations, but this time the update will be affected repartition performs very slowly.

  Virtual partition methods can be used to reduce this consumption of resources, virtual partitions defined by keywords wider range of data partitions, each data server loads equal number of partitions. But in the new server when the need to reload the data partition, and this process still need to migrate data to a new server. Partition is completely out afterwards simple user operation, without database support, might cause serious damage to the production system.

IV. Non-relational database systems [ Not-Only-SQL, NoSQL ]

  Symbolic mark is actually a good choice: the latest storage systems do not provide by SQL means to query data provide only some of the more simple, similar to the API interface the way to access the data. However, there are some tools to NoSQL provides data storage SQL entry language for relational database used to perform some complex query conditions. Therefore, restrictions on the query, the relational databases and non-relational database and there is no strict distinction.

  In fact, both on the bottom layer there is a difference, particularly in relation to mode or ACID transaction properties, so that the actual storage architecture is relevant. Many new system of this kind is first and foremost to do: abandon a number of constraints to improve scalability. For example, they typically do not support transactions or secondary index. More importantly, this type of system is no fixed pattern, it can change with the application of flexible change.

Guess you like

Origin www.cnblogs.com/yszd/p/12587825.html