Transaction mix analysis "HTAP" technical points analysis

HTAP is a concept in recent years, more fire, this article will talk about past lives and the technical characteristics of HTAP.

First, the data application category

The use of characteristic data may be simply divided as follows. Before selecting a technology platform, we need to do such a position.

1.1 OLTP online transaction processing OLTP (On-Line Transaction Processing)

OLTP is event-driven, application-oriented, also known as transaction-oriented processing. The basic feature of the user data reception is received may be transferred immediately to a computing center for processing, and the processing result is given in a very short time, a rapid response to user actions. Such as banking, e-commerce trading system is a typical OLTP system.

OLTP has the following features:

  • Directly to the application, data generated in the system.
  • Based on transaction processing system.
  • The amount of data involved in each transaction is small; the response time requirements are very high.
  • Very large number of users, its user is an operator, a high degree of concurrency.
  • Various operations of the database is mainly based on the index.
  • Interact with SQL as a carrier.
  • The overall amount of data is relatively small.

1.2 OLAP on-line real-time analysis OLAP (On-Line Analytical Processing)

OLAP oriented data analysis, also referred to as the information for the analysis process. It enables analysts to quickly, consistent, interactive observation information from all aspects, in order to achieve in-depth understanding of the data. Which is to deal with huge amounts of data, support for complex analysis operations, focusing on decision support, and provides intuitive query results, such as data warehouse is typical of OLAP systems.

OLAP has the following features:

  • Itself does not generate the data, based on data from the operating data of the production system.
  • Query analysis system; complex queries often use multi-table joins, full table scan, the number involved is often very large.
  • The amount of data per query design a large, response time and has a lot to specific inquiries.
  • The number of users is relatively small, its users are mainly business people and managers.
  • As the business problem is not fixed, the various operations of the database can not be based solely on the index.
  • With SQL as the main carrier, also supports language class interaction.
  • The overall amount of data is relatively large.

1.3 OTHER

In addition to traditional OLTP, OLAP class in recent years for the use of data and some new features, me be classified as "other" category.

1) Multimode

With business "Internet" and "intelligent" architecture and "micro-service" and "cloud" of the development, application management data storage system proposed new standards and requirements, the diversity of data become a prominent issue. Early database processing scenarios facing major structured data. Later, with the development of business, it has the effect of processing requirements for unstructured data, including structured data, semi-structured (JSON, XML, etc.) data, text data, geospatial data, map data, audio and video data. Multi-mode, it means a single database to support multiple types of storage and processing of data.

2) Streaming

Streaming (real-time calculation), derived from the need for data processing timeliness. Data's business value as time goes quickly reduced, so they must be calculated and processed as quickly as possible after the occurrence of the data. Based on the traditional treatment cycle class, apparently unable to meet the demand.

As mobile Internet, the development of things and the sensor lead to a lot of streaming data generation, there has been a corresponding proprietary streaming data processing platform, such as Storm, Kafka like. In recent years, many database processing began to support streaming data, e.g. MemSQL, PipelineDB. Some proprietary streaming data processing platform provides a SQL interface to start, for example, based on Kafka KSQL SQL provides a stream processing engine.

3) high order

With the deepening of the data used, the use of data is no longer just a simple CRUD or packet aggregation class action, and for its more high-end use is also gradually attracted everyone's attention. For example, using machine learning, statistical analysis and pattern recognition algorithms for data analysis.

1.4 Contrast - OLAP vs OLTP

Second, the data processing mode

Faced with these complex application scenarios, multiple classes of data applications, is handled by a single platform, or is handled by a different platform? In general, the performance of the system than the proprietary high performance general one to two orders of magnitude, and thus different services should be different systems. But as the old saying "the world trend, long period of division, together for a long time to divide" in the field of data processing also has a tendency to be handled by a single platform.

The core here is how to select the dialectical view of demand and technology. They are a contradiction, when it eased contradiction, the data processing field will tend to integrate; and when this sharp contradiction, the data processing field will tend to disperse. On the status of the development of hardware and software technology and current demand, future consolidation trend more apparent. Data integration platform will be able to meet most users scenes, only a handful of companies need to use proprietary systems to achieve their special needs.

2.1 Decentralized (proprietary platforms)

Currently more conventional way is the use of multiple proprietary platform to perform data processing for different scenarios. So is cross-platform, there is a process of data transmission. This has two problems: data synchronization, data redundancy. Data synchronization is the core problem of the timeliness of the data, outdated data tend to lose value.

Common practice is as follows:

  • OLTP system data changes, the form of logs exposed;
  • Decoupled transmission through a message queue;
  • ETL back-end consumer pull, to synchronize data in OLAP.
  • The whole chain is longer, the longevity of a demanding scenario is a test.

Further, the data flows in the chain, the presence of multiple copies of stored data redundancy. In the conventional high-availability environment, it will further maintain multiple copies of data, so this is hidden inside the larger technology, labor costs and costs of data synchronization. And so many across the technology stack, database products, technology behind each stack and require separate support and maintenance team, as DBA, big data, infrastructure, all of which contains a huge human, technical, time, transportation and maintenance costs. It is due to meet business needs while improving the timeliness, reduce data redundancy, such as shortening the chain, convergence technology stack becomes important. This is also the general class platform solution born starting point.

2.2 centralized (Spoken flatbed)

Users are tired of using different data processing systems for different data processing, we tend to use integrated data processing platform to handle enterprise of various data types. Scenario for the integration of online transaction processing and online real-time analysis, which is discussed below HTAP. Such common platform program includes the following advantages:

  • Avoid islands of information through data integration, easy data sharing and unified management.
  • Provides good data independence SQL-based data integration platform that allows applications to focus on the business logic, do not care about the operational details of the underlying data.
  • Data integration platform to provide better and more comprehensive real-time data, providing faster and more accurate analysis and decision making for the business.
  • To avoid glue between the various systems, the company's overall technical architecture is simple, does not require complex data import / export, etc., easy to manage and maintain.
  • Personnel training and facilitate the sharing of knowledge, no training for the development of a variety of proprietary systems, operation and maintenance and management personnel.

三, Htap

HTAP database (Hybrid Transaction and Analytical Process, mixing transactional and analytical processing). A report in 2014, Gartner mixed transactions analytical processing (HTAP) The term describes the new application framework, in order to break the barriers between OLTP and OLAP, can be applied to both transactional database scene, it can also be applied to the analysis type scene database, real-time business decisions.

This architecture has obvious advantages: not only avoids the cumbersome and expensive ETL operations, and can more quickly to the latest data for analysis. This ability to quickly analyze the data will be one of the core competitiveness of enterprises in the future.

3.1 Techniques

  • The underlying data or only one, or can be quickly copied, and at the same time meet the high concurrent real-time updates.
  • To meet the capacity problem mass data storage, calculation has good linear scalability.
  • With good optimizers to meet Service category, the statement needs analysis class.
  • It includes the SQL standard, and supports such as secondary indexes, partitioning, column storage, to calculate a quantization techniques.

3.2 Key Technology - the ranks of the store

1) stores row (Row-based)

For traditional relational databases, such as Oracle OracleDB and MySQL, IBM's DB2, Microsoft's SQL Server, etc., are generally used to store row (Row-based) line. Based on the database stored in the line, the data is stored on the basis of logical storage unit in accordance with the data line, the data line is present in the form of continuously stored in a storage medium.

2) storage column (Column-based)

Columnar storage with respect to the line of storage is emerging Hbase, HP Vertica, EMC Greenplum and other distributed databases are used columnar storage. In column-based database storage, the data is stored in accordance with the logical units of storage as the base, in the presence of a data continuously stored in the form of a storage medium.

Conventional row database, is stored in rows, the maintenance of a large number of indexes and materialized views both in time (treatment) or a space (storage) plane costs are high. And the column opposite database, the database data column is stored in columns, each column separately stored, that is the index data. Query access only the columns involved, greatly reduces system I / O, each column is processed by a line, and because the same data type, similar characteristics greatly facilitate compression.

3.3 Key Technology - MPP

MPP (Massively Parallel Processing), i.e., massively parallel processing, in a non-shared database cluster, each node has its own disk storage system and the memory system, data traffic is divided according to each node database model and application characteristics, each data nodes are connected through a private network or a business network common to each other cooperate with each other is calculated, as a whole provide database services. Non-shared database cluster has full scalability, high availability, high performance, excellent cost-effective, resource sharing and other advantages.

Briefly, the task of the MPP is dispersed to a plurality of parallel server nodes and, after completion of the calculation at each node, the respective portions of the results obtained are summarized together with the final result. The following typical product Greenplum MPP architecture, for example.

3.4 Key Technology - resource isolation

OLTP, OLAP class uses two different resource characteristics, we need to do work in the resource isolation level to avoid mutual influence. Common way by defining resource queue, specify the user assignment queue, resource isolation play a role.

3.5 HTAP Products

The figure is the website to find the product classification map database for online reference object can HTAP class of related products. Of course, this is just one of the words for reference purposes only!

Author: Han Feng

Starting in the number of personal public "Han Feng channel."

Source: CreditEase Institute of Technology

Guess you like

Origin www.cnblogs.com/yixinjishu/p/11576996.html