Summary and analysis of large data processing technology

In our real life, transactional data processing demand is very common, such as: station Taobao trading system, 12306 ticket transaction systems, supermarket POS systems belong to transactional data processing system.

data analysis and processing needs classification

1 transactional processing

In our real life, transactional data processing demand is very common, such as: station Taobao trading system, 12306 ticket transaction systems, supermarket POS systems belong to transactional data processing system.

Such a data processing system comprising the following features:

First, transaction processing operations are fine-grained operation, the amount of data involved in each transaction are small.

Second, the calculation is relatively simple, consisting of few steps typically only a few, such as modifying a column of a row;

Third transactional data processing operation relates to add, delete, change, search, data integrity and transactional consistency is very high.

Fourth, transactional operations are real-time interactive operation, at least can be implemented within a few seconds;

Fifth, based on the above characteristics, the index is to support transactional processing a very important technology.

In the amount of concurrent transactions and data volume is not the case, generally relying on stand-alone relational databases, such as ORACLE, MYSQL, SQLSERVER, plus data replication (DataGurad, RMAN, MySQL data replication, etc.) and other high-availability measures to meet business needs .

In the concurrent increase in the amount of data and transactions, the general ORALCE RAC cluster approach can be used either through hardware upgrades (using minicomputers, mainframes, such as the banking system, operator billing systems, securities systems) to support.

Transactional operations in Taobao, 12306 and other Internet companies, because the data volume, high concurrency access, using distributed technologies to deal with the inevitable, this brought problems of distributed processing transactions, and distributed transaction processing hard to do efficiency, it is generally employed to develop special characteristics of the system according to the business application to solve this problem.

2 statistical analysis

Statistical data are mainly enterprises of all types to assist corporate management to operational decisions by analyzing its own sales records and other business data daily operations. Typical usage scenarios include: weekly reports, monthly statements and other fixed period of time to provide the leadership of the various statistical reports; marketing department, were analyzed by various combinations of dimensions, in order to develop appropriate marketing strategies.

Statistical analysis features include the following:

First, the statistics generally relate to a large number of data aggregation operation, each time statistical data volumes will be relatively large.

Second, statistical analysis to calculate the relative complexity, for example, would involve a lot goupby, sub-queries, nested queries, window functions, aggregate functions, sorting; some statistics may need to write complex SQL scripts can be achieved.

Third, real-time statistical analysis of the data is not relatively high transactional operational requirements. But in addition to fixed reports, now more and more users want to do to achieve an interactive real-time statistics;

Traditional statistical analysis is mainly based data warehouse MPP parallel databases. Mainly dimensional model, by a method such precomputation, the data is organized into a structure suitable to high-performance statistical analysis statistical data analysis to support the drill and through the rollup operation, to achieve various dimensions and various combinations of particle size Statistical Analysis.

Also present in the field of statistical analysis, in order to meet the needs of interactive statistical analysis, database warehouse system memory based computing has become a trend, such as the SAP HANA platform.

Here I would like to recommend my own build large data exchange learning skirt qq: 522 189 307, there are learning skirt big data development, if you are learning to big data, you are welcome to join small series, we are all party software development, Share occasional dry (only big data development related), including a new advanced materials and advanced big data development tutorial myself finishing, advanced welcome and want to delve into the big data companion. The above information plus group can receive
 

3 Data Mining

Data mining is mainly based on business objectives, using data mining algorithm automatically found hidden in vast amounts of data and knowledge of the law of data from the mass.

Data mining is the main process: According to the analysis a mining target, the data extracted from the database, and then through the tissue to be suitable for analysis ETL mining algorithms use the wide table, and data mining software mining. Traditional data mining software, generally only support small-scale data processing on a single machine, affected by the limitations of traditional data mining analysis generally be applied sampling method to reduce the size of the data analysis.

Data mining, computational complexity and flexibility far more than the first two categories needs. First, because data mining open-ended questions, resulting in data mining will involve a large number of derived variables is calculated, derived variable changeable result in data preprocessing computational complexity; the second is a lot of data mining algorithm itself is more complex computation on large, especially a large number of machine learning algorithms are iterative calculations required to find the optimal solution through multiple iterations, such as K-means clustering algorithm, PageRank algorithm.

Overall, therefore, the analysis of data mining is characterized by:

1, the entire data mining more computationally complex, typically calculate the flow, there is a data exchange between a plurality of calculation steps, i.e. a large number of intermediate results generated by a plurality of steps, it is difficult to express a sql statement.

2, the calculation should be able to express very flexible, many need to use high-level language programming.

Transactional processing system of the related art two major data of the background

After google, facebook, taobao and other large Internet companies appear, these companies are registered online users and the number of non grew up, so the company's trading systems need to address "high concurrency mass data + + + high-availability data consistency" problem.

To solve this problem, the information from the current point of view, in fact, not a universal solution, companies will be according to their respective business features custom-developed systems, but common ideas include the following:

(1) Database fragmentation, and binding characteristics of data traffic distribution data on multiple machines.

(2) the use of caching mechanism, make full use of memory, random IO efficiency to solve the problems encountered when high concurrency.

(3) binding data replication technology separate read and write, and increase system availability.

(4) a large number of asynchronous processing mechanism, corresponding to the impact of high concurrency.

(5) based on actual business needs, try to avoid a distributed transaction.

1 Related System Introduction

1) Ali CORBAR system

Ali COBAR system is a distributed database system MYSQL database-based, are based on distributed database middleware distributed database system. The system is developed by the predecessor of Chen Siru "amoeba" System (previous research had), due to the grand Chen Siru leave Ali, Ali beware problem "amoeba" stability, re-development of the project.

The system uses a database fragment ideas realized: split the data, separate read and write, copying functions. Because this system only transactional operations due to the need to meet, so the relative real parallel database cluster (for example TeraData, etc.), such systems do not provide operations do not need to provide some sophisticated cross-database processing, so the following limitations of the system:

(1) does not support cross-repository join, paging, sorting, sub-queries.

(2) insert statement must include changes such as split fields and so on.

(3) should not support cross-machine transactions (formerly amoeba is not supported).

Plainly parallel computing systems do not have such capability, substantially equivalent to the router database!

In addition such a system in the practical application of the key issues is that, depending on what the data segmentation, segmentation is not good because a distributed transaction can cause problems.

2) Ali OceanBase system

The system is also a system in order to solve Taobao transaction-based processing under high concurrency, big data and custom development environment. The main ideas and features of the system are as follows:

(1) They found that in the actual creation environment, the updated data only 1% less than a day, so they put data into the overall data: Baseline data and incremental update data.

(2) baseline data is static data, distributed storage for storage.

(3) only the storing and processing the incremental update data on a single server, and the update data is stored and processed in the memory.

(4) when the system load is light, the incremental update data batch incorporated into the baseline.

Simultaneous access baseline data and incremental update data (5) and combined data access.

Therefore, such benefits are:

(1) read and write transactions separation
(2) by sacrificing a little scalability (write a single point), to avoid a distributed transaction processing.

Description: While the system can handle high transactional processing concurrency, known as very fast hardware, but in fact only based on transactions electricity supplier to custom-developed proprietary systems, personally think its technical difficulty less than universal database oracle and so on. The system can not be applied to banks or 12306 and so on, because the logic of the transaction is far more than the sale of the electricity supplier of goods processing logic complexity.

In the current era of big data, it must be based on a custom application to find a good solution!

3) Hbase based trading system

In hadoop platform, HBASE KV database is a distributed database, real-time database belong to the category. Alipay currently paying record is stored in HBASE database.

HBASE non-SQL database interface interfaces, but the user interface KV (Key-based access key and based on the scan operation range), and therefore although HBASE database scalability is very good, but because of its limitations cause the interface to the database to support the upper application is very narrow . HBASE design application-based key point of the design is key to designing key component required by the application support.

HBASE database can be considered only supports indexing this column as a KEY. Although HBASE program that supports secondary indexes, the secondary index maintenance will be too much trouble.

2 concurrency and parallelism difference

Concurrency refers to simultaneously perform a variety of tasks not normally associated, such as transactional systems typically belong to highly concurrent systems.

Is formed by a large parallel computing tasks are divided into a plurality of small computing tasks, and executing a plurality of small parallel computing tasks, to shorten the calculation time computing task.

The two main differences:

(1) Communication and coordination: computing in parallel, since the plurality of small tasks belong to a large computational task, there is a dependency between tasks small, it requires a lot of coordination and communication among tasks small; on the contrary, concurrent multi substantially independent of each other between tasks, little correlation between tasks and tasks.

(2) fault-tolerant processing: As independent between concurrent tasks, a task execution failure does not affect other tasks. But parallel computing multiple tasks belong to a big task, so the failure of a sub-task, if you can not recover (fine-grained and coarse-grained fault tolerance fault-tolerant), the entire mission will fail.

3 This chapter summarizes

The amount of data does not necessarily require parallel computing, although large volumes of data, the data is distributed and stored, but if each operation or for substantially small amount of data each time the operation is substantially completed on a server that does not involve parallel computing . Only it needs to support a high amount of concurrent access to the data copying, data cache, asynchronous processing, etc.

Guess you like

Origin blog.csdn.net/yyu000001/article/details/90578141