Personal blog navigation page (click on the right link to open a personal blog): Daniel take you on technology stack

[Background]

A domestic mobile game point Impala assembly process using details of a single telecommunications services, processing about 100TB details of a single day, details of a single table record levels greater than ten billion a day, there is a problem in the use of impala process:

Using the Parquet details of a single storage format, the data table + MSISDN numbers do time partition using Impala query, the query scenarios not using partitions, query performance is relatively poor.
In the course of using the Impala, you meet a lot of performance issues (such as catalog metadata expansion results in slow sync metadata), and poor performance of concurrent queries.
Impala belongs to the MPP architecture, can only do one hundred node level, when the number of concurrent queries generally reach about 20, the throughput of the entire system has reached full capacity, the expansion node can not improve the throughput.
YARN resources can not be unified resource management scheduling, so Hadoop cluster can not achieve dynamic resource sharing Impala, Spark, Hive and other components. Open to third parties details of a single query capabilities can not do resource isolation.

【solution】

For a range of issues above, the mobile game point we give the customer the appropriate solution, we analyze Big Data team for the above problems, and make technology selection, in the process, we move to this point a few innings typical business scenarios as input, respectively Spark + CarbonData, Impala2.6, HAWQ, Greenplum, SybaseIQ prototype verification, performance tuning, optimization CarbonData for our business scenario data loading performance, query performance and contribution to the open source community CarbonData eventually we chose Spark + CarbonData program, this is a typical SQL on Hadoop solutions, but also indirectly confirms the trend of traditional data warehouse to SQL on the Hadoop migration. Reference Community official website data, combined with our validation testing and understanding: CarbonData is Big Data Hadoop ecological high-performance data storage solutions, especially in the case of larger amounts of data to accelerate significantly, conducted in-depth integration with Spark, Spark is compatible with all ecological functions ( SQL, ML, DataFrame etc.), Spark + CarbonData data to meet the needs for a variety of business scenarios, it includes the following capabilities:

Storage: row, column-file storage, similar to the column storage Parquet, ORC, similar to the line memory Avro. Supports multiple index structure for the bill, logs, water and other data.
Calculation: Engine Spark deep integration and optimization and calculation; support and Presto, Flink, Hive and other engine docking;
interface:
1. API: Compatible DataFrame, MLlib, Pyspark other native API interface;
2. SQL: Spark is compatible with basic grammar, syntax extension supports CarbonSQL (Update removal, indexing, pre-aggregation tables, etc.).
Data Management:
1. Supports incremental data warehousing, batch data management (aging management)
2. Support data update, delete
3. Support and Kafka docking, near real-time storage

Detailed introduction and use of key technologies, please read the official website to view the document https://carbondata.apache.org/

[Introduction] technology selection

Here complemented describes why choose SQL on Hadoop technology as the ultimate solution.

Contact data of large people know that there is a 5V big data features, data from the traditional Internet to mobile Internet data, to the now very popular IoT, in fact, with every advance of industry, in terms of the amount of data will appear two to three orders of magnitude of growth. And now the data growth is showing a trend of accelerated growth, so now put forward a five features, including mobile Internet and the Internet of Things, including the Big Data: Volume, Velocity, Variety, Value, Veracity. With the growth challenge traditional data amount of data warehouse encountered more and more.

Traditional data warehousing challenges:

While the data system is also constantly evolving

• evolution of storage: off-line, near-line -> All online

• evolution of storage infrastructure: centralized storage -> Distributed Storage

• Storage evolutionary model: a fixed structure -> flexible structure.

Evolutionary data processing mode

• Fixed a fixed algorithm model -> Flexible algorithm flexible model

Data processing type of evolution

• Structured single-source centralized computing -> Multi-structured multi-source distributed computing

The evolution of data processing architecture

• Static Database Processing -> Data real-time / streaming / mass processing

Kimball described above for the parent of the database changes made a point of view:

Kimball's core ideas:

hadoop changing the number of data processing and traditional warehouse, a conventional database processing unit into three decoupled in the hadoop:

• Storage layer: HDFS

• metadata layer: Hcatalog

• Query layer: Hive, Impala, Spark SQL

Schema on Read more for the user to select:

• data storage layer in its original format Import

• to manage the target data structure through metadata layer

• to decide when to extract data from the query layer

• After long-term exploration and users familiar with the data, you can take the Schema on Write Mode curing middle of the table, improve query performance

No.	RDBMS-based data processing mode	Based on the data processing mode hadoop
1	Strong consistency	The final consistency, accuracy of the data processing more efficient than
2	Data must be converted, or follow-up process can not continue	Data can not do the conversion, long-term storage in its original format
3	Data must be cleaned, paradigm	Data is not recommended for cleaning and normalization
4	Data stored in the physical table basically, the low efficiency of file access	Most of the data stored in the file, the table is equivalent to physically structured document
5	Metadata limited to the dictionary table	Extended service metadata HCatalog
6	SQL data processing engine is only one kind	Open data processing engine: SQL, NOSQL, Java API
7	Data processing is completely controlled by IT staff	Data engineers, data scientists, data analysts can participate in data processing

SQL on Hadoop data warehouse

Data processing and analysis

• SQL on hadoop

• Kudu Impala +, Spark, HAWQ, Presto, Hive 等

• data modeling and storage

• Schema on Read

• Avro & ORC & Parquet & CarbonData

• Stream Processing

• Flume+Kafka+Spark Streaming

SQL-on-Hadoop technology development and maturity for change

After the above technical analysis, we ultimately chose SQL on Hadoop data warehouse evolution of technology as the future direction of our platform, there must be someone to ask, why do not select MPPDB this technology, here we also put SQL on Hadoop and MPPDB be through comparative analysis (Note Impala is actually a technique similar MPPDB):

Comparison items	SQL on Hadoop	MPPDB
Fault Tolerance	Support for fine-grained fault tolerance, tolerance, which is a fine-grained task will automatically retry failed, do not re-submit the entire query	Coarse-grained fault-tolerant, can not handle the backward node (Straggler node). Fault tolerance is a coarse-grained task execution failure will cause the entire query fails, the system then re-submit the entire query to get results
Expansibility	The number of cluster nodes can be expanded to hundreds or even thousands	Difficult to extend to more than 100 nodes, generally about 50 nodes (such as authentication before we use more than 32 Greenplum machine performance decline)
Concurrency	With increasing cluster size of available resources, the number of concurrent nearly linear growth	When MPPDB maximize resources for the queries, the query to improve performance, and therefore the lower the number of concurrent supported, the number of concurrent queries typically reach 20, the throughput of the whole system has reached the full load condition
Queries delay	1, the data size is less than 1PB, single-table 1 billion record level, a single query latency is usually around 10s 2, the data size is greater than 1PB, by increasing the cluster resources to ensure that the query performance	1, the data size is less than 1PB, single table 1000000000 recording level, a single query can return MPP delay query results typically within milliseconds or even seconds 2, the data size is greater than 1PB, subject to architectural restrictions, query performance may appear sharp decline
data sharing	Storage and computing separate, common storage formats may support different data analysis engine, include data mining	Proprietary storage format MPPDB database is not available for direct use of other data analysis engine

[Effect] Scheme embodiment

Game point 2018 at the end 9 on the line Spark + after CarbonData Alternatively Impala run so far, the processing of documents an amount greater than 100TB per day, during peak periods, the data load performance before the impala average single 60MB / s to internet single 100MB / s of performance, game point at typical business scenario, the query performance at 20 concurrent queries, Spark + CarbonData query performance is more than twice the Impala + parquet.

While addressing the following issues:

Hadoop cluster resource sharing, Impala Yarn resources can not be unified resource scheduling management, Spark + CarbonData can Yarn uniform resource scheduling management, sharing with others such as dynamic resource Spark, Hive and other components.
Hadoop cluster expansion issue, use only a hundred units before Impala machine, now Spark + CarbonData can do thousands of nodes in the cluster scale.

Note implementation process item:

Data loaded using CarbonData local sort of way to load, in order to avoid problems with too much large clusters of small files, load only the specified data is loaded on a small number of machines, in addition to a small amount of data for each table load can specify the level of compaction to the table small files merge load generated in the process.
The query traffic characteristics, the filter is often query field to sort column attribute data table (such as telecommunication service subscriber numbers often queries, etc.), and sets the order of the fields in descending sort column arranged in the first field of the query frequency , if the query frequency or less, then in accordance with the value field distinct from high to low, to improve the query performance.
Create a data table provided blocksize size, the file block size of a single data table may be defined by TABLEPROPERTIES, units in MB, default value 1024MB. This according to the amount of data each time you load the actual data table, based on our experience: a small amount of general recommendation data table blocksize is set to 256MB, large amount of data tables blocksize is set to 512MB.
Tuning query performance, but also the characteristics of the business query, the query high-frequency fields, such as datamap create bloomfilter to improve query performance.
Some Spark relevant parameters for data loading and query, analyze performance bottlenecks first combined SparkUI point, in targeted adjustments related parameters, there is not introduced one by one, keep in mind that performance tuning is a technical and deliberately, parameters to adjust the targeted adjustment, an adjustment related only to transfer one or several parameters, look at the effect does not take effect on the adjustment back, do not remember too many one-time adjustment of parameters.

Attached Java / C / C ++ / machine learning / Algorithms and Data Structures / front-end / Android / Python / programmer reading / single books books Daquan:

(Click on the right to open there in the dry personal blog): Technical dry Flowering
===== >> ① [Java Daniel take you on the road to advanced] << ====
===== >> ② [+ acm algorithm data structure Daniel take you on the road to advanced] << ===
===== >> ③ [database Daniel take you on the road to advanced] << == ===
===== >> ④ [Daniel Web front-end to take you on the road to advanced] << ====
===== >> ⑤ [machine learning python and Daniel take you entry to the Advanced Road] << ====
===== >> ⑥ [architect Daniel take you on the road to advanced] << =====
===== >> ⑦ [C ++ Daniel advanced to take you on the road] << ====
===== >> ⑧ [ios Daniel take you on the road to advanced] << ====
=====> > ⑨ [Web security Daniel take you on the road to advanced] << =====
===== >> ⑩ [Linux operating system and Daniel take you on the road to advanced] << = ====

There is no unearned fruits, hope you young friends, friends want to learn techniques, overcoming all obstacles in the way of the road determined to tie into technology, understand the book, and then knock on the code, understand the principle, and go practice, will It will bring you life, your job, your future a dream.

weixin_41663412

Published 47 original articles · won praise 0 · Views 289

Private letter concerns

Combat techniques: single table one hundred billion big data telecommunications scene, using Spark + CarbonData replace Impala Case