Use EMR Spark Relational Cache synchronize data across clusters

Relational Cache related articles links:

Use Relational Cache accelerate EMR Spark Data analysis

background

Relational Cache is an important characteristic of EMR Spark supported mainly by the data pre-calculated and pre organizations accelerate data analysis, provides a similar traditional data warehouse materialized view features. In addition to enhance the data processing speed, Relational Cache can also be applied to many other scenarios, this paper describes how to use Relational Cache synchronize data across the cluster table.
All data through a unified Data Lake Management is a goal pursued by many companies, but in reality, due to the multiple data centers, different network Region, even in the presence of different departments, inevitably there will be a large number of different data clusters, different data synchronization needs of the cluster widespread, in addition, cluster migration, involving new and old data synchronization is a common problem to move the station. Data synchronization job is usually a painful process, the development of migration tools, incremental data processing, synchronous read and write, and subsequent data comparison and so on, need a lot of custom development and human intervention. Based Relational Cache, users can simplify this part of the work, with little cost to achieve data synchronization across the cluster.
Below we show a concrete example of how to achieve data synchronization across the cluster by EMR Spark Relational Cache.

Relational Cache using synchronous data

Suppose we have A, B two clusters, you need the data from the table activity_log A cluster synchronization to the cluster B, and throughout the process, there will continue to insert new data into activity_log table, A table built cluster activity_log statement is as follows:

CREATE TABLE activity_log (
  user_id STRING,
  act_type STRING,
  module_id INT,
  d_year INT)
USING JSON
PARTITIONED BY (d_year)

Insert two pieces of information on behalf of historical information:

INSERT INTO TABLE activity_log PARTITION (d_year = 2017) VALUES("user_001", "NOTIFICATION", 10), ("user_101", "SCAN", 2)

Activity_log table for the construction of a Relational Cache:

CACHE TABLE activity_log_sync
REFRESH ON COMMIT
DISABLE REWRITE
USING JSON
PARTITIONED BY (d_year)
LOCATION "hdfs://192.168.1.36:9000/user/hive/data/activity_log"
AS SELECT user_id, act_type, module_id, d_year FROM activity_log

REFRESH ON COMMIT indicates when updating the source table data occurs automatically updated cache data. LOCATION memory address can be specified by the data cache, the cache address our cluster HDFS point B in order to achieve synchronization of data from the cluster A to cluster B. Further fields and Cache Partition information is consistent with the source table.

In cluster B, we also create a activity_log table, create the following statement:

CREATE TABLE activity_log (
  user_id STRING,
  act_type STRING,
  module_id INT,
  d_year INT)
USING JSON
PARTITIONED BY (d_year)
LOCATION "hdfs:///user/hive/data/activity_log"

Execution MSCK REPAIR TABLE activity_log auto repair related meta information, and then execute a query, you can see the cluster B, has two data tables can be found in the cluster A previously inserted.

image_1

Continue to insert new data in the cluster A:

INSERT INTO TABLE activity_log PARTITION (d_year = 2018) VALUES("user_011", "SUBCRIBE", 24);

And then perform in the cluster B, MSCK REPAIR TABLE activity_log and query activity_log table again, you can find the data has been automatically synchronized to activity_log table cluster B, for the partition table, when new data is added at the partition, Relational Cache increments of sync the new partition data, rather than re-sync all data.

image_2

If the cluster A, the new data is not activity_log Spark inserted through, but introduced by other means or to external Hive Hive table, the user data may be triggered by a synchronization statement REFRESH TABLE activity_log_sync manually or via a script, if the new data in accordance with batch import partition, also be (d_year = 2018) partition data by incremental synchronization statement similar REFRESH TABLE activity_log_sync WITH TABLE activity_log pARTITION.

Relational Cache can ensure data consistency in the cluster A and the cluster B activity_log table, the table dependent activity_log downstream tasks or applications can always switch to the cluster B, while the user can always be written to the application data table of the cluster A activity_log or services will be suspended, pointing to a cluster B in activity_log table and restart the service, thus completing the migration of applications, or services. A clean-up after the completion of a cluster of activity_log and activity_log_sync can be.

to sum up

This article describes how to Relational Cache synchronize data between different data tables large data clusters, very simple and convenient. In addition, Relational Cache can be applied to many other scenarios, such as building-second response of OLAP platform, interactive BI, Dashboard applications, speed up ETL process and so on, and then we will share in more best practice Relational Cache scene.

Guess you like

Origin yq.aliyun.com/articles/704649