Phoenix Practice | Phoenix Data Migration

1. Overview

The richness of data migration tools also determines the popularity of the database and its ecosystem to a certain extent. Understanding its related tools can make our data migration work more efficient. This article mainly introduces Phoenix's data import and export tools, hoping to give some help to students who are preparing to do data migration on Phoenix.


2. Data import and export instructions

Due to the data migration at the source, new data modification or writing will occur during the process of importing to Phoenix, which makes real-time migration of non-stop business difficult. Now open source data migration tools need to stop the business of the data source to complete the data migration.

For students who are going to migrate to Alibaba Cloud HBase, this is not a problem. We provide real-time migration (HFile copy + WAL synchronization analysis and storage) support for non-stop business.

There are two types of import methods

image


3. BulkLoad import data

Importing data through BulkLoad can directly import Phoenix tables or HBase tables, and then create Phoenix mappings (this method will not be introduced for the time being). The Bulkload tool that directly imports Phoenix tables supports the following data sources:

  • Csv data storage: CsvBulkloadTool

  • Json data storage: JsonBulkloadTool

  • Regular matching text storage: RegexBulkloadTool

  • ODPS table:  ODPSBulkLoadTool (only supported on cloud HBase)



    https://yq.aliyun.com/articles/691980?spm=a2c4e.11153940.blogcont691979.8.aeae1df5CCiRj2

Among them, Csv/Json/Regex Bulkload, the corresponding tool class has been provided in the open source Phoenix version, and the specific usage parameters can be viewed through --help. The usage example is as follows:

HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf \
hadoop jar phoenix-<version>-client.jar \
org.apache.phoenix.mapreduce.CsvBulkLoadTool \
--table EXAMPLE \
--input /data/example.csv

HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf \
hadoop jar phoenix-<version>-client.jar \
org.apache.phoenix.mapreduce.CsvBulkLoadTool \
--table EXAMPLE \
--input /data/example.csv

hadoop jar phoenix-<version>-client.jar \
org.apache.phoenix.mapreduce.JsonBulkLoadTool \
--table EXAMPLE \
--input /data/example.json


4. API data import and export

DataX is a widely used offline data synchronization tool/platform in Alibaba. It supports efficient data synchronization functions between various common heterogeneous data sources. Its principle is to read multiple data shards at the same time through Datax multi-threading and use API to write Into the target data source. Now supports Phoenix 4.12 and above data export plug-in, which can meet the daily needs of importing from relational database to Phoenix, ODPS to Phoenix, Phoenix to export CSV text, etc. For details, please refer to here:

https://github.com/alibaba/DataX?spm=a2c4e.11153940.blogcont691979.9.aeae1df5CCiRj2


5. Summary

For the full amount of source data with non-repetitive primary keys, we recommend using MR to import Phoenix using Bulkload (cloud HBase itself does not provide MR capabilities, and requires external access to the source cluster and target cluster HDFS Hadoop). Datax can be used for daily incremental data synchronization (importing data to cloud HBase requires providing an ECS that can access the source cluster and target cluster to run Datax).

To improve the data storage speed of Bulkload, not only need to increase the number of regions of the target Phoenix table (new tables need to specify the number of pre-partitions or add salt), but also need to improve the cluster configuration (scale out/scale up) of the MR operating environment. DataX's way to improve storage is mainly to adjust the number of threads and batches configured, and the number of regions in the target table cannot be too small.

Finally, it is recommended to use Datax for tens of millions of levels, because it is simple and easy to use. :)


reference

  • https://phoenix.apache.org/bulk_dataload.html


image



Guess you like

Origin blog.51cto.com/15060465/2676997