In order to facilitate the back of learning, learning in the process of learning a Hive's first tool that Sqoop, you will find the next opportunity sqoop is the easiest framework we learn in the big data framework.
Sqoop is used to
Hadoop
tool data and relational database mutually transferred may be a relational
database
: inlet guide data (e.g. MySQL, Oracle, Postgres, etc.) to the HDFS Hadoop, it is also possible to HDFS the pilot data into a relational database.
For some
NoSQL
database it also provides a connector.
Sqoop, similar to other ETL tool, using metadata model to determine the data type and data transfer from the data source Hadoop ensure security type when the data processing.
Sqoop designed for
large data
-volume transmission design, data collection and can be divided to create a task Hadoop to process each block.
Despite the above advantages, when using Sqoop there are a few things to note.
First, the default parallelism to be careful. Parallel means that the default assumption Sqoop large data within partition key range uniformly distributed. This works well when the source system when you are using a sequence number generator generating a primary key.
Analogy, when you have a 10-node cluster, then the workload in this 10 servers distributed equally on. However, if you split is based on an alphanumeric key, the number has a key example in the "A" as would be the beginning of the "M" key as 20 times the number at the beginning, then the workload will become from one inclined server to another server.
If you are most worried about is performance, so be loaded directly under study. Direct loaded directly loaded tool to bypass the usual Java Database Connectivity import, using the database itself provides, such as
MySQL
's mysqldump.
But there is a limit for a particular database. For example, you can not use MySQL or PostgreSQL connector to import BLOB and CLOB type. No driver support is introduced from view. Oracle direct drive required privileges to read such similar dba_objects and v_ $ parameter
metadata
. Refer to your database direct drivers limitations related documents.
Incremental imports are issues related to the efficiency of the most talked about, because Sqoop is specifically designed for large data sets. Sqoop supports incremental updates, will add a new record to the most recent export data source or specify a timestamp of the last modification.
Since Sqoop to move data into and out of relational database capabilities, which for Hive-
Hadoop
ecosystem in the famous class SQL data warehouse - a dedicated support is not surprising. Command "create-hive-table" may be used to import data table definitions
Hive
.
:( version of the two versions are not compatible entirely up to sqoop1 use)
sqoop1:1.4.x
sqoop2:1.99.x
Similar products
DataX: Ali top-level data exchange tool
Note that, here it is relative to the import and export of Hadoop it! ! ! ! !
Hadoop data into the HDFS in:
HDFS data in a relational database to export to: