To open big data in a visual way, how does tableau connect to Hadoop hive?

Preface

Hadoop Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide complete SQL query functions; SQL statements can be converted into MapReduce tasks for operation, with the advantage of low learning costs; Simple MapReduce statistics can be quickly realized through SQL-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses. This article will introduce in detail how Tableau connects to Hadoop Hive and its precautions.

The following link is a tableau learning tutorial carefully compiled by the blogger, including basic parts and advanced tutorials, and small partners in need can subscribe by themselves.

Tableau Visual Data Analysis Advanced Tutorial

https://blog.csdn.net/wenyusuran/category_9596753.html

Tableau visual data analysis knowledge points in detail

https://blog.csdn.net/wenyusuran/category_9274958.html

 

1. Introduction to Hadoop


The reason for the existence of Hadoop is that it is suitable for storing and computing big data. A Hadoop cluster is mainly composed of two parts: one is a "library" for storing and calculating "data", and the other is a storage and calculation framework.

1.1 Hadoop Distributed File System

Hadoop distributed file system is a file system implementation, similar to NTFS, EXT3, EXT4, etc. However, the Hadoop distributed file system is built on a higher level. The files stored on HDFS are divided into blocks (each block is 64M by default, which is mostly distributed on multiple machines than ordinary file system blocks, and each block has more Block redundancy backup (default is 3) to enhance the fault tolerance of the file system. This storage mode complements the subsequent MapReduce calculation model. The specific implementation of HDFS mainly has the following parts:
 

1. Name Node (NameNode)

The responsibility of the name node is to store the metadata of the entire file system, which is a very important role. Metadata will be loaded into the memory when the cluster is started, and changes to the metadata will also be written to the system image file on the disk. At the same time, the edit log of the metadata will be maintained. When HDFS stores files, the files are divided into logical blocks for storage, and the corresponding relationships are stored on the name node. If it is damaged, the data of the entire cluster will be unavailable. We can take some measures to back up the metadata of the name node, such as setting the name node directory to a local directory and an NFS directory at the same time, so that any metadata changes will be written to two locations for redundant backup. The process of redundantly writing to the two directories is atomic, so that after the name node in use goes down, we can use the backup file on the NFS to restore the file system.
 

2.
The role of the secondary name node (SecondaryNameNode) is to periodically merge the namespace image through the edit log to prevent the edit log from being too large. However, the status of the second name node lags behind that of the main name node. If the main name node goes down, there must be some file loss.
 

3. Data Node (DataNode)
This is the place where data is stored in HDFS. Generally, there are multiple machines. In addition to providing storage services, it also periodically sends a list of stored blocks to the name node. The name node does not need to permanently store each file and the data node where each block is located. This information will be reconstructed by the data node after the system is started.

1.2 MapReduce computing framework


The MapReduce computing framework is a distributed computing model. The core is to decompose tasks into small tasks, where different calculators participate in the calculation at the same time, and the calculation results of each calculator are combined to obtain the final result. The model itself is very simple. Generally, only two interfaces need to be implemented. The key is how to transform actual problems into MapReduce tasks. Hadoop's
MapReduce is mainly composed of the following two parts:
 

1. The job tracking node (JobTracker)
is responsible for task scheduling (different scheduling strategies can be set) and status tracking. A bit similar to the name node in HDFS, JobTracker is also a single point, which may be improved in future versions.
 

2. The task tracking node (TaskTracker)
is responsible for specific task execution. TaskTracker informs JobTracker of its status through a "heartbeat" method, and JobTracker assigns tasks to it based on the reported status. TaskTracker will start a new JVM to run tasks, and of course JVM instances can also be reused.

 

2. Basic conditions for connection


Hadoop Hive is a technology that utilizes Hadoop cluster data by mixing traditional SQL expressions and Hadoop-specific advanced data analysis and transformation operations. Tableau uses Hive to work with Hadoop to provide an environment without programming. Tableau supports the use of Hive and the HiveODBC driver of the data source to connect to data stored in Cloudera, Hortonworks, MapR, and Amazon EMR (ElasticMapReduce) distributions.

2.1 Hive version

The following describes the prerequisites and external resources for connection. For the connection to Hive Server,
one of the following conditions must be met: Cloudera distribution containing Apache Hadoop CDH3u1 or higher, including Hive 0.7.1 or higher; Hortonworks; MapR Enterprise Edition (M5); Amazon EMR.

For the connection to Hive Server 2, one of the following conditions must be met:
Cloudera distribution including Apache Hadoop CDH4u1; Hortonworks HDP1.2; MapR Enterprise Edition (M5) with Hive 0.9+; Amazon EMR. In addition, the correct Hive ODBC driver must be installed on each computer running Tableau Desktop or Tableau Server.

2.2 Driver


For Hive Server or Hive Server2, the Cloudera, Hortonworks, MapR, or Amazon EMR ODBC driver must be downloaded and installed from the "Driver" page. Cloudera (Hive): Cloudera ODBC driver
for ApacheHive2.5.x; for Tableau Server 8.0.8 or higher, driver 2.5.0.1001 or higher is required.

Cloudera (Impala): Cloudera ODBC driver for Impala Hive 2.5.x; if you connect to the Beeswax service on Cloudera Hadoop, you must use the Cloudera ODBC 1.2 connector suitable for Tableau Windows.

Hortonworks: Hortonworks Hive ODBC driver 1.2.x.

MapR: MapR_odbc_2.1.0_x86.exe or higher, or MapR_odbc_2.1.0_x64.exe or higher.

Amazon EMR: Hive ODBC.zip or Impala ODBC.zip. If another driver version has been installed, uninstall the driver first, and then install the corresponding version provided on the "Driver" page.

2.3 Start Hive service

Type the following command using the terminal interface of the Hadoop cluster:

hive—service hiverserver


The above command will terminate when you exit the Hadoop terminal session, so you may need to run the Hive service in a continuous state. To move the Hive service to the background, you need to type the following command:

nohup HIVE_PORT=10000 hive—service hiveserver &


For long-term use, you need to configure an automatic process for starting Hive with the cluster. Derby is the default metadata store for Hadoop. Hive metadata contains the structure and location of the Hive table and must be stored somewhere to enable continuous read/write access. Hive uses Derby to hold metadata information by default.

Although Derby cannot support the concurrent use of multiple instances of Hive, for external clients like Tableau, the Hive service will run as a single access. The Hive service supports concurrent access by multiple external clients, while only running on a single instance of the Derby metadata database. If you plan to use Hive for long-term production, you can consider using a multi-user metadata repository such as a PostgreSQL database, which will not affect the way Tableau interacts with Hive.

 

3. Main steps to connect


Select the appropriate server, Cloudera Hadoop, Hortonworks Hadoop Hive, MapR Hadoop Hive, or Amazon EMR in Tableau Desktop, and then enter the information required for the connection.

3.1 Cloudera Hadoop

Click Cloudera Hadoop under "Connect" on the start page, and then do the following:
 

(1)输入承载数据库服务器的名称和端口号,端口号21050是2.5.x驱动程序的默认端口。
(2)在“类型”下拉列表中选择要连接的数据库类型Hive Server、Hive Server2或Impala。
(3)在“身份验证”下拉列表中选择要使用的身份验证方法。
(4)单击“初始SQL”以指定将在连接时运行一次的SQL命令。
(5)单击“登录”按钮

If the connection is unsuccessful, verify that the user name and password are correct. If the connection still fails, it means that the computer has encountered a problem in locating the server, and you need to contact the network administrator or database administrator for processing.

 

3.2 Hortonworks Hadoop Hive


Click Hortonworks Hadoop Hive under "Connect" on the start page, and then do the
following:

(1)输入承载数据库的服务器名称。
(2)在“类型”下拉列表中选择要连接的数据库类型Hive Server或Hive Server2。
(3)在“身份验证”下拉列表中选择要使用的身份验证方法。
(4)单击“初始SQL”以指定将在连接时运行一次的SQL命令。
(5)单击“登录”按钮。


If the connection is unsuccessful, verify that the user name and password are correct. If the connection still fails, it means that the computer has encountered a problem in locating the server, and you need to contact the network administrator or database administrator for processing.

 

3.3 MapR Hadoop Hive


Click MapR Hadoop Hive under "Connect" on the start page, and then do the following:

(1)单击“登录”按钮。
(2)输入承载数据库的服务器名称。在“类型”下拉列表中选择要连接的数据库类型,可以选择Hive Server或Hive Server2。

(3)在“身份验证”下拉列表中选择要使用的身份验证方法。
(4)单击“初始SQL”以指定将在连接时运行一次的SQL命令。
(5)单击“登录”按钮。


If the connection is unsuccessful, verify that the user name and password are correct. If the connection still fails, it means that the computer has encountered a problem in locating the server, and you need to contact the network administrator or database administrator for processing.

4. Connection considerations


When connecting to Hive, Tableau Desktop needs to pay attention to the known limitations of date/time data, Hive and Hadoop (compared to traditional databases).

4.1 Date/time data


Tableau Desktop 9.0 and later versions support timestamps in Hive, and Tableau can use timestamps natively. If you store the date/time data as a string in Hive, make sure to store it in the ISO format (YYYY-MM-DD).

In Tableau Desktop 9.0 and earlier versions, Tableau does not have built-in support for the timestamp data type, but these versions support operations on date/time data stored in strings.

Steps to change the data type to date/time format: Create a data extract, then right-click the field in the "Data" pane, and select "Change Data Type" → "Date" to use the pure date stored in the string or Date/time data, or use the DATEPARSE function to convert a string to a date/time format field.

4.2 Known limitations


1. High-latency
Hive is a batch-oriented system and cannot answer simple queries with a fast turnaround time. This limitation makes it very difficult to explore new data sets or experience calculated fields, but some newer SQL-on-Hadoop technologies can be used to solve this limitation.
 

2. Date/time processing
Hive provides important functions for computing string data that may represent date/time, and adds support for storing date/time as native data type (time stamp).
 

3. Query progress and cancel operations

Cancellation in Hadoop Hive is not simple, especially when working on computers that are not part of the cluster. Hive cannot provide a cancellation mechanism. Therefore, the query issued by Tableau can only be "abandoned". After abandoning the query, continue in Tableau Works, but the query will still run on the cluster and consume resources.

4. Identity Verification

For the connection of traditional Hive Server, Hive ODBC driver does not display the authentication operation, and Hive authentication model and data security model is incomplete, TableauServer provides a data security model for such cases, the Tableau workbook
created "User filters" to indicate how to limit the data in each visualization, and TableauServer will ensure that these filters are implemented accordingly for users who access the interactive visualization in the browser.

 

5. Verify the test connection

With the latest ODBC drivers from Cloudera, Hortonworks, MapR, and Amazon EMR, the driver configuration utility can be used to test the connection to the Hadoop Hive cluster.

 

 

Guess you like

Origin blog.csdn.net/wenyusuran/article/details/114869393
Recommended