cloudera,cdh, hive,impala,Kerberos

Impala authentication can now be solved through the combined use of LDAP and Kerberos

Impala
is an open source analytical database based on Apache Hadoop, using Kerberos and LDAP to support authentication. Kerberos has been supported in version 1.0, while LDAP is only supported recently. In CDH 5.2, you can use both at the same time.
Kerberos
Kerberos is still the main authentication mechanism of Apache Hadoop.

  • The principal is the Kerberos principal, which is a user or a daemon process. For us, a principal is name/hostname@realm for the daemon, or just name@realm for the user.
  • The name field may be a process, such as impala, or a user name, such as myoder`.
  • The hostname may be the full name of a machine, or a _HOST string defined by Hadoop, usually the full name of the machine is automatically replaced.
  • realm is similar to (but not necessarily the same as) a DNS domain name

After the combination, for example: principal=impala/[email protected]

The basic support for Kerberos in Impala is simple: provide the following parameters, and the daemon will use the given principal and key to verify the identity of all communicating principals in the keytab file.

--principal=impala/hostname@realm
--keytab_file=/full/path/to/keytab

There is another situation where the Impala daemon (impalad) runs under a load balancer.

When a client runs a query through a load balancer (a proxy), the client expects impalad to have a subject with the name of the load balancer. Therefore, when Impalad provides services for external queries, it needs to use a subject with the same name as the agent, but when communicating between background programs, a subject needs to match the actual host name.

--principal=impala/proxy-hostname@realm
--be_principal=impala/actual-hostname@realm
--keytab_file=/full/path/to/keytab

The first parameter --principal defines which principal is used for external queries of the Impalad service, and the --be_principal parameter defines which principal is used for communication between Impalad processes. The keys of the two principals must exist in the same keytab file.
Debugging Kerberos
Kerberos is an elegant protocol, but the actual implementation is not very helpful when something goes wrong. There are two main things that need to be checked, in the event of an authentication failure:

Time. Kerberos relies on synchronized clocks, so it is a best practice to install and use NTP (Network Time Protocol) on all machines that use Kerberos.
DNS. Please make sure that your hostname is a full name and whether the forward (name->IP) and reverse (IP->name) DNS lookups are correct.
In addition, two environment variables can also be set to output Kerberos debugging information. The output may be a small part of the content, but it is usually to help you solve the problem.
KRB5_TRACE=/full/path/to/trace/output.log: This environment variable specifies the debug log output path.
JAVA_TOOL_OPTIONS=-Dsun.security.krb5.debug=true: This environment variable will be passed to the Impala daemon and passed to the internal java components.

Cloudera
Cloudera sells Hadoop-based software and also releases its own version of Hadoop products to help subscribe customers to manage data.
CDH
CDH is the most complete and tested popular release of Apache Hadoop and related projects. CDH provides the core elements of Hadoop-scalable storage and distributed computing-as well as a web-based user interface and important enterprise functions. CDH is an open source licensed under Apache and is the only Hadoop solution that provides unified batch processing, interactive SQL and interactive search, and role-based access control.

As a powerful commercial data center management tool, Cloudera provides a variety of data computing frameworks that can run quickly and stably, such as Apache Spark; it uses Apache Impala as a high-performance SQL query engine for HDFS and HBase; it also brings Hive data. Warehouse tools help users analyze data; users can also use Cloudera to manage and install HBase distributed column-type NoSQL databases; Cloudera also includes a native Hadoop search engine and Cloudera Navigator Optimizer to perform a visual coordination and optimization of computing tasks on Hadoop to improve Operational efficiency; At the same time, the various components provided in Cloudera allow users to conveniently manage, configure and monitor Hadoop and all other related components in a visual UI interface, and have certain fault tolerance and disaster tolerance processing;
Hive overview
Hive data warehouse software Supports reading, writing and managing large data sets in distributed storage. Using Hive Query Language (HiveQL), which is very similar to SQL, the query is converted into a series of jobs to be executed on the Hadoop cluster through MapReduce or Apache Spark.

Users can use Hive to run batch processing workloads, while also using tools such as Apache Impala or Apache Spark to analyze the same data for interactive SQL or machine learning workloads within a single platform.

Hive components
Hive consists of the following components:

1\Metastore database

The Metastore database is an important aspect of the Hive infrastructure. It is an independent database that relies on traditional RDBMS, such as MySQL or PostgreSQL, and it stores metadata about Hive databases, tables, columns, partitions, and Hadoop specific information (such as underlying data files and HDFS block locations).

The Metastore database is shared by other components. For example, both Hive and Impala can insert, query, change, etc. the same table. Although you may see references to "Hive metastore", please note that even when you are not using Hive itself, the Metastore database is widely used in the Hadoop ecosystem.

The Metastore database is relatively compact, and data changes rapidly. Backup, replication and other types of management operations will affect this database.
2\HiveServer2
HiveServer2 is a server interface that enables remote clients to submit queries to Hive and retrieve results. It replaces HiveServer1 (it has been deprecated and will be removed in a future CDH version). HiveServer2 supports multi-client concurrency, capacity planning control, Sentry authorization, Kerberos authentication, LDAP and SSL, and provides better support for JDBC and ODBC clients.

HiveServer2 is the container of the Hive execution engine. For each client connection, it creates a new execution context for providing Hive SQL requests to the client. It supports JDBC clients, such as Beeline CLI and ODBC clients. The client connects to HiveServer2 through the Hive service based on the Thrift API.
Overview of Apache Impala
Impala can directly provide fast interactive SQL queries to data stored in HDFS and HBase in the Apache Hadoop platform. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.

Impala is a complement to tools that can be used to query big data. Impala will not replace batch processing frameworks based on MapReduce (such as Hive). Hive and other frameworks built on MapReduce are most suitable for long-running batch jobs, such as batch jobs involving batch extraction, transformation, and loading (ETL) type jobs.

Benefits of
Impala Impala provides:

Familiar SQL interface that data scientists and analysts already know.
Ability to query large amounts of data ("big data") in Apache Hadoop.
Distributed query in a cluster environment is easy to expand and use cost-effective hardware.
No need to copy or export/import steps to share data files between different components; for example, use Pig to write code, use Hive for conversion and use Impala for query. Impala can read and write Hive tables, and use Impala for simple data exchange to analyze the data generated by Hive.
A single system for big data processing and analysis, so customers can avoid expensive modeling and ETL only for analysis.

CDH MapReduce query and cloudera search and impala search
CDH provide storage and access to large data sets by using MapReduce jobs, but creating these jobs requires technical knowledge, and each job may take several minutes or more to run. The long running time associated with MapReduce jobs can interrupt the process of exploring data.

In order to provide more direct queries and responses, and eliminate the need to write MapReduce applications, Apache Impala can be used. Impala returns results in seconds, not minutes.

Although Impala is a fast and powerful application, it uses SQL-based query syntax. Using Impala may encounter challenges for users who are not familiar with SQL. If you don't know SQL, you can use Cloudera Search. Although Impala, Apache Hive, and Apache Pig all require structure to be applied at query time, search supports free text search on any data or field that has been indexed.

Cloudera search and other Cloudera components
Cloudera search interacts with other Cloudera components to solve different problems. The following table lists the Cloudera components that contribute to the search process and describes how they interact with Cloudera Search:

Cloudera environment construction process

cdh opens kerberos, java connects impala

Guess you like

Origin blog.csdn.net/u010010600/article/details/114070306