Qiang [teacher] Big Data analysis engine: Presto

First, what is Presto?

  • Background: The disadvantage Hive and Presto background

Hive MapReduce used as the underlying computing framework, is designed specifically for the batch. But as more and more data, perform a simple to use Hive data query may take a few minutes to a few hours, obviously can not meet the needs of interactive query. Presto is a distributed SQL query engine, which is designed specifically for high-speed, real-time data analysis. It supports standard ANSI SQL, including complex queries, polymerization (aggregation), the connection (join) and window function (window functions). Of which there are two points worth exploring, first of architecture, followed by nature is how to achieve low latency to support timely interaction.

  • What PRESTO that?

Presto is an open source distributed SQL query engine for interactive analysis query, the amount of data to support GB PB bytes. Presto is designed and written entirely in order to solve problems like interactive analysis and processing speed commercial data warehouse of this size of Facebook.

  • What can it do?

Presto enables online data queries, including Hive, Cassandra, relational databases, and proprietary data stores. Presto a query data from multiple data sources are combined, it can be analyzed across the entire organization. Presto demand analysts as a target, they are expected response time of less than one second to several minutes. Presto ended the dilemma data analysis, or use the fast expensive commercial programs, or use consume a large amount of hardware slow "free" program.

  • Who uses it?

Facebook using Presto interactive query, a plurality of internal data storage, including 300PB data warehouse. Every day more than 1000 Facebook employees Presto, the number of queries executed more than 30,000 times, the amount of data scanning over 1PB. Leading Internet companies including Airbnb and Dropbox are using Presto.

Two, Presto architecture

Presto is a run on multiple servers in a distributed system. Complete installation comprising a coordinator and a plurality of worker. Submit a query by the client, to submit to the CLI command line from Presto coordinator. coordinator parse, analyze and execute the query plan, and then distribute the processing queue to worker.

 

Presto query engine is a Master-Slave architecture consists of a Coordinator node, a Discovery Server node, multiple nodes Worker, Discovery Server is usually embedded in the Coordinator node. Coordinator is responsible for parsing the SQL statement execution plan, distribute tasks to perform Worker nodes. Worker node is responsible for the actual implementation of query tasks. After Worker node to start the service registration Discovery Server, Coordinator Worker nodes to obtain a properly working from Discovery Server. If you configure the Hive Connector, you need to configure a service provided Hive Hive MetaStore meta-information for the Presto, Worker nodes interact with HDFS read data.

Third, install Presto Server

  • Installation Media
presto-cli-0.217-executable.jar
presto-server-0.217.tar.gz 
  • Presto Server installation configuration

  1, extract the installation package

tar -zxvf presto-server-0.217.tar.gz -C ~/training/

   2, create etc directory

cd ~/training/presto-server-0.217/
mkdir etc

  3, need to include the following configuration files in the etc directory

Node configuration information: the Node the Properties 
JVM Config: command-line tool JVM configuration parameters 
Config Properties: The configuration parameters Presto Server 
Catalog Properties: data source (Connectors) configuration parameters 
Log Properties: Configuration Parameters
  • Edit node.properties
# Cluster name. Presto all nodes in the same cluster must have the same cluster name. 
node.environment = production 
 
uniquely identifies each Presto # node. Node.id each node must be unique. Or restart the upgrade process node.id each node in the Presto must remain unchanged. Presto If multiple instances installed on one node (e.g.: Presto plurality of nodes installed on the same machine), then each node must have a unique Presto node.id. 
node.id = ffffffffffff-ffffffff-ffffffffffff 
 
position data storage directory # (path on the operating system). Presto will put this directory date and stored in the data. 
node.data-dir = / root / training / presto-server-0.217 / data
  • Edit jvm.config

Since the JVM OutOfMemoryError will result in an inconsistent state, so when we encounter this error action is to collect general information (for debugging) dump headp, and then forced to terminate the process. Presto query will be compiled into byte code file, and therefore generates a number Presto class, we should increase the size we Perm region (Perm mainly stored in the class) and by allowing the Jvm class unloading.

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
  • Edit config.properties

    coordinator configuration

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://192.168.157.226:8080

    workers Configuration

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://192.168.157.226:8080

    If we want to be tested on a single machine, configure coordinator and worker, please use the following configuration:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://192.168.157.226:8080

    Parameter Description:

  • Edit log.properties

    Configure log level.

com.facebook.presto = ABOUT
  • Catalog Properties Configuration

Presto access data connectors. These connectors mounted on the catalogs. connector can provide a catalog of all the schema and table. For example: Hive connector of each hive are mapped to a database schema, so if the hive connector mounted to the catalog called the hive, and the web hive has a table named clicks, then in the Presto can hive. web.clicks to access this table. Registration is done by creating a catalog of catalogs properties file in the etc / catalog directory. To create a data source connector hive, can create a etc / catalog / hive.properties file, the content file is as follows, on completion of loading a hiveconnector hivecatalog.

# Indicate hadoop version 
connector.name = Hive-hadoop2 
 
# Hive-Site configured address 
hive.metastore.uri = Thrift: //192.168.157.226: 9083 
 
#hadoop profile path 
hive.config.resources = / root /training/hadoop-2.7.3/etc/hadoop/core-site.xml,/root/training/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

 Note: To access the Hive, then you need to start Hive of MetaStore: hive --service metastore

Fourth, start Presto Server

./launcher start

6.3.5 Operating presto-cli

  • Download: presto-cli-0.217-executable.jar
  • Rename the jar package, and adds execute permissions
cp presto-cli-0.217-executable.jar presto 
chmod a+x presto
  • Presto Server connection
./presto --server localhost:8080 --catalog hive --schema default

Sixth, the use Presto

  • Presto operation using Hive
  • Use of Presto Web Console: Port: 8080

  • JDBC operations using Presto

    1, Maven dependent need to contain

<dependency>
	<groupId>com.facebook.presto</groupId>
	<artifactId>presto-jdbc</artifactId>
	<version>0.217</version>
</dependency>

    2, JDBC Code

*******************************************************************************************

 

 

 

 

Guess you like

Origin www.cnblogs.com/collen7788/p/12630662.html