Introduction to Apache Drill

Introduction

Apache Drill is a low-latency distributed massive data (covering structured, semi-structured and nested data) interactive query engine, using ANSI SQL compatible syntax, supports local files, HDFS, HBase, MongoDB and other back-end storage, supports Parquet, JSON, CSV, TSV, PSV and other data formats. Inspired by Google's Dremel, Drill satisfies interactive business intelligence analysis scenarios with thousands of nodes of petabyte-level data.

Install

Drill can be installed on a stand-alone or clustered environment, and supports Linux, Windows, and Mac OS X systems. For simplicity, we build it in a Linux stand-alone environment (CentOS 6.3) for trial use.

Prepare to install the package:

Install in the $WORK (/path/to/work) directory, extract jdk and drill to the java and drill directories respectively, and make a soft link to upgrade:

.
├── drill
│   ├── apache-drill -> apache-drill-0.8.0
│   └── apache-drill-0.8.0
├── init.sh
└── java
    ├── jdk -> jdk1.7.0_75
    └── jdk1.7.0_75

And add an init.sh script to initialize java related environment variables:

export WORK="/path/to/work"
export JAVA="$WORK/java/jdk/bin/java"
export JAVA_HOME="$WORK/java/jdk"

start up

To run in a stand-alone environment, you only need to start bin/sqlline:

$ cd $WORK
$ . ./init.sh
$ ./drill/apache-drill/bin/sqlline -u jdbc:drill:zk=local
Drill log directory /var/log/drill does not exist or is not writable, defaulting to ...
Apr 06, 2015 12:47:30 AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
sqlline version 1.1.6
0: jdbc:drill:zk=local> 

-u jdbc:drill:zk=localIndicates that if you use the local Drill, you do not need to start ZooKeeper. If it is a cluster environment, you need to configure and start ZooKeeper and fill in the address. After startup, you can 0: jdbc:drill:zk=local>use it by typing in the command.

try out

Drill's sample-data directory has demo data in Parquet format for querying:

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/apache-drill/sample-data/nation.parquet` limit 5;
+-------------+------------+-------------+------------+
| N_NATIONKEY |   N_NAME   | N_REGIONKEY | N_COMMENT  |
+-------------+------------+-------------+------------+
| 0           | ALGERIA    | 0           |  haggle. carefully f |
| 1           | ARGENTINA  | 1           | al foxes promise sly |
| 2           | BRAZIL     | 1           | y alongside of the p |
| 3           | CANADA     | 1           | eas hang ironic, sil |
| 4           | EGYPT      | 4           | y above the carefull |
+-------------+------------+-------------+------------+
5 rows selected (0.741 seconds)

The library name format used here is dfs.`Absolute path of local files (Parquet, JSON, CSV, etc. files)`. It can be seen that as long as you are familiar with SQL syntax, there is almost no learning cost. However, Parquet format files require special tools to view and edit, which is not very convenient. We will introduce them later. We will first use more general CSV and JSON files for demonstration.

$WORK/dataCreate the following test.csvfiles in :

1101,SteveEurich,Steve,Eurich,16,StoreT
1102,MaryPierson,Mary,Pierson,16,StoreT
1103,LeoJones,Leo,Jones,16,StoreTem
1104,NancyBeatty,Nancy,Beatty,16,StoreT
1105,ClaraMcNight,Clara,McNight,16,Store

Then query:

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.csv`;
+------------+
|  columns   |
+------------+
| ["1101","SteveEurich","Steve","Eurich","16","StoreT"] |
| ["1102","MaryPierson","Mary","Pierson","16","StoreT"] |
| ["1103","LeoJones","Leo","Jones","16","StoreTem"] |
| ["1104","NancyBeatty","Nancy","Beatty","16","StoreT"] |
| ["1105","ClaraMcNight","Clara","McNight","16","Store"] |
+------------+
5 rows selected (0.082 seconds)

It can be seen that the result is slightly different from the previous one, because the CSV file has no place to store the column name, so it is used uniformly columnsinstead. If you need to specify the column, you need to use it columns[n], such as:

0: jdbc:drill:zk=local> select columns[0], columns[3] from dfs.`/path/to/work/drill/data/test.csv`;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 1101       | Eurich     |
| 1102       | Pierson    |
| 1103       | Jones      |
| 1104       | Beatty     |
| 1105       | McNight    |
+------------+------------+

The CSV file format is relatively simple and cannot exert the powerful advantages of Drill. The more complex functions below are demonstrated using JSON files that are closer to Parquet.

$WORK/dataCreate the following test.jsonfiles in :

{
  "ka1": 1,
  "kb1": 1.1,
  "kc1": "vc11",
  "kd1": [
    {
      "ka2": 10,
      "kb2": 10.1,
      "kc2": "vc1010"
    }
  ]
}
{
  "ka1": 2,
  "kb1": 2.2,
  "kc1": "vc22",
  "kd1": [
    {
      "ka2": 20,
      "kb2": 20.2,
      "kc2": "vc2020"
    }
  ]
}
{
  "ka1": 3,
  "kb1": 3.3,
  "kc1": "vc33",
  "kd1": [
    {
      "ka2": 30,
      "kb2": 30.3,
      "kc2": "vc3030"
    }
  ]
}

It can be seen that the content of this JSON file has multiple layers of nesting, and the structure is much more complicated than the previous CSV file, and querying nested data is the advantage of Drill.

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+------------+------------+
|    ka1     |    kb1     |    kc1     |    kd1     |
+------------+------------+------------+------------+
| 1          | 1.1        | vc11       | [{"ka2":10,"kb2":10.1,"kc2":"vc1010"}] |
| 2          | 2.2        | vc22       | [{"ka2":20,"kb2":20.2,"kc2":"vc2020"}] |
| 3          | 3.3        | vc33       | [{"ka2":30,"kb2":30.3,"kc2":"vc3030"}] |
+------------+------------+------------+------------+
3 rows selected (0.098 seconds)

select *Only the data of the first layer is detected, and the deeper data is only presented in the original JSON data. Obviously, we should not only care about the data of the first layer. How to check it is completely arbitrary:

0: jdbc:drill:zk=local> select sum(ka1), avg(kd1[0].kb2) from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 6          | 20.2       |
+------------+------------+
1 row selected (0.136 seconds)

kd1[0]This table nested to the second level can be accessed by .

0: jdbc:drill:zk=local> select kc1, kd1[0].kc2 from dfs.`/path/to/work/drill/data/test.json` where kd1[0].kb2 = 10.1 and ka1 = 1;
+------------+------------+
|    kc1     |   EXPR$1   |
+------------+------------+
| vc11       | vc1010     |
+------------+------------+
1 row selected (0.181 seconds)

Create a view:

0: jdbc:drill:zk=local> create view dfs.tmp.tmpview as select kd1[0].kb2 from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | View 'tmpview' created successfully in 'dfs.tmp' schema |
+------------+------------+
1 row selected (0.055 seconds)

0: jdbc:drill:zk=local> select * from dfs.tmp.tmpview;
+------------+
|   EXPR$0   |
+------------+
| 10.1       |
| 20.2       |
| 30.3       |
+------------+
3 rows selected (0.193 seconds)

The nested second level table can be flattened (integrated kd1[0]..kd1[n]):

0: jdbc:drill:zk=local> select kddb.kdtable.kc2 from (select flatten(kd1) kdtable from dfs.`/path/to/work/drill/data/test.json`) kddb;
+------------+
|   EXPR$0   |
+------------+
| vc1010     |
| vc2020     |
| vc3030     |
+------------+
3 rows selected (0.083 seconds)

The usage details are still different from mysql. In addition, the complex logic of multi-layer tables is involved. If you want to use it well, you need to read the official documents carefully and practice a lot. This time, let's take a quick look at the flowers, and then we will have a deep understanding of the features at the grammatical level.

 

https://segmentfault.com/a/1190000002652348

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327051509&siteId=291194637