What is the Druid

First, what is the Druid

Druid word comes from the Western Roman mythological figures, often translated into Chinese Druid.

Played World of Warcraft, Diablo, Dota, Hearthstone legend, Dota self-propelled chess friends, certainly no stranger to the word.

Druid herein described is a distributed real-time analysis of the data storage system. Popular point: high-performance real-time analysis database . It was created by American advertising technology company MetaMarkets in 2011, and revenue in 2012. MetaMarkets is a specialized data services company for the online media company, the main push is the DSP platform ad operations, due to the very high real-time requirements, the company had to abandon the original big data solutions, Druid will come into being.

Druid's official website address is: http://druid.io/

Currently Druid has been based on open source Apache License 2.0 protocol is being developed by the Apache Incubator, the code is hosted on Github.

The latest official website address is:

https://druid.apache.org/

Ali has had a open source project called Druid is a database connection pool. And herein Driud just the same name, and there is no contact, both on Github has a corresponding repository.

The article said that Apache Druid Druid

Github Address: https://github.com/apache/druid/ have 9k + star to have the latest release version 0.17 are on the rise.

Two, Druid features and basic concepts

Druid main problem to solve is the large amount of data that can not be solved traditional database query performance issues.

So she's essentially a distributed real-time data analysis of the data storage system.

To quickly implement the query and data analysis, high availability, high scalability.

characteristic

1. Quick Query: druid provides fast aggregation capability and fast OLAP query capabilities, multi-tenant design, user-oriented analysis is the ideal way applications. druid data aggregation size may be 1 minute, 5 minutes, 1 hour, 1 day, or the like. Memory data to improve query speed of the druid.

OLAP : as opposed to OLTP, here through an online store, for example, such as an online mall both do what?

OLTP is a commodity browsing, transaction, user data. Must support transactions, frequent query modification. OLTP (online transaction processing), the main application of traditional database-oriented basic CRUD operations, characterized by high real-time, a small amount of data that can be modified to delete the data, there are strict requirements of the transaction.
OLAP is to store data to analyze large volumes of data. OLAP (online analytical processing), support for complex analysis operations, decision support, is characterized by large volumes of data, high throughput, only support queries.

2. Real-time data injection: druid support injection flow data, and provides event-driven data, to ensure effectiveness and unity of events in real-time and offline environment. Historical data does not change, real-time data access.

3. PB-class scalable storage: druid cluster can be easily expanded to the amount of data PB of millions of data per second-level injection. Even when its effectiveness in the case of increasing the size of the data, but also to ensure. druid polymerization can partition the data processed in the time range.

4. The multi-environment deployment: druid can either run on commercial hardware, can also run in the cloud. It can be injected into the system from a variety of data, including Hadoop, Spark, Kafka , Storm and samza like.

5. rich community: druid has a rich community for everyone to learn.

Before Metamarkets several druid developers set up a new company called imply.io of: https://imply.io/

Druid contrast to other OLAP solutions:

scenes to be used

According to Druid features can be seen, druid appropriate scene:

Little more than modify the query
A packet-based query to polymerize or
Quick Search
The need to support off-line and real-time data source

Thus Druid in real-time calculations, as part of the query real-time reporting and real-time big screen is very appropriate.

And druid has a very good performance:

High expansion column using a distributed storage system; highly fault-tolerant, self-balancing, to ensure that query latency and data integrity; automatic aggregation index data, a variety of algorithms to optimize search efficiency.

Therefore druid data is generally stored after polymerization.

basic concepts

1, data format

druid before the data intake, you first need to define a data source that is Datasource, the structure of this dataSource is time column (TimeStamp), column dimensions (Dimension) and indicators column (Metric).

Time Column: druid some data will be grouped together close to the time, when the query specified time frame.

Dimension column: identifies as some statistical dimensions, such as various types.

Indicator column: the column is calculated and used for the polymerization, including the count, sum and the like.

2, data intake

druid provides two ways intake data, real-time and batch processing.

3, data query

druid supports two queries, native and sql

sql query similar

[ EXPLAIN PLAN FOR ]
[ WITH tableName [ ( column1, column2, ... ) ] AS ( query ) ]
SELECT [ ALL | DISTINCT ] { * | exprs }
FROM table
[ WHERE expr ]
[ GROUP BY exprs ]
[ HAVING expr ]
[ ORDER BY expr [ ASC | DESC ], expr [ ASC | DESC ], ... ]
[ LIMIT limit ]
[ UNION ALL <another query> ]

druid native query using json way through http transfer.

Druid groupby example a query that specifies a time range, polymerization degree, and other data sources.

{
  "queryType": "groupBy",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "dimensions": ["country", "device"],
  "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
  "filter": {
    "type": "and",
    "fields": [
      { "type": "selector", "dimension": "carrier", "value": "AT&T" },
      { "type": "or",
        "fields": [
          { "type": "selector", "dimension": "make", "value": "Apple" },
          { "type": "selector", "dimension": "make", "value": "Samsung" }
        ]
      }
    ]
  },
  "aggregations": [
    { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
    { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
  ],
  "postAggregations": [
    { "type": "arithmetic",
      "name": "avg_usage",
      "fn": "/",
      "fields": [
        { "type": "fieldAccess", "fieldName": "data_transfer" },
        { "type": "fieldAccess", "fieldName": "total_usage" }
      ]
    }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
  "having": {
    "type": "greaterThan",
    "aggregation": "total_usage",
    "value": 100
  }
}

Third, the application scenarios

druid Common applications include:

Click stream analysis (web and mobile analysis)
Risk / fraud analysis
Telemetry network analysis (Network Performance Monitor)
Storage server metrics
Supply chain analysis (manufacturing index)
Application Performance Index
Business Intelligence / OLAP

User Behavior Analysis

Druid can be used, click stream, the stream view, activity streams.

And accurately calculate approximately user metrics to calculate the average daily active user metrics like to see the overall trend, or to present to the accurate calculation operations.

digital marketing

Druid often used to store and query data online advertising. These data usually come from the ad server, for measuring and understanding your campaign's performance, click-through rate, conversion rate (loss rate) and so on.

OLAP sum BI

Druid commonly used in the BI, the Hive like SQL-on-Hadoop different engines, designed for high concurrency Druid and subsecond queries designed to interactively explore the data by UI.

In short, real-time computing more widely today, druid will by virtue of her performance and advantages of OLAP, shine in the field of real-time BI has a large screen and so on!