A talk on open source distributed systems Druid

Introduction

Druid is a real-time with large data query and analysis of high fault-tolerant, high-performance open source distributed system designed to quickly handle large-scale data, and the ability to achieve fast query and analysis. Especially when the generated code deployment, machine malfunctions and other products down the system encounters a situation such as, Druid still able to maintain 100% uptime. Create a Druid's original intent is mainly to solve the query latency problems when trying to use hadoop to implement interactive query and analysis, but it is difficult to meet the needs of real-time analysis. The Druid provides the ability to interactively access data and query flexibility and performance trade-offs two adopted special storage format.

 

Druid allows a similar manner PowerDrill Dremel and single-table queries, but also adds new features, such as providing a column storage format is a partially nested data structure, indexing is quickly filtered, and the real-time query uptake, distribution of high fault tolerance style architecture and so on.

 

characteristic

For the analysis and design: To explore OLAP analysis workflow built to support a variety of filtering, aggregation and query classes;

Rapid interactive query: Druid intake of low-latency data architecture allows them to create events that can be queried within the millisecond;

High availability: Druid data is still available when the system update, expand and downsizing will not cause data loss;

Scalable: Druid has been achieved capable of processing billions of events and TB-level data every day.

 

scenes to be used

1, requires an interactive and rapid polymerization of large amounts of data inquiry;

2, requires real-time query analysis;

3, has a large amount of data, such as add hundreds of millions of events per day, increasing the number of daily 10T data;

4, particularly the data in real time big data analysis;

5, require a highly available, fault-tolerant high, high performance database.

 

Architecture

Historical: for non-real-time data storage and query processing;

Realtime: real-time ingest data, monitor the input data stream

Coordinator: historical monitoring node

Broker: receiving queries from external clients, and forwards the query to the Realtime and historical

Indexer: Indexing Service is responsible for

image.png

 

image.png

Compared

Spark + Redis + Hbase real-time data exploration

Substituting the following problems:

  • Peak traffic processing delays

  • Latitude cross-analysis, do not flexible

  • Large consumption of resources

  • System failures, slow recalculation

这是第一代、消耗大、系统故障,在大内存情况下很容易导致崩溃。马蜂窝之前就遇到突发,一组三台,每一台 512 个 G,这个时候内存太大了,哪天一个内存条坏的话,这一天的数据可能就要重新算,而且对于现在当前整个实时数据量来看,完全就不符合当前的现状,算一天需要十几个小时。

 

当时考虑到,在数据量大的情况下,是不是我们可以去牺牲 UV 的计算。所以就引入在 Druid 里面。把 Druid 引入到 MES,误差基本上保持在 2% 左右。后面我们又通过雅虎提供的data sketch,可以精确调控 UV 的计算,它的默认值是 16384,16384 以下可以是精确的。当然这个值是可以控制的,就是 2 的 N 次幂,当前我们是调到特别大,800 多万。但 Druid 里面不支持MES第一代的虚拟 key。

image.png

在 Druid 里面对于datasource 有一个按时间密度去分的,我们历史数据在查询力度这个层面,只能让他查到按每小时去查,其他按天去分配。最新的数据就在最近 15 天,我们可以让他精确到一分钟的查询,对于历史数据,力度越精确,数据量到 Druid 里面越大。

在离线批量导入,现在 Druid 支持,T+1 的数据校正。如果在 PSPARK+TRANQUILITY 这一阶段,因为 SPARK 的 task 失败的话,可能会导致这个数据到 Druid 里面 PV 会上升。所以说需要每天凌晨通过批量导入的方法把上一天的数据做一个数据校准。同样的是需要打平在 attr 里打平所有工程师上报的数据制定的值。

 

|Druid 集群注意事项

在 Druid 里面配置,

1、维度不要太多,像蚂蜂窝最开始 700 多个维度。每天导进去将近 100 个 G,导进去十分耗时。

2、维度大小,不要太大。比如你来一个维度值几兆的,这个不行。

3, going to a reasonable allocation ratio. In the beginning, we took a node with before we put up 10 T disk as a data storage node of the entire Druid, but found you to check whether you are going to check task, or check historical data. 10 T disk can not keep up, query various timeouts, a variety of responses.

4, disk selection. In fact, the use of solid-state disk, basically like our present configuration, is 256 G memory, 1.2T solid state disk. This configuration is up, you go to query the entire history of data, or other data query whether you are fast.

5, the segment size, we are beginning by day, 100 G, points to split back into an hour. This time to a few G, several G does not work, we are going to be split in several G, equivalent to the final query is about 300-700 megabytes.

6, in which Druid, comma does not support, as is used in Druid comma separated at the bottom.

7, the priority to upgrade Druid version. We slowly upgrade from 0.6 to 0.8 at the earliest, we are using is 0.9. Druid's hair every version, optimized a lot. Do you think each query in question, or do you want to go faster query the data, you can prioritize what to go and see the latest status of Druid's github above.

This is today to share with you some of the things. Of course, we are used throughout the Druid of the process, in fact, also meet a lot of other problems. I hope Druid can get better and better.

 

other

Druid has been based on Apache License 2.0 protocol open source code is hosted on github, the most current stable version is 0.7.11, there are already 63 Contributer codes and nearly 2,000 followers. The main contributors Druid include advertising analysis start-up companies Metamarkets, movie streaming site Metflix, Yahoo and other companies. Druid official Druid through Shark, Vertica, Cassandra, Hadoop, Spark, Elasticsearch etc. compared explained in terms of fault tolerance, flexibility, and query performance.

Guess you like

Origin www.cnblogs.com/liuys635/p/11295635.html