A Quick Look at Druid - Real-Time Big Data Analysis Software

What is Druid

  The word Druid comes from the mythological characters of ancient Rome in the West, and Chinese is often translated into Druid. 
  Druid introduced in this question is a distributed data storage system (Data Store) that supports real-time analysis. American advertising technology company MetaMarkets created the Druid project in 2011 and open-sourced the Druid project in late 2012. Druid was originally designed for analysis. It has a significant performance improvement over traditional OLAP systems in terms of data processing scale and real-time data processing, and embraces mainstream open source ecosystems, including Hadoop. Druid has been a very active open source project for many years. 
  Druid's official website is http://druid.io
  In addition, Alibaba has also created an open source project called Druid (referred to as Ali Druid), which is a database connection pool project. Ali Druid has nothing to do with the Druid discussed in this question, they solve completely different problems.

Big Data Analytics and Druid

  Big data has been a hot topic in recent years. With the rapid growth of data volume, the scale of data processing has also increased from GB level to TB level. Many image application fields have begun to process PB level data analysis. The core goal of big data is to improve the competitiveness of the business, to find some actionable insights (Actionable Insight), data analysis is the core technology, including data collection, processing, modeling and analysis, and finally to find solutions to improve the business. 
  In the past year or two, with the explosive growth of demand for big data analysis, many companies have experienced the transfer of data platforms based on relational commercial databases to some open source ecological big data platforms, such as Hadoop or Spark platforms, to enable Controlled hardware and software costs to handle larger amounts of data. Hadoop was originally designed to process big data in batches, but real-time data processing is often its weakness. For example, in many cases, it is difficult to estimate how long it will take for the execution of a MapReduce script to complete, which cannot meet the analysis requirements of returning query results in seconds expected by many data analysts. 
  In order to solve the problem of real-time data, most companies have an experience of turning data analysis into a more real-time interactive solution. Among them, it involves the introduction of new software, the improvement of data flow, etc. Several common methods of data analysis are shown in the figure below. 
【figure 1】
  The entire data analysis infrastructure is usually divided into the following categories. 
(1) MR analysis using Hadoop/Spark. 
(2) Inject the results of Hadoop/Spark into RDBMS to provide real-time analysis. 
(3) Inject the results into NoSQL with larger capacity, such as HBase, etc. 
(4) Stream the data source and connect to the streaming computing framework, such as Storm, and the result falls in RDBMS/NoSQL. 
(5) Stream the data source and connect to analytical databases, such as Druid, Vertica, etc.

Druid's three design principles

  在设计之初,开发人员确定了三个设计原则(Design Principle)。 
(1)快速查询(Fast Query):部分数据的聚合(Partial Aggregate)+内存化(In-emory)+索引(Index)。 
(2)水平扩展能力(Horizontal Scalability):分布式数据(Distributed Data)+ 并行化查询(Parallelizable Query)。 
(3)实时分析(Realtime Analytics):不可变的过去,只追加的未来(Immutable Past,Append-Only Future)。

1 快速查询(Fast Query)

  对于数据分析场景,大部分情况下,我们只关心一定粒度聚合的数据,而非每一行原始数据的细节情况。因此,数据聚合粒度可以是1 分钟、5 分钟、1 小时或1 天等。部分数据聚合(Partial Aggregate)给Druid 争取了很大的性能优化空间。 
  数据内存化也是提高查询速度的杀手锏。内存和硬盘的访问速度相差近百倍,但内存的大小是非常有限的,因此在内存使用方面要精细设计,比如Druid 里面使用了Bitmap 和各种压缩技术。 
另外,为了支持Drill-Down 某些维度,Druid 维护了一些倒排索引。这种方式可以加快AND 和OR 等计算操作。

2 水平扩展能力(Horizontal Scalability)

  Druid 查询性能在很大程度上依赖于内存的优化使用。数据可以分布在多个节点的内存中,因此当数据增长的时候,可以通过简单增加机器的方式进行扩容。为了保持平衡,Druid按照时间范围把聚合数据进行分区处理。对于高基数的维度,只按照时间切分有时候是不够的(Druid 的每个Segment 不超过2000 万行),故Druid 还支持对Segment 进一步分区。 
  历史Segment 数据可以保存在深度存储系统中,存储系统可以是本地磁盘、HDFS 或远程的云服务。如果某些节点出现故障,则可借助Zookeeper 协调其他节点重新构造数据。 
  Druid 的查询模块能够感知和处理集群的状态变化,查询总是在有效的集群架构中进行。集群上的查询可以进行灵活的水平扩展。Druid 内置提供了一些容易并行化的聚合操作,例如Count、Mean、Variance 和其他查询统计。对于一些无法并行化的操作,例如Median,Druid暂时不提供支持。在支持直方图(Histogram)方面,Druid 也是通过一些近似计算的方法进行支持,以保证Druid 整体的查询性能,这些近似计算方法还包括HyperLoglog、DataSketches的一些基数计算。

3 实时分析(Realtime Analytics)

  Druid 提供了包含基于时间维度数据的存储服务,并且任何一行数据都是历史真实发生的事件,因此在设计之初就约定事件一但进入系统,就不能再改变。 
  对于历史数据Druid 以Segment 数据文件的方式组织,并且将它们存储到深度存储系统中,例如文件系统或亚马逊的S3 等。当需要查询这些数据的时候,Druid 再从深度存储系统中将它们装载到内存供查询使用。

Druid 的技术特点

  Druid 具有如下技术特点。 
• 数据吞吐量大。 
• 支持流式数据摄入和实时。 
• 查询灵活且快。 
• 社区支持力度大。

1 数据吞吐量大

  很多公司选择Druid 作为分析平台,都是看中Druid 的数据吞吐能力。每天处理几十亿到几百亿的事件,对于Druid 来说是非常适合的场景,目前已被大量互联网公司实践。因此,很多公司选型Druid 是为了解决数据爆炸的问题。

2 支持流式数据摄入

  很多数据分析软件在吞吐量和流式能力上做了很多平衡,比如Hadoop 更加青睐批量处理,而Storm 则是一个流式计算平台,真正在分析平台层面上直接对接各种流式数据源的系统并不多。

3 查询灵活且快

  数据分析师的想法经常是天马行空,希望从不同的角度去分析数据,为了解决这个问题,OLAP 的Star Schema 实际上就定义了一个很好的空间,让数据分析师自由探索数据。数据量小的时候,一切安好,但是数据量变大后,不能秒级返回结果的分析系统都是被诟病的对象。因此,Druid 支持在任何维度组合上进行查询,访问速度极快,成为分析平台最重要的两个杀手锏。

4 社区支持力度大

  Druid 开源后,受到不少互联网公司的青睐,包括雅虎、eBay、阿里巴巴等,其中雅虎的Committer 有5 个,谷歌有1 个,阿里巴巴有1 个。最近,MetaMarkets 之前几个Druid 发明人也成立了一家叫作Imply.io 的新公司,推动Druid 生态的发展,致力于Druid 的繁荣和应用。

Druid 的应用场景

  From the perspective of technical positioning, Druid is a distributed data analysis platform, which is very similar to the traditional OLAP system in function, but has made a lot of focus and trade-offs in the implementation method, in order to support larger data volume and more flexible Distributed deployment, more real-time data ingestion, Druid discards more complex operations in OLAP queries, such as JOIN. Compared with traditional databases, Druid is a time series database that aggregates data according to a certain time granularity to speed up analytical queries. 
  In terms of application scenarios, Druid started as an advertising data analysis platform and has been widely used in various industries and many Internet companies. The latest list can be accessed at http://druid.io/druidpowered.html .

  The Druid ecosystem is expanding and maturing, and Druid is addressing more and more business scenarios. It is hoped that the book "Principles and Practices of Druid Real-Time Big Data Analysis" can help technicians make better technical selections, deeply understand Druid's functions and principles, and better solve big data analysis problems. 
Major e-commerce sites are hot pre-sale! 
  This article is selected from "The Principles and Practice of Druid Real-Time Big Data Analysis", click this link to view this book on the official website of Bowen Viewpoint. 
                    image description

  If you want to get more exciting articles in time, you can search for "Blog Viewpoint" in WeChat or scan the QR code below and follow.
                 image description

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326356675&siteId=291194637