Big Data Architecture Process

Data processing is divided into three categories: 

  • The first category is from a business perspective, which is subdivided into query retrieval, data mining, statistical analysis, and in-depth analysis, of which in-depth analysis is divided into machine learning and neural networks.
  • The second category is from a technical point of view, which is subdivided into Batch, SQL, stream processing, machine learning, and deep learning.
  • The third category is programming model, which is subdivided into offline programming model, memory programming model, and real-time programming model.

Combined with the data source characteristics, classification, collection methods, storage selection, data analysis, and data processing described above, I will give an overall big data platform architecture here. It is worth noting that monitoring, resource coordination, security logs, etc. have been removed from the architecture diagram. 


On the left is the data source, with real-time streaming data (may be structured or unstructured, but its characteristics are real-time), and offline data. Offline data generally uses ETL tools. In the data platform, Sqoop or Flume is used to synchronize data, or some NIO frameworks are used to read and load, and then write to HDFS. Of course, there are also some special technical storage types, such as HAWQ, which supports distributed and transactional consistency. open source database. 

From a business scenario, if we do statistical analysis, we can use SQL or MapReduce or streaming or Spark. If you do query retrieval, you should also consider writing to ES when synchronously writing to HDFS. If you do data analysis, you can build a cube, and then enter the OLAP scene. 

 

Before introducing the content of this article in detail, let's take a look at the overall development process of Hadoop business: 
write picture description here
from the business development flow chart of Hadoop, we can see that in the process of big data business processing, data collection is a very important step. It is also an inevitable step, which leads to the protagonist of our article - Flume. This article will give a detailed introduction to Flume's architecture and Flume's application (log collection). 

 

  Features of flume:
  Flume is a distributed, reliable, and high-availability massive log collection, aggregation and transmission system. Supports customizing various data senders in the log system for data collection; at the same time, Flume provides the ability to simply process data and write to various data receivers (such as text, HDFS, Hbase, etc.).
  Flume's data flow is run through by events. Event is the basic data unit of Flume. It carries log data (in the form of byte array) and header information. These Events are generated by the Source outside the Agent. When the Source captures the event, it will perform a specific format, and then the Source will put the event Push into (single or multiple) Channels. You can think of Channel as a buffer that will hold events until the sink has finished processing the event. Sink is responsible for persisting logs or pushing events to another Source.

In order to support so many functions, how do we build our data platform?

Let's take a look at the main steps of our data processing. First, our SDK collects data. After collecting data, we first throw it into our message queue for basic persistence. Then we will have two parts, one is real-time statistics, One part is offline statistics. After the two parts are completed, the statistical results will be saved, and then provided to our query service, and finally our external display interface. Our data platform is mainly based on the four green sections in the middle.

Regarding the requirements, for the message queue, the throughput must be large and the scalability must be very good. If there is a message peak, it must be able to expand at any time. Because everything is distributed, it is necessary to ensure that the node fails. It will not affect our normal business.

Our real-time calculation currently uses minute-level real-time, which is not accurate to the second level. Offline calculation requires very fast calculation speed. We chose Spark when we considered these two parts, because Spark itself supports both real-time and Offline, and compared to other real-time solutions, such as Flink or Storm and Samza, we do not need this real-time to the second level, we need throughput, so we choose Spark. The real-time part uses Spark streaming, and the offline part uses the Spark offline solution.

Query plan Because we want to support the combined sorting of multiple dimensions, we hope to support SQL, so that various combined sorting can be converted into SQL group and order operations.

Message queue -- Kafka

We choose Kafka for message queue, because in our opinion, Kafka is currently the most mature distributed message queue solution, and its performance and scalability are also very good, and it supports fault-tolerant solutions. You can ensure that by setting redundancy. data integrity. Kafka is currently supported by all mainstream streaming computing frameworks, such as Spark, Flink, Storm, Samza, etc. The other is that several of our company's founders are from LinkedIn, and they have used Kafka when they were at LinkedIn. , very familiar with Kafka, so we chose Kafka.

Message Timing -- HBase

But after selecting Kafka, we found a problem that is the problem of message timing. First of all, in our data collection process, because the network bandwidth of different users is different, the data may be delayed, and the late message may occur earlier, and the timing is not guaranteed between different partitions of Kafka.

But all our offline statistical programs need to be counted by time, so we need a database that supports time series to help us sort the data. Here we choose HBase. We use the time when the message was generated plus the ID of the message we generated to make its unique row key for sorting and indexing.

SQL On HBase -- Apache Phoenix

For the sql scheme, we chose Phoenix. Phoenix was chosen because we considered several current SQL On HBase solutions, and we found that Phoenix's efficiency is very good because it makes full use of the features of HBase coprocessor and performs a lot of calculations on the server side, so it greatly reduces the client data pressure and computational pressure.

还有就是它支持HBase的Column Family概念,比如说我们要支持40个纬度的时候我们会有一张大宽表,如果我们把所有的列都设置一个列族的话,在查询任意一个列的时候都需要把40列的数据都读出来,这样是得不偿失的,所以Phoenix支持Column Family的话,我们就可以把不同的列根据它们的相关性分成几个列族,查询的时候可能只会命中一个到两个列族,这样大大减少了读取量。

Phoenix还支持Spark的DataSource API,支持列剪枝和行过滤的功能,而且支持数据写入。什么是Spark的DataSource API呢, Spark在1.2的时候提供了DataSource API,它主要是给Spark框架提供一种快速读取外界数据的能力,这个API可以方便的把不同的数据格式通过DataSource API注册成Spark的表,然后通过Spark SQL直接读取。它可以充分利用Spark分布式的优点进行并发读取,而且Spark本身有一个很好的优化引擎,能够极大的加快Spark SQL的执行。

因为Spark最近非常的火,所以它的社区资源非常的多,基本上所有主流的框架,像我们常见的Phoenix,Cassandra, MongoDB都有Spark DataSource相关的实现。还有一个就是它提供了一个统一的数据类型,把所有的外部表都统一转化成Spark的数据类型,这样的话不同的外部表能够相互的关联和操作。

在经过上述的思考之后,我们选择了这样的一个数据框架。

首先我们最下面是三个SDK,JS、安卓和iOS,采集完数据之后会发到我们的负载均衡器,我们的负载均衡器用的是AWS,它会自动把我们这些数据发到我们的server端,server在收集完数据之后会进行一个初步的清洗,把那些不规律的数据给清洗掉,然后再把那些数据发到Kafka里,后面就进入到我们的实时和离线过程。

最终我们的数据会统计到HBase里面,对外暴露的是一个sql的接口,可以通过各种sql的组合去查询所需要的统计数据。目前我们用的主要版本,Spark用的还是1.5.1,我们自己根据我们自己的业务需求打了一些定制的patch,Hadoop用的还是2.5.2,HBase是0.98,Phoenix是4.7.0,我们修复了一些小的bug,以及加了一些自己的特性,打了自己的patch。 


Lambda架构

Lambda架构的主要思想是将大数据系统架构为多层个层次,分别为批处理层(batchlayer)、实时处理层(speedlayer)、服务层(servinglayer)如图(C)。

理想状态下,任何数据访问都可以从表达式Query= function(alldata)开始,但是,若数据达到相当大的一个级别(例如PB),且还需要支持实时查询时,就需要耗费非常庞大的资源。一个解决方式是预运算查询函数(precomputedquery funciton)。书中将这种预运算查询函数称之为Batch View(A),这样当需要执行查询时,可以从BatchView中读取结果。这样一个预先运算好的View是可以建立索引的,因而可以支持随机读取(B)。于是系统就变成:

(A)batchview = function(all data);

(B)query =function(batch view)。

图(C)

 

 

 

 

 

 

 

 

Hadoop危机?替代HDFS的8个绝佳方案

Ceph 是一个开源、多管齐下的操作系统,因为其高性能并行文件系统的特性,有人甚至认为它是基于Hadoop环境下的HDFS的接班人,因为自2010年就有研究者在寻找这个特性。

 

 

Apache Flink

Apache Flink是一种可以处理批处理任务的流处理框架。该技术可将批处理数据视作具备有限边界的数据流,借此将批处理任务作为流处理的子集加以处理。为所有处理任务采取流处理为先的方法会产生一系列有趣的副作用。

这种流处理为先的方法也叫做Kappa架构,与之相对的是更加被广为人知的Lambda架构(该架构中使用批处理作为主要处理方法,使用流作为补充并提供早期未经提炼的结果)。Kappa架构中会对一切进行流处理,借此对模型进行简化,而这一切是在最近流处理引擎逐渐成熟后才可行的。

流处理模型

Flink的流处理模型在处理传入数据时会将每一项视作真正的数据流。Flink提供的DataStream API可用于处理无尽的数据流。Flink可配合使用的基本组件包括:

  • Stream(流)是指在系统中流转的,永恒不变的无边界数据集
  • Operator(操作方)是指针对数据流执行操作以产生其他数据流的功能
  • Source(源)是指数据流进入系统的入口点
  • Sink(槽)是指数据流离开Flink系统后进入到的位置,槽可以是数据库或到其他系统的连接器

为了在计算过程中遇到问题后能够恢复,流处理任务会在预定时间点创建快照。为了实现状态存储,Flink可配合多种状态后端系统使用,具体取决于所需实现的复杂度和持久性级别。

此外Flink的流处理能力还可以理解“事件时间”这一概念,这是指事件实际发生的时间,此外该功能还可以处理会话。这意味着可以通过某种有趣的方式确保执行顺序和分组。

批处理模型

Flink的批处理模型在很大程度上仅仅是对流处理模型的扩展。此时模型不再从持续流中读取数据,而是从持久存储中以流的形式读取有边界的数据集。Flink会对这些处理模型使用完全相同的运行时。

Flink可以对批处理工作负载实现一定的优化。例如由于批处理操作可通过持久存储加以支持,Flink可以不对批处理工作负载创建快照。数据依然可以恢复,但常规处理操作可以执行得更快。

另一个优化是对批处理任务进行分解,这样即可在需要的时候调用不同阶段和组件。借此Flink可以与集群的其他用户更好地共存。对任务提前进行分析使得Flink可以查看需要执行的所有操作、数据集的大小,以及下游需要执行的操作步骤,借此实现进一步的优化。

优势和局限

Flink目前是处理框架领域一个独特的技术。虽然Spark也可以执行批处理和流处理,但Spark的流处理采取的微批架构使其无法适用于很多用例。Flink流处理为先的方法可提供低延迟,高吞吐率,近乎逐项处理的能力。

Flink的很多组件是自行管理的。虽然这种做法较为罕见,但出于性能方面的原因,该技术可自行管理内存,无需依赖原生的Java垃圾回收机制。与Spark不同,待处理数据的特征发生变化后Flink无需手工优化和调整,并且该技术也可以自行处理数据分区和自动缓存等操作。

Flink会通过多种方式对工作进行分许进而优化任务。这种分析在部分程度上类似于SQL查询规划器对关系型数据库所做的优化,可针对特定任务确定最高效的实现方法。该技术还支持多阶段并行执行,同时可将受阻任务的数据集合在一起。对于迭代式任务,出于性能方面的考虑,Flink会尝试在存储数据的节点上执行相应的计算任务。此外还可进行“增量迭代”,或仅对数据中有改动的部分进行迭代。

在用户工具方面,Flink提供了基于Web的调度视图,借此可轻松管理任务并查看系统状态。用户也可以查看已提交任务的优化方案,借此了解任务最终是如何在集群中实现的。对于分析类任务,Flink提供了类似SQL的查询,图形化处理,以及机器学习库,此外还支持内存计算。

Flink能很好地与其他组件配合使用。如果配合Hadoop 堆栈使用,该技术可以很好地融入整个环境,在任何时候都只占用必要的资源。该技术可轻松地与YARN、HDFS和Kafka 集成。在兼容包的帮助下,Flink还可以运行为其他处理框架,例如Hadoop和Storm编写的任务。

One of the biggest limitations of Flink at the moment is that this is still a very "young" project. The large-scale deployment of this project in the real environment is not as common as other processing frameworks, and there is no in-depth research on the limitations of Flink in terms of scaling capabilities. With the advancement of the rapid development cycle and the improvement of functions such as compatibility packages, more and more Flink deployments may appear as more and more organizations start to try.

Summarize

Flink provides low-latency stream processing while supporting traditional batch tasks. Flink is perhaps best suited for organizations with extremely high stream processing needs and a small number of batch jobs. The technology is compatible with native Storm and Hadoop programs and runs on YARN-managed clusters, so it can be easily evaluated. The fast-moving development work makes it worthy of everyone's attention.

in conclusion

Big data systems can use a variety of processing techniques.

For batch-only workloads that are not time-sensitive, Hadoop, which is less expensive to implement than other solutions, would be a good choice.

For workloads that only require stream processing, Storm supports a wider range of languages ​​and enables extremely low-latency processing, but the default configuration can produce duplicate results and order is not guaranteed. Samza's tight integration with YARN and Kafka provides greater flexibility, easier multi-team use, and simpler replication and state management.

For mixed workloads, Spark offers stream processing in high-speed batch and micro-batch modes. The technology is better supported, with various integration libraries and tools for flexible integration. Flink provides true stream processing and has batch processing capabilities. It can run tasks written for other platforms through deep optimization, providing low-latency processing, but it is still too early for practical application.

The most suitable solution depends mainly on the state of the data to be processed, the time required for processing, and the desired result. Whether to use a full-featured solution or one that is primarily focused on a certain project requires careful consideration. As it matures and becomes widely accepted, similar issues need to be considered when evaluating any emerging innovative solution.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326230456&siteId=291194637