Big data platform architecture technology selection and scenario application

 

Summary

This sharing will combine the experience of multiple big data projects and product development to discuss how to build a general big data platform based on different demand scenarios. The content covers mainstream technologies in data acquisition, storage, analysis and processing, as well as experience and lessons in architectural decision-making and technology selection.

Big coffee speech video

http://t.cn/R9xaSOB

Big data platform content

The data source is often on the business system. When most of the data analysis is performed, the data source of the business will not be directly processed. At this time, data collection is required.

After the data is collected, the data is stored based on the characteristics of the data source.

Finally, do data analysis and processing according to the storage location.

The core of the entire ecosystem is data collection, data storage and data analysis.


Features of the data source

The characteristics of the data source determine the technical selection of data acquisition and data storage. The characteristics of data sources mainly include four categories: source, structure, variability and data volume.

There are internal data and external data sources, and they are handled differently.

The selection of structured data and unstructured data is also different.

The third feature is whether the data has variability, which can be divided into two types, which can be added invariably and can be modified and deleted.

The amount of data is divided into large data amount and small data amount.

internal data

Internal data comes from within the enterprise system, and active writing technology can be used to ensure that the changed data is adopted in a timely manner.

external data

External data is divided into API calls and web crawlers.

If the data to be retrieved itself provides an API, the data can be obtained by calling the API.

Another situation is that there is no API provided, and the data is "crawled" through the crawler.

Unstructured Data & Structured Data

Unstructured data and structured data are completely different when they are stored. Unstructured data is more likely to choose NoSQL databases, while structured data will be more inclined to choose traditional relational databases, or something like A non-open source professional database such as TERADATA, and a distributed database such as PostgreSQL.

can be added

If the data of the data source is unchanged or only allowed to be added, the collection will become very easy, and only the simplest incremental synchronization strategy needs to be considered when synchronizing, and it becomes relatively easy to maintain the consistency of the data.

Editable and deleteable

Some of the data in the data source may be modified or deleted, especially since many dimension tables often need to be changed. To analyze and process such data, the easiest way is to use direct connection. If data collection is to be performed, synchronization issues must be considered.

Large amount of data

大数据平台架构技术选型与场景运用

利用时间来处理大数据量并不是一个实时的处理方式。要做到实时的处理方式,应该采用流式处理。要将两种方式结合起来,就要用到大数据的lambda架构。

Lambda架构分为了三层,最下层是speed layer,要求速度快,也就是实时。

最上层是batch layer,也就是批处理。

通过中间层serving layer,定期或不定期地把batch views和speed views去做merged,会产生一个结合了batch的数据。它既满足了一定的实时性,又能满足一定的大数据量。这是目前比较流行的一种大数据的处理方式。

一个典型的数据加载架构

大数据平台架构技术选型与场景运用

数据存储的技术选型

取决于数据源的类型与数据的采集方式。

取决于采集后数据的格式与规模。

取决于分析数据的应用场景。

大数据平台的特征就是,相同的业务数据会以多种不同的表现形式,存储在不同类型的数据库中,形成一种poly-db的数据冗余生态。

场景一:舆情分析

针对某手机品牌的舆情分析。客户提出的需求是能够对舆情数据进行全文本搜索。舆情数据最高可能达到70亿条,而全文本搜索的性能指标要求响应时间控制在10s以内。

大数据平台架构技术选型与场景运用

爬虫爬到kafka里面,进行流处理去虫去噪,再做语义分析,语义分析完之后将舆情数据写入ES,全量数据写入HDFS。

场景二:商业智能产品

大数据平台架构技术选型与场景运用

聚合运算把数据源采集存储的时候,是基于列的运算,而传统数据库是行式存储。行式存储针对于列的运算需要全表才能拿到,这时选择用parquet。因为parquet是以列式方式做存储,所以做统计分析很快。但parquet执行查询会很慢,没有优势。

场景三:Airbnb的大数据平台

大数据平台架构技术选型与场景运用

Airbnb的数据一部分来自于本身的业务数据在MySQL,还有一部分是大量的事件。数据源不同,处理的方式也不一样。

基于日志,就用事件写入kafka;如果是针对MySQL,就用Sqoop,写入HDFS里,并建立Hive的集群。还存了一份数据放入亚马逊的S3。

有一部分业务就是对数据合并后放入HDFS做大量的业务查询和业务统计。这时希望用SQL的方式进行查询,会有很多选项,它选择的是Presto。

还有一些流式处理或机器学习要用到Spark,选型就会不同。

数据处理的分类

从业务角度来看,可以分为查询检索、数据挖掘、统计分析和深度分析。

从技术角度分为五类,batch MapReduce、SQL、流式处理、Machine Learning和DeepLearning。

编程模型有离线编程模型、内存编程模型和实时编程模型。

大数据平台架构技术选型与场景运用

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326138921&siteId=291194637