Big Data platform development framework to explain

Big Data background

For the amount of data traffic surge, intelligent user demands increase. In this era of DT, and the development of large data also came into being, big data development must solve two problems, how large amount of data unified storage, unified computing how large amount of data. It generated a lot of technical aspects of big data framework to address these issues. To date, large data aspects of technology can be said is already quite mature, BAT in terms of big data applications are very broad, analysis of large amounts of data makes the data more value, operating with the analysis of big data can make an accurate marketing plan, through the collection of user operation log, analyze user behavior, give the user precise and valuable recommendations. AI also do analysis of massive data, to make intelligent behavior through accurate algorithm.

Hadoop big data applications

Big Data Analysis

statistics

recommend

Machine Learning

Artificial intelligence, predictive (algorithm)

SQL on Hadoop

Hive

Phoneix (based HBase)

Spark SQL

hadoop 2.x Overview

It contains four modules hadoop

As can be seen above focus on Hadoop contains three main parts , hdfs storage, MapReduce distributed computing process, Yarn management of computing tasks, including resource scheduling, task coordination, monitoring task.

HDFS services

YARN services

Off-line calculation frame MapReduce

About Big Data framework development process off-line and real-time common technique

nutch take crawler text data; semi-structured data;

Prepare the log flume data

Scribe Facebook is open log collection system which is capable of collecting logs from various log sources, stored on a central storage system, in order to process centralized statistical analysis. Want to learn the system big data, you can poke I joined the large group study exchange technical data, private letters administrator can receive a free development tools and entry-learning materials

Structured data stored in the database, and management by the RDBMS (relational database management system),

sqoop可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中;

Oozie是一个工作流引擎服务器,用于运行Hadoop Map/Reduce和Pig 任务工作流.

Mapreduce是主要操作逻辑和引擎, map是分配, reduce是合并;

HDFS是分布式文件存储系统

HBase快速存储, 快速响应查询

Jaql 是一个 JSON 的查询语言,用于简化 JSON 数据的建模和操作,主要用于分析大规模的半结构化数据。

Hue是一个可快速开发和调试Hadoop生态系统各种应用的一个基于浏览器的图形化用户接口。

mahout对数据进行分析; 我们要学习好如何使用好来分析; 提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤、频繁子项挖掘。

针对上述的生态圈做一个简单的说明:

对于大数据开发的步骤,第一步当然是数据来源的来源,爬虫技术,mutch当然也可以使用诸如python的其他框架,公司中常用的还有通过前端接口收集,flume日志采集。采集到的数据入hdfs文件,第二步数据采集之后,需要日志解析,日志解析公司一般会自己开发,实现方式可以是mapreduce计算实现原始日志简单清洗和入库hive。第三步是日志分析,通过简单的sql可以实现对业务数据分析,简单的编写sql如hivesql,sparksql可以进行计划,如何实现任务流管理就需要用到oozie,数据的分析计算结果最后结果写入高性能,读写数据快的数据存储系统,例如关系型数据库mysql,sqlserver 或者HBase。以上基本的离线开发流程就。对于一些业务多维度的分析可能需要事先进行预计算比如使用kylin的cube。

对于实时的技术,一般的数据源是日志收集之后写入kafka,然后通过sparkStreaming或者Strom,现在比较流行的还有flink,这些框架都是基于内存去进行分布式计算,用空间换时间,达到实时的计算。

离线开发和实时开发系统架构说明

系统架构说明:

本系统架构基于真实的大数据平台构建:

基本的数据结构说明:

以下两张图是基于离线和实时的系统开发架构图:

离线系统架构图

实时系统架构图

发布了138 篇原创文章 · 获赞 0 · 访问量 7719

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104319048