Good programmers large share data zero-based learning how to start the Hadoop

  Zero-based learning Hadoop how to start, many students through learning hadoop learning big data , learning materials may be based on the book as the main reference direction, " hadoop Definitive Guide" is indeed a good place to start large data books, but the data system itself is a distributed system, so I think that the concepts of distributed systems is to master all kinds of big data framework, the basics.

  

  1  Getting Started:

  hadoop framework is a set of storage (HDFS) , computing (mr calculation model ) , resource management (yarn) is equal to one of the Integrated Framework, of course, it is a historical stage product, excluding this because we look at well known wordcount specific practice (mr) is how to calculate what the next scene?

  

  1-1  Distributed Systems

  First wordcount program into the traditional stand-alone mode can also be treated, where everyone will think of multi-threading, paper cutting and other implementations, simply, parallel computing concept is not new, as the hardware continues to progress, and improve performance, multi-core computing have also been developed for many years, and data at the same time the world is generated by rapid growth, the multi-core parallel multi-tasking multi-threaded calculations under its original stand-alone after the encounter between both processing speed and a serious mismatch of data processing the question of how to improve the computing power is inevitable, so the cluster approach to address the development of the ability to calculate the level of resources extended while having parallelism, which is the core idea, we can understand the current cluster (a black box) analogous to traditional stand-alone between the way the cluster node parallel computing involves a master-slave architecture, cluster management, message communication, fault tolerance, etc., and these are distributed systems to consider and resolve the problem, because it is itself a distributed system.

  

  1-2  Distributed Storage

  Just briefly mentioned a distributed system, when it comes to computing, in fact there is a potential problem is to calculate the data must be, necessarily involves storage, the storage is fundamental, then how to use distributed storage system ( HDFS ) on We must understand part of it (such as what is block, file system, distributed file system), use (read and write HDFS ), but because most of the students are relatively familiar relational database and its use SQL , which are is the application level things specific underlying circumstances do not understand, or did not participate in the development of database software for the learning experience of the class file is relatively small, which referred the file to the IO operation, serialization, compression, built-in or custom file format to read and write, read and write a kind of strange, because hdfs nature of the file system.

  

  1-3  Distributed Computing

  mr calculation model is less contact before, there is no specific experience feelings such as mr concrete can do, use what scenes and so on, because before we are in contact with OLTP (online transaction processing [ OLTP Online Transaction Processing ]

  

Online transaction processing, represents a very high transactional systems, on-line systems are generally highly available to small matters as well as small queries mainly to traditional relational database as the main application, mainly basic, everyday affairs treatment, mainly for business data, such as bank transactions) operation, and large data originally used for data mining it is more of a OLAP (online analytical processing [ OLAP online Analytical processing ]:

 

Online analytical processing, sometimes also called DSS decision support system, what we call data warehouse, the focus is mainly oriented analysis, will generate a lot of inquiries, it seldom involves additions and deletions. ) Operation, mr calculation model of map operations and reduce operating we often meet demand, map operation is responsible for data cleansing, conversion, reduce operational responsible for data aggregation, and sql in the select clause and group by clause does not also correspond to the such practical needs it, but just in different ways.

 

 

Advanced

 

2-1 建议以分布式系统的角度来看待大数据中的各类框架,了解下分布式理论如CAP理论、主从架构方式等等

 

2-2 当然由于这些框架所处理的不是同一方向的问题,所以我们首先框架分类,参考如下


技术架构

 

```

数据采集:flumelogstash

数据存储:hdfshbasealluxioesneo4jjanusGraphredismongodbtidb

数据计算:hiveimpalasparkflinkdruid

数据通道:kafkapulsar

任务调度:azkabanairflow

多维数据模型

数据同步:sqoopdataxcanal

数据格式:parquet orc csv json

协调服务:zookeeper

10 监控:zabbixprometheus

 

推荐

 

3.1 大数据的各类框架 官网永远是第一手资源,一定要看

 

3.2 大量的公众号、stackoverflowgithub

 

3.3 google查询资源

 


Guess you like

Origin blog.51cto.com/14479068/2437463