Overall architecture design of big data platform based on Hadoop

1. Software Architecture Design

 

Software Architecture Diagram

 

The architecture design of the big data platform follows the idea of ​​layered design. The services required by the platform are divided into different module levels according to their functions. Each module level only interacts with the upper or lower module level (through the interface of the level boundary). , to avoid cross-layer interaction, the advantage of this design is that the interior of each functional module is highly cohesive, and the modules are loosely coupled. This architecture is conducive to the realization of high reliability, high scalability and easy maintenance of the platform. For example, when we need to expand the Hadoop cluster, we only need to add a new Hadoop node server at the infrastructure layer, without any changes to other module layers, and it is completely transparent to users.

The entire Lakala big data platform is divided into five module levels according to their functions, from bottom to top:

Runtime environment layer:

The runtime environment layer provides the runtime environment for the infrastructure layer, which consists of two parts, namely the operating system and the runtime environment.

(1) Operating system We recommend installing REHL5.0 or later (64-bit). In addition, in order to improve the IO throughput of the disk, it is avoided to install a RAID driver. Instead, the data directories of the distributed file system are distributed on different disk partitions, so as to improve the IO performance of the disk.

(2) The specific requirements of the runtime environment are as follows:

 

name Version illustrate

JDK

1.6 or above

Hadoop requires a Java runtime environment, and a JDK must be installed .

gcc/g++

3.x or above

When using Hadoop Pipes to run MapReduce tasks, the gcc compiler is required, optional.

python

2.x or above

When using Hadoop Streaming to run MapReduce tasks, a python runtime is required, optional.

 

 

Infrastructure layer:

The infrastructure layer consists of 2 parts: Zookeeper cluster and Hadoop cluster. It provides infrastructure services for the base platform layer, such as naming services, distributed file systems, MapReduce, etc.

(1) The ZooKeeper cluster is used for naming mapping. As the naming server of the Hadoop cluster, the task scheduling console of the basic platform layer can access the NameNode in the Hadoop cluster through the naming server, and also has the function of failover.

(2) Hadoop cluster is the core of the big data platform and the infrastructure of the basic platform layer. It provides services such as HDFS, MapReduce, JobTracker, and TaskTracker. At present, we adopt the dual-master node model to avoid the single point of failure problem of the Hadoop cluster.

Basic platform layer:

The basic platform layer consists of 3 parts: task scheduling console, HBase and Hive. It provides the basic service invocation interface for the user gateway layer.

(1) The task scheduling console is the scheduling center of MapReduce tasks, and assigns the order and priority of execution of various tasks. Users submit job tasks through the scheduling console, and return the results of their task execution through the Hadoop client at the user gateway layer. The specific execution steps are as follows:

  • After the task scheduling console receives the job submitted by the user, it matches its scheduling algorithm;
  • Request ZooKeeper to return the JobTracker node address of the available Hadoop cluster;
  • Submit MapReduce job tasks;
  • Polling whether the job task is completed;
  • If the job is done send a message and call the callback function;
  • Proceed to the next job task.

 

As a complete Hadoop cluster implementation, the task scheduling console should be developed and implemented by itself, so that the flexibility and control will be stronger.

(2) HBase is a column database based on Hadoop, providing users with table-based data access services.

(3)Hive是在Hadoop上的一个查询服务,用户通过用户网关层的Hive客户端提交类SQL的查询请求,并通过客户端的UI查看返回的查询结果,该接口可提供数据部门准即时的数据查询统计服务。

用户网关层:

用户网关层用于为终端客户提供个性化的调用接口以及用户的身份认证,是用户唯一可见的大数据平台操作入口。终端用户只有通过用户网关层提供的接口才可以与大数据平台进行交互。目前网关层提供了3个个性化调用接口:

(1)Hadoop客户端是用户提交MapReduce作业的入口,并可从其UI界面查看返回的处理结果。

(2)Hive客户端是用户提交HQL查询服务的入口,并可从其UI界面查看查询结果。

(3)Sqoop是关系型数据库与HBase或Hive交互数据的接口。可以将关系型数据库中的数据按照要求导入到HBase或Hive中,以提供用户可通过HQL进行查询。同时HBase或Hive或HDFS也可以将数据导回到关系型数据库中,以便其他的分析系统进行进一步的数据分析。

用户网关层可以根据实际的需求无限的扩展,以满足不同用户的需求。

客户应用层:

客户应用层是各种不同的终端应用程序,可以包括:各种关系型数据库,报表,交易行为分析,对账单,清结算等。

目前我能想到的可以落地到大数据平台的应用有:

1.行为分析:将交易数据从关系型数据库导入到Hadoop集群中,然后根据数据挖掘算法编写MapReduce作业任务并提交到JobTracker中进行分布式计算,然后将其计算结果放入Hive中。终端用户通过Hive客户端提交HQL查询统计分析的结果。

2.对账单:将交易数据从关系型数据库导入到Hadoop集群,然后根据业务规则编写MapReduce作业任务并提交到JobTracker中进行分布式计算,终端用户通过Hadoop客户端提取对账单结果文件(Hadoop本身也是一个分布式文件系统,具备通常的文件存取能力)。

3. Clearing and settlement: import the UnionPay file into HDFS, and then perform MapReduce calculation (ie reconciliation operation) on the POSP transaction data imported from the relational database before, and then connect the calculation result to another MapReduce job for rate and calculation. The calculation of profit sharing (ie settlement operation), and finally the calculation result is imported back into the relational database, and the user triggers the merchant to transfer money (ie, the transfer operation).

Deployment Architecture Design

Deployment Architecture Diagram

 

Description of key points:

1. At present, the entire Hadoop cluster is placed in the bank room.

2. There are 2 Master nodes and 5 Slave nodes in the Hadoop cluster, and the 2 Master nodes back up each other to realize the failover function through ZooKeeper. Each Master node shares all Slave nodes, ensuring that the backup of the distributed file system exists in all DataNode nodes. All hosts in a Hadoop cluster must use the same network segment and be placed on the same rack to ensure the IO performance of the cluster.

3. The ZooKeeper cluster is configured with at least 2 hosts to avoid single node failure of the naming service. Through ZooKeeper, we no longer need F5 for load balancing, and the task scheduling console directly implements load balancing access to Hadoop name nodes through ZK.

4. All servers must be configured for keyless SSH access.

5. External or internal users need to pass through the gateway to access the Hadoop cluster. The gateway can only provide services after some authentication, so as to ensure the access security of the Hadoop cluster.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326691606&siteId=291194637