Database and distributed data computing platform

Database and distributed data computing platform

I. Overview

  Common database and distributed data computing platforms: MySQL, Redis, HybridDB for MySQL , Hbase (100 Yi), MongoDB (10 Yi), MemcacheDB; the Spark, Hadoop, Hive, Kafka used to live, Flume, zookeper, MyBatis

  1. RDBMS:relational database management system
  2. MySQL for HybridDB : online transaction (OLTP) and online analytical (OLAP) of relational HTAP ( Hybrid (Hybrid) Transaction / Analytical Processing ) database class
  3. HBase: is a distributed, column-oriented open-source database. A unstructured distributed data storage system, HBase row basis rather than on a column mode. HBase is more suitable for low-latency data access
  4. HBase and Redis relatively similar in function: for example, they all belong to the class NoSQL database, support of data pieces , etc.
  5. HBase: for simple data write (e.g., "Message class" applications) and mass, a simple query data structure (e.g., "details of a single class" applications)
  6. Examples of HBase for on-line applications: Facebook messaging applications including Messages, Chats, Emails and SMS systems, are used in HBase; Taobao Ali Want WEB version, the background is HBase; m talking millet is also used in HBase; the company's mobile phone details of a single province query system, Oracle last year from the original into a 32-node cluster HBase
  7. MongoDB: is based on a distributed file storage database, compiled by the C ++ language, designed to provide scalable WEB application for high-performance data storage solutions
  8. MemcacheDB: is a distributed key-value durable storage system. MemcacheDB support for memcached protocol, but MemcacheDB system is persistent storage, MemcacheDB and MySQL MySQL used in combination to improve the efficiency of writing
  9. Frame data calculation : calculating a frame line: Streaming; offline computing frameworks: the MapReduce; Memory computing frameworks: Spark.
  10. Spark: is a fast, versatile large-scale data processing engine
  11. Hadoop: it is a distributed system infrastructure
  12. Hive: is a data warehouse Hadoop-based tool for data extraction, transformation, loading (ETL tool: Extract extract -Transform conversion -Load loading). This is a way to store, query, and mechanisms for large-scale data stored in Hadoop Analysis. The bottom layer is based on MapReduce, but is in line with SQL syntax
  13. kafka: it is a distributed, support partitions (partition), multiple copies (replica), zookeeper coordination baseddistributed messaging systemthat can handle large amounts of data in real-time to meet the various needs of the scene
  14. Flume: is Cloudera provide a highly available, highly reliable, distributed massive log collection, aggregation and transmission system , The Flume during transportation in order to ensure some success, prior to the destination (sink), the cache will first after the data (channel), the data to be true destination (sink), flume and then delete your cache data
  15. ZooKeeper : a distributed service framework. zookeeper is mainly used to solve some of the data management problems often encountered in distributed applications, such as: a unified naming service, state synchronization service, cluster management, distributed application management configuration items, etc.
  16. MyBatis : is an excellent persistence framework that supports custom SQL, stored procedures and advanced mappings. iBATIS persistence framework include SQL Maps and Data Access Objects (DAOs)

Second, the computing platform for distributed data

1, Hadoop / MapReduce and Spark are best suited to do off-line data analysis, but Hadoop is particularly suitable for the amount of data a single analysis of the "big" scenario, while Spark is applicable to the amount of data is not a big scene. An amount of data relative to the entire cluster in terms of memory capacity, because the data need Spark HOLD memory

2、基于Flume采集到 HDFS 中的数据,MapReduce 将数据清理(选择合适的信息字段,或者根据业务需求解析源数据中的信息字段包含的信息并增加新的信息字段)之后将数据保存到 HDFS,根据 HDFS 中规整的数据按照业务需求进行数据的统计分析。

3、MapReduce 程序的编写又分为写Mapper(拉取数据)、Reducer、Job三个基本的过程。

4、Hive数据仓库工具能将结构化的数据文件映射为一张数据库表,并提供SQL查询功,能将SQL语句转变成MapReduce任务来执行。由于底层是MapReduce,与shark(改进hive中的内存管理,执行等部分)和spark相比,运行速度不佳。

5、离线型的数据处理和在线型的数据处理,基本的数据来源都是日志数据。如针对于web应用来说,则可能是用户的访问日志、用户的点击日志等

6、离线型的数据处理和在线型的数据处理架构

 

图-1 数据处理架构图

 

7、数据处理软件架构示例:

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/yinminbo/p/11824068.html