The basic principle of big data technology system && NoSQL database

1, the reason NoSQL generated

  Relational database is difficult to deal with the increase of mass data, distributed laterally extended relatively weak, so constructed non-relational databases (called the NoSQL), its purpose is to construct a simple, distributed, scalable, efficient and easy to use the new database system.

2, NoSQL features

NoSQL generally provides distributed data storage, unified management and maintenance of data tables, and distributed quickly and write a simple query capabilities.

  • One popular NoSQL software to meet business needs students
  • Second, the well-known software is open source NoSQL

3, NoSQL typical application scenarios

  • Massive log data, business data, or monitoring data management and query
  • Simplified special or complex data models
  • As a data warehouse, data mining, OLAP system back-end systems or data support

Data Warehouse : subject-oriented, integrated, time-related, non-modifiable data set of enterprise management and decision-making.

Data warehouse data sources may be collected from the plurality of operating data, and the data pre-processing, such as washing, extraction and conversion operations, converting data into a uniform pattern. The processed data will be organized according to the needs of decision-making, form the subject-oriented, integrated, stable set of data, data content reflects the historical changes in the business and operations

Data Mining: from centralized discovery process large amounts of data useful new model.

OLAP : Online Analytical Processing, OLAP can be seen as a system based on data warehouse applications, policy makers and the general for personnel data analysis, query and analyze massive amounts of data for a specific business themes

OLTP : online transaction processing, namely the use of traditional relational database implementations based business system transactions.

Database : A collection of data, to store and organize the data according to the configuration information of software container data or warehouse

4, the difference between the relational database and NoSQL

  • Relational databases are better able to maintain the integrity and consistency of transaction data, as well as to support complex operations on the data
  • Simple non-relational database management and query of data in a distributed environment

5, big data technology system

5.1 features

It contains several major features: large capacity, diversified, high-speed, valuable, full-line data

  • Diversity: the data service may require a large variety of types of data processing systems at the same time from different business, different data formats, different areas. May also be semi-structured (e.g., logs) are processed (e.g., videos and photos, etc.) and unstructured data

5.2 Acquisition

Large data acquisition process: the raw data is loaded into the process of large distributed data management systems. There are two ways of gathering:

  • Online collection: directly monitor the data source changes, the new data in real time or quasi real time acquisition generated and loaded into the large data system. The process of loading: push or pull mode, that is, data distribution service initiative to view the data and get the data
  • Offline collection: large data system periodically way to upload data from a data source.

5.3 Memory

It uses a distributed architecture, and provides access through the network.

  • DAS: direct-attached storage, storage devices is via a cable directly to the server
  • NAS: Network access storage, a storage device connected to the network, usually a standard TCP / IP network. Client access data stored Protocol (NFS) Network File
  • SAN: Storage Area Network, a separate network storage devices, often using an optical fiber connector.
  • Cloud storage: the storage as a service out.

Cloud storage advantages:

  • Users no longer need to buy storage equipment and management software, but through the use of network interface leased storage service
  • Users no longer need to carry out the operation and maintenance of storage systems, but by paying to cloud storage service provider for data backup and system maintenance

Common types of cloud storage

  • Object stores: data into a container, using a client application like http or restful layer interface to access each block of data and metadata
  • File Storage: NAS-based service cloud model to achieve, you can hire, maintenance-free Network File System
  • Block storage (storage volume): mount a virtual drive letter may be implemented in the cloud host function (for example, a virtual storage volume is mapped in the disk D windows host), and the host mirroring and snapshot cloud storage and other functions
  • Key-value pair storage: direct implementation of key NoSQL database form on the cloud platform, free installation, maintenance-free, users can directly use
  • Database storage: direct implementation on a cloud platform relational database
  • Snapshot storage and mirrored storage: virtual machine images and instances on the cloud platform to store snapshots. Usually block-based storage implementation
  • Message queue stores: an asynchronous message is an important means of communication in a distributed system. Usually the message sender sends a message to a secure storage container, receive the message and wait for the receiver.

5.4 大数据的管理和使用

原因:将数据汇总到一处,很难实现且效率低下。

遵循“计算本地化”策略,所谓计算本地化,首先需要将数据存储在多个网络节点之上,各个节点既是存储节点也是处理节点 。

查询和处理数据时,将查询指令或处理数据所需的程序分发都各个节点,每个节点只处理或分析一部分数据,最好是本节点的数据。程序随数据移动的并行处理的方式,在较短时间内完成了处理任务

NoSQL系统会自行实现分布式存储,例如MongoDB系统;HBase系统基于HDFS分布式文件系统构建,并将所有文件操作交给HDFS,自身只负责数据库表的操作

大数据的存储和管理实现了文件方式的大数据管理,但对大数据的使用存在困难,无法直接看出数据结构和关系,没有库表的概念

NoSQL等工具会对大数据实现表格化管理、快速查询支持,以及提供数据库系统的集群的监控、扩展等维护管理功能。

NoSQL在大数据业务中的基本功能就是实现:分布式数据组织、管理和分布式数据查询,有两种方式。

  • 第一种是半结构化存储的大数据文件映射为表,即对文件进行纵向分割,对每个列定义其名称和属性,将这些名称属性作为元数据管理起来,即实现表格化管理。由于是分块存储,映射成表后,也可以实现分布式查询
  • 第二种是要求数据按照自身所规定的格式进行存储,可能需要通过数据导入等方式将原始数据按照新的格式重新存储一遍

分布式环境下大数据可以的操作:预处理、数据统计分析、数据挖掘

  预处理工具:Hadoop的MapReduce模块、Spark

  大数据挖掘和机器学习引擎:hadoop的Mahout、Spark的Mlib、谷歌的TensorFlow

6、数据可视化

从形式上可以大致分为统计图形和主题图两类

7、大数据安全和治理

7.1 身份管理和访问控制

  • 身份管理:对用户身份(凭证)的管理和身份认证。
  • 访问控制:指按照用户的身份或属性来限制和管理用户对资源的访问权限

大数据场景下,数据存储在集群环境中,且集群节点随数据增长而添加。除了要解决客户端访问集群时的认证授权问题,还要解决集群间各节点的认证授权问题,以防止攻击者冒充某个服务节点。NoSQL数据库提供了基于用户名口令的认证与授权方式,实现客户端到服务器的认证授权,Hadoop等大数据系统提供了Kerberos认证的身份管理和权限管理,一方面提供对客户端的身份认证,另一方面提供节点或组件之间的身份认证

7.2 大数据加密

主要包括传输加密和存储加密

  • 数据存储加密,常见的策略是将加密的数据上传到存储平台,使用时下载到本地再解密
  • 传输加密,不仅要解决加密算法的问题,也要解决秘钥传输和身份认证等一系列问题,通过SSL协议和数据分块后进行透明加密等方式解决这些问题

Hadoop目前采用SSL协议和数据分块后进行透明加密等方式

隐私保护和准标识符保护

Guess you like

Origin www.cnblogs.com/wendyw/p/12623978.html