Table of contents
1.1 Three waves of informatization
1.2 Information Technology Development
1.3 Changes in data generation methods
2.1 Characteristics of big data
2.2 Key Technologies of Big Data
3. Big Data, Cloud Computing, and Internet of Things
3.3 The relationship between big data, cloud computing and Internet of Things
Chapter 2 Big Data Processing Architecture Hadoop
Principles and Applications of Big Data Technology Chapter 1 Overview of Big Data and Chapter 2 Summary and Understanding of Big Data Processing Architecture Knowledge Points in Chapter 1 Big Data Fundamentals
Chapter 1 Big Data Overview
1. The era of big data
1.1 Three waves of informatization
Information wave |
Time of occurrence |
the sign |
Solve the problem |
The first wave of informationization |
Around 1980 |
popularization of personal computers |
information processing |
The second wave of informationization |
Around 1995 |
internet age |
Information transfer |
The third wave of informatization |
Around 2010 |
Big Data Era (Big Data, Cloud Computing and IoT) |
information explosion |
1.2 Information Technology Development
Three core issues: information storage, information processing, and information transmission
key problem |
features |
information storage |
The capacity of storage devices continues to increase The amount of data puts forward higher requirements on the capacity of storage devices The increase in storage device capacity has accelerated the generation of data volume |
Information transfer |
Network bandwidth keeps increasing |
information processing |
CPU processing performance has been greatly improved Moore's Law: Every 18 months, performance doubles and prices fall in half |
1.3 Changes in data generation methods
Data generation has gone through three stages: operational system stage, user-generated content stage, and perceptual system stage
data generation stage |
example |
features |
operational system phase |
Hospital medical system, bank transaction system... |
Data generation is passive, and relevant records are generated and recorded in the database every time a transaction occurs |
User Generated Content Phase |
"Web2.0 Era", the self-service mode should be adopted by Weibo as the main |
Data dissemination does not require physical media such as disks, emphasizing self-service |
perceptual system stage |
Sensors such as temperature sensors included in IoT |
IoT devices automatically generate dense, large amounts of data in a short period of time |
1.4 Impact of big data
Scientific Research: Experimental Science -> Theoretical Science -> Computational Science [put forward possible theories first and pass data verification] -> Data-intensive science [directly derive unknown theories through large amounts of data]
Three major shifts in thinking: full sampling instead of sampling, efficiency instead of precision, and correlation instead of causation
Social development: big data decision-making, integration of big data and various industries, big data to promote new technologies and new applications
2. The concept of big data
2.1 Characteristics of big data
The 4 "V" of big data: large data volume (Volume), various data types (Variety), fast processing speed (Velocity), low value density (Value)
2.2 Key Technologies of Big Data
Big data technology: refers to the related technologies accompanying the collection, storage, analysis and result presentation of big data. It uses non-traditional tools to process a large amount of structured, semi-structured and unstructured data, so as to obtain analysis and prediction results. A range of data processing and analysis techniques
Two core technologies: distributed storage and distributed processing
Main learning content: distributed file system HDFS, distributed database BigTable, distributed parallel processing technology MapReduce
big data technology |
Function |
Data Acquisition and Preprocessing |
Use the ETL tool of the data warehouse to extract the data [relational data, flat data files...] distributed in heterogeneous data sources to the temporary middle layer for cleaning, conversion, inheritance, and finally loading into the data warehouse or data mart , become the basis of online analytical processing and data mining; You can also use the log collection tool to input the real-time collected data into the computer as a stream for real-time processing and analysis |
Data Storage and Management |
Use distributed file systems, data warehouses, relational databases, NoSQL databases, etc. to store and manage structured, semi-structured, and unstructured massive data |
Data Processing and Analysis |
Using distributed parallel programming model and computing framework, combined with machine learning and data mining algorithms, to realize the processing and analysis of massive data; Visualize the analysis results to help people better understand and analyze data |
Data Security and Privacy Protection |
While mining commercial and academic value from big data, build a data security system and privacy data protection system to effectively protect data security and personal privacy |
Ps database != data warehouse
Database: A database is a transaction-oriented processing system (business system). It is a daily operation on the database for specific businesses, and usually queries and modifies records. Users are more concerned about the response time of operations, data security, integrity, and the number of concurrently supported users. As the main means of data management, the traditional database system is mainly used for operational processing, also known as OLTP (On-Line Transaction Processing).
Data Warehouse: Data Warehouse generally analyzes historical data of certain topics to support management decisions, and is also called OLAP (On-Line Analytical Processing).
ETL: Extract the data of each independent system, after certain conversion and filtering, store it in a centralized place and become a data warehouse. This extraction, transformation, and loading process is called ETL (Extract, Transform, Load), and its purpose is to integrate scattered, messy, and non-uniform data in the enterprise.
2.3 Big data computing model,
The common big data processing technology MapReduce represents a batch processing technology for large-scale data. In addition, there are various big data computing modes such as batch computing, stream computing, graph computing, query analysis computing, etc.
Big Data Computing Model |
Solve the problem |
Representative products |
batch computing |
Batch processing for large-scale data |
MapReduce, Spark, etc. |
stream computing |
Real-time computing for streaming data |
Flink, DStream, etc. |
graph computing |
Processing of large-scale graph-structured data |
GraphX, Pregel, etc. |
query analysis calculation |
Storage management and query analysis of large-scale data |
Dremel, Hive, etc. |
2.4 Big Data Industry
The big data industry includes IT infrastructure layer, data source layer, data management layer, data analysis layer, data platform layer and data application layer. At different levels, a number of market-leading technologies and enterprises have been formed.
3. Big Data, Cloud Computing, and Internet of Things
3.1 Cloud Computing
Cloud Computing : Cloud Computing realizes the provision of scalable and cheap distributed computing capabilities through the network. Users only need to be in places with network access conditions to obtain various IT resources they need anytime and anywhere.
Key technologies of cloud computing : key technologies of cloud computing include virtualization, distributed storage, distributed computing, multi-tenancy, etc.
Three Service Modes |
explain |
three types |
explain |
Infrastructure as a Service |
Rental of infrastructure such as computer resources and storage space |
public cloud |
Provide services to all registered users |
platform as a service |
rental platform |
Private Cloud |
Only provide services for specific users, such as the private cloud built by the enterprise only provides services for the enterprise |
software as a service |
rental software |
hybrid cloud |
Combination of public cloud and private cloud: data is placed in the private cloud for security reasons, and computer resources of the public cloud are used for efficiency considerations |
3.2 Internet of Things
物联网:物联网是物物相连的互联网,是互联网的延伸,它利用局部网络或互联网等通信技术把传感器、控制器、机器、人员和物等通过新的方式联在一起,形成人与物、物与物相联,实现信息化和远程管理控制
物联网关键技术:物联网中的关键技术包括识别和感知技术(二维码、RFID、传感器等)、网络与通信技术、数据挖掘与融合技术等
3.3 大数据与云计算、物联网的关系
云计算、大数据和物联网代表了IT领域最新的技术发展趋势,三者既有区别又有联系
第二章 大数据处理架构Hadoop
一、 概述
1.1 Hadoop简介
Hadoop是Apache软件基金会旗下的一个开源分布式计算平台,为用户提供系统底层细节透明的分布式基础架构
Hadoop是基于java开发的,跨平台性强,可以部署在廉价的计算机集群中
Hadoop的核心是Hadoop分布式文件系统【HDFS】和MapReduce。
HDFS是针对谷歌文件系统【GFS】的开源实现,是面向普通硬件环境的分布式文件系统
MapReduce是针对谷歌MapReduce的开源实现,允许用户在不了解系统底层细节情况下开发big你选哪个应用程序
1.2 Hadoop特性
Hadoop是一个能对大量数据进行分布式处理的软件框架,并且是以一种可靠、高效、可伸缩的方式进行处理的
特性:
高可靠性:采用冗余数据存储方式,有多个副本互相检查。
高效性:作为并行分布式计算平台,Hadoop采用分布式处理和分布式存储两大核心技术,可高效处理PB级数据
高可扩展性:Hadoop可以稳定高效地在廉价的计算机集群上运行,可以扩展到数以千计的计算机节点
高容错性:采用荣誉存储方式,自动保存数据的多个副本,并能自动将失败的任务重新分配
低成本:采用廉价计算机集群,个人计算机中也很容易搭建起Hadoop环境。
运行在Linux操作系统上
支持多种编程语言
1.3 Hadoop版本问题
第一代Hadoop包含三个版本,0.20.x最后演化成1.0.x,变成了稳定版,而0.21.x和0.22.x则增加了NameNode HA等新的重大特性
第二代是一套全新架构,均包含HDFS Federation和YARN两个系统,相比于0.23.x,2.x增加了NameNode HA和Wire-compatibility两个重大特性
Hadoop2.0基于JDK1.7开发(2015.4月停更),社区基于JDK1.8发布Hadoop3.0
1.4 Hadoop生态
Hadoop项目已非常成熟和完善,包括Zookeeper、HDFS、MapReduce、HBase、Hive、Pig等子项目,其中,HDFS和MapReduce是Hadoop的两大核心组件。
组件 |
功能 |
HDFS |
分布式文件系统 |
MapReduce |
分布式并行编程模型 |
YARN |
资源管理和调度器 |
Tez |
运行在YARN之上的下一代Hadoop查询处理框架 |
Hive |
Hadoop上的数据仓库 |
HBase |
Hadoop上的非关系型的分布式数据库 |
Pig |
一个基于Hadoop的大规模数据分析平台,提供类似SQL的查询语言Pig Latin |
Sqoop |
用于在Hadoop与传统数据库之间进行数据传递 |
Oozie |
Hadoop上的工作流管理系统 |
Zookeeper |
提供分布式协调一致性服务 |
Storm |
流计算框架 |
Flume |
一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统 |
Ambari |
Hadoop快速部署工具,支持Apache Hadoop集群的供应、管理和监控 |
Kafka |
一种高吞吐量的分布式发布订阅消息系统,可以处理消费者规模的网站中的所有动作流数据 |
Spark |
类似于Hadoop MapReduce的通用并行框架 |