Principles and Applications of Big Data Technology Part I Big Data Fundamentals

Table of contents

Chapter 1 Big Data Overview

1. The era of big data

1.1 Three waves of informatization

1.2 Information Technology Development

1.3 Changes in data generation methods

1.4 Impact of big data

2. The concept of big data

2.1 Characteristics of big data

2.2 Key Technologies of Big Data

2.3 Big data computing model,

2.4 Big Data Industry

3. Big Data, Cloud Computing, and Internet of Things

3.1 Cloud Computing

3.2 Internet of Things

3.3 The relationship between big data, cloud computing and Internet of Things

Chapter 2 Big Data Processing Architecture Hadoop

I. Overview

1.1 Introduction to Hadoop

1.2 Hadoop Characteristics

1.3 Hadoop version problem

1.4 Hadoop ecology


Principles and Applications of Big Data Technology Chapter 1 Overview of Big Data and Chapter 2 Summary and Understanding of Big Data Processing Architecture Knowledge Points in Chapter 1 Big Data Fundamentals

Chapter 1 Big Data Overview

1. The era of big data

1.1 Three waves of informatization

Information wave

Time of occurrence

the sign

Solve the problem

The first wave of informationization

Around 1980

popularization of personal computers

information processing

The second wave of informationization

Around 1995

internet age

Information transfer

The third wave of informatization

Around 2010

Big Data Era (Big Data, Cloud Computing and IoT)

information explosion

1.2 Information Technology Development

Three core issues: information storage, information processing, and information transmission

key problem

features

information storage

The capacity of storage devices continues to increase

The amount of data puts forward higher requirements on the capacity of storage devices

The increase in storage device capacity has accelerated the generation of data volume

Information transfer

Network bandwidth keeps increasing

information processing

CPU processing performance has been greatly improved

Moore's Law: Every 18 months, performance doubles and prices fall in half

1.3 Changes in data generation methods

Data generation has gone through three stages: operational system stage, user-generated content stage, and perceptual system stage

data generation stage

example

features

operational system phase

Hospital medical system, bank transaction system...

Data generation is passive, and relevant records are generated and recorded in the database every time a transaction occurs

User Generated Content Phase

"Web2.0 Era", the self-service mode should be adopted by Weibo as the main

Data dissemination does not require physical media such as disks, emphasizing self-service

perceptual system stage

Sensors such as temperature sensors included in IoT

IoT devices automatically generate dense, large amounts of data in a short period of time

1.4 Impact of big data

Scientific Research: Experimental Science -> Theoretical Science -> Computational Science [put forward possible theories first and pass data verification] -> Data-intensive science [directly derive unknown theories through large amounts of data]

Three major shifts in thinking: full sampling instead of sampling, efficiency instead of precision, and correlation instead of causation

Social development: big data decision-making, integration of big data and various industries, big data to promote new technologies and new applications

2. The concept of big data

2.1 Characteristics of big data

The 4 "V" of big data: large data volume (Volume), various data types (Variety), fast processing speed (Velocity), low value density (Value)

2.2 Key Technologies of Big Data

Big data technology: refers to the related technologies accompanying the collection, storage, analysis and result presentation of big data. It uses non-traditional tools to process a large amount of structured, semi-structured and unstructured data, so as to obtain analysis and prediction results. A range of data processing and analysis techniques

Two core technologies: distributed storage and distributed processing

Main learning content: distributed file system HDFS, distributed database BigTable, distributed parallel processing technology MapReduce

big data technology

Function

Data Acquisition and Preprocessing

Use the ETL tool of the data warehouse to extract the data [relational data, flat data files...] distributed in heterogeneous data sources to the temporary middle layer for cleaning, conversion, inheritance, and finally loading into the data warehouse or data mart , become the basis of online analytical processing and data mining;

You can also use the log collection tool to input the real-time collected data into the computer as a stream for real-time processing and analysis

Data Storage and Management

Use distributed file systems, data warehouses, relational databases, NoSQL databases, etc. to store and manage structured, semi-structured, and unstructured massive data

Data Processing and Analysis

Using distributed parallel programming model and computing framework, combined with machine learning and data mining algorithms, to realize the processing and analysis of massive data;

Visualize the analysis results to help people better understand and analyze data

Data Security and Privacy Protection

While mining commercial and academic value from big data, build a data security system and privacy data protection system to effectively protect data security and personal privacy

Ps database != data warehouse

Database: A database is a transaction-oriented processing system (business system). It is a daily operation on the database for specific businesses, and usually queries and modifies records. Users are more concerned about the response time of operations, data security, integrity, and the number of concurrently supported users. As the main means of data management, the traditional database system is mainly used for operational processing, also known as OLTP (On-Line Transaction Processing).

Data Warehouse: Data Warehouse generally analyzes historical data of certain topics to support management decisions, and is also called OLAP (On-Line Analytical Processing).

ETL: Extract the data of each independent system, after certain conversion and filtering, store it in a centralized place and become a data warehouse. This extraction, transformation, and loading process is called ETL (Extract, Transform, Load), and its purpose is to integrate scattered, messy, and non-uniform data in the enterprise.

2.3 Big data computing model,

The common big data processing technology MapReduce represents a batch processing technology for large-scale data. In addition, there are various big data computing modes such as batch computing, stream computing, graph computing, query analysis computing, etc.

Big Data Computing Model

Solve the problem

Representative products

batch computing

Batch processing for large-scale data

MapReduce, Spark, etc.

stream computing

Real-time computing for streaming data

Flink, DStream, etc.

graph computing

Processing of large-scale graph-structured data

GraphX, Pregel, etc.

query analysis calculation

Storage management and query analysis of large-scale data

Dremel, Hive, etc.

2.4 Big Data Industry

The big data industry includes IT infrastructure layer, data source layer, data management layer, data analysis layer, data platform layer and data application layer. At different levels, a number of market-leading technologies and enterprises have been formed.

3. Big Data, Cloud Computing, and Internet of Things

3.1 Cloud Computing

Cloud Computing : Cloud Computing realizes the provision of scalable and cheap distributed computing capabilities through the network. Users only need to be in places with network access conditions to obtain various IT resources they need anytime and anywhere.

Key technologies of cloud computing : key technologies of cloud computing include virtualization, distributed storage, distributed computing, multi-tenancy, etc.

Three Service Modes

explain

three types

explain

Infrastructure as a Service

Rental of infrastructure such as computer resources and storage space

public cloud

Provide services to all registered users

platform as a service

rental platform

Private Cloud

Only provide services for specific users, such as the private cloud built by the enterprise only provides services for the enterprise

software as a service

rental software

hybrid cloud

Combination of public cloud and private cloud: data is placed in the private cloud for security reasons, and computer resources of the public cloud are used for efficiency considerations

3.2 Internet of Things

物联网:物联网是物物相连的互联网,是互联网的延伸,它利用局部网络或互联网等通信技术把传感器、控制器、机器、人员和物等通过新的方式联在一起,形成人与物、物与物相联,实现信息化和远程管理控制

物联网关键技术:物联网中的关键技术包括识别和感知技术(二维码、RFID、传感器等)、网络与通信技术、数据挖掘与融合技术等

3.3 大数据与云计算、物联网的关系

云计算、大数据和物联网代表了IT领域最新的技术发展趋势,三者既有区别又有联系

第二章 大数据处理架构Hadoop

一、 概述

1.1 Hadoop简介

Hadoop是Apache软件基金会旗下的一个开源分布式计算平台,为用户提供系统底层细节透明的分布式基础架构

Hadoop是基于java开发的,跨平台性强,可以部署在廉价的计算机集群中

Hadoop的核心是Hadoop分布式文件系统【HDFS】和MapReduce。

HDFS是针对谷歌文件系统【GFS】的开源实现,是面向普通硬件环境的分布式文件系统

MapReduce是针对谷歌MapReduce的开源实现,允许用户在不了解系统底层细节情况下开发big你选哪个应用程序

1.2 Hadoop特性

Hadoop是一个能对大量数据进行分布式处理的软件框架,并且是以一种可靠、高效、可伸缩的方式进行处理的

特性:

高可靠性:采用冗余数据存储方式,有多个副本互相检查。

高效性:作为并行分布式计算平台,Hadoop采用分布式处理和分布式存储两大核心技术,可高效处理PB级数据

高可扩展性:Hadoop可以稳定高效地在廉价的计算机集群上运行,可以扩展到数以千计的计算机节点

高容错性:采用荣誉存储方式,自动保存数据的多个副本,并能自动将失败的任务重新分配

低成本:采用廉价计算机集群,个人计算机中也很容易搭建起Hadoop环境。

运行在Linux操作系统上

支持多种编程语言

1.3 Hadoop版本问题

第一代Hadoop包含三个版本,0.20.x最后演化成1.0.x,变成了稳定版,而0.21.x和0.22.x则增加了NameNode HA等新的重大特性

第二代是一套全新架构,均包含HDFS Federation和YARN两个系统,相比于0.23.x,2.x增加了NameNode HA和Wire-compatibility两个重大特性

Hadoop2.0基于JDK1.7开发(2015.4月停更),社区基于JDK1.8发布Hadoop3.0

1.4 Hadoop生态

Hadoop项目已非常成熟和完善,包括Zookeeper、HDFS、MapReduce、HBase、Hive、Pig等子项目,其中,HDFS和MapReduce是Hadoop的两大核心组件。

组件

功能

HDFS

分布式文件系统

MapReduce

分布式并行编程模型

YARN

资源管理和调度器

Tez

运行在YARN之上的下一代Hadoop查询处理框架

Hive

Hadoop上的数据仓库

HBase

Hadoop上的非关系型的分布式数据库

Pig

一个基于Hadoop的大规模数据分析平台,提供类似SQL的查询语言Pig Latin

Sqoop

用于在Hadoop与传统数据库之间进行数据传递

Oozie

Hadoop上的工作流管理系统

Zookeeper

提供分布式协调一致性服务

Storm

流计算框架

Flume

一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统

Ambari

Hadoop快速部署工具,支持Apache Hadoop集群的供应、管理和监控

Kafka

一种高吞吐量的分布式发布订阅消息系统,可以处理消费者规模的网站中的所有动作流数据

Spark

类似于Hadoop MapReduce的通用并行框架

Guess you like

Origin blog.csdn.net/CNDefoliation/article/details/127716474