Big Data Study Notes 01----Introduction to Big Data

Getting started with big data

Introduction to Big Data

- definition of big data
Big data refers to data sets can not be captured, managed and treated with conventional software tools within a certain time frame, it is required at the new
processing model in order to have more decision-making power, strength and insight discovery process optimization capabilities Mass, high growth rate and diversified information assets.

-Characteristics of big data The characteristics of
big data can be described by the "5V" that IBM once proposed, as follows:
Insert picture description here

-A large amount
of data The amount of data collected, stored and calculated is very large.
Computer storage units are generally represented by B, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, and DB. The relationship between them is
1GB = 1024 MB
1TB = 1024 GB 1PB
= 1024 TB
1EB = 1024 PB
1ZB = 1024 EB
1YB = 1024 ZB
1BB = 1024 YB
1NB = 1024 BB
1DB = 1024 NB
Taking PB as an example, what is the amount of PB-level data? What is the concept?
If the mobile phone plays MP3 at an average speed of 1MB per minute, and the average duration of a song is 4 minutes, then 1PB
of songs can be played continuously for 2000 years.
1PB is also equivalent to 50% of the consulting content of the collections of academic research libraries in the United States.
-High speed
In the era of big data, data creation, storage, and analysis require high-speed processing. For example, personalized recommendations on e-commerce websites
require real-time recommendations as much as possible. This is also a significant feature that distinguishes big data from traditional data mining.
-Diversity
Data formats and sources are diversified. Including structured, semi-structured, and unstructured data, which are specifically expressed as web logs, audio
, video, pictures, geographic location information, etc., multiple types of data put forward higher requirements on data processing capabilities
-authenticity
assurance The authenticity of data can guarantee the correctness of data analysis
-low value
The value density of data is relatively low, or it is precious in the waves. The development of the Internet has spawned a large amount of data and
information, but the value density is low. How to combine business logic and use powerful machine algorithms to mine the value of data is
the most difficult problem to be solved in the era of big data.

-Big data application scenarios
With the development of big data, big data technology has been widely used in many industries, such as warehousing and logistics, e-commerce retail, automotive, telecommunications, biomedicine, artificial intelligence, smart cities, etc., including in epidemic prevention In the control war, big data technology has also played an important role.

Introduction to Hadoop

1. What is Hadoop

Hadoop in a narrow sense: refers to a framework . Hadoop is composed of three parts: HDFS: Distributed File System-"Storage; MapReduce: Distributed Offline Computing Framework-"Computation; Yarn: Resource Scheduling Framework"

Hadoop in a broad sense: Hadoop in a broad sense includes not only the Hadoop framework, but also some auxiliary frameworks in addition to the Hadoop framework. Flume: Log data collection, Sqoop: relational database data collection;
Hive: Deeply rely on Hadoop framework to complete calculations (sql),
Hbase: database in the field of big data (mysql) Sqoop: data export
Broadly Hadoop refers to an ecosystem

2. Features of Hadoop
Insert picture description here

3. Hairstyle version of Hadoop The
three main versions used in enterprises are: Apache Hadoop version (the original version, all distributions are improved based on this version), Cloudera version (Cloudera's Distribution Including Apache Hadoop, referred to as "CDH" ),
Hortonworks version (Hortonworks Data Platform, referred to as "HDP").

The original version of Apache Hadoop
Official website address: http://hadoop.apache.org/
Advantages: Open source contributions from all over the world, code update version is relatively fast
Disadvantages: version upgrade, version maintenance, and compatibility between versions, learning It is very convenient
to download all Apache software (including various historical versions): http://archive.apache.org/dist/
Software fee version ClouderaManager CDH version-production environment use
Official website address: https://www.cloudera.com/
Cloudera is mainly a big data company in the United States on the Apache open source Hadoop version, through various internal patches of its own company. For stable operation between versions, the software of each version of the big data ecosystem provides corresponding versions, which solves various problems such as version upgrade difficulties and version compatibility. It is strongly recommended to use in the production environment
Free open source version HortonWorks HDP version-production environment use
Official website address: https://hortonworks.com/
hortonworks is mainly the vice president of Yahoo leading the development of Hadoop, leading more than 20 core members to establish Hortonworks, the core product software HDP (ambari), HDF is free and open source, and provides a complete set of web management interfaces for us to manage our cluster status through the web interface, the web management interface software HDF website (http://ambari.apache.org/)

4.
Advantages and disadvantages of Hadoop Advantages of Hadoop

Hadoop has high reliability for storing and processing data.
Hadoop distributes data through available computer clusters to complete storage and computing tasks. These clusters can be easily expanded to thousands of nodes and have high scalability.
Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node. The processing speed is very fast and highly efficient.
Hadoop can automatically save multiple copies of data, and can automatically redistribute failed tasks, with high fault tolerance.

Disadvantages of Hadoop

Hadoop is not suitable for low-latency data access.
Hadoop cannot store a large number of small files efficiently.
Hadoop does not support multiple users to write and modify files arbitrarily.
Summary: We need to deduce whether we need Hadoop as a development tool through actual scenarios and the advantages and disadvantages of Hadoop

Important components of Apache Hadoop

Hadoop=HDFS (distributed file system) + MapReduce (distributed computing framework) + Yarn (resource coordination framework) + Common module
1. Hadoop HDFS : (Hadoop Distribute File System) a highly reliable, high-throughput distributed file system
For example: 100T data storage,
"divide and conquer" is
divided into: split-"data cutting, 100T data is split into 10G data block and a computer node stores this data block.
Cutting data, making a copy, store dispersion
Insert picture description here
diagram involves several roles
the NameNode (NN) : metadata storage file, such as file name, file directory structure, file attributes (generation time, the sub
present number, file permissions), and each of The block list of the file and the DataNode where the block is located, etc.
SecondaryNameNode (2nn) : Assist NameNode to work better. It is an auxiliary daemon
that monitors the status of HDFS and obtains HDFS metadata snapshots at regular intervals.
DataNode (dn) : Store file block data in the local file system, and check block data.
Note: NN, 2NN, DN are both role names, process names, and computer node names! !
2. Hadoop MapReduce : a distributed offline parallel computing framework.
Dismantling tasks, decentralized processing, and aggregate results.
MapReduce calculation = Map phase + Reduce phase. The
Map phase is a "divided" phase, which processes input data in parallel;
The Reduce phase is the "combination" phase, which summarizes the results of the Map phase;
Insert picture description here
3. Hadoop YARN: a framework
for job scheduling and cluster resource management

. There are several main roles in Yarn for computing resource coordination . Similarly, they are both the role name and the process name. , Also refers to the name of the computer node where it is located.
ResourceManager(rm) : Process client requests, start/monitor ApplicationMaster, monitor NodeManager, resource allocation and scheduling;
NodeManager(nm) : Resource management on a single node, process commands from ResourceManager, and process commands from
ApplicationMaster;
ApplicationMaster(am ) : Data segmentation, application of resources for applications, and allocation to internal tasks, task monitoring and
fault tolerance .
Container : An abstraction of the task operating environment, which encapsulates
information related to task operation such as CPU, memory and other multi-dimensional resources as well as environment variables and startup commands .
ResourceManager is the boss, NodeManager is the younger brother, and ApplicationMaster is the computing task specialist.
Insert picture description here
4. Hadoop Common: Tool modules that support other modules (Configuration, RPC, serialization mechanism, log operation)

to sum up

Focus on mastering the components of Hadoop: HDFS (distributed file system) + MapReduce (distributed computing framework) + Yarn (resource coordination framework) + Common module.