Introduction to Big Data Ecosystem

1) Hadoop is a distributed system infrastructure developed by the Apache Foundation.
2) It mainly solves the problems of massive data storage and massive data analysis and calculation.
3) In a broad sense, Hadoop usually refers to a broader concept - the Hadoop ecosystem.

hadoop

HDFS architecture overview, HDFS, is a distributed file system
1) NameNode (nn): Stores file metadata, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), and each file The block list and the DataNode where the block is located, etc.
(1) Manage HDFS namespace;
(2) Configure copy strategy;
(3) Manage data block (Block) mapping information;
(4) Process client read and write requests.
2) DataNode(dn): Store the file block data and the checksum of the block data in the local file system.
(1) Store the actual data block;
(2) Perform the read/write operation of the data block.
3) Secondary NameNode (2nn): Backup NameNode metadata at regular intervals. Not a Hot Standby for NameNode. When the NameNode hangs up, it cannot immediately replace the NameNode and provide services.
(1) Assist NameNode to share its workload, such as regularly merging Fsimage and Edits and pushing them to NameNode;
(2) In case of emergency, it can assist in recovering NameNode.

YARN architecture overview
YARN is a resource coordinator and a resource manager for Hadoop.

1) ResourceManager (RM): the boss of the entire cluster resources (memory, CPU, etc.)
3) ApplicationMaster (AM): the boss of a single task running 2) NodeManager (NM): the boss of
a single node server resource
An independent server, which encapsulates the resources required for task execution, such as memory, CPU, disk, network, etc.
Note 1: There can be multiple clients
Note 2: Multiple ApplicationMasters can run on the cluster
Note 3: Each NodeManager can have multiple Containers

map process analysis

There are several main concepts in map:

Partition: partition, partition according to key, generally hash method

Sorting: sort, sorting keys, sorting plays a central role in hadoop

Spill: spill, write data from memory to disk

reducer process analysis

1. Sort and merge the data passed from multiple maps;

2. Reducer processing;

3. Output the result to hdfs;

Spark has three major engines, spark core, sparkSQL, sparkStreaming ,

The key abstractions of spark core are SparkContext and RDD;

The key abstractions of SparkSQL are SparkSession and DataFrame;

The key abstractions of sparkStreaming are StreamingContext, DStream

SparkSession is a concept introduced by spark2.0. It is mainly used in sparkSQL, and of course it can also be used in other occasions. It can replace SparkContext;

SparkSession actually encapsulates SQLContext and HiveContext

Spark is a low-latency interactive computing framework similar to MapReduce

four core components

Spark Core: Contains the basic functions of Spark; especially the API for defining RDDs, operations, and actions on both. Other Spark libraries are built on top of RDD and Spark Core

Spark SQL: Provides an API for interacting with Spark through Apache Hive's SQL variant, Hive Query Language (HiveQL). Each database table is treated as an RDD, and Spark SQL queries are converted into Spark operations.

Spark Streaming: Process and control real-time data streams. Spark Streaming allows programs to process real-time data like ordinary RDDs

MLlib: A library of common machine learning algorithms implemented as Spark operations on RDDs. This library contains scalable learning algorithms, such as classification, regression, etc., that need to iterate over large data sets.

GraphX: A collection of algorithms and tools for controlling graphs, parallel graph operations, and computation. GraphX ​​extends the RDD API to include operations for controlling graphs, creating subgraphs, and accessing all vertices on a path

The composition of the Spark architecture

Cluster Manager: In the standalone mode, it is the Master master node, which controls the entire cluster and monitors the workers. Resource Manager in YARN mode

Worker node: slave node, responsible for controlling computing nodes and starting Executor or Driver.

Driver: Run Application's main() function

Executor: The executor is a process running on the worker node for an Application

Implementation process

After the user program creates SparkContext, it will connect to the cluster resource manager, and the cluster resource manager will allocate computing resources for the user program and start the Executor;

Driver divides the calculation program into different execution stages and multiple Tasks, and then sends the Tasks to Executor;

The Executor is responsible for executing the Task and reporting the execution status to the Driver, as well as reporting the usage of the current node resources to the cluster resource manager.

Spark deployment

· Install Scala

Configuration file: spark-env.sh

Four modes of deployment: Standalone, Spark On Yarn, Spark On Mesos, Spark On Cloud

Application Scenario

When our calculation exceeds the size of the single machine, that is, when the memory of the single machine is not enough, spark can be used;

Or when our calculation is very complicated and takes a lot of time, spark can also be used;

Batch processing: batch processing of complex calculations, focusing on massive, tolerable calculation speed of several minutes to several hours

Interactive query: interactive query of massive historical data, the calculation speed is from a few seconds to tens of minutes

Real-time data: data processing of real-time data streams, with a calculation speed of several hundred milliseconds to several seconds


Spark cores, number of tasks and degree of parallelism

Each spark job is divided into stages according to shuffle, and each stage forms one or more taskSets. Knowing how many tasks each stage needs to run helps us optimize spark operation

number of tasks 

First you need to understand the following concepts:

RDD, elastic distributed dataset, multiple partitions;

split, slice, why and how to slice files on HDFS

textFlie partition, how textFile partitions a file

In the process of creating RDD, we can think that there is no concept of task, such as reading HDFS files;

The concept of task comes only after RDD is available; 

focus

An inputSplit corresponds to a partition of RDD;

A partition of RDD corresponds to a task, that is to say, an inputSplit corresponds to a task;

Usually a block corresponds to an inputSplit;

// Take textFile as an example, each inputSplit cannot be larger than blockSize, that is to say, a block can be split, but multiple blocks cannot be combined. If no partition is specified, each slice is a block;

HBase

➢Region

The table is divided into multiple Regions in the row direction. Region is the smallest unit of distributed storage and load balancing in HBase, that is, different regions can be on different Region Servers, but the same Region will not be split into multiple servers. Regions are separated by size, and each row in the table can only belong to one region. As data is continuously inserted into the table, the region continues to grow. When a certain column family of the region reaches a threshold, it will be divided into two new regions.

➢ Store

Each region consists of one or more stores, at least one store, and hbase will put the data accessed together in one store, that is, build a store for each ColumnFamily (that is, there are several ColumnFamilies, and there are several Stores ). A Store consists of a memStore and 0 or more StoreFiles.

➢ StoreFile

The physical file that saves the actual data, StoreFile is stored on HDFS in the form of HFile. Each Store will have one or more StoreFile(HFile), and the data is ordered in each StoreFile

➢ WAL (Hlog)

Since the data can only be written to HFile after being sorted by MemStore, there is a high probability of data loss if the data is stored in memory. To solve this problem, the data will be written in a file called Write-Ahead logfile first. Then write to MemStore. So when the system fails, the data can be reconstructed through this log file

➢ MemStore

Write cache, because the data in HFile is required to be ordered, so the data is first stored in MemStore, after sorting, it will not be written to HFile until it reaches the time to write, and a new HFile will be formed every time it is written

Comparison of HBase and Hive

1.Hive

Data Warehouse: The essence of Hive is actually equivalent to making a bijective relationship in Mysql for the files already stored in HDFS, so as to facilitate the use of HQL to manage queries.

For data analysis and cleaning: Hive is suitable for offline data analysis and cleaning, with high latency.

Based on HDFS, MapReduce: The data stored in Hive is still on the DataNode, and the written HQL statement will eventually be converted into MapReduce code execution

2.HBase

Database: It is a non-relational database for column family storage

Used to store structured and unstructured data, suitable for storage of single-table non-relational data, not suitable for associated queries, operations such as JOIN

Based on HDFS: the form of data persistent storage is HFile, which is stored in DataNode and managed by ResionServer in the form of region

Low latency, access to online business use: In the face of a large amount of enterprise data, HBase can store a large amount of data in a single table in a straight line, while providing efficient data access speed

Hbase data structure & store results

1 Lsm of Hbase

Definition of LSM tree:

1. The LSM tree is a forest that spans memory and disk and contains multiple "subtrees".

2. The LSM tree is divided into Level 0, Level 1, Level 2 ... Level n subtrees, of which only Level 0 is in memory, and the rest of Level 1-n are in disk.

3. The Level 0 subtree in the memory generally adopts an ordered data structure such as a sorting tree (red-black tree/AVL tree), skip table or TreeMap, which is convenient for subsequent sequential writing to disk.

4. The Level 1-n subtree in the disk is essentially a file that is written to the disk after the data is sorted. It is just called a tree.

5. The subtrees in each layer have a threshold size. After reaching the threshold, they will be merged, and the merged results will be written to the next layer.

6. Only the data in the memory is allowed to be updated in-place, and the data changes on the disk are only allowed to be appended, and no in-place updates are allowed.

2 Hbase jump table

HRegion uses the skip table data structure ConcurrentSkipListMap in sotre management:

ConcurrentSkipListMap has several advantages that ConcurrentHashMap cannot compare:

The keys of ConcurrentSkipListMap are ordered.

ConcurrentSkipListMap supports higher concurrency.

The access time of ConcurrentSkipListMap is log(N), which has almost nothing to do with the number of threads. That is to say, in the case of a certain amount of data, the more concurrent threads, the more ConcurrentSkipListMap can reflect its advantages.

3 Bloom filter of Hbase

The role of the Bloom filter is that users can immediately determine whether a file contains a specific row key, thereby helping us filter out some files that do not need to be scanned.

For hbase, when we choose to use Bloom filter, HBase will include a Bloom filter structure data when generating StoreFile (HFile), called MetaBlock; MetaBlock and DataBlock (real KeyValue data) together Maintained by LRUBlockCache. Therefore, enabling bloomfilter will have certain storage and memory cache overhead.

Flink achieves high throughput, low latency, high performance and a real-time streaming computing framework by implementing the Google Dataflow streaming computing model. At the same time, Flink supports highly fault-tolerant state management to prevent the state from being lost due to system abnormalities during the calculation process. Flink periodically implements the persistent maintenance of the state through the distributed snapshot technology Checkpoints, so that even when the system is down or abnormal can calculate the correct result.

Flink’s tasks are actually run in a multi-threaded manner, which is very different from MapReduce’s multi-JVM process. Flink can greatly improve CPU usage efficiency. Multiple tasks and tasks share system resources through TaskSlot. Manage multiple TaskSlot resource pools in one TaskManager to manage resources effectively.

hive is a computing engine that processes massive amounts of structured data;

Hive is a data warehouse tool based on Hadoop, which maps structured data files into a table and provides SQL-like query functions;

The sql provided by hive is called HQL, which essentially converts HQL into mapreduce;

1. Although hive is a big data tool, hive is not distributed. It is only installed on one machine. Of course, you can install hive on multiple machines, but there is no connection between them;

2. We can regard hive as the client of hadoop, and use hadoop by using hive;

3. Hive is a computing engine. It has no storage function. Although there is a table, we can think of it as a virtual table. The data of the table is stored in hdfs, and the metadata of the table exists in databases such as mysql. Hive queries through metadata Find the location of the data in hdfs, and start mapreduce for calculation;

hive architecture

Client: client, requires jdbc database;

Metastore: metadata, stored in relational databases, such as mysql, etc.;

Metadata stores the relevant information of the table, including the table name, the storage path of each table in hdfs, the owner of the table, fields, etc.;

Driver: This is the core of hive, including parser, compiler, optimizer, and executor;

The parser checks whether the sql syntax is correct; the compiler converts sql into mr; the optimizer optimizes mr; the executor executes mr;

The default location of the hive data warehouse is under the /user/hive/warehouse path on hdfs;

hive has a default database called default;

However, no default folder is created under the /user/hive/warehouse path, and the tables under default are directly created under the /user/hive/warehouse path

Guess you like

Origin blog.csdn.net/hongyucai/article/details/127225064