Chapter 1 Overview of Big Data Computing System
1.1 Overview of Big Data Computing Framework
Hadoop
Hadoop's running process (5 steps?)
split => map => shuffle => reduce => output
The detailed operation process of Hadoop? (4 big processes, 6+6+6+2)
- Create a new Job instance and schedule HDFS resources
- Enable MapTask to execute map function
- Start ReduceTask to execute the reduce function
- JobClient polls to know that the task is completed
What is the difference between Job and Task?
Job: A complete calculation process specified by the MapReduce program
Task: the basic transaction unit for parallel computing in the MapReduce framework
A job (Job) can be split into several Map and Reduce tasks (Task) during execution
MapReduce scheduler (three? What is the default? The order of jobs executed?)
FIFO,Fair,Capacity
Two sorts after Map (what sorting algorithm is used respectively? Is the object a file or multiple files?)
Quick sort inside the file (Sort)
Merge and sort multiple files (Merge)
MapReduce task processing process (six steps?)
Big data to be processed => partition => submit to the master node => send to the map node, do some data sorting work (combining) => send to the reduce node
Failure node processing (what happens if the master node fails? What about the failure of the working node?)
Master node failure: Once a task fails, it can be re-executed from the most recent valid checkpoint, avoiding the waste of time to calculate from scratch.
Working node failure: If the master node detects that the working node does not get a response, the working node is considered to be invalid. The master node will reschedule the failed tasks to other worker nodes for execution.
Disadvantages of MapReduce 1.0 (mainly in which two aspects and which three directions?)
JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure.
JobTracker completed too many tasks, causing excessive resource consumption
YARN (what concepts are introduced? What are the three main parts and what do they do?)
ApplicationMaster: Head, apply for resources and assign tasks
ResourceManager: small head, monitoring head and subordinates, resource allocation and scheduling
NodeManager: subordinates, resource management, accepting commands
1.2 Big Data Batch Computing Framework
Spark
RDD concept (full name? Partitionable? Is it in main memory or memory?)
Resilient Distributed Datasets
Different partitions of an RDD can be computed in parallel on different nodes in the cluster
Put memory
RDD operations (what are the three types? For?)
action,transformation,persistence
The execution process of RDD (how to create? How to generate different RDD? How to output to external data source? What are the advantages?)
read in external data sources
transformation
action
The reason for RDD's high efficiency (how to be fault-tolerant? On disk or in memory? What can store data?)
Data replication and logging
The intermediate is persisted to the memory, and the intermediate data is passed in the RDD operation in the memory
The storage object is a Java object
basic concept
- DAG: What is the abbreviation for?
- Executor: Where is a process running? Responsible for running what?
- Application, job, phase, task relationship? What are they used for?
Directed Acyclic Graph
A process running on the worker node (WorkerNode), responsible for running the Task
Application > Jobs > Stages > Tasks
Architecture design (what are the three layers? English and Chinese names? What is in the third layer? What are the two Nodes? Based on what storage structure?)
Driver Program(SparkContext)
Cluster Manager
Worker Nodes(Executor(Task))
HDFS、HBase
7 steps for Spark to run code?
① Driver parses and generates Task
② Driver applies for resources from Cluster Manager
③ Cluster Manager allocates resources and nodes, and creates Executor
④ Executor registers with Driver
⑤ Driver passes the code and files to Executor
⑥ Executor runs Task
Shuffle operation (what is the situation of wide dependency and narrow dependency?)
Wide dependencies: one-to-many and many-to-many
Narrow dependencies: many-to-one or one-to-one
Phase division (basis for division? Which kind of dependence is conducive to optimization?)
Countercurrent division, merge when encountering narrow dependencies, and disconnect when encountering wide dependencies
RDD running process (4 stages? What did you do?)
① Create an RDD object
② Create DAG, that is, the dependency relationship between RDDs, and then decompose it into multiple Stages, each with multiple Tasks
③ Task is assigned by TaskScheduler to Executor on WorkerNode for execution
④ Worker executes Tasks
RDD fault tolerance mechanism (maintaining information for reconstruction)
RDD maintains information that can be used to create lost partitions
The storage mechanism in Spark (Where does the RDD cache exist? How to obtain the data blocks corresponding to the partition from the disk? Where does the Shuffle data exist?)
RDD caching: including memory and disk-based caching
Memory cache = hash table + access strategy
Persistence of Shuffle data: must be cached on disk
Chapter 2 Big Data Management System
Big Data Management System 1
concept
Database definition?
A database is an organized, shareable collection of data stored in a computer for a long period of time.
DBMS (full name? Main function? What is DDL? What is DML?)
Database Management System
Data Definition Language, which defines the data objects in the database.
Data manipulation language, manipulating data to achieve basic operations on the database.
DBS (includes what?)
database, database management system, application system, database administrator, user
database storage structure
-
RAID (what is it? What is it made of?)
Redundant Array of Disks
Array of several identical disks
-
Organization of records in the file (How are the 5 records organized? How are they recorded?)
Heap file organization: put whatever you want
Sequential file organization: ascending or descending order, pointer chain structure
Hash file organization: the value obtained by a certain attribute value through the hash function is used as the storage address
Clustered file organization: related records are stored in the same block
indexing technology
- What is an index? Is it a file?
- Index classification (two categories? What is the main index? The difference between clustering index and non-clustering index? What are the three types of index in clustering index?)
- Index update (delete and insert what kind of operations are for dense index and sparse index respectively?)
A small file with only indexed attributes that is recorded separately from the main file
Two categories: ordered index vs hash index
Clustered (non-clustered) index: the difference is whether it is consistent with the main file order
Dense index, sparse index, multi-level index
Delete:
For a dense index, delete the corresponding index item;
for a sparse index, if the index value of the deleted record appears in the index block, replace it with the search key A of the next record of the deleted record in the main file. If A already appears in the index block, delete the corresponding index key of the deleted record.Insert:
For a dense index and the lookup key does not appear in the index block, insert in the index.
For sparse indexes: if the data block is free for new data, no need to modify the index; otherwise, add a new data block and insert a new index item into the index block
affairs
- Definition, what does it consist of?
- ACID nature of transactions?
A logical unit of work in a DBMS, usually consisting of a set of database operations
Atomic
Consistency
Isolation
Durability
I/O Parallel
-
Division technology define(number of disks = n) advantage shortcoming cycle division Hash division Scope division
Division technology define(number of disks = n) advantage shortcoming cycle division (i mod n) Best for sequential scans Difficulty handling range queries Hash division Hash function h in the range 0…n-1 sequential access No clustering, so difficult to answer range queries Scope division Divide vector [ v 0 , v 1 , . . . , vn − 2 ] [v_0, v_1, ..., v_{n-2}][v0,v1,...,vn−2]
skewed handling
-
Types of Skew (What are the 2 types of Skew? What are the two divisions of Skew?)
-
Dealing with skew (3 ways?)
Attribute value skew: some values appear on the partition attribute of many tuples, and all tuples with the same value on the partition attribute are assigned in the same partition
partition skew
Range Partitioning: A bad partition vector may assign too many tuples to one partition and too few tuples to other partitions
Hash partitioning: unlikely to happen as long as good hash function is chosen
Dealing with Skew in Range Partition: A method to generate a balanced partition vector—every time 1/n of the relationship is read, the partition attribute value of the next tuple is added to the partition vector
Dealing with Skewness with Histograms: It is relatively straightforward to construct balanced partition vectors from histograms
Use virtual processors to handle skew: skewed virtual partitions are spread across several real processors
inter-query parallelism
Increase transaction throughput, mainly for scaling transaction processing systems to support larger transactions per second
Cache Coherency Protocol
• Before reading/writing a page, the page must be shared/exclusively locked
• When locking a page, the page must be read from disk
• Before releasing the page lock, the page must be written to disk if updated
Intra-query parallelism (two complementary forms of intra-query parallelism)
Intra-operational parallelism - each operation within a query is executed in parallel
Parallelism between operations - different operations within a query are executed in parallel
Big Data Management System II
Introduction to NoSQL
Not Only SQL
Typical NoSQL databases usually include key-value databases, column family databases, document databases, and graph databases
CAP
C (Consistency): Consistency means that any read operation can always read the result of the previously completed write operation;
A (Availability): Availability refers to the rapid acquisition of data and the return of operation results within a certain period of time;
P (Tolerance of Network Partition): Partition tolerance means that when a network partition occurs, separate systems can also operate normally.
- CA: It means emphasizing consistency (C) and availability (A), and giving up partition tolerance (P). The easiest way is to put all transaction-related content on the same machine. Obviously, this approach will seriously affect the scalability of the system. Traditional relational databases adopt this design principle, so the scalability is relatively poor
- CP: That is, emphasizing consistency (C) and partition tolerance (P), giving up availability (A). When a network partition occurs, the affected service needs to wait for data consistency, so it cannot provide external services during the waiting period
- AP: That is, emphasizing availability (A) and partition tolerance (P), giving up consistency (C), allowing the system to return inconsistent data
BASE
BASE(Basically Availble, Soft-state, Eventual consistency)
The basic meaning of BASE is basically available (Basically Available), soft state (Soft state) and eventual consistency (Eventual consistency)
NewSQL
NewSQL is an abbreviation for various new scalable/high-performance databases
- Capable of storage and management of massive data with NoSQL
- It also maintains the characteristics of traditional databases such as ACID and SQL
NewSQL features:
Support relational data model
Use SQL as the main interface
Chapter 3 Big Data Real-time Computing Framework
3.1 Storm
definition?
Real-time, distributed, streaming computing system
What are the typical application scenarios of Storm (synchronous, asynchronous, data stream processing, continuous computing, distributed remote program calls)?
Request response (synchronous): real-time image processing, real-time web page analysis
Stream processing (asynchronous): item-by-item processing, analysis and statistics
Data stream processing: It can be used to process new data and update databases in real time, with fault tolerance and scalability.
Continuous calculation: Continuous query can be performed and the results can be fed back to the client in real time.
Distributed Remote Procedure Call
Features of Storm
Reliable, fast, high fault tolerance, horizontal expansion
Technical architecture (what are the three parts?)
Nimbus (class JobTracker
zookeeper
Supervisor (class TaskTracker
worker (class Child
The relationship between Worker, Executor and Task
similar to before
Storm's workflow
Client submits the Topology => Nimbus stores the task to => Zookeeper => Supervisor obtains the assigned task and starts it => Worker executes the specific (Task)
Storm fault tolerance (how to deal with task-level failures? What happens to task-level failures of task slot failures + cluster node (machine) failures? What happens to Nimbus node failures?)
task level failure
The message caused by the crash of the bolt task is not answered or the acker task fails
=> the fail method of the Spout will be called.Spout task failure
=> The external device (such as MQ) connected to the spout task is responsible for the integrity of the message.Cluster node (machine) failure
- Node failure in a Storm cluster: task shifting
- Node failure in Zookeeper cluster: ensure that less than half of the machines are down and still running
Nimbus node failure: Without Nimbus, workers will not be scheduled to other hosts when necessary, and clients will not be able to submit tasks.
What is Stream? What are Spouts? What are Tuples? What are Bolts? What is Topology?
Stream: infinite sequence of Tuples
Spouts: faucets, the source of Stream
Bolts: process Tuples, create new Streams
Topology: an abstract network of Spouts and Bolts
What is Stream Grouping? There are 6 ways?
Used to tell Topology how to transfer Tuple between two components (Spouts, Bolts)
ShuffleGrouping: Random grouping
FieldsGrouping: Grouping by field
AllGrouping: Broadcast sending, all Tuples are sent to all Tasks
GlobalGrouping: Global grouping, all Tuples are sent to the same Task
NonGrouping: No grouping
DirectGrouping: Specify sending, specify receiving
3.2 Spark Streaming
The input data is divided into a segment of DStream according to the time slice, and each segment of data is converted into an RDD of Spark
The core concept of Spark Streaming
- What does DStream represent?
- What does Transformations do? (Standard RDD operation, stateful operation)
A sequence of RDDs representing data streams
Transformations: Modify data from one Dstream to create another DStream
Standard RDD operations: map, countByValue, reduce, insert...
Stateful operations: window, countByValueAndWindow...
The input source of DStream
- base source?
- Premium source?
Spark fault tolerance
- RDDs can remember the sequence of operations that created it from the original fault-tolerant input
- Batches of input data are replicated in memory across multiple worker nodes and are therefore fault tolerant
some contrast
Comparison between Spark Streaming and Storm
Spark Streaming | Storm |
---|---|
Millisecond-level stream computing cannot be achieved | Response in milliseconds can be achieved |
Low-latency execution engine can be used for real-time computing | |
Compared with Storm, RDD data sets are easier to do efficient fault-tolerant processing |
Functional Correspondence between Storm and Hadoop Architecture Components
Hadoop | Storm | |
---|---|---|
Application Name | Job | Topology |
system role | JobTracker | Cloud |
TaskTracker | Supervisor | |
component interface | Map/Reduce | Spout/Bolt |
Chapter 4 Big Graph Computing Framework
Computational model
Superstep: Parallel Node Computing
-
For each node (six possible operations)
-
Termination condition (two)
Accepts a message sent by the previous superstep
Executes the same user-defined function
Modifies its value or the value of its outgoing edges
Sends messages to other points (accepted by the next superstep)
Changes the topology of the graph
Ends the iteration when there is no more work to doAll vertices become inactive at the same time
No message is passed
4.1 Translation
system structure
The Pregel system also uses a master/slave model
- Master node: schedule slave nodes, fix slave node errors
- Slave nodes: process their own tasks, communicate with other slave nodes
Aggregator (used for? aggregated with what structure?)
For global communication, global data and monitoring
At the end of the superstep, the partially aggregated values from each slave node are aggregated in a tree structure
Pregel execution (5 steps)
① The master node splits the graph and assigns one or more parts to each slave node
② The master node instructs each slave node to execute a superstep
③ Finally, the master node instructs each slave node to save its own graph
4.2 GraphX
Connection Site Selection Using Routing Tables
Iterate over the cache of mrTriplets
Iterate over the aggregation of mrTriplets
fault tolerance
- Checkpoint: The master node periodically instructs the slave nodes to save the state of the partition to persistent storage
- Error detection: timed use of "ping" messages
- recovery
master reassigns graph partitions to currently available
slaves all workers reload partition state from latest available checkpoint- Partial recovery: log outgoing information, only involves recovery partition
Chapter 5 Big Data Storage
Small-probability events will become the norm on a large scale (what are small-probability events?)
Disk machine damage, RAID card failure, network failure, power failure, data error, system abnormality
HDFS
related terms
HDFS | GFS | MooseFS | illustrate |
---|---|---|---|
NameNode | Master | Master | Provide directory information of the file system, block information, location information of data blocks, and manage each data server. |
DataNode | Chunk Server | Chunk Server | Each file in the distributed file system is divided into several data blocks, and each data block is stored on a different server |
Block | Chunk | Chunk | Each file will be divided into several blocks (default 64MB), and each block has a continuous piece of file content, which is the basic unit of storage. |
Packet | none | none | After accumulating to the Packet, write it to the file system once |
Chunk | none | Block(64KB) | In each data packet, the data is cut into smaller blocks (512 bytes), and each block is matched with a parity check code (CRC). Such a block is a transmission block. |
Secondary NameNode | none | Metalogs | The standby master server is pulling the logs of the master server, waiting to be corrected |
Core functions
Function | illustrate |
---|---|
Namespace | Namespaces |
Shell command | Interact directly with HDFS and other Hadoop-supported file systems |
data replication | |
rack awareness | The storage strategy is to store one copy on a node on the local rack and one copy on another node on the same rack |
Editlog | It is the core of the entire log system |
cluster balance | |
space recycling |
Read file process (5 steps?)
① HDFS Client initiates an RPC request to the remote Namenode
② Namenode returns the DataNode list of the block copy of the file
③ Client selects the DataNode close to the client to read the block
④ If the file reading is not over yet, the Client continues to obtain the next batch of block lists from the NameNode
⑤ After reading, close the connection with DataNode and find the best DataNode for reading the next block
Write file process (5 steps? How to write data?)
① HDFS Client initiates an RPC request to the remote Namenode
② NameNode checks whether the file exists and whether it has the right to operate
③ Divide the file into multiple packets, apply for new blocks from the NameNode, and obtain a list of DataNodes suitable for storage
④ Start to write the packet to the DataNode in the form of a pipeline, and pass the rest to the next DataNode after storage, in the form of a pipeline
⑤ The last DataNode will return the ack packet and pass it to the Client in the pipeline. After the client receives it, it removes the corresponding packet from the ack queue
Data writing process summary
Data writing method | advantage | insufficient |
---|---|---|
chain write | load balancing | chain too long |
master-slave write | short chain | single point pressure |