Big data system self-test

Chapter 1 Overview of Big Data Computing System

1.1 Overview of Big Data Computing Framework

Hadoop

Hadoop's running process (5 steps?)

split => map => shuffle => reduce => output

The detailed operation process of Hadoop? (4 big processes, 6+6+6+2)

  1. Create a new Job instance and schedule HDFS resources
  2. Enable MapTask to execute map function
  3. Start ReduceTask to execute the reduce function
  4. JobClient polls to know that the task is completed

What is the difference between Job and Task?

Job: A complete calculation process specified by the MapReduce program

Task: the basic transaction unit for parallel computing in the MapReduce framework

A job (Job) can be split into several Map and Reduce tasks (Task) during execution

MapReduce scheduler (three? What is the default? The order of jobs executed?)

FIFO,Fair,Capacity

Two sorts after Map (what sorting algorithm is used respectively? Is the object a file or multiple files?)

Quick sort inside the file (Sort)

Merge and sort multiple files (Merge)

MapReduce task processing process (six steps?)

Big data to be processed => partition => submit to the master node => send to the map node, do some data sorting work (combining) => send to the reduce node

Failure node processing (what happens if the master node fails? What about the failure of the working node?)

Master node failure: Once a task fails, it can be re-executed from the most recent valid checkpoint, avoiding the waste of time to calculate from scratch.

Working node failure: If the master node detects that the working node does not get a response, the working node is considered to be invalid. The master node will reschedule the failed tasks to other worker nodes for execution.

Disadvantages of MapReduce 1.0 (mainly in which two aspects and which three directions?)

JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure.

JobTracker completed too many tasks, causing excessive resource consumption

YARN (what concepts are introduced? What are the three main parts and what do they do?)

ApplicationMaster: Head, apply for resources and assign tasks

ResourceManager: small head, monitoring head and subordinates, resource allocation and scheduling

NodeManager: subordinates, resource management, accepting commands

1.2 Big Data Batch Computing Framework

Spark

RDD concept (full name? Partitionable? Is it in main memory or memory?)

Resilient Distributed Datasets

Different partitions of an RDD can be computed in parallel on different nodes in the cluster

Put memory

RDD operations (what are the three types? For?)

action,transformation,persistence

The execution process of RDD (how to create? How to generate different RDD? How to output to external data source? What are the advantages?)

read in external data sources

transformation

action

The reason for RDD's high efficiency (how to be fault-tolerant? On disk or in memory? What can store data?)

Data replication and logging

The intermediate is persisted to the memory, and the intermediate data is passed in the RDD operation in the memory

The storage object is a Java object

basic concept

  • DAG: What is the abbreviation for?
  • Executor: Where is a process running? Responsible for running what?
  • Application, job, phase, task relationship? What are they used for?

Directed Acyclic Graph

A process running on the worker node (WorkerNode), responsible for running the Task

Application > Jobs > Stages > Tasks

Architecture design (what are the three layers? English and Chinese names? What is in the third layer? What are the two Nodes? Based on what storage structure?)

Driver Program(SparkContext)

Cluster Manager

Worker Nodes(Executor(Task))

HDFS、HBase

7 steps for Spark to run code?

① Driver parses and generates Task

② Driver applies for resources from Cluster Manager

③ Cluster Manager allocates resources and nodes, and creates Executor

④ Executor registers with Driver

⑤ Driver passes the code and files to Executor

⑥ Executor runs Task

Shuffle operation (what is the situation of wide dependency and narrow dependency?)

Wide dependencies: one-to-many and many-to-many

Narrow dependencies: many-to-one or one-to-one

Phase division (basis for division? Which kind of dependence is conducive to optimization?)

Countercurrent division, merge when encountering narrow dependencies, and disconnect when encountering wide dependencies

RDD running process (4 stages? What did you do?)

① Create an RDD object

② Create DAG, that is, the dependency relationship between RDDs, and then decompose it into multiple Stages, each with multiple Tasks

③ Task is assigned by TaskScheduler to Executor on WorkerNode for execution

④ Worker executes Tasks

RDD fault tolerance mechanism (maintaining information for reconstruction)

RDD maintains information that can be used to create lost partitions

The storage mechanism in Spark (Where does the RDD cache exist? How to obtain the data blocks corresponding to the partition from the disk? Where does the Shuffle data exist?)

  • RDD caching: including memory and disk-based caching

    Memory cache = hash table + access strategy

  • Persistence of Shuffle data: must be cached on disk

Chapter 2 Big Data Management System

Big Data Management System 1

concept

Database definition?

A database is an organized, shareable collection of data stored in a computer for a long period of time.

DBMS (full name? Main function? What is DDL? What is DML?)

Database Management System

Data Definition Language, which defines the data objects in the database.

Data manipulation language, manipulating data to achieve basic operations on the database.

DBS (includes what?)

database, database management system, application system, database administrator, user

database storage structure

  • RAID (what is it? What is it made of?)

    Redundant Array of Disks

    Array of several identical disks

  • Organization of records in the file (How are the 5 records organized? How are they recorded?)

    Heap file organization: put whatever you want

    Sequential file organization: ascending or descending order, pointer chain structure

    Hash file organization: the value obtained by a certain attribute value through the hash function is used as the storage address

    Clustered file organization: related records are stored in the same block

indexing technology

  • What is an index? Is it a file?
  • Index classification (two categories? What is the main index? The difference between clustering index and non-clustering index? What are the three types of index in clustering index?)
  • Index update (delete and insert what kind of operations are for dense index and sparse index respectively?)

A small file with only indexed attributes that is recorded separately from the main file

Two categories: ordered index vs hash index

Clustered (non-clustered) index: the difference is whether it is consistent with the main file order

Dense index, sparse index, multi-level index

Delete:
For a dense index, delete the corresponding index item;
for a sparse index, if the index value of the deleted record appears in the index block, replace it with the search key A of the next record of the deleted record in the main file. If A already appears in the index block, delete the corresponding index key of the deleted record.

Insert:
For a dense index and the lookup key does not appear in the index block, insert in the index.
For sparse indexes: if the data block is free for new data, no need to modify the index; otherwise, add a new data block and insert a new index item into the index block

affairs

  • Definition, what does it consist of?
  • ACID nature of transactions?

A logical unit of work in a DBMS, usually consisting of a set of database operations

Atomic
Consistency
Isolation
Durability

I/O Parallel

  • Division technology define(number of disks = n) advantage shortcoming
    cycle division
    Hash division
    Scope division
Division technology define(number of disks = n) advantage shortcoming
cycle division (i mod n) Best for sequential scans Difficulty handling range queries
Hash division Hash function h in the range 0…n-1 sequential access No clustering, so difficult to answer range queries
Scope division Divide vector [ v 0 , v 1 , . . . , vn − 2 ] [v_0, v_1, ..., v_{n-2}][v0v1...vn2]

skewed handling

  • Types of Skew (What are the 2 types of Skew? What are the two divisions of Skew?)

  • Dealing with skew (3 ways?)

  1. Attribute value skew: some values ​​appear on the partition attribute of many tuples, and all tuples with the same value on the partition attribute are assigned in the same partition

  2. partition skew

    • Range Partitioning: A bad partition vector may assign too many tuples to one partition and too few tuples to other partitions

    • Hash partitioning: unlikely to happen as long as good hash function is chosen

  3. Dealing with Skew in Range Partition: A method to generate a balanced partition vector—every time 1/n of the relationship is read, the partition attribute value of the next tuple is added to the partition vector

  4. Dealing with Skewness with Histograms: It is relatively straightforward to construct balanced partition vectors from histograms

  5. Use virtual processors to handle skew: skewed virtual partitions are spread across several real processors

inter-query parallelism

Increase transaction throughput, mainly for scaling transaction processing systems to support larger transactions per second

Cache Coherency Protocol

• Before reading/writing a page, the page must be shared/exclusively locked
• When locking a page, the page must be read from disk
• Before releasing the page lock, the page must be written to disk if updated

Intra-query parallelism (two complementary forms of intra-query parallelism)

Intra-operational parallelism - each operation within a query is executed in parallel

Parallelism between operations - different operations within a query are executed in parallel

Big Data Management System II

Introduction to NoSQL

Not Only SQL

Typical NoSQL databases usually include key-value databases, column family databases, document databases, and graph databases

CAP

C (Consistency): Consistency means that any read operation can always read the result of the previously completed write operation;

A (Availability): Availability refers to the rapid acquisition of data and the return of operation results within a certain period of time;

P (Tolerance of Network Partition): Partition tolerance means that when a network partition occurs, separate systems can also operate normally.

image-20230216112417406
  1. CA: It means emphasizing consistency (C) and availability (A), and giving up partition tolerance (P). The easiest way is to put all transaction-related content on the same machine. Obviously, this approach will seriously affect the scalability of the system. Traditional relational databases adopt this design principle, so the scalability is relatively poor
  2. CP: That is, emphasizing consistency (C) and partition tolerance (P), giving up availability (A). When a network partition occurs, the affected service needs to wait for data consistency, so it cannot provide external services during the waiting period
  3. AP: That is, emphasizing availability (A) and partition tolerance (P), giving up consistency (C), allowing the system to return inconsistent data

BASE

BASE(Basically Availble, Soft-state, Eventual consistency)

The basic meaning of BASE is basically available (Basically Available), soft state (Soft state) and eventual consistency (Eventual consistency)

NewSQL

image-20230218135628012

NewSQL is an abbreviation for various new scalable/high-performance databases

  • Capable of storage and management of massive data with NoSQL
  • It also maintains the characteristics of traditional databases such as ACID and SQL

NewSQL features:
Support relational data model
Use SQL as the main interface

Chapter 3 Big Data Real-time Computing Framework

3.1 Storm

definition?

Real-time, distributed, streaming computing system

What are the typical application scenarios of Storm (synchronous, asynchronous, data stream processing, continuous computing, distributed remote program calls)?

  • Request response (synchronous): real-time image processing, real-time web page analysis

  • Stream processing (asynchronous): item-by-item processing, analysis and statistics

  • Data stream processing: It can be used to process new data and update databases in real time, with fault tolerance and scalability.

  • Continuous calculation: Continuous query can be performed and the results can be fed back to the client in real time.

  • Distributed Remote Procedure Call

Features of Storm

Reliable, fast, high fault tolerance, horizontal expansion

Technical architecture (what are the three parts?)

Nimbus (class JobTracker

zookeeper

Supervisor (class TaskTracker

worker (class Child

The relationship between Worker, Executor and Task

similar to before

Storm's workflow

Client submits the Topology => Nimbus stores the task to => Zookeeper => Supervisor obtains the assigned task and starts it => Worker executes the specific (Task)

Storm fault tolerance (how to deal with task-level failures? What happens to task-level failures of task slot failures + cluster node (machine) failures? What happens to Nimbus node failures?)

task level failure

  • The message caused by the crash of the bolt task is not answered or the acker task fails
    => the fail method of the Spout will be called.

  • Spout task failure
    => The external device (such as MQ) connected to the spout task is responsible for the integrity of the message.

Cluster node (machine) failure

  • Node failure in a Storm cluster: task shifting
  • Node failure in Zookeeper cluster: ensure that less than half of the machines are down and still running

Nimbus node failure: Without Nimbus, workers will not be scheduled to other hosts when necessary, and clients will not be able to submit tasks.

What is Stream? What are Spouts? What are Tuples? What are Bolts? What is Topology?

Stream: infinite sequence of Tuples

Spouts: faucets, the source of Stream

Bolts: process Tuples, create new Streams

Topology: an abstract network of Spouts and Bolts

What is Stream Grouping? There are 6 ways?

Used to tell Topology how to transfer Tuple between two components (Spouts, Bolts)

ShuffleGrouping: Random grouping
FieldsGrouping: Grouping by field
AllGrouping: Broadcast sending, all Tuples are sent to all Tasks
GlobalGrouping: Global grouping, all Tuples are sent to the same Task
NonGrouping: No grouping
DirectGrouping: Specify sending, specify receiving

3.2 Spark Streaming

The input data is divided into a segment of DStream according to the time slice, and each segment of data is converted into an RDD of Spark

The core concept of Spark Streaming

  • What does DStream represent?
  • What does Transformations do? (Standard RDD operation, stateful operation)

A sequence of RDDs representing data streams

Transformations: Modify data from one Dstream to create another DStream
Standard RDD operations: map, countByValue, reduce, insert...
Stateful operations: window, countByValueAndWindow...

The input source of DStream

  • base source?
  • Premium source?

Spark fault tolerance

  • RDDs can remember the sequence of operations that created it from the original fault-tolerant input
  • Batches of input data are replicated in memory across multiple worker nodes and are therefore fault tolerant

some contrast

Comparison between Spark Streaming and Storm

Spark Streaming Storm
Millisecond-level stream computing cannot be achieved Response in milliseconds can be achieved
Low-latency execution engine can be used for real-time computing
Compared with Storm, RDD data sets are easier to do efficient fault-tolerant processing

Functional Correspondence between Storm and Hadoop Architecture Components

Hadoop Storm
Application Name Job Topology
system role JobTracker Cloud
TaskTracker Supervisor
component interface Map/Reduce Spout/Bolt

Chapter 4 Big Graph Computing Framework

Computational model

Superstep: Parallel Node Computing

  • For each node (six possible operations)

  • Termination condition (two)

  • Accepts a message sent by the previous superstep
    Executes the same user-defined function
    Modifies its value or the value of its outgoing edges
    Sends messages to other points (accepted by the next superstep)
    Changes the topology of the graph
    Ends the iteration when there is no more work to do

  • All vertices become inactive at the same time
    No message is passed

4.1 Translation

system structure

The Pregel system also uses a master/slave model

  • Master node: schedule slave nodes, fix slave node errors
  • Slave nodes: process their own tasks, communicate with other slave nodes

Aggregator (used for? aggregated with what structure?)

For global communication, global data and monitoring

At the end of the superstep, the partially aggregated values ​​from each slave node are aggregated in a tree structure

Pregel execution (5 steps)

① The master node splits the graph and assigns one or more parts to each slave node

② The master node instructs each slave node to execute a superstep

③ Finally, the master node instructs each slave node to save its own graph

4.2 GraphX

Connection Site Selection Using Routing Tables

Iterate over the cache of mrTriplets

Iterate over the aggregation of mrTriplets

fault tolerance

  • Checkpoint: The master node periodically instructs the slave nodes to save the state of the partition to persistent storage
  • Error detection: timed use of "ping" messages
  • recovery
    master reassigns graph partitions to currently available
    slaves all workers reload partition state from latest available checkpoint
  • Partial recovery: log outgoing information, only involves recovery partition

Chapter 5 Big Data Storage

Small-probability events will become the norm on a large scale (what are small-probability events?)

Disk machine damage, RAID card failure, network failure, power failure, data error, system abnormality

HDFS

related terms

HDFS GFS MooseFS illustrate
NameNode Master Master Provide directory information of the file system, block information, location information of data blocks, and manage each data server.
DataNode Chunk Server Chunk Server Each file in the distributed file system is divided into several data blocks, and each data block is stored on a different server
Block Chunk Chunk Each file will be divided into several blocks (default 64MB), and each block has a continuous piece of file content, which is the basic unit of storage.
Packet none none After accumulating to the Packet, write it to the file system once
Chunk none Block(64KB) In each data packet, the data is cut into smaller blocks (512 bytes), and each block is matched with a parity check code (CRC). Such a block is a transmission block.
Secondary NameNode none Metalogs The standby master server is pulling the logs of the master server, waiting to be corrected

Core functions

Function illustrate
Namespace Namespaces
Shell command Interact directly with HDFS and other Hadoop-supported file systems
data replication
rack awareness The storage strategy is to store one copy on a node on the local rack and one copy on another node on the same rack
Editlog It is the core of the entire log system
cluster balance
space recycling

Read file process (5 steps?)

① HDFS Client initiates an RPC request to the remote Namenode

② Namenode returns the DataNode list of the block copy of the file

③ Client selects the DataNode close to the client to read the block

④ If the file reading is not over yet, the Client continues to obtain the next batch of block lists from the NameNode

⑤ After reading, close the connection with DataNode and find the best DataNode for reading the next block

Write file process (5 steps? How to write data?)

① HDFS Client initiates an RPC request to the remote Namenode

② NameNode checks whether the file exists and whether it has the right to operate

③ Divide the file into multiple packets, apply for new blocks from the NameNode, and obtain a list of DataNodes suitable for storage

④ Start to write the packet to the DataNode in the form of a pipeline, and pass the rest to the next DataNode after storage, in the form of a pipeline

⑤ The last DataNode will return the ack packet and pass it to the Client in the pipeline. After the client receives it, it removes the corresponding packet from the ack queue

Data writing process summary

Data writing method advantage insufficient
chain write load balancing chain too long
master-slave write short chain single point pressure

Guess you like

Origin blog.csdn.net/twi_twi/article/details/129258113