Big data system focus

Chapter 1 Overview of Big Data Computing System

1 Overview of Big Data Computing Framework

Computing framework : An abstraction in which corresponding general functions are provided for users to write code to achieve specific functions, thus forming application-oriented software.

Big Data Computing Framework : Computing framework for big data.

Hadoop

The running process of Hadoop

image-20221115205701832

The detailed operation process of Hadoop

image-20221115221408609

① MapReduce program
=> create a new JobClient instance
=> request a new JobId from JobTasker, and put resources into HDFS, calculate the number of fragments and the number of map tasks
=> submit the job to JobTasker, and obtain the status object handle of the job
= >Job submission requests are placed in the queue for scheduling
=> take out the job fragmentation information from HDFS, and create a corresponding number of TaskInProgress scheduling and monitoring Map tasks

② Extract related resources (Jar package, data) from HDFS
=> Create TaskRunner to run Map task
=> Enable MapTask in a separate JVM to execute map function
=> Intermediate result data is stored in cache periodically
=> Cache is written to disk
=> Periodic report schedule

③ Assign Reduce tasks
=> Create TaskRunner to run Reduce tasks
=> Start ReduceTask in a separate JVM to execute reduce functions
=> Download intermediate result data from Map nodes
=> Output result temporary files
=> Regularly report progress

④ JobClient polls to know that the task is completed
=> notify the user

The difference between Job and Task

Job : A job can be split into several Map and Reduce tasks during execution

Task : the basic transaction unit for parallel computing

MapReduce scheduler

(FIFO) First-in-first-out scheduler (default): The order of jobs to be executed: first by priority, then by job arrival time, does not support preemption

(Fair) Fair scheduler : resources are shared fairly between users, resources are shared among multiple tasks uploaded by users, and resource preemption is supported

(Capacity) capacity scheduler : Calculate the ratio between the number of running tasks in each queue and the computing resources that should be allocated, calculate a queue with the smallest ratio, and then select a job from the queue to execute according to FIFO

2 sorts after Map

Quick sort inside the file (Sort): After the map function processes the input data, it will store the intermediate data in one or several files of the local machine, and perform a quick sort on the records inside these files

Merge and sort multiple files (Merge): After the Map task is executed, these sorted files will be merged and sorted, and the sorted results will be output to a large file

MapReduce task processing process

Big data to be processed => partition => submit to the master node => send to the map node, do some data sorting work (combining) => send to the reduce node

Failure Node Handling

Checkpoints are periodically set in the master node to check the execution of the entire computing job

Master node failure: Once a task fails, it can be re-executed from the most recent valid checkpoint, avoiding the waste of time to calculate from scratch.

Working node failure: If the master node detects that the working node does not get a response, the working node is considered to be invalid. The master node will reschedule the failed tasks to other worker nodes for execution.

MapReduce 1.0 Disadvantages

  • JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure.
  • JobTracker has completed too many tasks, resulting in excessive resource consumption. When there are a lot of map-reduce jobs, it will cause a large memory overhead.
  • It is too simple to use the number of map/reduce tasks as resources, and does not take into account the occupancy of cpu/memory.

YARN

image-20221115231012981

2 Big Data Batch Computing Framework

Spark

RDD concept

  • Resilient Distributed Datasets
  • Each RDD can be divided into multiple partitions, and different partitions of an RDD can be computed in parallel on different nodes in the cluster
  • RDD is a collection of read-only record partitions and cannot be directly modified. RDDs can only be created from datasets in stable physical storage, or by performing deterministic transformation operations on other RDDs
  • The conversion operations between different RDDs form dependencies, which can realize data pipeline processing and avoid intermediate data storage

Operations on RDDs

  • Divided into "action" (Action) and "transformation" (Transformation) two types

  • The conversion interfaces provided by RDD are very simple, and they are all coarse-grained data conversion operations such as map, filter, groupBy, join, etc.

  • Transformations Actions Persistence
    • Create new datasets from existing datasets.
    • Lazy calculations. They are only executed when the action is executed
    • Return a data to a driver or output data to a storage system after computation • Caches datasets in memory for subsequent operations
    • Optionally stored on disk, in RAM, or a hybrid
    Map(func)
    Filter(func)
    Distinct()
    Count()
    Reduce(funct)
    Collect
    Take()
    Persist()
    Cache()

RDD execution process

  • Create: read in external data sources
  • RDD undergoes a series of "transformation" operations, each time a different RDD is generated for the next transformation operation
  • The last RDD is transformed by the "Action" operation and output to an external data source
  • Advantages: lazy call, pipeline, avoid synchronous waiting, no need to save intermediate results, each operation becomes simple

The reason for the high efficiency of RDD

  • Fault tolerance: data replication or logging, RDD characteristics
  • The intermediate results are persisted to the memory, and the data is transferred between multiple RDD operations in the memory, avoiding unnecessary read and write disk overhead
  • Stored data can be Java objects, avoiding unnecessary object serialization and deserialization

basic concept

  • DAG: Directed Acyclic Graph (Directed Acyclic Graph)
  • Executor: is a process running on the worker node (WorkerNode), responsible for running the Task
  • image-20221121202016155

architecture design

image-20221115232957239 image-20221116110254314

Shuffle operation

image-20221116104657828

Stage division

  • The basis of phase division is narrow dependency and wide dependency: narrow dependency is very beneficial to job optimization, and wide dependency cannot be optimized

  • Every RDD operation is a fork/join

  • Divide the stage: perform reverse analysis in the DAG, disconnect when encountering wide dependencies, and merge when encountering narrow dependencies

    image-20221116105218320

RDD operation process

image-20221116110810412

RDD fault tolerance mechanism

image-20221116111043971

Storage mechanism in Spark

  • RDD caching: including memory and disk-based caching

    Memory cache = hash table + access strategy

  • Persistence of Shuffle data: must be cached on disk

Chapter 2 Big Data Management System

Definition of database: A database is an organized and shareable collection of data stored in a computer for a long time.

DBMS

  • Database Management System (Database Management System) is a layer of data management software located between the user and the operating system.
  • The main function of DBMS
    • Data definition function: Provides data definition language (DDL) for defining data objects in the database.
    • Data manipulation function: provide data manipulation language (DML), which is used to manipulate data to realize basic operations on the database (query, insert, delete and modify).
    • Operation and management of database
    • Database creation and maintenance functions

DBS

  • A database system (Database System, DBS for short) refers to the system composition after introducing a database into a computer system.
  • It consists of database, database management system (and its development tools), application system, database administrator and users.

database storage structure

  • Redundant Array of Disks (RAID): It is the most commonly used external storage medium for database servers. It is an array composed of several identical disks. There are many combinations from RAID0 to RAID8.
  • Database data is stored in external storage in the form of files.
  • How records are organized within the file
    • Heap file organization: records can be placed anywhere in the file, in order of input. Delete and insert operations do not need to move data.
    • Sequential file organization: records are logically stored in ascending or descending order of lookup key values. Generally use the pointer chain structure.
    • Hash file organization: The value of an attribute value obtained through a hash function is used as the storage address of the record.
    • Clustering file organization: A file can store multiple related relationships, and related records are stored in the same block to improve I/O speed.

indexing technology

  • Index: It is a small file that is independent of the main file record and contains only index attributes, and is sorted by index value, so the search speed can be very fast.

  • index classification

    1. Ordered index: An index built according to a certain sort order in the records.

      • Main index (clustering index): The order of the search key values ​​of the index is consistent with the order of the main file. This kind of file is called an index sequential file, and there is only one.
        index record: Lookup key value and pointer to the first record in the main file with that value

        • Dense index: Create an index record (index item) for each lookup key value in the main file
        • Sparse index: In the main file, several lookup key values ​​are used to create an index record index record
        • Multi-level index: The sparse index that may be built is still very large, so that the query efficiency is not high. So create another level of index for the main index
      • Non-clustered index: The order of the lookup key values ​​of the index is not consistent with the main file order.

    2. Hash index: According to a certain attribute value in the record, the value obtained through the hash function is used as the bucket number of the storage space.

  • index update

    • Delete:
      ① Delete the main file record
      ② If there are multiple records matching the index key value, the index does not need to be modified.
      Otherwise: for a dense index, delete the corresponding index item;
      for a sparse index, if the index value of the deleted record appears in the index block, replace it with the search key A of the next record of the deleted record in the main file. If A already appears in the index block, delete the corresponding index key of the deleted record.
    • Insertion:
      ① Use the search key of the inserted record to find the insertion position, and execute the main file insertion.
      ②For a dense index and the lookup key does not appear in the index block, insert it in the index.
      ③Sparse index: If the data block is free for new data, no need to modify the index;
      otherwise, add a new data block and insert a new index item in the index block

affairs

  • Definition: A transaction is a logical unit of work in a DBMS, usually consisting of a set of database operations
  • ACID properties of transactions
    • Atomic (Atomic): All operations that constitute a transaction are either executed in full or not executed at all, without affecting the data in the database.
    • Consistency: The operation of a transaction causes the database to change from one consistent state to another.
    • Isolation: In a multi-user environment, the execution of a transaction is not affected by other transactions executed at the same time.
    • Durability: If a transaction is successfully executed, its impact on the database is persistent and will not fail due to failure.

I/O Parallel

  • Reduce the time it takes to retrieve relations from disk by partitioning them across multiple disks

  • Horizontal partitioning: the tuples of a relation are divided among multiple disks, and each tuple is stored on only one disk

  • Division technology define(number of disks = n) advantage shortcoming
    cycle division Send the i-th tuple in the relation to the (i mod n)th disk • Best suited for queries that sequentially scan the entire relation
    • All disks have nearly equal numbers of tuples, so query workload is balanced across disks
    • Difficult to handle range queries
    • No clustering: tuples are scattered across all disks
    Hash division 1. Select one or more attributes as the partition attribute
    2. Select the hash function h whose value range is 0...n-1
    3. Let i represent the result obtained by applying the hash function h to the partition attribute value of the tuple, then The tuple is sent to disk i
    • Good for sequential access
    • Good for point queries on partitioned attributes, which can examine only a single disk, leaving other disks available for other queries
    No clustering, so difficult to answer range queries
    Scope division 1. Select an attribute as partition attribute Select a partition vector [ v 0 , v 1 , . . . , vn − 2 ] [v_0, v_1, ..., v_{n-2}][v0v1...vn2]
    2. Set the division attribute value of a tuple to v
    so thatvi <= v <= vi + 1 v_i <= v <= v_{i+1}vi<=v<=vi+1The tuple group to disk i+1
    such that v < v 0 v < v_0v<v0tuples of to disk 0
    such that v >= vn − 2 v >= v_{n-2}v>=vn2tuples to disk n-1
    • Provides data clustering based on partition attribute values.
    • Suitable for sequential access
    • Suitable for point queries on partition attributes: only one disk access is required.
    • For range queries on partition attributes, only one or few disks

skewed handling

  • Kind of Skew

    • Attribute value skew (can occur in range partitioning and hash partitioning)
      • Some value occurs on the partition attribute of many tuples, all tuples with the same value on the partition attribute are assigned in the same partition
    • partition skew
      • Range Partitioning: A bad partition vector may assign too many tuples to one partition and too few tuples to other partitions
      • Hash partitioning: unlikely to happen as long as good hash function is chosen
  • handle skew

    • Dealing with Skew in Range Partitions: A Method for Generating Balanced Partition Vectors

      • Scan the relationship in order and construct the division vector: every time 1/n of the relationship is read, the division attribute value of the next tuple is added to the division vector
      • Duplicates or imbalances can result if partition attributes have duplicate values
      • Other techniques based on histograms are also used in practice
    • Dealing with skew with histograms

      • It is relatively straightforward to construct a balanced partition vector from the histogram
      • Histograms can be constructed by scanning relations or sampling relation tuples
    • Take advantage of virtual processors to handle skew

      • Skew in range partitioning can be handled nicely with virtual processors:

        Create a large number of virtual processors (say, 10 to 20 times the number of real processors)
        Assign relational tuples to virtual processors using any partitioning technique
        Map each virtual processor to real processors in a round-robin fashion

      • Basic idea:

        If normal partitioning would cause skew, the skew is likely to be spread over several virtual partitions The
        skewed virtual partition is spread over several real processors, so the distribution becomes even

inter-query parallelism

  • Queries/transactions execute in parallel with each other
  • Increase transaction throughput, mainly for scaling transaction processing systems to support larger transactions per second
  • More complex implementation on shared-disk or shared-nothing architectures
    • Messages must be passed between processors to coordinate locking and logging
    • Data in the local buffer may have been updated by another processor
    • Cache-coherency is maintained - reads and writes to buffer data must find the most recent version of the data

Cache Coherency Protocol

  • Examples of cache coherency protocols for shared disk systems:
    • Before reading/writing a page, the page must be shared/exclusively locked
    • When locking a page, the page must be read from disk
    • Before releasing the page lock, the page must Must be written to disk if updated

Intra-query parallelism

  • A single query is executed in parallel on multiple processors/disks: important to speed up time-consuming queries

  • Two complementary forms of intra-query parallelism:

    Intra-operational parallelism - each operation within a query is executed in parallel

    Parallelism between operations - different operations within a query are executed in parallel

  • The first form scales well with increased parallelism, since each operation typically processes more tuples than the number of operations in the query

Chapter 3 Big Data Real-time Computing Framework

image-20221116144844979

Storm

Storm is a real-time, distributed and highly fault-tolerant computing system.

  • Distributed
    • Horizontal expansion: Improve processing capabilities by adding machines and increasing concurrency
    • Automatic fault tolerance: Automatically handle process, machine, and network exceptions
  • Real-time: data is not written to disk, with low latency (millisecond level)
  • Streaming: data is constantly flowing in, processed, and out

Typical application scenarios of Storm

  • Request response (synchronous): real-time image processing, real-time web page analysis

  • Stream processing (asynchronous): item-by-item processing, analysis and statistics

  • Data stream processing: It can be used to process new data and update databases in real time, with fault tolerance and scalability.

  • Continuous calculation: Continuous query can be performed and the results can be fed back to the client in real time.

  • Distributed Remote Procedure Call

Features of Storm

  • Reliable message processing: Storm guarantees that each message will be fully processed at least once. When a task fails, it takes care of retrying the message from the source.
  • Fast: The design of the system ensures that messages can be processed quickly
  • Local Mode: A Storm cluster can be fully simulated during processing, enabling rapid development and unit testing.
  • Fault tolerance: Storm manages worker process and node failures.
  • Horizontal scaling: Computations are performed in parallel across multiple threads, processes, and servers.

Technology Architecture

image-20221116164335239

The relationship between Worker, Executor and Task

image-20221116165446139

Storm's workflow

image-20221116170049775

Storm fault tolerance

  • task level failure

    • The message caused by the crash of the bolt task is not answered: all messages associated with this bolt task in acker will fail due to timeout
      => the fail method of the Spout will be called.

    • The acker task fails: all messages held before the failure will timeout and fail
      => the Spout's fail method will be called.

    • Spout task failure
      => The external device (such as MQ) connected to the spout task is responsible for the integrity of the message.

  • Task slot failure Task level failure + cluster node (machine) failure

    image-20221116173546327
  • Nimbus node failure

    If the Nimbus node is lost, the Worker will continue to execute, and if the worker dies, the Supervisor will continue to restart them.

    However, without Nimbus, workers are not scheduled to other hosts if necessary, and clients cannot submit tasks.

image-20221116160448291 image-20221116160745709 image-20221116161236754 image-20221116161757034 image-20221116172420039

Storm Development API

  • Spout
    nextTuple(): Callback function, triggered cyclically
    ack(id): Callback function, triggered when the message is successfully processed
    fail(id): Callback function, triggered when the message timed out

  • Bolt
    execute(Tuple input): callback function, data trigger
    collector.emit(tuple): send tuple downstream through collector
    collector.ack(tuple): confirm through collector that the input tuple has been processed

  • public class MD5Topology {
          
          
        public static class MD5Bolt extends BaseBasicBolt {
          
          
            @Override
            public void execute(Tuple tuple, BasicOutputCollector collector) {
          
          
                String input = tuple.getString(0); // 获取来自DRPCSpout的实际输入数据
                String output = MD5Util.getMD5Str(input);
                // 往下游ReturnBolt emit数据
                // 第一个字段是计算的结果,这里是md5串
                // 第二个字段是来自DRPCSpout的return-info,是一个json串,包括drpc request id,server host、port
                collector.emit(new Values(output, tuple.getString(1))); 
            }
            @Override
            public void declareOutputFields(OutputFieldsDeclarer declarer) {
          
          
                declarer.declare(new Fields("result""return-info")); // 声明输出两个字段,和emit是对应的
            }
        }
        public static void main(String[] args) throws Exception {
          
          
            TopologyBuilder builder = new TopologyBuilder();
    
            builder.setSpout("DRPCSpout"new DRPCSpout(args[0])2);
            builder.setBolt("MD5Bolt"new MD5Bolt()4) // 参数依次是spout/bolt id,spout/bolt对象,并发度
                .shuffleGrouping("DRPCSpout"); // 指定上游以及grouping方式
            builder.setBolt("ReturnBolt"new ReturnResults()2)
                .shuffleGrouping("MD5Bolt");
    
            Config conf = new Config();
            conf.setNumWorkers(4); // 设置worker个数
            StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
        }
    }
    

Spark Streaming

image-20221116160026036 image-20221116162110054 image-20221116162809890

The core concept of Spark Streaming

  • DStream: A sequence of RDDs representing data streams

  • Transformations - modify data from one Dstream to create another DStream

    Standard RDD operations: map, countByValue, reduce, insert...
    Stateful operations: window, countByValueAndWindow...

  • Output operations: transfer data to external entities

    saveAsHadoopFiles: save to HDFS
    foreach: process each batch of results

The input source of DStream

  • Base sources: Sources directly available in the StreamingContext API.
  • Advanced sources: such as Kafka, Flume, Kinesis, Twitter, etc. can be created with additional utility classes.

Spark Streaming Architecture

image-20221116170114448

Spark fault tolerance

  • RDDs can remember the sequence of operations that created it from the original fault-tolerant input
  • Batches of input data are replicated in memory across multiple worker nodes and are therefore fault tolerant

some contrast

Comparison between Spark Streaming and Storm

Spark Streaming Storm
Millisecond-level stream computing cannot be achieved Response in milliseconds can be achieved
Low-latency execution engine can be used for real-time computing
Compared with Storm, RDD data sets are easier to do efficient fault-tolerant processing

Functional Correspondence between Storm and Hadoop Architecture Components

Hadoop Storm
Application Name Job Topology
system role JobTracker Cloud
TaskTracker Supervisor
component interface Map/Reduce Spout/Bolt

Chapter 4 Big Graph Computing Framework

Computational model

Superstep: Parallel Node Computing

  • for each node

    ▪ accept a message sent by the previous superstep
    ▪ execute the same user-defined function
    ▪ modify its value or the value of its output edge
    ▪ send the message to another point (accepted by the next superstep)
    ▪ change the topology of the graph
    ▪ no extra work end iteration when to do

  • Termination condition

    ▪ All vertices become inactive at the same time
    ▪ No message passing

Differences from MapReduce

  • Graph algorithms can be written as a series of MapReduce calls
  • Pregel
    maintains vertices and edges on the machine performing the computation and
    uses a mesh structure to transfer information

  • Each stage of MapReduce utilizes all the states of the entire graph
    and needs to integrate the MapReduce chain

image-20221116181622573

// 图操作(Scala语言)
class Graph [ V, E ] {
    
    
def Graph(vertices: Table[ (Id, V) ], 
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V)(Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean, 
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E] 
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
}

Pregel

system structure

The Pregel system also uses a master/slave model

  • Master node: schedule slave nodes, fix slave node errors
  • Slave nodes: process their own tasks, communicate with other slave nodes

polymerization

image-20221116183452238

aggregator

image-20221116184243570

Pregel execution

image-20221116184931841 image-20221116184944191 image-20221116184954887

GraphX

Connection Site Selection Using Routing Tables

image-20221116185125847

Iterate over the cache of mrTriplets

image-20221116185222651

Iterate over the aggregation of mrTriplets

image-20221116190402120

fault tolerance

  • Checkpoint: The master node periodically instructs the slave nodes to save the state of the partition to persistent storage
  • Error detection: timed use of "ping" messages
  • recovery
    master reassigns graph partitions to currently available
    slaves all workers reload partition state from latest available checkpoint
  • Partial recovery: log outgoing information, only involves recovery partition

Chapter 5 Big Data Storage

Small-probability events will become the norm on a large scale

Disk machine corrupted

RAID card failure

network failure

electricity failure

data error

System exception

hotspot

software defect

wrong operation

HDFS

related terms

HDFS GFS MooseFS illustrate
NameNode Master Master The brain of the entire file system
provides directory information of the entire file system, block information of each file, and location information of data blocks
to manage each data server.
DataNode Chunk Server Chunk Server Each file in the distributed file system is divided into several data blocks, and each data block is stored on a different server
Block Chunk Chunk Each file will be divided into several blocks (default 64MB), and each block has a continuous piece of file content, which is the basic unit of storage.
Packet none none When the client writes a file, it does not write the file system byte by byte, but writes it once to the file system after a certain amount has been accumulated, and each sent data is called a data packet.
Chunk none Block(64KB) In each data packet, the data is cut into smaller blocks (512 bytes), and each block is matched with a parity check code (CRC). Such a block is a transmission block.
Secondary NameNode none Metalogs The backup main control server silently pulls the log of the main control server behind him, waiting for the main control server to be righted after it dies.

Core functions

Function illustrate
Namespace HDFS supports traditional hierarchical file organization. Similar to most other file systems, users can create directories and create, delete, move, and rename files among them.
Shell command Hadoop includes a series of shell-like commands that can directly interact with HDFS and other file systems supported by Hadoop.
data replication 每个文件的block大小和replication因子都是可配置的。Replication因子可以在文件创建的时候配置,以后也可以改变。HDFS中的文件是write-one,并且 严格要求在任何时候只有一个writer。
机架感知 在大多数情况下,replication因子是3,HDFS的存放策略是将一个副本存放在本地机架上的节点,一个副本放在同一机架上的另一个节点,最后一个副本放在不同机架上的一个节点。机架的错误远远比节点的错误少,这个策
Editlog FSEditLog类是整个日志体系的核心,提供了一大堆方便的日志写入API,以及日志的恢复存储等功能。
集群均衡 如果某个DataNode节点上的空闲空间低于特定的临界点,那么就会启动一个计划自动地将数据从一个DataNode搬移到空闲的DataNode。
空间的回收 删除文件并没有立刻从HDFS中删除,HDFS将这个文件重命名,并转移到/trash目录,用于恢复,/trash可设置保存时间。

HDFS结构

image-20221116214529041

读取文件流程

image-20221116215023142

写入文件流程

image-20221116220255283

数据写入流程总结

数据写入 概述 优点 不足
链式写入 Client
-> Replica A
-> Replica B
-> Replica C
每个节点负载和流量比较均衡 链条过长,出现异常时诊断和修复过程比较复杂
主从写入 Client
-> Replica A
-> Replica B
-> Repllica C
总路径较短,管理逻辑由主节点负责 主节点有可能成为负载和流量瓶颈
异常处理方式 概述 优点 不足
重新修复 剔除异常节点,提升 replica 版本 , 重新组织当前写入的repilca group 最大程度保留之前写入的数据
Seal and New 1. Seal all the chunk replicas that are currently being written, no longer allow writing, and use the current shortest replica as the chunk standard length
2. Apply for a new chunk, and continue to write data in the new chunk replica
Simple and fast, can bypass abnormal nodes Chunk length is not fixed, more meta management is required

Read Process Summary

  • Any valid copy can be selected to read
  • If there is an exception, try another copy
  • Backup read can effectively reduce read latency
  • Select the current optimal copy access according to the principle of locality

Data validation

IO full path Checksum check

  • <buffer, len, crc>
  • Authentication in storage and network transmission
  • Checksum and Data Persistence
  • Regularly scan disk data in the background for checksum checks

data backup

When the machine/disk is abnormal, it can be quickly recovered through other copies

data balance

When new machines/disks come online, migrate data to ensure load balancing

garbage collection

Completed asynchronously, the system is stable and smooth

Read Process Summary

  • Any valid copy can be selected to read
  • If there is an exception, try another copy
  • Backup read can effectively reduce read latency
  • Select the current optimal copy access according to the principle of locality

Data validation

IO full path Checksum check

  • <buffer, len, crc>
  • Authentication in storage and network transmission
  • Checksum and Data Persistence
  • Regularly scan disk data in the background for checksum checks

data backup

When the machine/disk is abnormal, it can be quickly recovered through other copies

data balance

When new machines/disks come online, migrate data to ensure load balancing

garbage collection

Asynchronously completed, the system is stable and smooth.
The current shortest replica is used as the standard length of the chunk
2. Apply for a new chunk and continue to write data in the new chunk replica | Simple and fast, can bypass abnormal nodes | Chunk length is not fixed, more meta management is required |

Read Process Summary

  • Any valid copy can be selected to read
  • If there is an exception, try another copy
  • Backup read can effectively reduce read latency
  • Select the current optimal copy access according to the principle of locality

Data validation

IO full path Checksum check

  • <buffer, len, crc>
  • Authentication in storage and network transmission
  • Checksum and Data Persistence
  • Regularly scan disk data in the background for checksum checks

data backup

When the machine/disk is abnormal, it can be quickly recovered through other copies

data balance

When new machines/disks come online, migrate data to ensure load balancing

garbage collection

Completed asynchronously, the system is stable and smooth

Read Process Summary

  • Any valid copy can be selected to read
  • If there is an exception, try another copy
  • Backup read can effectively reduce read latency
  • Select the current optimal copy access according to the principle of locality

Data validation

IO full path Checksum check

  • <buffer, len, crc>
  • Authentication in storage and network transmission
  • Checksum and Data Persistence
  • Regularly scan disk data in the background for checksum checks

data backup

When the machine/disk is abnormal, it can be quickly recovered through other copies

data balance

When new machines/disks come online, migrate data to ensure load balancing

garbage collection

Completed asynchronously, the system is stable and smooth

Guess you like

Origin blog.csdn.net/twi_twi/article/details/129257991