"Offer is here: Java interview core knowledge points (framework)" reading notes

Chapter 1 Spring Principles and Applications

Chapter 2 Principles and Applications of Spring Cloud

Chapter 3 Principles and Applications of Netty Network Programming

Chapter 4 The Principle and Application of ZooKeeper

Role
zab agreement

Chapter 5 Kafka Principles and Applications

Chapter 6 Hadoop Principles and Applications

Chapter 7 Principles and Applications of HBase

Chapter 8 Cassandra Principles and Applications

Features

The features of Cassandra include column-based storage, P2P decentralized design, scalability, multi-data center identification and remote disaster recovery, secondary indexes, and support for distributed write operations:

  1. Column-based storage: Cassandra, like HBase, is a database based on column-based storage. Since the selection rules in the query are defined by columns, the entire database is automatically indexed and the query efficiency is very high.
  2. P2P decentralized design: Cassandra adopts the P2P decentralized design idea. There is no master node in the entire cluster. Therefore, there is no problem that the cluster is unavailable when the master node is down, and there is no performance bottleneck of the master node. Cassandra will Automatically distribute data and requests to each node in a balanced manner
  3. Scalable: Cassandra is fully horizontally scalable. When you need to add more capacity to the cluster, you can dynamically add nodes. Cassandra will automatically perform data migration, so there is no need to restart any processes or manually migrate any data
  4. Multi-data center identification and remote disaster recovery: Cassandra supports the identification of racks and data centers. When remote disaster recovery is needed, you only need to set the database configuration to different data centers. Cassandra will ensure that each data center is There are full data. Therefore, when the main data center goes down, the backup data center can fully support business requests. At the same time, when the main data center is lost due to force majeure such as earthquakes, fires, etc., the cluster can be quickly rebuilt in the main data center based on the standby data center and automatic data recovery can be completed.
  5. Secondary index: In addition to supporting key-value queries and range queries based on keys, Cassandra also supports secondary indexes, and Group By and Count operations can be conveniently performed on secondary indexes.
  6. Support for distributed write operations: P2P architecture design, users can read or write any data at any time, anywhere, without worrying about single point of failure

Data model

It is composed of Key Space, Column Family, Key and Column:
Insert picture description here

  1. Key Space
    A Key Space can contain several Column Family. The core parameters that need to be set to create a Key Space include the replication factor and the copy storage strategy. The replication factor is the number of copies of the same data in the cluster. The copy storage strategy refers to the strategy used to distribute the copies on the servers in the cluster. Replica storage strategies include simple strategies (single data center storage strategies), old network topology strategies (rack-aware strategies), and network topology strategies (data center sharing strategies). KeySpace creation command:
    CREATE KEYSPACE Keyspace loginlog WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
  2. Key
    In Cassandra, each row of data records is stored in the form of Key-Value, where Key is a unique identifier.
  3. Column
    Column is similar to a column in a relational database. In Cassandra, the Value in each Key-Value is also called Column, which is the smallest data unit in Cassandra. It is a ternary data type, including name, value and timestamp. Both the name and value are of type byte[], and the length is unlimited.
  4. Super Column
    Super Column allows Value in Key-Value to be a Map<Key, Value List>, Column in Super Column can have multiple sub-columns
  5. Column Family is
    a structure containing many Rows, similar to Table in RDBMS. Each Row contains the Key provided for the Client and a series of Columns associated with the Key. The type of Column Family can be Standard/Super Column Family type
  6. The Standard Column Family is
    similar to the Table in the relational database. Each Column Family consists of a series of Rows. Each Row contains a Key and its corresponding Columns. Columns are of Column type.
  7. Super Column Family
    Each Super Column Family is composed of a series of Rows, and each Row contains a Key and its corresponding SuperColumns.

Gossip protocol

Anti-Entropy, which does not require nodes to know the status of all other nodes, is decentralized, and the roles between nodes are completely equal. The cluster does not require a central node, and is often used in areas that can accept eventual consistency , such as failure Detection, route synchronization, publish and subscribe, dynamic load balancing, etc.

There are three communication methods between A and B nodes: pull, push, pull & push.

Convergence: The message in the Gossip protocol propagates rapidly in the network at an exponential rate, and all inconsistencies in the state can converge to the same in a short time, and the convergence rate is logn. The Pull/Push communication method is the fastest, and its convergence speed is also the fastest.

Problem:
Multiple data centers, that is, data partitions.
Solution:
Cassandra uses the same Seed List (describes how many nodes in the cluster) to specify in all nodes in the cluster. The seed node configures the Seed List of the cluster. When other nodes are started, they first communicate with the seed node to exchange cluster information. Prevent data partition problems and ensure that newly added nodes can quickly learn the information of the entire cluster.

NWR theory

  1. N (Number): In a distributed storage system, there are N copies of backup data
  2. W (Write): In a successful update operation, at least W pieces of data are required to be successfully written
  3. R (Read): In a successful read data operation, at least R pieces of data are required to be successfully read

Different combinations of NWR will produce different consistency effects. When W+R>N, the entire system can guarantee strong consistency for the client; R+W<=N, strong consistency of data cannot be guaranteed

Consistency Hash

Data copy strategy

data storage

Data write

Data read

Data deletion mechanism

When deleting data, just insert a deleted tombstone (Tombstone) about this data, not directly delete the original data. The tombstone is regarded as a modification record of the data. In MemTable and SSTable, the content of the tombstone is the time when the delete request is executed. When the client queries the deleted data again, Cassandra finds and finds that the data has been marked as deleted, and considers that the data has been deleted, and the query result returned to the client will be empty. The deleted data will not be deleted from the disk immediately, and will occupy disk space in a short time; Major Compaction, the garbage collection mechanism periodically deletes the data marked tombstones.

Compared with HBase
Insert picture description here
installation, slightly

Spring Boot integrates Cassandra, simple, slightly

spring-boot-starterdata-cassandra is SpringBoot's secondary encapsulation of Cassandra client operations on the basis of cassandra-driver-core by Spring

Chapter 9 Principles and Applications of ElasticSearch

Chapter 10 Spark Principles and Applications

Chapter 11 Flink Principle and Application

Flink abstracts data into bounded data streams and unbounded data streams.

Key concept

  1. Flink Cluster: A cluster is a distributed system used to run Flink applications. A Flink cluster consists of three roles: ZooKeeper, JobManager, and Task Manager. In the high-availability mode, ZooKeeper is generally a cluster with at least 3 nodes; Job Manager is a cluster with at least 2 nodes, and the Job Manager high-availability mode is one active and multiple backups. Under normal circumstances, the primary node provides services. When the primary node goes down, a standby node will be upgraded to provide external services to the primary node. Task Manager is a specific computing node, and there are one or more Task Managers in a cluster.
  2. Flink Master refers to the management node of the cluster. A Flink Master consists of three roles: Flink Resource Manager (resource management), FlinkDispatcher (distribution) and Flink Job Manager.
  3. Flink Job Manager: Flink's task management node, used for task submission, distribution and running status monitoring. A cluster can have one or more (in high availability mode) Job Manager.
  4. Flink Task Manager: The computing node of the Flink cluster. Flink tasks are scheduled by the Job Manager to be executed on multiple TaskManagers. Tasks on multiple Task Managers exchange calculation results with each other to complete data flow calculations.
  5. Job: Refers to a running Flink application. Job can be submitted to the cluster through the Job Manager via command line, or the Flink monitoring page can be submitted to the cluster.
  6. Flink Graph: Graph, refers to the data flow chart composed of Flink flow calculation program. It is divided into Logical Graph and Physical Graph. The former description is the logical relationship between the data flows defined by the application (usually the Flink program based on Java or Scala), and the corresponding logical operator (operator), Input, Output, DataStream and DataSet. The latter is the logical graph of the physical calculation after the Logical Graph is transformed based on the distributed operating environment, which corresponds to the Task, Input, Output, DataStream and DataSet in the physical calculation.
  7. Flink Operator and Operator Chain: Flink Operator is a node in Flink Logical Graph, used to execute a Flink Function. A complete data flow usually includes a Source Operator (data ingestion), Process Function (data calculation) and Sink Operator (data output). Multiple adjacent Operators are connected to each other to form an Operator Chain. Within an Operator Chain, the data of the Operators can be directly accessed with each other without serialization and network transmission by the Flink cluster.
  8. Flink Task and SubTask: Flink Task is a node of the Physical Graph, which corresponds to a physical computing unit. A Task is composed of multiple SubTasks, and each SubTask corresponds to a processing function on the data stream.
  9. Event: Refers to the state change of the data model when Flink is running. Event is input and output in the streaming computing and batch computing interface to complete the recording and transmission of the state.
  10. Function: A logical calculation unit implemented by an application program. Function is generally defined by implementing Flink's Function interface or inheriting the Function class. Commonly used functions include MapFunction, ReduceFunction, ProcessFunction, RichFunction, etc.
  11. Flink Record: Refers to the elements in the data stream.
  12. Flink State Backend: Defines the state storage method of the Job running on the Task Manager (such as JVM heap memory storage, RocksDB, FileSystem), the storage rules and storage methods of SavePoint and CheckPoint.

frame

Insert picture description here

Flink is composed of Job Manager, Task Manager and client. Job Manager is a management node, responsible for the submission, allocation and resource management of cluster tasks; Task Manager is a computing node that performs specific tasks; the client is used for job submission.

  1. Responsibilities of
    Job Manager Job Manager is responsible for coordinating distributed computing nodes, also known as Master nodes. It is responsible for scheduling tasks, coordinating CheckPoint, fault recovery, etc. Job Manager divides a job into multiple tasks, and communicates with the TaskManager through the Actor system for task deployment, stop, and cancellation. In a high-availability deployment, there will be multiple Job Managers, including one Leader and multiple Flowers. The Leader is always in the Active state and provides services for the cluster; the Flower is in the Standby state. After the Leader goes down, one of the Flowers will be selected as the Leader to continue to provide services for the cluster. Job Manager election is implemented through ZooKeeper.
  2. Responsibilities of
    Task Manager Task Manager is also called Worker node, which is used to execute Task (SubTask) assigned by Job Manager. Task Manager divides system resources (CPU, network, memory) into multiple Task Slots (computing slots). Tasks run on specific Task Slots. Task Manager communicates with Job Manager through the Actor system, and periodically monitors the running status of Tasks. And the running status of Task Manager is submitted to Job Manager. Tasks on multiple Task Managers perform state calculation and result interaction through DataStream.
  3. Client The
    client is not part of the runtime environment. It is mainly used to submit jobs to the Job Manager. After the job is submitted, the client can disconnect or stay connected to receive the running status of the job.
  4. The running process of the application
    1. Write application data flow jobs, which can be based on Java or Scala
    2. Build DAG and optimize process
    3. Submit jobs to the Job Manager Leader node of the cluster through client commands
    4. Job Manager feeds back the results and running status of the task submission to the client
    5. Job Manager splits the job into multiple tasks according to the usage of resources on each Task Manager, and deploys the tasks to specific Task Manager nodes through the Actor system
    6. Task Manager runs Task on Task Slot, and sends Task Manager running status and Task running status to Job Manager regularly. Job Manager schedules the cluster according to the resource usage and task running status on Task Manager.
    7. Job Manager interacts with ZooKeeper to complete the election and failure recovery of Job Manager
  5. Task Slot resource allocation
  6. Tasks and operators
  7. State storage
  8. Operating mode
    9.

Event-driven model

Definition
Event-driven model is a stateful computing model based on event flow. It receives a steady stream of events and updates different states according to different types of events to trigger different calculations. The biggest difference between the event-driven model and the general computing and storage separation model is that the computing and storage separation model needs to store data in remote object storage systems (such as S3, OSS, OBS), transactional/relational databases, and distributed memory systems (LeverDB). )in. That is, all data calculations of the computing and storage separation model are based on local memory and disk, and the data is stored in the remote storage system. The advantage is that the calculation and storage are separated, so that the calculation and storage can be expanded independently without affecting each other. The architectural
Insert picture description here
event-driven model of the computing and storage separation model is based on stateful stream processing. It does not separate computing and storage, but instead accesses local storage (operating system memory or disk) to obtain data as quickly as possible during the calculation process. To complete the calculations. The event-driven model system periodically writes CheckPoint and SavePoint to the remote persistent storage to achieve state rollback, failure recovery, and program upgrades. The architecture is shown in the figure:
Insert picture description here
Features The feature of the
event-driven model is high efficiency. Because the event-driven model does not require frequent access to remote data, most data operations are completed in local memory, and a small part of data operations are completed on disk, so it has higher throughput and lower latency. At the same time, the event-driven model will periodically and incrementally store the state of data processing in the remote persistent storage in the form of CheckPoint to facilitate program state rollback and failure recovery.

Features of Flink's event-driven model The
bottom layer of Flink is designed based on stateful event data processing, providing a wealth of state operations, as well as Exactly-Once data consistency guarantees and massive scale (at least terabyte) state data calculations. And Flink provides a variety of window calculations, which is extremely flexible.

Data analysis application

Data cleaning and data pipeline

Basic concepts of data stream processing

The whole framework is built on the basis of three core components: data flow, state (State), and time (Time).

API classification

State-based memory computing

Programming model

Guess you like

Origin blog.csdn.net/lonelymanontheway/article/details/114379058