flink briefly

1.Flink operating mode
is based on the stream processing model and supports batch stream computing. SLA (service level agreement) is blocked.
2.10 major features:

Stateful computing (exactly-once)
\Streaming processing and window processing with event time semantics
\Highly flexible windows to facilitate and quickly retry failed operations
\Lightweight fault-tolerant processing through lightweight state snapshots
\ High-throughput, low-latency, high-performance streaming
\supports savepoint mechanism
\supports large-scale cluster mode
\has back-pressure function
\supports iterative computing
\JVM implements its own memory management internally3
.

2 major APIs: streaming DataStream API, batch DataSet API
4.

Library: Streaming (CEP/Table), Batch Processing (ML/Gelly)
5.

Task submission 3 roles: client, JobManager (HA, task splitting), TaskManager (implementation of tasks)
Communication: client and JobManager, JM and TM use akka framework (actor system), the former sends instructions, and the latter sends status and statistics to
client and JobManager data interaction: data transmission between netty framework
TM: network
standalone JobManager HA:zk
6. Flink on yarn has two modes:

Single yarn session: Start the cluster first, then submit the job, and then apply for resources from yarn. If the resources are insufficient, the next job will wait.
Multiple yarn sessions: one task corresponds to one job and applies for resources from yarn without affecting other jobs.

7. Various data sources supported by flink:
socket/file data source (hdfs)/collection/custom
8.datastream3 types of operators:

transformation/partition (repartition, solve data skew)/sink
9. Window type:
infinite data is divided into limited data, time window, counting window countWindow, session window and custom window
rolling window, sliding window: timeWindow, countWindow
window value Aggregation statistics are divided into 2 categories: incremental aggregation statistics and full aggregation statistics
10. Three time types:
eventtime (specified watermark)\IngestionTime (data source system time for obtaining records)\ProcessingTime (system time for performing operations)
Performance: ProcessingTime> IngestionTime>EventTime
Delay: ProcessingTime<IngestionTime<EventTime
Deterministic: EventTime>IngestionTime>ProcessingTime
11. watermark:
To solve the disorder and data delay, it is usually combined with the window to achieve
two major opportunities to generate watermark: generate immediately after receiving the data, and simply receive the data After processing, there
are two ways to generate watermark: Periodic Watermark (define a maximum time that allows disordering),
12. Trigger Flink window call 2 conditions:
watermark time >= eventtime, there is data in the window
13. From fault tolerance and message processing semantics exactly-once, introduce: state (a specific task/operator state, Jvm heap memory is TM memory), checkpoint (persistence of state, all task/operator states are JM memory) 14.2
types Type state: keyed state (state based on keyedstream), operator state, 2 ways of state existence: raw state (managed by the user, when the user customizes the operator), managed state (managed by the Flink framework, recommended on datastream) 15.3 types of
statebackend :
MemoryStateBackend (based on memory storage, jobmanager),
FsStateBackend (based on file system storage, filesystem),
RocksDBStateBackend (based on database storage, rocksdb)
16. checkpoint, by default, only the most recently generated one is kept, and multiple CPs are supported (state.checkpoints .num-retained:20)
savepoint: After the program is upgraded, it will continue to perform calculations from the point before the upgrade to ensure that the data is not interrupted.
Flink consumes Kafka, not by tracking the offset of the Kafka consumption group to ensure exactly-once, but by tracking the offset and setting checkpoints inside Flink. For the partition of Kafka, Flink will start the corresponding parallelism processing.
17.Dataset built-in data sources: files, collections, general
18.Dataset commonly used operators:
transformation/partition (repartition, rebalance/hashPartition/rangePartition/custom)
dataset sink operator
19.Dataset file system connector: hdfs(hdfs://), S3(s3://), MapR(maprfs://) , Alluxio (alluxio://)
20. Broadcast variables (broadcast): datastream (broadcast to all partitions, may be processed repeatedly), dataset (a read-only cache variable for each server, not a copy of the variable sent to the Task)
accumulator (accumulator)
Distributed cache (distributed cache, it is very convenient to read local files in parallel functions): env registers and names cache files, and flink automatically copies them to all TM's local file systems 21. Two relational
APIs: Table, SQL ( Unified batch stream processing)
TableSink supports: multiple file formats (csv/parquet/avro), storage systems (jdbc/hbase/es), messaging systems (kafka, rabbitmq) batch: batchtablesink, stream: appendstreamtablesink/retractstreamtablesink/
upsertstreamtablesink
will The table is converted into two modes of datastream: append mode (only query, add) / retrace mode
to realize the connection of stream: cogroup (joint grouping, get CoGroupedStream, after conversion to WithWindow, the method apply is the key point)
/interval join (solve out-of-order, delayed, cross-window situations, implemented based on keyedstream)
22. Two types of SQL join:
global join, Time-windowed join (supports EventTime/ProcessingTime, multiple Joins)
23. CEP (complex event processing ) ):
pattern (pattern, rules for processing events): individual pattern (single event, quantifier, iteration condition, simple condition, combined condition, stop condition) / combined
pattern (pattern sequence, multiple individual patterns, continuous conditions [strict continuity] , relaxed continuity, non-deterministic relaxed continuity])
/ pattern group (nesting pattern sequences as conditions in individual patterns. Match skip strategy: NO_SKIP, SKIP_PAST_LAST_EVENT, SKIP_TO_FIRST, SKIP_TO_LAST)
24. Monitoring indicators: task monitoring ( Whether there is an error or failure), monitoring system indicators (CPU, memory, etc.)
Custom monitoring indicators: inherit RichFunction, call getRuntimeContext().getMetricGroup(), and implement custom indicators.
Four monitoring indicators: Counter\Gauge\Histogram\Meter (Throughput)
25. Counter-pressure mechanism: early warning for data processing (whether production is faster than processing) and timely adoption of strategies.
Backpressure thread sampling: perform data sampling through thread loops, monitor data processing speed Thread.getStackTrace()
backpressure thread configuration: refresh-interval\num-samples\delay-between-samples
checkpoint monitoring: OverView, History, Summary, Configuration
Checkpoint tuning: Synchronization is inefficient, and checkpoint queuing consumes system resources
\ Measure checkpoint speed: checkpoint start time each time, whether there is idleness, data cache volume
\ adjacent checkpoint interval time setting: forcibly insert idle time between checkpoints
\checkp resource setting: make checkpoint on Task first, then persist in external storage, set more parallelism for each task
\checkpoint Task local recovery: speed up, each Task writes checkpoint data to local disk at the same time And remote distributed storage
\asynchronous checkpoint settings: greatly improve checkpoint performance, but may be overwritten. Two conditions: managed state, support for asynchronous
\checkpoint data compression
26. Flink divides the heap into three areas: network buffer (allocated when taskmanager starts, 2048*32), memory manager buffer (buffer record, serialized storage ), residual heap (user code, TM data structure)
memory segment management: memory is represented as a collection of memory segments, and Flink serializes it into several memory segments when storing data somewhere, and the serialization format is defined by Flink
. Section buffer: do not use java.nio.ByteBuffer, reason: use sun.misc.unsafe()\absolute get|put method on byte array, thread-safe
JVM default value: -XX:NewRatio=2, which is twice OldGen in NewGen. Running tasks, NewGen short-lived objects perform garbage collection, and rarely lifetime garbage collection
Memory allocation only defines the memory manager and remaining heap. The relative value taskmanager.memory.fraction specifies how much of the remaining heap is used as managed pages. The absolute value taskmanager.memory.size allocates managed pages when TM starts.
Off-heap memory: zero copy spill to disk/ ssd, and send zero copies over the network. Increase the maximum amount of direct memory -XX:MaxDirectMemorySize

Guess you like

Origin blog.csdn.net/victory0508/article/details/122043446