67, Performance Tuning

A data reception parallelism Tuning

(1 ) 
time (such as Kafka, Flume) receives data via a network, data is deserialized, the Spark and stored in memory. If the data receiving system is called a bottleneck, consider the received parallel data. 
Each input will start a DStream Receiver on a Worker Executor, which Receiver receives a data stream. DStream can be input by creating multiple, and they receive configuration 
different data partition data sources, to achieve the effect of receiving a plurality of data streams. For example, an input for receiving two DSTREAM Kafka Topic to be split into two DSTREAM inputs, each respectively receiving a 
data of topic. This creates two Receiver, thereby receiving in parallel data, thereby improving throughput. DSTREAM plurality union operator can be used for the polymerization, thereby forming a DStream. Then 
subsequent transformation operator operations to the one for DStream after polymerization. 

int NumStreams =. 5 ; 
List <JavaPairDStream <String, String >> kafkaStreams = new new the ArrayList <JavaPairDStream <String, String >> (NumStreams);
 for ( int I = 0; I <NumStreams; I ++ ) {
  kafkaStreams.add (KafkaUtils.createStream (...)); 
} 
JavaPairDStream <String, String> unifiedStream = streamingContext.union (kafkaStreams.get (0), kafkaStreams.subList (. 1 , kafkaStreams.size ())); 
unifiedStream. Print (); 


( 2 ) 
the data reception tuning parallelism, in addition to creating more than receiver input DStream and may also be considered adjustable block interval. By parameter, spark.streaming.blockInterval, you can set the block interval, 
the default is 200ms. Receiver For most, the before saving the received data to the BlockManager Spark, will be cut into a data of a block. While the number of block in each batch, then determines the 
number of partition RDD corresponds to the batch, and the number of task when the RDD for the implementation of transformation operation created. Each task batch corresponding to the number is approximately estimated, i.e. interval The batch / Block interval The.

For example say, batch interval is 2s, block interval to 200ms, will create 10 task. If you think the task is too small quantity of each batch, that is lower than the number of cpu core of each machine, then it shows the number of batch task 
is not enough, because all cpu resources can not be fully utilized. To increase the number block for batch, then reduce the block interval. However, the recommended minimum block interval is 50ms, if less than this value, 
then the task of a large number of start-up time, could become a performance overhead point. 


( 3 ) 
In addition to receiving parallelism manner, there is a method of lifting above said two data, that is, to explicitly re-partitions the input data stream. Use inputStream.repartition ( <Number of Partitions> ) to. 
This allows the received BATCH, distributed over a specified number of machines, then further action.


Second, the task starts tuning

If the start of every second task is too much, for example, 50 per start, then send them to task performance overhead Executor of the Worker node will be relatively large, and this time it is difficult to achieve the basic millisecond delayed. 
Use the following procedure to reduce this overhead performance:

 . 1 , the sequence of the Task: Use Kryo serialization mechanism to serialize the task, the task can be reduced in size, thereby reducing the time to send the task to each node the Executor Worker.
2 , execution mode: Spark run in Standalone mode, you can achieve less task start time. 

The above-described embodiment, may be able to reduce the processing time for each batch of 100 milliseconds. Whereby the second level down to milliseconds.


Third, the data processing parallelism Tuning

If the number of concurrent task used in any stage of the calculation is not enough, then the cluster resources are not fully utilized. For example, for distributed reduce operation, 
such reduceByKey and reduceByKeyAndWindow, the default is the number of parallel task. A Spark default .parallelism parameters determined. You can reduceByKey like operations, 
passing the second parameter, to manually specify the degree of parallelism of operation, can adjust the overall Spark. Default .parallelism parameters.


Fourth, data serialization tune

(1 ) 
overhead resulting serialized data may be reduced by the format of the optimized sequence. At a flow computing scenarios, there are two types of data need to be serialized. 

1 , input data: By default, the input data is received, is used for Executor level is stored in persistent memory is StorageLevel.MEMORY_AND_DISK_SER_2. This means that 
the data is serialized as to reduce GC overhead bytes, and will be copied to the failed FT executor. Therefore, the data is first stored in memory, then when will spill out of memory to disk, which 
is streaming calculated to save all required data. Here serialization significant performance overhead --Receiver deserialize data must be received from the network, and then using a sequence format of the sequence data Spark. 

2 , operation of generating a flow calculated persistence RDD: stream generating persistent calculating operation RDD, may persist into memory. For example, the default will be the operating window data in the persistent memory, since the data 
later can be used in multiple windows, and processed a plurality of times. However, unlike the default Spark Core persistence level, StorageLevel.MEMORY_ONLY, default RDD streaming computing operations generated persistent 
level is StorageLevel.MEMORY_ONLY_SER, the default will be to reduce the overhead GC. 


( 2 ) 
In the above scenario, the use of sequence libraries can be reduced Kryo CPU and memory performance overhead. When using Kryo, we must consider the custom class registration, 
and disables the tracking for the corresponding reference (spark.kryo.referenceTracking).

In some special scenarios, such as the need to keep the amount of data for streaming applications are not many, perhaps serialized data in a non-persistent manner, thereby reducing serialization and de-serialization of CPU overhead, 
and they there will not be too expensive GC overhead. For example, if you are a few seconds of batch interval, and does not use the operation window, you may wish to consider explicitly persistence level provided to disabled when persistent 
data is serialized. This can be used to reduce the serialization and de-serialization of CPU performance overhead, and do not bear too much of the GC overhead.


Five, batch interval tuning (important)

If you want to run on a cluster of Spark Streaming applications can be stable, it must process the received data as quickly as possible. In other words, batch should after generation, it is disposed of as quickly as possible. 
For an application, this is not a problem, it can be set by observing the batch processing time on Spark UI. batch processing time must be less than the time interval The batch. 

Calculated based on the nature of flow, batch interval enormous impact, under fixed conditions of cluster resources, application data reception rate can be maintained, there will be. For example, in WordCount example, for a particular 
data reception rate, the application service can be guaranteed once every 2 seconds print word count, instead of every 500ms. Therefore batch interval needs to be set so that data reception rate can be expected to hold in a production environment. 

A better way to calculate the correct batch size for your application, is a very conservative interval The batch, for example. 5 ~ 10s, at a slow rate for receiving test data. To check whether the application to keep up with the data rates, 
you can check the time delay processing of each batch, and if the processing time is basically consistent with the batch interval, then the application is stable. Otherwise, if the batch scheduling delays continue to grow, then it means that the application can not keep up with 
this rate, which is unstable. So you want to have a stable configuration, you can try to improve the speed of data processing, or increase the batch interval. Remember, due to a temporary delay in growth due to temporary data growth, 
it is reasonable, as long as the delays can be recovered in a short time.


Sixth, memory tuning

(1 ) 
optimization of memory use and application Spark GC behavior, tuning Spark Core, it has been talked about. Here to talk about tuning parameters related to the Spark Streaming applications. 

Spark Streaming applications require memory cluster resources, transformation is determined by the type of operation used. For example, if you want to use a window length of window operator 10 minutes, 
then the cluster must have enough memory to hold the data within 10 minutes. If you want to use updateStateByKey to maintain the state of many key, then your memory resources must be large enough. Conversely, 
if you want to do a simple the Map -filter- Store operation, you need to use very little memory. 

Generally speaking, the data received by the Receiver will use StorageLevel.MEMORY_AND_DISK_SER_2 persistence level for storage, so the data can not be saved in memory will overflow to disk. 
The overflow written to disk, it will reduce the performance of the application. Therefore, usually it recommended to provide sufficient memory resources it needs for the application. Recommended that a small amount of testing at the scene of memory and evaluated. 



( 2 ) 
Another aspect of the tuning memory garbage collector. For streaming applications, if you want to get low latency, and certainly do not want to have a long delay caused because the JVM garbage collection. There are many parameters that can help reduce memory usage and overhead GC:

 12 Just as mentioned in the "data serialization Tuning" section, and the intermediate input data RDD production of certain operations, the default persistence:, persistence DSTREAM It will be serialized into bytes. Compared with non-serialized manner, which reduces memory overhead and GC.
Use Kryo serialization mechanism can further reduce memory usage and GC overhead. To further reduce memory usage, data may be compressed, is controlled by spark.rdd.compress parameters (default false). 

Clean up old data: By default, all input data and generated by the operation DStream transformation persistent RDD, will automatically be cleared. Spark Streaming will decide when to clean up the data, 
depending on the type of transformation operation. For example, you're using a window length window operation within 10 minutes, Spark will keep the data within 10 minutes, it will clean up old data after time has passed. However, in some special scenes, 
such as Spark SQL and Spark Streaming integration when used in asynchronous open thread, using Spark SQL query execution for batch RDD. Then you need to get data for longer periods Spark, Spark SQL queries until the end. 
It may be implemented using streamingContext.remember () method. 

3, CMS garbage collector: use parallel mark- Sweep garbage collection mechanism is recommended, to keep the overhead low GC. Although the parallel GC will reduce throughput, but it is recommended to use it 
to reduce batch processing time of (gc reduce overhead during processing). If you want to use, you have to open the driver-side end and executor. In the spark using --driver-java- -submit the options provided; 
use spark.executor.extraJavaOptions parameter settings. -XX: + UseConcMarkSweepGC.

Guess you like

Origin www.cnblogs.com/weiyiming007/p/11388447.html