Data Warehouse Task Optimization

Tips: Upstream task optimization, current task optimization, downstream task optimization and related environment optimization, full link optimization tasks.

1. Model optimization

Reasonable requirements, reasonable table structure, and reasonable processing logic.

1. Field optimization

1 reduced field

1) Delete unused fields;

2) For the fields in both Chinese and English, only English is reserved;

3) Delete fields that are not commonly used and can be obtained through existing field associations;

2 field types

1) Choose the appropriate data type, and try to choose the one with low storage;

2) Try not to choose the string type for convenience;

3 field value

1) Choose to store concise codes, and get them through the code table when the real value is needed;

2) When columnar storage, fill the empty value with null;

2. Storage type

1 storage format

1) The best columnar storage (parquet, orc), high compression ratio, good read performance;

2) It is best to choose the default storage format of the computing engine, with the best performance;

2 compressed formats

1) Query many tables, require high processing performance, and choose low compression rate;

2) Query fewer tables, require low storage, and choose high compression ratio;

3) Different layers adopt different compression formats;

3 tiered storage

1) Data hot and cold stratification

4 File Uniformity

1) The best uniform size

2) Merge small files

3. Dimension table fields

Try to associate the code table in the last step

1) After the aggregation is completed, associate the required dimensions;

4. Full scale table optimization

1 Incremental data merge

There are generally a lot of historical data, how to efficiently incorporate incremental data into the synthetic full-scale table;

2 Filter invalid data

1) Filter inactivation data;

2) Filter data that is no longer used;

5. Add index

For tables with many partitions, when querying metadata is slow, partition indexes can be added ;

2. Code optimization

1. Sequential optimization

1) When multiple tables are associated, the tables that can reduce the amount of data are associated first;

2) The association key of the join statement has a high degree of discrimination on the left;

3) The where statement first filters by partition and then filters by field, and the field with a high degree of discrimination is on the left;

4) The order of the window keys: the one with the most distinguishing degree is placed in front

5) The data is associated first, and then the table generation function is executed;

2 hint optimization

1) The display indicates the broadcast join

2) Control the number of files written

3 primary key filtering

Associated primary key (empty, null and exception handling), which can reduce the amount of associated data and avoid data skew
1) id = 'abnormal primary key'
2) id is null
3) id = ''

4 Approximate calculations

Inaccurate statistics are required, and approximate calculations can be made;

1) Sampling the table

2) Approximate calculation operator

5 Data Skew

6 read file optimization

1) Optimization for reading massive small files

3. Environment optimization

1. Hardware resources

1)CPU

2) memory

a. spark.shuffle.spill.numElementsForceSpillThreshold data is more overflow written to disk;

b. Disk with better performance, data can be written to disk faster;

3) disk

a. Shuffle read and shuffle write are slow, and the disk iops may reach the upper limit;

b. Configure more volumes on a single node

Make the total size of all volumes the same as the previous single volume size to increase the total disk read and write throughput of a single task node.

For example, for a cluster's Task Spot Instance Group, you can use four EBS volumes, each 1TB in size, instead of one 4TB volume.

4) Network bandwidth

Is the network bandwidth up to the limit?

5) Bucket bandwidth

Whether the bucket bandwidth has reached the upper limit

2. Break up tasks

The more concentrated the tasks, the worse the overall performance, try to distribute the tasks evenly; the larger the cluster, the worse the overall performance;

1) Tasks with low effectiveness can be moved to low-peak areas;

2) If multiple days of backtracking are required, can the execution be delayed and run only once;

3) Priority execution for high-value ones, low-peak execution for low-value ones;

4) Key tasks run tasks in a stable cluster to avoid task retries caused by downtime or resource recovery from affecting effectiveness;

5) Cluster splitting: importance, area;

3. spot cluster

Short execution time and less shuffle process

5. Make sure the driver is stable

1) The driver is best run on the core node

The driver is started on a non-core node. If the node is recycled, the task will fail and cannot be retried.

2) Do not run non-driver tasks on the core node

To avoid resource shortage, the driver cannot be started.
 

6. Object storage

  • Object storage flow control and frequency control;
  • Multiple versions of object storage;
  • Hierarchical storage of data;

4. Scheduling optimization

1. There cannot be too many tasks scheduled at the same time (scheduling parallelism limit)

2. High frequency to low frequency (hours to days, days to weeks)

4. External optimization

Whether the adjustment of upstream and downstream tasks can reduce the complexity of this task;


 


Guess you like

Origin blog.csdn.net/weixin_40829577/article/details/124894568