Why we choose parquet as the data storage format (transfer)

Scene description

 

We have built a data warehouse for customer login logs, but there are some common points in actual business use.

A Need to associate dimension table

B Finally, only the data of a certain product within a period of time is taken

C only pays attention to a few fields

Based on the above business, we decided to unify the associated dimension table every day and store the associated data separately. Each business directly uses the associated data for offline calculations.

 

 

External factors for choosing parquet

 

Among the various column stores, we finally chose parquet for many reasons. In addition to the advantages of parquet itself, there are the following factors

 

A. The company has already launched the spark cluster at that time, and spark naturally supports parquet, and its recommended storage format (the default storage is parquet).

 

B. Hive supports parquet format storage. If you use hiveql for query in the future, it is also fully compatible.

 

 

The internal factors of choosing parquet

 

Let’s talk about the advantages of parquet itself by comparing parquet and csv.

 

The size of csv stored on hdfs is the same as the actual file size. If you consider copies, it is the actual file size * the number of copies. (If not compressed)

 

Parquet uses different compression ratios

640?wx_fmt=png

Note: The original log size is about 214G, with 120+ fields

With csv (uncompressed mode) there is almost no compression.

After using parquet uncompressed mode, gzip, snappy format compression, the compression ratios are 17.4G, 8.0G, and 11G respectively , and the compression ratios achieved are: 12 , 27, and 19.

If we store 3 copies on hdfs, the compression ratio still reaches 4, 9, 6 times

 

Partition filtering and column pruning

Partition filtering

Parquet combined with spark can perfectly support partition filtering. For example, if you need the data of a certain product for a certain period of time, hdfs will only take this folder.

The filter and where keywords of spark sql, rdd, etc. can all achieve the effect of partition filtering.

Use spark's partitionBy to achieve partitioning. If multiple parameters are passed in, a multi-level partition will be created. The first field is used as a first-level partition, and the second field is used as a second-level partition.

 

Column trim

Column pruning: In fact, the simple point is the column data we want to retrieve.

When getting fewer columns, the faster the speed. When taking all columns of data, such as our 120 columns of data, the efficiency will be extremely low at this time. At the same time, it loses the meaning of using parquet.

The partition filtering and column pruning tests are as follows:

 

640?wx_fmt=png

Description:

A. The number of tasks, input values, and time consumption are all real data on the spark web ui.

B. The reason why there is no verification of csv for comparison is because when there are more than 200 G and each record has 120 fields, csv reads a field and counts it directly and loses the excuter.

C. Note: In order to avoid automatic optimization, we directly print the value of each field in each record. (It is estimated that much of the above time-consuming is spent here)

D. Through the comparison of the above figure, it can be found that:

  • When we take out all the records, the three compression methods have little difference in time-consuming. It takes about 7 minutes.

  • When we only take out a certain day, the advantages of parquet's partition filtering are revealed. Only about one-sixth. It seems that the full amount was about seven or eight days.

  • When we take only one field of a certain day, the time will be shortened again. At this time, the hard disk will only scan the cylinder of the rowgroup where the column is located. Greatly save IO.

  • 640?wx_fmt=png

E. Please enable the filterpushdown function when testing

in conclusion

Parquet's gzip has the highest compression ratio, which can reach 27 times if backup is not taken into consideration. Maybe this is the reason why spar parquet uses gzip compression by default.

Partition filtering and column pruning can help us save disk IO significantly. To reduce the pressure on the server.

If you have a lot of data fields, but in actual applications, each business only reads a small number of fields, parquet will be a very good choice.

Guess you like

Origin blog.csdn.net/ccpit2b2c/article/details/111511165