[] SparkSQL reason reprint selection Parquet DF as the default type of external data source

 

 

Select 5 reasons Parquet for Spark SQL's

The following details the reasons as parquet SparkSQL use the default input and output data source.

In order to understand how powerful Parquet Yes, we selected 24 to complete the TPC-DS derived from a query comparing (a total of 99 inquiries, some queries in scaling 1TB can not be used from the plane in spark-perf-sql CSV data file. more see below). These queries represent all categories of TPC-DS: reports, ad hoc reporting, and data mining iterations. We have to make sure to include a short queries (queries 12 and 91) and the long-running queries (queries 24a and 25), and uses 100% CPU of the well-known query (query 97).

We used a premise Cisco UCS cluster node 6, each Cisco validated designs have similar configurations. We tune the underlying hardware, in case you encounter network or disk IO bottlenecks in all tests. This article focuses on understanding Spark 1.5.1 Spark 1.6.0 just released and only run text and Parquet storage format of these queries, how will the difference in performance. Spark total working storage is 500GB. TPC-DS a scale of 1TB.

1. Spark SQL faster when used Parquet!

The chart below compares the sum of the run 24 in Spark 1.5.1 query execution time of all. When using the plane CVS file, the query took about 12 hours to complete, and in the use Parquet, query in less than one hour of time to complete a performance improvement of 11 times.

2. Spark SQL performance when using large scale better than Parquet (That use can reduce the occurrence of parquet more problems)

Improper storage format of choice often lead to difficult to diagnose and difficult to repair. For example, when using 1TB of scaling, if you use a flat CSV file, all queries can be run, at least one-third of the query can not be completed, but when using the Parquet, these inquiries are complete.

Some errors and exceptions very mysterious. There are three examples:

Some of the problems often occur:

Error Example 1:

This error is very common, that is to say after the map, shuffle can not pull data when an exception:

1

2

3

4

WARN scheduler.TaskSetManager: Lost task 145.0 in stage 4.0 (TID 4988, rhel8.cisco.com): FetchFailed(BlockManagerId(2, rhel4.cisco.com, 49209), shuffleId=13, mapId=47, reduceId=145, message=

org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /data6/hadoop/yarn/local/usercache/spark/appcache/application_1447965002296_0142/blockmgr-44627d4c-4a2b-4f53-a471-32085a252cb0/15/shuffle_13_119_0.index (No such file or directory)

at java.io.FileInputStream.open0(Native Method)

at java.io.FileInputStream.open(FileInputStream.java:195)

Error Example 2:

This error is very common, and it is said that for the shuffle, the data is lost:

1

2

3

4

5

6

7

WARN scheduler.TaskSetManager: Lost task 1.0 in stage 13.1 (TID 13621, rhel7.cisco.com): FetchFailed(null, shuffleId=9, mapId=-1, reduceId=148, message=

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 9

at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)

at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)

at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

Error Example 3:

This error is very common, executor end the loss of data:

1

ERROR cluster.YarnScheduler: Lost executor 59 on rhel4.cisco.com: remote Rpc client disassociated

The failure of most queries by re-queued task force Spark (or even restart some stage) to try again. Since then things get worse; eventually, the application fails, like never complete.

By switching to Parquet, without changing any other configuration Spark, these problems have been resolved. Compression reduces the file size, column-format allows only reading the selected record, to reduce the input data directly affect decisions about Spark DAG executes scheduler (for more details see below). All these advantages are quickly Parquet critical to the query.

3. less disk IO

Uses a compression function Parquet allow data storage on average by 75%, that is, 1TB compression ratio of the data files on the disk will occupy about 250 GB of disk space. This significantly reduces the required input data Spark SQL applications. And in the Spark 1.6.0, Parquet reader uses a pushdown filter (predicate pushdown) to further reduce the disk IO. Push-down filters allow selection decisions on the development of data before the data is read into the Spark. For example, the query processing between clauses 97 as follows:

1

2

3

4

select cs_bill_customer_sk customer_sk, cs_item_sk item_sk

from catalog_sales,date_dim

where cs_sold_date_sk = d_date_sk

and d_month_seq between 1200 and 1200 + 11

Spark SQL statement shows the scan of a physical plan for the query:

1

+- Scan ParquetRelation[d_date_sk#141,d_month_seq#144L] InputPaths: hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/_SUCCESS, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/_common_metadata, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/_metadata, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/part-r-00000-4d205b7e-b21d-4e8b-81ac-d2a1f3dd3246.gz.parquet, hdfs://rhel10.cisco.com/user/spark/hadoopds1tbparquet/date_dim/part-r-00001-4d205b7e-b21d-4e8b-81ac-d2a1f3dd3246.gz.parquet, PushedFilters: [GreaterThanOrEqual(d_month_seq,1200), LessThanOrEqual(d_month_seq,1211)]]

Wherein, PushedFilters only records returned in the range of 1200-1211 d_mont_seq column, or only a few records returned. Compared with the flat file, when using a flat file, it reads the entire table (each column and each row), as shown in Physics Program:

1

[                  Scan CsvRelation(hdfs://rhel10.cisco.com/user/spark/hadoopds1000g/date_dim/*,false,|,",null,PERMISSIVE,COMMONS,false,false,StructType(StructField(d_date_sk,IntegerType,false), StructField(d_date_id,StringType,false), StructField(d_date,StringType,true), StructField(d_month_seq,LongType,true), StructField(d_week_seq,LongType,true), StructField(d_quarter_seq,LongType,true), StructField(d_year,LongType,true), StructField(d_dow,LongType,true), StructField(d_moy,LongType,true), StructField(d_dom,LongType,true), StructField(d_qoy,LongType,true), StructField(d_fy_year,LongType,true), StructField(d_fy_quarter_seq,LongType,true), StructField(d_fy_week_seq,LongType,true), StructField(d_day_name,StringType,true), StructField(d_quarter_name,StringType,true), StructField(d_holiday,StringType,true), StructField(d_weekend,StringType,true), StructField(d_following_holiday,StringType,true), StructField(d_first_dom,LongType,true), StructField(d_last_dom,LongType,true), StructField(d_same_day_ly,LongType,true), StructField(d_same_day_lq,LongType,true), StructField(d_current_day,StringType,true), StructField(d_current_week,StringType,true), StructField(d_current_month,StringType,true), StructField(d_current_quarter,StringType,true), StructField(d_current_year,StringType,true)))[d_date_sk#141,d_date_id#142,d_date#143,d_month_seq#144L,d_week_seq#145L,d_quarter_seq#146L,d_year#147L,d_dow#148L,d_moy#149L,d_dom#150L,d_qoy#151L,d_fy_year#152L,d_fy_quarter_seq#153L,d_fy_week_seq#154L,d_day_name#155,d_quarter_name#156,d_holiday#157,d_weekend#158,d_following_holiday#159,d_first_dom#160L,d_last_dom#161L,d_same_day_ly#162L,d_same_day_lq#163L,d_current_day#164,d_current_week#165,d_current_month#166,d_current_quarter#167,d_current_year#168]]

4. Spark 1.6.0 提供了更高的扫描吞吐量

Databricks 的 Spark 1.6.0 发布博客中曾经提到过显著的平面扫描吞吐量,因为该博客使用到了 “更优化的代码路径” 一词。为了在现实世界中说明这一点,我们在 Spark 1.5.1 和 1.6.0 中运行了查询 97,并捕获了 nmon 数据。改进非常明显。

首先,查询响应时间减少了一半:查询 97 在 Spark 1.5.1 中用了 138 秒时间,而在 Spark 1.6.0 中只用了 60 秒。

图 2. 使用 Parquet 时查询 97 所用的时间(以秒为单位)

 

其次,在 Spark 1.6.0 中,工作节点上的 CPU 使用率更低一些,这主要归功于 SPARK-11787:

图 3. Spark 1.6.0 中的查询 97 的 CPU 使用率,最高时为 70%

 

图 4. Spark 1.5.1 中的查询 97 的 CPU 使用率,最高时为 100%

与上述数据相关,在 Spark 1.6.0 中,磁盘读取吞吐量要高出 50%:

图 5. Spark 1.5.1 和 1.6.0 中的磁盘读取吞吐量

5. 高效的 Spark 执行图

除了更智能的读取器(比如 Parquet)之外,数据格式也会直接影响 Spark 执行图,因为调度程序的一个主要输入是 RDD 计数。在我们的示例中,我们使用文本和 Parquet 在 Spark 1.5.1 上运行了相同的查询 97,我们获得了各个阶段的以下执行模式。

使用文本 – 有许多长时间运行的阶段(请注意,y 轴上使用的单位是毫秒)

图 6. 使用文本的执行阶段

在使用 Parquet 时,虽然有更多的阶段,但工作的执行速度很快,而且只创建了两个长时间运行的阶段就接近了工作尾声。这表明 “父-子” 阶段的边界变得更明确,因此需要保存到磁盘和/或通过网络节点的中间数据变得更少,这加快了端到端执行的速度。

图 7. 使用 Parquet 的执行阶段

结束语

Parquet 用于 Spark SQL 时表现非常出色。它不仅提供了更高的压缩率,还允许通过已选定的列和低级别的读取器过滤器来只读取感兴趣的记录。因此,如果需要多次传递数据,那么花费一些时间编码现有的平面文件可能是值得的。

Guess you like

Origin www.cnblogs.com/huomei/p/12094014.html