版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012292754/article/details/84194170
1 DataFrame
DataFrame = RDD + Schema
- DataFrame is just a type alias for Dataset of Row
- DataFrame over RDD : Catalyst optimization&schemas
- DataFrame can handle : Text,JSON,Parquet,…
- Both SQL and API Functions in DF still Catalyst optimized
2 Schema
https://spark.apache.org/docs/2.1.3/sql-programming-guide.html#interoperating-with-rdds
- inferred
- explicit
3 Loading & Saving Results
https://spark.apache.org/docs/2.1.3/sql-programming-guide.html#save-modes
4 SQL Function Coverage
SQL 覆盖面
- SQL 2003 support
- Runs all 99 of TPC-DS benchmark queries
- Subquery supports
- vectorization
5 外部数据源
https://spark-packages.org/
- rdbms,need JDBC jars,
- Parquet,Phoenix,csv,avro,…