Spark的Azure Databricks

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. It offers two environments:

Azure Databricks SQL Analytics: it provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
Azure Databricks Workspace: it provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline:
- the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub.
- this data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.
- as part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.

Azure Databricks Workspace

workspace: 一个用以访问Databricks全部资产的交互式的环境，可以管理Databricks的集群、Notebook、Job等对象。

cluster: 运行notebooks和jobs的一组计算资源和配置

pool：一组闲置的备用实例，可减少集群启动和自动缩放的时间
runtime：运行在cluster上的一组核心组件，包括添加了机器学习库(TensorFlow, Keras, PyTorch, Horovod, scikit-learn and XGBoost)的runtime for Machine Learning，为生物医学数据优化过的runtime for genomics和light版本等

job：可立即或按时运行notebook或library的非交互式机制

data loading, feature engineering, model training, hyperparameter tuning, model inference, and model deployment and export

Delta Lake: an storage layer that brings reliability to data lakes (delta format). Delta Lake offers:

ACID transactions: Serializable isolation levels ensure that readers never see inconsistent data.
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
...

Delta Engine: a query engine that provides an efficient way to process data in data lakes.

Notes

OLAP按查询类型分类：

OLAP引擎按架构实现划分：

MPP架构系统(Presto/Impala/SparkSQL/Drill等)：这种架构主要还是从查询引擎入手，使用分布式查询引擎，而不是使用hive+mapreduce架构，提高查询效率；
搜索引擎架构的系统(Elasticsearch/Solr等)：在入库时将数据转换为倒排索引，采用Scatter-Gather计算模型，牺牲了灵活性换取很好的性能，在搜索类查询上能做到亚秒级响应，但是对于扫描聚合为主的查询，随着处理数据量的增加，响应时间也会退化到分钟级；
预计算系统(Druid/Kylin等)：在入库时对数据进行预聚合，进一步牺牲灵活性换取性能，以实现对超大数据集的秒级响应。

Reference

Azure Databricks documentation