Spark的Azure Databricks

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. It offers two environments:

  • Azure Databricks SQL Analytics: it provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
  • Azure Databricks Workspace: it provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline:
    • the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub.
    • this data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage.
    • as part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.

Azure Databricks Workspace

Concepts

workspace: 一个用以访问Databricks全部资产的交互式的环境,可以管理Databricks的集群、Notebook、Job等对象。

cluster: 运行notebooks和jobs的一组计算资源和配置

  • pool:一组闲置的备用实例,可减少集群启动和自动缩放的时间
  • runtime:运行在cluster上的一组核心组件,包括添加了机器学习库(TensorFlow, Keras, PyTorch, Horovod, scikit-learn and XGBoost)的runtime for Machine Learning,为生物医学数据优化过的runtime for genomics和light版本等

job:可立即或按时运行notebook或library的非交互式机制

Machine learning and deeop learning

data loading, feature engineering, model training, hyperparameter tuning, model inference, and model deployment and export

Delta Lake and Delta Engine

Delta Lake: an storage layer that brings reliability to data lakes (delta format). Delta Lake offers:

  • ACID transactions: Serializable isolation levels ensure that readers never see inconsistent data.
  • Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • ...

Delta Engine: a query engine that provides an efficient way to process data in data lakes.

Notes

OLAP按查询类型分类:

  • 即席查询:通过手写sql完成一些临时的数据分析需求,这类sql形式多变、逻辑复杂,对查询时间没有严格要求;
  • 固化查询:指的是一些固化下来的取数、看数需求,通过数据产品的形式提供给用户,从而提高数据分析和运营的效率,这类sql形式较固定,对响应时间有较高要求。

OLAP引擎按架构实现划分:

  • MPP架构系统(Presto/Impala/SparkSQL/Drill等):这种架构主要还是从查询引擎入手,使用分布式查询引擎,而不是使用hive+mapreduce架构,提高查询效率;
  • 搜索引擎架构的系统(Elasticsearch/Solr等):在入库时将数据转换为倒排索引,采用Scatter-Gather计算模型,牺牲了灵活性换取很好的性能,在搜索类查询上能做到亚秒级响应,但是对于扫描聚合为主的查询,随着处理数据量的增加,响应时间也会退化到分钟级;
  • 预计算系统(Druid/Kylin等):在入库时对数据进行预聚合,进一步牺牲灵活性换取性能,以实现对超大数据集的秒级响应。

Reference

Azure Databricks documentation

深入分析 Parquet 列式存储格式

猜你喜欢

转载自blog.csdn.net/qq_34276652/article/details/113884118