[Data Lake Architecture] Azure Data Lake Data Lake Guide

  • The Hitchhiker's Guide to the Data Lake

    • File size and number of files

    • file format

    • partition scheme

    • Use query acceleration

    • How do I manage access to my data?

    • What data format do I choose?

    • How do I manage my data lake costs?

    • How do I monitor my data lake?

    • When is ADLS Gen2 the right choice for your data lake?

    • Key Considerations for Designing a Data Lake

    • the term

    • Organize and manage data in the data lake

    • Do I want a centralized or federated data lake implementation?

    • How should I organize my data?

    • Optimize your data lake for better scale and performance

    • recommended reading

    • Questions, comments or feedback?

Azure Data Lake Storage Gen2 (ADLS Gen2) is a highly scalable and cost-effective data lake solution for big data analytics. As we continue to work with our customers to leverage ADLS Gen2 to uncover key insights from their data, we have identified some key patterns and considerations that will help them effectively leverage ADLS Gen2 in their large-scale Big Data platform architectures.

This document documents these considerations and best practices that we have learned while working with our customers. For the purpose of this document, we will focus on modern data warehouse patterns that are heavily used by our large enterprise customers on Azure, including our solutions such as Azure Synapse Analytics.

We will improve this documentation to include more analysis modes in future iterations.

Important: Please consider the contents of this document as guidance and best practices to help you make architectural and implementation decisions. This is not an official HOW-TO document.

When is ADLS Gen2 the right choice for your data lake? #


Enterprise data lakes are designed to be the central repository for unstructured, semi-structured, and structured data used in big data platforms. The goal of an enterprise data lake is to eliminate data silos (where data is only accessible by one part of the organization) and facilitate a single storage tier that accommodates an organization's diverse data needs. For more information on choosing the right storage solution, visit Choosing in Azure Big data storage technology article.

A common question that arises is when to use a data warehouse versus a data lake. We urge you to consider data lakes and data warehouses as complementary solutions that work together to help you gain critical insights from your data. A data lake is a repository that stores all types of data from various sources. Data in its natural form is stored as raw data and schemas and transformations are applied on this raw data to gain valuable business insights based on the key questions the business is trying to answer. A data warehouse is a store of highly structured schema data that is often organized and processed for very specific insights. For example. Retail customers can store past 5 years of sales data in a data lake, additionally, they can process data from social media, extract new trends in consumption and intelligence from retail analytics solutions, and use all of this as input together to generate a A dataset that can be used to predict sales targets for the next year. They can then store the highly structured data in a data warehouse where BI analysts can build targeted sales forecasts. Additionally, they can use the same sales data and social media trends in the data lake to build intelligent machine learning models for personalized recommendations on their website.

ADLS Gen2 is an enterprise-grade hyperscale data repository for big data analytics workloads. ADLS Gen2 provides faster performance and Hadoop-compatible access through hierarchical namespaces, reduced cost and security through fine-grained access control and native AAD integration. This is suitable as an option for enterprise data lakes focused on big data analytics scenarios - extracting high-value structured data from unstructured data using transformations, advanced analytics using machine learning, or real-time data ingestion and analysis for rapid insights . It's worth noting that we've seen customers define hyperscale differently -- depending on the data stored, the number of transactions, and transaction throughput. When we say hyperscale, we usually mean petabytes of data and hundreds of Gbps of throughput—the challenges involved in this type of analysis are very different than hundreds of gigabytes of data and several Gbps of transactions in throughput.

Key considerations for designing a data lake#


When you are building an enterprise data lake on ADLS Gen2, it is important to understand your requirements for key use cases, including

  1. What am I storing in the data lake?

  2. How much data am I storing in the data lake?

  3. What portion of the data are you running analytics workloads on?

  4. Who needs access to which parts of my data lake?

  5. What kinds of analytics workloads will I be running on my data lake?

  6. What are the different transaction patterns for analytics workloads?

  7. What is my working budget?

For some of the key design/architecture questions we've been hearing from our customers, we'd like to anchor the rest of this document in the following structure.

  • Available options with pros and cons

  • Factors to consider when choosing the option that's right for you

  • Recommended patterns when applicable

  • Antipatterns you want to avoid

To get the most out of this document, identify your key scenarios and requirements, and weigh our options against your requirements to determine your approach. If you are unable to choose an option that perfectly fits your scenario, we recommend doing a Proof of Concept (PoC) with some options and letting the data guide your decision.

the term#


Before we discuss best practices for building a data lake, it's important to be familiar with the various terms we'll be using in the context of building a data lake with ADLS Gen2. This document assumes you have an account in Azure.

  • Resource: A manageable item available through Azure. Virtual machines, storage accounts, VNETs are examples of resources.

  • Subscription: An Azure subscription is a logical entity that separates management and financial (billing) logic for Azure resources. Subscriptions are associated with limits and quotas for Azure resources, you can read about them here.

  • Resource Groups: Logical containers for holding the resources required by an Azure solution can be managed together as a group. You can read more about resource groups here.

  • Storage account: An Azure resource that contains all Azure Storage data objects: blobs, files, queues, tables, and disks. You can read more about storage accounts here. For the purposes of this document, we will focus on an ADLS Gen2 storage account - which is essentially an Azure Blob storage account with hierarchical namespaces enabled, which you can read more about here.

  • Container (also known as container for non-HNS enabled accounts): A container organizes a group of objects (or files). A storage account has no limit on the number of containers, and containers can store an unlimited number of folders and files. Some attributes can be applied at the container level, such as RBAC and SAS keys.

  • Folders/Directories: Folders (also known as directories) organize a group of objects (other folders or files). There is no limit to how many folders or files can be created under one folder. Folders also have Access Control Lists (ACLs) associated with them, there are two types of ACLs associated with folders - access ACLs and default ACLs, you can read more about them here.

  • Object/File: A file is an entity that holds data that can be read/written. A file has an access control list associated with it. Files only have access ACLs, not default ACLs.

Organizing and managing data in a data lake#


As our enterprise customers develop their data lake strategies, one of the key value propositions of ADLS Gen2 is as a single data store for all their analytics scenarios. Keep in mind that this single data store is a logical entity, and depending on design considerations, it can appear as a single ADLS Gen2 account or multiple accounts. Some customers have end-to-end ownership of the analytics pipeline components, while others have a central team/organization that manages the infrastructure, operations and governance of the data lake, serving multiple customers simultaneously - whether it be other Organizations or other customers external to their business.

In this section, we present our thoughts and recommendations for a series of common questions customers hear when designing an enterprise data lake. As an illustration, we'll use the example of a large retail customer, Contoso.com, building their data lake strategy to help handle various predictive analytics scenarios.

Do I want a centralized or federated data lake implementation? #


As an enterprise data lake, you have two options available - either centralize all data management within one organization for your analytical needs, or have a federated model where your customers manage their own data lakes and the centralized The data team provides direction and manages several key aspects of the data lake, such as security and data governance. It's important to remember that both centralized and federated data lake strategies can be implemented using a single storage account or multiple storage accounts.

A common question customers ask us is if they can build a data lake in a single storage account, or if they need multiple storage accounts. While technically a single ADLS Gen2 can address your business needs, there are various reasons why customers would choose multiple storage accounts, including but not limited to the following scenarios in the remainder of this section.

Key Considerations#


When deciding how many storage accounts to create, the following considerations can help you decide how many storage accounts to provision.

  • A single storage account enables you to manage a set of control plane management operations, such as RBAC for all data in the storage account, firewall settings, data lifecycle management policies, while allowing you to work with containers, files, and folders on the storage account. This is a good model to consider if you want to optimize for easier management, especially if you have a centralized data lake strategy.

  • Multiple storage accounts enable you to segregate data between different accounts so that you can apply different management policies to them or manage their billing/costing logic independently. If you are considering a federated data lake strategy, where each organization or business unit has its own set of manageability requirements, then this model may be best for you.

Let's put these aspects in the context of some scenarios.

Enterprise data lake with global coverage#


Driven by global markets and/or geographically distributed organizations, there are cases where enterprises' analytical scenarios take into account multiple geographic regions. The data itself can be divided into two broad categories.

  • Data that can be shared globally across all regions - eg Contoso is trying to plan sales targets for the next fiscal year and wants to get sales data from each region.

  • Data that needs to be segregated into one area - Contoso, for example, wanted to provide a personalized buyer experience based on the buyer's profile and buying patterns. Given that this is customer data, sovereignty requirements need to be met, so the data cannot leave the zone.

In this case, the customer will provide a region-specific storage account to store data in a specific region and allow sharing of specific data with other regions. There is still a centralized logical data lake with a central set of multiple storage accounts for infrastructure management, data governance and other operations.

becb0dc90ebf767e45d94fcf9db5372c.png

Client or Data Specific Isolation#


There are scenarios where an enterprise data lake serves multiple customer (internal/external) scenarios, which may be subject to different requirements - different query patterns and different access requirements. Let's take our example of Contoso.com, they have an analytics solution to manage their operations. In this case, they have various data sources—employee data, customer/event data, and financial data—that are subject to different governance and access rules and may also be managed by different organizations within the company. In this case, they can choose to create different data lakes for various data sources.

In another scenario, an enterprise that is a multi-tenant analytics platform serving multiple customers may end up offering separate data lakes for customers in different subscriptions to help ensure that customer data and its associated analytics workloads are kept separate from other Customer isolation to help manage their costs and billing models.

0f46f18088f29e37070a4eda205668ae.png

suggestion#

  • Create separate storage accounts (preferably in separate subscriptions) for your development and production environments. This helps you efficiently track and optimize management and billing policies, in addition to ensuring sufficient isolation between development and production environments that require different SLAs.

  • Identify the different logical sets of data and consider the need to manage them in a consolidated or segregated fashion - this will help define your account boundaries.

  • Start your design approach with one storage account, and consider why you need multiple storage accounts (isolation, region-based requirements, etc.) and not the other way around.

  • Other resources (e.g. VM cores, ADF instances) also have subscription limits and quotas - take these into consideration when designing your data lake.

Anti-Patterns#


Beware of multiple data lake management#


When you decide on the number of ADLS Gen2 storage accounts, make sure to optimize for your consumption patterns. If you don't need isolation and you're not fully utilizing your storage account's capabilities, you'll incur the overhead of managing multiple accounts without a meaningful return on investment.

Copy data back and forth#


One of the things you need to be careful about when you have multiple data lakes is whether and how you replicate data across multiple accounts. This creates an administrative problem of what the source of truth is and how fresh it needs to be, and also consumes transactions that involve copying data back and forth. If you have a legitimate plan to replicate your data, we have some features on our roadmap to make this workflow easier.

Extensibility Notes#


A common question our customers ask is whether a single storage account can continue to scale indefinitely to meet their data, transaction, and throughput needs. Our goal in ADLS Gen2 is to meet the limit required by customers. When you have a scenario where you need to store really large amounts of data (several petabytes) and need the account to support really large transaction and throughput modes (tens of thousands of TPS and hundreds of Gbps of throughput), we do ask), usually via Databricks or HDInsight 1000 cores of computing power are required for analytical processing, please contact our product group so that we can plan to properly support your requirements.

How should I organize my data? #


Data organization in an ADLS Gen2 account can be done sequentially in a hierarchy of containers, folders, and files, as we have seen above. When we work with clients on their data lake strategies, a very common point of discussion is how they can best organize their data. There are multiple ways to organize data in a data lake, and this section documents a common approach taken by many customers building data platforms.

The organization tracks the lifecycle of the data as it flows through the source system all the way to the final consumer - the BI analyst or data scientist. For example, let's follow the journey of sales data through Contoso.com's data analytics platform.

265933beee0e73aed4b1dcfc4cf97582.png

For example, think of raw data as a lake/pond with water in its natural state, the data is ingested and stored as is, untransformed, and the rich data is water in a reservoir, cleaned and stored in a predictable state (in our data, for example), curated data is like bottled water ready to consume. Workspace Data is like a laboratory where scientists can bring their own data for testing. It's worth noting that while all of these data tiers exist within a single logical data lake, they may be spread across different physical storage accounts. In these cases, having a metastore helps with discovery.

  • Raw Data: This is the data from the source system. This data is stored as-is in the data lake and used by analytics engines such as Spark to perform cleansing and enrichment operations to produce curated data. Data in the raw region is also sometimes stored as aggregated datasets, for example in the case of streaming scenarios, the data is ingested via a message bus such as Event Hubs, then aggregated via a real-time processing engine such as Azure Stream Analytics or Spark Streaming, and then stored in the data lake. Depending on your business needs, you can choose to keep the data as it is (such as log messages from a server) or aggregate it (such as real-time streaming data). This layer of data is highly controlled by a central data engineering team and rarely accessed by other consumers. Depending on your enterprise's retention policy, this data is either stored as-is for the period required by the retention policy, or deleted when you deem the data no longer needed. For example. This will be raw sales data pulled from Contoso's sales management tool running on its on-premises system.

  • Enriched Data: This layer of data is a version of the raw data (as-is or aggregated) that has a defined schema and is cleansed, enriched (with other sources) and ready for use by analytics engines to extract high-value data. Data engineers generate these datasets and continue to extract high-value/curated data from these datasets. For example. This will be sales data enrichment - ensure sales data is schematized, enriched with other product or inventory information, and separated into multiple datasets for different business units within Contoso.

  • Curated Data: This tier of data contains high-value information for data consumers (BI analysts and data scientists). This data has a structure and can be provided to consumers as-is (such as a data science notebook) or through a data warehouse. Data assets in this layer are typically highly managed and well documented. For example. High-quality sales data for the business unit (that is, data in a data-rich region that is correlated with other demand forecasting signals, such as social media trend patterns), for use in predictive analytics to determine sales forecasts for the next fiscal year.

  • Workspace data: In addition to the data ingested at the source by the data engineering team, consumers of the data can choose to bring in other potentially valuable datasets. In this case, the data platform can assign workspaces to these consumers so they can generate valuable insights using curated data as well as other datasets they bring to the table. For example. Data science teams are trying to determine a product placement strategy for a new region, they can bring in additional data sets such as customer demographics and usage data for other similar products in the region, and use high-value sales insights to analyze product-market fit and distribution strategy.

  • Archived data: This is your organization's data "vault" - data stored primarily in compliance with retention policies and for very strict purposes such as supporting audits. You can use the Cool and Archive layers in ADLS Gen2 to store this data. You can read more about our data lifecycle management policy to determine which plan is right for you.

Key Considerations#


When deciding on a data structure, consider the semantics of the data itself and the consumers who will access it to determine the data organization strategy that is right for you.

suggestion#

  • Create different folders or containers for different data areas (more on folder vs. container considerations) - raw datasets, enriched datasets, curated datasets, and workspace datasets.

  • Within an area, choose to organize data in folders according to logical divisions, such as datetime or business unit or both. You can find more examples and scenarios for directory layouts in our best practices documentation.

    • Consider analyzing usage patterns when designing your folder structure. For example. If you have a Spark job that reads all sales data for a product from a specific region for the past 3 months, the ideal folder structure would be /enriched/product/region/timestamp.

    • When deciding on a folder structure, consider the access control model you want to follow.

    • The following table provides a framework for thinking about the different areas of your data and the associated management of areas with common patterns.

consider Raw data rich data curated data workspace data
consumer Data Engineering Team Data engineering team with ad-hoc access mode by data scientists/BI analysts Data Engineer, BI Analyst, Data Scientist Data Scientist/BI Analyst
Access control The data engineering team has locked down access Full control of the data engineering team with read access to BI analysts/data scientists Full control of the data engineering team with read and write access to BI analysts/data scientists Full control of Data Engineers, Data Scientists/BI Analysts
Data Lifecycle Management Once the rich data is generated, it can be moved to a cooler storage tier to manage costs. Older data can be moved to cooler tiers. Older data can be moved to cooler tiers. While the end consumer is in control of this workspace, make sure there are processes and policies in place to clean up unnecessary data – for example, with a policy-based DLM, data can easily build up.
Folder Structure and Hierarchy Folder structure to reflect ingestion patterns. The folder structure reflects an organization, such as a business unit. The folder structure reflects an organization, such as a business unit. The folder structure reflects the team used by the workspace.
example /raw/sensordata /raw/lobappdata /raw/userclickdata

/enriched/sales /enriched/

manufacturing

/curated/sales /curated/

manufacturing

/workspace/salesBI /workspace/

manufacturin

datascience

  • Another common question our customers ask when to use containers and when to use folders to organize data. Although at a higher level, both are used for the logical organization of data, they have some key differences.

consider container folder
grade Containers can contain folders or files. Folders can contain other folders or files.
Access Control Using AAD At the container level, coarse-grained access control can be set using RBAC. These RBACs apply to all data inside the container. At the folder level, fine-grained access control can be set using ACLs. The ACL applies only to that folder (unless using the default ACL, in which case it is snapshotted when new files/folders are created under that folder).
Non-AAD Access Control At the container level, it is possible to enable anonymous access (via a shared secret) or set container-specific SAS keys. Folders do not support non-AAD access control.

Anti-Pattern #Infinite
growth of irrelevant data#


While ADLS Gen2 storage is not very expensive and allows you to store large amounts of data in a storage account, the absence of a lifecycle management policy can end up causing very fast growth of data in storage even if you don't need the entire corpus of data for your scenario. Two common patterns we see this kind of data growth are:-

  • Refresh data with newer versions of data - customers often retain some older versions of data for analysis when the same data is refreshed over time, for example when last month's customer engagement data is updated daily over a 30-day rolling window Every day when you refresh, you get 30 days of engagement data, and if you don't have a cleansing process in place, your data can grow exponentially.

  • Workspace Data Accumulation - In a workspace dataspace, customers of your data platform, i.e. BI analysts or data scientists, can bring in their own datasets. Often, we have seen that when unused data is left in storage space around.

How do I manage access to my data? #


ADLS Gen2 supports an access control model that combines RBAC and ACLs to manage data access. You can find more information on access control here. In addition to managing access using AAD identities using RBAC and ACLs, ADLS Gen2 also supports the use of SAS tokens and shared secrets to manage access to data in Gen2 accounts.

A common question we hear from customers is when to use RBAC and when to use ACLs to manage access to data. RBAC allows you to assign roles to security principals (users, groups, service principals, or managed identities in AAD) and these roles are associated with permission sets for data in containers. RBAC can help manage roles related to control plane operations (such as adding additional users and assigning roles, managing encryption settings, firewall rules, etc.) or data plane operations (such as creating containers, reading and writing data, etc.). For more information about RBAC, you can read this article.

RBAC is essentially limited to top-level resources - storage accounts or containers in ADLS Gen2. You can also apply RBAC across resources at the resource group or subscription level. ACLs allow you to manage a specific set of permissions for a security principal to a narrower scope - a file or directory in ADLS Gen2. There are 2 types of ACLs - Access ADLs control access to a file or directory, default ACLs are ACL templates set for directories associated with a directory, snapshots of these ACLs are inherited by any subkeys created under that directory.

Key Considerations#


The following table provides a quick overview of how ACLs and RBAC are used to manage permissions on data in ADLS Gen2 accounts - at a high level, RBAC is used to manage coarse-grained permissions (for storage accounts or containers) and RBAC is used to manage fine-grained permissions ACLs (for files and directories).

Consideration RBACs ACLs
Scope Storage accounts, containers. Cross resource RBACs at subscription or resource group level. Files, directories
Limits 2000 RBACs in a subscription 32 ACLs (effectively 28 ACLs) per file, 32 ACLs (effectively 28 ACLs) per folder, default and access ACLs each
Supported levels of permission Built-in RBACs or custom RBACs ACL permissions

Be aware of the 2000 limit when using RBAC at the container level as the only mechanism for data access control, especially if you may have a large number of containers. You can view the number of role assignments for each subscription in any Access Control (IAM) blade in the portal.

suggestion#

  • Create security groups with the required permission levels for the objects (usually directories within the directories we see at customers) and add them to the ACL. For specific security principals that you want to provide permissions for, add them to security groups instead of creating specific ACLs for them. Following this practice will help you minimize the process of managing access for new identities - which would take a long time if you wanted to recursively add new identities to every file and folder in the container. Let's take an example, you have a directory /logs in your data lake that contains log data from your server. You can ingest data into this folder via ADF and also have specific users of the service engineering team upload logs and manage other users to this folder. Additionally, you have various Databricks cluster analytics logs. You will create the /logs directory and create two AAD groups LogsWriter and LogsReader with the following permissions.

    • LogsWriter added to the ACL of the /logs folder with rwx permissions.

    • Add LogsReader to the ACL of the /logs folder with rx permissions.

    • ADF's SPN/MSI and user and service engineering teams can be added to the LogsWriter group.

    • The SPN/MSI for Databricks will be added to the LogsReader group.

What data format do I choose? #


Data may arrive in your data lake account in many formats - human readable formats like JSON, CSV or XML files, compressed binary formats like .tar.gz and various sizes - huge files (several terabytes ), such as exporting SQL tables from a local system or exporting a large number of small files (a few KB) from an IoT solution, such as real-time events. While ADLS Gen2 supports the storage of all types of data without imposing any restrictions, it is best to consider the data format to maximize the efficiency of the processing pipeline and optimize costs - you can do this by choosing the right format and the right file size to achieve these two goals. Hadoop has a set of file formats it supports for optimal storage and processing of structured data. Let's look at some common file formats - Avro, Parquet, and ORC. All are machine-readable binary file formats, provide compression to manage file size, and are self-describing in nature, with schemas embedded in the file. The difference between the formats is how the data is stored - Avro stores data in a row-based format, while Parquet and ORC formats store data in a columnar format.

Key Considerations#

  • The Avro file format is suitable for I/O-heavy write or query patterns that tend to retrieve multi-row records in their entirety. For example. The Avro format is favored by message buses such as Event Hub or Kafka to write multiple events/messages consecutively.

  • Parquet and ORC file formats are favored when I/O patterns are more read-intensive and/or query patterns focus on a subset of columns in a record - where read transactions can be optimized to retrieve specific columns rather than reading entire records .

How do I manage my data lake costs? #

ADLS Gen2 provides data lake storage for your analytics scenarios with the goal of reducing your total cost of ownership. Pricing for ADLS Gen2 can be found here. As our enterprise customers serve the needs of multiple organizations, including analytics use cases on a central data lake, their data and transactions tend to increase dramatically. With little or no centralized control, associated costs also increase. This section provides key considerations you can use to manage and optimize your data lake costs.

Key Considerations#

  • ADLS Gen2 provides policy management that you can use to leverage the lifecycle of the data stored in your Gen2 account. You can read more about these policies here. For example. If your organization has a retention policy requirement to retain data for 5 years, you can set a policy to automatically delete data when it has not been modified for 5 years. If your analytics scenario primarily operates on data ingested from the previous month, you can move data older than that month to a lower tier (cool or archive) where data storage is less expensive. Note that lower tiers are less expensive for static data but higher for trading strategies, so don't move data to lower tiers if you expect to process the data frequently.

c50f3e70dc2b3bda9571141731ee31f8.png

  • Make sure you have selected the correct replication option for your account, you can read the data redundancy article to learn more about your options. For example. While a GRS account ensures that your data is replicated across multiple regions, it also costs more than an LRS account (where data is replicated in the same data center). When you have a production environment, replication options such as GRS are valuable for ensuring business continuity through high availability and disaster recovery. However, an LRS account may be sufficient for your development environment.

  • As you can see from the ADLS Gen2 pricing page, your read and write transactions are billed in 4 MB increments. For example. If you perform 10,000 reads, each with a file size of 16 MB, you will be charged for 40,000 transactions. When you read a few KB of data in a transaction, you are still charged for a 4 MB transaction. Optimizing more data in a single transaction, i.e. higher throughput in optimizing transactions can not only save costs, but can also greatly improve your performance.

How do I monitor my data lake? #


Understanding how your data lake is used and how it performs is a key component of operating your service and making sure it is available to any workload that consumes the data it contains. This includes:

  • Ability to audit your data lake based on frequent operations

  • Understand key performance indicators, such as high-latency operations

  • Learn about common mistakes, actions that cause them, and actions that cause server-side throttling

Key Considerations#

All telemetry data for the data lake is available through Azure Storage logs in Azure Monitor. Azure Storage Logs in Azure Monitor is a new preview feature for Azure Storage that allows direct integration between your storage account and Log Analytics, Event Hubs, and archiving logs to another storage account using standard diagnostic settings. A reference to a complete list of metrics and resource logs and their associated schemas can be found at Azure Storage Monitoring Data Reference.

  • The choice of where to store logs in Azure Storage Logs becomes important when considering how to access them:

    • If you want near real-time access to your logs and the ability to correlate events in your logs with other metrics from Azure Monitor, you can store your logs in a Log Analytics workspace. This allows you to query your logs using KQL and author queries that enumerate the StorageBlobLogs table in your workspace.

    • If you want to store logs for near real-time queries and long-term retention, you can configure diagnostic settings to send logs to a Log Analytics workspace and storage account.

    • If you want to access your logs through another query engine (such as Splunk), you can configure your diagnostic settings to send logs to Event Hubs and ingest logs from Event Hubs to a destination of your choice.

  • Azure Storage logs in Azure Monitor can be enabled through the Azure portal, PowerShell, Azure CLI, and Azure Resource Manager templates. For large-scale deployments, Azure Policy is available and fully supports remediation tasks. For more details, see:

    • Azure/Community Policies

    • ciphertxt/AzureStoragePolicy


Common KQL queries for Azure Storage logs in Azure Monitor


The following queries can be used to gain insight into the performance and health of the data lake:

  • Frequent operations

    StorageBlobLogs
    | where TimeGenerated > ago(3d)
    | summarize count() by OperationName
    | sort by count_ desc
    | render piechart
  • High latency operations

    StorageBlobLogs
    | where TimeGenerated > ago(3d)
    | top 10 by DurationMs desc
    | project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, 
    ClientLatencyMs = DurationMs - ServerLatencyMs
  • Operations causing the most errors

    StorageBlobLogs
    | where TimeGenerated > ago(3d) and StatusText !contains "Success"
    | summarize count() by OperationName
    | top 10 by count_ desc

A list of all built-in queries for Azure Storage logs in Azure Monitor can be found in the Azure Services/Storage Accounts/Queries folder in the Azure Montior community on GitHub.

Optimizing your data lake for better scale and performance#


Under construction, seeking contributions

In this section, we discuss how to optimize data lake storage to improve performance in analytics pipelines. In this section, we'll focus on fundamental principles to help you optimize storage transactions. To make sure we have the right context, there is no silver bullet or 12-step process for optimizing a data lake, as many considerations depend on the specific use and business problem you are trying to solve. However, when we talk about optimizing a data lake for performance, scalability and even cost, it boils down to two key factors:-

  • Optimize for high throughput - Aim to get at least a few MB per transaction (higher is better).

  • Optimize data access mode - reduce unnecessary file scanning and only read the data you need to read.

As a prerequisite for optimization, it is important to know more about transaction profiles and data organization. Given the varying nature of analytics scenarios, optimization depends on your analytics pipeline, storage I/O patterns, and datasets you operate on, especially the following aspects of your data lake.

Note that the scenarios we discuss are primarily focused on optimizing ADLS Gen2 performance. In addition to storage performance considerations, the overall performance of the analytics pipeline has analytics engine-specific considerations, and our partnerships with analytics products on Azure such as Azure Synapse Analytics, HDInsight, and Azure Databricks ensure we focus on crafting the overall experience better one. Also, while we use engine-specific examples, please note that these examples focus on storage performance.

File Size and Number of Files#


The analytics engine (your ingestion or data processing pipeline) incurs overhead for each file it reads (related to listing, checking access, and other metadata operations), and too many small files can weigh on your overall work. Performance is negatively affected. Also, when your files are too small (in the KB range), you also have low throughput through I/O operations, requiring more I/O to get the data you want. In general, it is best practice to organize data into larger files (aim for at least 100 MB or more) for better performance.

In many cases, if your raw data (from various sources) is not inherently large, you can use the following options to ensure that the datasets your analytics engine operates on are still optimized with large files.

  • Add a data processing layer to your analytics pipeline to combine data from multiple small files into one large file. You can also take advantage of this opportunity to store data in a read-optimized format (such as Parquet) for downstream processing.

  • In the case of dealing with real-time data, you can use a real-time streaming engine (such as Azure Stream Analytics or Spark Streaming) with a message broker (such as Event Hubs or Apache Kafka) to store your data as larger files.


file format#


As we've already discussed, optimizing your storage I/O patterns can greatly benefit the overall performance of your analytics pipeline. It is worth mentioning that choosing the right file format can not only provide better performance, but also reduce data storage costs. Parquet is a very popular data format worth exploring for your big data analytics pipeline.

Apache Parquet is an open source file format optimized for read-heavy analysis pipelines. Parquet's columnar storage structure allows you to skip irrelevant data, thereby improving query efficiency. This ability to skip also results in only the data you want being sent from storage to the analytics engine, reducing costs and improving performance. Also, since similar data types (for a column) are stored together, Parquet facilitates efficient data compression and encoding schemes, thereby reducing data storage costs, compared to storing the same data in a text file format.

cc55ce1e01125d253b2cb865ec9e3838.png

Services such as Azure Synapse Analytics, Azure Databricks, and Azure Data Factory have built-in native functionality to take advantage of the Parquet file format.

Partition Scheme#


An effective data partitioning scheme can improve the performance of the analysis pipeline and also reduce the overall transaction cost incurred by the query. In simple terms, partitioning is a method of organizing data by grouping datasets with similar attributes into one storage entity, such as a folder. When your data processing pipeline is querying data with similar attributes (such as all data from the past 12 hours), the partitioning scheme (in this case, done by datetime) lets you skip irrelevant data and only look for those that you want.

Let's take an example of Contoso's IoT scenario, where data is ingested into a data lake in real-time from various sensors. You now have a variety of options for storing your data, including (but not limited to) those listed below:

  • Option 1 - /<sensorid>/<datetime>/<temperature>, <sensorid>/<datetime>/<pressure>, <sensorid>/<datetime>/<humidity>

  • Option 2 - /<datetime>/<sensorid>/<temperature>, /<datetime>/<sensorid>/<pressure>, /datetime>/<sensorid>/<humidity>

  • Option 3 - <temperature>/<datetime>/<sensorid>, <pressure>/<datetime>/<sensorid>, <humidity>/<datetime>/<sensorid>


If the high-priority scenario is to understand the health of the sensor based on the values ​​it sends to ensure the sensor is working properly, then you would run the analysis pipeline every hour or so to triangulate the data from the specific sensor with the data from other sensors Measure to make sure they are working properly. In this case, option 2 would be the best way to organize the data. Conversely, if your high-priority scenario is to understand weather patterns in the area based on sensor data to determine what remedial actions you need to take, you'll run your analytics pipeline periodically to assess the weather based on sensor data in the area. In this case, you might want to optimize the organization by date and attributes on the sensor ID.

Open source computing frameworks such as Apache Spark provide native support for partitioning schemes that you can leverage in your big data applications.

Using Query Acceleration #


Azure Data Lake Storage has a feature called Query Acceleration, available in preview, designed to optimize performance while reducing costs. Query acceleration allows you to specify more predicates (think of these predicates as similar to the conditions you will provide in the WHERE clause of the SQL query) and column projection (think of these columns as the columns you will specify in the SELECT statement of the SQL query) on unstructured data.

824c4b7233f13c7e7f9d85dc0daa5810.png

In addition to improving performance by filtering specific data used by queries, query acceleration also reduces the overall cost of the analysis pipeline by optimizing the data transferred, thereby reducing overall storage transaction costs, and saving you the cost of computing resources that you would otherwise be able to read the entire dataset and filter the desired subset of data.

This article: https://architect.pub/hitchhikers-guide-data-lake
Discussion: Knowledge Planet [Chief Architect Circle] or add WeChat trumpet [ca_cto] or add QQ group [792862318]
No public

【jiagoushipro】
【Super Architect】
Brilliant graphic and detailed explanation of architecture methodology, architecture practice, technical principles, and technical trends.
We are waiting for you, please scan and pay attention.
7a8fab60020ed3cfdee08f149d6a5b30.jpeg
WeChat trumpet

[ca_cea]
Community of 50,000 people, discussing: enterprise architecture, cloud computing, big data, data science, Internet of Things, artificial intelligence, security, full-stack development, DevOps, digitalization.

c9b3a059db3a54ae443053efddc38aeb.jpeg

QQ group

[285069459] In-depth exchange of enterprise architecture, business architecture, application architecture, data architecture, technical architecture, integration architecture, security architecture. And various emerging technologies such as big data, cloud computing, Internet of Things, artificial intelligence, etc.
Join the QQ group to share valuable reports and dry goods.

d1a875cdb5f9193e89879e2fbf927653.jpeg

video number [Super Architect]
Quickly understand the basic concepts, models, methods, and experiences related to architecture in 1 minute.
1 minute a day, the structure is familiar.

933fcd6876b27319edaefbae683bb105.jpeg

knowledge planet [Chief Architect Circle] Ask big names, get in touch with them, or get private information sharing.

055c89d6c18a25adbefac36fcc53d3bd.jpeg

Himalayas [Super Architect] Learn about the latest black technology information and architecture experience on the road or in the car. [Intelligent moments, Mr. Architecture will talk to you about black technology]
knowledge planet Meet more friends, workplace and technical chat. Knowledge Planet【Workplace and Technology】
LinkedIn Harry https://www.linkedin.com/in/architect-harry/
LinkedIn group LinkedIn Architecture Group
https://www.linkedin.com/groups/14209750/
Weibo‍‍ 【Super Architect】 smart moment‍
Bilibili 【Super Architect】

eafb364b5ecc69a04f514e455f406c0c.jpeg

Tik Tok 【cea_cio】Super Architect

c45822cadf27eeb6a7fc613195a53b5a.jpeg

quick worker 【cea_cio_cto】Super Architect

c839e4c9129dc8168d0aa04af6da3bfc.jpeg

little red book [cea_csa_cto] Super Architect

8779531419f3687763183de695124bdd.jpeg

website CIO (Chief Information Officer) https://cio.ceo
website CIOs, CTOs and CDOs https://cioctocdo.com
website Architect practical sharing https://architect.pub   
website Programmer cloud development sharing https://pgmr.cloud
website Chief Architect Community https://jiagoushi.pro
website Application development and development platform https://apaas.dev
website Development Information Network https://xinxi.dev
website super architect https://jiagou.dev
website Enterprise technical training https://peixun.dev
website Programmer's Book https://pgmr.pub    
website developer chat https://blog.developer.chat
website CPO Collection https://cpo.work
website chief security officer https://cso.pub    ‍
website CIO cool https://cio.cool
website CDO information https://cdo.fyi
website CXO information https://cxo.pub

Thank you for your attention, forwarding, likes and watching.

Guess you like

Origin blog.csdn.net/jiagoushipro/article/details/130799261