Introduction to Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 builds on Azure Blob storage and is a set of capabilities for big data analytics.

The features of Azure Data Lake Storage Gen1 and Azure Blob Storage are combined in Data Lake Storage Gen2. For example, Data Lake Storage Gen2 provides scale, file-level security, and file system semantics. You'll also get low-cost tiered storage with high availability and disaster recovery features because they're built on top of Blob storage.

Specially developed for enterprise massive data analysis

Azure Storage is now the starting point for creating enterprise data lakes on Azure, thanks to Data Lake Storage Gen2. Data Lake Storage Gen2 was built from the ground up to support petabytes of data while supporting hundreds of gigabytes of throughput, allowing you to easily manage large amounts of data.

Extending blob storage to include a hierarchical namespace is a key component of Data Lake Storage Generation 2. For efficient data access, hierarchical namespaces group objects and files into folder hierarchies. Slashes are often used in object store names to simulate a hierarchical directory structure. The advent of Data Lake Storage Gen2 has made this arrangement a reality. Operations on directories, including renaming or deleting directories, become a single atomic metadata operation. There is no need to enumerate and process every object that shares a directory name prefix.

Blob storage is the foundation of Data Lake Storage Generation 2, which improves management, security, and performance through:

performance

Performance is optimized because data does not need to be copied or changed prior to analysis. Additionally, hierarchical namespaces on blob storage perform directory management activities far better than flat namespaces, improving job performance.

manage

Management is even easier since you can use directories and subdirectories to arrange and manage files.

Safety

 Security is enforceable since POSIX permissions can be set on folders or specific files  .

Also, Data Lake Storage Gen2 is relatively affordable because it is based on cheap Azure Blob storage. Additional capabilities lower the total cost of ownership for using Azure to perform big data analytics.

Important Characteristics of Data Lake Storage Generation 2

  • Data Lake Storage Gen2 enables you to organize and access data in a manner comparable to Hadoop Distributed File System (HDFS). All Apache Hadoop setups support the new ABFS driver, which is used to access data. Azure HDInsight, Azure Databricks, and Azure Synapse Analytics are some examples of these environments.
  • Data Lake Gen 2's security model supports ACLs and POSIX permissions, as well as other granularities unique to Data Lake Storage Gen 2. Additionally, frameworks such as Hive and Spark, and Storage Explorer allow for configuration settings.
  • Cost-effective: Data Lake Storage Gen2 provides low-cost storage space and transactions. With features like Azure Blob Storage Lifecycle, costs are reduced as data moves through its lifecycle.
  • Driver optimization: ABFS driver is tailored for big data analysis. The endpoint dfs.core.windows.net exposes the corresponding  REST API .

scalability

Whether accessed through Data Lake Storage Gen 2 or Blob storage interfaces, Azure Storage is designed to scale. It can store and serve many exabytes of data. Throughput for this storage volume is measured in gigabits per second (Gbps) at high input/output operations per second (IOPS). Processing latency is monitored at the service, account, and file level, and remains nearly constant for each request. Whether accessed through Data Lake Storage Gen 2 or Blob storage interfaces, Azure Storage is designed to scale. It can store and serve many exabytes of data. Throughput for this storage volume is measured in gigabits per second (Gbps) at high input/output operations per second (IOPS). Processing latency is monitored at the service, account, and file level, and remains nearly constant for each request.

Cost-effectiveness

Storage capacity and transaction costs are lower because Data Lake Storage Gen 2 is built on Azure Blob storage. Unlike other cloud storage providers, you don't need to relocate or change your data to research it. For more details on pricing, visit Azure Storage Pricing.

Features such as hierarchical namespaces also greatly improve the overall performance of many analysis activities. Due to improved performance, less computing power is now required to process the same amount of data, reducing the total cost of ownership (TCO) of the entire analysis project.

One service, multiple ideas

Because Data Lake Storage Gen 2 is based on Azure Blob storage, the same shared object can be described by several concepts.

Below are the same objects described by various concepts. Unless otherwise stated, the following terms are immediately synonymous:

concept

top organization

subordinate organization

data container

Blob - general purpose object storage

container

Virtual directory (SDK only - does not provide atomic operations)

spot

Azure Data Lake Storage Gen2 - Analytics Storage

container

Table of contents

document

Blob storage support features

Accounts have access to Blob storage features such as diagnostic logging, access tiers, and Blob storage lifecycle management policies. Most blob storage features are fully supported, but some features are only supported in preview mode or not at all.

See Blob storage feature support in Azure Storage Accounts for details on how each Blob storage feature is supported in Data Lake Storage Gen 2.

Supported Azure service integrations

Data Lake Storage gen2 supports various Azure services. They can be used to perform analysis, generate visualizations, and ingest data. For a list of supported Azure services, see Azure services that support Azure Data Lake Storage Gen2.

Supported Open Source Platforms

Data Lake Storage Gen2 is supported by several open source platforms. For a complete list, see Open source platforms supporting Azure Data Lake Storage Gen 2.

Leverage Azure Data Lake Storage Gen 2 Best Practices

The Gen 2 version of Azure Data Lake Storage is not a specific service or account type. It is a collection of tools for high-throughput analysis tasks. Best practices and instructions for taking advantage of these capabilities are provided in Data Lake Storage Gen 2 Reference. For information on all other aspects of account administration, including setting up network security, designing for high availability, and disaster recovery, see the Blob Storage documentation.

Review feature compatibility and known issues

Apply the following method when setting up an account to take advantage of the Blob storage service.

  • To find out whether an account fully supports a feature, read the page on Blob storage feature support for Azure Storage accounts. In accounts enabled for Data Lake Storage Gen2, some features are either not supported at all or are only partially supported. As feature support continues to grow, be sure to check this page frequently for changes.
  • Check out the Known issues with Azure Data Lake Storage Gen 2 article to see if there are any limitations or specific instructions for the features you want to use.
  • Browse the featured article for any recommendations that apply specifically to Data Lake Storage Gen2 enabled accounts.

Identify terms used in the document

When switching between content sets, you will notice some minor vocabulary changes. For example, in the featured content of the blob storage description, the term "Blob" will be used instead of "file". Technically, data uploaded to a storage account becomes a blob there. Therefore, this sentence is accurate. However, the term "blob" can be confusing if you're used to the term "file". Filesystems are also known as "containers". Treat these phrases as interchangeable.

Consider the premium

If your workload requires low constant latency and/or high input-output operations per second (IOPs), consider a premium block blob storage account. High-performance hardware is used in such accounts to make data accessible. Solid State Drives (SSDs) are designed for minimal latency and are used to store data. SSDs provide greater throughput than traditional hard drives. Advanced performance has higher storage costs but lower transaction costs. Therefore, if the application performs a high number of transactions, a premium performance block blob account might be cost-effective.

If the storage account will be used for analytics, we strongly recommend using Azure Data Lake Storage Gen 2 with a premium block blob storage account. The Premium tier of Azure Data Lake Storage is a premium block blob storage account combined with a Data Lake Storage enabled account.

Improve data ingestion

When ingesting data from the source system, the source hardware, source network hardware, or network connection to the storage account can be a bottleneck.

 

source hardware 

Make sure you choose the right hardware carefully, whether you're using a virtual machine (VM) on Azure  or on an on-premises device. Choose disk hardware with faster spindles, and consider solid-state drives (SSDs). Use the fastest network interface controller (NIC) for networking hardware. We recommend using Azure D14 VMs as they have sufficient network and disk hardware capabilities.

Network connected to the storage account

There can sometimes be a bottleneck in the network connection between the source data and the storage account. When the source data is on-premises, it might be necessary to use an Azure ExpressRoute private link. Performance is best when the source data (if in Azure) is in the same Azure region as the Data Lake Storage Gen2 enabled account.

Set up the data ingestion mechanism for the most parallel processing possible

Use all available throughput by running as many reads and writes in parallel for optimal performance.

 

structured dataset

Consider the structure of organizing your data beforehand. Performance and cost may be affected by file format, file size, and directory organization.

file format

Different formats can be used to ingest data. Data can be presented in a compressed binary format such as tar. go or in a human-readable format such as JSON, CSV or XML. Data can also arrive in various sizes. Large files (several terabytes) can constitute data, such as exporting information from a SQL table from a local system. For example, real-time event data from Internet of Things (IoT) solutions can also come in the form of a large number of small files (a few kilobytes in size). By choosing the right file format and file size, you can maximize efficiency and minimize costs.

Hadoop supports a range of file formats designed for storing and analyzing structured data. Avro, Parquet, and Optimized Determinant  (ORC) formats are some popular formats. These are binary file formats that can be read by machines. They are compressed to help you control file size. They are self-describing because each file contains an embedded schema. The method used to store data varies by format. Parquet and ORC formats store data in a columnar fashion, while Avro stores data in a row-based format.

If your I/O patterns are more write-intensive, or if your query patterns tend to fetch large rows of information in their entirety, you may want to use the Avro file format. For example, the Avro format is suitable for message buses that write sequences of events or messages, such as Event Hubs or Kafka.

Consider Parquet and ORC file formats when I/O patterns are more read-intensive or when query patterns focus on a specific subset of columns in a record. Read transactions may be reduced to fetching only certain columns, rather than reading full records.

The open source Apache Parquet is a file format designed for read-intensive analytics pipelines. Thanks to Parquet's columnar storage format, you can skip irrelevant data. Queries are much more efficient because they may be specific to the data to be sent from storage to the analytics engine. Additionally, Parquet provides efficient data encoding and compression techniques that can reduce the cost of data storage because similar data types (for columns) are stored together. The native Parquet file format is supported in services such as Azure Synapse Analytics, Azure Databricks, and Azure Data Factory.

File size

Larger files improve performance and reduce costs.

Analysis engines such as HDInsight typically include per-file overhead, which includes activities such as listing, determining access permissions, and performing various metadata operations. Data storage in the form of several small files can negatively impact performance. To improve performance, organize the data into larger files (256 MB to 100 GB in size). Some engines and programs may not be able to efficiently handle files larger than 100 GB.

Reduced transaction costs are another benefit of enlarging files. You will be charged for read and write activity in 4 MB increments, whether the file contains 4 MB or only a few KB. For pricing details, see Azure Data Lake Storage Pricing.

Raw data consists of a large number of small files, which can sometimes be subject to limited control by the data pipeline . We recommend that your system has a program that merges small files into larger files for use by downstream applications. If you want to process data in real time, you can use a real-time streaming engine such as Spark Streaming or Azure Stream Analytics with a message broker such as Event Hubs or Apache Kafka to save the data as larger files. When merging small files into large ones, consider saving them in a read-optimized format (such as Apache Parquet) for later processing.

Directory Structure

 These are some typical layouts to consider when working with the Internet of Things (IoT), batch scenarios, or optimizing time-series data. Every workload has various requirements for how data is used.

Scheduling of batch work

In IoT workloads, large volumes of data from many commodities, devices, businesses and customers can be ingested. The directory layout needs to be planned in time to provide organized, secure and efficient data processing for downstream users. The following designs can serve as broad templates to consider:

{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

To illustrate, a landing telemetry structure for a British aircraft engine might look like this:

UK/Planes/BA1293/Engine1/2017/08/11/12/

In this example, you can more easily limit areas and topics to users and groups by adding the date to the end of the directory structure. Securing these fields and themes will be more challenging if the date structure comes first. For example, if you wanted to restrict access to UK data or to specific aircraft only, you would need to request separate authorizations for multiple directories under each hourly directory. Furthermore, this arrangement will rapidly increase the number of directories over time.

The structure of batch tasks

Adding data to the "in" directory is a frequent batch processing technique. After the data has been processed, the new data is placed in the "out" directory so that other processes can consume it. This directory structure is occasionally used for tasks that only need to examine a single file, and don't necessarily require massively parallel execution across large datasets. A useful directory structure has parent directories for things like areas and topics, such as the IoT structure shown above (e.g. organization, product, or producer). For better organization, filter search, security, and automation in your processing, consider dates and times when designing your structure. The frequency with which data is uploaded or downloaded determines the level of granularity of the data structure.

File processing occasionally fails due to data corruption or unexpected formatting. In these cases, the directory structure may benefit from having a /bad folder so that files can be moved there for additional analysis. Batch tasks can also monitor reports or alert users to these problem files so they can take manual action. Consider the following arrangement of templates:

  • {Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/
  • {Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/
  • {Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/

For example, a marketing firm receives daily extracts of customer updates from North American customers. Before and after processing, it might look like the following code snippet:

  • NA/Extracts/ACMEPaperCo/In/2017/08/14/updates_08142017.csv
  • NA/Extracts/ACMEPaperCo/Out/2017/08/14/processed_updates_08142017.csv

In a typical scenario where batch data is processed directly to a database such as Hive or a traditional SQL database, the output has gone into a separate folder for the Hive table or external database; therefore, no /in or /out directories are required. For example, each directory will receive daily data fetches from consumers. Afterwards, daily Hive or Spark jobs are launched by services such as Azure Data Factory, Apache Oozie, or Apache Airflow to process and write the data to Hive tables.

time series data structure

Partition pruning on time-series data for Hive workloads may improve performance by causing some queries to read only part of the data.

Pipelines for ingesting time series data typically organize their files into named folders. Here's an example of data typically arranged by date:

  • /DataSet/YYYY/MM/DD/datafile_YYYY_MM_DD.tsv

Remember that date and time details can be found in filenames and folders.

The following are typical formats for dates and times:

  • /DataSet/YYYY/MM/DD/HH/mm/datafile_YYYY_MM_DD_HH_mm.tsv

Once again, the decisions you make about how to organize your folders and files should be optimized for larger file sizes and manageable numbers of files in each folder.

set security

Start by reviewing the recommendations in the Security considerations for blob storage article. You'll receive best-practice advice on how to protect data behind the firewall, prevent malicious or accidental deletion, and use Azure Active Directory (Azure AD) as the foundation for identity management.

Then, for recommendations specific to accounts with Data Lake Storage Gen 2 enabled, see the Access control model section on the Azure Data Lake Storage Gen 2 page. This article describes how to apply security permissions to directories and files in a hierarchical file system by using Azure role-based access control (Azure RBAC) roles and access control lists (ACLs).

Introduce, implement and evaluate

Data can be ingested into Data Lake Storage Gen2 enabled accounts from a variety of sources and in a variety of methods.

To prototype applications, large amounts of data can be brought in from HDInsight and Hadoop clusters, as well as smaller ad-hoc datasets. You can receive streaming data generated by various sources, including software, hardware, and sensors. You can use tools to record and process this data event by event in real time, and then write events in batches to your account. Web server logs can also be brought in, which include details such as page request history. If you want the flexibility to integrate the data upload component into a larger big data application, consider using a custom script or program to submit log data.

Once the data is available in your account, you can run analytics on it, create visualizations, and even download the data to your local computer or to another repository, such as an Azure SQL Database or SQL Server instance.

monitoring telemetry

Operating services requires careful attention to usage and performance monitoring. Examples include processes that occur frequently, have significant delays, or limit service.

All telemetry data for a storage account is accessible through the Azure Storage logs in Azure Monitor. This feature allows archiving logs to another storage account and linking the storage account with Log Analytics and Event Hubs. Visit the Azure Storage Monitoring Data Reference to view the entire collection of metrics, resource logs, and their corresponding structures.

Depending on how you plan to access them, you can store your logs anywhere you like. For example, you can store logs in a Log Analytics workspace if you want near real-time access to your logs and the ability to join events in your logs with other metrics in Azure Monitor. Then, use KQL and an authoring query to query the logs, which lists the StorageBlobLogs table in the workspace.

If you want to store logs for near real-time query and long-term retention, you can set the diagnostic settings to send logs to the Log Analytics workspace and storage account.

If you want to retrieve logs via other query engines such as Splunk, you can set the diagnostic settings to send logs to Event Hubs and ingest logs from Event Hubs to your preferred destination.

in conclusion

The features of Azure Data Lake Storage Gen1 and Azure Blob Storage are combined in Data Lake Storage Gen2. For example, Data Lake Storage Gen2 provides scale, file-level security, and file system semantics. You also get low-cost tiered storage with high-availability/disaster recovery features, since those features are built on top of Blob storage.

Guess you like

Origin blog.csdn.net/weixin_56863624/article/details/130632795