Build enterprise-class data lake? Azure Data Lake Storage Gen2 actual combat experience (in)

introduction

Compared with the traditional heavyweight OLAP data warehouse, "Data Lake" with its large body of data, integrated, low cost, support for unstructured data, flexible queries, etc., are more and more enterprises of all ages, gradually became the core paradigm of modern architecture and data platform.

Therefore, the data-related services lake has become one focus of development of cloud computing. Early Azure platform had released the first generation of Data Lake Storage, then Microsoft will it be a strong integration with Azure Storage, earlier this year officially launched its second-generation products: Azure Lake the Data Storage Gen2 (hereinafter referred to as ADLS Gen2). ADLS Gen2 slogan is "uncompromising lake data platform that combines a rich set of advanced data solutions lake feature set and economy Azure Blob storage, global scale and enterprise-level security."

ADLS Gen2 whole new generation of practical experience of how? On the structure and characteristics of primary storage of large data whether Kanren Lake applications it? In the last article , we already have a preliminary understanding of the basic operating system and the privileges of ADLS Gen2. Let's continue to explore in depth, with a particular focus ADLS Gen2 to mount after the performance of large data clusters as a storage layer.

ADLS Gen2 experience: a cluster mount

Data storage lake is mainly applied to large data processing scenarios, so we choose to create a HDInsight big data clusters to experiment, using Spark to access data and lake operations. You can see HDInsight has supported the ADLS Gen2:

  Next is the more critical aspects of the configuration memory, we specify a new instance hdiclusterroot as ADLS Gen2 store the entire cluster, a file system called hdfs-root, as shown:

(Figure we also configure the Storage Accounts Additional , used to mount traditional Blob , will be used for after-time performance comparison. Temporarily deployed here.)

Very interesting is the lower half of the figure, and it allows us to specify a Identity , Identity may represent the identity and access rights Spark cluster. This is critical, identity means that the cluster system perfectly with the permission of ADLS Gen2 correspondence, enterprise-class scene is well landing control for large data resource access .

Here we chose a spark-cluster-identity as the identity of purpose-built cluster. We advance gives it hdiclusterroot this storage account storage blob data owner permissions, so that the identity be able to make any changes to the data in the lake:

After you create a complete other pressing Configure button, Azure will be a key generation Spark cluster, after about ten minutes the entire cluster enters the available state:

We can not wait to SSH to log into the cluster, view the default mounted file system. Try to use hadoop fs -ls lists the files in the root directory:

sshuser@hn0-cloudp:~$ hadoop fs -ls /
Found 18 items
drwxr-xr-x   - sshuser sshuser          0 2019-08-26 03:10 /HdiNotebooks
drwxr-xr-x   - sshuser sshuser          0 2019-08-26 03:29 /HdiSamples
drwxr-x---   - sshuser sshuser          0 2019-08-26 02:54 /ams
drwxr-x---   - sshuser sshuser          0 2019-08-26 02:54 /amshbase
drwxrwx-wt   - sshuser sshuser          0 2019-08-26 02:54 /app-logs
drwxr-x---   - sshuser sshuser          0 2019-09-06 07:41 /apps
drwxr-x--x   - sshuser sshuser          0 2019-08-26 02:54 /atshistory
drwxr-xr-x   - sshuser sshuser          0 2019-08-26 03:25 /custom-scriptaction-logs
drwxr-xr-x   - sshuser sshuser          0 2019-08-26 03:19 /example
drwxr-x---   - sshuser sshuser          0 2019-08-26 02:54 /hbase
drwxr-x--x   - sshuser sshuser          0 2019-09-06 07:41 /hdp
drwxr-x---   - sshuser sshuser          0 2019-08-26 02:54 /hive
drwxr-x---   - sshuser sshuser          0 2019-08-26 02:54 /mapred
drwxrwx-wt   - sshuser sshuser          0 2019-08-26 03:19 /mapreducestaging
drwxrwx-wt   - sshuser sshuser          0 2019-08-26 02:54 /mr-history
drwxrwx-wt   - sshuser sshuser          0 2019-08-26 03:19 /tezstaging
drwxr-x---   - sshuser sshuser          0 2019-08-26 02:54 /tmp
drwxrwx-wt   - sshuser sshuser          0 2019-09-09 02:31 /user

The list of files and ADLS Gen2 comparison, you can see where the "root directory" in fact completely corresponds to the data hdiclusterroot lake example under this data hdfs-root file system, indicating the cluster to achieve the data file system of the lake mount:

So, this remote mount is how to achieve it? Opening a cluster of core-site.xml configuration file, answer fs.defaultFS configuration section:

<property>
    <name>fs.defaultFS</name>
    <value>abfs://[email protected]</value>
    <final>true</final>
</property>

The original, and usually use a different hdfs, cluster fs.defaultFS created when it was set up in order to abfs beginning with a specific url , the url is the point to our data storage lake. The ABFS drive (Azure Blob File System) is designed for Microsoft Data Lake Storage Gen2 development, the full realization of Hadoop FileSystem interface to set up a communication bridge between Hadoop system and ADLS Gen2.

To prove lake data file system to work properly, we have to run a classic WordCount program. I use AzCopy data to the lake uploaded a novel "Tale of Two Cities" ( ATaleOfTwoCities.txt ), then to HDInsight cluster Jupyter Notebook comes in to word frequency statistics Scala script by using Spark:

Great! Our Spark on ADLS Gen2 perfect test run, silky smooth process.

summary

Azure Data Lake Storage Gen2 Azure is Microsoft's new generation of large data storage products designed for enterprise-class data type applications constructed lake. It inherits the Azure Blob Storage is easy to use, low cost, while adding the directory hierarchy, fine-grained access control and other enterprise-level features.

As ADLS Gen2 series The second paper mainly practiced large data cluster mount ADLS Gen2 as the main memory of the scene, at the same time prove ADLS Gen2 with good ecological compatibility of Hadoop, HDFS also experience the traditional separation of different storage computing architecture . Because this type of architecture can scale independently compute and storage section, ideal for cloud features, is being more and more popular. We will also explore more features follow ADLS Gen2, so stay tuned.

Related reading:

Build enterprise-class data lake? Azure Data Lake Storage Gen2 actual combat experience (on)

"Clouds Supplements" from the user's perspective focused on the introduction of cloud computing products and technology, adhere to the practical operation experience as the core content of the output, combined with the logical product of the scenarios depth interpretation. Welcome to scan the next Fanger Wei code concern "among the clouds Supplements" micro-channel public number, or subscribe to this blog.

 

Guess you like

Origin www.cnblogs.com/yunjianshiyi/p/adls-gen2-part2.html