Build enterprise-class data lake? Azure Data Lake Storage Gen2 actual combat experience (at)

Compared with the traditional heavyweight OLAP data warehouse, "Data Lake" with its large body of data, integrated, low cost, support for unstructured data, flexible queries, etc., are more and more enterprises of all ages, gradually became the core paradigm of modern architecture and data platform.

As the latest generation of Microsoft's Azure lake data services, Data Lake Storage Gen2 release, the ability to cloud the data of the lake and to enhance the experience on a new level. In the previous article, we have introduced its basic use and big data cluster mount scene. As the next in this series, let us continue to experience the depth tour.

 G ADLS EN2 experience: data sharing Lake

In the enterprise, a huge lake data often needs to be shared . Such data is typically divided into a lake plurality of regions that is preferably capable of being accessed for a different computing tasks corresponding to each computing cluster. It also fully embodies the concept of separation of computing storage, cloud computing is the essence of architecture. So ADLS Gen2 can support this important scene it?

The answer is yes. For each cluster computing, it might dilute its own "local" storage, into consideration whether the cluster is able to read data to access the remote instance of the lake - in this line of thought, you can set up a unified and independent data lake example, is shared by multiple compute cluster, press directory permissions and data isolation . Lake of the data life cycle independent of the creation and destruction of cluster computing, data is referenced as an external and access it when needed.

Next we HDInsight Spark clusters created in the previous article, for example, continue to verify the actual data sharing lake. Microsoft's section of the document tells us how to make HDInsight cluster to access the "outside" of ADLS Gen2:

To add a secondary Data Lake Storage Gen2 account, at the storage account level, simply assign the managed identity created earlier to the new Data Lake Storage Gen2 storage account that you wish to add.

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2

Looks quite easy, identity just to representatives of clusters given the appropriate permissions to ADLS Gen2. Note that you can set permissions granularity in fact very detailed, accurate to the directory and the file level, which is what we need.

The next important such a common scenario that we used to build and test: the original data is located in a lake in the area of data to be processed by the cluster spark procedure 1, and landing the processed data to the same data area of the lake II; cluster 2 using dimethyl hive to query the processed data region. Note that since the calculation were separated storage, data processing and query clustering may be stateless, no workload can be closed, or can be created at any scale.

Let's take a lake ready to share data, in the series of articles on the storage account cloudpickerdlg2 has created a new file system for sharing datalakefs-shared data. Which were followed by the establishment of zone-rawdata and zone-processed two folders, and store the used earlier novel "Tale of Two Cities" in a text file ATaleOfTwoCities.txt zone-rawdata folder:

Next, in order to make a smooth Spark cluster to access data in this data center of the lake, we just need to spark-cluster-identity previously created two folders authorize respectively. Here granted read permissions for the zone-rawdata, for zone-processed given read and write permissions:  

Then, we can reuse series novella in the Spark cluster to access the remote data lake. Jupyter Notebook turned by data processing, and the processing result is written Spark parquet zone-processed in the form:

val domain = "abfss://[email protected]/"
val book = spark.sparkContext.textFile(domain + "zone-rawdata/ATaleOfTwoCities.txt")
val wordCounts = book.flatMap(l => l.split(" ")).map(w => (w, 1)).reduceByKey((a,b) => a+b).map( { case (w, c) => (w, c, w.length) } )
val wordCountsWithSchema = spark.createDataFrame(wordCounts).toDF("word", "count", "word_length")
wordCountsWithSchema.write.parquet(domain + "zone-processed/ATaleOfTwoCities.parq")

 After running you can see, the result set parquet file has been located in the zone-processed.

Next we consider part of the lake query data, you can create a separate query Hive cluster and point to the data processing results on the lake. There is a dedicated high-performance online query optimization of large data cluster type HDInsight, is called Interactive Query on Azure, which uses Hive LLAP, very suitable for our scene. We wish to deploy it as a query cluster.

Here we omit details on the establishment Hive LLAP cluster, the cluster similar steps to establish Spark, follow the prompts to select Wizard gradually. Note that, we need to create a correspondingly corresponding identity hive-cluster-identity as the inquirer , and set the query cluster status this hive-cluster-identity.

And no need to inquire directly related clusters occur in sharing data Lake datalakefs-shared, as long as the path (zone-processed folder) to be read in open read permission to hive-cluster-identity can be:

 Then you can use SQL for data on the lake Hive cluster inquired:

- create the external table data to the data parquet lake 
Create External Table WordsOnDataLake ( 
    Word String, 
    COUNT  int , 
    word_length   int 
) STORED the AS PARQUET 
LOCATION ' abfss: //[email protected]/zone- Processed / ATaleOfTwoCities.parq ' ; 

- immediately be able to directly query the data on the lake, for example, by the polymerization word length packet statistics 
SELECT word_length, SUM ( COUNT ) AS TOTAL_COUNT from WordsOnDataLake
 Group  by word_length 
 Order  by total_count desc
limit 10;

Finally successfully ran out of the results:

+--------------+--------------+--+
| word_length  | total_count  |
+--------------+--------------+--+
| 3            | 31667        |
| 4            | 24053        |
| 2            | 22800        |
| 5            | 15942        |
| 6            | 12133        |
| 7            | 9624         |
| 8            | 6791         |
| 9            | 4716         |
| 1            | 4434         |
| 0            | 4377         |
+--------------+--------------+--+
10 rows selected (4.295 seconds)

Hive LLAP can see the data successfully read remote data Lake, and the speed is quite impressive, it returned a result within a few seconds. If you turn off LLAP mode using the traditional way of execution (the hive.llap.execution.mode set to none), experiment down the same query takes about 25 seconds to complete.

In the actual scene, Hive LLAP clusters can remain online, at any time in order to deal with queries coming; and responsible for the ETL Spark cluster can be started on demand, calculated after the task is completed closed. This design not only makes full use of the lake shared data architecture, but also reflects the cloud scale on demand start-stop characteristics.

 

to sum up

The lake is popular in recent years of data architecture thinking, help to improve the flexibility of data services capabilities, but also for building a unified cross-cutting enterprise big data platform provides guidance paradigm and floor support.

Therefore, public cloud giants have to provide data to build and enhance the Lake corresponding product line, this series of articles focusing Azure Data Lakge Storege Gen2 That is one of the outstanding representatives. Through layers of depth of three articles, we not only analyze the product features, but also combines application scenarios were POC verification. Practice has proved that, ADLS Gen2 can become a solid foundation to build enterprise-class data protection and reliability of the lake.

Finally, we use an architecture diagram as the full text of the pack, it is a good summary of the relationship between ADLS Gen2 capabilities, positioning and with neighboring systems:

  (Picture from https://www.blue-granite.com/blog/10-things-to-know-about-azure-data-lake-storage-gen2)

From now on, we consider the use of ADLS Gen2 and related cloud services, to build your own data lake it!

related articles:

Build enterprise-class data lake? Azure Data Lake Storage Gen2 actual combat experience (on)

Build enterprise-class data lake? Azure Data Lake Storage Gen2 actual combat experience (in)

 

"Clouds Supplements" from the user's perspective focused on the introduction of cloud computing products and technology, adhere to the practical operation experience as the core content of the output, combined with the logical product of the scenarios depth interpretation. Welcome Fanger Wei code concern "among the clouds Supplements" micro-channel public number scan next.

 

 

Guess you like

Origin www.cnblogs.com/yunjianshiyi/p/adls-gen2-part3.html