Data governance ---Apache Atlas metadata management

table of Contents

First, the concept Background

1.1 Overview

1.2 Core features

1.3 Components of Atlas

1.4 Apache Atlas relies on HDP components

1.5 Type system

1.6 Type system

Second, Atlas metadata blood relationship

2.1 atlas configuration file

1. Atlas database

2.Grapth Titan

3.hive-site.xml configuration file

4.hbase-site.xml configuration file

2.2 hive_db Type example

1. Create a database in hive

2. Create three tables in the atlas_test database

3.atlas_test Type metadata information

4. Information of the teacher table in atlas_test

2.3 Classification spread

Three, label-based security strategy

3.1 Atlas Add Tag

3.2 Ranger configures Tag-based strategy

3.3 Apply Tag Service to Hive

3.4 Verification


1. Conceptual background introduction

1.1 Overview

       Faced with a huge and continuously increasing variety of data objects, are you confident that you know what data comes from and how it changes over time? The actual situation of data management must be considered when using Hadoop. Metadata and data governance have become an important part of an enterprise-level data lake.

In order to seek open source solutions for data governance, Hortonworks and other vendors and users launched a data governance initiative in 2015, including data classification, centralized policy engine, data blood relationship, security, and lifecycle management. The Apache Atlas project is the result of this initiative, and community partners continue to provide new functions and features for the project. The project is used to manage shared metadata, data classification, auditing, security, and data protection, and strive to integrate with Apache Ranger for data access control strategies.

Apache Atlas is Hadoop's data governance and metadata framework. It provides a scalable and extensible set of core basic data governance services, enabling enterprises to effectively and efficiently meet the compliance requirements in Hadoop, and allowing collaboration with the entire enterprise Data Ecosystem Integration:

1.2 Core features

Apache Atlas provides the following features for Hadoop metadata governance:

  •  Data Classification

  -Import or define business-oriented classification notes for metadata

  -Define, annotate, and automatically capture the relationship between the data set and the underlying elements

  -Export metadata to third-party systems

  • Centralized audit

  -Capture secure access information with all applications, processes and interactions with data

  - information capture operation performed steps, activities, etc.

  • Search and blood

  -Pre-defined navigation paths are used to explore data classification and audit information

  -Text-based search feature to quickly and accurately locate related data and audit events

  -  visual browsers dataset kinship of the user information related to operations, security, and data origin can drill down

  • Security and Policy Engine

  -Reasonable runtime compliance strategy based on data classification mode, attributes and roles

  -Advanced strategy definition based on classification-prediction to prevent data derivation

  -Row/column level masking based on cell attributes and values

1.3 Components of Atlas

  • Core

        - Type System : Atlas they want to allow the user to manage the definition of a metadata object model. The model consists of definitions called "types". Instances of "types" are called "entities" and represent the actual metadata objects being managed . The type system is a component that allows users to define and manage types and entities. All metadata objects managed by Atlas (such as Hive tables) are modeled using types and represented as entities. To store new types of metadata in Atlas, you need to understand the concept of type system components.

       -Ingest/Export : The Ingest component allows metadata to be added to Atlas . Similarly, the Export component exposes metadata changes detected by Atlas to be triggered as events, and consumers can use these change events to respond to metadata changes in real time.

       -Graph Engine: Internally, Atlas manages metadata objects through the use of graph models . To achieve great flexibility and rich relationships between metadata objects. The graphics engine is the component responsible for converting between the types and entities of the type system, as well as the basic graphics model. In addition to managing graphics objects, the graphics engine also creates appropriate indexes for metadata objects in order to search them efficiently.

       -Titan: Currently, Atlas uses the Titan graph database to store metadata objects. Titan uses two storage: By default, the metadata storage is configured as HBase, and the index storage is configured as Solr . You can also use BerkeleyDB to store metadata storage and ElasticSearch to store Index by constructing corresponding configuration files. The metadata store is used to store the metadata object itself, and the index store is used to store the index of the metadata attribute, which allows efficient searching.

  • Integration

Users can use two methods to manage metadata in Atlas:

  -API : All features of Atlas can be provided to end users through REST API, allowing creation, update and deletion of types and entities. It is also the main method for querying and discovering the types and entities managed by Atlas.

  - Messaging: In addition to the API, users may also choose to use message-based interface Kafka Atlas integration. This is useful for transferring metadata objects to Atlas and for using metadata change events from Atlas that can build applications. If you want to use a more loosely coupled integration with Atlas, which can allow for better scalability, reliability, etc., the messaging interface is particularly useful. Atlas uses Apache Kafka as the notification server for communication between hooks and downstream consumers of metadata notification events. Events are written to different Kafka topics by hooks and Atlas: -ATLAS_HOOK: metadata notification events from Hooks of various components are written to the Kafka topic named ATLAS_HOOK and sent to Atlas

        

        -ATLAS_ENTITIES: Events from Atlas to other integrated components (such as Ranger) are written to the Kafka topic named ATLAS_ENTITIES

  • Metadata source

     -Hive : Through the hive bridge, atlas can access Hive metadata, including hive_db/hive_table/hive_column/hive_process

     -Sqoop: Through the sqoop bridge, atlas can access the metadata of relational databases, including sqoop_operation_type/ sqoop_dbstore_usage/sqoop_process/sqoop_dbdatastore

      -Falcon : Through the falcon bridge, atlas can access Falcon metadata, including falcon_cluster/falcon_feed/falcon_feed_creation/falcon_feed_replication/ falcon_process

    -Storm   : Through the storm bridge, atlas can access streaming metadata, including storm_topology/storm_spout/storm_bolt

     -HBase : through hbase bridge

  The metadata source of Atlas integrated big data components needs to achieve the following two points:

   -First of all, it is necessary to define a metadata model that can express the metadata objects of big data components based on the atlas type system (for example, the metadata model of Hive is implemented in org.apache.atlas.hive.model.HiveDataModelGenerator);

-Then    , you need to provide a hook component to extract metadata objects from the metadata source of the big data component, listen to metadata changes in real time and feed back to atlas;

  • Applications

 -Atlas Admin UI : This component is a web-based application that allows data administrators and scientists to discover and annotate metadata. Admin UI provides a search interface and SQL-like query language, which can be used to query metadata types and objects managed by Atlas. Admin UI uses Atlas' REST API to build its functionality.

 -Tag Based Policies : Apache Ranger is an advanced security management solution for the Hadoop ecosystem and has extensive integration with various Hadoop components. Through integration with Atlas, Ranger allows security administrators to define metadata-driven security policies to achieve effective governance. Ranger is a consumer of metadata change events notified by Atlas.

  -Business Taxonomy: The metadata object obtained from the metadata source to Atlas is mainly a technical form of metadata. In order to enhance discoverability and governance capabilities, Atlas provides a business classification interface that allows users to first define a set of business terms that represent their business domains and associate them with metadata entities managed by Atlas. Business taxonomy is a web application that is currently part of the Atlas Admin UI and uses REST API to integrate with Atlas.

1.4 Apache Atlas relies on HDP components

  1. HBase: Titan uses HBase to store metadata by default
  2. Ambari infra/Solr: Titan uses Solr to store metadata indexes by default
  3. Kafka: Apache Atlas uses Kafka as a message queue to implement the communication between hooks and consumers of metadata notification events

 

1.5 Type system

The overall process of metadata processing is shown in the figure below:

     When querying a certain metadata object in Atlas, it is often necessary to traverse multiple vertices and edges in the graph database. Compared with a relational database, directly querying a row of data is much more complicated. Of course, using a graph database as the underlying storage also has its advantages. For example, it can support complex data types and better support the reading and writing of blood data.

1.6 Type system

Atlas allows users to define a model for the metadata objects they want to manage. The model consists of definitions called "types." Instances of "types" called "entities" represent the actual metadata objects being managed. All metadata objects managed by Atlas (such as Hive tables) are modeled using types and represented as entities.

- Type : Atlas "Type" defines how to store and access specific types of metadata objects . The type represents one or more attribute collections of the defined metadata object. Users with a development background can understand "type" as defined by the "class" of an object-oriented programming language or the "table mode" of a relational database. Type has a meta type, which represents the type of this model in Atlas:

           -Basic element types: Int, String, Boolean, etc.

           -Collection meta type: such as Array, Map

          - Class,Struct,Trait

 -  Entities : the Atlas of  "entity" class "type", or examples of particular values , and therefore represents a real-world objects specific metadata. Looking back at our object-oriented programming language analogy, an "instance" is an "object" of a "class".

- the Attributes : more concepts related to the type of system in Atlas attribute some property, which defines, including:

     -isComposite-whether to composite

     -isIndexable-whether to index

     -isUnique-is it unique

     -multiplicity-indicates whether this attribute is (required/optional/ or can be multi-valued)

Atlas provides some predefined system types:

- Referenceable : This type represents all entities may use a unique property called the search qualifiedName

-Asset: This type contains attributes such as name, description and owner

- Infrastructure : This type Referenceable and Asset extends, generally in infrastructure metadata object (such as a cluster, host, etc.) common supertype

- the DataSet : This type extends the Referenceable and Asset. Conceptually, it can be used to represent the type of stored data. In Atlas, hive tables, Sqoop RDBMS tables, etc. are all types extended from DataSet. The types of the extended DataSet can expect to have schemas, and they will have properties that define the properties of the data set. For example, the columns attribute in hive_table. In addition, entities that extend the entity type of the DataSet participate in data conversion, and this conversion can be used by Atlas to generate graphics through lineage (or provenance).

- Process : This type extends the Referenceable and Asset. Conceptually, it can be used to represent any data transformation operation. For example, the ETL process of converting a hive table of original data into another hive table that stores a certain aggregate can be a specific type of extended process type. The process type has two specific attributes, input and output.

The Hive table is an example of a type defined by Atlas natively. The Hive table is defined to have the following properties:

Name:         hive_table
TypeCategory: Entity
SuperTypes:   DataSet
Attributes:    
	name:              string    
	db:                 hive_db    
	owner:             string    
	createTime:       date    
	lastAccessTime:   date    
	comment:           string    
	retention:         int    
	sd:                 hive_storagedesc    
	partitionKeys:    array<hive_column>    
	aliases:           array<string>    
	columns:           array<hive_column>    
	parameters:        map<string>    
	viewOriginalText: string    
	viewExpandedText: string    
	tableType:         string    
	temporary:         boolean

Second, Atlas metadata blood relationship

2.1 atlas configuration file

1. Atlas database

        Atlas data information is saved in the HBase database of ATLAS_ENTITY_AUDIT_EVENTS created by default.

2.Grapth Titan

       Atlas saves Graph metadata information in atlas_janus of HBase, and uses solr to implement information indexing.

3.hive-site.xml configuration file

      When the Atlas service is checked, Ambari will automatically create a configuration when it is installed, and add a hook function to the hive configuration file.

4.hbase-site.xml configuration file

2.2 hive_db Type example

1. Create a database in hive

Create the atlas_test database in the hive database, and the metadata from the hive hook is synchronized to Atlas. In the hive_db Type, we can see the database information.

2. Create three tables in the atlas_test database

3. atlas_test Type metadata information

(1)Properties

(2)Relationships

(3) Descriptions after Classifications

(4)Audits 

(5)Tables

4. Information of the teacher table in atlas_test

(1)properties

(2)Lineage

(3)Relationships

(4)Classifications

5Audits

(6)Schema

2.3 Classification spread

Classification propagation enables classifications associated with an entity to be automatically associated with other related entities of that entity. This is very useful when dealing with scenarios where data sets get data from other data sets, for example, tables loaded with data in files, reports generated from tables/views, etc.

For example, when a table is classified as PII, the table or view (through CTAS or "Create View" operation) that derives data from that table will be automatically classified as PII.

(1) Create classification PII

( 2 ) Demonstrate using hive teacher table

Below I will label the employee with the PII classification, and the student table associated with it and the teacher table derived from employee and student will also be automatically classified as "PII".

  This classification (Classifications Propagated) is like a certain infectious disease, passed on to the next generation or its derivatives.

Three, label-based security strategy

      Atlas integrates with Ranger to provide Hadoop with tag-based dynamic access control. By controlling tags associated with resources instead of the resource itself, it can provide many conveniences for the access control model, and can achieve classification-based and cross-component Access control, without the need to create separate services and policies in each component :

  • Separate resource classification from permission control. Resources of different Hadoop components (such as HDFS directory, Hive table, HBase table) of the same type of data (such as social security account/credit card account) can be tagged with the same label, with unified permissions To control access
  • If a Hadoop resource is tagged, the permissions related to the tag will be automatically assigned to the resource
  • Separate access control policies can be applied to the resources of different Hadoop components, instead of creating separate policies for each component's resources

3.1 Atlas Add Tag

Based on the database and table created above

3.2 Ranger configures Tag-based strategy

Configure Ranger Tagsync to synchronize tags from Atlas. It can be configured through Ambari WebUI=>Ranger, as shown in the figure:

(1) Create Tag Based Policies

(2) Create policy in Tag

Here to add the Tag in Atlas

3.3 Apply Tag Service to Hive

(1) Create New Service

(2) Select Tag Service

3.4 Verification

The test machine is formatted and the verification is interrupted. The verification is to use the kangll user to log in to hiveserver2 to access the atlas_test database

Detailed operation on the official website: https://www.cloudera.com/tutorials/tag-based-policies-with-apache-ranger-and-apache-atlas/2.html

 

reference:

https://www.jianshu.com/p/8c07974111dd

Detailed operation on the official website: https://www.cloudera.com/tutorials/tag-based-policies-with-apache-ranger-and-apache-atlas/2.html

Like and forward, thank you!

 

Guess you like

Origin blog.csdn.net/qq_35995514/article/details/107395181