Apache atlas metadata management governance platform usage and architecture

1 Introduction

Apache Atlas is a metadata management and governance product hosted by Apache. It is currently widely used in the field of big data. It can help enterprises manage data assets, classify and manage these assets, and provide a basis for data analysis. , data governance provides high-quality metadata information.

As the business volume of an enterprise gradually expands, the data increases day by day. Data from different business lines may be stored in multiple types of databases and eventually gathered into the enterprise's data warehouse for integrated analysis. At this time, if you want to track the source of the data, it is reasonable to Clearing the relationship between data will be an extremely troublesome thing. If there is a problem in a certain link, the cost of tracing will be huge, so Atlas came into being in this context. Through it, we can very It is convenient to manage metadata, and can trace the relationship (blood relationship) between table level and column level, providing strong support and guarantee for the enterprise's data assets. Atlas supports extracting and managing metadata from HBase, Hive, Sqoop, Storm, and Kafka. At the same time, you can also define your own metadata model and generate metadata through Rest API.

In this article, we focus on introducing the related concepts of Atlas to help everyone better understand Atlas. At the same time, we explain in detail how to customize the data model through Rest API and generate blood relationships in order to develop your own personalized needs.

2. Atlas principles and related concepts

metadata

Metadata is actually data that describes data, such as tables, fields, views, etc. Each business system may define its own tables, fields, views, where the data comes from and where it goes, whether there is a correlation between the data, and other Whether there are duplicate or contradictory fields in the system data, these are the problems that metadata management needs to solve, and they are also the problems that Atlas needs to solve.

Operating principle

The principle of Atlas is actually not difficult to understand. It mainly reads the database structure in the data warehouse through internally provided scripts, generates data models, and stores them in Atlas' Hbase. At the same time, it monitors data changes in the data warehouse through hooks and analyzes them. The executed sql statement generates table-to-table and column-to-column dependencies, which are displayed to the user at the front desk.

Data warehouse support

Atlas has the best support for Hive. We all know that Hive relies on Hadoop, and data is stored in HDFS. Atlas has a special shell script that can directly run and read metadata information such as Hive's table structure and synchronize it to the Atlas repository. , automatically generate a metadata model, and the HiveHook provided by Atlas can monitor Hive data changes, infer the relationship between data based on the SQL executed by Hive, and generate a blood relationship diagram. If we want to analyze the metadata of other data storage media Data and kinship, Atlas support is not ideal. But usually, we will regularly synchronize data from business libraries such as MySQL and Oracle to the data warehouse for integration and analysis. In the data warehouse, we generally use the Hadoop ecosystem, so this is not a problem.

Architecture diagram

The following is an architectural diagram of Atlas. It can be seen that the ecosystem that Atlas relies on is extremely large, which directly makes its deployment very cumbersome. This article will not explain the deployment of Atlas. There are many related tutorials on the Internet. If you are interested Friends can search and try it themselves.
Insert image description here

Core component concepts

Atlas mainly has the following core components, which we need to focus on. Next, our custom modeling through Rest Api is actually the addition, deletion, checking and modification operations of the following components.

1. Type

Metadata type definition, which can be database, table, column, etc., can also be subdivided into mysql table (mysql_table), oracle table (oracle_table), etc. Atlas comes with many types, such as DataSet, Process, etc. Under normal circumstances, data Related types will inherit DataSet when defining types, while process-related types will inherit Process, which facilitates the generation of blood relationships. We can also customize the type by calling the api. This is the starting point of everything. After the types are defined, different types of metadata entities can be generated to generate blood relationships. I personally prefer to call metadata types modeling.

2. Classification

Classification, in layman's terms, means labeling metadata. Classification is transferable. For example, view A is generated based on table A, so if table A is tagged with a, view A will also be automatically tagged with a. The advantage of this is Facilitates data tracking.

3. Entity

Entities represent specific metadata, and the objects managed by Atlas are entities of various types.

4. Lineage

Data lineage represents the transmission relationship between data. Through Lineage, we can clearly know where the data comes from and where it flows, and what operations have been experienced in the middle. In this way, once there is a problem with the data, it can be quickly traced back to locate the link where it occurred. mistake.

3. Altas installation

(Reference link, please use according to actual situation)
1. https://blog.csdn.net/hshudoudou/article/details/123899947
2. https://blog.csdn.net/javaThanksgiving/article/details/130505251

4. Use of Altas

After Altas is successfully deployed, it is very simple to use. This is the login interface. The default username and password are admin, admin:
Insert image description hereEnter the homepage, click switch to new in the upper right corner, and use The new interface is more intuitive:
Insert image description here
Insert image description here
On the left side of the page is the Atlas type tree. Click on a type of tree node to view the following entities. Here we click on mysql_table:
Insert image description hereYou can see that there are many tables below, which I have uploaded and defined using Rest Api before.

Let's explain how to customize types, generate entities, and create blood relationships through Rest Api.

5. Detailed explanation and examples of Atlas Rest API

We click Help-》API Documentation at the top of the homepage to view all the open interfaces of Atlas:
Insert image description hereInsert image description here
One thing we need to pay attention to is that the interfaces of Atlas need to be authenticated when using them. Right, so we need to bring username and password authentication information when building an http request. In this example, we use the atlas-client-v2 open source component to make Atlas api calls.

In this example, we define a my_db type, my_table type, and make my_db one-to-many my_table, then create the test_db entity under my_db, create the test_table_source and test_table_target entities under my_table, and define the data of test_table_target to come from test_table_source, generating two entities. Blood relationship dependence.

Customize my_db and my_table types

We define the my_db and my_table types. In Atlas's Rest Api, one request is allowed to define multiple types. Here we first build the json request body, and then implement it through coding. Comparing the two, it is easier to understand, json request The body is as follows (notes are provided in key places):

{
    
    
  "enumDefs": [],
  "structDefs": [],
  "classificationDefs": [],
  //类型定义
  "entityDefs": [
    {
    
    
      "name": "my_db",
      //数据类型的定义,约定俗成,继承Atlas自带的DataSet
      "superTypes": [
        "DataSet"
      ],
      //服务类型(便于在界面分组显示类型)
      "serviceType": "my_type",
      "typeVersion": "1.1",
      "attributeDefs": []
    },
    {
    
    
      "name": "my_table",
      "superTypes": [
        "DataSet"
      ],
      "serviceType": "my_type",
      "typeVersion": "1.1",
      "attributeDefs": []
    }
  ],
  //定义类型之间的关系
  "relationshipDefs": [
    {
    
    
      "name": "my_table_db",
      "serviceType": "my_type",
      "typeVersion": "1.1",
      //关系类型:ASSOCIATION:关联关系,没有容器存在,1对1 
      //AGGREGATION:容器关系,1对多,而且彼此可以相互独立存在 
      //COMPOSITION:容器关系,1对多,但是容器中的实例不能脱离容器存在
      "relationshipCategory": "AGGREGATION",
      //节点一
      "endDef1": {
    
    
        "type": "my_table",
        //表中关联的属性名称,对应下面的 my_db
        "name": "db",
        //代表这头是不是容器
        "isContainer": false,
        //cardinality: 三种类型SINGLE, LIST, SET
        "cardinality": "SINGLE"
      },
      // 节点2
      "endDef2": {
    
    
        "type": "my_db",
        "name": "tables",
        "isContainer": true,
        // db 包含 table,table不能重复,所以类型设置为 SET
        "cardinality": "SET"
      },
      // 推导tag NONE 不推导
      "propagateTags": "NONE"
    }
  ]
}

Coding implementation:

Introduce pom dependency. Note that if you want to integrate it into your own business system, if the business system uses other log frameworks, you need to remove the slf4j-log4j12 dependency. Otherwise, the log framework will conflict and cause startup failure. In addition, atlas-client- Common relies on commons-configuration 1.10. If there are lower version dependencies in the business system, remember to exclude them, otherwise the two will conflict, causing client initialization to fail.

<dependencies>
        <!-- Apache Atlas -->
        <dependency>
            <groupId>org.apache.atlas</groupId>
            <artifactId>atlas-client-common</artifactId>
            <version>2.1.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-log4j12</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- Apache Atlas Client  Version2 -->
        <dependency>
            <groupId>org.apache.atlas</groupId>
            <artifactId>atlas-client-v2</artifactId>
            <version>2.1.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-log4j12</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>log4j</artifactId>
                    <groupId>log4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>
    </dependencies>

Introduce atlas-application.properties (must exist, otherwise initialization will fail):

atlas.rest.address=http://127.0.0.1:21000

The code is implemented as follows (very easy to understand compared to json):

		AtlasClientV2 atlasClientV2 = new AtlasClientV2(new String[]{
    
    "http://127.0.0.1:21000"}, new String[]{
    
    "admin", "admin"});
		//父类集合
		Set<String> superTypes = new HashSet<>();
		superTypes.add(AtlasBaseTypeDef.ATLAS_TYPE_DATASET);
		//定义myType
		AtlasTypesDef myType = new AtlasTypesDef();
		//定义myDb
		AtlasEntityDef myDb = new AtlasEntityDef();
		myDb.setName("my_db");
		myDb.setServiceType("my_type");
		myDb.setSuperTypes(superTypes);
		myDb.setTypeVersion("1.1");
		//定义mytable
		AtlasEntityDef myTable = new AtlasEntityDef();
		myTable.setName("my_table");
		myTable.setServiceType("my_type");
		myTable.setSuperTypes(superTypes);
		myTable.setTypeVersion("1.1");
		//定义relationshipDef
		AtlasRelationshipDef relationshipDef = new AtlasRelationshipDef();
		relationshipDef.setName("my_table_db");
		relationshipDef.setServiceType("my_type");
		relationshipDef.setTypeVersion("1.1");
		relationshipDef.setRelationshipCategory(AtlasRelationshipDef.RelationshipCategory.AGGREGATION);
		relationshipDef.setPropagateTags(AtlasRelationshipDef.PropagateTags.NONE);
		//定义endDef1
		AtlasRelationshipEndDef endDef1 = new AtlasRelationshipEndDef();
		endDef1.setType("my_table");
		endDef1.setName("db");
		endDef1.setIsContainer(false);
		endDef1.setCardinality(AtlasStructDef.AtlasAttributeDef.Cardinality.SINGLE);
		relationshipDef.setEndDef1(endDef1);
		//定义endDef2
		AtlasRelationshipEndDef endDef2 = new AtlasRelationshipEndDef();
		endDef2.setType("my_db");
		endDef2.setName("tables");
		endDef2.setIsContainer(true);
		endDef2.setCardinality(AtlasStructDef.AtlasAttributeDef.Cardinality.SET);
		relationshipDef.setEndDef2(endDef2);
		//entityDefs
		List<AtlasEntityDef> entityDefs = new ArrayList<>(2);
		entityDefs.add(myDb);
		entityDefs.add(myTable);
		myType.setEntityDefs(entityDefs);
		//relationshipDefs
		List<AtlasRelationshipDef> relationshipDefs = new ArrayList<>(1);
		relationshipDefs.add(relationshipDef);
		myType.setRelationshipDefs(relationshipDefs);
		//查询是否已有my_db类型,没有则创建
		SearchFilter filter = new SearchFilter();
		filter.setParam("name", "my_db");
		AtlasTypesDef allTypeDefs = atlasClientV2.getAllTypeDefs(filter);
		if (allTypeDefs.getEntityDefs().isEmpty()) {
    
    
			//请求 rest api
			atlasClientV2.createAtlasTypeDefs(myType);
		}

Execute the above code. After execution, go to the Atlas home page to see that the type has been successfully created:
Insert image description hereView the type model diagram:
Insert image description hereThe type is created. Next we create the entity.

Create entities test_db, test_table_source and test_table_target

json is as follows:

//my_db 实体
{
    
    
	"typeName": "my_db",
	"attributes": {
    
    
		"qualifiedName": "test_db",
		"name": "test_db",
		"description": "测试创建db"
	}
}
//test_table_source 实体
{
    
    
	"typeName": "my_table",
	"attributes": {
    
    
		"qualifiedName": "test_table_source",
		"name": "test_table_source",
		"description": "测试创建test_table_source"
	},
	"relationshipAttributes": {
    
    
		"db": {
    
    
			"typeName": "my_db",
			//my_db的guid(创建完my_db后会返回)
			"guid": "xxxx"
		}
	}
}
//test_table_target 实体
{
    
    
	"typeName": "my_table",
	"attributes": {
    
    
		"qualifiedName": "test_table_target",
		"name": "test_table_target",
		"description": "测试创建test_table_target"
	},
	"relationshipAttributes": {
    
    
		"db": {
    
    
			"typeName": "my_db",
			"guid": "xxx"
		}
	}
}

The code is implemented as follows:

		//创建实体 test_db
		AtlasEntity testDb = new AtlasEntity();
		testDb.setTypeName("my_db");
		Map<String, Object> attributes = new HashMap<>();
		attributes.put("qualifiedName", "test_db");
		attributes.put("name", "test_db");
		attributes.put("description", "测试创建db");
		testDb.setAttributes(attributes);
		Map<String, String> queryAttributes = new HashMap<>();
		queryAttributes.put("qualifiedName", "test_db");
		String myDbGuid = null;
		try {
    
    
			//查询不到会报错
			AtlasEntity.AtlasEntityWithExtInfo extInfo = atlasClientV2.getEntityByAttribute("my_db", queryAttributes);
			myDbGuid = extInfo.getEntity().getGuid();
		} catch (AtlasServiceException e) {
    
    
			if (ClientResponse.Status.NOT_FOUND.equals(e.getStatus())) {
    
    
				AtlasEntity.AtlasEntityWithExtInfo extInfo = new AtlasEntity.AtlasEntityWithExtInfo(testDb);
				//请求
				EntityMutationResponse response = atlasClientV2.createEntity(extInfo);
				myDbGuid = response.getGuidAssignments().values().toArray(new String[]{
    
    })[0];
			}
		}
		//创建与db的关系
		Map<String, Object> relationShipAttr = new HashMap<>();
		Map<String, String> dbMap = new HashMap<>();
		dbMap.put("guid", myDbGuid);
		dbMap.put("typeName", "my_db");
		relationShipAttr.put("db", dbMap);
		//创建实体 test_table_source
		AtlasEntity testTableSource = new AtlasEntity();
		testTableSource.setTypeName("my_table");
		attributes.put("qualifiedName", "test_table_source");
		attributes.put("name", "test_table_source");
		attributes.put("description", "测试创建test_table_source");
		testTableSource.setAttributes(attributes);
		testTableSource.setRelationshipAttributes(relationShipAttr);
		queryAttributes.put("qualifiedName", "test_table_source");
		try {
    
    
			//atlasClientV2.updateEntity(new AtlasEntity.AtlasEntityWithExtInfo(testTableSource));
			AtlasEntity.AtlasEntityWithExtInfo extInfo = atlasClientV2.getEntityByAttribute("my_table", queryAttributes);
			testTableSource = extInfo.getEntity();
		} catch (AtlasServiceException e) {
    
    
			if (ClientResponse.Status.NOT_FOUND.equals(e.getStatus())) {
    
    
				AtlasEntity.AtlasEntityWithExtInfo extInfo = new AtlasEntity.AtlasEntityWithExtInfo(testTableSource);
				//请求
				EntityMutationResponse response = atlasClientV2.createEntity(extInfo);
				testTableSource.setGuid(response.getGuidAssignments().values().toArray(new String[]{
    
    })[0]);
			}
		}
		//创建实体 test_table_target
		AtlasEntity testTableTarget = new AtlasEntity();
		testTableTarget.setTypeName("my_table");
		attributes.put("qualifiedName", "test_table_target");
		attributes.put("name", "test_table_target");
		attributes.put("description", "测试创建test_table_target");
		testTableTarget.setAttributes(attributes);
		testTableTarget.setRelationshipAttributes(relationShipAttr);
		queryAttributes.put("qualifiedName", "test_table_target");
		try {
    
    
			//atlasClientV2.updateEntity(new AtlasEntity.AtlasEntityWithExtInfo(testTableTarget));
			AtlasEntity.AtlasEntityWithExtInfo extInfo = atlasClientV2.getEntityByAttribute("my_table", queryAttributes);
			testTableTarget = extInfo.getEntity();
		} catch (AtlasServiceException e) {
    
    
			if (ClientResponse.Status.NOT_FOUND.equals(e.getStatus())) {
    
    
				AtlasEntity.AtlasEntityWithExtInfo extInfo = new AtlasEntity.AtlasEntityWithExtInfo(testTableTarget);
				//请求
				EntityMutationResponse response = atlasClientV2.createEntity(extInfo);
				testTableTarget.setGuid(response.getGuidAssignments().values().toArray(new String[]{
    
    })[0]);
			}
		}

After executing the code, check the tree diagram of the class and find that entities have been generated:

Insert image description hereWhen we click on the test_db entity on the right, we can see its basic information and its relationship information, which includes two entities: test_table_source and test_table_target:
Insert image description hereView relationship Information, including test_table_source and test_table_target:

Insert image description hereCreate blood relationship dependencies of test_table_source and test_table_target

As we mentioned earlier, the data that defines test_table_target comes from test_table_source. The blood relationship dependency actually exists as an entity in Atlas, but the inherited parent class is Process. In this way, the inputs and outputs attributes can be defined to build the blood relationship. The json is as follows:

{
    
    
	"typeName": "Process",
	"attributes": {
    
    
		"name": "test_process",
		"qualifiedName": "test_process",
		"description": "test_table_target 的数据来自 test_table_source",
		"inputs": [{
    
    
			"typeName": "my_table",
			//test_table_source的guid,创建实体从返回的信息中获取
			"guid": "xxx"
		}],
		"outputs": [{
    
    
			"typeName": "my_table",
			test_table_target的guid,创建实体从返回的信息中获取
			"guid": "xxx"
		}]
	}
}

The code is implemented as follows:

		AtlasEntity lineage = new AtlasEntity();
		//设置为process类型构建血缘
		lineage.setTypeName(AtlasBaseTypeDef.ATLAS_TYPE_PROCESS);
		attributes.put("qualifiedName", "test_process");
		attributes.put("name", "test_process");
		attributes.put("description", "test_table_target 的数据来自 test_table_source");
		attributes.put("inputs", getLineAgeInfo(testTableSource));
		attributes.put("outputs", getLineAgeInfo(testTableTarget));
		lineage.setAttributes(attributes);
		queryAttributes.put("qualifiedName", "test_process");
		System.out.println(SingletonObject.OBJECT_MAPPER.writeValueAsString(lineage));
		try {
    
    
			//查询是否存在
			atlasClientV2.getEntityByAttribute(AtlasBaseTypeDef.ATLAS_TYPE_PROCESS, queryAttributes);
		} catch (AtlasServiceException e)  {
    
    
			if (ClientResponse.Status.NOT_FOUND.equals(e.getStatus())) {
    
    
				//创建
				AtlasEntity.AtlasEntityWithExtInfo extInfo = new AtlasEntity.AtlasEntityWithExtInfo(lineage);
				atlasClientV2.createEntity(extInfo);
			}
		}
		
	//构建inputs和outputs
	private static List<Map<String, String>> getLineAgeInfo(AtlasEntity entity) {
    
    
		List<Map<String, String>> list = new ArrayList<>();
		Map<String, String> map = new HashMap<>();
		map.put("guid", entity.getGuid());
		map.put("typeName", entity.getTypeName());
		list.add(map);
		return list;
	}

Execute the above code, then open the homepage, click test_table_source in my_table, and check the lineage tag. The blood relationship has been successfully constructed:
Insert image description hereAt this point, we have modeled ourselves through Atlas Rest Api. , creating the entity and constructing the blood relationship is completed.

Metadata management and data governance are still a hot topic nowadays. At the same time, it can also help us better support enterprise data assets, better analyze data, and provide effective assistance for enterprise development decisions.

引用链接:
1、https://blog.csdn.net/hshudoudou/article/details/123899947
2、https://blog.csdn.net/javaThanksgiving/article/details/130505251
3、原文 https://blog.csdn.net/m0_37719874/article/details/124245209

Guess you like

Origin blog.csdn.net/qq_44787816/article/details/133784054