Data warehouse's metadata management system (data management system)-Apach Atlas

Data warehouse's metadata management system (data management system)-Apach Atlas

1. If the company does not have such a management system, how to manage your data assets?

Use word documents or excel to record which table has which fields, when it was created, who is the author, what table is this table, and so on.

These are very primitive methods, and the management efficiency is very low.

All need a set of software to realize the management of these data assets.

Large companies generally develop such a system. It is a javaee program

2. Data warehouse metadata management system: A management system for the description information of various data assets (tables, libraries, catalogs, etc.) in the data warehouse

Now Apache has open sourced a general data warehouse metadata management system-atlas

Atlas mainly stores the description information of those data assets

atlas is a javaweb or javaee program

Three. atlas structure:

1. The underlying storage of atlas: janusGraph graph database-(depends on habase and solr)

2.Atlas's core functional layer: core - equivalent to the service layer in javaee.

There are ingest and export in the core function

  • Ingest is to input metadata from the outside and record it into the atlas storage.
  • Export is to export the metadata inside

3.A layer of atlas external services (api layer): integration - equivalent to the controller layer in javaee

Expose an address to the outside world, you can use http to request this address, and you can use the function of atlas

4. To enter (inject) metadata into Atlas, the outside world needs to be connected through Kafka. Ingest will go to Kafka for consumption, and then store the consumed metadata in the format of the graph database. Then api this The layer can be checked in the graph database.

Four.Atlas is only the management of metadata, and cannot operate and calculate hive.

Five.Atlas function:

1. Data classification management classifications

There are many tables in hive, and then I defined several categories on atlas: app burying log data, WeChat applet burying log data, etc., a table can be classified into a certain category, I just need to click on a certain category , You can display all the tables in this category.

2. Audit

Atlas can capture any operation you do with those data.

Insert or build a table or drop in hive, atlas can capture these behaviors, and then form some records

In the future, you can do some audits, for example, you can check when a certain table is, which user has done which operations and the details.

3. Search your data assets search

You can search, such as searching the table of hive, the library of hive, the processing process of hive, what is in kafka, your relational database, and the description of some extraction processes of sqoop. There is also hdfs Some directories and so on.

It's all descriptions.

You can quickly search for relevant metadata based on categories, keywords, data asset types, etc.

4. Bloodlineage

It is very convenient to check the blood relationship of data assets

For example, how does this table a come from a calculation process of insert, where is insert into read from, it is read from table b, where does table b come from, it is built from the create external... statement , And then the data comes from the operation process of load data.... Load the data of which directory in hdfs, etc., table a has derived many tables, and what statements are passed.

It is shown in the form of a graph.

6. How does atlas know that there are these tables in hive?

The metadata in atlas is not innate, it needs to be injected from the outside, you can call his api to pass in,

In other words, we can write a program by ourselves and provide an interface so that we can enter metadata information related to data assets on the interface, and then call the atlas api interface and pass it to atlas to save.

In this way, the workload of the company's development is too large, it is better to develop a complete metadata management system without atlas.

Fortunately, atlas provides a hook program for the components in the hadoop system to automatically detect the data description information in the component, and then pass it to atlas.

For example, he developed hooks for hive

This hook can automatically detect any operation in hive, form metadata information, and automatically pass it to atlas.

Guess you like

Origin blog.csdn.net/weixin_47699191/article/details/113921877