Metadata Management - Introduction and Use of Atlas (Integrating Hive, Solr, Kafka, Kerberos)

overview

introduce

Apache Atlas provides open metadata management and governance capabilities for organizations to build catalogs of their data assets, classify and manage these assets, and provide data analysts and data governance teams with collaboration capabilities around these data assets.

If you want to manage these data well, it is not enough to use text, documents, etc., you must use pictures. Atlas is a tool for turning metadata into graphs.

The specific functions of Atlas are as follows:

metadata classification Supports classified management of metadata, such as personal information, sensitive information, etc.
metadata retrieval Retrieval can be performed according to metadata type and metadata classification, and full-text search is supported
Consanguinity Support blood relationship between table to table and field to field, which is convenient for problem backtracking and impact analysis, etc.

(1) Consanguinity dependence between tables

(2) Consanguinity dependence between fields

architecture development

(1) The figure below describes the first-generation metadata architecture. It's usually a classic monolithic frontend (probably a Flask application), connection to the main store for queries (usually MySQL/Postgres), a search index to serve search queries (usually Elasticsearch), and for this Generation 1.5 of this architecture, perhaps once the "recursive query" limit of relational databases is reached, uses graph indexes that handle lineage (typically Neo4j) graph queries.

(2) Soon, the second-generation architecture appeared. A monolithic application has been split into services that sit in front of a metadata store database. The service provides an API that allows metadata to be written to the system using a push mechanism.

(3) The third-generation architecture is an event-based metadata management architecture, and customers can interact with the metadata database in different ways according to their needs.

Low-latency lookup of metadata, the ability to perform full-text and ranked searches on metadata attributes, graph queries on metadata relationships, and full scan and analysis capabilities.

Apache Atlas uses this architecture and is tightly coupled with the Hadoop ecosystem.

Architecture principle

Atlas includes the following components:

  • Using Hbase to store metadata
  • Indexing with Solr
  • Ingest/Export collection and export components, Type System type system, and Graph Engine graphics engine together constitute the core mechanism of Atlas
  • All functions are provided to users through API, and can also be integrated through Kafka message system
  • Atlas supports various sources for obtaining metadata: Hive, Sqoop, Storm. . .
  • And excellent UI support

(1) Core layer

Atlas core consists of the following components:

  • Ingest/Export : The ingest component allows metadata to be added to Atlas. Likewise, the Export component exposes metadata changes detected by Atlas as events. Consumers can use these change events to respond to metadata changes in real time.

  • Type system : Atlas allows users to define models for the metadata objects they want to manage. The model consists of definitions called "types". Instances of "Type" called "Entities" represent the actual metadata objects that are managed. Type System is a component that allows users to define and manage types and entities. All metadata objects managed by Atlas out of the box, such as Hive tables, are modeled using types and represented as entities. To store new types of metadata in Atlas requires an understanding of the concept of type system components.

    A key point to note is that the generic nature of modeling in Atlas allows data stewards and integrators to define both technical and business metadata. It is also possible to define rich relationships between the two using the capabilities of Atlas.

  • Graph Engine : Atlas internally uses a Graph model to persist the metadata objects it manages. This approach provides great flexibility to efficiently handle rich relationships between metadata objects. The graph engine component is responsible for converting between types and entities of the Atlas type system, as well as the underlying graph persistence model. In addition to managing graphics objects, the graphics engine also creates appropriate indexes for metadata objects so that they can be efficiently searched. Atlas uses JanusGraph to store metadata objects.

(2) Integration layer

In Atlas, users can manage metadata in the following two ways:

  • API − All functionality of Atlas is exposed to end-users through a REST API that allows creating, updating and deleting types and entities. It is also the primary mechanism for querying and discovering the types and entities managed by Atlas.
  • Messaging : In addition to the API, users can also choose to integrate with Atlas using a Kafka-based messaging interface. This is useful for passing metadata objects into Atlas, and for using Atlas with metadata changed events that can build applications. The messaging interface is especially useful if you wish to use a more loosely coupled integration with Atlas for better scalability, reliability, etc. Atlas uses Apache Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events. Events are written to different Kafka topics by hooks and Atlas.

(3) Metadata sources layer

Atlas supports out-of-the-box integration of multiple metadata sources. More integrations will be added in the future. Currently, Atlas supports extracting and managing metadata from the following sources:

  • HBase
  • Hive
  • Sqoop
  • Storm
  • Kafka

Integration means two things: The metadata model Atlas defines is used to represent the objects of these components. Atlas provides components for ingesting metadata objects from these components (either in real time or in batch mode in some cases).

(4) Applications layer

Metadata managed by Atlas is used by various applications to satisfy many governance needs.

Atlas Admin UI − This component is a web-based application that allows data stewards and scientists to discover and annotate metadata. Most important here are the search interface and SQL-like query language that can be used to query the metadata types and objects managed by Atlas. Admin UI uses Atlas' REST API to build its functionality.

Tag Based Policies : Apache Ranger is an advanced security management solution for the Hadoop ecosystem that provides extensive integration with various Hadoop components. By integrating with Atlas, Ranger allows security administrators to define metadata-driven security policies for effective governance. Ranger is a consumer of metadata change events notified by Atlas.

Introduction to the type system

Atlas allows users to define models for the metadata objects they want to manage. The model type(类型)consists of definitions called . entities(实体)Instances called type(类型)represent the actual metadata objects that are managed. Type System is a component that allows users to define and manage types and entities. All metadata objects managed by Atlas out of the box, such as Hive tables, are modeled using types and represented as entities. To store new types of metadata in Atlas requires an understanding of the concept of type system components.

type

A "type" in Atlas defines how a specific type of metadata object is stored and accessed. A type represents a collection of one or more properties of a defined metadata object. Users with a development background can understand a "type" as defined by a "class" in an object-oriented programming language or a "table schema" in a relational database.

  • All types of Atlas can be obtained through this API: http://atlas:21000/api/atlas/v2/types/typedefs
  • Get the definition of hive_table type through this API: http://atlas:21000/api/atlas/v2/types/typedef/name/hive_table

Introduction to hive_table type

One example of a type that uses Atlas is a Hive table. A Hive table defines the following attributes:

Name:         hive_table
TypeCategory: Entity
SuperTypes:   DataSet
Attributes:
    name:             string
    db:               hive_db
    owner:            string
    createTime:       date
    lastAccessTime:   date
    comment:          string
    retention:        int
    sd:               hive_storagedesc
    partitionKeys:    array<hive_column>
    aliases:          array<string>
    columns:          array<hive_column>
    parameters:       map<string,string>
    viewOriginalText: string
    viewExpandedText: string
    tableType:        string
    temporary:        boolean
{
    "category": "ENTITY",
    "guid": "30a12b7c-faed-4ead-ad83-868893ebed93",
    "createdBy": "cloudera-scm",
    "updatedBy": "cloudera-scm",
    "createTime": 1536203750750,
    "updateTime": 1536203750750,
    "version": 1,
    "name": "hive_table",
    "description": "hive_table",
    "typeVersion": "1.1",
    "options": {
        "schemaElementsAttribute": "columns"
    },
    "attributeDefs": [
        {
            "name": "db",
            "typeName": "hive_db",
            "isOptional": false,
            "cardinality": "SINGLE",
            "valuesMinCount": 1,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "createTime",
            "typeName": "date",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "lastAccessTime",
            "typeName": "date",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "comment",
            "typeName": "string",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "retention",
            "typeName": "int",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "sd",
            "typeName": "hive_storagedesc",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "constraints": [
                {
                    "type": "ownedRef"
                }
            ]
        },
        {
            "name": "partitionKeys",
            "typeName": "array<hive_column>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "constraints": [
                {
                    "type": "ownedRef"
                }
            ]
        },
        {
            "name": "aliases",
            "typeName": "array<string>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "columns",
            "typeName": "array<hive_column>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false,
            "constraints": [
                {
                    "type": "ownedRef"
                }
            ]
        },
        {
            "name": "parameters",
            "typeName": "map<string,string>",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "viewOriginalText",
            "typeName": "string",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "viewExpandedText",
            "typeName": "string",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "tableType",
            "typeName": "string",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "temporary",
            "typeName": "boolean",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": true,
            "includeInNotification": false
        }
    ],
    "superTypes": [
        "DataSet"
    ],
    "subTypes": []
}

The following points can be noticed from the above example:

  • Types in Atlas are nameuniquely identified by

  • attributeDefs represents the definition of attributes in this type

  • Type has a metatype. There are the following metatypes in Atlas:

    • Primitive metatypes : boolean, byte, short, int, long, float, double, biginteger, bigdecimal, string, date
    • Enum metatypes
    • Collection metatypes : array, map
    • Composite metatypes : Entity, Struct, Classification, Relationship
  • Entity (Entity) and classification (Classification) types can inherit from other types, called "supertype / parent type" (supertype), which includes the properties defined in the supertype. This allows modelers to define common properties across a set of related types, etc. Similar to how object-oriented languages ​​define superclasses for classes. Types in Atlas can also extend from multiple supertypes.

    In this example, each hive table DataSetextends from a predefined supertype called . More details about this predefined type will be provided later.

  • A type with a metatype Entity, Struct, Classificationor Relationshipcan have a collection of properties. Each property has a name (eg: name) and some other related properties. Properties can be referenced using expressions type_name.attribute_name. It's worth noting that the attributes themselves are defined using Atlas metatypes.

    In this example, hive_table.name is String, hive_table.aliases is an array of strings, hive_table.db refers to an instance of a type called hive_db, and so on.

  • Type references in attributes (such as hive_table.db) are particularly interesting. Using such attributes, we can define arbitrary relationships between two types defined in Atlas, thereby building rich models. In addition, reference lists can also be collected as attribute types (for example, hive_table.columns, which represents a list of references from hive_table to hive_column types)

DataSet type definition

{
    "category": "ENTITY",
    "guid": "d31c0a02-6999-4f81-a62a-07d7654aec84",
    "createdBy": "cloudera-scm",
    "updatedBy": "cloudera-scm",
    "createTime": 1536203676149,
    "updateTime": 1536203676149,
    "version": 1,
    "name": "DataSet",
    "description": "DataSet",
    "typeVersion": "1.1",
    "attributeDefs": [],
    "superTypes": [
        "Asset"
    ],
    "subTypes": [
        "rdbms_foreign_key",
        "rdbms_db",
        "kafka_topic",
        "hive_table",
        "sqoop_dbdatastore",
        "hbase_column",
        "rdbms_instance",
        "falcon_feed",
        "jms_topic",
        "hbase_table",
        "rdbms_table",
        "rdbms_column",
        "rdbms_index",
        "hbase_column_family",
        "access_info",
        "hive_column",
        "avro_type",
        "fs_path"
    ]
}

You can see that DataSet has many subtypes, and some of the types that come with Atlas are inherited from DataSet.

At the same time, DataSet inherits from Asset, which means assets, and some common attributes are defined in it.

Asset type definition

{
    "category": "ENTITY",
    "guid": "349a5c61-47c3-4f4b-9a79-7fd59454a73a",
    "createdBy": "cloudera-scm",
    "updatedBy": "cloudera-scm",
    "createTime": 1536203676083,
    "updateTime": 1536203676083,
    "version": 1,
    "name": "Asset",
    "description": "Asset",
    "typeVersion": "1.1",
    "attributeDefs": [
        {
            "name": "name",
            "typeName": "string",
            "isOptional": false,
            "cardinality": "SINGLE",
            "valuesMinCount": 1,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": true,
            "includeInNotification": false
        },
        {
            "name": "description",
            "typeName": "string",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "owner",
            "typeName": "string",
            "isOptional": true,
            "cardinality": "SINGLE",
            "valuesMinCount": 0,
            "valuesMaxCount": 1,
            "isUnique": false,
            "isIndexable": true,
            "includeInNotification": false
        }
    ],
    "superTypes": [
        "Referenceable"
    ],
    "subTypes": [
        "rdbms_foreign_key",
        "rdbms_db",
        "rdbms_instance",
        "DataSet",
        "rdbms_table",
        "rdbms_column",
        "rdbms_index",
        "Infrastructure",
        "Process",
        "avro_type",
        "hbase_namespace",
        "hive_db"
    ]
}

There are 3 properties defined in the Asset type

  • name: name
  • owner: owner
  • description: description

It also has many subtypes, which represent the meaning of assets, such as hive database hive_db, hbase namespace hbase_namespace. It inherits from the Referenceable type.

Referenceable type definition

{
    "category": "ENTITY",
    "guid": "34c72533-2e80-4e5c-9226-e15b163f98d1",
    "createdBy": "cloudera-scm",
    "updatedBy": "cloudera-scm",
    "createTime": 1536203673540,
    "updateTime": 1536203673540,
    "version": 1,
    "name": "Referenceable",
    "description": "Referenceable",
    "typeVersion": "1.0",
    "attributeDefs": [
        {
            "name": "qualifiedName",
            "typeName": "string",
            "isOptional": false,
            "cardinality": "SINGLE",
            "valuesMinCount": 1,
            "valuesMaxCount": 1,
            "isUnique": true,
            "isIndexable": true,
            "includeInNotification": false
        }
    ],
    "superTypes": [],
    "subTypes": [
        "hive_storagedesc",
        "Asset"
    ]
}

This type defines a very important attribute, qualifiedName, which is a unique qualified name in this type. You can use this attribute to match the type name to find the corresponding unique entity content in Atlas. Note the difference from guid, which is globally unique.

For example:

  • A qualifiedName of a hive database: test@primary
  • The qualifiedName of the table test_table under the database test: test.test_table@primary
  • The qualifiedName of the field name in the table test_table: test.test_table.name@primary

@primary indicates the default name of the cluster. By configuring the cluster name as follows, by adding the cluster name, an entity can be uniquely identified among different clusters

atlas.cluster.name

primary

Process type definition

{
    "category": "ENTITY",
    "guid": "7c03ccad-29aa-4c5f-8a27-19b536068f69",
    "createdBy": "cloudera-scm",
    "updatedBy": "cloudera-scm",
    "createTime": 1536203677547,
    "updateTime": 1536203677547,
    "version": 1,
    "name": "Process",
    "description": "Process",
    "typeVersion": "1.1",
    "attributeDefs": [
        {
            "name": "inputs",
            "typeName": "array<DataSet>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        },
        {
            "name": "outputs",
            "typeName": "array<DataSet>",
            "isOptional": true,
            "cardinality": "SET",
            "valuesMinCount": 0,
            "valuesMaxCount": 2147483647,
            "isUnique": false,
            "isIndexable": false,
            "includeInNotification": false
        }
    ],
    "superTypes": [
        "Asset"
    ],
    "subTypes": [
        "falcon_feed_replication",
        "falcon_process",
        "falcon_feed_creation",
        "sqoop_process",
        "hive_column_lineage",
        "storm_topology",
        "hive_process"
    ]
}
  • The Process type inherits from the Asset type, so it has four attributes: name, owner, description, and quailifiedName
  • Its own unique inputs and outputs represent the input and output of the process, which is the superclass of all types in Atlas lineage management

Conceptually, it can be used to represent any data transformation operation. For example, an ETL process that transforms a hive table of raw data into another hive table that stores some aggregate can be a specific type of extended process type. Process types have two specific properties, input and output. Both input and output are arrays of DataSet entities. Thus, an instance of the Process type can use these inputs and outputs to capture how the DataSet's lineage evolves.

Entities

In Atlas entityis typea specific value or instance of an object and thus represents a specific metadata object in the real world. To use our analogy to an object-oriented programming language, 实例(instance)a 类(Class)certain 对象(Object).

One example of an entity is a Hive table. Hive has a table called 'customers' in the 'default' database. The table is an "entity" in Atlas of type hive_table. Since it is an instance of an entity type, it will have the value of each attribute that is part of the Hive table 'type', for example:

guid:     "9ba387dd-fa76-429c-b791-ffc338d3c91f"
typeName: "hive_table"
status:   "ACTIVE"
values:
    name:             “customers”
    db:               { "guid": "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc", "typeName": "hive_db" }
    owner:            “admin”
    createTime:       1490761686029
    updateTime:       1516298102877
    comment:          null
    retention:        0
    sd:               { "guid": "ff58025f-6854-4195-9f75-3a3058dd8dcf", "typeName": "hive_storagedesc" }
    partitionKeys:    null
    aliases:          null
    columns:          [ { "guid": ""65e2204f-6a23-4130-934a-9679af6a211f", "typeName": "hive_column" }, { "guid": ""d726de70-faca-46fb-9c99-cf04f6b579a6", "typeName": "hive_column" }, ...]
    parameters:       { "transient_lastDdlTime": "1466403208"}
    viewOriginalText: null
    viewExpandedText: null
    tableType:        “MANAGED_TABLE”
    temporary:        false

The following points can be noticed from the above example:

  • Each instance of an entity type is identified by a unique identifier GUID. This GUID is generated by the Atlas server when the object is defined and remains constant throughout the lifetime of the entity. At any point in time, this particular entity can be accessed using its GUID.

    In this example, the "customers" table in the default database is uniquely identified by the GUID "9ba387dd-fa76-429c-b791-ffc338d3c91f".

  • Entities are of the given type, and the name of the type is provided with the entity definition.

    In this example, the 'customers' table is of type 'hive_table'.

  • The value of this entity is a map of all attribute names and their values ​​for the attributes defined in the hive_table type definition.
    The property value will be based on the data type of the property. The entity type attribute will have a value of type AtlasObjectId.

With this design of Entity, we can now see the difference between Entity and Struct metatypes. Both Entity and Struct constitute properties of other types. However, an instance of an entity type has an identity (has a GUID value) and can be referenced from other entities (for example, a hive_db entity from a hive_table entity). Instances of the Struct type do not have their own identity. A value of type Struct is a collection of properties "embedded" within the entity itself.

{
  "referredEntities" : {
    "779734cc-9011-4066-9bb1-25df6f28ac72" : {
      "typeName" : "hive_column",
      "attributes" : {
        "owner" : "wangjian5185",
        "qualifiedName" : "test.student.age@primary",
        "name" : "age",
        "description" : null,
        "comment" : null,
        "position" : 1,
        "type" : "int",
        "table" : {
          "guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
          "typeName" : "hive_table"
        }
      },
      "guid" : "779734cc-9011-4066-9bb1-25df6f28ac72",
      "status" : "ACTIVE",
      "createdBy" : "admin",
      "updatedBy" : "admin",
      "createTime" : 1536215508751,
      "updateTime" : 1536215508751,
      "version" : 0
    },
    "c47aed54-e4d2-4080-aa7a-5428075f5b20" : {
      "typeName" : "hive_column",
      "attributes" : {
        "owner" : "wangjian5185",
        "qualifiedName" : "test.student.phone@primary",
        "name" : "phone",
        "description" : null,
        "comment" : null,
        "position" : 2,
        "type" : "int",
        "table" : {
          "guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
          "typeName" : "hive_table"
        }
      },
      "guid" : "c47aed54-e4d2-4080-aa7a-5428075f5b20",
      "status" : "ACTIVE",
      "createdBy" : "admin",
      "updatedBy" : "admin",
      "createTime" : 1536215508751,
      "updateTime" : 1536215508751,
      "version" : 0
    },
    "7c4b4cd5-841d-409b-b38a-77ec8779e252" : {
      "typeName" : "hive_column",
      "attributes" : {
        "owner" : "wangjian5185",
        "qualifiedName" : "test.student.name@primary",
        "name" : "name",
        "description" : null,
        "comment" : null,
        "position" : 0,
        "type" : "string",
        "table" : {
          "guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
          "typeName" : "hive_table"
        }
      },
      "guid" : "7c4b4cd5-841d-409b-b38a-77ec8779e252",
      "status" : "ACTIVE",
      "createdBy" : "admin",
      "updatedBy" : "admin",
      "createTime" : 1536215508751,
      "updateTime" : 1536215508751,
      "version" : 0
    },
    "a6038b00-ce2d-4612-9436-d63092d09182" : {
      "typeName" : "hive_storagedesc",
      "attributes" : {
        "bucketCols" : null,
        "qualifiedName" : "test.student@primary_storage",
        "sortCols" : null,
        "storedAsSubDirectories" : false,
        "location" : "hdfs://cdhtest/user/hive/warehouse/test.db/student",
        "compressed" : false,
        "inputFormat" : "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat" : "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "parameters" : null,
        "table" : {
          "guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
          "typeName" : "hive_table"
        },
        "serdeInfo" : {
          "typeName" : "hive_serde",
          "attributes" : {
            "serializationLib" : "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            "name" : null,
            "parameters" : null
          }
        },
        "numBuckets" : 0
      },
      "guid" : "a6038b00-ce2d-4612-9436-d63092d09182",
      "status" : "ACTIVE",
      "createdBy" : "admin",
      "updatedBy" : "admin",
      "createTime" : 1536215508751,
      "updateTime" : 1536215508751,
      "version" : 0
    }
  },
  "entity" : {
    "typeName" : "hive_table",
    "attributes" : {
      "owner" : "wangjian5185",
      "temporary" : false,
      "lastAccessTime" : 1533807252000,
      "aliases" : null,
      "qualifiedName" : "test.student@primary",
      "columns" : [ {
        "guid" : "7c4b4cd5-841d-409b-b38a-77ec8779e252",
        "typeName" : "hive_column"
      }, {
        "guid" : "779734cc-9011-4066-9bb1-25df6f28ac72",
        "typeName" : "hive_column"
      }, {
        "guid" : "c47aed54-e4d2-4080-aa7a-5428075f5b20",
        "typeName" : "hive_column"
      } ],
      "description" : null,
      "viewExpandedText" : null,
      "sd" : {
        "guid" : "a6038b00-ce2d-4612-9436-d63092d09182",
        "typeName" : "hive_storagedesc"
      },
      "tableType" : "MANAGED_TABLE",
      "createTime" : 1533807252000,
      "name" : "student",
      "comment" : null,
      "partitionKeys" : null,
      "parameters" : {
        "transient_lastDdlTime" : "1533807252"
      },
      "db" : {
        "guid" : "a804165e-77ff-4c60-9ee7-956760577a1e",
        "typeName" : "hive_db"
      },
      "retention" : 0,
      "viewOriginalText" : null
    },
    "guid" : "75a0c17a-dfdb-4532-9aae-c87b64be958d",
    "status" : "ACTIVE",
    "createdBy" : "admin",
    "updatedBy" : "admin",
    "createTime" : 1536215508751,
    "updateTime" : 1536281983181,
    "version" : 0
  }
}

Table name: student contains 3 columns: name, age, phone

  • At the beginning, referredEntities contains the referenced entity objects of the table, that is, 3 column entities and a hive_storagedesc entity (indicating the storage information of the table), which are actually stored as their GUIDs
  • The entity contains some information about the table student, name, owner, guid, status, attribute, etc.
  • Each entity that is an instance of a Class Type is identified by a unique identifier GUID. This GUID is generated by the Atlas server when the object is defined and remains constant throughout the life of the entity. At any point in time, that particular entity can be accessed using its GUID.

Attributes

We have seen that attributes are defined in metatypes such as Entity, Struct, Classification and Relationship. But we enumerate properties as having a name and a metatype value. However, attributes in Atlas have properties that define more concepts related to the type system.

Attributes have the following properties:

name:        string,
    typeName:    string,
    isOptional:  boolean,
    isIndexable: boolean,
    isUnique:    boolean,
    cardinality: enum
"attributeDefs" : [ {
    "name" : "db",
    "typeName" : "hive_db",
    "isOptional" : false,   //请注意“isOptional = false”约束 - 如果没有db引用,则无法创建表实体。
    "cardinality" : "SINGLE",
    "valuesMinCount" : 1,
    "valuesMaxCount" : 1,
    "isUnique" : false,
    "isIndexable" : false
  }, {
    "name" : "createTime",
    "typeName" : "date",
    "isOptional" : true,
    "cardinality" : "SINGLE",
    "valuesMinCount" : 0,
    "valuesMaxCount" : 1,
    "isUnique" : false,
    "isIndexable" : false
  }, {
    "name" : "columns",
    "typeName" : "array<hive_column>",
    "isOptional" : true,
    "cardinality" : "SET",
    "valuesMinCount" : 0,
    "valuesMaxCount" : 2147483647,
    "isUnique" : false,
    "isIndexable" : false,
    "constraints" : [ {
      "type" : "ownedRef"    //请注意列的“ownedRef”约束。通过这样,我们指出定义的列实体应始终绑定到它们所定义的表实体。
    } ]
  }

The above properties have the following meanings:

  • name: the name of the attribute

  • typeName: The attribute type, including basic types, and date, various type types, and collection types, etc.

  • isOptional: Is it optional, false means that the attribute must be specified

  • cardinality: There are three types: SINGLE (single), LIST (multiple repeatable), SET (multiple repeatable)

  • valuesMinCount: The minimum number of attributes

  • valuesMaxCount: The maximum number of attributes

  • isComposite:

    • This flag represents an aspect of modeling. If a property is defined as composite, it means that it cannot have a life cycle independent of the entity it contains. A good example of this concept is the set of columns that form part of a hive table. Since the columns have no meaning outside the hive table, they are defined as composite attributes.
    • Composite attributes and the entities they contain must be created in Atlas. That is, the hive column has to be created along with the hive table.
  • isIndexable: Flag indicating whether this property should be indexed so that lookups can be performed using the property value as a predicate and can be performed efficiently.

  • isUnique

    • Also related to indexing. If specified as unique, it means that a special index has been created for this property in JanusGraph, allowing equality-based lookups.
    • Any attribute with a true value for this flag is considered a primary key to distinguish this entity from other entities. Therefore, care should be taken to ensure that this property does indeed model a unique property in the real world.
    • For example consider the name attribute of hive_table. In a separate case, name is not the only attribute of hive_table because a table with the same name can exist in multiple databases. Even a pair of (database name, table name) is not unique if Atlas stores metadata of hive tables in multiple clusters. In the physical world, only cluster locations, database names and table names can be considered unique.
  • multiplicity: Indicates whether the attribute is required, optional, or multi-valued. If the entity's property value definition does not match the multiplicity declaration in the type definition, then this will violate the constraint and the entity addition will fail. Therefore, this field can be used to define some constraints on metadata information.

  • constraints: Restriction type, the restriction type of this attribute. It is guessed that this value can be used to realize the function similar to the foreign key in MySQL. The default values ​​are as follows:

Based on the above content, let us expand the attribute definition of an attributes of the hive table below. Let's look at the property called 'db' which indicates the database the hive table belongs to:

Install

(1) Atlas official website address: https://atlas.apache.org/

(2) Document viewing address: https://atlas.apache.org/2.1.0/index.html

(3) Download address: https://www.apache.org/dyn/closer.cgi/atlas/2.1.0/apache-atlas-2.1.0-sources.tar.gz

Installation environment preparation

Atlas installation is divided into: integrating the built-in HBase + Solr; integrating external HBase + Solr. Usually enterprise development chooses to integrate external HBase + Solr, which is convenient for the integration operation of the project as a whole.

The following is the environment and cluster planning that Atlas relies on. This article only contains the installation guides for Solr and Atlas, please refer to the previous chapters for the installation of other dependent services.

service name subservice server hadoop102 server hadoop103 server hadoop104
JDK
Zookeeper QuorumPeerMain
Kafka Kafka
HBase HMaster
HRegionServer
Solr Jar
Hive Hive
Atlas atlas
Total number of services 13 7 7

Install Solr-7.7.3

(1) Create system user solr on each node

[root@hadoop102 ~]# useradd solr
[root@hadoop102 ~]# echo solr | passwd --stdin solr

[root@hadoop103 ~]# useradd solr
[root@hadoop103 ~]# echo solr | passwd --stdin solr

[root@hadoop104 ~]# useradd solr
[root@hadoop104 ~]# echo solr | passwd --stdin solr

(2) Unzip solr-7.7.3.tgz to the /opt/module directory and rename it to solr

[root@hadoop102 ~]# tar -zxvf solr-7.7.3.tgz -C /opt/module/
[root@hadoop102 ~]# mv solr-7.7.3/ solr

(3) Modify the owner of the solr directory to be the solr user

[root@hadoop102 ~]# chown -R solr:solr /opt/module/solr

(4) Modify the solr configuration file

Modify the following properties in the /opt/module/solr/bin/solr.in.sh file:

ZK_HOST="hadoop102:2181,hadoop103:2181,hadoop104:2181"

(5) Distribute solr

[root@hadoop102 ~]# xsync /opt/module/solr

(6) Start the solr cluster

  • Start the Zookeeper cluster

    [root@hadoop102 ~]# zk.sh start
    
  • Start the solr cluster

    For security reasons, it is not recommended to use the root user to start solr. Here, use the solr user to execute the following command on all nodes to start the solr cluster

    [root@hadoop102 ~]# sudo -i -u solr /opt/module/solr/bin/solr start
    [root@hadoop103 ~]# sudo -i -u solr /opt/module/solr/bin/solr start
    [root@hadoop104 ~]# sudo -i -u solr /opt/module/solr/bin/solr start
    

    The words "Happy Searching! " appear to indicate that the startup is successful.

    Explanation: The above warning content is: the maximum number of processes and the maximum number of open files allowed by the solr recommended system are 65000 and 65000 respectively, and the system default value is lower than the recommended value. If you need to modify, please refer to the following steps. After the modification, you need to restart to take effect. You can not modify it here.

    (1)修改打开文件数限制
    修改/etc/security/limits.conf文件,增加以下内容:
    * soft nofile 65000
    * hard nofile 65000
    
    (2)修改进程数限制
    修改/etc/security/limits.d/20-nproc.conf文件
    * soft nproc 65000
    
    (3)重启服务器
    

(7) Visit the web page

The default port is 8983, you can specify any one of the three nodes IP, http://hadoop102:8983

Tip: Solr's Cloud mode is only successfully deployed when the Cloud menu bar appears on the UI interface.

Install Atlas2.1.0

(1) Upload apache-atlas-2.1.0-server.tar.gz to the /opt/software directory of hadoop102

(2) Unzip apache-atlas-2.1.0-server.tar.gz to the /opt/module/ directory

[root@hadoop102 software]# tar -zxvf apache-atlas-2.1.0-server.tar.gz -C /opt/module/

(3) Modify the name of apache-atlas-2.1.0 to atlas

[root@hadoop102 ~]# mv /opt/module/apache-atlas-2.1.0 /opt/module/atlas

Atlas configuration

Atlas integrates Hbase

(1) Modify the following parameters in the /opt/module/atlas/conf/atlas-application.properties configuration file

atlas.graph.storage.hostname=hadoop102:2181,hadoop103:2181,hadoop104:2181

(2) Modify the /opt/module/atlas/conf/atlas-env.sh configuration file and add the following content

export HBASE_CONF_DIR=/opt/module/hbase/conf

Atlas integrates Solr

(1) Modify the following parameters in the /opt/module/atlas/conf/atlas-application.properties configuration file

atlas.graph.index.search.backend=solr
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=hadoop102:2181,hadoop103:2181,hadoop104:2181

(2) Create a solr collection

[root@hadoop102 ~]# sudo -i -u solr /opt/module/solr/bin/solr create  -c vertex_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
[root@hadoop102 ~]# sudo -i -u solr /opt/module/solr/bin/solr create -c edge_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2
[root@hadoop102 ~]# sudo -i -u solr /opt/module/solr/bin/solr create -c fulltext_index -d /opt/module/atlas/conf/solr -shards 3 -replicationFactor 2

Atlas integrates Kafka

Modify the following parameters in the /opt/module/atlas/conf/atlas-application.properties configuration file

atlas.notification.embedded=false
atlas.kafka.data=/opt/module/kafka/data
atlas.kafka.zookeeper.connect= hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka
atlas.kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092

Atlas Server configuration

(1) Modify the following parameters in the /opt/module/atlas/conf/atlas-application.properties configuration file

#########  Server Properties  #########
atlas.rest.address=http://hadoop102:21000
# If enabled and set to true, this will run setup steps when the server starts
atlas.server.run.setup.on.start=false

#########  Entity Audit Configs  #########
atlas.audit.hbase.zookeeper.quorum=hadoop102:2181,hadoop103:2181,hadoop104:2181

(2) Record performance indicators, enter the /opt/module/atlas/conf/ path, and modify atlas-log4j.xml in the current directory

[root@hadoop101 conf]# vim atlas-log4j.xml

#去掉如下代码的注释
<appender name="perf_appender" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="file" value="${atlas.log.dir}/atlas_perf.log" />
    <param name="datePattern" value="'.'yyyy-MM-dd" />
    <param name="append" value="true" />
    <layout class="org.apache.log4j.PatternLayout">
        <param name="ConversionPattern" value="%d|%t|%m%n" />
    </layout>
</appender>

<logger name="org.apache.atlas.perf" additivity="false">
    <level value="debug" />
    <appender-ref ref="perf_appender" />
</logger>

Kerberos related configuration

If Kerberos authentication is enabled in the Hadoop cluster, Kerberos authentication must be performed before Atlas interacts with the Hadoop cluster. If Kerberos authentication is not enabled in the Hadoop cluster, skip this section.

(1) Create a Kerberos principal for Atlas and generate a keytab file

[root@hadoop102 ~]# kadmin -padmin/admin -wadmin -q"addprinc -randkey atlas/hadoop102"
[root@hadoop102 ~]# kadmin -padmin/admin -wadmin -q"xst -k /etc/security/keytab/atlas.service.keytab atlas/hadoop102"

(2) Modify the /opt/module/atlas/conf/atlas-application.properties configuration file and add the following parameters

atlas.authentication.method=kerberos
atlas.authentication.principal=atlas/[email protected]
atlas.authentication.keytab=/etc/security/keytab/atlas.service.keytab

Atlas integrates with Hive

(1) Install Hive Hook

  • Unzip the Hive Hook

    [root@hadoop102 ~]# tar -zxvf apache-atlas-2.1.0-hive-hook.tar.gz
    
  • Copy the Hive Hook dependencies to the Atlas installation path

    [root@hadoop102 ~]# cp -r apache-atlas-hive-hook-2.1.0/* /opt/module/atlas/
    
  • Modify the /opt/module/hive/conf/hive-env.sh configuration file

    Note: You need to change the file name first

    [root@hadoop102 ~]# mv hive-env.sh.template hive-env.sh
    

    Add the following parameters

    export HIVE_AUX_JARS_PATH=/opt/module/atlas/hook/hive
    

(2) Modify the Hive configuration file, add the following parameters in the /opt/module/hive/conf/hive-site.xml file, and configure Hive Hook.

<property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook</value>
</property>

(3) Modify the following parameters in the /opt/module/atlas/conf/atlas-application.properties configuration file

######### Hive Hook Configs #######
atlas.hook.hive.synchronous=false
atlas.hook.hive.numRetries=3
atlas.hook.hive.queueSize=10000
atlas.cluster.name=primary

(4) The Atlas configuration file /opt/module/atlas/conf/atlas-application.properties

[root@hadoop102 ~]# cp /opt/module/atlas/conf/atlas-application.properties  /opt/module/hive/conf/

Atlas launch

(1) Start the Hadoop cluster

Execute the following command on the NameNode node to start HDFS:

[root@hadoop102 ~]# start-dfs.sh

Execute the following command on the ResourceManager node to start Yarn:

[root@hadoop103 ~]# start-yarn.sh

(2) Start the Zookeeper cluster:

[root@hadoop102 ~]# zk.sh start

(3) Start the Kafka cluster:

[root@hadoop102 ~]# kf.sh start

(4) Start the Hbase cluster:

Execute the following command on the HMaster node to start HBase as the hbase user:

[root@hadoop102 ~]# sudo -i -u hbase start-hbase.sh

(5) Start the Solr cluster:

Execute the following command on all nodes to start Solr as the solr user:

[root@hadoop102 ~]# sudo -i -u solr /opt/module/solr/bin/solr start
[root@hadoop103 ~]# sudo -i -u solr /opt/module/solr/bin/solr start
[root@hadoop104 ~]# sudo -i -u solr /opt/module/solr/bin/solr start

(6) Enter the /opt/module/atlas path and start the Atlas service

[root@hadoop102 atlas]# bin/atlas_start.py

hint:

  • Error message viewing path: /opt/module/atlas/logs/*.out and application.log
  • The command to stop the Atlas service is atlas_stop.py

(7) Access the WebUI of Atlas

Access address: http://hadoop102:21000

NOTE: Wait several minutes.

Account: admin

Password: admin

Atlas use

The use of Atlas is relatively simple. Its main job is to synchronize the metadata of each service (mainly Hive), build the association relationship between metadata entities, and then index the stored metadata, and finally provide users with data lineage viewing and metadata retrieval.

At the beginning of Atlas installation, you need to manually perform a full import of metadata, and then Atlas will use Hive Hook to incrementally synchronize Hive metadata.

Import Hive metadata for the first time

Atlas provides a script for importing Hive metadata. Execute the script directly to complete the initial full import of Hive metadata.

(1) Import Hive metadata

Execute the following command:

[root@hadoop102 ~]# /opt/module/atlas/hook-bin/import-hive.sh

Enter the user name as prompted: admin; enter the password: admin

Enter username for atlas :- admin
Enter password for atlas :- 

Wait for a while, and the following log appears, indicating that the import is successful:

Hive Meta Data import was successful!!!

(2) View Hive metadata

  • Search for metadata of the hive_table type, but you can see that Atlas has already obtained the Hive metadata

  • Choose a table to view blood dependencies

    It is found that the expected blood relationship does not appear at this time, because Atlas obtains the dependencies between tables and fields and between fields based on the SQL statement executed by Hive, such as executing the insert into table_a select * from table_b statement, Atlas can obtain the dependency between table_a and table_b. At this time, no SQL statement is executed, so the blood relationship cannot appear.

Hive metadata incremental synchronization

Incremental synchronization of Hive metadata does not require human intervention. As long as the metadata in Hive changes (executes DDL statements), Hive Hook will notify Atlas of the metadata change. In addition, Atlas will also obtain the blood relationship between data according to the DML statement.

Whole Process Scheduling

In order to check the effect of blood relationship, Azkaban is used here to schedule the whole process of the data warehouse once.

(1) New data preparation

  • User Behavior Log

    a. Start the log collection channel, including Zookeeper, Kafka, Flume, etc.

    b. Modify the /opt/module/applog/application.yml file of hadoop102 and hadoop103 nodes, and change the simulation date to 2020-06-17 as follows:

    #业务日期
    mock.date: "2020-06-17"
    

    c. Execute the script that generates the log:

    # lg.sh
    

    d. Wait for a while and observe whether the log file of 2020-06-17 appears in HDFS

  • business data

    a. Modify /opt/module/db_log/application.properties, and change the simulation date to 2020-06-17, as follows:

    #业务日期
    mock.date=2020-06-17
    

    b. Enter the /opt/module/db_log path, and execute the command to simulate and generate business data, as follows:

    # java -jar gmall2020-mock-db-2021-01-22.jar
    

    c. Observe whether the data of 2020-06-17 appears in the gmall data of mysql

(2) Start Azkaban

Note that you need to use the azkaban user to start Azkaban

  • Start Executor Server

    Execute the following command on each node to start the Executor:

    [root@hadoop102 ~]# sudo -i -u azkaban bash -c "cd /opt/module/azkaban/azkaban-exec;bin/start-exec.sh"
    [root@hadoop103 ~]# sudo -i -u azkaban bash -c "cd /opt/module/azkaban/azkaban-exec;bin/start-exec.sh"
    [root@hadoop104 ~]# sudo -i -u azkaban bash -c "cd /opt/module/azkaban/azkaban-exec;bin/start-exec.sh"
    
  • To activate the Executor Server, select a node and execute the following activation command:

    [root@hadoop102 ~]# curl http://hadoop102:12321/executor?action=activate
    [root@hadoop102 ~]# curl http://hadoop103:12321/executor?action=activate
    [root@hadoop102 ~]# curl http://hadoop104:12321/executor?action=activate
    
  • Start the web server:

    [root@hadoop102 ~]# sudo -i -u azkaban bash -c "cd /opt/module/azkaban/azkaban-web;bin/start-web.sh"
    

(3) Full process scheduling

  • workflow parameters

  • operation result

View consanguinity

At this time, by viewing the Hive metadata through Atlas, you can find blood dependencies:

Extended content

Atlas source code compilation

Install Maven

(1) Maven download: https://maven.apache.org/download.cgi

(2) Upload apache-maven-3.6.1-bin.tar.gz to the /opt/software directory of linux

(3) Unzip apache-maven-3.6.1-bin.tar.gz to the /opt/module/ directory

[root@hadoop102 software]# tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/

(4) Modify the name of apache-maven-3.6.1 to maven

[root@hadoop102 module]# mv apache-maven-3.6.1/ maven

(5) Add environment variables to /etc/profile

[root@hadoop102 module]#vim /etc/profile
#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven
export PATH=$PATH:$MAVEN_HOME/bin

(6) Test installation results

[root@hadoop102 module]# source /etc/profile
[root@hadoop102 module]# mvn -v

(7) Modify setting.xml, specifying as Alibaba Cloud:

[root@hadoop101 module]# cd /opt/module/maven/conf/
[root@hadoop102 maven]# vim settings.xml

<!-- 添加阿里云镜像-->
<mirror>
    <id>nexus-aliyun</id>
    <mirrorOf>central</mirrorOf>
    <name>Nexus aliyun</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
<mirror>
    <id>UK</id>
    <name>UK Central</name>
    <url>http://uk.maven.org/maven2</url>
    <mirrorOf>central</mirrorOf>
</mirror>
<mirror>
    <id>repo1</id>
    <mirrorOf>central</mirrorOf>
    <name>Human Readable Name for this Mirror.</name>
    <url>http://repo1.maven.org/maven2/</url>
</mirror>
<mirror>
    <id>repo2</id>
    <mirrorOf>central</mirrorOf>
    <name>Human Readable Name for this Mirror.</name>
    <url>http://repo2.maven.org/maven2/</url>
</mirror>

Compile Atlas source code

(1) Upload apache-atlas-2.1.0-sources.tar.gz to the /opt/software directory of hadoop102

(2) Unzip apache-atlas-2.1.0-sources.tar.gz to the /opt/module/ directory

[root@hadoop101 software]# tar -zxvf apache-atlas-2.1.0-sources.tar.gz -C /opt/module/

(3) Download Atlas dependencies

[root@hadoop101 software]# export MAVEN_OPTS="-Xms2g -Xmx2g"

[root@hadoop101 software]# cd /opt/module/apache-atlas-sources-2.1.0/
[root@hadoop101 apache-atlas-sources-2.1.0]# mvn clean -DskipTests install
[root@hadoop101 apache-atlas-sources-2.1.0]# mvn clean -DskipTests package -Pdis
#一定要在${atlas_home}执行
[root@hadoop101 apache-atlas-sources-2.1.0]# cd distro/target/
[root@hadoop101 target]# mv apache-atlas-2.1.0-server.tar.gz /opt/software/
[root@hadoop101 target]# mv apache-atlas-2.1.0-hive-hook.tar.gz /opt/software/

Tip: The execution process is relatively long, and many dependencies will be downloaded, which takes about half an hour. If an error is reported during this period, it is likely to be due to a network interruption caused by TimeOut, just try again.

Atlas memory configuration

If you plan to store tens of thousands of metadata objects, it is recommended to adjust the parameter value to obtain the best JVM GC performance. The following are common server-side options:

Modify the configuration file /opt/module/atlas/conf/atlas-env.sh:

#设置Atlas内存
export ATLAS_SERVER_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps"

#建议JDK1.7使用以下配置
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=3072m -XX:PermSize=100M -XX:MaxPermSize=512m"

#建议JDK1.8使用以下配置
export ATLAS_SERVER_HEAP="-Xms15360m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"

#如果是Mac OS用户需要配置
export ATLAS_SERVER_OPTS="-Djava.awt.headless=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Parameter description: -XX:SoftRefLRUPolicyMSPerMB This parameter is especially useful for managing GC performance for query-heavy workloads with many concurrent users.

Configure username and password

Atlas supports the following authentication methods: File, Kerberos protocol, LDAP protocol.

Enable or disable three verification methods by modifying the configuration file atlas-application.properties:

atlas.authentication.method.kerberos=true|false
atlas.authentication.method.ldap=true|false
atlas.authentication.method.file=true|false

If two or more id authentication methods are set to true, authentication will fall back to the latter method if the earlier method fails. For example, if Kerberos authentication is set to true and ldap authentication is also set to true, LDAP authentication will be used as a fallback for requests without a kerberos principal and keytab.

This article mainly explains how to modify the user name and password settings by means of files. For other methods, please refer to the configuration on the official website.

(1) Open the /opt/module/atlas/conf/users-credentials.properties file

[atguigu@hadoop102 conf]$ vim users-credentials.properties

#username=group::sha256-password
admin=ADMIN::8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918
rangertagsync=RANGER_TAG_SYNC::e3f67240f5117d1753c940dae9eea772d36ed5fe9bd9c94a300e40413f1afb9d

admin is the username

8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918 is a password encrypted with sha256, and the default password is admin.

(2) For example: change the user name to atguigu and the password to atguigu

  • Get the sha256 encrypted atguigu password:

    [atguigu@hadoop102 conf]$ echo -n "atguigu"|sha256sum
    2628be627712c3555d65e0e5f9101dbdd403626e6646b72fdf728a20c5261dc2
    
  • Modify username and password:

    [atguigu@hadoop102 conf]$ vim users-credentials.properties
    
    #username=group::sha256-password
    atguigu=ADMIN::2628be627712c3555d65e0e5f9101dbdd403626e6646b72fdf728a20c5261dc2
    rangertagsync=RANGER_TAG_SYNC::e3f67240f5117d1753c940dae9eea772d36ed5fe9bd9c94a300e40413f1afb9d
    

Guess you like

Origin blog.csdn.net/qq_44766883/article/details/131228442