Data Security Hardening: In-depth Analysis of Didi ES Security Authentication Technical Solution

The previous article respectively introduced how the self-developed ES strong consistency and multi-activity are realized , and how to improve the performance potential of ES . Since ES has powerful search and analysis functions, but also because of its open source and ease of use, it has become the target of hackers. In recent years, ES data leakage incidents have occurred frequently in the industry. The following are some serious data leakage cases:

In December 2021, Socialarks leaked 400GB of data. Due to incorrect ElasticSearch database settings, more than 318 million user records were leaked, involving user information on multiple social platforms such as Instagram, LinkedIn, and Facebook. [1]

In June 2022, an Elasticsearch unauthorized access vulnerability existed in a sub-site of Midea (midea.com). [2]

In August 2022, more than 280 million records of Indian citizens were leaked online, including user account information, bank account information, and personally identifiable information. [3]

When Didi introduced ES, there were also issues of unauthorized access to ES http 9200 port and Kibana http 5601 port. In order to ensure data security, Didi ES team decided to fix these problems as soon as possible.

1

Problem Description

Let’s briefly introduce the Didi ES architecture, which includes the following five parts:

1. ES cluster : provide data storage, word segmentation, retrieval and other services

2. GateWay cluster : ES cluster proxy, providing authentication, authentication, current limiting, routing, and indicator statistics functions. Users can read and write ES cluster indexes through GateWay cluster

3. ES Admin management and control platform : Provide metadata display, index creation and cleaning at regular intervals, DCDR master-slave switching verification and other functions

4. User console : a platform for users to operate indexes and display index information

5. Client: the user accesses the GateWay cluster through the ES client

cd34d4e6495c62e57ac5cba053ab86fd.png

The problem is that the entire ES service has authentication and authentication capabilities, but the ES cluster does not have security authentication capabilities when it is alone . ES services provide external security authentication and authentication services through the Gateway, but the ES cluster itself does not have security authentication capabilities. Anyone can perform any operations on the ES cluster as long as they obtain the IP and port of the ES cluster. Therefore, we need to add security authentication capabilities to ES clusters, and we need to perform security adaptation work on admins, gateways, and clients accessing ES clusters.

2

solution

Solution 1: ES X-Pack plug-in

1. Introduction to ES X-Pack plug-in

ES X-Pack is an official plugin for Elasticsearch that provides a range of features, including support for security, alerting, monitoring, reporting, and graph visualization. The security features provide authentication, authorization, and encryption functions to ensure that only authorized users can access the cluster.

Enable the X-Pack security feature, create an account password through an HTTP request, and store the account password information in the ES index. The ES cluster receives the HTTP request and processes it. During the processing, AuthenticationService will be called for security authentication (by analyzing whether the account password in the request header is consistent with the index account password). After the HTTP authentication is passed, the authenticated user information will be written into the thread context, and an authorization check will be performed before the TCP layer performs related operations to check whether the authenticated user has the corresponding operation authority, and the real business logic can only be executed after the authorization is passed.

2. Advantages

  • Kibana requires no code retrofit . Kibana itself supports the X-Pack plug-in, and provides an account and password login page on the page.

  • Native support for complete authentication, authorization, and audit logic.

3. Disadvantages

  • It cannot support cluster rolling restart and upgrade . After the plug-in is enabled, it will not only force the HTTP layer account password authentication to be enabled, but also force the TCP layer between nodes to use SSL encrypted communication authentication to ensure that malicious nodes cannot join the cluster. During the rolling restart upgrade security authentication process, the non-upgraded node and the upgraded node TCP cannot communicate, resulting in the cluster being unavailable.

  • DCDR cannot synchronize data . If security authentication is not enabled on the master cluster and security authentication is enabled on the slave cluster, due to the authorization logic at the TCP layer, the master cluster will fail to request the slave cluster.

  • Unable to quickly roll back and stop losses . After the ES account password authentication mechanism is enabled, if a third party cannot stop losses in time due to abnormal security authentication access, the cluster nodes must be fully rolled back to recover.

  • Deleting the account password storage index and index alias by mistake will make the cluster inaccessible, and the account password information cannot be recovered. The plug-in stores the account password information in the security index of the cluster, and accesses the index through the index alias. If the index or index alias is deleted by mistake, the account password information will be lost, and the request cannot be authenticated and the request will fail.

  • Changing the password by mistake will cause the access to be unavailable and the stop loss time will be too long . The password recovery process is complicated, and you need to log in to the online machine and perform 5 steps to change the original cluster account password.

4. Transformation point

  • The new dynamic configuration supports one-key switch security authentication, and supports one-key stop loss.

  • Remove the node TCP layer TLS/SSL encrypted communication logic , support DCDR synchronization data and cluster rolling restart.

  • GateWay, Admin, and client need to carry account and password information in the request header.

Solution 2: Self-developed ES security plug-in

1. Introduction to the principle of self-developed ES security plug-in

The http request interceptor is implemented through self-developed plug-ins. The interceptor is used to obtain the account and password information carried in the http request header, and perform matching authentication according to the account and password information saved in the local configuration file. If the authentication is successful, you can continue to execute the subsequent logic, and if it fails, an authentication failure exception will be returned.

2. Advantages

  • The structure is simple and the logic is clear . It only needs to perform simple string verification in the HTTP request processing link, and does not need to involve the internal TCP communication verification of the node.

  • Support ES cluster rolling restart upgrade . By adding dynamic cluster configuration, it is very convenient to turn on and off permission verification, which is friendly to rolling upgrades.

  • It supports one-key switch security authentication capability, which can quickly stop losses . Added cluster dynamic configuration, one-click switch security authentication, users can quickly stop loss due to abnormal access due to security authentication.

  • Kibana does not require code transformation

    1) You only need to configure the correct account password in Kibana.yml, and the kibana request will automatically carry the account password to access the ES cluster

    2) To log in to the Kibana page, you also need to enter the correct account password to access it, and no additional authentication jumps to the page

  • Avoid changing the password by mistake and making the request unavailable. The account password is configured in elasticsearch.yml and must not be modified

3. Disadvantages

  • Only the large account authentication function, no other functions such as authentication and auditing

  • The password needs to be restarted after the cluster node is changed to take effect later.

4. Transformation point

0701649d1829d25a1de8a3604422f049.png

plan selection

After comprehensively comparing the above two solutions from the perspectives of development volume, easy operation and maintenance, stability, and ease of use, we finally decided to adopt the second solution. The following is the query process of the ES ecology after adopting the second scheme:

1. The ES client initiates a query request to the Gateway.

2. Gateway authenticates and authenticates the request. After the authentication is passed, it will go to Admin to obtain the access address of the corresponding cluster and the account password for accessing the ES cluster, and cache them locally.

3. Gateway forwards the query request to the corresponding ES cluster through the ES cluster account password obtained in step 2.

4. ES executes the query logic and returns the result to Gateway, and Gateway returns the result to the client. So far the query process is over.

ea650f5cd1a1cf621eca32b1ed3059b9.png

3

Online guarantee

Security upgrade involves ES cluster, ES Gateway, ES Admin, ES client, Fastindex (Hive2ES), DataX (Mysql2ES), Flink2ES. The following are the scales of the upgrade components:

  • ES has a total of 66 clusters and 2236 nodes

  • Gateway has a total of 28 clusters and 492 nodes

  • Admin has 2 clusters and 12 nodes

  • Flink2ES has a total of 8500+ tasks that need to be restarted to upgrade to the latest ES client

  • Fastindex 3 clusters

  • DataX 3 clusters

Among them, the most cumbersome upgrade is the ES cluster. ES is a distributed engine that disperses and stores data on multiple nodes. When the ES cluster is performing a rolling upgrade, a node restart may cause the cluster state to change to "yellow", which means that some shard replicas in the cluster have not been assigned to the node. In order to ensure the availability of data, it is necessary to wait for the cluster status to return to "green" before restarting the next node. The time it takes for the cluster to recover from "yellow" to "green" depends on the amount of data and the number of nodes in the cluster. In extreme cases, it takes more than 1 hour to upgrade a node in a public cluster, so it often takes 3 for all ES clusters to complete a major version upgrade. more than a month.

In order to ensure that all components are stably upgraded to safe versions, keep the cornerstone and bottom line of stability, avoid affecting business reading and writing after ES enables security features, and at the same time roll back quickly when abnormal problems are found during changes, we have done the following:

  • The ES engine supports one-key switch security features . As mentioned above, in order to support ES cluster rolling restart and upgrade, avoid ES cluster, ES client and other component upgrades from affecting each other, and also allow second-level rollback when access is unavailable, the ES security authentication plug-in supports one-click enabling and disabling of cluster security. authentication capabilities.

  • Upgrade and enable security features in sequence according to the ES cluster priority . The priorities are from low to high: log cluster, public cluster, and independent cluster. If there is a problem in the middle, you can roll back and turn off security features in time to avoid greater stability problems.

  • The script regularly scans ES clusters and Gateway clusters . Make sure that all ES nodes and Gateway nodes have been upgraded to the security certification version.

  • Counts the running version of the Flink task . Make sure that all Flink2ES tasks have been upgraded

  • Added ES security certification exception indicators . Use indicators to ensure that when the security feature of the cluster is turned on and affects the business, the cluster security feature can be discovered and turned off in time, and it will be turned on again after the user upgrade is completed.

6449f0848415445bdd960595b604d043.png

4

Summarize

It took more than 3 months. ES RD and ES SRE completed the upgrade of all ES components. The Flink2ES task was also fully restarted and upgraded with the strong support of the Flink team. DataX completed the upgrade with the cooperation of the synchronization center. And the security features of all clusters are turned on when the business is almost insensitive. So far, all clusters on the Didi ES line support security authentication capabilities, which greatly reduces the risk of data leakage and data loss.

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132222302
Recommended