Elasticsearch: How to Remove Personally Identifiable Information from Elastic Data in 3 Easy Steps

Author: Peter Titov

Personally Identifiable information (PII) compliance is a growing challenge for any organization. Whether you're in e-commerce, banking, healthcare, or other data-sensitive fields, PII can be captured and stored unintentionally. Having a structured log makes it easy to quickly identify, remove and secure sensitive data fields; but what about unstructured messages? Or maybe call center transcription?

Elasticsearch, drawing on its long experience in the field of machine learning, provides various options for introducing custom models, such as large language models (LLM), and providing its own models. These models will facilitate the implementation of PII editing.

If you want to learn more about natural language processing, machine learning, and Elastic, be sure to check out these related articles:

In this blog, we'll show you how to leverage Elasticsearch's ability to load trained models in machine learning and the flexibility of Elastic ingestion pipelines to set up PII revisions.

Specifically, we will gradually build a named entity recognition (NER) model for person and location recognition, and deploy an edit processor for custom data recognition and deletion. All of this is then combined with an ingestion pipeline where we can use Elastic machine learning and data transformation capabilities to remove sensitive information from the data.

Load the trained model

Before starting, we must load the NER model into the Elasticsearch cluster. This can be easily done with Docker and the Elastic Eland client. From the command line, let's install the Eland client via git:

git clone https://github.com/elastic/eland.git

Navigate to the recently downloaded client:

cd eland/

Now let's build the client:

docker build -t elastic/eland .

From here, you can deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname and esport.

If you use Elastic Cloud or signed certificates, just run the following command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://<username>:<password>@<es-cluster-hostname>:<esport>/ --hub-model-id dslim/bert-base-NER --task-type ner --start

If you use a self-signed certificate, run the following command:

docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://<username>:<password>@<es-cluster-hostname>:<esport>/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start

From here, you will witness the Eland client download the trained model from HuggingFace and automatically deploy it to your cluster!

For my situation, I prefer to use the already released eland. For detailed installation steps, please refer to the article " Elasticsearch: How to implement image similarity search in Elastic ". We can use the following command to do so:

  eland_import_hub_model --url https://<user>:<password>@<hostname>:<port> \
  --hub-model-id dslim/bert-base-NER \
  --task-type ner \
  --ca-certs <your certificate> \
  --start

On my computer, I use:

Via the machine learning overview UI " Synchronize your jobs and trained models. " Click the blue hyperlink to sync the newly loaded trained models.

 

 

That's it! Congratulations, you just loaded your first trained model into Elasticsearch! 

Create an edit processor and ingest pipeline

In DevTools, we configure the redact processor and inference processor to utilize the Elastic trained model we just loaded. This will create an ingestion pipeline called redact which we can then use to remove sensitive data from any field we wish. In this example, I'll focus on the "message" field. NOTE: At the time of writing, the redact processor is experimental and must be created via DevTools.

Introduction to Redact processors : Redact processors use the Grok rules engine to fuzz text in an input document that matches a given Grok pattern. The processor can be used to hide personally identifiable information (PII) by configuring it to detect known patterns such as email or IP addresses. Text matching a Grok pattern will be replaced with a configurable string, eg <EMAIL> to match an email address, or if you prefer, just replace all matches with the text <REDACTED>.

We enter the following command under Dev Tools:

PUT _ingest/pipeline/redact
{
  "processors": [
    {
      "set": {
        "field": "redacted",
        "value": "{
   
   {
   
   {message}}}"
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        }
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
           String msg = ctx['message'];
           for (item in ctx['ml']['inference']['entities']) {
             msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')
           }
           ctx['redacted']=msg
        """
      }
    },
    {
      "redact": {
        "field": "redacted",
        "patterns": [
          "%{EMAILADDRESS:EMAIL}",
          "%{IP:IP_ADDRESS}",
          "%{CREDIT_CARD:CREDIT_CARD}",
          "%{SSN:SSN}",
          "%{PHONE:PHONE}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": """\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}""",
          "SSN": """\d{3}-\d{2}-\d{4}""",
          "PHONE": """\d{3}-\d{3}-\d{4}"""
        }
      }
    },
    {
      "remove": {
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "pii_script-redact"
      }
    }
  ]
}

Ok, but what does each processor really do? Let's go over each processor in detail here:

  1. The SET handler creates the redacted field, which is copied from the message field and used later in the pipeline.
  2. The INFERENCE processor calls our loaded NER model for message fields to identify name, location, and organization.
  3. The SCRIPT processor then replaces the detected entities in the edit field from the message field.
  4. Our REDACT processor uses a Grok pattern to identify any custom datasets we wish to remove from the edit field (copied from the message field).
  5. The REMOVE handler removes extraneous ml.* fields from the index; note that we add a "message" to this handler once we've verified that the data was edited correctly.
  6. The ON_FAILURE / SET handler catches any errors, just in case something goes wrong.

Divide your PII

Now that the ingestion pipeline is configured with all the necessary steps, let's start testing the effect of removing sensitive data from the document. Navigate to Stack Management, select Ingest Pipelines and search for redact, and click the result.

 

 

 

Here we'll test our pipeline by adding some documentation. Below is an example that you can copy and paste to make sure everything works. 

{"_source":{"message": "John Smith lives at 123 Main St. Highland Park, CO. His email address is [email protected] and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2"}}

Just press the "Run the pipeline" button and you should see the following output:

What's next?

After adding this ingestion pipeline to the dataset to be indexed and verifying that it meets expectations, you can add the message field to be removed so that the PII data is not indexed. Just update your REMOVE handler to include the message field and simulate again to see only the edited fields.

Run the test pipeline again. We found that the message field disappeared.

 

in conclusion

With this step-by-step approach, you are now ready and able to detect and edit any sensitive data throughout your index.

Here's a quick recap of what we discussed:

  • Load the pretrained named entity recognition model into the Elastic cluster
  • Configure the Redact Processor and Inference Processor to use the trained model during data ingestion
  • Test the sample data and modify the ingestion pipeline to securely remove personally identifiable information

Ready to get started? Sign up for Elastic Cloud and try out the features and functionality I outlined above to get the most value and visibility from your OpenTelemetry data.

The release and timing of any features or functionality described in this article is at the sole discretion of Elastic. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used third-party generative artificial intelligence tools that are owned and operated by their respective owners. Elastic has no control over third-party tools, and we are not responsible for their content, operation, or use, nor shall we be liable for any loss or damage that may arise from your use of such tools. Exercise caution when using artificial intelligence tools with personal, sensitive or confidential information. Any data you submit may be used for artificial intelligence training or other purposes. There can be no guarantee that information you provide will be secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative artificial intelligence tool before using it.

Elastic, Elasticsearch, and related marks are trademarks, logos, or registered trademarks of Elasticsearch NV in the US and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/131347521