Protect Your Sensitive Data from a Nosy LLM.

Many of us face the crucial challenge of leveraging the power of AI while safeguarding sensitive or private data. The importance of this task cannot be overstated, as the consequences of data exposure can be severe. We’ve all been warned not to send such data to an external LLM, yet most businesses do not have the resources to host a local LLM solution. This is where the technique of ‘data masking’ comes into play, a solution we will explore in this article.

Several practical forms of data masking can be used. One example is replacing sensitive data with anonymous data, a technique that provides a level of security while maintaining some limited usability of the data.

“ssn”: “XXX-XX-6789”
“drug_id”: “DRG-XXXXX”
“age”: “XX”

In these examples, the sensitive data was replaced with fixed characters.
Another approach is to use “tokens.” With a token, you replace sensitive data with a unique placeholder that can be mapped back to the original value later in the processing. An example of tokens looks like this:

“patient_id”: “TOKEN12345”,
“SSN”: “TOKEN11223”,
“birthdate”: “TOKEN44556”

The original data or document is changed using the token, but a “token dictionary” is kept so the masking can be reserved:

“TOKEN12345”: “ID-3467”,
“TOKEN11223”: “123-45-6789”,
“TOKEN44556”: “Feb-29-1980”

Token-based data masking might be useful in LLM/Rag processing. An LLM process might extract a complex JSON structure from a data source. The JSON might contain sensitive information, like the example fields above. You don’t want to send the original data, but if you send the masked data, the original values might be difficult to read or understand in the final LLM response since they were redacted. Your LLM response might come back like this:

“Patient TOKEN12345 should not take medication TOKEN8675 due to possible side effects.”

By creating a token dictionary that is immutable during the processing, the final answer received by the LLM can be translated into meaningful text before it is displayed to the user:

“Patient ID-3467 should not take medication, acetylsalicylic acid, due to possible side effects.”

Here is the conceptual sequence of steps you might perform in RAG processing for an LLM app. The workflow is based on token-based data masking.