Do embeddings protect private information?

Leer en Español

The increasing prominence of Generative AI technologies such as ChatGPT, RAG and others has led to groundbreaking and innovative applications in various fields. However, it has also raised concerns about privacy and security. One of those concerns relates to the reliance of these models on embeddings, which are essentially vector representations of text. The crux of the matter is whether these embeddings are reversible, since they can contain sensitive information. Can these vectors be translated back to their original text, potentially unveiling sensitive health (PHI), credit (PCI) or other financial or private information (PII)?

The simple answer is yes and no.

On the YES side — The Argument for Reversibility

Certain research papers and studies affirm the possibility of reversing embeddings back to text, making a compelling case for the reversibility of embeddings. Most embedding reversal attacks start with a bag of words (BoW) recovered from the vectors.

From these studies, it is evident that embeddings are not entirely secure and can indeed leak content under specific circumstances. In that sense, then, YES, embeddings do have the potential to be reversed and some or all its content leaked.

On the NO side — The Argument against Reversibility

However, all the examples above predominantly revolve around shorter sentences, leading to an incomplete narrative.

Text Embeddings Reveal… (paper #1)

This attack was able to reverse the vectors back to the original text by using a combination of bags of words to initialize the attack and continue improving the ranking until the text rendered a close enough vector to the original. Their brute force attack does get to an impressive similarity with the original text. Their technique was almost 100% successful when using 32 tokens samples, but their success declined at 128 tokens, and no tests were conducted beyond this length.

Sentence Embedding Leaks… (paper #2)

This attack used Generative AI to build a phrase with the bag of words from the original vectors. For this exercise, the average length of their tests was 11.71 for chats material and 18.25 for Wikipedia (I assume these are in token units). The examples shown by paper #2 are:
– I love playing the cello! It relaxes me!
– Nope, my hobbies are singing, running and cooking.
– Which network broadcasted Super Bowl 50 in the U.S.?
– What was Fresno’s population in 2010?

So, this study did demonstrate reversibility for relatively short text snippets.

The Role of Compression

The most popular model for creating embeddings today is text-embedding-ada-002, which, given an input of 8191 tokens, roughly 32K characters of text, the model returns a 1536 dimension vector, which is about 12.3K of size. So, just by its size, we can infer that there is probably some loss of fidelity in the reproducibility of the data. The cited model is not open-sourced so, we do not have access to its specifications or detailed discussion of its technology, so we should not infer much more than the obvious compression effect.

Conclusion (or not)

So, it should be clear at this point that:

  • Vectorization has the potential to preserve, and leak, personal or private information in the vector.
  • Vectorization of sentences or paragraphs is not a substitute for deidentification.
  • Given a short enough sentence, there are techniques that can reverse the original content of the text.
  • Vectorization provides obfuscation for the casual observer, but not protection for the ill intended.
  • There are currently no known successes reversing the original meaning of texts of 256 tokens or larger which are more common chunk size for embedding.

Further research is needed to determine long term repercussions and policies, but at the present, vectorized private data should be treated just as obfuscated data.

Good software architecture should be employed to layer the accessibility to the data by need. The use of vectors in conjunction with micro-services architectures and APIs provides better access control and security, increases visibility into the access and creates a better separation of concerns environment as depicted in the diagram below.

Glossary

  • PHI (Protected Health Information): It is any information about a person’s health that can be used to identify them. Includes data items such as name, address, date of birth, social security number, medical record, etc. This information is protected in the United States by the Health Insurance Protability and Accountability Act (HIPAA), which sets forth the rules of how PHI is to be collected, used and disclosed.
  • PII (Personally Identifiable Information): Any information that can be used to identify and individual. Includes data items such as name, address, phone number, email, social security number, driver license, financial information, credit card information, etc. While there is no single federal law in the United States to protect this class of data, there are a number of laws that protect specific types of PII. For example, the Gramm-Leach-Bliley Act (GLBA) protects financial information, and the Fair Credit Reporting Act (FCRA) protects credit information.
  • PCI (Payment Card Industry): This terms describes a set of security standards designed to protect cardholder data. PCI is normally used as a short for PCI DSS (Payment Card Industry Data Security Standards).
  • Deidentification: Deidentification refers to the process of removing or modifying personal information from a document or database so that individuals cannot be readily identified. This is a crucial practice in data privacy and protection, particularly when dealing with sensitive information in sectors like healthcare, finance, or research. The goal of deidentification is to preserve the utility of the data for analysis and research while minimizing the risk of privacy breaches and ensuring compliance with legal and regulatory standards. Techniques employed in deidentification include anonymization, where identifiers are completely removed, and pseudonymization, where identifiers are replaced with fictitious but consistent values. However, it is important to note that deidentification is not foolproof, and in certain cases, individuals might still be re-identified through linkage with other data sources or by leveraging advanced analytical techniques. As such, ongoing attention to data security practices and ethical considerations is paramount.
  • RAG (Retrieval Augmented Generation): RAG is a widely-used technique that augments the information provided on a question by retrieving context and providing it to the large language model (LLM). The process starts by generating a set of candidate pieces of information that might contain the answer to the user query. The candidates are then ranked based on their relevance to the query, and the top-ranked candidates are used to provide additional context to the LLM. The LLM is then used to generate a final response that is based on both the original user query and the provided context.
  • API (Application Programming Interface): An API is a set of definitions and protocols that allows different software components to communicate with each other. APIs can be used to expose functionality, data, or events to other applications, or other components of the same application. A well constructed and documented API defines the methods and the data formats for both, the requests and the responses. API enable the communication between different components, but also provide the opportunity to use API managers where the authentication, authorization, auditing, and logging can be centralized.
  • Embeddings and Vectors: An embedding is a representation of a word or phrase as a vector of numbers. The numbers in the vector represent the semantic meaning of the word or phrase. For example, the embedding of the word “dog” might be very similar to the embedding for the word “canine”. “Embeddings” and “Vectors” are used almost interchangeably when discussing NLP and LLMs, and for these purposes, they are. But, while all embeddings are represented as vectors, not all vectors are embeddings. Vectors are a more generic term that has mathematical connotations, but embeddings specifically refer to the transformation of language data into numerical vectors. This allows the models to perform tasks such as text classification, sentiment analysis and language translation as mathematical operations.
  • BoW (Bag of Words): A bag of words (BoW) is a statistical language model that represents text as a collection of words based on their frequency, without regard to their order or grammar. Since it is a frequency-based representation of the text, it is often used to represent text for many natural language processing (NLP) tasks, such as text classification and sentiment analysis.
  • Token: A unit of text that is the result of the tokenization process, in which a word/sentence/paragraph is broken into smaller units. A token can be a word, a couple of words or a portion of a word. Tokens and their length can vary depending on the tokenization method used, but, a common rule of thumb used to measure tokens is about 1 token for every five characters, so a 32 token sentence is about 128 characters long without counting spaces and punctuation. The following two sentences represent 32 and 128 tokens using a common tokenization algorythm. 32-Token SentenceIn a quaint village nestled between a crystal-clear lake and a lush, verdant forest, there lived a baker named Tom, who was renowned for his bread.128-Token SentenceIn a quaint village nestled between a crystal-clear lake and a lush, verdant forest, there lived a humble baker named Tom, who was renowned throughout the region for his incredibly delicious, freshly-baked bread, which he meticulously crafted using a secret recipe handed down through generations, and every morning, the villagers would gather, forming a long, winding line outside his tiny, rustic bakery, eagerly awaiting the moment when the warm, inviting aroma of the freshly-baked loaves would waft through the air, signaling the beginning of a new, hopeful day, and Tom, with his comforting smile, would warmly greet each customer with encouraging words.