The GenAI Data Spill - How your IP is leaking into GenAI - and what to do about it
Source - OpenAI

The GenAI Data Spill - How your IP is leaking into GenAI - and what to do about it

Introduction: The Data Leak Risk

Generative AI is making inroads across industries—but beneath the surface, it is creating new data risks: sensitive data is seeping into AI models, sometimes without anyone noticing. From proprietary code and internal documents to personal customer information, the risks of unintended data exposure are real and rising.

Some publicized examples of this risk have included highly IP sensitive environments in which engineers accidentally uploaded sensitive source code to ChatGPT while troubleshooting. The incident led to an immediate ban on AI tools across departments.  Another example in the Government sector included a spreadsheet leak involving military personnel details.  Both these cases highlighted the risks of integrating AI with sensitive data environments.

In this blog, we’ll unpack what’s leaking into GenAI models, how it’s happening, and why it should be on every executive’s radar.

Let’s break down what types of data are leaking into these models, how it happens, and—most importantly—what you can do to keep your data safe.


What’s Leaking? The 5 Big Categories of Data at Risk

According to industry research, there are 5 major categories of data being leaked into GenAI models -

Article content
Source - Harmonic

 

1.      Customer Interaction Data

Chatbot interactions, support tickets, and recorded calls used for AI training can lead to data retention issues. If a customer service AI model remembers past interactions too well, it might regurgitate confidential details to the wrong user.

Article content
Source - Harmonic

Example: A support AI tool trained on customer complaints might start auto-suggesting responses that contain details of another company’s dispute resolution process.

 

2. Employee Data

Article content
Source - Harmonic

Customer names, emails, phone numbers, and even Social Security numbers can end up in AI models if organizations don’t have proper data handling safeguards.

Example: AI-powered HR tools trained on employee records can accidentally retain and reveal personal data, creating compliance nightmares (possible GDPR and CCPA fines!).


3. Legal & Compliance Documents

Article content
Source - Harmonic

Contracts, legal memos, and regulatory compliance documents are sometimes processed by GenAI for quick summarization or redlining. However, some models retain fragments of legal texts, leading to unintended leaks.

Example: A contract review AI tool might generate clauses eerily similar to another company’s private agreements.


4.      Security data leakage

Article content
Source - Harmonic

Imagine an AI-powered chatbot generating marketing insights using last quarter’s confidential sales data. That’s not magic—it’s a data leak! Proprietary business data is being absorbed by GenAI models when employees unknowingly feed them sensitive information.

Example: Employees using AI-powered document summarizers may unknowingly share sensitive sales forecasts or internal memos, making them part of the model’s dataset.

 

5.      Source Code & Software Secrets

Article content
Source - Harmonic

Companies using AI-powered coding assistants (think GitHub Copilot or ChatGPT for coding) may accidentally expose proprietary algorithms, security keys, or even entire application architectures.

Example: Developers pasting company code into an AI tool for debugging may unknowingly add proprietary algorithms to the AI’s training set, making them retrievable by others.

How Does This Happen? The Data Leak Pipeline

AI data leaks don’t happen because AI models are malicious. Instead, they’re often the unintended consequence of how we train and interact with AI systems. Here’s how the leaks occur due to human interactions:

The Data Leak Pipeline: Human Factors

Article content

Technical Vulnerabilities That Can Lead to IP Leakage

IP leakage in AI systems can also result from various system vulnerabilities, including inadequate data governance, model inversion attacks, insecure APIs, and adversarial threats. Below are key ways IP leakage can occur due to technology deficiencies:

Article content
Source - NIST

IP leakage in AI systems can also result from various vulnerabilities, including inadequate data governance, model inversion attacks, insecure APIs, and adversarial threats. Below are key ways IP leakage can occur due to technology deficiencies:

Protecting Your Enterprise: 7 Key Steps to Stop AI Data Leaks

So, how can enterprise leaders and CISO’s prevent their organizations from falling victim to GenAI data leaks? To help organizations harness AI securely, there are emerging standards to structure your GenAI risk strategy:

  • NIST AI RMF 600-1 (2024): The National Institute of Standards and Technology (NIST) offers a robust AI Risk Management Framework emphasizing governance, mapping risks, measuring model behavior, and managing harm.
  • ISO/IEC 42001: A new international standard providing AI-specific guidance on risk controls and organizational policies.
  • MITRE ATLAS: A threat modeling framework specific to AI, offering red-team testing guidance for AI/ML models.

Preventing IP and Data Leakage Solutions

Although NIST and other industry bodies provide a broader analysis of GenAI risks and techniques to address them, in this blog we focus on the key security techniques, governance, and risk mitigation strategies to minimize the risk of IP data leakage in AI systems.  Below is a summary of techniques available to prevent IP data leakage -

Article content
Source - NIST

Examples of commercial solution providers providing these solutions are as follows –

1. Data Masking and Redaction

Data masking and redaction techniques prevent sensitive information from being exposed during model training or inference.

How It Works:

  • Data Masking: Sensitive fields are replaced with placeholders or pseudonyms (e.g., "John Smith" → "User123").
  • Redaction: Specific sensitive terms or patterns are removed based on predefined rules.

Examples:

  • Granica: Offers real-time sensitive data detection and masking for structured and unstructured data, ensuring privacy while preserving functionality.
  • Protecto: Provides AI-powered data masking integrated with platforms like Snowflake and AWS, ensuring compliance with privacy regulations.

2. Secure Enterprise AI Platforms

Enterprise-grade AI platforms focus on controlled access, encryption, and compliance to prevent unauthorized data leakage.

How It Works:

  • Centralized control over data interactions with AI models.
  • Encryption of sensitive data during transit and storage.
  • Policy-based tools to block unauthorized uploads or downloads.

Examples:

  • Kiteworks: The AI Data Gateway ensures secure interactions with LLMs through encryption, audit trails, and policy enforcement.
  • Microsoft Copilot: Designed for enterprises, it ensures that user data is not exposed to public models while maintaining compliance.

3. Secrets Management and Machine Identity

These solutions protect credentials and enforce trust between machines interacting with AI systems.

How It Works:

  • Secrets Management: Dynamically generates encrypted API keys for short-term use, reducing exposure risks.
  • Machine Identity Management: Establishes unique identities for machines using certificates issued by enterprise Certificate Authorities (CA).

Examples:

  • Akeyless: Provides unified secrets management for API keys and encryption keys alongside machine identity verification.
  • CyberArk: Focuses on privileged access management (PAM) to secure machine-to-machine communications.

4. Tokenization

Tokenization replaces sensitive data with non-sensitive tokens while retaining usability for analysis.

How It Works:

  • Sensitive fields (e.g., names or Social Security Numbers) are replaced with unique tokens before being processed by AI models.
  • Original data mapping remains securely stored in a separate system.

Examples:

  • CipherTrust: Provides flexibility for IT environments, including cloud and big data platforms, making it suitable for GenAI applications requiring secure data handling.
  • Protegrity: Offers enterprise-grade tokenization solutions integrated into AI workflows.

5. AI-Powered Encryption

AI-enhanced encryption dynamically adapts to threats, ensuring secure data handling even during computation.

How It Works:

  • Real-time analysis of evolving threats enables tailored encryption protocols.
  • Post-quantum cryptography protects against future quantum computing risks.

Examples:

  • RTS Labs: Develops AI-powered encryption methods that autonomously adjust security parameters based on risk factors.
  • Granica: Combines encryption with anomaly detection for enhanced security in enterprise environments.

6. Self-Hosted Models

Self-hosted models allow enterprises to maintain full control over their data without relying on external providers.

How It Works:

  • Models are deployed within the organization’s infrastructure, ensuring sensitive information never leaves internal systems.

Examples:

  • Custom deployments using OpenAI APIs or open-source models like Meta’s LLaMA or Mistral ensure security while enabling customization.

7. Threat Detection and Runtime Security

Real-time monitoring systems detect anomalies or potential breaches in AI interactions.

How It Works:

  • Continuous scanning of inputs/outputs to identify unauthorized access or adversarial attacks.
  • Automated red teaming evaluates vulnerabilities in LLMs before deployment.

Examples:

  • Protect AI: Guardian: Enables zero-trust enforcement for model usage. Recon: Provides automated red teaming to identify vulnerabilities in GenAI systems.

8. Data Loss Prevention (DLP) Tools

DLP solutions monitor and control data transfers across endpoints to prevent leaks.

How It Works:

  • Detects unusual activity such as large-scale data transfers or unauthorized access attempts.
  • Integrates with firewalls, intrusion detection systems, and other security frameworks for comprehensive protection.

Examples:

·       Nightfall AI – Offers advanced AI-driven insights reduce false positives and provide comprehensive data flow visibility, making it a strong choice for organizations leveraging GenAI.

·       Proofpoint Enterprise DLP - Provides comprehensive protection across email, cloud, and endpoint environments with a people-centric approach to data security.

Key Takeaways for Enterprise Leaders:

  • IP leakage prevention is happening everyday — perhaps without adequate oversight.
  • Strong AI governance & security controls are essential for safe AI adoption.  AI Governance is synonymous with enterprise cybersecurity posture.
  • Industry-specific risk management strategies can ensure compliance and resilience with this evolving risk.

Conclusion: Future-Proofing AI Against IP Leakage

Preventing IP leakage in AI systems requires a multilayered defense strategy combining technical safeguards, governance frameworks, and proactive monitoring. Industry frameworks together with commercial solutions provide a structured approach to mitigating AI risks, ensuring organizations can leverage AI’s potential without exposing proprietary data.

 

Tim Freestone

Kiteworks CMO | AI-first Strategist | I Love Lamp

4mo

Thanks Rahul Chandra for the solution recognition.

Like
Reply
Edward Finegold

Tech product marketing and thought leadership.

4mo

Good piece Rahul. This adds some clarity to what the "data in motion" issues are far beyond the faceless evil hacker assumption. Clearly there can be real problems and liabilities created with no malicious intent whatsoever.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics