This Stanford study examined how six major AI companies (Anthropic, OpenAI, Google, Meta, Microsoft, and Amazon) handle user data from chatbot conversations. Here are the main privacy concerns. 👀 All six companies use chat data for training by default, though some allow opt-out 👀 Data retention is often indefinite, with personal information stored long-term 👀 Cross-platform data merging occurs at multi-product companies (Google, Meta, Microsoft, Amazon) 👀 Children's data is handled inconsistently, with most companies not adequately protecting minors 👀 Limited transparency in privacy policies, which are complex and hard to understand and often lack crucial details about actual practices Practical Takeaways for Acceptable Use Policy and Training for nonprofits in using generative AI: ✅ Assume anything you share will be used for training - sensitive information, uploaded files, health details, biometric data, etc. ✅ Opt out when possible - proactively disable data collection for training (Meta is the one where you cannot) ✅ Information cascades through ecosystems - your inputs can lead to inferences that affect ads, recommendations, and potentially insurance or other third parties ✅ Special concern for children's data - age verification and consent protections are inconsistent Some questions to consider in acceptable use policies and to incorporate in any training. ❓ What types of sensitive information might your nonprofit staff share with generative AI? ❓ Does your nonprofit currently specifically identify what is considered “sensitive information” (beyond PID) and should not be shared with GenerativeAI ? Is this incorporated into training? ❓ Are you working with children, people with health conditions, or others whose data could be particularly harmful if leaked or misused? ❓ What would be the consequences if sensitive information or strategic organizational data ended up being used to train AI models? How might this affect trust, compliance, or your mission? How is this communicated in training and policy? Across the board, the Stanford research points that developers’ privacy policies lack essential information about their practices. They recommend policymakers and developers address data privacy challenges posed by LLM-powered chatbots through comprehensive federal privacy regulation, affirmative opt-in for model training, and filtering personal information from chat inputs by default. “We need to promote innovation in privacy-preserving AI, so that user privacy isn’t an afterthought." How are you advocating for privacy-preserving AI? How are you educating your staff to navigate this challenge? https://lnkd.in/g3RmbEwD
Best Practices for Data Management
Explore top LinkedIn content from expert professionals.
-
-
How To Handle Sensitive Information in your next AI Project It's crucial to handle sensitive user information with care. Whether it's personal data, financial details, or health information, understanding how to protect and manage it is essential to maintain trust and comply with privacy regulations. Here are 5 best practices to follow: 1. Identify and Classify Sensitive Data Start by identifying the types of sensitive data your application handles, such as personally identifiable information (PII), sensitive personal information (SPI), and confidential data. Understand the specific legal requirements and privacy regulations that apply, such as GDPR or the California Consumer Privacy Act. 2. Minimize Data Exposure Only share the necessary information with AI endpoints. For PII, such as names, addresses, or social security numbers, consider redacting this information before making API calls, especially if the data could be linked to sensitive applications, like healthcare or financial services. 3. Avoid Sharing Highly Sensitive Information Never pass sensitive personal information, such as credit card numbers, passwords, or bank account details, through AI endpoints. Instead, use secure, dedicated channels for handling and processing such data to avoid unintended exposure or misuse. 4. Implement Data Anonymization When dealing with confidential information, like health conditions or legal matters, ensure that the data cannot be traced back to an individual. Anonymize the data before using it with AI services to maintain user privacy and comply with legal standards. 5. Regularly Review and Update Privacy Practices Data privacy is a dynamic field with evolving laws and best practices. To ensure continued compliance and protection of user data, regularly review your data handling processes, stay updated on relevant regulations, and adjust your practices as needed. Remember, safeguarding sensitive information is not just about compliance — it's about earning and keeping the trust of your users.
-
Vision-Language Models connect what AI sees with what it reads and reasons. They’re the foundation of AI systems that can interpret charts, medical images, retail shelves, or product catalogs. But a generic VLM doesn’t understand your domain’s visual language. That’s where fine-tuning becomes essential. 𝐖𝐡𝐲 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐚 𝐕𝐋𝐌 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 A pretrained VLM already knows the basics of visual-text reasoning. Fine-tuning helps it specialize for your domain. → In healthcare, it learns to detect anomalies in MRIs and X-rays. → In retail, it interprets shelf images and product layouts. → In enterprise, it extracts structured data from invoices and reports. You’re not rebuilding intelligence, you’re refining perception to fit your use case. 𝐇𝐨𝐰 𝐋𝐨𝐑𝐀 𝐦𝐚𝐤𝐞𝐬 𝐢𝐭 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 Full model fine-tuning is expensive and compute-heavy. Low-Rank Adaptation (LoRA) keeps the base model frozen and trains only small adapter layers. That means: → Faster training cycles → Smaller memory footprint → Lower compute costs → Domain adapters that are easy to swap in and out You can maintain one base model and multiple lightweight adapters for each use case such as invoices, medical forms, or retail analytics. 𝐈𝐧𝐬𝐢𝐝𝐞 𝐚 𝐕𝐢𝐬𝐢𝐨𝐧-𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥 A VLM has three main components: → 𝐕𝐢𝐬𝐢𝐨𝐧 𝐄𝐧𝐜𝐨𝐝𝐞𝐫 converts pixels into visual tokens. → 𝐅𝐮𝐬𝐢𝐨𝐧 𝐋𝐚𝐲𝐞𝐫 combines visual and text context for reasoning. → 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐃𝐞𝐜𝐨𝐝𝐞𝐫 generates captions, summaries, or structured responses. Each layer introduces potential failure modes like poor resolution, misaligned regions, or verbose hallucinations. Fine-tuning improves alignment and reliability across these components. 𝐓𝐡𝐞 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐥𝐢𝐟𝐞𝐜𝐲𝐜𝐥𝐞 → 𝐃𝐚𝐭𝐚 𝐝𝐞𝐬𝐢𝐠𝐧: Collect diverse, high-quality, clearly labeled visuals. → 𝐓𝐚𝐬𝐤 𝐝𝐞𝐟𝐢𝐧𝐢𝐭𝐢𝐨𝐧 : Choose the right setup: captioning, VQA, extraction, or localization. → 𝐋𝐨𝐑𝐀 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠: Train adapters for each domain efficiently. → 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧: Use both quantitative metrics and human review for grounding and accuracy. Always evaluate across different slices such as document type, lighting, and template to surface hidden biases. 𝐆𝐨𝐯𝐞𝐫𝐧𝐚𝐧𝐜𝐞 𝐚𝐧𝐝 𝐫𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 Each domain adapter should have its own dataset lineage, version, and evaluation score. Reliability requires attention to fairness, privacy, consistency, and uncertainty handling. Fine-tuning doesn’t just improve accuracy, it strengthens governance and ethical alignment. LoRA fine-tuning makes VLMs faster to adapt, cheaper to deploy, and more aligned with your real-world data. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
Treating data as a product is a necessity these days but the main question is: How do you operationalize it without adding more tools, more silos, and more manual work? There has been some confusion and process gaps around it especially when you're working with Databricks Unity Catalog. From contract to catalog, it's important for us to treat the data journey as a single process. Here, I'd like to talk about a practical user flow that organisations should adopt to create governed, discoverable, and mature data products using UC and contract-first approach. But before I begin with the flow, it's important to make sure that: ✅ Producers clearly define what they're offering (table schema, metadata, policies); ✅ Consumers know what to expect (quality, access, usage); ✅ Governance and lifecycle management are enforced automatically. That's why to do this, I'd like to divide the architecture into 3 parts: 👉 Data Contract Layer: To define expectations and ownership; 👉 UC Service Layer: API-driven layer to enforce contracts as code; 👉 UC Layer: Acting as Data & Governance plane. ☘️ The Ideal flow: 🙋 Step 1: Producer would define the schema of the table (columns, dtypes, descriptions) including ownership, purpose and intended use. 👨💻 Step 2: Producer would add table descriptions, table tags, column-level tags (e.g., PII, sensitive) and domain ownership rules. 🏌♂️ Step 3: Behind the scenes, the API service would trigger the table creation process in the right catalog/schema. Metadata would also be registered. 🥷 Step 4: Producer would include policies like: Who can see what? Which columns require masking? What's visible for which role? etc.. 😷 Step 5: Row/column filters and masking logic would be applied to the table. ⚡ Step 6: Once the table is live, validation would kick-in that would include schema checks, contract compliance, etc. 💡 Step 7: Just-in-Time Access would ensure consumers don't get access by default. Instead, access would be granted on demand based on Attribute Based Access Control (ABAC). The process, again, would be managed by APIs and no ad-hoc grants via UI. 👍 Step 8-9: All access and permission changes would be audited and stored. As soon as the consumer requests access to the table, SELECT permission would be granted based on approvals ensuring right data usage and compliance. 🔔 Step 10-11: Upon consumer request and based on the metrics provided, a Lakehouse Monitoring would be hooked-in to the table to monitor freshness, completeness, and anomalies. Alerts would also be configured to notify consumers proactively. ☑️ Step 12: The Lakehouse monitoring dashboard attached to the table would be shared with the stakeholders. 🚀 What do you get⁉️ -A fully governed & discoverable data product. -Lifecycle polices enforced for both producer and consumer. -Decoupled producer and consumer responsibilities. -Quality monitoring observability built-in. #Databricks #UnityCatalog #DataGovernance #DataContract #DataProducts
-
⏳ Modeling time travel in data? Slowly Changing Dimensions keep your story intact — or rewrite it. Tracking how data changes over time is a core headache for Data Engineers. We call these Slowly Changing Dimensions (SCDs), and managing them always forces a tough trade-off between simplicity and history. 𝗧𝘆𝗽𝗲 𝟭: The “Whiteboard Eraser” → Like erasing old notes and writing fresh ones. → Overwrites old data with new — no history kept. → Use Case: When history doesn't matter, e.g., fixing a typo or updating an address. 𝗧𝘆𝗽𝗲 𝟮: The “Time Machine” → Every change creates a new snapshot in time. → Adds new rows to keep historical data intact. → Use Case: Track changes over time, like customer status or pricing changes. 𝗧𝘆𝗽𝗲 𝟰: The “Photo Album” → Keep daily snapshots in a separate album. → Stores history in separate tables, adds to storage cost. → Use Case: When detailed history is needed but original table stays clean. ⚖️ The Final Trade-Offs Choosing an SCD type is always a compromise: Type 1: Simple ✅ | History ❌ | Storage ✅ Type 2: Simple ❌ | History ✅ | Storage ⚠️ Type 4: Simple ⚠️ | History ✅ | Storage ❌ Snapshots: Simple ✅ | History ✅ | Storage ❌❌ Image Credits: DataExpert.io - Zach Wilson 💡 My take? Type 2 for 80% of cases. Storage is cheap. Losing history isn't. Each method solves a problem but brings tradeoffs—choose wisely based on what your business truly needs. #data #engineering
-
Leveraging GHG data and analytics to accelerate business transformation 🌎 As regulations tighten and the demand for transparency grows, businesses face increasing pressure to adopt robust greenhouse gas (GHG) data and analytics systems. Establishing a structured framework for emissions measurement and analysis is critical for compliance, but its benefits extend far beyond regulatory requirements. A comprehensive GHG data architecture enables businesses to measure, manage, and act on emissions across the full value chain, paving the way for meaningful transformation. To meet both current and future expectations, organizations must focus on measuring emissions across Scopes 1, 2, and 3. Addressing direct emissions (Scope 1), energy-related emissions (Scope 2), and value chain emissions (Scope 3) ensures a complete understanding of an organization’s carbon footprint. Scope 3, in particular, represents the largest and most complex challenge, but it also holds the greatest opportunity for reducing environmental impact and driving systemic change across supply chains. With precise data on emissions across all scopes, businesses can move beyond compliance to actionable insights. By identifying carbon hotspots and setting reduction targets, organizations can optimize processes such as energy efficiency, supply chain sourcing, and logistics management. These actions help integrate sustainability into business operations while delivering cost efficiencies and improving resilience. A robust GHG data and analytics system also facilitates full-value chain transformation. Leveraging technologies like machine learning, scenario modeling, and ecosystem data exchanges enables businesses to plan for long-term carbon reduction strategies and innovate low-carbon products. Addressing emissions holistically across Scopes 1, 2, and 3 ensures alignment with global climate goals while creating competitive advantages in sustainable markets. Measuring and acting on emissions across the entire value chain is no longer optional. Businesses equipped with accurate data and advanced analytics capabilities can meet regulatory demands, reduce emissions at scale, and drive meaningful progress toward a low-carbon economy. Source: Gartner #sustainability #sustainable #business #esg #climatechange #GHG
-
In an era where data sharing is essential and concerning, six fundamental techniques are emerging to protect privacy while enabling valuable insights. Fully Homomorphic Encryption involves encrypting data before being shared, allowing analysis without decoding the original information, thus safeguarding sensitive details. Differential Privacy adds noise variables to a dataset, making decoding the initial inputs impossible, maintaining privacy while allowing generalized analysis. Functional Encryption provides selected users a key to view specific parts of the encrypted text, offering relevant insights while withholding other details. Federated Analysis allows parties to share only the insights from their analysis, not the data itself, promoting collaboration without direct exposure. Zero-Knowledge Proofs enable users to prove their knowledge of a value without revealing it, supporting secure verification without unnecessary exposure. Secure Multi-Party Computation distributes data analysis across multiple parties, so no single entity can see the complete set of inputs, ensuring a collaborative yet compartmentalized approach. Together, these techniques pave the way for a more responsible and secure data management and analytics future. #privacy #dataprotection
-
🧭 “𝐍𝐨𝐧𝐞 𝐨𝐟 𝐨𝐮𝐫 𝐝𝐚𝐭𝐚 𝐢𝐬 𝐞𝐪𝐮𝐚𝐥 - 𝐬𝐨𝐦𝐞 𝐝𝐚𝐭𝐚 𝐢𝐬 𝐦𝐨𝐫𝐞 𝐞𝐪𝐮𝐚𝐥 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫𝐬” A Chief Data Officer once told me during a workshop: “I thought all our definitions were aligned and consistent… until a dashboard told me otherwise.” We both laughed — but the truth hit hard. That moment summed up the data reality for so many organizations today. 💡Millions invested in modern platforms. 💡Fancy dashboards everywhere. 💡Yet… conflicting numbers, duplicated data, and endless debates about which report to trust. That’s when the real problem shows up — not a lack of data, but a lack of trust. ⚙️ The Turning Point: From Data Chaos to Data Confidence At DataGalaxy, we’ve learned that not all data deserves the same level of attention. Some data fuels decisions, innovation, and growth. Other data? It’s just noise. That’s why we help organizations take a pragmatic path — one that starts with identifying and certifying what truly matters: Critical Data Elements (CDEs). Here’s the simple, human logic behind it 👇 1️⃣ Identify your key data elements. Which data really drives business outcomes? 2️⃣ Score & prioritize. Focus your data quality and governance energy where it counts most. 3️⃣ Establish data contracts. Know who owns what, where data comes from, and how it’s used. 4️⃣ Certify your data products. Give them a visible seal of quality — trusted, traceable, and ready for self-service. Think of it as building your own Data Marketplace, where every product is transparent, reliable, and business-aligned. 🚀 The Impact: Trust That Scales When certification becomes part of your culture, everything changes. ✅ Decision-makers stop arguing over “which number is right.” ✅ Teams move faster because ownership is clear. ✅ Data becomes a trusted business asset, not an ongoing frustration. Certification isn’t about bureaucracy — it’s about clarity, confidence, and credibility. It’s about creating a world where business and data teams finally speak the same language. 🎯 Ready to Act? Start Here 👇 💥 Step 1: Identify your top 10 Critical Data Elements. 💥 Step 2: Define a lightweight certification playbook — focus on quick wins. 💥 Step 3: Share success stories early. Visibility builds momentum. Small, consistent actions will create an unstoppable movement toward trusted data. ✨ Final thought: In the age of AI and automation, trustworthy data isn’t a luxury — it’s your competitive advantage. Let’s make certified data the new standard for business excellence. That's what you can practically learn during our CDO Masterclass sessions hosted by Kash Mehdi and Laurent Dresse ☁ (𝐒𝐞𝐚𝐬𝐨𝐧 12 is already opened, registration link in comments) #DataGovernance #DataQuality #CDO #DataProducts #AI #Metadata #Leadership #DataCertification #DataGalaxy #DataTrust
-
⸻ 🔍 Mastering Slowly Changing Dimensions (SCD) in Data Warehousing 📊 Whether you’re designing a data warehouse or building robust ETL pipelines, understanding SCD types is essential for accurate historical tracking and reporting. Here’s a breakdown of all major SCD types, from SCD Type 0 to Type 6, with real-world use cases: ⸻ 🔸 SCD Type 0 – Fixed (No Changes Allowed) Use when data must never change. 🔹 Example: Date of Birth, National ID. 🔹 Ensures absolute consistency across systems. ⸻ 🔸 SCD Type 1 – Overwrite (No History) Simple and fast – just update the existing record. 🔹 Use Case: Correcting typos or non-critical attributes like email or phone. ⚠️ Note: Old value is lost. ⸻ 🔸 SCD Type 2 – Add Row (Track Full History) Adds a new row for every change with versioning, effective_date, and expiry_date. 🔹 Use Case: Employee role changes, address history, pricing evolution. ✅ Best for auditable and time-travel scenarios. ⸻ 🔸 SCD Type 3 – Add Column (Limited History) Adds a new column for previous value. 🔹 Use Case: Storing current and previous department or region. ⚠️ Not suitable if many changes need to be tracked. ⸻ 🔸 SCD Type 4 – History Table (Separate History Storage) Current table holds only latest data. Historical changes are stored in a dedicated history table. 🔹 Use Case: Performance-heavy systems, regulatory audits. ⸻ 🔸 SCD Type 5 – Mini-Dimension + Type 1 High-frequency changing attributes are moved to a mini-dimension with Type 1 handling in main fact table. 🔹 Use Case: Frequent but low-impact attributes like marital status, job grade. ⸻ 🔸 SCD Type 6 – Hybrid (Type 1 + 2 + 3) Combines overwrite, row addition, and column tracking. 🔹 Use Case: Complex audit requirements + quick access to current & previous values. ✅ Most flexible and powerful, but implementation is more complex. ⸻ 💡 Pro Tip: Use the right SCD strategy based on data volatility, reporting needs, and performance constraints. Not every dimension needs full history! ⸻ 🔁 How do you handle historical data in your data warehouse? 💬 Share your experience or best practices below! #DataEngineering #DataWarehouse #SCD #ETL #Databricks #Snowflake #AzureDataFactory #DimensionalModeling #SQL #BigData ⸻
-
This used to be a joke. But with LLMs actually understanding JSON, unstructured data isn’t a liability anymore ; it’s becoming the semantic layer for business context. Instead of flattening JSON into tables, you can: → Keep rich, nested business context in JSON. → Use LLMs to interpret and map it directly into semantic layers. → Let analysts query it naturally (NLQ, instructions), without first “fixing” the structure. So yeah, what was once a database anti-pattern is starting to look like a business context pattern.. Because “unstructured” data is no longer unreadable noise, it’s fuel for models.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development