Ensuring Data Quality

Explore top LinkedIn content from expert professionals.

  • View profile for Jan Beger

    Healthcare needs AI ... because it needs the human touch.

    85,669 followers

    This paper reviews how bias affects AI in healthcare and outlines strategies to detect and reduce such bias across the AI model lifecycle. 1️⃣ Bias in healthcare AI often originates from human, data, algorithmic, or deployment-related factors, each introducing unique risks that can worsen health disparities. 2️⃣ Implicit, systemic, and confirmation biases are introduced during data collection and model design due to unconscious attitudes or structural inequalities. 3️⃣ Data biases like representation, sampling, and measurement issues stem from underrepresented populations or inconsistent data acquisition practices. 4️⃣ Algorithmic biases, including aggregation and feature selection bias, often arise from decisions made during model development and preprocessing. 5️⃣ Deployment-related biases like automation, feedback loop, and dismissal biases emerge from how clinicians interact with AI tools in practice. 6️⃣ Mitigating bias requires a lifecycle approach—spanning from conception, data collection, preprocessing, algorithm development, deployment, to post-deployment surveillance. 7️⃣ Effective mitigation involves team diversity, use of diverse and representative data, careful feature selection, subgroup testing, and fairness metrics like equalized odds and demographic parity. 8️⃣ International bodies like WHO and regulators such as the FDA and Health Canada have issued frameworks emphasizing fairness, explainability, and ethical use in healthcare AI. 9️⃣ Future directions include embedding DEI principles in AI development, expanding bias training, and integrating AI ethics into clinical education. ✍🏻 Fereshteh Hasanzadeh Alagoz, Colin B. Josephson, Gabriella Waters, Demilade Adedinsewo, Zahra Azizi, MD, MSc, James White. Bias recognition and mitigation strategies in artificial intelligence healthcare applications. npj Digital Medicine. 2025. DOI: 10.1038/s41746-025-01503-7

  • View profile for Antonio Vizcaya Abdo
    Antonio Vizcaya Abdo Antonio Vizcaya Abdo is an Influencer

    LinkedIn Top Voice | Sustainability Advocate & Speaker | ESG Strategy, Governance & Corporate Transformation | Professor & Advisor

    118,976 followers

    Criteria for Climate & Net Zero Reporting 🌎 This framework, developed by KPMG, provides a clear and practical benchmark to evaluate the quality of corporate climate and net zero disclosures. It is based on international best practices and draws heavily from the recommendations of the Task Force on Climate-related Financial Disclosures (TCFD). The criteria are grouped into four focus areas: Governance, Risk Identification, Impacts, and Net Zero Transition. Each area outlines specific reporting elements that reflect the maturity and robustness of a company’s approach to climate-related issues. Under Governance, disclosures should show that board-level responsibility has been assigned to oversee climate matters. In addition, climate risks should be referenced in the Chair or CEO’s message, and the company should clearly acknowledge climate change as a material financial risk. In the Risk Identification category, strong reporting includes a dedicated climate risk section in the annual report or a standalone TCFD-aligned report. It should also cover both physical risks (e.g., extreme weather) and transitional risks (e.g., policy shifts or market changes). Impacts criteria emphasize the importance of scenario analysis to understand how different climate outcomes could affect the business. Companies are expected to report using multiple warming scenarios and clear timeframes, relying on reputable sources such as the IPCC or IEA. The Net Zero Transition section highlights the need for science-based or net zero targets. A credible strategy for decarbonization should be disclosed, including the actions the company will take and the timelines involved. Transparent progress tracking is also essential. Disclosures should communicate whether the company is on track to meet its targets, identifying any challenges or adjustments made along the way. Finally, the use of an internal carbon price is seen as a strong indicator of preparedness for future regulation. It demonstrates that climate-related financial risks are being factored into planning and investment decisions. Source: KPMG #sustainability #sustainable #business #esg #reporting

  • View profile for Kevin Hartman

    Associate Teaching Professor at the University of Notre Dame, Former Chief Analytics Strategist at Google, Author "Digital Marketing Analytics: In Theory And In Practice"

    24,199 followers

    ChatGPT can be a great data cleaning tool. But most analysts let it ruin their data. They upload a messy CSV and give a bad command: "Clean this." The LLM will "clean" it by making massive, undocumented assumptions. It will silently delete outliers. It will hallucinate standardizations. It will turn messy data into wrong data. Stop asking LLMs to be your data janitor. Start directing them to be your data engineer. Instead of asking for a clean dataset, ask for an executable script in Python or R that you can audit, trust, and scale to millions of rows. Here is the 3-part framework for a perfect Data Transformation Blueprint using an LLM: Provide the Schema Never ask an LLM to write code for data it cannot see. You must provide the context. Paste the output of df.info() or str(df) [if you use Python] or `str(df)` or `glimpse(df)` and `head(df, 5)` [if you prefer R] so the LLM knows your column types. Pasting the first five rows lets it see the messy reality. Separate Logic from Engineering Don't just say "clean this." Bifurcate your instructions. Tell it WHAT to do (business rules: "standardize dates to YYYY-MM-DD") and HOW to do it (engineering standards: "use vectorized operations, not loops"). Show, Don't Tell Complex text cleaning requires complex Regex. Don't try to describe it. Use examples. Show it: "Input: 'Calif.' -> Output: 'CA'". The LLM will deduce the pattern and write the complex code for you. If your data foundation is cracked by a bad prompt, your advanced models will just generate noise. Use an LLM to clean your data the right way and free yourself up to do the more important work of analysis and interpretation. Art+Science Analytics Institute | University of Notre Dame | University of Notre Dame - Mendoza College of Business | University of Illinois Urbana-Champaign | University of Chicago | D'Amore-McKim School of Business at Northeastern University | ELVTR | Grow with Google - Data Analytics #Analytics #DataStorytelling

  • View profile for Dr. Sebastian Wernicke

    Driving growth & transformation with data & AI | Partner at Oxera | Best-selling author | 3x TED Speaker

    11,233 followers

    All data ultimately has a human source—it is not collected, but created. Data-savvy leaders understand this nuance. Decision infrastructures are often built on the premise that data is objective, definitive, and value-neutral. This leads organizations to treat data as an infallible compass. However, every byte of information springs from human actions, decisions, interactions, goals, and biases. Customer data, for example, doesn't just show behavior but reflects how people navigate interfaces we've designed, within constraints we've established. Even pristine financial data carries the imprint of human judgment—from revenue recognition timing to expense categorization—codified in vast accounting guidelines, but human-made nonetheless. Does this mean data is just subjective figures open to any conclusion? Of course not! It means that for proper understanding and interpretation, data's context is vital. All that metadata and methodology documentation isn't a footnote, but a crucial user's manual. Even the most carefully constructed dataset can be misinterpreted without proper context. This demands a targeted response. Implementing the following five specific structural changes can help address this reality: 1️⃣ Make the documentation of collection methods, decision points, known biases, and limitations a part of your data quality metrics. 2️⃣ For major decisions, require stakeholders to articulate which assumptions the data implicitly reflects and how changes would affect conclusions. 3️⃣ Pair data specialists with subject matter experts who understand the contexts generating the data. Formalize this collaboration for critical insights. 4️⃣ Integrate behavioral variables into risk assessment by testing how human motivations could invalidate data patterns. Create alternate scenarios for more robust strategies. 5️⃣ Establish mechanisms to test data-derived insights against lived experiences, where frontline observations can challenge or validate data-based conclusions. When businesses acknowledge that humans shape every piece of data, they gain insights that others miss and avoid misinterpretations, strategic missteps and compliance failures (like algorithmic bias). Success comes not from making data more human-friendly, but from recognizing data as fundamentally human in the first place.

  • View profile for José Manuel de la Chica
    José Manuel de la Chica José Manuel de la Chica is an Influencer

    Global Head of Santander AI Lab | Leading frontier AI with responsibility. Shaping the future with clarity and purpose.

    15,041 followers

    AI meet Consensus? A New Consensus Framework that Makes Models More Reliable and Collaborative. This paper addresses the challenge of ensuring the reliability of LLMs in high-stakes domains such as healthcare, law, and finance. Traditional methods often depend on external knowledge bases or human oversight, which can limit scalability. To overcome this, the author proposes a novel framework that repurposes ensemble methods for content validation through model consensus. Key Findings: Improved Precision: In tests involving 78 complex cases requiring factual accuracy and causal consistency, the framework increased precision from 73.1% to 93.9% with two models (95% CI: 83.5%-97.9%) and to 95.6% with three models (95% CI: 85.2%-98.8%). Inter-Model Agreement: Statistical analysis showed strong inter-model agreement (κ > 0.76), indicating that while models often concurred, their independent errors could be identified through disagreements. Scalability: The framework offers a clear pathway to further enhance precision with additional validators and refinements, suggesting its potential for scalable deployment. Relevance to Multi-Agent and Collaborative AI Architectures: This framework is particularly pertinent to multi-agent systems and collaborative AI architectures for several reasons: Enhanced Reliability: By leveraging consensus among multiple models, the system can achieve higher reliability, which is crucial in collaborative environments where decisions are based on aggregated outputs. Error Detection: The ability to detect errors through model disagreement allows for more robust systems where agents can cross-verify information, reducing the likelihood of propagating incorrect data. Scalability Without Human Oversight: The framework's design minimizes the need for human intervention, enabling scalable multi-agent systems capable of operating autonomously in complex, high-stakes domains. In summary, the proposed ensemble validation framework offers a promising approach to improving the reliability of LLMs, with significant implications for the development of dependable multi-agent AI systems. https://lnkd.in/d8is44jk

  • View profile for Andy Werdin

    Director Logistics Analytics & Network Strategy | Designing data-driven supply chains for mission-critical operations (e-commerce, industry, defence) | Python, Analytics, and Operations | Mentor for Data Professionals

    32,981 followers

    Struggling with messy datasets? Let Python do the heavy lifting for you! 1. 𝗦𝘁𝗮𝗿𝘁 𝘄𝗶𝘁𝗵 𝗣𝗮𝗻𝗱𝗮𝘀: Pandas is your main library for data manipulation. Use it to load data, handle missing values, and perform all kinds of transformations. Its simple syntax makes complex tasks easier.     2. 𝗛𝗮𝗻𝗱𝗹𝗲 𝗠𝗶𝘀𝘀𝗶𝗻𝗴 𝗗𝗮𝘁𝗮: Use Pandas functions like 𝘪𝘴𝘯𝘶𝘭𝘭(), 𝘧𝘪𝘭𝘭𝘯𝘢(), and 𝘥𝘳𝘰𝘱𝘯𝘢() to identify and manage missing values. Decide whether to fill gaps, interpolate data, or remove incomplete rows.     3. 𝗡𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗲 𝗮𝗻𝗱 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺: Clean up inconsistent data formats using Pandas and NumPy. Functions like 𝘭𝘰𝘸𝘦𝘳(), 𝘱𝘥.𝘵𝘰_𝘥𝘢𝘵𝘦𝘵𝘪𝘮𝘦(), and 𝘢𝘱𝘱𝘭𝘺() help you standardize and transform data efficiently.     4. 𝗗𝗲𝘁𝗲𝗰𝘁 𝗮𝗻𝗱 𝗥𝗲𝗺𝗼𝘃𝗲 𝗗𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲𝘀: Ensure data integrity by finding and removing duplicates with Pandas 𝘥𝘶𝘱𝘭𝘪𝘤𝘢𝘵𝘦𝘥() and 𝘥𝘳𝘰𝘱_𝘥𝘶𝘱𝘭𝘪𝘤𝘢𝘵𝘦𝘴() functions.     5. 𝗥𝗲𝗴𝗲𝘅 𝗳𝗼𝗿 𝗧𝗲𝘅𝘁 𝗖𝗹𝗲𝗮𝗻𝗶𝗻𝗴: Use regular expressions (regex) to clean and standardize text data. Python’s re library and Pandas 𝘳𝘦𝘱𝘭𝘢𝘤𝘦() function are perfect for removing unwanted characters and patterns.     6. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲 𝘄𝗶𝘁𝗵 𝗦𝗰𝗿𝗶𝗽𝘁𝘀: Write Python scripts to automate repetitive cleaning tasks. Automation saves time and ensures consistency across your data-cleaning processes.     7. 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗲 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮: Always validate your cleaned data. Check for consistency and completeness. Use descriptive statistics and visualizations to confirm your data is ready for analysis.     8. 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗬𝗼𝘂𝗿 𝗖𝗹𝗲𝗮𝗻𝗶𝗻𝗴 𝗣𝗿𝗼𝗰𝗲𝘀𝘀: Keeping detailed records helps maintain transparency and allows others to understand your steps and reasoning. Using Python for data cleaning will enhance your efficiency, ensure data quality, and generate accurate insights for your stakeholders. Have you tried automating your cleaning tasks? How did it go? ---------------- ♻️ 𝗦𝗵𝗮𝗿𝗲 if you find this post useful ➕ 𝗙𝗼𝗹𝗹𝗼𝘄 for more daily insights on how to grow your career in the data field #dataanalytics #datascience #python #datacleaning #careergrowth

  • View profile for Pooja Jain
    Pooja Jain Pooja Jain is an Influencer

    Storyteller | Lead Data Engineer@Wavicle| Linkedin Top Voice 2025,2024 | Globant | Linkedin Learning Instructor | 2xGCP & AWS Certified | LICAP’2022

    183,810 followers

    Data Quality isn't boring, its the backbone to data outcomes! Let's dive into some real-world examples that highlight why these six dimensions of data quality are crucial in our day-to-day work. 1. Accuracy:  I once worked on a retail system where a misplaced minus sign in the ETL process led to inventory levels being subtracted instead of added. The result? A dashboard showing negative inventory, causing chaos in the supply chain and a very confused warehouse team. This small error highlighted how critical accuracy is in data processing. 2. Consistency: In a multi-cloud environment, we had customer data stored in AWS and GCP. The AWS system used 'customer_id' while GCP used 'cust_id'. This inconsistency led to mismatched records and duplicate customer entries. Standardizing field names across platforms saved us countless hours of data reconciliation and improved our data integrity significantly. 3. Completeness: At a financial services company, we were building a credit risk assessment model. We noticed the model was unexpectedly approving high-risk applicants. Upon investigation, we found that many customer profiles had incomplete income data exposing the company to significant financial losses. 4. Timeliness: Consider a real-time fraud detection system for a large bank. Every transaction is analyzed for potential fraud within milliseconds. One day, we noticed a spike in fraudulent transactions slipping through our defenses. We discovered that our real-time data stream was experiencing intermittent delays of up to 2 minutes. By the time some transactions were analyzed, the fraudsters had already moved on to their next target. 5. Uniqueness: A healthcare system I worked on had duplicate patient records due to slight variations in name spelling or date format. This not only wasted storage but, more critically, could have led to dangerous situations like conflicting medical histories. Ensuring data uniqueness was not just about efficiency; it was a matter of patient safety. 6. Validity: In a financial reporting system, we once had a rogue data entry that put a company's revenue in billions instead of millions. The invalid data passed through several layers before causing a major scare in the quarterly report. Implementing strict data validation rules at ingestion saved us from potential regulatory issues. Remember, as data engineers, we're not just moving data from A to B. We're the guardians of data integrity. So next time someone calls data quality boring, remind them: without it, we'd be building castles on quicksand. It's not just about clean data; it's about trust, efficiency, and ultimately, the success of every data-driven decision our organizations make. It's the invisible force keeping our data-driven world from descending into chaos, as well depicted by Dylan Anderson #data #engineering #dataquality #datastrategy

  • View profile for Chris French

    Helping you excel your analytics career l Linked[in] Instructor

    91,942 followers

    One of my favorite ways to clean data in Excel is to use the SORT and UNIQUE functions. Here’s a challenge: We have a list of sales reps and their regions, but the list includes duplicates. We want to see each rep once, with their corresponding region, sorted alphabetically. Using the following formula: =SORT(UNIQUE(Table1[[Region]:[Sales Rep]]), 2, TRUE) How it works: - UNIQUE removes duplicates, returning one clean record per rep and region. - SORT then organizes those names A–Z (the “2” means sort by the second column, which is the rep name). And just like that, we get exactly what we need! If you haven’t used SORT, UNIQUE, or other dynamic array formulas, I highly recommend doing so! Any other functions or parts of Excel you’d like to see?

  • View profile for Shobha Moni

    25+ Years Transforming Businesses with ERP Systems | Partner Founder at Triad Software Services (award-winning Sage partner) | Digital Transformation Leader

    21,020 followers

    I’ve audited 120+ ERP data migrations in the last 5 years. 80% of them failed. And most ERP failures are not because it’s SAP, Oracle, or Dynamics. Not even the custom build from 2012. They fail because the data going in was never cleaned. Here’s what I keep seeing (even in $10M+ projects): In 80% of failed ERP migrations, I found: ☠️ UOM mismatches that break inventory. ☠️ Customer and vendor duplicates. ☠️ Zombie SKUs and dead warehouses. ☠️ Orphaned transactions. ☠️ No audit trail of what got transformed. Here’s my Data Migration Checklist (to use before go-live): ✅ Units of Measure (UOM): → Are all UOMs mapped 1:1 between legacy and new ERP? → Have we tested conversion logic in live transactions? ✅ Master Data Uniqueness: → Do we have duplicate SKUs, vendors, or customers? → What’s the deduplication logic? Who owns it? ✅ Historical Data Mapping: → Are all past transactions (GR/IR, payments, returns) traceable? → Can we audit them after go-live? ✅ Open Transactions Review: → How many open POs, SOs, GRNs exist in legacy? → Who validated carry-forward rules? ✅ Dummy Runs with Real Data: → Did we run full-cycle transactions with migrated data in UAT? → Were accounting, tax, and inventory balances reconciled? ✅ Cleanup Ownership: → Who is responsible for final data sign-off—IT or Finance? → Is it documented? I think ERP is not an Excel import. It’s a financial and operational rebirth. And the data is either your foundation or your downfall. How confident are you in the quality of the data being loaded into your next ERP? ♻️ 𝐑𝐄𝐏𝐎𝐒𝐓 so others can learn.

  • View profile for Felipe Daguila
    Felipe Daguila Felipe Daguila is an Influencer

    Helping enterprises simplify and accelerate their transformation through sustainable, net-positive business models | Climate Tech, Sustainability & AI enthusiast

    18,585 followers

    Interesting article from Financial Times on Carbon reporting is broken but a bold idea from the world of accounting might be the fix. The Financial Times just spotlighted “E-liability accounting,” a new model from Harvard Business School’s @Bob Kaplan and University of Oxford’s Karthik Ramanna that tracks CO₂ like cash: emissions are measured at the source, tagged to each product, and passed down the value chain with audit grade precision. Think blockchain meets double entry bookkeeping. In a Hitachi Energy pilot, tracing just copper through three supplier tiers captured 80% of emissions in a transformer coil and it turns out virgin copper smelted with hydropower beat recycled copper made with coal. Scope 3 averages can mislead; real data reveals the truth. It is an interesting model because it combines climate integrity with financial logic. It’s transparent, product specific, decision ready and a huge leap from the fuzzy spreadsheets, spend base data many of companies are still stuck with. But it only works if every player joins the ledger. I already work with several customers where invoice carries a carbon number. https://lnkd.in/g2ihc_fd

Explore categories