Strategies for Ensuring Data Quality

Seasoned AI, analytics and cloud software business leader, currently Head of Strategy & Operations and Strategic Business Unit Leader at Axtria Inc.

6,169 followers 1y

Managing data quality is critical in the pharma industry because poor data quality leads to inaccurate insights, missed revenue opportunities, and compliance risks. The industry is estimated to lose between $15 million to $25 million annually per company due to poor data quality, according to various studies. To mitigate these challenges, the industry can adopt AI-driven data cleansing, enforce master data management (MDM) practices, and implement real-time monitoring systems to proactively detect and address data issues. There are several options that I have listed below: Automated Data Reconciliation: Set up an automated and AI enabled reconciliation process that compares expected vs. actual data received from syndicated data providers. By cross-referencing historical data or other data sources (such as direct sales reports or CRM systems), discrepancies, like missing accounts, can be quickly identified. Data Quality Dashboards: Create real-time dashboards that display prescription data from key accounts, highlighting any gaps or missing data as soon as it occurs. These dashboards can be designed with alerts that notify the relevant teams when an expected data point is missing. Proactive Exception Reporting: Implement exception reports that flag missing or incomplete data. By establishing business rules for prescription data based on historical trends and account importance, any deviation from the norm (like missing data from key accounts) can trigger alerts for further investigation. Data Quality Checks at the Source: Develop specific data quality checks within the data ingestion pipeline that assess the completeness of account-level prescription data from syndicated data providers. If key account data is missing, this would trigger a notification to your data management team for immediate follow-up with the data providers. Redundant Data Sources: To cross-check, leverage additional data providers or internal data sources (such as sales team reports or pharmacy-level data). By comparing datasets, missing data from syndicated data providers can be quickly identified and verified. Data Stewardship and Monitoring: Assign data stewards or a dedicated team to monitor data feeds from syndicated data providers. These stewards can track patterns in missing data and work closely with data providers to resolve any systemic issues. Regular Audits and SLA Agreements: Establish a service level agreement (SLA) with data providers that includes specific penalties or remedies for missing or delayed data from key accounts. Regularly auditing the data against these SLAs ensures timely identification and correction of missing prescription data. By addressing data quality challenges with advanced technologies and robust management practices, the industry can reduce financial losses, improve operational efficiency, and ultimately enhance patient outcomes.

3 Comments

Chad Sanderson

CEO @ Gable.ai (Shift Left Data Platform)

89,317 followers 8mo

Many companies talk about implementing data contracts and shifting left, but Zakariah S. and the team at Glassdoor have actually done it. In an article published earlier today, the Glassdoor Data Platform team goes in-depth about how they have started driving data quality from the source through data contracts, proactive monitoring/observability, and Data DevOps. Here's a great quote from the article on the value of Shifting Left: "This approach offers many benefits, but the top four we’ve observed are: Data Quality by Design: Incorporating data quality checks early in the lifecycle helps prevent bad data from entering production systems. Fewer Downstream Breakages: By resolving potential issues closer to the source, the entire data pipeline becomes more resilient and less susceptible to cascading failures. Stronger Collaboration: Equipping product engineers with tools, frameworks, and guidelines to generate high-quality data nurtures a closer partnership between data producers and consumers. Cost & Time Efficiency: Preventing bad data is significantly cheaper than diagnosing and fixing it after propagating across multiple systems. These were the foundational principles upon which our motivation for shifting left was achieved." Glassdoor achieved this through six primary technology investments: Data Contracts (Gable.ai): Define clear specifications for fields, types, and constraints, ensuring product engineers are accountable for data quality from the start. Static Code Analysis (Gable.ai): Integrated with GitLab/GitHub and Bitrise to catch and block problematic data changes before they escalate downstream. LLMs for Anomaly Detection (Gable.ai): Identify subtle issues (e.g., swapped field names) that may not violate contracts but could lead to data inconsistencies. Schema Registry (Confluent): Screens incoming events, enforcing schema validation and directing invalid data to dead-letter queues to keep pipelines clean. Real-time Monitoring (DataDog): Provides continuous feedback loops to detect and resolve issues in real time. Write-Audit-Publish (WAP) / Blue-Green Deployment: Ensures each data batch passes through a staging area before being promoted to production, isolating risks before they impact downstream consumers. "By addressing the psychological dimension of trust through shared responsibility, transparent validation, and confidence-building checks, we’re scaling to petabytes without compromising our data’s essential sense of faith. Ultimately, this combination of technical rigor and cultural awareness empowers us to build resilient, trustworthy data systems — one contract, one check, and one validation at a time." It's a fascinating article and insight into incredibly sophisticated thinking around data quality and governance. You can check out the link below: https://lnkd.in/d-ADip42 Good luck!

26 Comments

Joseph M.

Data Engineer, startdataengineering.com | Bringing software engineering best practices to data engineering.

47,741 followers 5mo

It took me 10 years to learn about the different types of data quality checks; I'll teach it to you in 5 minutes: 1. Check table constraints The goal is to ensure your table's structure is what you expect: * Uniqueness * Not null * Enum check * Referential integrity Ensuring the table's constraints is an excellent way to cover your data quality base. 2. Check business criteria Work with the subject matter expert to understand what data users check for: * Min/Max permitted value * Order of events check * Data format check, e.g., check for the presence of the '$' symbol Business criteria catch data quality issues specific to your data/business. 3. Table schema checks Schema checks are to ensure that no inadvertent schema changes happened * Using incorrect transformation function leading to different data type * Upstream schema changes 4. Anomaly detection Metrics change over time; ensure it's not due to a bug. * Check percentage change of metrics over time * Use simple percentage change across runs * Use standard deviation checks to ensure values are within the "normal" range Detecting value deviations over time is critical for business metrics (revenue, etc.) 5. Data distribution checks Ensure your data size remains similar over time. * Ensure the row counts remain similar across days * Ensure critical segments of data remain similar in size over time Distribution checks ensure you get all the correct dates due to faulty joins/filters. 6. Reconciliation checks Check that your output has the same number of entities as your input. * Check that your output didn't lose data due to buggy code 7. Audit logs Log the number of rows input and output for each "transformation step" in your pipeline. * Having a log of the number of rows going in & coming out is crucial for debugging * Audit logs can also help you answer business questions Debugging data questions? Look at the audit log to see where data duplication/dropping happens. DQ warning levels: Make sure that your data quality checks are tagged with appropriate warning levels (e.g., INFO, DEBUG, WARN, ERROR, etc.). Based on the criticality of the check, you can block the pipeline. Get started with the business and constraint checks, adding more only as needed. Before you know it, your data quality will skyrocket! Good Luck! - Like this thread? Read about they types of data quality checks in detail here 👇 https://lnkd.in/eBdmNbKE Please let me know what you think in the comments below. Also, follow me for more actionable data content. #data #dataengineering #dataquality

12 Comments

Pan Wu

Senior Data Science Manager at Meta

48,608 followers 3mo

Ensuring data quality at scale is crucial for developing trustworthy products and making informed decisions. In this tech blog, the Glassdoor engineering team shares how they tackled this challenge by shifting from a reactive to a proactive data quality strategy. At the core of their approach is a mindset shift: instead of waiting for issues to surface downstream, they built systems to catch them earlier in the lifecycle. This includes introducing data contracts to align producers and consumers, integrating static code analysis into continuous integration and delivery (CI/CD) workflows, and even fine-tuning large language models to flag business logic issues that schema checks might miss. The blog also highlights how Glassdoor distinguishes between hard and soft checks, deciding which anomalies should block pipelines and which should raise visibility. They adapted the concept of blue-green deployments to their data pipelines by staging data in a controlled environment before promoting it to production. To round it out, their anomaly detection platform uses robust statistical models to identify outliers in both business metrics and infrastructure health. Glassdoor’s approach is a strong example of what it means to treat data as a product: building reliable, scalable systems and making quality a shared responsibility across the organization. #DataScience #MachineLearning #Analytics #DataEngineering #DataQuality #BigData #MLOps #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gUwKZJwN

Site Maintenance medium.com

5 Comments

Ajay Patel

Product Leader | Data & AI

3,484 followers 9mo

My AI was ‘perfect’—until bad data turned it into my worst nightmare. 📉 By the numbers: 85% of AI projects fail due to poor data quality (Gartner). Data scientists spend 80% of their time fixing bad data instead of building models. 📊 What’s driving the disconnect? Incomplete or outdated datasets Duplicate or inconsistent records Noise from irrelevant or poorly labeled data Data quality The result? Faulty predictions, bad decisions, and a loss of trust in AI. Without addressing the root cause—data quality—your AI ambitions will never reach their full potential. Building Data Muscle: AI-Ready Data Done Right Preparing data for AI isn’t just about cleaning up a few errors—it’s about creating a robust, scalable pipeline. Here’s how: 1️⃣ Audit Your Data: Identify gaps, inconsistencies, and irrelevance in your datasets. 2️⃣ Automate Data Cleaning: Use advanced tools to deduplicate, normalize, and enrich your data. 3️⃣ Prioritize Relevance: Not all data is useful. Focus on high-quality, contextually relevant data. 4️⃣ Monitor Continuously: Build systems to detect and fix bad data after deployment. These steps lay the foundation for successful, reliable AI systems. Why It Matters Bad #data doesn’t just hinder #AI—it amplifies its flaws. Even the most sophisticated models can’t overcome the challenges of poor-quality data. To unlock AI’s potential, you need to invest in a data-first approach. 💡 What’s Next? It’s time to ask yourself: Is your data AI-ready? The key to avoiding AI failure lies in your preparation(#innovation #machinelearning). What strategies are you using to ensure your data is up to the task? Let’s learn from each other. ♻️ Let’s shape the future together: 👍 React 💭 Comment 🔗 Share

4 Comments

Durga Gadiraju

GVP - AI, Data, and Analytics @ INFOLOB | Gen AI Evangelist & Thought Leader

50,875 followers 12mo

📊 How do you ensure the quality and governance of your data? In the world of data engineering, maintaining proper data governance and quality is critical for reliable insights. Let’s explore how Google Cloud Platform (GCP) can help. 🌐 Data Governance and Quality in GCP Managing data governance and ensuring data quality are essential for making informed decisions and maintaining regulatory compliance. Here are some best practices for managing data governance and quality on GCP: Key Strategies for Data Governance: 1. Centralized Data Management: - Data Catalog: Use Google Cloud’s Data Catalog to organize and manage metadata across your GCP projects. This tool helps you discover, classify, and document your datasets for better governance. 2. Data Security and Compliance: - Encryption: Implement end-to-end encryption (both in transit and at rest) for all sensitive data. GCP provides encryption by default and allows you to manage your own encryption keys. 3. Data Auditing and Monitoring: - Audit Logs: Enable Cloud Audit Logs to track access and changes to your datasets, helping you maintain an audit trail for compliance purposes. - Data Retention Policies: Implement policies to automatically archive or delete outdated data to ensure compliance with data retention regulations. Key Strategies for Data Quality: 1. Data Validation: Automated Checks: Use tools like Cloud Data Fusion to integrate automated data validation checks at every stage of your data pipelines, ensuring data integrity from source to destination. Monitoring Data Quality: Set up alerts in Stackdriver Monitoring to notify you if data quality metrics (like completeness, accuracy, and consistency) fall below defined thresholds. 2. Data Cleaning: Cloud Dataprep: Use Cloud Dataprep for data cleaning and transformation before loading it into data warehouses like BigQuery. Ensure data is standardized and ready for analysis. Error Handling: Build error-handling mechanisms into your pipelines to flag and correct data issues automatically. 3. Data Consistency Across Pipelines: Schema Management: Implement schema enforcement across your data pipelines to maintain consistency. Use BigQuery’s schema enforcement capabilities to ensure your data adheres to predefined formats. Benefits of Data Governance and Quality: - Informed Decision-Making: High-quality, well-governed data leads to more accurate insights and better business outcomes. - Compliance: Stay compliant with regulations like GDPR, HIPAA, and SOC 2 by implementing proper governance controls. - Reduced Risk: Proper governance reduces the risk of data breaches, inaccuracies, and inconsistencies. 📢 Stay Connected: Follow my LinkedIn profile for more tips on data engineering and GCP insights: https://zurl.co/lEpN #DataGovernance #DataQuality #GCP #DataEngineering #CloudComputing

1 Comment

Manisha Lodha

Follow me for GenAI, Agentic AI, Data related content | Chief Data Scientist | GenAI | I write to 74k+ followers | We need more WOMEN in DATA

77,421 followers 1y

How robust is your organization's data governance framework? Data governance is crucial for ensuring data quality, security, and compliance across any organization. Here are key components that form the backbone of effective data governance: ✔ Data Stewardship: Managing and overseeing data to ensure its quality, consistency, and compliance with policies and regulations. ✔ Quality Management: Monitoring and improving the accuracy, completeness, and timeliness of data, ensuring trustworthiness and reliability. ✔ Data Ownership: Assigning responsibility and accountability for data assets to specific individuals or teams for proper management and governance. ✔ Data Classification: Categorizing data based on sensitivity, value, or risk to implement appropriate security and compliance measures. ✔ Retention & Archiving: Defining and implementing policies for data retention, storage, and disposal based on legal and business requirements. ✔ Privacy Compliance: Ensuring data management practices adhere to privacy laws and regulations like GDPR or CCPA. ✔ Lineage & Provenance: Tracking the flow of data through systems, ensuring accuracy and compliance by understanding data origin and transformations. ✔ Cataloging & Discovery: Creating a searchable repository for an inventory of data assets, facilitating easy access and management. ✔ Risk Management: Identifying, assessing, and mitigating data-related risks such as breaches or regulatory non-compliance. Implementing a comprehensive data governance strategy can significantly enhance an organization's ability to make data-driven decisions while ensuring compliance and reducing risks. Credits: Deepak Bhardwaj

20 Comments

Animesh Kumar

CTO | DataOS: Data Products in 6 Weeks ⚡

13,161 followers 3mo

This visual captures how a 𝗠𝗼𝗱𝗲𝗹-𝗙𝗶𝗿𝘀𝘁, 𝗣𝗿𝗼𝗮𝗰𝘁𝗶𝘃𝗲 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗖𝘆𝗰𝗹𝗲 breaks the limitations of reactive data quality maintenance and overheads. 📌 Let's break it down: 𝗧𝗵𝗲 𝗮𝗻𝗮𝗹𝘆𝘀𝘁 𝘀𝗽𝗼𝘁𝘀 𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗶𝘀𝘀𝘂𝗲 But instead of digging through pipelines or guessing upstream sources, they immediately access metadata-rich diagnostics. Think data contracts, semantic lineage, validation history. 𝗧𝗵𝗲 𝗶𝘀𝘀𝘂𝗲 𝗶𝘀 𝗮𝗹𝗿𝗲𝗮𝗱𝘆 𝗳𝗹𝗮𝗴𝗴𝗲𝗱 Caught at the ingestion or transformation layer by embedded validations. 𝗔𝗹𝗲𝗿𝘁𝘀 𝗮𝗿𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁-𝗿𝗶𝗰𝗵 No generic failure messages. Engineers see exactly what broke, whether it was an invalid assumption, a schema change, or a failed test. 𝗙𝗶𝘅𝗲𝘀 𝗵𝗮𝗽𝗽𝗲𝗻 𝗶𝗻 𝗶𝘀𝗼𝗹𝗮𝘁𝗲𝗱 𝗯𝗿𝗮𝗻𝗰𝗵𝗲𝘀 𝘄𝗶𝘁𝗵 𝗺𝗼𝗰𝗸𝘀 𝗮𝗻𝗱 𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻𝘀 Just like modern application development. Then they’re redeployed via CI/CD. This is non-disruptive to existing workflows. 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗹𝗼𝗼𝗽𝘀 𝗸𝗶𝗰𝗸 𝗶𝗻 Metadata patterns improve future anomaly detection. The system evolves. 𝗨𝗽𝘀𝘁𝗿𝗲𝗮𝗺 𝘀𝘁𝗮𝗸𝗲𝗵𝗼𝗹𝗱𝗲𝗿𝘀 𝗮𝗿𝗲 𝗻𝗼𝘁𝗶𝗳𝗶𝗲𝗱 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰𝗮𝗹𝗹𝘆 In most cases, they’re already resolving the root issue through the data product platform. --- This is what happens when data quality is owned at the model layer, not bolted on with monitoring scripts. ✔️ Root cause in minutes, not days ✔️ Failures are caught before downstream users are affected ✔️ Engineers and analysts work with confidence and context ✔️ If deployed, AI Agents work without hallucination and context ✔️ Data products become resilient by design This is the operational standard we’re moving toward: 𝗣𝗿𝗼𝗮𝗰𝘁𝗶𝘃𝗲, 𝗺𝗼𝗱𝗲𝗹-𝗱𝗿𝗶𝘃𝗲𝗻, 𝗰𝗼𝗻𝘁𝗿𝗮𝗰𝘁-𝗮𝘄𝗮𝗿𝗲 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆. Reactive systems can’t support strategic decisions. 🔖 If you're curious about the essence of "model-first", here's something for a deeper dive: https://lnkd.in/dWVzv3EJ #DataQuality #DataManagement #DataStrategy

2 Comments

LinkedIn respects your privacy

Strategies for Ensuring Data Quality

Explore categories

Strategies for Ensuring Data Quality

More in Ensuring Data Quality

Explore categories