Understanding the Value of Imperfect Data

Explore top LinkedIn content from expert professionals.

  • View profile for Ross Helenius

    Director AI Transformation Engineering & Architecture

    2,825 followers

    There is a trap out there waiting for you in your #data quality initiatives. A lot of people fall into it. I call it the all data must be perfect fallacy. It is surprising how many people can only think of data quality in binary terms, is it perfect or imperfect. Rarely is this ever the case. Data is a production of action, process, subjectivity. Trying to boil it down to a binary is a trap. Guess what, that’s ok! Your time is better spent understanding the value and outcomes of that data. What can it support now and what would you get for additional value from fixing it? Observe your data and patterns with tools such as data observability platforms. With a better understanding of the shape and patterns of the data you can then prioritize what to fix combining the potential value and the quality of the data pipeline. The journey to fix data can be deep sometimes or even out of our control. Is it a flawed process? A technical error? Maybe you have to retrain staff on a process or get a vendor to update their feed. You will also still get value out of imperfect data when used in the right way and understand the limitations. Don’t use imperfect data in a scenario that requires audit level accuracy but evaluate it for trends and other higher level analysis. You don’t have to put all of your data science and GenAI projects on hold waiting for prefect data as many folks on LinkedIn will tell you. You should understand when and how to use it based on its properties though.

  • View profile for Malcolm Hawker

    CDO | Author | Keynote Speaker | Podcast Host

    21,054 followers

    "I don't want excellent data quality, I want 𝒈𝒐𝒐𝒅 𝒆𝒏𝒐𝒖𝒈𝒉 data quality". Or put another way, any effort expended on data quality beyond the minimum needed to make it 'fit for purpose' is wasted. This is quote from my friend and former Gartner colleague Andrew White at their recent Data and Analytics Summit in Orlando. As with many things Andrew says, this statement is worth pondering in greater detail, as it touches on several areas where we need to drastically improve our awareness as data teams, and data leaders: ✅ Optimally, we must strive to quantify what 'good enough' looks like, across multiple contexts / domains and individual use cases. ✅ We must define data quality standards across both analytical *and* operational use cases. Only looking through the lens of analytics is hindering our ability to understand how our businesses operate, and ultimately, our customer relationships. ✅ These quality standards cannot be absolute, nor can the be unrealistic. Put another way, perfection is the enemy, and striving for it is a fool's errand. ✅ There is an elasticity in the cost/benefit of data quality that we must also do a better job of understanding. The cost to go from 95% accuracy to 96% accuracy is 𝐝𝐫𝐚𝐬𝐭𝐢𝐜𝐚𝐥𝐥𝐲 𝐡𝐢𝐠𝐡𝐞𝐫 than it is to go from 75% to 76%. ✅ Finding the optimal balance between business impact and effective cost management in data quality should be a top goal for data leaders. ✅ All of the above points assume we are able to quantify the business impacts of our data, which unfortunately remains an elusive goal for most. Doing the above would put a data leader in a position of being able to have in informed (dare I say 'data driven'?) conversation with their customers about the tradeoffs inherent to effective data quality management. Today, we have far too much emphasis on highly deterministic and rigid data quality rules might keep our CFO happy, while also alienating other customers desperate for more pragmatic approaches. Our approaches to data quality must adapt. Breaking a mindset that sees data quality as an 'all or nothing' perspective is a great first step in that direction. What do you think? #dataquality #datagovernance #cdo

  • View profile for Kevin Hu

    Data Observability at Datadog | CEO of Metaplane (acquired)

    24,610 followers

    High-quality data != perfect data. Data is never perfect; it's only good enough to satisfy a use case. That’s why truly understanding data quality goes beyond just “checking for errors” — you have to evaluate whether the data fits its intended use case. For instance, • What specific business processes or decisions rely on this data? • Is the data relevant to the KPIs and metrics we’re tracking? • How often is it refreshed? Is it current? Full disclosure: the answer to all these questions might not be what you’re hoping for. But they should give you what you need to get your data quality initiatives off the ground with a threshold for what is “good enough” for your business needs. Because let’s be honest, if data doesn’t help you meet your business needs, what good is it? #dataengineering #dataquality #data

  • View profile for Mikhail Panko

    Product @ Airbnb (previously: founder, Google, Uber, Coursera)

    3,489 followers

    A big part of building Motif Analytics has been re-thinking how an exploration-first analytics tool should work. Most analytics tools today are built for narrow reporting, provide appealing but misleading insights, or require many hours of work by a strong data practitioner to answer practical custom questions. We asked ourselves: how can we adjust existing analytics abstractions and tradeoffs to provide the fastest path from raw data to practical insights in modern tech companies? This is hard because many fundamental assumptions about how analytics “ought to be done” have become entrenched and are rarely questioned. I’d like to do a series of short posts about several such assumptions, which can get in the way of fast practical analytics, and hear reactions from data folks (you!). Let’s start with one of the biggest headaches of every data practitioner I know. 🔬 Data Quality 🔬 Organizational trust in the accuracy of metrics is something every data practitioner has to grapple with. The widely accepted solution is improving and monitoring data quality through every step of data capture and processing. But does it solve the problem? Do you know 100+ person organizations where trust in metrics is not an issue? Several big factors work against it: ➡ no guarantee of result correctness: missing even one small thing breaks the whole processing chain ➡ dynamic environment: software products and logging are constantly changing, business definitions of metrics are shifting ➡ inherent errors: there is an inherent loss of analytics logs (often ~1%) ➡ distributed ownership: feature engineers, data engineers, analysts and data scientists all touch the same data Is there another approach? How do strong data folks answer analytics questions today when working with imperfect data? They: 1️⃣ check results correctness through their coherence over: - time: review metric stability over time - context: view data in broad context of prior/later behavior - question tweaks: inspect how results change based on slight changes to the question - redundancy: compare metric values coming from redundant data sources. 2️⃣ work around identified data quality issues quickly during the analysis by filtering out bad data, using proxy data, making reasonable simplifying assumptions, etc. Both are specific to the question at hand and hard to generalize across analyses. Unfortunately, analytics tools today don't focus on making this type of work easy. Here are some approaches we are using for Motif: ➡ display data in broad context including prior/later events in user flows ➡ preserve high interactivity with ~2 second exploratory query time on any dataset sizes ➡ provide the ability to modify data on the fly by replacing event patterns.

Explore categories