SlideShare a Scribd company logo
Correlation Does Not Mean
Causation
Testing insights into DataOps, Big Data Analytics, and AI
Peter Varhol
About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• AWS certified
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com
What You Will Learn
• How AI systems make the determinations they do based on data.
• Why big data is so important in analytics and AI.
• What are we actually learning when we work with AI and analytics
systems.
Agenda
• The Evolution of data
• The Role of DataOps
• Logistics of Big Data
• Using data to train machine learning
systems
• Bias in data
• Summary
The Evolution of Data
• Thirty years ago
• Hardware was king
• Twenty years ago
• Software ruled the roost
• Ten years ago
• Hardware and software went to the cloud
• Today
• Nothing matters but data
How Did This Happen?
• Prices fell with commodity hardware
• Storage became much less expensive
• We developed better software abstractions
• Operating systems became standardized
• Nicholas Carr was wrong – software did matter
• Business decision-makers became comfortable with data
• “Gut feel” is no longer an acceptable basis for decision-making
How Did This Happen?
• Storage is cheap
• We can easily store and retrieve terabytes of data
• Processing power is fast
• It doesn’t take long to operate on large datasets
• Data can produce information
• Decision-making became more refined
What Does This Mean?
• The business is now using data as an integral part of decision-making
• That data is often in real time
• Data is also critical to machine learning applications
• IT has to keep data up to date and clean
• Old data is worse than useless
• We need a data pipeline similar to DevOps
• Data  Information seamlessly
What is DataOps?
• Data collection is a natural part of business operations
• No out of cycle effort required
• Data collection, storage, workflow, integration, and analytics
deployment in a consistent, repeatable process
• Plus data about your data
DataOps Versus DevOps
• Data can be designed to follow flow principles similar to DevOps
• Process
• Automation
• Data production and workflow is important to effective data
consumption
• Cross-functional teams are essential in both
• Developers, testers
• DBAs, report writers
Why Would We Want To?
• Many teams don’t know how to handle big data
• Defining a practice provides guidance
• We need information in real time
• We can’t wait for the next monthly report
• It helps companies better understand their data
• Data is now front and center
Principles of DataOps
• Individuals and interactions over processes and tools
• Working analytics over comprehensive documentation
• Customer collaboration over contract negotiation
• Experimentation, iteration, and feedback over extensive upfront
design
• Cross-functional ownership of operations over siloed responsibilities
• https://www.dataopsmanifesto.org/
Why We Need DataOps
• Data is a valuable commodity
• It’s not simply a biproduct of our work
• We need to reap intelligence from data
• In real time
• With a standard process
• We must get it into the hands of those who need it
• For analysis
• For decision-making
Why We Need DataOps
• Auditability – Versioning every output and input, from source data to
data science experiments to trained model, means that you can show
exactly how the model was created and where it was implemented.
• Reliability – Deploy quickly but with increased consistency and quality
• Repeatability – Automating ensures a repeatable process
• Productivity – Providing a self-service environment with access to
curated data sets
We’re All In This Together
• Data is a team sport
• DBA
• Report writer
• Data scientist/analyst
• Ops person
• Tester
• And more
The Logistics of Big Data
• We get data from a variety of sources
• Our own databases
• Measurement of processes
• Natural and social science
• A single source is no longer enough
• We tie together sales, weather reports, more
• This can’t be done manually
Big Data and Machine Learning
• Our intelligent systems learn through data
• The more data, the better (usually)
• Algorithms manipulate the data to draw a conclusion
• It can seem like intelligence because that’s how we make decisions
• Your algorithms are your competitive advantage
• And the better your data, the more effective your algorithms
Using Data to Train Machine Learning
Systems
• Big Data is used for “training” machine learning systems
• Data is fed through a series of nonlinear algorithms that adjust parameters in
response
• We tend to believe it infallible
• Um, no
• Data is only as good as how we select and collect it
• And results are only as good as the data
The Limitations of Data
• Data is typically a sample or representation of a real-world
circumstance
• Not necessarily exact
• And not necessarily correct
• Data can be misinterpreted
• That doesn’t mean what you think it means
Bias and Machine Learning Systems
• Worst of all, data can be biased
• It may not accurately and consistently represent the problem domain
• That’s a problem
• And all data is biased in some way
• And we need to understand our data bias
Where Do Biases Come From?
• Data selection
• We choose training data that represents only one segment of the domain
• We limit our training data to certain times or seasons
• We overrepresent one population
• Or
• The problem domain has subtly changed
Where Do Biases Come From?
• Latent bias
• Concepts become incorrectly correlated
• Correlation does not mean causation
• But it is high enough to believe
• We could be promoting stereotypes
• This describes Amazon’s problem
Where Do Biases Come From?
• Interaction bias
• We may focus on keywords that users apply incorrectly
• User incorporates slang or unusual words
• “That’s bad, man”
• The story of Microsoft Tay
• It wasn’t bad, it was trained that way
Why Does Bias Matter?
• Wrong answers
• Often with no recourse
• Subtle discrimination (legal or illegal)
• And no one knows it
• Suboptimal results
• We’re not getting it right often enough
• Although bias may also have value
Delivering in the Clutch
• Machines treat all events as equal
• Humans recognize the importance of some events
• And sometimes can rise to the occasion
• There is no mechanism for code to do this
• We could have data and algorithms to recognize
the importance of a specific event
• But the software cannot “improve” its answer
• This is less a bias than an inherent weakness
The Human in the Loop
• We don’t understand complex software systems
• Disasters often happen because software behaves in unexpected
ways
• Human oversight may prevent disasters, or wrong decisions
• Can we overcome human bias?
• The problem is that machines respond too quickly
• In many cases, there is not enough time for human oversight
• Aircraft, autonomous vehicles need to respond instantly
Where Testing Fits In
• Data must be accurate
• How do we make it so?
• Humans need to be proactive
• We test – objectively
• We anticipate
• Bias
• Wrong answers
• Puzzles
• Ensuring data represents the problem domain
How to Test
• Many scenarios
• Hundreds or thousands
• With detailed documentation
• Edge cases
• The data may not be there for them
• Think outside the box
• Try to create a model from test results
• I understand how this works
Conclusions
• Data is central to all applications
• Big data is the norm
• Managed by DataOps
• But data can’t make our decisions for us
• Put data in its proper role
• But the burden is on us
• How can we respond when response time is in seconds?
Correlation does not mean causation
Thank You
• Peter Varhol
peter@petervarhol.com

More Related Content

PPTX
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
PDF
Pdf analytics-and-witch-doctoring -why-executives-succumb-to-the-black-box-me...
PPTX
The Analysis Part of Integration Projects
PPTX
Testing for cognitive bias in ai systems
PPTX
Data Scientists Are Analysts Are Also Software Engineers
PDF
Your Data Scientist Hates You
KEY
Make Life Suck Less (Building Scalable Systems)
PPTX
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
Pdf analytics-and-witch-doctoring -why-executives-succumb-to-the-black-box-me...
The Analysis Part of Integration Projects
Testing for cognitive bias in ai systems
Data Scientists Are Analysts Are Also Software Engineers
Your Data Scientist Hates You
Make Life Suck Less (Building Scalable Systems)
NYC Open Data Meetup-- Thoughtworks chief data scientist talk

What's hot (20)

PDF
What Managers Need to Know about Data Science
PDF
Building Better Models Faster Using Active Learning
PPTX
Managing Data Science | Lessons from the Field
PDF
H2O World - Intro to Data Science with Erin Ledell
PPT
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
PPTX
Enterprise Machine Learning Governance
PDF
Large Scale Modeling Overview
PDF
Leveraged Analytics at Scale
PDF
Data science unit 1 By: Professor Lili Saghafi
PPTX
Reproducible Dashboards and other great things to do with Jupyter
PDF
Big Data Rampage
PPTX
A quick overview of Eaagle
PDF
Wtf is data science?
PDF
JDO 2019: Data Science for Developers - Matthew Renze
PPTX
Hadoop Meets Scrum
PDF
Keynote at Spark Summit
PDF
Keynote at Big Data Tech Con SF 2014
PPTX
The future of jobs
PDF
How to succeed at data without even trying!
PDF
Agile data science
What Managers Need to Know about Data Science
Building Better Models Faster Using Active Learning
Managing Data Science | Lessons from the Field
H2O World - Intro to Data Science with Erin Ledell
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Enterprise Machine Learning Governance
Large Scale Modeling Overview
Leveraged Analytics at Scale
Data science unit 1 By: Professor Lili Saghafi
Reproducible Dashboards and other great things to do with Jupyter
Big Data Rampage
A quick overview of Eaagle
Wtf is data science?
JDO 2019: Data Science for Developers - Matthew Renze
Hadoop Meets Scrum
Keynote at Spark Summit
Keynote at Big Data Tech Con SF 2014
The future of jobs
How to succeed at data without even trying!
Agile data science
Ad

Similar to Correlation does not mean causation (20)

PDF
Getting started in Data Science (April 2017, Los Angeles)
PDF
Career in Data Science (July 2017, DTLA)
PDF
Getting Started in Data Science
PDF
Industrial Data Science
PDF
Intro to Data Science
PDF
Startds9.19.17sd
PDF
D92-198gstindspdx
PDF
2017 06-14-getting started with data science
PDF
Thinkful DC - Intro to Data Science
PDF
Getting started in data science (4:3)
PDF
Getting started in data science (4:3)
PDF
Getting started in ds (july 17) atlanta
PDF
Thinkful - Intro to Data Science - Washington DC
PDF
Deck 92-146 (3)
PPTX
Not fair! testing ai bias and organizational values
PPTX
Not fair! testing AI bias and organizational values
PDF
Getstarteddssd12717sd
PDF
Data Science and Culture
PDF
Machine learning at b.e.s.t. summer university
PPTX
Data science for BE subject code is 2cs642
Getting started in Data Science (April 2017, Los Angeles)
Career in Data Science (July 2017, DTLA)
Getting Started in Data Science
Industrial Data Science
Intro to Data Science
Startds9.19.17sd
D92-198gstindspdx
2017 06-14-getting started with data science
Thinkful DC - Intro to Data Science
Getting started in data science (4:3)
Getting started in data science (4:3)
Getting started in ds (july 17) atlanta
Thinkful - Intro to Data Science - Washington DC
Deck 92-146 (3)
Not fair! testing ai bias and organizational values
Not fair! testing AI bias and organizational values
Getstarteddssd12717sd
Data Science and Culture
Machine learning at b.e.s.t. summer university
Data science for BE subject code is 2cs642
Ad

More from Peter Varhol (15)

PPTX
DevOps and the Impostor Syndrome
PPTX
162 the technologist of the future
PPTX
Digital transformation through devops dod indianapolis
PPTX
Making disaster routine
PPTX
What Aircrews Can Teach Testing Teams
PPTX
Identifying and measuring testing debt
PPTX
What aircrews can teach devops teams ignite
PPTX
Talking to people lightning
PPTX
Using Machine Learning to Optimize DevOps Practices
PPTX
Varhol oracle database_firewall_oct2011
PPTX
Qa test managed_code_varhol
PPTX
Testing a movingtarget_quest_dynatrace
PDF
Talking to people: the forgotten DevOps tool
PPTX
How do we fix testing
PPTX
Moneyball peter varhol_starwest2012
DevOps and the Impostor Syndrome
162 the technologist of the future
Digital transformation through devops dod indianapolis
Making disaster routine
What Aircrews Can Teach Testing Teams
Identifying and measuring testing debt
What aircrews can teach devops teams ignite
Talking to people lightning
Using Machine Learning to Optimize DevOps Practices
Varhol oracle database_firewall_oct2011
Qa test managed_code_varhol
Testing a movingtarget_quest_dynatrace
Talking to people: the forgotten DevOps tool
How do we fix testing
Moneyball peter varhol_starwest2012

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)

Correlation does not mean causation

  • 1. Correlation Does Not Mean Causation Testing insights into DataOps, Big Data Analytics, and AI Peter Varhol
  • 2. About me • International speaker and writer • Graduate degrees in Math, CS, Psychology • Technology communicator • AWS certified • Former university professor, tech journalist • Cat owner and distance runner • [email protected]
  • 3. What You Will Learn • How AI systems make the determinations they do based on data. • Why big data is so important in analytics and AI. • What are we actually learning when we work with AI and analytics systems.
  • 4. Agenda • The Evolution of data • The Role of DataOps • Logistics of Big Data • Using data to train machine learning systems • Bias in data • Summary
  • 5. The Evolution of Data • Thirty years ago • Hardware was king • Twenty years ago • Software ruled the roost • Ten years ago • Hardware and software went to the cloud • Today • Nothing matters but data
  • 6. How Did This Happen? • Prices fell with commodity hardware • Storage became much less expensive • We developed better software abstractions • Operating systems became standardized • Nicholas Carr was wrong – software did matter • Business decision-makers became comfortable with data • “Gut feel” is no longer an acceptable basis for decision-making
  • 7. How Did This Happen? • Storage is cheap • We can easily store and retrieve terabytes of data • Processing power is fast • It doesn’t take long to operate on large datasets • Data can produce information • Decision-making became more refined
  • 8. What Does This Mean? • The business is now using data as an integral part of decision-making • That data is often in real time • Data is also critical to machine learning applications • IT has to keep data up to date and clean • Old data is worse than useless • We need a data pipeline similar to DevOps • Data  Information seamlessly
  • 9. What is DataOps? • Data collection is a natural part of business operations • No out of cycle effort required • Data collection, storage, workflow, integration, and analytics deployment in a consistent, repeatable process • Plus data about your data
  • 10. DataOps Versus DevOps • Data can be designed to follow flow principles similar to DevOps • Process • Automation • Data production and workflow is important to effective data consumption • Cross-functional teams are essential in both • Developers, testers • DBAs, report writers
  • 11. Why Would We Want To? • Many teams don’t know how to handle big data • Defining a practice provides guidance • We need information in real time • We can’t wait for the next monthly report • It helps companies better understand their data • Data is now front and center
  • 12. Principles of DataOps • Individuals and interactions over processes and tools • Working analytics over comprehensive documentation • Customer collaboration over contract negotiation • Experimentation, iteration, and feedback over extensive upfront design • Cross-functional ownership of operations over siloed responsibilities • https://www.dataopsmanifesto.org/
  • 13. Why We Need DataOps • Data is a valuable commodity • It’s not simply a biproduct of our work • We need to reap intelligence from data • In real time • With a standard process • We must get it into the hands of those who need it • For analysis • For decision-making
  • 14. Why We Need DataOps • Auditability – Versioning every output and input, from source data to data science experiments to trained model, means that you can show exactly how the model was created and where it was implemented. • Reliability – Deploy quickly but with increased consistency and quality • Repeatability – Automating ensures a repeatable process • Productivity – Providing a self-service environment with access to curated data sets
  • 15. We’re All In This Together • Data is a team sport • DBA • Report writer • Data scientist/analyst • Ops person • Tester • And more
  • 16. The Logistics of Big Data • We get data from a variety of sources • Our own databases • Measurement of processes • Natural and social science • A single source is no longer enough • We tie together sales, weather reports, more • This can’t be done manually
  • 17. Big Data and Machine Learning • Our intelligent systems learn through data • The more data, the better (usually) • Algorithms manipulate the data to draw a conclusion • It can seem like intelligence because that’s how we make decisions • Your algorithms are your competitive advantage • And the better your data, the more effective your algorithms
  • 18. Using Data to Train Machine Learning Systems • Big Data is used for “training” machine learning systems • Data is fed through a series of nonlinear algorithms that adjust parameters in response • We tend to believe it infallible • Um, no • Data is only as good as how we select and collect it • And results are only as good as the data
  • 19. The Limitations of Data • Data is typically a sample or representation of a real-world circumstance • Not necessarily exact • And not necessarily correct • Data can be misinterpreted • That doesn’t mean what you think it means
  • 20. Bias and Machine Learning Systems • Worst of all, data can be biased • It may not accurately and consistently represent the problem domain • That’s a problem • And all data is biased in some way • And we need to understand our data bias
  • 21. Where Do Biases Come From? • Data selection • We choose training data that represents only one segment of the domain • We limit our training data to certain times or seasons • We overrepresent one population • Or • The problem domain has subtly changed
  • 22. Where Do Biases Come From? • Latent bias • Concepts become incorrectly correlated • Correlation does not mean causation • But it is high enough to believe • We could be promoting stereotypes • This describes Amazon’s problem
  • 23. Where Do Biases Come From? • Interaction bias • We may focus on keywords that users apply incorrectly • User incorporates slang or unusual words • “That’s bad, man” • The story of Microsoft Tay • It wasn’t bad, it was trained that way
  • 24. Why Does Bias Matter? • Wrong answers • Often with no recourse • Subtle discrimination (legal or illegal) • And no one knows it • Suboptimal results • We’re not getting it right often enough • Although bias may also have value
  • 25. Delivering in the Clutch • Machines treat all events as equal • Humans recognize the importance of some events • And sometimes can rise to the occasion • There is no mechanism for code to do this • We could have data and algorithms to recognize the importance of a specific event • But the software cannot “improve” its answer • This is less a bias than an inherent weakness
  • 26. The Human in the Loop • We don’t understand complex software systems • Disasters often happen because software behaves in unexpected ways • Human oversight may prevent disasters, or wrong decisions • Can we overcome human bias? • The problem is that machines respond too quickly • In many cases, there is not enough time for human oversight • Aircraft, autonomous vehicles need to respond instantly
  • 27. Where Testing Fits In • Data must be accurate • How do we make it so? • Humans need to be proactive • We test – objectively • We anticipate • Bias • Wrong answers • Puzzles • Ensuring data represents the problem domain
  • 28. How to Test • Many scenarios • Hundreds or thousands • With detailed documentation • Edge cases • The data may not be there for them • Think outside the box • Try to create a model from test results • I understand how this works
  • 29. Conclusions • Data is central to all applications • Big data is the norm • Managed by DataOps • But data can’t make our decisions for us • Put data in its proper role • But the burden is on us • How can we respond when response time is in seconds?

Editor's Notes

  • #13: DataOps Principles 1. Continually satisfy your customer: Our highest priority is to satisfy the customer through the early and continuous delivery of valuable analytic insights from a couple of minutes to weeks. 2. Value working analytics: We believe the primary measure of data analytics performance is the degree to which insightful analytics are delivered, incorporating accurate data, atop robust frameworks and systems. 3. Embrace change: We welcome evolving customer needs, and in fact, we embrace them to generate competitive advantage. We believe that the most efficient, effective, and agile method of communication with customers is face-to-face conversation. 4. It's a team sport: Analytic teams will always have a variety of roles, skills, favorite tools, and titles. A diversity of backgrounds and opinions increases innovation and productivity. 5. Daily interactions: Customers, analytic teams, and operations must work together daily throughout the project. 6. Self-organize: We believe that the best analytic insight, algorithms, architectures, requirements, and designs emerge from self-organizing teams. 7. Reduce heroism: As the pace and breadth of need for analytic insights ever increases, we believe analytic teams should strive to reduce heroism and create sustainable and scalable data analytic teams and processes. 8. Reflect: Analytic teams should fine-tune their operational performance by self-reflecting, at regular intervals, on feedback provided by their customers, themselves, and operational statistics. 9. Analytics is code: Analytic teams use a variety of individual tools to access, integrate, model, and visualize data. Fundamentally, each of these tools generates code and configuration which describes the actions taken upon data to deliver insight. 10. Orchestrate: The beginning-to-end orchestration of data, tools, code, environments, and the analytic teams work is a key driver of analytic success. 11. Make it reproducible: Reproducible results are required and therefore we version everything: data, low-level hardware and software configurations, and the code and configuration specific to each tool in the toolchain. 12. Disposable environments: We believe it is important to minimize the cost for analytic team members to experiment by giving them easy to create, isolated, safe, and disposable technical environments that reflect their production environment. 13. Simplicity: We believe that continuous attention to technical excellence and good design enhances agility; likewise simplicity--the art of maximizing the amount of work not done--is essential. 14. Analytics is manufacturing: Analytic pipelines are analogous to lean manufacturing lines. We believe a fundamental concept of DataOps is a focus on process-thinking aimed at achieving continuous efficiencies in the manufacture of analytic insight. 15. Quality is paramount: Analytic pipelines should be built with a foundation capable of automated detection of abnormalities (jidoka) and security issues in code, configuration, and data, and should provide continuous feedback to operators for error avoidance (poka yoke). 16. Monitor quality and performance: Our goal is to have performance, security and quality measures that are monitored continuously to detect unexpected variation and generate operational statistics. 17. Reuse: We believe a foundational aspect of analytic insight manufacturing efficiency is to avoid the repetition of previous work by the individual or team. 18. Improve cycle times: We should strive to minimize the time and effort to turn a customer need into an analytic idea, create it in development, release it as a repeatable production process, and finally refactor and reuse that product.
  • #15: Auditability – Versioning every output and input, from source data to data science experiments to trained model, means that you can show exactly how the model was created and where it was implemented. Reliability – Incorporating MLOps enables you the ability not just to deploy quickly but with increased consistency and quality. Repeatability – Automating every process helps you ensure a repeatable process, including how the machine learning model is deployed, evaluated, training, and versioned. Productivity – Providing a self-service environment with access to curated data sets allow data scientists and data engineers to waste less time with invalid or missing data and move faster. Read More https://techbullion.com/a-basic-guide-to-understanding-machine-learning-operations/?utm_content=151262637&utm_medium=social&utm_source=linkedin&hss_channel=lcp-28618310