Correlation does not mean causation

Correlation Does Not Mean
Causation
Testing insights into DataOps, Big Data Analytics, and AI
Peter Varhol

About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• AWS certified
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com

What You Will Learn
• How AI systems make the determinations they do based on data.
• Why big data is so important in analytics and AI.
• What are we actually learning when we work with AI and analytics
systems.

Agenda
• The Evolution of data
• The Role of DataOps
• Logistics of Big Data
• Using data to train machine learning
systems
• Bias in data
• Summary

The Evolution of Data
• Thirty years ago
• Hardware was king
• Twenty years ago
• Software ruled the roost
• Ten years ago
• Hardware and software went to the cloud
• Today
• Nothing matters but data

How Did This Happen?
• Prices fell with commodity hardware
• Storage became much less expensive
• We developed better software abstractions
• Operating systems became standardized
• Nicholas Carr was wrong – software did matter
• Business decision-makers became comfortable with data
• “Gut feel” is no longer an acceptable basis for decision-making

How Did This Happen?
• Storage is cheap
• We can easily store and retrieve terabytes of data
• Processing power is fast
• It doesn’t take long to operate on large datasets
• Data can produce information
• Decision-making became more refined

What Does This Mean?
• The business is now using data as an integral part of decision-making
• That data is often in real time
• Data is also critical to machine learning applications
• IT has to keep data up to date and clean
• Old data is worse than useless
• We need a data pipeline similar to DevOps
• Data  Information seamlessly

What is DataOps?
• Data collection is a natural part of business operations
• No out of cycle effort required
• Data collection, storage, workflow, integration, and analytics
deployment in a consistent, repeatable process
• Plus data about your data

DataOps Versus DevOps
• Data can be designed to follow flow principles similar to DevOps
• Process
• Automation
• Data production and workflow is important to effective data
consumption
• Cross-functional teams are essential in both
• Developers, testers
• DBAs, report writers

Why Would We Want To?
• Many teams don’t know how to handle big data
• Defining a practice provides guidance
• We need information in real time
• We can’t wait for the next monthly report
• It helps companies better understand their data
• Data is now front and center

Principles of DataOps
• Individuals and interactions over processes and tools
• Working analytics over comprehensive documentation
• Customer collaboration over contract negotiation
• Experimentation, iteration, and feedback over extensive upfront
design
• Cross-functional ownership of operations over siloed responsibilities
• https://www.dataopsmanifesto.org/

Why We Need DataOps
• Data is a valuable commodity
• It’s not simply a biproduct of our work
• We need to reap intelligence from data
• In real time
• With a standard process
• We must get it into the hands of those who need it
• For analysis
• For decision-making

Why We Need DataOps
• Auditability – Versioning every output and input, from source data to
data science experiments to trained model, means that you can show
exactly how the model was created and where it was implemented.
• Reliability – Deploy quickly but with increased consistency and quality
• Repeatability – Automating ensures a repeatable process
• Productivity – Providing a self-service environment with access to
curated data sets

We’re All In This Together
• Data is a team sport
• DBA
• Report writer
• Data scientist/analyst
• Ops person
• Tester
• And more

The Logistics of Big Data
• We get data from a variety of sources
• Our own databases
• Measurement of processes
• Natural and social science
• A single source is no longer enough
• We tie together sales, weather reports, more
• This can’t be done manually

Big Data and Machine Learning
• Our intelligent systems learn through data
• The more data, the better (usually)
• Algorithms manipulate the data to draw a conclusion
• It can seem like intelligence because that’s how we make decisions
• Your algorithms are your competitive advantage
• And the better your data, the more effective your algorithms

Using Data to Train Machine Learning
Systems
• Big Data is used for “training” machine learning systems
• Data is fed through a series of nonlinear algorithms that adjust parameters in
response
• We tend to believe it infallible
• Um, no
• Data is only as good as how we select and collect it
• And results are only as good as the data

The Limitations of Data
• Data is typically a sample or representation of a real-world
circumstance
• Not necessarily exact
• And not necessarily correct
• Data can be misinterpreted
• That doesn’t mean what you think it means

Bias and Machine Learning Systems
• Worst of all, data can be biased
• It may not accurately and consistently represent the problem domain
• That’s a problem
• And all data is biased in some way
• And we need to understand our data bias

Where Do Biases Come From?
• Data selection
• We choose training data that represents only one segment of the domain
• We limit our training data to certain times or seasons
• We overrepresent one population
• Or
• The problem domain has subtly changed

• Latent bias
• Concepts become incorrectly correlated
• Correlation does not mean causation
• But it is high enough to believe
• We could be promoting stereotypes
• This describes Amazon’s problem

• Interaction bias
• We may focus on keywords that users apply incorrectly
• User incorporates slang or unusual words
• “That’s bad, man”
• The story of Microsoft Tay
• It wasn’t bad, it was trained that way

Why Does Bias Matter?
• Wrong answers
• Often with no recourse
• Subtle discrimination (legal or illegal)
• And no one knows it
• Suboptimal results
• We’re not getting it right often enough
• Although bias may also have value

Delivering in the Clutch
• Machines treat all events as equal
• Humans recognize the importance of some events
• And sometimes can rise to the occasion
• There is no mechanism for code to do this
• We could have data and algorithms to recognize
the importance of a specific event
• But the software cannot “improve” its answer
• This is less a bias than an inherent weakness

The Human in the Loop
• We don’t understand complex software systems
• Disasters often happen because software behaves in unexpected
ways
• Human oversight may prevent disasters, or wrong decisions
• Can we overcome human bias?
• The problem is that machines respond too quickly
• In many cases, there is not enough time for human oversight
• Aircraft, autonomous vehicles need to respond instantly

Where Testing Fits In
• Data must be accurate
• How do we make it so?
• Humans need to be proactive
• We test – objectively
• We anticipate
• Bias
• Wrong answers
• Puzzles
• Ensuring data represents the problem domain

How to Test
• Many scenarios
• Hundreds or thousands
• With detailed documentation
• Edge cases
• The data may not be there for them
• Think outside the box
• Try to create a model from test results
• I understand how this works

Conclusions
• Data is central to all applications
• Big data is the norm
• Managed by DataOps
• But data can’t make our decisions for us
• Put data in its proper role
• But the burden is on us
• How can we respond when response time is in seconds?

Correlation does not mean causation

Thank You
• Peter Varhol
peter@petervarhol.com

Correlation does not mean causation

More Related Content

What's hot (20)

Similar to Correlation does not mean causation (20)

More from Peter Varhol (15)

Recently uploaded (20)

Correlation does not mean causation

Editor's Notes