Grab some
coffee and
enjoy the
pre-show
banter before
the top of the
hour!
The Data Lake Survival Guide
Exploratory Webcast | October 26, 2016
SPONSORED BY
Presenting
Robin Bloor
Chief Analyst, The Bloor Group
@robinbloor robin.bloor@bloorgroup.com
Host: Eric Kavanagh
CEO, The Bloor Group
@eric_kavanagh eric.kavanagh@bloorgroup.com
Dez Blanchfield
Data Scientist, The Bloor Group
@dez_blanchfield dez.blanchfield@bloorgroup.com
Findings Webcast
January 12, 2017
Data Lake Survival Guide
Roundtable Webcast
December 8, 2016
Exploratory Webcast
October 26, 2016
Data Lake
Survival
Robin Bloor, PhD
The Sequence of Topics….
1  Disturbance in the Force
2  What is a Data Lake,
exactly?
3  Streams and Events
1
Disturbance
in the
Force
The Generic Dimensions of IT
q  All IT involves 4 components (only)
q  Users
q  Software
q  Data
q  Hardware
q  They all relate to each other
q  Change any one of these and the other
three components have to adjust
q  Aggregate these and you get a process
q  Time will impose change anyway
q  We can also consider:
q  Staff
q  Business Processes
q  Business Information
q  Facility
q  And also
q  People
q  Information
q  Human Activity
q  Civilization (Stuff)
Four Fundamental (IT) Factors
Hardware
Users
Software Data
BusinessInformation
BusinessProcess
HumanActivity
AllInformation
Staff
Facility
People
Civilization
TIME
The Technology Layers
§  The buying impulse
descends through the
stack
§  The impact of
technology change rises
up the stack
§  This ensures the
eventual “legacification”
of all technology
The Buying
Impulse Goes
Down
Technology
Change Rises Up
The Technology
Layers
Disruption in the Technology Layers
§  Disruption (as
innovation) can happen in
any layer
§  Where it occurs it will
impact all layers above it
§  And it may also impact
the layers below it (but
less quickly)
§  There is no such thing as
future-proof; but some
technologies definitely live
longer
The Buying
Impulse Goes
Down
Technology
Change Rises Up
The Technology
Layers
§  Mainframe Computer (Batch architecture)
§  On-line Interaction (Centralized
architecture)
§  PC (Client Server)
§  Internet (Multi-tier architecture)
§  Mobile (Service Oriented architecture)
§  Internet of Things (Event Driven
Architecture)
Tech Revolutions
Note that all of these disruptive changes
were driven by hardware innovation
Cloud
Centralized Computer Systems
PC Based Systems
Integrated Systems
Limited process power
Terminals only
Few applications
No external data sources
Extensive process power
PCs & Apps
Analytics capability
Wealth of applications
Many external data sources
Moderate process power
PCs
Spreadsheets & email
Many applications
Few external data sources
Parallelism: The Imp Out of the Bottle
u  Multicore chips enabled
parallelism
u  It has changed the whole
performance equation
u  It enabled Big Data
u  Big Data is really Big
Processing
The Impact of Parallelism
We used to see 10x performance
improvement every 6 years, now we
see 1000x (and that’s just an
approximation)
Hardware Factors
q  CPUs, GPUs & FPGAs
q  Cross breeding
q  SoCs
q  3D Xpoint and PCM (and
memristor?)
q  SSDs & parallel access
q  Parallel hardware
architectures
Performance is accelerating
and costs continue to fall.
The Perfect Storm (Software)
q  The triumph of Open
Source as a business model
q  The dominance of Apache
q  Hadoop, the platform
for data
q  Spark, for speed
q  Kafka, for connectivity
q  The triumph of the cloud
and its dominance
q  Little data is also big data
q  Cost challenges
Then the Data
Lake evaporated
into the Cloud
2
What is a
Data Lake?
Everything in flux
u  Hardware (network,
storage, servers)
u  Data Sources
u  Data Staging
u  Data Volumes
u  Data Flow
u  Data Governance
u  Data Usage
u  Data Structures
u  Schema definition
u  Ingest Speeds
u  Data Workloads
Hadoop Applications
The Scale Out Applications
§  Data Ingest & Staging
§  Data Governance
§  Software development
platform
§  Analytics environment
§  Database/Data
Warehouse
§  Data Archiving
§  Video rendering & other
niche apps
The Data Lake involves just
the first two and does not
necessarily involve Hadoop
Data Lake, Refinery, Hub, in Overview
Think Logical, Implement Physical
The Data Lake Analytics Picture
Data Sources
Analytics
Service
Mgt
Life Cycle
Mgt
MetaData
Discovery
MDM
MetaData
Mgt
Data
Cleansing
Data
Lineage
R
O
U
N
D
|
U
P
W
R
A
N
G
L
I
N
G
Staging Area
(Hadoop)
Data Warehouse
or other location
Data Streams
ETL
ETL
How Data Gets to be Wrong
u  Accidentally born wrong
u  Deliberately born wrong
u  Defective sensor/data
source
u  Murdered (truncated,
overwritten)
u  Corrupted in flight (rare)
u  Corrupted by bad code
(surely not!)
u  Corrupted by bad DBA
Data Governance
If data governance was important
before Big Data, (and it was) it is
far more important in the era of
Data Lakes
What Needs To Be Governed
Data Governance
  Data Flows and Data Storage
  Security & Access
  Data cleansing and
transformation
  Data meaning
  Data provenance and lineage
  Data archive and disposal
  Availability and performance
Analytics Is a Process Not an Activity
q Data Analytics is a multi-
disciplinary end-to-end
process
q Until recently it was a
walled-garden. But the
walls were torn down by…
§  Data availability
§  Scalable technology
§  Open source tools
q It is now becoming an
integrated process
Data Governance is a process,
not an activity!!
The Global Map and Data Options
u  Move the data to
the processing
u  Move the
processing to the
data
u  Move the
processing and the
data
u  Shard
All network nodes can be data
creators, data stores and
processing points.
Logical Data Lakes
Soon we will be speaking of a
logical data lake and multiple
physical data lakes
3
Events
and
Streams
Big Data, Event Data – The Data of Everything
WHAT
IS BIG
DATA?
Business
data
Traditional
data
Log file
data
Operational
data
Mobile data
Location
data Social
network
data
Public data
Commercial
databases
Streaming
data
Internet of
Things
A TRANSACTION is a
MOLECULE of ATOMIC EVENTS
The ATOM of data has
become the EVENT
Events: Atoms and Molecules
It’s Become and Event Based World
Events
Think of events as drops of water.
They can live in streams, and they
can also live in data pools and data
lakes.
Two Data Flows
The Traffic Cop (Events)
Event Types
q  Instantiation Event
q  A State Report
q  A Trigger Event
q  A Correction Event
We also need to consider:
Data Refinement
Aggregations
Homogeneous Collections
Derived Data
§  The pulse and the
threshold alert
§  Some of this involves
distributed processing
§  There are known apps
and unknown apps, so
analytical exploration
needs to be enabled
§  Only aggregations will
migrate
DepotDepot
Central
Hub
Source
Proc.
Depot
Proc.
Central
Proc.
Sensors, controllers, CPUs
Data Data
Data
Event Based IoT Architecture
u Time
u Geographic location
u Virtual/logical location
u Source device
u Device ID
u Actors
u Ownership/
Provenance
u Values
Events and Event Data
Spark, Storm, Flink & Kafka
u  Spark has dethroned Hadoop as a platform and
has momentum, both for microbatch and
streaming
u  Storm provides batch and streaming (event
processing capabilities) concurrently via the
lambda architecture
u  Flink was purpose built for streaming
u  Kafka is the pipe
u  Lambda and Zeta Architectures…
In Summary
1  Disturbance in the Force
2  What is a Data Lake,
exactly?
3  Streams and Events
The Central Hub: Defining the Data Lake
Questions?
THANK
YOU!
FIND OUT MORE at
InsideAnalysis.com

More Related Content

PPTX
DLD Summer Workshop Big Data
PDF
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
PDF
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PDF
Spark streaming
PDF
Observability at Spotify
PPTX
The Big Data Ecosystem at LinkedIn
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
DLD Summer Workshop Big Data
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Spark streaming
Observability at Spotify
The Big Data Ecosystem at LinkedIn
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]

What's hot (20)

PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
PDF
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
PDF
Machine Learning CI/CD for Email Attack Detection
PPTX
Revolution Analytics: a 5-minute history
PDF
Spark with Delta Lake
PPTX
Rapid Data Analytics @ Netflix
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
PDF
Architecture in action 01
PDF
Big Data and Fast Data - Lambda Architecture in Action
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Smart data for a predictive bank
PDF
The Evolution of Big Data Frameworks
PPTX
Stream Analytics
PDF
Spark Summit EU talk by Pat Patterson
PDF
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
PDF
Big Data Analysis Starts with R
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Machine Learning CI/CD for Email Attack Detection
Revolution Analytics: a 5-minute history
Spark with Delta Lake
Rapid Data Analytics @ Netflix
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Architecture in action 01
Big Data and Fast Data - Lambda Architecture in Action
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Smart data for a predictive bank
The Evolution of Big Data Frameworks
Stream Analytics
Spark Summit EU talk by Pat Patterson
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Big Data Analysis Starts with R
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Ad

Viewers also liked (20)

PDF
Mind Your Business: Why Privacy Matters to the Successful Enterprise
PDF
A Tight Ship: How Containers and SDS Optimize the Enterprise
PPT
Presentation dual inversion-index
PPTX
My OS
PDF
Solving the Really Big Tech Problems with IoT
PDF
Arcadian Landscapes
PDF
A Better Understanding: Solving Business Challenges with Data
PDF
Auto bodies
PDF
See the Whole Story: The Case for a Visualization Platform
PPTX
Warsztaty Active Image | Opinie
PDF
Who, What, Where and How: Why You Want to Know
PDF
The Art of Visibility: Enabling Multi-Platform Management
PPTX
Test your taste buds
PPSX
Warsztaty PR-u i komunikacji | Opinie
PDF
The Key to Effective Analytics: Fast-Returning Queries
PPTX
Extracción-de-la-muestra-_ Clase Nº 2 Hematología
PDF
Webエンジニアがラクして企業向けモバイルアプリを作る方法 ~Salesforce1モバイルコンテナを使った開発手法~
PDF
Summer '12のワイルドな新機能+
PDF
Heroku-ja Meetup #1 - Salesforce.com
PDF
The New Normal: Dealing with the Reality of an Unsecure World
Mind Your Business: Why Privacy Matters to the Successful Enterprise
A Tight Ship: How Containers and SDS Optimize the Enterprise
Presentation dual inversion-index
My OS
Solving the Really Big Tech Problems with IoT
Arcadian Landscapes
A Better Understanding: Solving Business Challenges with Data
Auto bodies
See the Whole Story: The Case for a Visualization Platform
Warsztaty Active Image | Opinie
Who, What, Where and How: Why You Want to Know
The Art of Visibility: Enabling Multi-Platform Management
Test your taste buds
Warsztaty PR-u i komunikacji | Opinie
The Key to Effective Analytics: Fast-Returning Queries
Extracción-de-la-muestra-_ Clase Nº 2 Hematología
Webエンジニアがラクして企業向けモバイルアプリを作る方法 ~Salesforce1モバイルコンテナを使った開発手法~
Summer '12のワイルドな新機能+
Heroku-ja Meetup #1 - Salesforce.com
The New Normal: Dealing with the Reality of an Unsecure World
Ad

Similar to The Central Hub: Defining the Data Lake (20)

PDF
Data Lake Architecture
PDF
The Cloud Data Lake Early Release Rukmani Gopalan
PDF
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
PDF
BDIA Findings
PDF
The Great Lakes: How to Approach a Big Data Implementation
PDF
Big data data lake and beyond
PDF
Data lakes
PDF
GE’s Industrial Data Lake Platform
PDF
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Idera live 2021: Why Data Lakes are Critical for AI, ML, and IoT By Brian Flug
PDF
Designing the Next Generation Data Lake
PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
PPTX
How to build a successful data lake Presentation.pptx
PDF
So You Want to Build a Data Lake?
PPTX
How to build a successful Data Lake
PDF
Harness the power of Data in a Big Data Lake
PPTX
Data Lake Overview
PPTX
Data Lake Organization (Data Mining and Knowledge discovery)
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Lake Architecture
The Cloud Data Lake Early Release Rukmani Gopalan
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
BDIA Findings
The Great Lakes: How to Approach a Big Data Implementation
Big data data lake and beyond
Data lakes
GE’s Industrial Data Lake Platform
Data lake-itweekend-sharif university-vahid amiry
Idera live 2021: Why Data Lakes are Critical for AI, ML, and IoT By Brian Flug
Designing the Next Generation Data Lake
The Maturity Model: Taking the Growing Pains Out of Hadoop
How to build a successful data lake Presentation.pptx
So You Want to Build a Data Lake?
How to build a successful Data Lake
Harness the power of Data in a Big Data Lake
Data Lake Overview
Data Lake Organization (Data Mining and Knowledge discovery)
Data Virtualization: An Essential Component of a Cloud Data Lake

More from Eric Kavanagh (20)

PPTX
The Future of Data Warehousing and Data Integration
PPTX
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
PPTX
Expediting the Path to Discovery with Multi-Source Analysis
PPTX
Will AI Eliminate Reports and Dashboards
PPTX
Metadata Mastery: A Big Step for BI Modernization
PDF
Horses for Courses: Database Roundtable
PDF
Database Survival Guide: Exploratory Webcast
PDF
Better to Ask Permission? Best Practices for Privacy and Security
PDF
The Model Enterprise: A Blueprint for Enterprise Data Governance
PDF
Best Laid Plans: Saving Time, Money and Trouble with Optimal Forecasting
PDF
A Winning Strategy for the Digital Economy
PDF
Discovering Big Data in the Fog: Why Catalogs Matter
PDF
Health Check: Maintaining Enterprise BI
PDF
Rapid Response: Debugging and Profiling to the Rescue
PDF
Beyond the Platform: Enabling Fluid Analysis
PDF
Protect Your Database: High Availability for High Demand Data
PDF
Application Acceleration: Faster Performance for End Users
PDF
Time's Up! Getting Value from Big Data Now
PDF
A Bigger Magnifying Glass: Analyzing the Internet of Things
PDF
A Real-Time Version of the Truth
The Future of Data Warehousing and Data Integration
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Expediting the Path to Discovery with Multi-Source Analysis
Will AI Eliminate Reports and Dashboards
Metadata Mastery: A Big Step for BI Modernization
Horses for Courses: Database Roundtable
Database Survival Guide: Exploratory Webcast
Better to Ask Permission? Best Practices for Privacy and Security
The Model Enterprise: A Blueprint for Enterprise Data Governance
Best Laid Plans: Saving Time, Money and Trouble with Optimal Forecasting
A Winning Strategy for the Digital Economy
Discovering Big Data in the Fog: Why Catalogs Matter
Health Check: Maintaining Enterprise BI
Rapid Response: Debugging and Profiling to the Rescue
Beyond the Platform: Enabling Fluid Analysis
Protect Your Database: High Availability for High Demand Data
Application Acceleration: Faster Performance for End Users
Time's Up! Getting Value from Big Data Now
A Bigger Magnifying Glass: Analyzing the Internet of Things
A Real-Time Version of the Truth

Recently uploaded (20)

PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
Configure Apache Mutual Authentication
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Zenith AI: Advanced Artificial Intelligence
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Architecture types and enterprise applications.pdf
Statistics on Ai - sourced from AIPRM.pdf
Developing a website for English-speaking practice to English as a foreign la...
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Build Your First AI Agent with UiPath.pptx
Configure Apache Mutual Authentication
The influence of sentiment analysis in enhancing early warning system model f...
CloudStack 4.21: First Look Webinar slides
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Getting started with AI Agents and Multi-Agent Systems
Final SEM Unit 1 for mit wpu at pune .pptx
Convolutional neural network based encoder-decoder for efficient real-time ob...
Zenith AI: Advanced Artificial Intelligence
Module 1.ppt Iot fundamentals and Architecture
Enhancing plagiarism detection using data pre-processing and machine learning...
Custom Battery Pack Design Considerations for Performance and Safety
Comparative analysis of machine learning models for fake news detection in so...
Architecture types and enterprise applications.pdf

The Central Hub: Defining the Data Lake

  • 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  • 2. The Data Lake Survival Guide Exploratory Webcast | October 26, 2016 SPONSORED BY
  • 3. Presenting Robin Bloor Chief Analyst, The Bloor Group @robinbloor [email protected] Host: Eric Kavanagh CEO, The Bloor Group @eric_kavanagh [email protected] Dez Blanchfield Data Scientist, The Bloor Group @dez_blanchfield [email protected]
  • 4. Findings Webcast January 12, 2017 Data Lake Survival Guide Roundtable Webcast December 8, 2016 Exploratory Webcast October 26, 2016
  • 6. The Sequence of Topics…. 1  Disturbance in the Force 2  What is a Data Lake, exactly? 3  Streams and Events
  • 8. The Generic Dimensions of IT q  All IT involves 4 components (only) q  Users q  Software q  Data q  Hardware q  They all relate to each other q  Change any one of these and the other three components have to adjust q  Aggregate these and you get a process q  Time will impose change anyway q  We can also consider: q  Staff q  Business Processes q  Business Information q  Facility q  And also q  People q  Information q  Human Activity q  Civilization (Stuff) Four Fundamental (IT) Factors Hardware Users Software Data BusinessInformation BusinessProcess HumanActivity AllInformation Staff Facility People Civilization TIME
  • 9. The Technology Layers §  The buying impulse descends through the stack §  The impact of technology change rises up the stack §  This ensures the eventual “legacification” of all technology The Buying Impulse Goes Down Technology Change Rises Up The Technology Layers
  • 10. Disruption in the Technology Layers §  Disruption (as innovation) can happen in any layer §  Where it occurs it will impact all layers above it §  And it may also impact the layers below it (but less quickly) §  There is no such thing as future-proof; but some technologies definitely live longer The Buying Impulse Goes Down Technology Change Rises Up The Technology Layers
  • 11. §  Mainframe Computer (Batch architecture) §  On-line Interaction (Centralized architecture) §  PC (Client Server) §  Internet (Multi-tier architecture) §  Mobile (Service Oriented architecture) §  Internet of Things (Event Driven Architecture) Tech Revolutions Note that all of these disruptive changes were driven by hardware innovation Cloud Centralized Computer Systems PC Based Systems Integrated Systems Limited process power Terminals only Few applications No external data sources Extensive process power PCs & Apps Analytics capability Wealth of applications Many external data sources Moderate process power PCs Spreadsheets & email Many applications Few external data sources
  • 12. Parallelism: The Imp Out of the Bottle u  Multicore chips enabled parallelism u  It has changed the whole performance equation u  It enabled Big Data u  Big Data is really Big Processing
  • 13. The Impact of Parallelism We used to see 10x performance improvement every 6 years, now we see 1000x (and that’s just an approximation)
  • 14. Hardware Factors q  CPUs, GPUs & FPGAs q  Cross breeding q  SoCs q  3D Xpoint and PCM (and memristor?) q  SSDs & parallel access q  Parallel hardware architectures Performance is accelerating and costs continue to fall.
  • 15. The Perfect Storm (Software) q  The triumph of Open Source as a business model q  The dominance of Apache q  Hadoop, the platform for data q  Spark, for speed q  Kafka, for connectivity q  The triumph of the cloud and its dominance q  Little data is also big data q  Cost challenges
  • 16. Then the Data Lake evaporated into the Cloud 2 What is a Data Lake?
  • 17. Everything in flux u  Hardware (network, storage, servers) u  Data Sources u  Data Staging u  Data Volumes u  Data Flow u  Data Governance u  Data Usage u  Data Structures u  Schema definition u  Ingest Speeds u  Data Workloads
  • 19. The Scale Out Applications §  Data Ingest & Staging §  Data Governance §  Software development platform §  Analytics environment §  Database/Data Warehouse §  Data Archiving §  Video rendering & other niche apps The Data Lake involves just the first two and does not necessarily involve Hadoop
  • 20. Data Lake, Refinery, Hub, in Overview Think Logical, Implement Physical
  • 21. The Data Lake Analytics Picture Data Sources Analytics Service Mgt Life Cycle Mgt MetaData Discovery MDM MetaData Mgt Data Cleansing Data Lineage R O U N D | U P W R A N G L I N G Staging Area (Hadoop) Data Warehouse or other location Data Streams ETL ETL
  • 22. How Data Gets to be Wrong u  Accidentally born wrong u  Deliberately born wrong u  Defective sensor/data source u  Murdered (truncated, overwritten) u  Corrupted in flight (rare) u  Corrupted by bad code (surely not!) u  Corrupted by bad DBA
  • 23. Data Governance If data governance was important before Big Data, (and it was) it is far more important in the era of Data Lakes
  • 24. What Needs To Be Governed
  • 25. Data Governance   Data Flows and Data Storage   Security & Access   Data cleansing and transformation   Data meaning   Data provenance and lineage   Data archive and disposal   Availability and performance
  • 26. Analytics Is a Process Not an Activity q Data Analytics is a multi- disciplinary end-to-end process q Until recently it was a walled-garden. But the walls were torn down by… §  Data availability §  Scalable technology §  Open source tools q It is now becoming an integrated process Data Governance is a process, not an activity!!
  • 27. The Global Map and Data Options u  Move the data to the processing u  Move the processing to the data u  Move the processing and the data u  Shard All network nodes can be data creators, data stores and processing points.
  • 28. Logical Data Lakes Soon we will be speaking of a logical data lake and multiple physical data lakes
  • 30. Big Data, Event Data – The Data of Everything WHAT IS BIG DATA? Business data Traditional data Log file data Operational data Mobile data Location data Social network data Public data Commercial databases Streaming data Internet of Things
  • 31. A TRANSACTION is a MOLECULE of ATOMIC EVENTS The ATOM of data has become the EVENT Events: Atoms and Molecules
  • 32. It’s Become and Event Based World
  • 33. Events Think of events as drops of water. They can live in streams, and they can also live in data pools and data lakes.
  • 35. The Traffic Cop (Events)
  • 36. Event Types q  Instantiation Event q  A State Report q  A Trigger Event q  A Correction Event We also need to consider: Data Refinement Aggregations Homogeneous Collections Derived Data
  • 37. §  The pulse and the threshold alert §  Some of this involves distributed processing §  There are known apps and unknown apps, so analytical exploration needs to be enabled §  Only aggregations will migrate DepotDepot Central Hub Source Proc. Depot Proc. Central Proc. Sensors, controllers, CPUs Data Data Data Event Based IoT Architecture
  • 38. u Time u Geographic location u Virtual/logical location u Source device u Device ID u Actors u Ownership/ Provenance u Values Events and Event Data
  • 39. Spark, Storm, Flink & Kafka u  Spark has dethroned Hadoop as a platform and has momentum, both for microbatch and streaming u  Storm provides batch and streaming (event processing capabilities) concurrently via the lambda architecture u  Flink was purpose built for streaming u  Kafka is the pipe u  Lambda and Zeta Architectures…
  • 40. In Summary 1  Disturbance in the Force 2  What is a Data Lake, exactly? 3  Streams and Events
  • 43. THANK YOU! FIND OUT MORE at InsideAnalysis.com