SlideShare a Scribd company logo
10 Clusters
1000 Users
10,000 Flows
The Dali Experience at LinkedIn
Carl Steinbach
Senior Staff Software Engineer
LinkedIn Data Analytics Infrastructure Group
in/carlsteinbach
@cwsteinbach
Hadoop @ LinkedIn: Circa 2008
1 cluster
20 nodes
10 users
10 production workflows
MapReduce, Pig
Hadoop @ LinkedIn: NOW
> 10 clusters
> 10,000 nodes
> 1,000 users
Hundreds of production workflows, thousands of
development flows and ad-hoc Qs
MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark,
Presto, …
What did we learn along the way?
Scaling Hardware Infrastructure is Hard
What did we learn along the way?
Scaling Human Infrastructure is Harder
6
Hidden,
constantly
evolving
dependencies
binding
producers,
consumers, and
infra providers
Motivations: Producers, Consumers
Data Consumers have to manage too many details:
• Where is the data located? (cluster, path)
• How is the data partitioned? (logical  physical mapping)
• How do I read the data? (storage format, wire protocol)
Data Producers are flying blind:
• Who is consuming the data that I produce?
• Will anything break if I make this change?
• Deprecating legacy schemas is too expensive.
Motivations: Infra Providers
This mess makes things really hard for infrastructure providers!
Lots of optimizations are impossible because producers/consumer logic locks us into
what should be backend decisions
• Storage format
• Physical partitioning scheme
• Data location, wire protocol
Lots of redundant code paths to support: Spark, Hive, Presto, Pig, etc
Dali Vision and Mission
Motivation:
Make analytics infrastructure invisible by abstracting away the
underlying physical details
Mission: Make data on HDFS easier to access + manage
Filesystem: protocol-independent, multi-cluster
Datasets: tables not files
Views: virtual datasets, contract management for producers and consumers
Lineage and Discovery: map datasets to producers, consumers, and track
changes over time
Dali Dataset API: Catalog Service
Is a Dataset API Enough?
Some use cases at LinkedIn:
Structural transformations (flattening and nesting)
Muxing and de-muxing data (unions)
Patching bad data
Backward incompatible changes (intentional and otherwise…)
Code reuse
What we need:
Ability to decouple the API from the dataset
Producer control over public and private APIs
Tooling and processes to support safe evolution of these APIs
Dali View
A sample view
CREATE VIEW profile_flattened
TBLPROPERTIES(
'functions' =
'get_profile_section:isb.GetProfileSections',
'dependencies' =
'com.linkedin.dali-udfs:get-profile-sections:0.0.5')
AS SELECT
get_profile_section(...)
FROM
prod_identity.profile;
Reading a Dali View from Pig
register ivy://com.linkedin.dali:dali-all:2.3.52;
data = LOAD ‘dalids:///tracking.pageviewevent’
USING DaliStorage();
data = FILTER data
BY datepartition >= ‘2016-05-08-00’
AND datepartition <= ‘2016-05-10-00’
View Versioning
• Views can evolve to add/remove fields, update UDF and views/table
dependencies, update logic, etc.
• Multiple versions of each view can be registered with the Dali at the same time.
• Consumers can migrate to newer versions at their own pace.
• Incremental upgrades reduce the cost and risk of change!
Example:
For a database foo which contains view bar, we could have:
bar_1_0_0, bar_1_1_0, bar_2_0_0 registered with Dali at the same time.
* We also register bar which is a latest pointer to bar_2_0_0
Semantic Versioning for Views
Major Version
• Backward incompatible changes to the view schema
• Removing a field
• Changing the physical type of an existing field
Minor Version
• Backward compatible changes visible to consumers of the view
• Adding a new field to the schema
Patch Version
• Everything else that doesn’t alter the schema or semantic output of the view
• Updating one of the view’s binary dependencies
• Updating SQL for better execution plan
Leveraging existing LI tools INFRA
Query view/UDF version dependency graph
who-depends-on-me?
Deprecate, EOL, and purge a specific view/UDF version
Plug into existing global namespace management provided by LI developer tools
Enforce referential integrity for views at deployment time
Contract Law for Datasets
Vague, poorly defined contracts bind data producers to consumers
Physical types don’t tell us much
STRING or URI?
STRING or ENUM?
Semantic types help, but what about other types of relationships?
X IS NOT NULL
A_time is in seconds, b_time is in millis
Attributes of a good contract
Easy to find
Easy to understand
Easy to change
Hijacking an existing process
Express contracts as logical constraints against the fields of a view
Make the contract easy to find by storing it in the view’s Git repo
Contract negotiation follows an existing process
Data producer (view owner) controls the ACL on the view repo
Data consumer requests a contract change via ReviewBoard request
View owner either accepts or rejects the pull request
If accepted, view version is bumped to notify downstream consumers
If rejected, consumer still has the option of committing the constraint to their own repo
Contract  Constraint based testing for views
Contract  Data Quality tests
Case Study: Project Voyager
Views allowed us to parallelize
development by decoupling the online
and offline sides of the project.
• Read existing data using new
schemas
• Legacy apps can continue using old
schemas
~ 100 views for the Voyager project
• 31 consumer (leaf) views
• 63 producer views
• Dependencies on 48 unique tables
Why Dali?
Consumers
Make data stable, predictable, discoverable
Producers
Explicit, manageable contracts with consumers
Frictionless, familiar process for modifying existing contracts
Infra Providers
Freedom to optimize
Flow portability  DR, multi-DC scheduling
Simplifying with Views
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
csteinbach@linkedin.com
linkedin.com/in/carlsteinbach
@cwsteinbach

More Related Content

PPTX
Hadoop data access layer v4.0
PPTX
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
PPTX
Gobblin: Unifying Data Ingestion for Hadoop
PPTX
SQL To NoSQL - Top 6 Questions Before Making The Move
PDF
Data Infrastructure at LinkedIn
PPTX
Real Time Streaming Architecture at Ford
PPTX
Practical Use of a NoSQL Database
Hadoop data access layer v4.0
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Gobblin: Unifying Data Ingestion for Hadoop
SQL To NoSQL - Top 6 Questions Before Making The Move
Data Infrastructure at LinkedIn
Real Time Streaming Architecture at Ford
Practical Use of a NoSQL Database

What's hot (20)

PPTX
Data Infrastructure at LinkedIn
PPTX
NoSQL for SQL Users
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PPTX
Mobile App Development With IBM Cloudant
PPTX
Building IoT and Big Data Solutions on Azure
PDF
Automated Metadata Management in Data Lake – A CI/CD Driven Approach
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PPTX
Building a Big Data Pipeline
PPTX
Big Data Ingestion @ Flipkart Data Platform
PDF
Monitoring MySQL at scale
PPTX
Real time analytics
PDF
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
PPTX
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
PPTX
a Real-time Processing System based on Spark streaming int he field of Teleco...
PDF
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
PDF
NoSQL and Spatial Database Capabilities using PostgreSQL
 
PPTX
Druid Overview by Rachel Pedreschi
PDF
Data Pipelines With Streamsets
PPTX
Webinar: Enterprise Trends for Database-as-a-Service
PDF
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Data Infrastructure at LinkedIn
NoSQL for SQL Users
Lambda-less Stream Processing @Scale in LinkedIn
Mobile App Development With IBM Cloudant
Building IoT and Big Data Solutions on Azure
Automated Metadata Management in Data Lake – A CI/CD Driven Approach
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Building a Big Data Pipeline
Big Data Ingestion @ Flipkart Data Platform
Monitoring MySQL at scale
Real time analytics
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
a Real-time Processing System based on Spark streaming int he field of Teleco...
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
NoSQL and Spatial Database Capabilities using PostgreSQL
 
Druid Overview by Rachel Pedreschi
Data Pipelines With Streamsets
Webinar: Enterprise Trends for Database-as-a-Service
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Ad

Similar to LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016 (20)

PDF
Exploiting the Data / Code Duality with Dali
PDF
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
PDF
Data Services and the Modern Data Ecosystem (ASEAN)
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PPTX
End-to-End Security and Auditing in a Big Data as a Service Deployment
PDF
Data Virtualization: Introduction and Business Value (UK)
PPTX
Redis Streams for Event-Driven Microservices
PPTX
Svcc services presentation (Silicon Valley code camp 2011)
PPTX
CI/CD for a Data Platform
PDF
Modern Data Management for Federal Modernization
PDF
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
PDF
Sukumar Nayak-Agile-DevOps-Cloud Management
PDF
Vue d'ensemble Dremio
PPTX
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
PDF
Microservices Patterns with GoldenGate
PPTX
Impala Unlocks Interactive BI on Hadoop
PPTX
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
PDF
The Great Lakes: How to Approach a Big Data Implementation
PDF
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
PPTX
StampedeCon 2015 Keynote
Exploiting the Data / Code Duality with Dali
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Data Services and the Modern Data Ecosystem (ASEAN)
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
End-to-End Security and Auditing in a Big Data as a Service Deployment
Data Virtualization: Introduction and Business Value (UK)
Redis Streams for Event-Driven Microservices
Svcc services presentation (Silicon Valley code camp 2011)
CI/CD for a Data Platform
Modern Data Management for Federal Modernization
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Sukumar Nayak-Agile-DevOps-Cloud Management
Vue d'ensemble Dremio
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
Microservices Patterns with GoldenGate
Impala Unlocks Interactive BI on Hadoop
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
The Great Lakes: How to Approach a Big Data Implementation
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
StampedeCon 2015 Keynote
Ad

Recently uploaded (20)

PDF
Data Science Trends & Career Guide---ppt
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Global journeys: estimating international migration
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Logistic Regression ml machine learning.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Report The-State-of-AIOps 20232032 3.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Understanding Prototyping in Design and Development
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data Science Trends & Career Guide---ppt
STUDY DESIGN details- Lt Col Maksud (21).pptx
Global journeys: estimating international migration
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Major-Components-ofNKJNNKNKNKNKronment.pptx
Logistic Regression ml machine learning.pptx
Foundation of Data Science unit number two notes
Report The-State-of-AIOps 20232032 3.pdf
Miokarditis (Inflamasi pada Otot Jantung)
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Understanding Prototyping in Design and Development
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

  • 1. 10 Clusters 1000 Users 10,000 Flows The Dali Experience at LinkedIn Carl Steinbach Senior Staff Software Engineer LinkedIn Data Analytics Infrastructure Group in/carlsteinbach @cwsteinbach
  • 2. Hadoop @ LinkedIn: Circa 2008 1 cluster 20 nodes 10 users 10 production workflows MapReduce, Pig
  • 3. Hadoop @ LinkedIn: NOW > 10 clusters > 10,000 nodes > 1,000 users Hundreds of production workflows, thousands of development flows and ad-hoc Qs MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark, Presto, …
  • 4. What did we learn along the way? Scaling Hardware Infrastructure is Hard
  • 5. What did we learn along the way? Scaling Human Infrastructure is Harder
  • 7. Motivations: Producers, Consumers Data Consumers have to manage too many details: • Where is the data located? (cluster, path) • How is the data partitioned? (logical  physical mapping) • How do I read the data? (storage format, wire protocol) Data Producers are flying blind: • Who is consuming the data that I produce? • Will anything break if I make this change? • Deprecating legacy schemas is too expensive.
  • 8. Motivations: Infra Providers This mess makes things really hard for infrastructure providers! Lots of optimizations are impossible because producers/consumer logic locks us into what should be backend decisions • Storage format • Physical partitioning scheme • Data location, wire protocol Lots of redundant code paths to support: Spark, Hive, Presto, Pig, etc
  • 9. Dali Vision and Mission Motivation: Make analytics infrastructure invisible by abstracting away the underlying physical details Mission: Make data on HDFS easier to access + manage Filesystem: protocol-independent, multi-cluster Datasets: tables not files Views: virtual datasets, contract management for producers and consumers Lineage and Discovery: map datasets to producers, consumers, and track changes over time
  • 10. Dali Dataset API: Catalog Service
  • 11. Is a Dataset API Enough? Some use cases at LinkedIn: Structural transformations (flattening and nesting) Muxing and de-muxing data (unions) Patching bad data Backward incompatible changes (intentional and otherwise…) Code reuse What we need: Ability to decouple the API from the dataset Producer control over public and private APIs Tooling and processes to support safe evolution of these APIs
  • 13. A sample view CREATE VIEW profile_flattened TBLPROPERTIES( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5') AS SELECT get_profile_section(...) FROM prod_identity.profile;
  • 14. Reading a Dali View from Pig register ivy://com.linkedin.dali:dali-all:2.3.52; data = LOAD ‘dalids:///tracking.pageviewevent’ USING DaliStorage(); data = FILTER data BY datepartition >= ‘2016-05-08-00’ AND datepartition <= ‘2016-05-10-00’
  • 15. View Versioning • Views can evolve to add/remove fields, update UDF and views/table dependencies, update logic, etc. • Multiple versions of each view can be registered with the Dali at the same time. • Consumers can migrate to newer versions at their own pace. • Incremental upgrades reduce the cost and risk of change! Example: For a database foo which contains view bar, we could have: bar_1_0_0, bar_1_1_0, bar_2_0_0 registered with Dali at the same time. * We also register bar which is a latest pointer to bar_2_0_0
  • 16. Semantic Versioning for Views Major Version • Backward incompatible changes to the view schema • Removing a field • Changing the physical type of an existing field Minor Version • Backward compatible changes visible to consumers of the view • Adding a new field to the schema Patch Version • Everything else that doesn’t alter the schema or semantic output of the view • Updating one of the view’s binary dependencies • Updating SQL for better execution plan
  • 17. Leveraging existing LI tools INFRA Query view/UDF version dependency graph who-depends-on-me? Deprecate, EOL, and purge a specific view/UDF version Plug into existing global namespace management provided by LI developer tools Enforce referential integrity for views at deployment time
  • 18. Contract Law for Datasets Vague, poorly defined contracts bind data producers to consumers Physical types don’t tell us much STRING or URI? STRING or ENUM? Semantic types help, but what about other types of relationships? X IS NOT NULL A_time is in seconds, b_time is in millis Attributes of a good contract Easy to find Easy to understand Easy to change
  • 19. Hijacking an existing process Express contracts as logical constraints against the fields of a view Make the contract easy to find by storing it in the view’s Git repo Contract negotiation follows an existing process Data producer (view owner) controls the ACL on the view repo Data consumer requests a contract change via ReviewBoard request View owner either accepts or rejects the pull request If accepted, view version is bumped to notify downstream consumers If rejected, consumer still has the option of committing the constraint to their own repo Contract  Constraint based testing for views Contract  Data Quality tests
  • 20. Case Study: Project Voyager Views allowed us to parallelize development by decoupling the online and offline sides of the project. • Read existing data using new schemas • Legacy apps can continue using old schemas ~ 100 views for the Voyager project • 31 consumer (leaf) views • 63 producer views • Dependencies on 48 unique tables
  • 21. Why Dali? Consumers Make data stable, predictable, discoverable Producers Explicit, manageable contracts with consumers Frictionless, familiar process for modifying existing contracts Infra Providers Freedom to optimize Flow portability  DR, multi-DC scheduling
  • 23. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  • #3: -Since People You May Know is long, we call it PYMK at Linkedin. -The original version ran on Oracle -And the way it worked was to attempt to find overlaps between any pairs of people. Did they share the same school? Did they work at the same company? -One big indicator was common connections, and we used something called triangle closing.