LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

10 Clusters
1000 Users
10,000 Flows
The Dali Experience at LinkedIn
Carl Steinbach
Senior Staff Software Engineer
LinkedIn Data Analytics Infrastructure Group
in/carlsteinbach
@cwsteinbach

Hadoop @ LinkedIn: Circa 2008
1 cluster
20 nodes
10 users
10 production workflows
MapReduce, Pig

Hadoop @ LinkedIn: NOW
> 10 clusters
> 10,000 nodes
> 1,000 users
Hundreds of production workflows, thousands of
development flows and ad-hoc Qs
MapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark,
Presto, …

What did we learn along the way?
Scaling Hardware Infrastructure is Hard

What did we learn along the way?
Scaling Human Infrastructure is Harder

6
Hidden,
constantly
evolving
dependencies
binding
producers,
consumers, and
infra providers

Motivations: Producers, Consumers
Data Consumers have to manage too many details:
• Where is the data located? (cluster, path)
• How is the data partitioned? (logical  physical mapping)
• How do I read the data? (storage format, wire protocol)
Data Producers are flying blind:
• Who is consuming the data that I produce?
• Will anything break if I make this change?
• Deprecating legacy schemas is too expensive.

Motivations: Infra Providers
This mess makes things really hard for infrastructure providers!
Lots of optimizations are impossible because producers/consumer logic locks us into
what should be backend decisions
• Storage format
• Physical partitioning scheme
• Data location, wire protocol
Lots of redundant code paths to support: Spark, Hive, Presto, Pig, etc

Dali Vision and Mission
Motivation:
Make analytics infrastructure invisible by abstracting away the
underlying physical details
Mission: Make data on HDFS easier to access + manage
Filesystem: protocol-independent, multi-cluster
Datasets: tables not files
Views: virtual datasets, contract management for producers and consumers
Lineage and Discovery: map datasets to producers, consumers, and track
changes over time

Dali Dataset API: Catalog Service

Is a Dataset API Enough?
Some use cases at LinkedIn:
Structural transformations (flattening and nesting)
Muxing and de-muxing data (unions)
Patching bad data
Backward incompatible changes (intentional and otherwise…)
Code reuse
What we need:
Ability to decouple the API from the dataset
Producer control over public and private APIs
Tooling and processes to support safe evolution of these APIs

A sample view
CREATE VIEW profile_flattened
TBLPROPERTIES(
'functions' =
'get_profile_section:isb.GetProfileSections',
'dependencies' =
'com.linkedin.dali-udfs:get-profile-sections:0.0.5')
AS SELECT
get_profile_section(...)
FROM
prod_identity.profile;

Reading a Dali View from Pig
register ivy://com.linkedin.dali:dali-all:2.3.52;
data = LOAD ‘dalids:///tracking.pageviewevent’
USING DaliStorage();
data = FILTER data
BY datepartition >= ‘2016-05-08-00’
AND datepartition <= ‘2016-05-10-00’

View Versioning
• Views can evolve to add/remove fields, update UDF and views/table
dependencies, update logic, etc.
• Multiple versions of each view can be registered with the Dali at the same time.
• Consumers can migrate to newer versions at their own pace.
• Incremental upgrades reduce the cost and risk of change!
Example:
For a database foo which contains view bar, we could have:
bar_1_0_0, bar_1_1_0, bar_2_0_0 registered with Dali at the same time.
* We also register bar which is a latest pointer to bar_2_0_0

Semantic Versioning for Views
Major Version
• Backward incompatible changes to the view schema
• Removing a field
• Changing the physical type of an existing field
Minor Version
• Backward compatible changes visible to consumers of the view
• Adding a new field to the schema
Patch Version
• Everything else that doesn’t alter the schema or semantic output of the view
• Updating one of the view’s binary dependencies
• Updating SQL for better execution plan

Leveraging existing LI tools INFRA
Query view/UDF version dependency graph
who-depends-on-me?
Deprecate, EOL, and purge a specific view/UDF version
Plug into existing global namespace management provided by LI developer tools
Enforce referential integrity for views at deployment time

Contract Law for Datasets
Vague, poorly defined contracts bind data producers to consumers
Physical types don’t tell us much
STRING or URI?
STRING or ENUM?
Semantic types help, but what about other types of relationships?
X IS NOT NULL
A_time is in seconds, b_time is in millis
Attributes of a good contract
Easy to find
Easy to understand
Easy to change

Hijacking an existing process
Express contracts as logical constraints against the fields of a view
Make the contract easy to find by storing it in the view’s Git repo
Contract negotiation follows an existing process
Data producer (view owner) controls the ACL on the view repo
Data consumer requests a contract change via ReviewBoard request
View owner either accepts or rejects the pull request
If accepted, view version is bumped to notify downstream consumers
If rejected, consumer still has the option of committing the constraint to their own repo
Contract  Constraint based testing for views
Contract  Data Quality tests

Case Study: Project Voyager
Views allowed us to parallelize
development by decoupling the online
and offline sides of the project.
• Read existing data using new
schemas
• Legacy apps can continue using old
schemas
~ 100 views for the Voyager project
• 31 consumer (leaf) views
• 63 producer views
• Dependencies on 48 unique tables

Why Dali?
Consumers
Make data stable, predictable, discoverable
Producers
Explicit, manageable contracts with consumers
Frictionless, familiar process for modifying existing contracts
Infra Providers
Freedom to optimize
Flow portability  DR, multi-DC scheduling

csteinbach@linkedin.com
linkedin.com/in/carlsteinbach
@cwsteinbach

LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

More Related Content

What's hot (20)

Similar to LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016 (20)

Recently uploaded (20)

LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Editor's Notes