SlideShare a Scribd company logo
H104: Harnessing the Hadoop Ecosystem
Optimizations in Apache Hive
Jason Huang, Senior Solutions Architect – Qubole, Inc.
May 12, 2015
NYC Data Summit Hadoop Day
A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project.
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed,
Norwest Ventures.
2015 CNBC Disruptor 50 Companies – announced today!
World class product and engineering team from:
Hive – SQL on Hadoop
●  A system for managing and querying unstructured data as if it were
structured
●  Uses Map-Reduce for execution
●  HDFS for Storage (or Amazon S3)
●  Key Building Principles
●  SQL as a familiar data warehousing tool
●  Extensibility (Pluggable map/reduce scripts in the language of your
choice, Rich and User Defined Data Types, User Defined Functions)
●  Interoperability (Extensible Framework to support different file and data
formats)
●  Performance
Why Hive?
●  Problem : Unlimited data
●  Terabytes everyday
●  Wide Adoption of Hadoop
●  Scalable/Available
●  But, Hadoop can be …
●  Complex
●  Different Paradigm
●  Map-Reduce hard to program
Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST	
  API	
  
(HTTPS)	
  	
  
SSH	
  	
  
Ephemeral	
  Hadoop	
  Clusters,	
  
Managed	
  by	
  Qubole	
  
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS	
  –	
  Qubole	
  
User,	
  Account	
  
ConfiguraFons	
  
(Encrypted	
  
credenFals	
  
Amazon S3
w/S3 Server Side
Encryption
Default	
  Hive	
  
Metastore	
  
Encryption Options:
a)  Qubole can encrypt the result cache
b)  Qubole supports encryption of the ephemeral drives used for HDFS
c)  Qubole supports S3 Server Side Encryption
(c)(b)	
  
(a)	
  
(opFonal)	
  
Custom	
  Hive	
  
Metastore	
  
SSH	
  	
  
De-normalizing data:
Normalization:
-  models data tables with certain rules to deal with redundancy
-  normalizing creates multiple relational tables
-  requires joins at runtime to produce results
Joins are expensive and difficult operations to perform and are one of the
common reasons for performance issues. Because of this, it’s a good idea
to avoid highly normalized table structures because they require join
queries to derive the desired metrics.
Partitioning Tables:
Hive partitioning is an effective method to improve the query performance
on larger tables. Partition key is best as a low cardinal attribute.
Bucketing:
Improves the join performance if the bucket key and join keys are
common.
Bucketing:
-  improves the join performance if the bucket key and join keys are
common
-  distributes the data in different buckets based on the hash results on
the bucket key
-  Reduces I/O scans during the join process if the process is happening
on the same keys (columns)
Note: set bucketing flag (hive.enforce.bucketing) each time before
writing data to the bucketed table.
To leverage the bucketing in the join operation we should set
hive.optimize.bucketmapjoin=true. This setting hints to Hive to do
bucket level join during the map stage join.
Map join:
Really efficient if a table on the other side of a join is small enough to fit
in the memory.
File Input Formats:
-  play a critical role in Hive performance
E.g. JSON, the text type of input formats
-  not a good choice for a large production system where data volume is
really high
-  readable format take a lot of space and have some overhead of
parsing ( e.g JSON parsing )
To address these problems, Hive comes with columnar input formats like
RCFile, ORC etc. Columnar formats reduce read operations in queries by
allowing each column to be accessed individually.
Other binary formats like Avro, sequence files, Thrift can be effective in
various use cases.
Compress map/reduce output:
-  reduce the intermediate data volume
-  reduces the amount of data transfers between mappers and reducers
over the network
Note: gzip compressed files are not splittable – so apply with caution
File size should not be larger than a few hundred megabytes
-  otherwise it can potentially lead to an imbalanced job
-  compression codec options: e.g. snappy, lzo, bzip, etc.
For map output compression: set mapred.compress.map.output=“true”
For job output compression: set mapred.output.compress=“true”
Parallel execution:
Hadoop can execute MapReduce jobs in parallel, and several queries
executed on Hive automatically use this parallelism.
Vectorization:
-  allows Hive to process a batch of rows in ORC format together instead
of processing one row at a time
Each batch consists of a column vector which is usually an array of
primitive types. Operations are performed on the entire column vector,
which improves the instruction pipelines and cache usage.
To enable: set hive.vectorized.execution.enabled=true
Sampling:
-  allows users to take a subset of dataset and analyze it, without having
to analyze the entire data set
Hive offers a built-in TABLESAMPLE clause that allows you to sample
your tables.
TABLESAMPLE can sample at various granularity levels
-  return only subsets of buckets (bucket sampling)
-  HDFS blocks (block sampling)
-  first N records from each input split
Alternatively, you can implement your own UDF that filters out records
according to your sampling algorithm.
Sampling on Buckets:
Unit Testing:
-  In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries
and more.
-  Verify the correctness of your whole HiveQL query without touching a
Hadoop cluster.
-  Executing HiveQL query in the local mode takes literally seconds,
compared to minutes, hours or days if it runs in the Hadoop mode.
Various tools available: e.g HiveRunner, Hive_test and Beetest.
Qubole Data Service
18
19
Use Cases and Additional Information
20
21
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
22
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Aug-13
Sept-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Number of queries
Segment audiences based on their behavior including such
topics as user pathway and multi-dimensional recency
analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies
23
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
Links for more information
http://www.datacenterknowledge.com/
archives/2015/04/02/hybrid-clouds-need-for-
speed/
http://engineering.pinterest.com/post/
92742371919/powering-big-data-at-pinterest
http://www.itbusinessedge.com/slideshows/
six-details-your-big-data-provider-wont-tell-
you.html
http://www.marketwired.com/press-release/
qubole-reports-rapid-adoption-of-its-self-
service-big-data-analytics-
platform-1990272.htm

More Related Content

PPTX
Big Data at Pinterest - Presented by Qubole
PPTX
Optimizing Big Data to run in the Public Cloud
PPTX
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
PPTX
Atlanta Data Science Meetup | Qubole slides
PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
PPTX
Qubole - Big data in cloud
PPTX
Attunity Solutions for Teradata
PPTX
Atlanta MLConf
Big Data at Pinterest - Presented by Qubole
Optimizing Big Data to run in the Public Cloud
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Atlanta Data Science Meetup | Qubole slides
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole - Big data in cloud
Attunity Solutions for Teradata
Atlanta MLConf

What's hot (20)

PPTX
Accelerating Big Data Analytics
PPTX
Optimize Data for the Logical Data Warehouse
PPT
Google App Engine
PDF
Big Data and Hadoop - key drivers, ecosystem and use cases
PPTX
How to Operationalise Real-Time Hadoop in the Cloud
PDF
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
PPTX
How Glidewell Moves Data to Amazon Redshift
PDF
Data pipeline and data lake for autonomous driving
PPTX
Digital Business Transformation in the Streaming Era
PDF
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
PDF
Data platform architecture
PDF
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
PDF
The Holy Grail of Data Analytics
PDF
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
PPTX
Couchbase Server
PDF
Impala use case @ Zoosk
PPTX
Eugene Polonichko "Architecture of modern data warehouse"
PDF
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
PDF
Azure Data Factory v2
Accelerating Big Data Analytics
Optimize Data for the Logical Data Warehouse
Google App Engine
Big Data and Hadoop - key drivers, ecosystem and use cases
How to Operationalise Real-Time Hadoop in the Cloud
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
How Glidewell Moves Data to Amazon Redshift
Data pipeline and data lake for autonomous driving
Digital Business Transformation in the Streaming Era
Database Camp 2016 @ United Nations, NYC - Bob Wiederhold, CEO, Couchbase
Data platform architecture
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
The Holy Grail of Data Analytics
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Couchbase Server
Impala use case @ Zoosk
Eugene Polonichko "Architecture of modern data warehouse"
Deliver Your Modern Data Warehouse (Microsoft Tech Summit Oslo 2018)
Azure Data Factory v2
Ad

Viewers also liked (9)

PPT
Indic threads pune12-comparing hadoop data storage
PPTX
Data flow ii extract
PPTX
Data modeling dimensions
ODP
Research methodology
PPTX
Data modeling facts
PPTX
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
PPTX
Temporal Snapshot Fact Tables
PDF
Effective Hive Queries
PPTX
The Hadoop Ecosystem
Indic threads pune12-comparing hadoop data storage
Data flow ii extract
Data modeling dimensions
Research methodology
Data modeling facts
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Temporal Snapshot Fact Tables
Effective Hive Queries
The Hadoop Ecosystem
Ad

Similar to Harnessing the Hadoop Ecosystem Optimizations in Apache Hive (20)

PPTX
Learn what is Hadoop-and-BigData
PPTX
Analysis of historical movie data by BHADRA
PPTX
Hadoop live online training
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
What is Apache Hadoop and its ecosystem?
PDF
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
PPT
Hive @ Hadoop day seattle_2010
DOCX
Hadoop Seminar Report
PPTX
Hadoop and Big Data: Revealed
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PPT
Architecting the Future of Big Data and Search
ODP
Hadoop seminar
DOCX
Prashanth Kumar_Hadoop_NEW
ODP
Hadoop demo ppt
PPTX
Hadoop info
PPTX
Data infrastructure at Facebook
PDF
Hadoop Master Class : A concise overview
PPTX
Introduction to Apache Hadoop
Learn what is Hadoop-and-BigData
Analysis of historical movie data by BHADRA
Hadoop live online training
Hadoop a Natural Choice for Data Intensive Log Processing
Oct 2011 CHADNUG Presentation on Hadoop
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
What is Apache Hadoop and its ecosystem?
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
Hive @ Hadoop day seattle_2010
Hadoop Seminar Report
Hadoop and Big Data: Revealed
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Architecting the Future of Big Data and Search
Hadoop seminar
Prashanth Kumar_Hadoop_NEW
Hadoop demo ppt
Hadoop info
Data infrastructure at Facebook
Hadoop Master Class : A concise overview
Introduction to Apache Hadoop

More from Qubole (13)

PDF
7 Big Data Challenges and How to Overcome Them
PDF
State of Big Data Adoption
PDF
5 Factors Impacting Your Big Data Project's Performance
PPTX
Spark on Yarn
PDF
Running Spark on Cloud
PDF
Qubole State of the Big Data Industry
PPTX
Big Data Platform at Pinterest
PDF
BIPD Tech Tuesday Presentation - Qubole
PPTX
Getting to 1.5M Ads/sec: How DataXu manages Big Data
PDF
Expert Big Data Tips
PPTX
Big dataproposal
PDF
Presto in the cloud
PPTX
Basic Sentiment Analysis using Hive
7 Big Data Challenges and How to Overcome Them
State of Big Data Adoption
5 Factors Impacting Your Big Data Project's Performance
Spark on Yarn
Running Spark on Cloud
Qubole State of the Big Data Industry
Big Data Platform at Pinterest
BIPD Tech Tuesday Presentation - Qubole
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Expert Big Data Tips
Big dataproposal
Presto in the cloud
Basic Sentiment Analysis using Hive

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

  • 1. H104: Harnessing the Hadoop Ecosystem Optimizations in Apache Hive Jason Huang, Senior Solutions Architect – Qubole, Inc. May 12, 2015 NYC Data Summit Hadoop Day
  • 2. A little bit about Qubole Ashish Thusoo Founder & CEO Joydeep Sen Sarma Founder & CTO Founded in 2011 by the pioneers of “big data” @ Facebook and the creator’s of the Apache Hive Project. Based in Mountain View, CA with offices in Bangalore, India. Investments by Charles River, LightSpeed, Norwest Ventures. 2015 CNBC Disruptor 50 Companies – announced today! World class product and engineering team from:
  • 3. Hive – SQL on Hadoop ●  A system for managing and querying unstructured data as if it were structured ●  Uses Map-Reduce for execution ●  HDFS for Storage (or Amazon S3) ●  Key Building Principles ●  SQL as a familiar data warehousing tool ●  Extensibility (Pluggable map/reduce scripts in the language of your choice, Rich and User Defined Data Types, User Defined Functions) ●  Interoperability (Extensible Framework to support different file and data formats) ●  Performance
  • 4. Why Hive? ●  Problem : Unlimited data ●  Terabytes everyday ●  Wide Adoption of Hadoop ●  Scalable/Available ●  But, Hadoop can be … ●  Complex ●  Different Paradigm ●  Map-Reduce hard to program
  • 5. Qubole DataFlow Diagram Qubole UI via Browser SDK ODBC User Access Qubole’s AWS Account Customer’s AWS Account REST  API   (HTTPS)     SSH     Ephemeral  Hadoop  Clusters,   Managed  by  Qubole   Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS  –  Qubole   User,  Account   ConfiguraFons   (Encrypted   credenFals   Amazon S3 w/S3 Server Side Encryption Default  Hive   Metastore   Encryption Options: a)  Qubole can encrypt the result cache b)  Qubole supports encryption of the ephemeral drives used for HDFS c)  Qubole supports S3 Server Side Encryption (c)(b)   (a)   (opFonal)   Custom  Hive   Metastore   SSH    
  • 6. De-normalizing data: Normalization: -  models data tables with certain rules to deal with redundancy -  normalizing creates multiple relational tables -  requires joins at runtime to produce results Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of this, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
  • 7. Partitioning Tables: Hive partitioning is an effective method to improve the query performance on larger tables. Partition key is best as a low cardinal attribute.
  • 8. Bucketing: Improves the join performance if the bucket key and join keys are common.
  • 9. Bucketing: -  improves the join performance if the bucket key and join keys are common -  distributes the data in different buckets based on the hash results on the bucket key -  Reduces I/O scans during the join process if the process is happening on the same keys (columns) Note: set bucketing flag (hive.enforce.bucketing) each time before writing data to the bucketed table. To leverage the bucketing in the join operation we should set hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join.
  • 10. Map join: Really efficient if a table on the other side of a join is small enough to fit in the memory.
  • 11. File Input Formats: -  play a critical role in Hive performance E.g. JSON, the text type of input formats -  not a good choice for a large production system where data volume is really high -  readable format take a lot of space and have some overhead of parsing ( e.g JSON parsing ) To address these problems, Hive comes with columnar input formats like RCFile, ORC etc. Columnar formats reduce read operations in queries by allowing each column to be accessed individually. Other binary formats like Avro, sequence files, Thrift can be effective in various use cases.
  • 12. Compress map/reduce output: -  reduce the intermediate data volume -  reduces the amount of data transfers between mappers and reducers over the network Note: gzip compressed files are not splittable – so apply with caution File size should not be larger than a few hundred megabytes -  otherwise it can potentially lead to an imbalanced job -  compression codec options: e.g. snappy, lzo, bzip, etc. For map output compression: set mapred.compress.map.output=“true” For job output compression: set mapred.output.compress=“true”
  • 13. Parallel execution: Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism.
  • 14. Vectorization: -  allows Hive to process a batch of rows in ORC format together instead of processing one row at a time Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. To enable: set hive.vectorized.execution.enabled=true
  • 15. Sampling: -  allows users to take a subset of dataset and analyze it, without having to analyze the entire data set Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels -  return only subsets of buckets (bucket sampling) -  HDFS blocks (block sampling) -  first N records from each input split Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.
  • 17. Unit Testing: -  In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more. -  Verify the correctness of your whole HiveQL query without touching a Hadoop cluster. -  Executing HiveQL query in the local mode takes literally seconds, compared to minutes, hours or days if it runs in the Hadoop mode. Various tools available: e.g HiveRunner, Hive_test and Beetest.
  • 19. 19
  • 20. Use Cases and Additional Information 20
  • 21. 21 “Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable” Mohammad Shahangian - Lead, Data Science and Infrastructure Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers. Rapid expansion of big data beyond developers (240 users out of 600 person company) Use CasesUser and Query Growth Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc. Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce Enterprise scale processing and data access
  • 22. 22 “We needed something that was reliable and easy to learn, setup, use and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.” Marc Rosen - Sr. Director, Data Analytics Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards. Use Cases# Commands Per Month 0 1250 2500 3750 5000 Aug-13 Sept-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies
  • 23. 23 Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure Links for more information http://www.datacenterknowledge.com/ archives/2015/04/02/hybrid-clouds-need-for- speed/ http://engineering.pinterest.com/post/ 92742371919/powering-big-data-at-pinterest http://www.itbusinessedge.com/slideshows/ six-details-your-big-data-provider-wont-tell- you.html http://www.marketwired.com/press-release/ qubole-reports-rapid-adoption-of-its-self- service-big-data-analytics- platform-1990272.htm