SlideShare a Scribd company logo
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive 0.13: An upgrade in
Performance, Scaling,
Security and Multi-tenancy
Vikram Dixit
(vikram@apache.org)
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive – SQL on Hadoop
• Open source Apache project
• Started by Facebook in 2009
• Tools to enable easy data extract/transform/load (ETL)
• Work with structured, unstructured, semi-structured data
• Access to files stored either directly in Apache HDFSTM or in other
data storage systems such as Apache HBaseTM
• Query execution via MapReduce/Tez
• Metadata sharing via HCatalog allows your Pig scripts to work with
Hive tables
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Using Hive Effectively
• Understanding Hive’s use case
– Current focus on making it a fast analytics engine that scales
– Transactions coming
• Understand the storage mechanism right for you
– ORC File - highest compression, metadata used to enable faster reads
– Parquet - intermediate compression, fast reads
– RC File - legacy, most widely used but suffers in performance
– Text - ease of use but lowest in terms of performance
• Use the right execution engine
– Tez is the recommended execution engine for performance
– Map reduce is chosen by default in cases where tez can not yet run the query
• Use the right configuration flags
– Many optimizations are turned on by default
– Some are not. Need to tune it for your cluster because a default is hard to come up
with.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
What’s new in Hive 0.13
•Speed
– Hive on Tez – Broadcast Joins, Bucket Map Joins
– Vectorized Query processing
– Split elimination for ORC file
– Parquet file format support
•Scale
– Smaller hash tables allowing more scalable map joins
– More scalable dynamic partition loads
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
What’s new in Hive 0.13
• More SQL improvements
– SQL standard Authorization
– Char support, Decimal improvements
– Permanent UDFs
– Streaming ingest from Flume for ACID capability
• Additional Improvements
– Hive Server 2 improvements
– HCatalog parity with Hive data types
– JDBC improvements viz. job cancel, async execution
• Even more goodies
– Mavenization
– Parallel test framework
– Lots of documentation
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-
Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive 0.13, April, 2013
• Hive on Apache Tez
• Query Service
• Buffer Cache
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
SPEED: Increasing Hive
Performance
Key Highlights
– Tez: New execution engine
– Vectorized Query Processing
– Startup time improvement
– Statistics to accelerate query execution
– Cost Based Optimizer: Optiq (missed the cut)
Interactive Query Times across ALL use cases
• Simple and advanced queries in seconds
• Integrates seamlessly with existing tools
• Currently a >100x improvement in just nine months
Elements of Fast SQL Execution
• Query Planner/Cost Based Optimizer
w/ Statistics
• Query Startup
• Query Execution
• I/O Path
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Apache Tez (“Speed”)
• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT g1.x, g1.avg, g2.cnt
FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2
ON (g1.x = g2.x)
ORDER BY avg;
GROUP a BY a.x
JOIN (a,b)
GROUP b BY b.x
ORDER BY
M M M
R R
M M
R
M M
R
M
R
HDFS HDFS
HDFS
M M M
R R
R
M M
R
GROUP BY a.x
JOIN (a,b)
ORDER BY
GROUP BY x
Tez avoids
unnecessary writes
to HDFS
HIVE-4660
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Shuffle Join
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM inventory inv
JOIN store_sales ss
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Broadcast Join
• Similar to map-join w/o the need to build a hash table on the
client
• Will work with any level of sub-query nesting
• Uses stats to determine if applicable
• How it works:
– Broadcast result set is computed in parallel on the cluster
– Join processor are spun up in parallel
– Broadcast set is streamed to join processor
– Join processors build hash table
– Other relation is joined with hashtable
• Tez handles:
– Best parallelism
– Best data transfer of the hashed relation
– Best scheduling to avoid latencies
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Broadcast Join
Hive – MR Hive – Tez
M M M
M
HDFS
M MM
M M
HDFS
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
HDFS
Inventory scan
(Runs as single
local map task)
Store Sales scan
and Join
(Inventory hash
table read as side
file)
Inventory scan
(Runs on cluster
potentially more
than 1 mapper)
Store Sales scan
and Join
Broadcast
edge
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Dynamically partitioned Hash join
• Kicks in when large table is bucketed
– Bucketed table
– Dynamic as part of query processing
– Enabled via set hive.convert.join.bucket.mapjoin.tez = true; (use 0.13.1)
• Uses custom edge to match the partitioning on the smaller table
• Allows hash-join in cases where broadcast would be too large
• Tez gives us the option of building custom edges and vertex
managers
– Fine grained control over how the data is replicated and partitioned
– Scheduling and actual data transfer is handled by Tez
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Dynamically Partitioned Hash Join
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M MM
M M
HDFS
Inventory scan
(Runs on cluster
potentially more
than 1 mapper)
Store Sales scan
and Join (Custom
vertex reads
both inputs – no
side file reads)
Custom edge
(routes outputs of
previous stage to
the correct
Mappers of the
next stage)M MM
M
HDFS
Inventory scan
(Runs as single
local map task)
Store Sales scan
and Join
(Inventory hash
table read as side
file)
HDFS
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Dynamically Partitioned Hash Join
Plans look very similar to map join but the way things work change between
MR and Tez.
Hive – MR (Bucket map-join) Hive – Tez
• Not dynamically partitioned.
• Both tables need to be bucketed by the join key.
• Local task that generates the hash table writes n
files corresponding to n buckets.
• Number of mappers for the join must be same
as the number of buckets.
• Each of these mappers reads the corresponding
bucket file of the local task to perform the join.
• Only one of the sides needs to be bucketed and
the other side is dynamically bucketed.
• Also works if neither side is explicitly bucketed,
but another operation forced bucketing in the
pipeline (traits)
• No writing to HDFS.
• There can be more mappers than number of
buckets but splits do not span multiple buckets.
• The dynamically bucketed mappers have as
many outputs as number of buckets and a
custom tez routing ensures these outputs reach
the right mappers.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Bulk Inner loop: Vectorization
• Avoid Writable objects & use primitive int/long
– Allows efficient JIT code for primitive types
• Generate per-type loops & avoid runtime type-checks
• The classes generated look like
– LongColEqualDoubleColumn
– LongColEqualLongColumn
– LongColEqualLongScalar
• Avoid duplicate operations on repeated values
– isRepeating & hasNulls
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
ORC: ZeroCopy & caching
• Use memory mapped I/O path in HDFS
– HDFS in-memory cache
• ORC reads can start deserializing early
– there is no blocking read() call
• Allow OS read-ahead to kick-in
• Use buffer-cache pages without copying it
• Avoid wasting heap space on ORC stripes
• Decompress directly from mapped buffers
– Fast JNI code for SNAPPY decompressors
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Scaling
• Reduce size of map join hash tables
– Hundred bytes were being used to store an integer (Map join key)
– HIVE-6430 reduced sizes of the hash tables by 60-70% in many cases
– Allowed more efficient use of memory and hence more tables to fit in
• Large number of open record writers in ORC file reduced to just 1
– HIVE-6455
– Now in a multi-insert scenario, performance is much better and many more
inserts can be done in parallel
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
TPC-DS 10 TB Query Times (Tuned Queries)
Page 19
Data: 10 TB data loaded into ORCFile using defaults.
Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each.
Queries: TPC-DS queries with partition filters added.
© Hortonworks Inc. 2011
Security
Page 20
Architecting the Future of Big Data
• Old authorization based on grant/revoke
• Incomplete model - eg. Anybody can run grant statement
• Does not try following standard
• Why follow standard ?
• Lot of thought has been put into the standard – important for
security!
• It’s a standard!
• Hive should have built-in authorization
• Easy to use, no additional components to manage
• New features get added that needs authorization
• Life cycle of objects should be synced with authorization policy
© Hortonworks Inc. 2011
Managing privileges
Page 21
Architecting the Future of Big Data
• Grant/revoke privilege on object to/from user/role
• SHOW GRANT statement
• view privilege grants based on user/role name and/or object name
• INSERT, SELECT, DELETE, UPDATE, ALL
• Privileges for some actions based on object ownership
• Table/view ownership : Most alter commands, drop
• Database ownership : create table, drop database
• URI privileges based on file permissions
• https://cwiki.apache.org/confluence/display/Hive/SQL+Stand
ard+based+hive+authorization#SQLStandardBasedHiveAuthor
ization-Configuration
• Use hive 0.13.1 – fixes the issues listed under known issues in
above wiki doc.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive Server 2 improvements
• Hive server 2 now supports thrift over HTTP and kerberos/LDAP
authentication on HTTP
• Also supports HTTPS
• HiveServer2 can keep sessions alive
– Between different JDBC queries
• New security model helps
– All secure queries run as “hive” user
• Ideal for short exploratory queries
• Uses same JARs (no download for task)
• Even better JIT performance on >1 queries
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Other improvements
• Insert-update-delete semantics. Streaming ingest from flume. (HIVE-
5317)
– Transaction manager added in. Support for ORC file format only at this time.
• Lots of UDF support via permanent functions. No need to have add
jar for most commonly used UDFs. Ideally, admin adds the permanent
(trusted) functions.
• Parquet is a supported storage format.
https://cwiki.apache.org/confluence/display/Hive/Parquet
• HCatalog now supports all the datatypes supported in Hive.
• Hive is now mavenized (Thanks Brock Noland!)
• Parallel test framework means Unit testing happens faster and
changes get in faster.
• Lots of new documentation for all the new features. (Thanks Lefty!)
• Bottom line: Hive 0.13 is the fastest, most feature rich version of hive
so far.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Future Work
• Lot more improvements coming up
• Speed
– Sort Merge Bucket Map join in Tez
– Total ordering of data
– Skew joins
– Cost based Optimizer
• Security
– Authorizing permanent UDF access
– Authorizing ‘show grant’
– Support hdfs ACL in URI permission checks (new in hadoop 2.4)
– More SQL syntax support – eg revoke just admin option on a role
• Multi-tenancy
– Sticky HS2 sessions for improved performance in a multi-tenant environment
– Improve scheduling in a multi-tenant environment
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Questions?

More Related Content

PDF
Foss evolution cos-boudnik
Data Con LA
 
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 
PDF
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
PDF
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.
 
PPTX
Keynote: The Future of Apache HBase
HBaseCon
 
PDF
Maintainable cloud architecture_of_hadoop
Kai Sasaki
 
PPTX
Asbury Hadoop Overview
Brian Enochson
 
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 
Foss evolution cos-boudnik
Data Con LA
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.
 
Keynote: The Future of Apache HBase
HBaseCon
 
Maintainable cloud architecture_of_hadoop
Kai Sasaki
 
Asbury Hadoop Overview
Brian Enochson
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 

What's hot (20)

ODP
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
PPTX
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 
PPTX
HBaseCon 2015: HBase and Spark
HBaseCon
 
PDF
Hadoop Hardware @Twitter: Size does matter!
DataWorks Summit
 
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PDF
Large-scale Web Apps @ Pinterest
HBaseCon
 
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PDF
Quick Introduction to Apache Tez
GetInData
 
PPTX
Digital Library Collection Management using HBase
HBaseCon
 
PPTX
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
PDF
Big Data Journey
Tugdual Grall
 
PDF
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
PPTX
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
PDF
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
PDF
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
PPTX
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
Esther Kundin
 
PDF
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 
HBaseCon 2015: HBase and Spark
HBaseCon
 
Hadoop Hardware @Twitter: Size does matter!
DataWorks Summit
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Large-scale Web Apps @ Pinterest
HBaseCon
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Quick Introduction to Apache Tez
GetInData
 
Digital Library Collection Management using HBase
HBaseCon
 
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
Big Data Journey
Tugdual Grall
 
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
Esther Kundin
 
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Ad

Viewers also liked (20)

PDF
Aziksa hadoop for buisness users2 santosh jha
Data Con LA
 
PDF
140614 bigdatacamp-la-keynote-jon hsieh
Data Con LA
 
PDF
20140614 introduction to spark-ben white
Data Con LA
 
PDF
Ag big datacampla-06-14-2014-ajay_gopal
Data Con LA
 
PPT
Big datacamp june14_alex_liu
Data Con LA
 
PDF
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Data Con LA
 
PPTX
2014 bigdatacamp asya_kamsky
Data Con LA
 
PDF
Yarn cloudera-kathleenting061414 kate-ting
Data Con LA
 
PPTX
Summit v4 dave wolcott
Data Con LA
 
PDF
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Data Con LA
 
PPTX
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA
 
PDF
Kiji cassandra la june 2014 - v02 clint-kelly
Data Con LA
 
PDF
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
PDF
Hadoop and NoSQL joining forces by Dale Kim of MapR
Data Con LA
 
PPTX
Hadoop Innovation Summit 2014
Data Con LA
 
PPTX
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Data Con LA
 
PPTX
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
PDF
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Data Con LA
 
Aziksa hadoop for buisness users2 santosh jha
Data Con LA
 
140614 bigdatacamp-la-keynote-jon hsieh
Data Con LA
 
20140614 introduction to spark-ben white
Data Con LA
 
Ag big datacampla-06-14-2014-ajay_gopal
Data Con LA
 
Big datacamp june14_alex_liu
Data Con LA
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Data Con LA
 
2014 bigdatacamp asya_kamsky
Data Con LA
 
Yarn cloudera-kathleenting061414 kate-ting
Data Con LA
 
Summit v4 dave wolcott
Data Con LA
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Data Con LA
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Data Con LA
 
Kiji cassandra la june 2014 - v02 clint-kelly
Data Con LA
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Data Con LA
 
Hadoop Innovation Summit 2014
Data Con LA
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Data Con LA
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Data Con LA
 
Ad

Similar to La big datacamp2014_vikram_dixit (20)

PPTX
Performance Hive+Tez 2
t3rmin4t0r
 
PDF
Gunther hagleitner:apache hive & stinger
hdhappy001
 
PDF
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PPTX
Stinger hadoop summit june 2013
alanfgates
 
PPTX
An In-Depth Look at Putting the Sting in Hive
DataWorks Summit
 
PPTX
Mapreduce over snapshots
enissoz
 
PDF
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
PDF
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Yahoo Developer Network
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PPTX
Getting started big data
Kibrom Gebrehiwot
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
Stinger Initiative - Deep Dive
Hortonworks
 
PPTX
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
PPTX
MHUG - YARN
Joseph Niemiec
 
PDF
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Performance Hive+Tez 2
t3rmin4t0r
 
Gunther hagleitner:apache hive & stinger
hdhappy001
 
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Stinger hadoop summit june 2013
alanfgates
 
An In-Depth Look at Putting the Sting in Hive
DataWorks Summit
 
Mapreduce over snapshots
enissoz
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Yahoo Developer Network
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Getting started big data
Kibrom Gebrehiwot
 
Big Data and Cloud Computing
Farzad Nozarian
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Stinger Initiative - Deep Dive
Hortonworks
 
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
MHUG - YARN
Joseph Niemiec
 
A Reference Architecture for ETL 2.0
DataWorks Summit
 

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PDF
Data Con LA 2022 Keynote
Data Con LA
 
PPTX
Data Con LA 2022 - Startup Showcase
Data Con LA
 
PPTX
Data Con LA 2022 Keynote
Data Con LA
 
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
PPTX
Data Con LA 2022 - AI Ethics
Data Con LA
 
PDF
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PDF
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
PDF
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 

La big datacamp2014_vikram_dixit

  • 1. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive 0.13: An upgrade in Performance, Scaling, Security and Multi-tenancy Vikram Dixit ([email protected])
  • 2. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive – SQL on Hadoop • Open source Apache project • Started by Facebook in 2009 • Tools to enable easy data extract/transform/load (ETL) • Work with structured, unstructured, semi-structured data • Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM • Query execution via MapReduce/Tez • Metadata sharing via HCatalog allows your Pig scripts to work with Hive tables
  • 3. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Using Hive Effectively • Understanding Hive’s use case – Current focus on making it a fast analytics engine that scales – Transactions coming • Understand the storage mechanism right for you – ORC File - highest compression, metadata used to enable faster reads – Parquet - intermediate compression, fast reads – RC File - legacy, most widely used but suffers in performance – Text - ease of use but lowest in terms of performance • Use the right execution engine – Tez is the recommended execution engine for performance – Map reduce is chosen by default in cases where tez can not yet run the query • Use the right configuration flags – Many optimizations are turned on by default – Some are not. Need to tune it for your cluster because a default is hard to come up with.
  • 4. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. What’s new in Hive 0.13 •Speed – Hive on Tez – Broadcast Joins, Bucket Map Joins – Vectorized Query processing – Split elimination for ORC file – Parquet file format support •Scale – Smaller hash tables allowing more scalable map joins – More scalable dynamic partition loads
  • 5. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. What’s new in Hive 0.13 • More SQL improvements – SQL standard Authorization – Char support, Decimal improvements – Permanent UDFs – Streaming ingest from Flume for ACID capability • Additional Improvements – Hive Server 2 improvements – HCatalog parity with Hive data types – JDBC improvements viz. job cancel, async execution • Even more goodies – Mavenization – Parallel test framework – Lots of documentation
  • 6. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN- Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Hive 0.13, April, 2013 • Hive on Apache Tez • Query Service • Buffer Cache • Cost Based Optimizer (Optiq) • Vectorized Processing Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Hive 0.12, October 2013: • VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop Goals:
  • 7. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. SPEED: Increasing Hive Performance Key Highlights – Tez: New execution engine – Vectorized Query Processing – Startup time improvement – Statistics to accelerate query execution – Cost Based Optimizer: Optiq (missed the cut) Interactive Query Times across ALL use cases • Simple and advanced queries in seconds • Integrates seamlessly with existing tools • Currently a >100x improvement in just nine months Elements of Fast SQL Execution • Query Planner/Cost Based Optimizer w/ Statistics • Query Startup • Query Execution • I/O Path
  • 8. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft YARN ApplicationMaster to run DAG of Tez Tasks Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task ProcessorInput Output
  • 9. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez SELECT g1.x, g1.avg, g2.cnt FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; GROUP a BY a.x JOIN (a,b) GROUP b BY b.x ORDER BY M M M R R M M R M M R M R HDFS HDFS HDFS M M M R R R M M R GROUP BY a.x JOIN (a,b) ORDER BY GROUP BY x Tez avoids unnecessary writes to HDFS HIVE-4660
  • 10. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Shuffle Join SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM inventory inv JOIN store_sales ss ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez
  • 11. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Broadcast Join • Similar to map-join w/o the need to build a hash table on the client • Will work with any level of sub-query nesting • Uses stats to determine if applicable • How it works: – Broadcast result set is computed in parallel on the cluster – Join processor are spun up in parallel – Broadcast set is streamed to join processor – Join processors build hash table – Other relation is joined with hashtable • Tez handles: – Best parallelism – Best data transfer of the hashed relation – Best scheduling to avoid latencies
  • 12. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Broadcast Join Hive – MR Hive – Tez M M M M HDFS M MM M M HDFS SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join Broadcast edge
  • 13. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Dynamically partitioned Hash join • Kicks in when large table is bucketed – Bucketed table – Dynamic as part of query processing – Enabled via set hive.convert.join.bucket.mapjoin.tez = true; (use 0.13.1) • Uses custom edge to match the partitioning on the smaller table • Allows hash-join in cases where broadcast would be too large • Tez gives us the option of building custom edges and vertex managers – Fine grained control over how the data is replicated and partitioned – Scheduling and actual data transfer is handled by Tez
  • 14. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Dynamically Partitioned Hash Join SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M MM M M HDFS Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join (Custom vertex reads both inputs – no side file reads) Custom edge (routes outputs of previous stage to the correct Mappers of the next stage)M MM M HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) HDFS
  • 15. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Dynamically Partitioned Hash Join Plans look very similar to map join but the way things work change between MR and Tez. Hive – MR (Bucket map-join) Hive – Tez • Not dynamically partitioned. • Both tables need to be bucketed by the join key. • Local task that generates the hash table writes n files corresponding to n buckets. • Number of mappers for the join must be same as the number of buckets. • Each of these mappers reads the corresponding bucket file of the local task to perform the join. • Only one of the sides needs to be bucketed and the other side is dynamically bucketed. • Also works if neither side is explicitly bucketed, but another operation forced bucketing in the pipeline (traits) • No writing to HDFS. • There can be more mappers than number of buckets but splits do not span multiple buckets. • The dynamically bucketed mappers have as many outputs as number of buckets and a custom tez routing ensures these outputs reach the right mappers.
  • 16. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Bulk Inner loop: Vectorization • Avoid Writable objects & use primitive int/long – Allows efficient JIT code for primitive types • Generate per-type loops & avoid runtime type-checks • The classes generated look like – LongColEqualDoubleColumn – LongColEqualLongColumn – LongColEqualLongScalar • Avoid duplicate operations on repeated values – isRepeating & hasNulls
  • 17. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. ORC: ZeroCopy & caching • Use memory mapped I/O path in HDFS – HDFS in-memory cache • ORC reads can start deserializing early – there is no blocking read() call • Allow OS read-ahead to kick-in • Use buffer-cache pages without copying it • Avoid wasting heap space on ORC stripes • Decompress directly from mapped buffers – Fast JNI code for SNAPPY decompressors
  • 18. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Scaling • Reduce size of map join hash tables – Hundred bytes were being used to store an integer (Map join key) – HIVE-6430 reduced sizes of the hash tables by 60-70% in many cases – Allowed more efficient use of memory and hence more tables to fit in • Large number of open record writers in ORC file reduced to just 1 – HIVE-6455 – Now in a multi-insert scenario, performance is much better and many more inserts can be done in parallel
  • 19. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. TPC-DS 10 TB Query Times (Tuned Queries) Page 19 Data: 10 TB data loaded into ORCFile using defaults. Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each. Queries: TPC-DS queries with partition filters added.
  • 20. © Hortonworks Inc. 2011 Security Page 20 Architecting the Future of Big Data • Old authorization based on grant/revoke • Incomplete model - eg. Anybody can run grant statement • Does not try following standard • Why follow standard ? • Lot of thought has been put into the standard – important for security! • It’s a standard! • Hive should have built-in authorization • Easy to use, no additional components to manage • New features get added that needs authorization • Life cycle of objects should be synced with authorization policy
  • 21. © Hortonworks Inc. 2011 Managing privileges Page 21 Architecting the Future of Big Data • Grant/revoke privilege on object to/from user/role • SHOW GRANT statement • view privilege grants based on user/role name and/or object name • INSERT, SELECT, DELETE, UPDATE, ALL • Privileges for some actions based on object ownership • Table/view ownership : Most alter commands, drop • Database ownership : create table, drop database • URI privileges based on file permissions • https://cwiki.apache.org/confluence/display/Hive/SQL+Stand ard+based+hive+authorization#SQLStandardBasedHiveAuthor ization-Configuration • Use hive 0.13.1 – fixes the issues listed under known issues in above wiki doc.
  • 22. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive Server 2 improvements • Hive server 2 now supports thrift over HTTP and kerberos/LDAP authentication on HTTP • Also supports HTTPS • HiveServer2 can keep sessions alive – Between different JDBC queries • New security model helps – All secure queries run as “hive” user • Ideal for short exploratory queries • Uses same JARs (no download for task) • Even better JIT performance on >1 queries
  • 23. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Other improvements • Insert-update-delete semantics. Streaming ingest from flume. (HIVE- 5317) – Transaction manager added in. Support for ORC file format only at this time. • Lots of UDF support via permanent functions. No need to have add jar for most commonly used UDFs. Ideally, admin adds the permanent (trusted) functions. • Parquet is a supported storage format. https://cwiki.apache.org/confluence/display/Hive/Parquet • HCatalog now supports all the datatypes supported in Hive. • Hive is now mavenized (Thanks Brock Noland!) • Parallel test framework means Unit testing happens faster and changes get in faster. • Lots of new documentation for all the new features. (Thanks Lefty!) • Bottom line: Hive 0.13 is the fastest, most feature rich version of hive so far.
  • 24. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Future Work • Lot more improvements coming up • Speed – Sort Merge Bucket Map join in Tez – Total ordering of data – Skew joins – Cost based Optimizer • Security – Authorizing permanent UDF access – Authorizing ‘show grant’ – Support hdfs ACL in URI permission checks (new in hadoop 2.4) – More SQL syntax support – eg revoke just admin option on a role • Multi-tenancy – Sticky HS2 sessions for improved performance in a multi-tenant environment – Improve scheduling in a multi-tenant environment
  • 25. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Questions?

Editor's Notes

  • #7: With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance. That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.