© 2014 MapR Technologies 1© 2014 MapR Technologies
Real Time and Big Data – It’s About Time
© 2014 MapR Technologies 2
What is Real Time
Event
Occurs
Gain
Insight
Take
Action
Time Elapsed
© 2014 MapR Technologies 3
Time to Insight
Event
Occurs
Gain
Insight
NFS + Drill
Kafka + Camus + Drill
HBase/MapR-DB + Drill
Time to Ingest Data Time to Iterate+
© 2014 MapR Technologies 4
Real-time Data Exploration on newly ingested data via NFS
Sources
RELATIONAL
WEB
SERVER
APPLICATION
SERVER
REAL TIME
ANALYTICS
MAPR DISTRIBUTION FOR HADOOP
N
F
S
drillbit drillbit
ODBC
Node Node
drillbit drillbit
Node Node
drillbit drillbit
Node Node
© 2014 MapR Technologies 5
Real-time Data Exploration on newly ingested streams via
Kafka and Camus
REAL TIME
ANALYTICS
MAPR DISTRIBUTION FOR
HADOOP
drillbit drillbit
ODBC
Node Node
drillbit drillbit
Node Node
drillbit drillbit
Node Node
Camus
ClusterCluster
Kafka
Cluster
Sources
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
© 2014 MapR Technologies 6
Real-time Data Exploration on Operational Data stored in
HBase/MapR-DB
REAL TIME
ANALYTICS
MAPR DISTRIBUTION FOR HADOOP
ODBC
Node
HBase drillbit
Node
HBase drillbit
Node
HBase drillbit
Node
HBase drillbit
APPLICATION SERVER
© 2014 MapR Technologies 7
Apache Drill Brings Flexibility & Performance
Access to any data type, any data source
• Relational
• Nested data
• Schema-less
Rapid time to insights
• Query data in-situ
• No Schemas required
• Easy to get started
Integration with existing tools
• ANSI SQL
• BI tool integration
Scale in all dimensions
• TB-PB of scale
• 1000’s of users
• 1000’s of nodes
Granular Security
• Authentication
• Row/column level controls
• De-centralized
© 2014 MapR Technologies 8
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational database
environment on top of Hadoop, Drill instead enables a SQL language interface to
data in numerous formats, without requiring a formal schema to be declared. This
enables plug-and-play discovery over a huge universe of data without
prerequisites and preparation. So while Drill uses SQL, and can connect to
Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might
be SQL-on-Everything, with very low setup requirements.
Andrew Brust,
“
”
© 2014 MapR Technologies 9
JSON Model, Columnar Speed
JSON
BSON
Mongo
HBase
NoSQL
Parquet
Avro
CSV
TSV
Schema-lessFixed schema
Flat
Complex
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 10
Drill Supports Schema Discovery On-The-Fly
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
© 2014 MapR Technologies 11
Drill’s Role in the Enterprise Data Architecture
Raw data
• JSON, CSV, ...
“Optimized” data
• Parquet, …
Centrally-structured
data
• Schemas in Hive
Metastore
Relational data
• Highly-structured data
Hive, Impala, Spark SQL
Oracle, Teradata
Exploration
(known and unknown questions)
© 2014 MapR Technologies 12
Data Warehouse Augmentation with Drill
Augment existing expensive SQL analytics platform with Hadoop and Drill
• Apache Drill allows interactive analysis on large datasets with MapR as the
underlying platform that meets scale, reliability and data protection needs
• SQL users did not have to learn Pig, HiveQL or any other language and
continue to use Tableau on top of Drill
OBJECTIVES
CHALLENGES
SOLUTION
• Hadoop and Drill dramatically reduce the price point to about $1,000 / TB
• MapR platform with Drill delivers reliability and performance for the end users
• Leverage existing BI and SQL skill-sets on Hadoop without retraining
Business
Impact
Potential
• Mine purchase data and compare consumer shopping habits
• Require internal SQL specialists to gain instant access to data at all times
• Currently process tens of TB on Traditional MPP DB
• Want to preserve instant access to data but a lower price point
• Need a system that is reliable, does not lose data and is fast
• Must be able to leverage the SQL skill sets in the company
Retail Analytics
© 2014 MapR Technologies 13
Real-time Action
Event
Occurs
Take
Action
© 2014 MapR Technologies 14
Real-time processing leading to instant action
MAPR DISTRIBUTION FOR HADOOP
HBase
APPLICATION SERVERS
File system
Batch: Spark, Drill
File system
File system
File system
Kafka
HBase
HBase
HBase
Stream
Processing
ACTION
ACTION
© 2014 MapR Technologies 15
Stream Processing – Global MSSP
SENSOR DATA
FIREWALL
LOGS
INTRUSION
PROTECTION
SYSTEM LOGS
Globally Dispersed
Datacenters
SECURITY
APPLIANCE LOGS
SQL Queries
and
Reporting
Batch
Processing
Graph
Processing
New Threat Footprint
within 2-5 min
Closed-Loop
Operations
Benefits: Unified platform for Analytics
 Low Operational Costs
 Faster Response Times
 Better Algorithms
MapR M7 Distribution for Hadoop
1 million events/sec. Over 100 channels
Spark
Streaming
for known threats
& aggregation
Mahout, MLLib
Drill, Impala
GraphX & Titan
© 2014 MapR Technologies 16
Operations + Analytics = Real-time, Personalized Services
Fraud model
Recommendations
table
MapR Distribution for Hadoop
Fraud
investigator
Interactive
marketer
Online
transactions
Fraud
detection
Personalized
offers
Clickstream
analysis
Fraud
investigation tool
Real-time Operational Applications
Analytics
© 2014 MapR Technologies 17
Q&A
@mapr maprtech
tshiran@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Real Time and Big Data – It’s About Time

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies Real Time and Big Data – It’s About Time
  • 2.
    © 2014 MapRTechnologies 2 What is Real Time Event Occurs Gain Insight Take Action Time Elapsed
  • 3.
    © 2014 MapRTechnologies 3 Time to Insight Event Occurs Gain Insight NFS + Drill Kafka + Camus + Drill HBase/MapR-DB + Drill Time to Ingest Data Time to Iterate+
  • 4.
    © 2014 MapRTechnologies 4 Real-time Data Exploration on newly ingested data via NFS Sources RELATIONAL WEB SERVER APPLICATION SERVER REAL TIME ANALYTICS MAPR DISTRIBUTION FOR HADOOP N F S drillbit drillbit ODBC Node Node drillbit drillbit Node Node drillbit drillbit Node Node
  • 5.
    © 2014 MapRTechnologies 5 Real-time Data Exploration on newly ingested streams via Kafka and Camus REAL TIME ANALYTICS MAPR DISTRIBUTION FOR HADOOP drillbit drillbit ODBC Node Node drillbit drillbit Node Node drillbit drillbit Node Node Camus ClusterCluster Kafka Cluster Sources LOG FILES, CLICKSTREAMS SENSORS BLOGS, TWEETS, LINK DATA
  • 6.
    © 2014 MapRTechnologies 6 Real-time Data Exploration on Operational Data stored in HBase/MapR-DB REAL TIME ANALYTICS MAPR DISTRIBUTION FOR HADOOP ODBC Node HBase drillbit Node HBase drillbit Node HBase drillbit Node HBase drillbit APPLICATION SERVER
  • 7.
    © 2014 MapRTechnologies 7 Apache Drill Brings Flexibility & Performance Access to any data type, any data source • Relational • Nested data • Schema-less Rapid time to insights • Query data in-situ • No Schemas required • Easy to get started Integration with existing tools • ANSI SQL • BI tool integration Scale in all dimensions • TB-PB of scale • 1000’s of users • 1000’s of nodes Granular Security • Authentication • Row/column level controls • De-centralized
  • 8.
    © 2014 MapRTechnologies 8 Omni-SQL (“SQL-on-Everything”) Drill: Omni-SQL Whereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements. Andrew Brust, “ ”
  • 9.
    © 2014 MapRTechnologies 9 JSON Model, Columnar Speed JSON BSON Mongo HBase NoSQL Parquet Avro CSV TSV Schema-lessFixed schema Flat Complex Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
  • 10.
    © 2014 MapRTechnologies 10 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 11.
    © 2014 MapRTechnologies 11 Drill’s Role in the Enterprise Data Architecture Raw data • JSON, CSV, ... “Optimized” data • Parquet, … Centrally-structured data • Schemas in Hive Metastore Relational data • Highly-structured data Hive, Impala, Spark SQL Oracle, Teradata Exploration (known and unknown questions)
  • 12.
    © 2014 MapRTechnologies 12 Data Warehouse Augmentation with Drill Augment existing expensive SQL analytics platform with Hadoop and Drill • Apache Drill allows interactive analysis on large datasets with MapR as the underlying platform that meets scale, reliability and data protection needs • SQL users did not have to learn Pig, HiveQL or any other language and continue to use Tableau on top of Drill OBJECTIVES CHALLENGES SOLUTION • Hadoop and Drill dramatically reduce the price point to about $1,000 / TB • MapR platform with Drill delivers reliability and performance for the end users • Leverage existing BI and SQL skill-sets on Hadoop without retraining Business Impact Potential • Mine purchase data and compare consumer shopping habits • Require internal SQL specialists to gain instant access to data at all times • Currently process tens of TB on Traditional MPP DB • Want to preserve instant access to data but a lower price point • Need a system that is reliable, does not lose data and is fast • Must be able to leverage the SQL skill sets in the company Retail Analytics
  • 13.
    © 2014 MapRTechnologies 13 Real-time Action Event Occurs Take Action
  • 14.
    © 2014 MapRTechnologies 14 Real-time processing leading to instant action MAPR DISTRIBUTION FOR HADOOP HBase APPLICATION SERVERS File system Batch: Spark, Drill File system File system File system Kafka HBase HBase HBase Stream Processing ACTION ACTION
  • 15.
    © 2014 MapRTechnologies 15 Stream Processing – Global MSSP SENSOR DATA FIREWALL LOGS INTRUSION PROTECTION SYSTEM LOGS Globally Dispersed Datacenters SECURITY APPLIANCE LOGS SQL Queries and Reporting Batch Processing Graph Processing New Threat Footprint within 2-5 min Closed-Loop Operations Benefits: Unified platform for Analytics  Low Operational Costs  Faster Response Times  Better Algorithms MapR M7 Distribution for Hadoop 1 million events/sec. Over 100 channels Spark Streaming for known threats & aggregation Mahout, MLLib Drill, Impala GraphX & Titan
  • 16.
    © 2014 MapRTechnologies 16 Operations + Analytics = Real-time, Personalized Services Fraud model Recommendations table MapR Distribution for Hadoop Fraud investigator Interactive marketer Online transactions Fraud detection Personalized offers Clickstream analysis Fraud investigation tool Real-time Operational Applications Analytics
  • 17.
    © 2014 MapRTechnologies 17 Q&A @mapr maprtech [email protected] Engage with us! MapR maprtech mapr-technologies