MapR 5.2: Getting More Value from the MapR Converged Community Edition

© 2016 MapR Technologies© 2016 MapR Technologies
MapR 5.2: Getting More Value from the MapR
Converged Community Edition
Sep 14, 2016

© 2016 MapR Technologies
Today’s Presenters
Deborah Littlefield
Technical Curriculum Developer
Ankur Desai
Sr. Manager, Platform and Products

© 2016 MapR Technologies 3
Today’s Agenda
• Recent updates to the MapR Converged Data Platform
• Latest Ecosystem Support in MapR 5.2
• How to upgrade to the latest version of the Community Edition
• Q&A

The MapR Converged Data Platform

4 Major Additions to the MapR Platform in the past
12 months
• Taking cluster monitoring to the next level with the Spyglass
Initiative
• Real-time streaming with MapR Streams
• MapR-DB JSON document database and application
development with OJAI
• Securing your data with access control expressions (ACEs)

© 2016 MapR Technologies 6© 2016 MapR Technologies
Project Spyglass

MapR Vision: Maximizing User/Operator Productivity
Deep
Visibility
Another
sample
Easy
Management
Full
Control

The MapR Spyglass Initiative
• New approach for increasing user and administrator productivity
– Comprehensive, open, extensible
• Simplifies the management of growing big data deployments
• Starts with 5.2 release
– Phase 1 – MapR Monitoring
– Initial focus on operational visibility
• Helps community innovate faster
– Extensive use of open source visualization and dashboarding tools

Spyglass Initiative Phase 1 - MapR Monitoring
Empower administrators with cluster
monitoring capabilities, including
metric and log collection from nodes,
services, and jobs, with dashboards to
display information in a useful way.
Converged
Customizable
Extensible

Collection VisualizationAggregation &
Storage
MapR Monitoring Architecture
Future
Data Sources
Log Shippers
Metrics
Collectors
Alerting
Node
Environmentals
(CPU, Mem, I/O)
Service
Daemons
(YARN, Drill,
Hive, etc.)
MapR Control System
…

Project Spyglass – Monitoring All You Care About
Node/Infrastructure Monitoring
• Global Aggregates (Average, Min, Max)
Charts (e.g. CPU, Disk utilization)
• Per-node charts (e.g. I/O Throughput
by disk)
• MFS read/writes and throughput
• DB puts, gets, scans and cache metrics

by disk)
Cluster Space Utilization Monitoring
• Cluster wide storage utilization
• Storage Utilization Trend
• Utilization per volume and per accountable
entity (data, volume, snapshot and total size)

by disk)
YARN/MR Application Monitoring
• Global YARN trend graphs
• Containers - Pending, Active
• vCores & RAM - Allocated & Used
• Per Queue charts - containers, vCores, RAM

by disk)
YARN/MR Application Monitoring
• Global YARN trend graphs
• Containers - Pending, Active
• vCores & RAM - Allocated & Used
• Per Queue charts - containers, vCores, RAM
Service Daemon Monitoring
• Per-service charts with for (CPU Usage by
type, Memory)
• Centralized, searchable logs
• MapR core and ecosystem services
(includes YARN, Drill and Spark)

Customizable
Dashboards
for Visualizing Metrics
Log
Analytics

Destination to Learn and Collaborate
Blog about topics and ideas
Share code snippets and dashboards
View demos, tutorials, and videos
Engage in use case discussion/development

Dashboards are defined with JSON
and easy to export and import in
Grafana and Kibana
Extend/Integrate using REST API
The Exchange

Dashboards
can be viewed
on mobile
devices.

Summary
● Data collection and storage infrastructure (packaged
and supported)
○ Collection/storage of metrics & logs across node, storage,
services
● Visualization dashboard (Driven via community)
○ Sample dashboards for Grafana & Kibana
5.2 - Spyglass 1.0 GA
CUSTOMIZABLE, shareable and mobile-ready dashboards
CONVERGED monitoring with deep search
EXTENSIBLE and easy to integrate with REST API

MapR Streams

MapR Streams: Enabling Continuous Data Processing
To enable continuous,
globally scalable streaming of
event data, allowing developers to
create real-time applications
that their business can depend on.
Converged
Continuous
Global

MapR Streams:
Publish-subscribe Event Streaming System for Big Data
Producers publish billions of
messages/sec to a topic in a stream.
Guaranteed, immediate delivery
to all consumers.
Standard real-time API (Kafka).
Integrates with Spark Streaming,
Storm, Apex, and Flink
Direct data access (OJAI API) from
analytics frameworks.
To
pi
c
Stream
Producers
Remote sites and consumers
Batch analytics
Topic
Replication
Consumers
Consumers
Available in the Enterprise Edition Only

MapR Streams: Building Faster and Simpler Apps
Simpler and
Faster
Architecture
• Converged platform with file storage and database
reduces data movement, data latency, hardware
cost, and administration cost
• Event streaming and stream processing in the same
cluster enables faster processing
• Unified security framework with files and database
tables reduces administration cost around setting
up and enforcing security policies
• Multi-tenant - topic isolation, quotas, data
placement control allows multiple isolated streaming
applications to run on the same cluster reducing
hardware cost and data movement

Scalable.
• Ingest more events to enable faster insights
• Hold on to events longer to enable deeper insights
• Develop app once and apply to short & long-term
data (i.e. run analysis on 15-days data AND 1-year
data using same application)
MapR Streams: Building Faster and Simpler Apps

MapR-DB JSON document database
and application development with OJAI

Open Source OJAI API for JSON-Based Applications
Open JSON Application Interface (OJAI)
Databases Streams
MapR-Client
File Systems
{JSON}
MapR-Client

Familiar JSON Paradigm – Similar API Constructs
MapR-DB
Document record = Json.newDocument()
.set("firstName", "John")
.set("lastName", "Doe")
.set("age", 50);
table.insert("jdoe", record);
MongoDB
BasicDBObject doc = new BasicDBObject
("firstName", "John")
.append("lastName", "Doe")
.append("age", 50);
coll.insert(doc);

JSON: Easy Variation with Documents
{
"_id" : "rp-prod132546",
"name" : "Marvel T2 Athena”,
"brand" : "Pinarello",
"category" : "bike",
"type" : "Road Bike”,
"price" : 2949.99,
"size" : "55cm",
"wheel_size" : "700c",
"frameset" : {
"frame" : "Carbon Toryaca",
"fork" : "Onda 2V C"
},
"groupset" : {
"chainset" : "Camp. Athena 50/34",
"brake" : "Camp."
},
"wheelset" : {
"wheels" : "Camp. Zonda",
"tyres" : "Vittoria Pro"
}
}
{
"_id" : "rp-prod106702",
"name" : " Ultegra SPD-SL 6800”,
"brand" : "Shimano",
"category" : "pedals",
"type" : "Components,
"price" : 112.99,
"features" : [
"Low profile design increases ...",
"Supplied with floating SH11 cleats",
"Weight: 260g (pair)"
]
}
{
"_id" : "rp-prod113104",
"name" : "Bianchi Pride Jersey SS15”,
"brand" : "Nalini",
"category" : "Jersey",
"type" : "Clothing,
"price" : 76.99,
"features" : [
"100% Polyester",
"3/4 hidden zip",
"3 rear pocket"
],
"color" : "black"
}
jerseypedalbike

Product Catalog - RDBMS
To get a single product“Entity Value Attribute” pattern
SELECT * FROM (
SELECT
ce.sku,
ea.attribute_id,
ea.attribute_code,
CASE ea.backend_type
WHEN 'varchar' THEN ce_varchar.value
WHEN 'int' THEN ce_int.value
WHEN 'text' THEN ce_text.value
WHEN 'decimal' THEN ce_decimal.value
WHEN 'datetime' THEN ce_datetime.value
ELSE ea.backend_type
END AS value,
ea.is_required AS required
FROM catalog_product_entity AS ce
LEFT JOIN eav_attribute AS ea
ON ce.entity_type_id = ea.entity_type_id
LEFT JOIN catalog_product_entity_varchar AS ce_varchar
ON ce.entity_id = ce_varchar.entity_id
AND ea.attribute_id = ce_varchar.attribute_id
AND ea.backend_type = 'varchar'
LEFT JOIN catalog_product_entity_text AS ce_text
ON ce.entity_id = ce_text.entity_id
AND ea.attribute_id = ce_text.attribute_id
AND ea.backend_type = 'text'
LEFT JOIN catalog_product_entity_decimal AS ce_decimal
ON ce.entity_id = ce_decimal.entity_id
AND ea.attribute_id = ce_decimal.attribute_id
AND ea.backend_type = 'decimal'
LEFT JOIN catalog_product_entity_datetime AS ce_datetime
ON ce.entity_id = ce_datetime.entity_id
AND ea.attribute_id = ce_datetime.attribute_id
AND ea.backend_type = 'datetime'
WHERE ce.sku = ‘rp-prod132546’
) AS tab
WHERE tab.value != ’’;

Store the product “as a business object” To get a single product
{
"_id" : "rp-prod132546",
"name" : "Marvel T2 Athena”,
"brand" : "Pinarello",
"category" : "bike",
"type" : "Road Bike”,
"price" : 2949.99,
"size" : "55cm",
"wheel_size" : "700c",
"frameset" : {
"frame" : "Carbon Toryaca",
"fork" : "Onda 2V C"
},
"groupset" : {
"chainset" : "Camp. Athena 50/34",
"brake" : "Camp."
},
"wheelset" : {
"wheels" : "Camp. Zonda",
"tyres" : "Vittoria Pro"
}
}
products
.findById(“rp-prod132546”)
Product Catalog - NoSQL/Document

Native JSON Support in MapR-DB
{
order_num: 5555,
products: [
{ product_id: 348752,
quantity: 1,
unit_price: 149.99,
total_price: 149.99
},
quantity: 1,
unit_price: 99.99,
total_price: 99.99
},
quantity: 1,
unit_price: 49.99,
total_price: 49.99
},
]
}
Reads/writes at element level
• Granular disk reads/writes
• Less network traffic
• Higher concurrency
Any new elements added on demand
• No predefined schemas
• Easy to store evolving data
Not all NoSQL databases treat JSON as a native data type.

Leverage the Column Family Construct (Optional)
/
{a:
{a1:
{b1: "v1",
b2: [
{c1: "v1",
c2: "v2"}
]
},
a2:
{
e1: "v1",
e2: <inline jpg>
}
}
}
Column Family 1
Column Family 2
Control layout for faster data access
Different TTL requirements
Separate Table Replication settings
Specific data placement policies
Efficient ACEs

Fine Grained Security for JSON Documents
{
“fname”: “John”,
“lname”: “Doe”,
“address”: “111 Main St.”,
“city”: “San Jose”,
“state”: “CA”,
“zip”: “95134”,
“credit_cards”: [
{“issuer”: “Visa”,
“number”: “4444555566667777”},
{“issuer”: “MasterCard”,
“number”: “5555666677778888”}
]
}
Entire document
Element: “fname”
Array: “credit_cards”
Sub-element in array element:
“credit_cards[*].number”
Specify different permissions levels within the document.

Comprehensive Data Type Support for MapR-DB
• NULL
• Boolean
• String
• Map
• Array
• Float, Double
• Binary
• Byte, Short, Int, Long
• Date
• Decimal
• Interval
• Time
• Timestamp
Examples:
{
“sample_int”: {"$numberLong”: 2147483647},
“sample_date”: {“$dateDay”: “2016-02-22”},
“sample_decimal”:{“$decimal”: “1234567890.23456789”},
“sample_time”: {“$time”: “10:26:12.487”},
“sample_timestamp”: {“$date”: “2016-02-22T10:26:12.487+Z”}
}

Data Security with Access Control
Expressions

File ACEs – Key Features
Intuitive
Inheritance
Subdirectories
and files inherit
perms from parent
directory
Whole-Volume
ACEs
Volume-level filter –
useful in multitenant
environments.
Roles
Arbitrary grouping
of users according
to your business
needs
High Performance
No performance hit
Boolean Operators
Allowing for
ultra fine-grain
permissions
AUTHORIZATION

File ACEs: Whole Volume ACE Example
Whole-Volume ACE
r: group:finance
Jane grants read access to Bob.
File: /finance/final_report.csv
r: user:bob
Bob cannot read the file
/finance/final_report.csv because
the whole-volume ACE is set to
allow read-access to finance only.
Jane
(Finance)
Bob
(Developer)
Whole-Volume ACE
AUTHORIZATION

POSIX ACLs vs ACEs
r : user:sally |
(group:dev_team & group:managers)
Access Control Lists
MapR Access Control Expressions
AUTHORIZATION
Which one is easier to set and understand?
Which one allows for higher granularity?

MapR Has ACEs for Files and MapR-DB Records
Example: user:mary | (group:admins & group:VP) & user:!bob
Permissions on files, tables, column families, columns, JSON documents and sub-documents
Use Access Control Expressions (ACEs) to set granular permissions.
AUTHORIZATION

Ecosystem Updates

5.2 Ecosystem Support
These are the only component version changes in MEP 1.0 from 5.2 release date
and all of these have been out for 5.1 already.
Eco on 5.1 today MEP 1.0 on 5.2
Component Released with 5.1
Subsequently released for
5.1
Drill 1.4 1.6 1.6
Spark 1.5.2 1.6.1 1.6.1 (2.0 in dev
preview)
Impala 2.2.0 2.5 2.5
Storm 0.10.0 0.10.1 0.10.1
Mahout 0.11.2 0.12.2 0.12.2

Converging SQL and JSON with Apache Drill 1.6
• Flexible and operational analytics on NoSQL
– MapR-DB plugin allows analysts to perform SQL queries directly on JSON data in MapR-DB tables
– Pushdown capabilities provide optimal interactive experience
• Enhanced query performance
– Provides better query performance via partition pruning, metadata caching and other optimizations
– Delivers up to 10-60X performance gains in query planning compared to the previous releases of Drill
• Better memory management
– Delivers greater stability and scale which enables customers to run not only larger but also more SQL
workloads on a MapR cluster
• Improved integration with visualization tools like Tableau
– Introduces client impersonation for end-to-end security from the visualization tool to data in Hadoop.
– Enhanced SQL Window functions

What’s New in Spark 2.0?
• Structured Streaming with Spark SQL
– The ability to perform interactive queries against live streaming data.
– Output can now be aggregated in a stream for continuous applications.
– Pre-computation of analytics in a continuous fashion can occur as the data is generated
• Whole Stage Code-gen
– Provided by the second-generation Tungsten engine.
– Eliminates the need for multiple JVM calls by flattening SQL queries into one single
function evaluated as bytecode at runtime.
• Dataframe API’s
– Runs on the same engine as SparkSQL.
– Allows access to data from a variety of different data sources.
– Can run database-like operations or allow for passing in custom code.

Upgrade to the Latest MapR
Converged Community Edition

Select an Upgrade Method
Takes advantage of
high-availability features
Offline
Installer
Time
Complexity
Rolling
Manual
Rolling
Scripted
Offline
Manual
Cluster offline during upgrade

Community Edition and Rolling Upgrades
• Expect interruptions to cluster operations when nodes running the
only copy of a service (for example, CLDB) are upgraded
• Minimize cluster access
• With 10 or fewer nodes,
offline upgrade probably
makes the most sense
Offline
Installer
Rolling
Manual
Rolling
Scripted
Offline
Manual

Supported Upgrade Methods
From Version Offline Installer Offline Manual Rolling Manual Rolling Scripted
3.x
4.0
4.1
5.0
5.1
* Supported for clusters that were installed using the MapR Installer. This is the only
method that also upgrades ecosystem components.

High-Level Overview
2
Prepare
1
Plan! Upgrade
3

Plan: Determine What to Include
MapR Core
Ecosystem components not at supported MEP
MapR clients
New features
?
?

Plan: Develop a Test Plan
• Run tests before and after each upgrade step
– Compare results
• Test basic functionality
– Verify cluster access and volumes
– Use maprcli, hadoop fs, MCS
• Test jobs and queries
– Based on the components you use

Plan: Create an Upgrade Schedule
What needs to
happen after the
upgrade?
What can be done
days ahead?
What needs to
happen the day of
the upgrade?
What can be done
weeks ahead?

Prepare: Weeks Ahead
• Review Release Notes
• Verify node specifications
– Update the JDK if needed
• Upgrade on a test cluster
– Document surprises
– Prepare configuration files
Weeks
Ahead
Critical!Critical!

Prepare: Days Ahead
• Download the installer, packages, etc.
• Run tests and record results
• Back up critical data
Days
Ahead

Prepare: Day of Upgrade
• Verify cluster health and clear alarms
• Empty job queue/terminate jobs
• Stop cross-cluster operations
– Volume mirroring
– Table replication

Upgrade Order
1. MapR core
2. Ecosystem components
• Upgraded manually, unless using MapR Installer
3. MapR clients
4. Enable new features

Upgrade MapR Core
Component Includes
MapReduce binaries
MapR Core
Webserver
maprcli command binaries, MCS, REST API
Other services
New features, performance enhancements (varies by release)

Upgrade MapR Core: Config Files
New default configuration files created:
Active Configuration Files
(do not change during upgrade)
New Configuration Files
(added with upgrade)
/opt/mapr/conf /opt/mapr/conf.new
/opt/mapr/conf/conf.d /opt/mapr/conf.d.new
/opt/mapr/hadoop/hadoop-<ver>/conf opt/mapr/hadoop/hadoop-<ver>/conf.new

Upgrade MapR Core: Config Files
New default configuration files created:
Active Configuration Files
(do not change during upgrade)
New Configuration Files
(added with upgrade)
/opt/mapr/conf /opt/mapr/conf.new
/opt/mapr/conf/conf.d /opt/mapr/conf.d.new
/opt/mapr/hadoop/hadoop-<ver>/conf opt/mapr/hadoop/hadoop-<ver>/conf.new
Important! Merge
required changes into
active configuration files

Upgrade MapR Core: Hadoop Common Version
1. New Hadoop directory created at:
/opt/mapr/hadoop/hadoop-<version>
2. Existing Hadoop directory moved to:
/opt/mapr/hadoop/OLD_HADOOP_VERSIONS
3. Links updated for new version:
/opt/mapr/lib/*.jar
4. Paths in service configuration files updated:
/opt/mapr/conf/conf.d/warden.<service name>.conf

Upgrade MapR Core: Post-Upgrade Tasks
• If upgrading from 5.0 or earlier, copy new license file into place on each
node:
cp /opt/mapr/conf.new/BaseLicense.txt /opt/mapr/conf/
• After a manual (rolling, or offline) upgrade, update Hadoop configuration
file with new version:
/opt/mapr/conf/hadoop_version
• Resume cross-cluster operations
– Volume mirroring
– Table replication

Upgrade Ecosystem Components
• Follow pre- and post-upgrade
steps in documentation
• As of MapR 5.2, must upgrade
to ecosystem components that
belong to the same MapR
Ecosystem Pack (MEP)
http://maprdocs.mapr.com/home/InteropMatrix/r_MEP_52.html

Upgrade MapR Clients
MapR Client
(Windows, Mac, Linux)
Cluster
hadoop fs –ls /
maprcli volume list

Upgrade MapR POSIX Clients
• Loopback POSIX client
• FUSE-based POSIX client
– FUSE-based new in MapR 5.1
• Recommend: upgrade to
FUSE-based POSIX client
MapR POSIX Client
(Linux only)

Upgrading from MapR 3.x
• To run MapReduce v1 jobs, change the default MapReduce
mode or submit them with the appropriate command
• May need to recompile MapReduce jobs
• May need to add YARN services to cluster
http://maprdocs.mapr.com/home/UpgradeGuide/RunningMRjobsYarn.html

Other Upgrade Considerations
• Mirroring between clusters
– Volumes must be mirrored to a cluster at the same, or higher, revision
– Upgrade the destination cluster first!
– Consider disabling mirror operations during the upgrades, to avoid
alarms and maximize available bandwidth
• Table replication between clusters
– Clusters involved in table replication can be at different versions

Q&AEngage with us!
• Spyglass Initiative
o https://www.mapr.com/products/spyglass-initiative
• Try out MapR Streams and MapR-DB in the free MapR Community
Edition
o https://www.mapr.com/products/hadoop-download
• Try out MapR Streams and MapR-DB in the MapR Sandbox (virtual
machine)
o https://www.mapr.com/products/mapr-sandbox-hadoop

MapR 5.2: Getting More Value from the MapR Converged Community Edition

More Related Content

What's hot

Viewers also liked

Similar to MapR 5.2: Getting More Value from the MapR Converged Community Edition

More from MapR Technologies

Recently uploaded

MapR 5.2: Getting More Value from the MapR Converged Community Edition