Big Data
Analysis Patterns
TriHUG
6/27/2013
1
whoami
•

Brad Anderson

•

Solutions Architect at MapR (Atlanta)

•

ATLHUG co-chair

•

NoSQL East Conference 2009

•

“boorad” most places (twitter, github)

•

banderson@maprtech.com
2
BIG DATA
3
4
Big Data is not new!
but the tools are.

5
The Good News in Big Data:

“Simple algorithms and lots of data
trump complex models”

Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
6
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need…



Apache Mahout?



Storm?



Apache Solr/Lucene?



Apache HBase (or MapR M7)?



Apache Drill (or Impala?)



d3.js or Tableau?



Node.js


7

Apache Hadoop?

Titan?
7
Ask a Different Question
It may be more useful to better define the problem by asking some
of these questions:



How large is the data to be queried? (the analysis volume)



What time frame is appropriate for your query response?



How fast is data arriving? (bursts or continuously?)



Are queries by sophisticated users?



Are you looking for common patterns or outliers?



8

How large is the data to be stored?

How are your data sources structures?

8
Picking the Best Solution
Your responses to these questions can help you better:


define the problem



recognize the analysis pattern to which it belongs



guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you
might choose, and then we will focus on three of the questions as a
part of the landscape.

9

9
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily
indexed queries including data such as


Full text



Geographical data



Statistically weighted data

Solr is a small data tool that has flourished in a big data world

10
Apache Mahout
Mahout provides a library of scalable machine learning algorithms
useful for big data analysis based on Hadoop or other storage
systems.

Mahout algorithms mainly are used for


Recommendation (collaborative filtering)



Clustering



Classification

Mahout can be used in conjunction with solutions such as Solr: You
might use Mahout to create a co-occurrence data base that could
then be queried using a search tool such as Solr

11
Apache Drill


Google Dremel clone



Pluggable Query Languages
–
–



Pluggable Storage Backends
–
–
–



Starts with ANSI SQL 2003
Hive, Pig, Cascading, MongoQL, …
Hadoop, Hbase
MongoDB (BSON)
RDBMS?

Bypasses MapReduce

12
Storm


Realtime Stream Computation Engine



Horizontal Scalability



Guaranteed Data Processing



Fault Tolerance



Higher level abstraction over:
–

–



Message Queues
Worker Logic

“The Hadoop of Realtime”

13
Titan


Distributed Graph Database



Property Graph



Pluggable Backend Storage
–
–
–



Search Integrated
–
–



HBase or M7
Cassandra
Berkeley DB
Solr/Lucene
Elastic Search

Faunus
–
–

Graph traversals on subset
In-memory
14
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions:


How large is the data to be stored?



How large is the data to be queried? (the analysis volume)



What time frame is appropriate for your query response?

15
Big Data Decision Tree
How big is your data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

What size queries?
One pass
over 100%

B

Response time?

C

Big storage

Multiple passes
over big chunks

Streaming

< 100s
(human scale)
D
16

throughput
not response
E
Use Cases
Company
 Data Shape
 Technique(s)
 Business Value


17
Business Value
18
Business Value
19
Telecommunications Giant

ETL Offload
20
Telecommunications






Data Shape

Lots of Data
Lots of Queries across Large Sets
Throughput important

21
Telecommunications

Techniques
Analytics

ETL

22
Telecommunications

Techniques

+
ETL (Hadoop)

Analytics (Teradata)
23
Telecommunications

Business Value

24
Credit Card
Issuer

25
Credit Card
Issuer

Data Shape








Customer Purchase History (big)
Merchant Designations
Merchant Special Offers
Throughput important
Recommendations
26
Search Abuse

Techniques
A Recommendation Engine with Mahout and Solr/Lucene

History matrix
One row per user
One column per thing
27
Techniques
Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and column per thing
28
Techniques
Cooccurrence matrix can also be
implemented as a search index

29
Techniques
Complete
history

Cooccurrence
(Mahout)

SolR
SolR
Indexer
Solr
Indexer
indexing

Item metadata

Index
shards

30

20 Hrs  3 Hrs
Techniques
User
history

SolR
SolR
Indexer
Solr
Indexer
search

Web tier

8Hrs  3 Min

Item metadata

Index
shards

31
Techniques
Hadoop
Purchase
History

Export
(4 hrs)

App
App

Merchant
Information

Recommendation
Engine Results
(Mahout)

Presentation
Data Store
(DB2)

App
App

Merchant
Offers

App

Import
(4 hrs)
32
Techniques
Hadoop
Purchase
History
Merchant
Information

Recommendation
Engine Results
(Mahout)

Index
Update
(3 min)

App
App

Recommendation
Search Index
(Solr)

App
App

Merchant
Offers

App

33
Business Value

34
Waste & Recycling Leader

Idle Alerts
35
Data Shape
Truck Geolocation Data
– 20,000 trucks
– 5 sec interval (arriving quickly)
 Landfill Geographic Boundaries


36
Techniques
Realtime Stream Computation
(Storm)

Truck
Geolocation

Data

Hadoop
Storage

Immediate
Alerts

Batch Computation
(MapReduce)

Tax Reduction
Reporting

Shortest Path
Graph Algorithm
(Titan)

Route
Optimization

37
Business Value

38
Beverage Company

Social Engagement Application

39
Data Shape

Tweets, FB Messages
 Person, Activity links
 Graph Traversal


40
Consumer Activity Graph
Wal*Mart.com
Ebay
Shopping.com
Sam’s
Ebay Motors
Dollar General
StubHub
CVS

41

Toys R Us
Techniques
Property Graph
(Titan)

Social
Activity
Stream
Key/Value Store
(MapR M7)

42

Graph Traversal
(Faunus)
Business Value

43
Questions?

44

More Related Content

PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
PDF
Big data analysis concepts and references
DOCX
Big data abstract
PPTX
Big Data Analytics
PDF
Introduction to Big Data
PDF
Big data analytics with Apache Hadoop
PDF
Introduction to Big Data
PDF
Introduction to Big Data
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big data analysis concepts and references
Big data abstract
Big Data Analytics
Introduction to Big Data
Big data analytics with Apache Hadoop
Introduction to Big Data
Introduction to Big Data

What's hot (20)

PDF
Big Data Tech Stack
PDF
Big Data Analytics for Real Time Systems
PDF
Introduction to Big Data
PDF
The Future Of Big Data
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
Are you ready for BIG DATA?
ODP
Big Data Analytics - Introduction
PPTX
Introduction to Big Data
PPT
Big data analytics, survey r.nabati
PPTX
Exploring Big Data Analytics Tools
PPTX
BDaas- BigData as a service
PPTX
Big Data & Data Science
PDF
Big Data Final Presentation
PDF
Big data Big Analytics
PPTX
Bigdata " new level"
PPT
Big Data: An Overview
PPTX
Big data ppt
PPTX
Introduction to Big Data
PDF
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Big Data Tech Stack
Big Data Analytics for Real Time Systems
Introduction to Big Data
The Future Of Big Data
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Are you ready for BIG DATA?
Big Data Analytics - Introduction
Introduction to Big Data
Big data analytics, survey r.nabati
Exploring Big Data Analytics Tools
BDaas- BigData as a service
Big Data & Data Science
Big Data Final Presentation
Big data Big Analytics
Bigdata " new level"
Big Data: An Overview
Big data ppt
Introduction to Big Data
Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial...
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Ad

Viewers also liked (15)

PPTX
Development Platform as a Service - erfarenheter efter ett års användning - ...
PPT
Couchbase Server 2.0 - Indexing and Querying - Deep dive
PPTX
Paris data-geeks-2013-03-28
PDF
OpenStack Heat slides
PDF
Cassandra at Instagram (August 2013)
PDF
A user's perspective on SaltStack and other configuration management tools
PDF
storm at twitter
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PPTX
Building Your First App with MongoDB
PPTX
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
PDF
Realtime Data Analysis Patterns
PPT
Data Acquisition System and Data loggers
PDF
Recommender system algorithm and architecture
PPTX
What is big data?
Development Platform as a Service - erfarenheter efter ett års användning - ...
Couchbase Server 2.0 - Indexing and Querying - Deep dive
Paris data-geeks-2013-03-28
OpenStack Heat slides
Cassandra at Instagram (August 2013)
A user's perspective on SaltStack and other configuration management tools
storm at twitter
Introduction to Apache Airflow - Data Day Seattle 2016
Building Your First App with MongoDB
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Realtime Data Analysis Patterns
Data Acquisition System and Data loggers
Recommender system algorithm and architecture
What is big data?
Ad

Similar to Big Data Analysis Patterns - TriHUG 6/27/2013 (20)

PPTX
big data eco system fundamentals of data science
PPTX
Modul_1_Introduction_to_Big_Data.pptx
PDF
Technologies for Data Analytics Platform
PDF
Big Data , Big Problem?
PPTX
A Glimpse of Bigdata - Introduction
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PPTX
Big data presentationandoverview_of_couchbase
PDF
Big_data_1674238705.ppt is a basic background
PPTX
Big data or big deal
PPT
Architecting Big Data Ingest & Manipulation
PDF
Modern data warehouse
PDF
Modern data warehouse
PPTX
Foxvalley bigdata
PPTX
Big Data Practice_Planning_steps_RK
PDF
Big data analytics 1
PPTX
Intro to Hadoop
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PDF
Hadoop Technologies
PPT
Seminar presentation
big data eco system fundamentals of data science
Modul_1_Introduction_to_Big_Data.pptx
Technologies for Data Analytics Platform
Big Data , Big Problem?
A Glimpse of Bigdata - Introduction
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big data presentationandoverview_of_couchbase
Big_data_1674238705.ppt is a basic background
Big data or big deal
Architecting Big Data Ingest & Manipulation
Modern data warehouse
Modern data warehouse
Foxvalley bigdata
Big Data Practice_Planning_steps_RK
Big data analytics 1
Intro to Hadoop
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
Hadoop Technologies
Seminar presentation

More from boorad (11)

PPTX
Hadoop and Storm - AJUG talk
PDF
Realtime Computation with Storm
PPTX
Big Data Use Cases
PPTX
PhillyDB Talk - Beyond Batch
KEY
TriHUG - Beyond Batch
KEY
Realtime Computation with Storm
KEY
Large Scale Data Analysis Tools
KEY
DevNexus 2011
KEY
DevNation Atlanta
KEY
NOSQL, CouchDB, and the Cloud
PDF
Why Erlang? - Bar Camp Atlanta 2008
Hadoop and Storm - AJUG talk
Realtime Computation with Storm
Big Data Use Cases
PhillyDB Talk - Beyond Batch
TriHUG - Beyond Batch
Realtime Computation with Storm
Large Scale Data Analysis Tools
DevNexus 2011
DevNation Atlanta
NOSQL, CouchDB, and the Cloud
Why Erlang? - Bar Camp Atlanta 2008

Recently uploaded (20)

PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PPTX
Microsoft User Copilot Training Slide Deck
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Co-training pseudo-labeling for text classification with support vector machi...
Electrocardiogram sequences data analytics and classification using unsupervi...
Introduction to MCP and A2A Protocols: Enabling Agent Communication
MuleSoft-Compete-Deck for midddleware integrations
Auditboard EB SOX Playbook 2023 edition.
Rapid Prototyping: A lecture on prototyping techniques for interface design
EIS-Webinar-Regulated-Industries-2025-08.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
Microsoft User Copilot Training Slide Deck
giants, standing on the shoulders of - by Daniel Stenberg
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
Convolutional neural network based encoder-decoder for efficient real-time ob...
Comparative analysis of machine learning models for fake news detection in so...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
4 layer Arch & Reference Arch of IoT.pdf
Co-training pseudo-labeling for text classification with support vector machi...

Big Data Analysis Patterns - TriHUG 6/27/2013