Introduction to Big Data and NoSQL
SQL Azure Saturday April, 21, 2012
Don Demsak Advisory Solutions Architect EMC Consulting www.donxml.com
Meet Don
Advisory Solutions Architect EMC Consulting
Application Architecture, Development & Design
DonXml.com, Twitter: donxml Email [email protected] SlideShare - http://www.slideshare.net/dondemsak
The era of Big Data
How did we get here?
Expensive
Processors Disk space Memory Operating Systems Software Programmers
Monoculture
Limit CPU cycles Limit disk space Limit memory Limited OS Development Limited Software Programmers
Mono-lingual Mono-persistence
Typical RDBMS Implementations
Fixed table schemas Small but frequent reads/writes Large batch transactions Focus on ACID
Atomicity Consistency Isolation Durability
How we scale RDBMS implementations
1st Step Build a relational database
Database
2nd Step Table Partitioning
p1 p2 p3
Database
3rd Step Database Partitioning
Browser Customer #1 Web Tier B/L Tier Database
Browser Customer #2
Web Tier
B/L Tier
Database
Browser Customer #3
Web Tier
B/L Tier
Database
4th Step Move to the cloud?
Browser Customer #1 Web Tier B/L Tier
SQL Azure Federation
Browser Customer #2
Web Tier
B/L Tier
SQL Azure Federation
Browser Customer #3
Web Tier
B/L Tier
SQL Azure Federation
10
There has to be other ways
11
Polyglot Persistence
12
Polyglot Programmer
13
14
Where Did NoSQL Originate?
1998 - Carlo Strozzi
NoSQL project - lightweight open-source relational DB with no SQL interface
2009 - Eric Evans & Johan Oskarsson of Last.fm wanted to organize an event to discuss opensource distributed databases
15
NoSQL (loose) Definition
(often) Open source Non-relational Distributed (often) dont guarantee ACID
16
Atlanta 2009
No:sql(east) conference
select fun, profit from real_world where relational=false
Billed as conference of no-rel datastores
17
Types Of NoSQL Data Stores
18
5 Groups of Data Models
Relational
Document
Key Value
Graph
Column Family
19
Document Store
Apache Jackrabbit CouchDB MongoDB SimpleDB
XML Databases
MarkLogic Server eXist.
20
Document?
Okay think of a web page...
Relational model requires column/tag Lots of empty columns Wasted space
Document model just stores the pages as is
Saves on space Very flexible.
21
Graph Storage
AllegroGraph Core Data Neo4j DEX
FlockDB
Microsoft Trinity (research project)
http://research.microsoft.com/en-us/projects/trinity/
22
Whats a graph?
Graph consists of
Node (stations of the graph) Edges (lines between them)
FlockDB
Created by the Twitter folks Nodes = Users Edges = Nature of relationship between nodes.
23
Key/Value Stores
On disk Cache in Ram Eventually Consistent
Weak Definition
If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent
Strong Definition
for a given update and a given replica eventually either the update reaches the replica or the replica retires
Ordered
Distributed Hash Table allows lexicographical processing
24
Key/Value Examples
Azure AppFabric Cache Memcache-d VMWare vFabric GemFire
25
Object Databases
Db4o GemStone/S InterSystems Cach Objectivity/DB
ZODB
26
Tabular
BigTable Mnesia Hbase Hypertable
Azure Table Storage
SQL Server 2012
27
Azure Table Storage Demo
28
Big Data
29
Big Data Definition
Volumes & volumes of data Unstructured Semi-structured Not suited for Relational Databases
Often utilizes MapReduce frameworks
30
Big Data Examples
Cassandra Hadoop Greenplum Azure Storage
EMC Atmos
Amazon S3 SQL Azure (with Federations support)
31
Real World Example
Twitter
The challenges
Needs to store many graphs
Who you are following Whos following you Who you receive phone notifications from etc
To deliver a tweet requires rapid paging of followers Heavy write load as followers are added and removed Set arithmetic for @mentions (intersection of users).
32
What did they try?
Started with Relational Databases
Tried Key-Value storage of denormalized lists
Did it work?
Nope
Either good at
Handling the write load Or paging large amounts of data But not both
33
What did they need?
Simplest possible thing that would work Allow for horizontal partitioning Allow write operations to Arrive out of order
Or be processed more than once Failures should result in redundant work
Not lost work!
34
The Result was FlockDB
Stores graph data Not optimized for graph traversal operations Optimized for large adjacency lists
List of all edges in a graph
Key is the edge value a set of the node end points
Optimized for fast read and write
Optimized for page-able set arithmetic.
35
How Does it Work?
Stores graphs as sets of edges between nodes Data is partitioned by node
All queries can be answered by a single partition
Write operations are idempotent
Can be applied multiple times without changing the result
And commutative
Changing the order of operands doesnt change the result.
36
Working With Big Data
37
ACID
Atomicity
All or Nothing
Consistency
Valid according to all defined rules
Isolation
No transaction should be able to interfere with another transaction
Durability
Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
38
BASE
Basically Available
High availability but not always consistent
Soft state
Background cleanup mechanism
Eventual consistency
Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
39
Traditional (relational) Approach
Extract
Transactional Data Store
Transform
Load
Data Warehouse
40
Big Data Approach
MapReduce Pattern/Framework
an Input Reader Map Function To transform to a common shape (format) a partition function a compare function Reduce Function an Output Writer
41
MongoDB Example
> // map function > m = function(){ ... this.tags.forEach( ... function(z){ ... emit( z , { count : 1 } ); ... } ... ); ...}; > // reduce function > r = function( key , values ){ ... var total = 0; ... for ( var i=0; i<values.length; i++ ) ... total += values[i].count; ... return { count : total }; ...};
> // execute > res = db.things.mapReduce(m, r, { out : "myoutput" } );
42
MongoDB Demo
43
Big Data on Azure
Azure Table Storage
Azure Service Bus
SQL Azure Federations
MongoDB on Azure
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure
Hadoop on Azure
https://www.hadooponazure.com/
44
Using Azure for Computing
Data
Worker Master Job/Task Scheduler Worker Worker
Data
Client
Data
Data
45
Moving to Event Based Architecture
Web Role Web Role Worker Role Worker Role
Web Role
Worker Role
Req
Req
Req
Queue
Web Role
Worker Role
Web Role
Web Role
Monitor queue length against users expectations
Worker Role
Worker Role
46
Aggregate Stores
47
Visualizing Aggregates
ID: 1001 Customer: Ann Line Items
Orders
Customers
32411234 707423234 125145
2 1 1
$48 $56 $24
$96 456 $24
Payment Details
Order Lines
Card: AmEx CC#: 12343 Expiration: 07/2015
Credit Cards
48
Visualizing Aggregates
ID: 1001 Customer: Ann
Line Items
32411234 707423234 125145
2 1 1
$48 $56 $24
$96 456 $24
Payment Details
{ SalesOrdersView:{ ID: 1001, Customer: Ann, LineItems: [] .. . .. } }
Card: AmEx CC#: 12343 Expiration: 07/2015
49
MongoDB on Azure Demo
50
Next Steps
Learn a NoSQL product
Great place to start AppFabric Cache, Azure Table Storage, MongoDB
Pick a new programming language to learn
Not Java or C#/VB Node.js, JavaScript, F#
51
THANK YOU
52