Cloud as a Data Platform
What is (Big) Data? Amazon Data Services
Andrei Savu
Founder of Axemblr.com
Co-organizer of Bucharest JUG
Lead of Apache Provisionr
Passion for Automation & Data Analysis
Connect with me on LinkedIn
@ Axemblr
Data Processing Infrastructure
Deployment Automation on IaaS platforms
Product: Hadoop On-Demand Appliance
Apache Provisionr (Open Source)
Consulting & Professional Services
Topics
Introduction on (Big)Data
● Characteristics
● In Practice
● Value
Amazon Data Platform
● Tools
● How they fit
What is (Big)Data?
Beyond the Hype (Source)
... size & speed are relative
Characteristics #1
Too big, Too fast, Unstructured
1. Volume
"Simple models work better with more data"
The Unreasonable Effectiveness of Data
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
Challenging from a technical perspective
Needs scalable storage
Distributed query engines (massively parallel)
2. Velocity
Nothing new for financial traders
Tight feedback loop as competitive advantage
Complex event processing (CEPs)
Online stream summarization (estimation)
Online aggregation (key-value stores)
Long term storage for batch processing
3. Variety
The reality of data is messy and the format
evolves over time
Entity Resolution, Language Detection etc.
Mantra: Detect Schema, Annotate, Enrich
Characteristics #2
In Practice
(Big) data is messy
80% efforts go into identifying sources,
integration and cleaning
Messy and disconnected: different systems,
different networks, different departments
Consider data-markets
(Big) data has gravity
Tends to attract processing services
The cost of moving may be large
Cloud or in-house?
Cloud:
● for development & exploration
● low usage or variable capacity needs
In-house:
● due to strict regulations
● for performance and cost efficiency
People & Data Science
You need a team that combines: math,
programming and scientific instinct
Building data-science teams
http://radar.oreilly.com/2011/09/building-data-science-teams.html
(Big)Data Value
... answer them w/ Data
Enables New Products
Recommendation engines (think Amazon,
Netflix, Facebook, LinkedIn)
Advanced advertising (more later)
Advanced search & spelling suggestions
(and many more)
Rule of thumb
"Advice to businesses starting out with big data:
first, decide what problem you want to solve." *
Christer Johnson, IBM’s leader for advanced
analytics in North America
* create data-driven business processes (more)
(Big)Data on AWS
http://aws.amazon.com/big-data/
Based on my work at
Magnolia Labs Inc. http://magnolialabs.com/
San Francisco, CA based company with R&D
in Romania
Various products: RTB (real-time bidding),
Secure Browsing etc.
They are hiring! info@magnolialabs.com
Overview
Amazon S3
Amazon S3
Amazon Glacier
Amazon Glacier
Amazon EMR (Elastic MapReduce)
Amazon Data Pipeline
Amazon RedShift
Amazon DynamoDB
How they fit?
Thanks! Questions?
Andrei Savu - asavu @ axemblr.con

Cloud as a Data Platform