Solving Big Data problems on AWS by Rajnish Malik

Solving Big Data Problems on AWS
Rajnish Malik
Email: rajnishm@amazon.com
Contact number: 09833311878

GB TB
PB
ZB
EB
The World is Producing Ever-Larger Volumes of Big Data
• IT/ Application server logs
IT Infrastructure logs, Metering,
Audit logs, Change logs
• Web sites / Mobile Apps/ Ads
Clickstream, User Engagement
• Sensor data
Weather, Smart Grids, Wearables
• Social Media, User Content
450MM+ Tweets/day

Big Data: Unconstrained data
growth
95% of the 1.2 zettabytes of
data in the digital universe is
unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates of
compound annual growth
(CAGR) at 62% from 2008 –
2012.
Source: IDC
GB
TB
PB
ZB
EB

Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data Lifecycle

Lower cost,
higher throughput
Generation
Big Data Lifecycle – Volume, Velocity & Variety

Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real time bidding
Clickstream analysis
Fraud detection
Use Cases

Visits, views, clicks, purchases
Source, device, location, time
Latency, throughput, uptime
Likes, shares, friends, follows
Price, frequency
Metrics

Relational
NoSQL
Web servers
Mobile phones
Tablets
3rd party feeds
Sources

Structured
Unstructured
Text
Binary
Near Real-time
Batched
Formats

Reporting
Dashboards
Sentiment
Clustering
Machine Learning
Optimization
Analysis

Highly
constrained
Lower cost,
higher throughput
Generation

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020

Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
=
Remove
constraints

Technologies and techniques for working
productively with data, at any scale.
Big Data

Big data and AWS Cloud computing
Big data Cloud computing
Variety, volume, and velocity
requiring new tools
Variety of compute, storage, and
networking options

Potentially massive datasets Massive, virtually unlimited
capacity

Iterative, experimental style of
data manipulation and analysis
Iterative, experimental style of
infrastructure deployment/usage

Frequently not steady-state
workload; peaks and valleys
At its most efficient with highly
variable workloads

Absolute performance not as critical
as “time to results”; shared
resources are a bottleneck
Parallel compute projects allow
each workgroup to have more
autonomy, get faster results

Only pay for what you use
No capital investment
Pay as you go
Lower costs

Programmable
Integrate with existing tools
Zero admin
Easy to configure
Ease of use

Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift Amazon
Elastic MapReduce

Store anything
Object storage
Scalable
99.999999999%
durability
Amazon
S3

NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond
latency
Amazon
DynamoDB

Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce

Real-time processing
High throughput;
elastic
Easy to use
EMR, S3, Redshift,
DynamoDB Integrations
Amazon
Kinesis

Relational data
warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift

Free steak campaign
Disaster recovery
Web site & media sharing
Facebook app
Ground campaign
SAP & SharePoint
Marketing web site
Business line of sight
Consumer social app
IT operations
Mars exploration ops
Interactive TV apps
Media streaming
Consumer social app
Facebook page
Securities Trading Data Archiving
Financial markets analytics
Web and mobile apps
Big data analytics
Digital media
Ticket pricing optimization
Streaming webcasts
Mobile analytics
Consumer social app
Core IT and media

a
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
AWS
Direct Connect
AWS
Storage Gateway
AWS
Import/ Export
Amazon
Glacier
S3
Amazon
Kinesis
Amazon EMR
Generation

Amazon EC2 Amazon EMRAmazon
Kinesis
Generation

Amazon
Redshift
Amazon
DynamoDB
Amazon
RDS
S3 Amazon EC2 Amazon EMR
Amazon
CloudFront
AWS
CloudFormation
AWS
Data Pipeline
Generation

The right tools.
At the right scale.
At the right time.

http://aws.amazon.com/marketplace
Big Data Case Studies
Learn from other AWS customers
aws.amazon.com/solutions/case-studies/big-data

AWS Marketplace
AWS Online Software Store
aws.amazon.com/marketplace
Shop the big data category

http://aws.amazon.com/marketplace
AWS Public Data Sets
Free access to big data sets
aws.amazon.com/publicdatasets

Solving Big Data problems on AWS by Rajnish Malik

More Related Content

What's hot

Viewers also liked

Similar to Solving Big Data problems on AWS by Rajnish Malik

More from Blazeclan Technologies Private Limited

Recently uploaded

Solving Big Data problems on AWS by Rajnish Malik

Editor's Notes