Solving Big Data Problems on AWS
Rajnish Malik
Email: rajnishm@amazon.com
Contact number: 09833311878
GB TB
PB
ZB
EB
The World is Producing Ever-Larger Volumes of Big Data
• IT/ Application server logs
IT Infrastructure logs, Metering,
Audit logs, Change logs
• Web sites / Mobile Apps/ Ads
Clickstream, User Engagement
• Sensor data
Weather, Smart Grids, Wearables
• Social Media, User Content
450MM+ Tweets/day
Big Data: Unconstrained data
growth
95% of the 1.2 zettabytes of
data in the digital universe is
unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates of
compound annual growth
(CAGR) at 62% from 2008 –
2012.
Source: IDC
GB
TB
PB
ZB
EB
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data Lifecycle
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Big Data Lifecycle – Volume, Velocity & Variety
Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real time bidding
Clickstream analysis
Fraud detection
Use Cases
Visits, views, clicks, purchases
Source, device, location, time
Latency, throughput, uptime
Likes, shares, friends, follows
Price, frequency
Metrics
Relational
NoSQL
Web servers
Mobile phones
Tablets
3rd party feeds
Sources
Structured
Unstructured
Text
Binary
Near Real-time
Batched
Formats
Reporting
Dashboards
Sentiment
Clustering
Machine Learning
Optimization
Analysis
Highly
constrained
Lower cost,
higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
=
Remove
constraints
Technologies and techniques for working
productively with data, at any scale.
Big Data
Big data and AWS Cloud computing
Big data Cloud computing
Variety, volume, and velocity
requiring new tools
Variety of compute, storage, and
networking options
Big data and AWS Cloud computing
Big data Cloud computing
Potentially massive datasets Massive, virtually unlimited
capacity
Big data and AWS Cloud computing
Big data Cloud computing
Iterative, experimental style of
data manipulation and analysis
Iterative, experimental style of
infrastructure deployment/usage
Big data and AWS Cloud computing
Big data Cloud computing
Frequently not steady-state
workload; peaks and valleys
At its most efficient with highly
variable workloads
Big data and AWS Cloud computing
Big data Cloud computing
Absolute performance not as critical
as “time to results”; shared
resources are a bottleneck
Parallel compute projects allow
each workgroup to have more
autonomy, get faster results
Only pay for what you use
No capital investment
Pay as you go
Lower costs
Programmable
Integrate with existing tools
Zero admin
Easy to configure
Ease of use
One tool to
rule them all
Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift Amazon
Elastic MapReduce
Store anything
Object storage
Scalable
99.999999999%
durability
Amazon
S3
NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond
latency
Amazon
DynamoDB
Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce
Real-time processing
High throughput;
elastic
Easy to use
EMR, S3, Redshift,
DynamoDB Integrations
Amazon
Kinesis
Relational data
warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift
Free steak campaign
Disaster recovery
Web site & media sharing
Facebook app
Ground campaign
SAP & SharePoint
Marketing web site
Business line of sight
Consumer social app
IT operations
Mars exploration ops
Interactive TV apps
Media streaming
Consumer social app
Facebook page
Securities Trading Data Archiving
Financial markets analytics
Web and mobile apps
Big data analytics
Digital media
Ticket pricing optimization
Streaming webcasts
Mobile analytics
Consumer social app
Core IT and media
a
Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
AWS
Direct Connect
AWS
Storage Gateway
AWS
Import/ Export
Amazon
Glacier
S3
Amazon
Kinesis
Amazon EMR
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon EC2 Amazon EMRAmazon
Kinesis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon
Redshift
Amazon
DynamoDB
Amazon
RDS
S3 Amazon EC2 Amazon EMR
Amazon
CloudFront
AWS
CloudFormation
AWS
Data Pipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The right tools.
At the right scale.
At the right time.
http://aws.amazon.com/marketplace
Big Data Case Studies
Learn from other AWS customers
aws.amazon.com/solutions/case-studies/big-data
AWS Marketplace
AWS Online Software Store
aws.amazon.com/marketplace
Shop the big data category
http://aws.amazon.com/marketplace
AWS Public Data Sets
Free access to big data sets
aws.amazon.com/publicdatasets
Thank You

Solving Big Data problems on AWS by Rajnish Malik

Editor's Notes

  • #4 Due to the convergence of many technologies of cloud, mobile, social, and advancements in many field such as genomics, life sciences, space, the size of the digital universe is growing at an ever increasing rate. Customers have also found tremendous value in being able to mine this data to make better medicine, tailored purchasing recommendations, detect fraudulent financial transactions in real time, provide on-demand digital content such as movies and songs, predict weather forecasts, the list goes on and on.
  • #5 We see big data having a lifecycle with several high level, but distinct phases from generation to storage to analysis and sharing.
  • #6 Big data has received a lot of attention over the last few years due to the ever increasing scale of volume, velocity and variety, the famous 3 Vs of big data. Let’s talk about generation of data Big data is being used in many different use cases because of advancements of lowering the cost of generating data and the increasing aggregate amount of throughput
  • #7 Here are a few big data use cases
  • #8 Which require a lot of metrics such as…
  • #9 From many different kinds of sources, including machines and application logs
  • #10 In a variety of different formats and time
  • #11 Which feed into why we have big data and that is to gain knowledge through various types of analysis to gain situational awareness, discover pattern and trends, and make predictions
  • #12 The need and value of data along with the ease of generating it puts pressure on the rest of the big data lifecycle
  • #13 There is an estimated growing gap of what is generated versus what is available readily for analysis. There is one more point to make about the current state of data analysis. Various analysts have attempted to quantify the gap between data generated by applications and data that makes it’s way into an analytical environment. The general trend is that the gap is large and growing – people make decisions on what data to keep and what to leave on the cutting room floor. However, we feel big data is a asset to an organization on par with capital and labor. Cloud computing enables you to flip the script, instead of asking questions based on the data that I have decided to keep, to now what should I be asking from ALL of my data. You know longer have to have your data model dictate what you keep, keep everything and evolve your data model.
  • #14 This is done by allowing for the cloud to remove those constraints down the big data lifecycle. Having infrastructure that can scale and grow larger due to increase demands or have the ability to add or remove resources on demand, without having to pay up front a large capital investment helps remove those constraints.
  • #15 When we think of big data, we think of both the proliferation of digital information and also about the innovations to exploit or extract information from that data to increase sales, efficiency, better health, analysis, predictions, recommendations, and innovation More specifically, we think cloud computing is a fundamental component to any big data strategy due to its inherent benefits
  • #16 We will go over several of these storage and compute options
  • #17 From TBs to PBs, we have the capacity and scale to handle your largest big data workloads
  • #18 You can start and stop on demand, run big data workloads in parallel as you test out new ideas, allowing you to explore without commitments
  • #19 With services such as Auto Scaling and elastic load balancing, you can dial up and down the amount of infrastructure you need for your variable or experimental workloads
  • #20 The total time also includes the waiting to get access to those IT resources, with the cloud you can be up and running in minutes and in parallel allowing
  • #22 We provide all of our services with a self service API, we als provide managed services so you don’t have to the back end administration and you can configure your infrastructure with code, scripts or point and click from our console all the while maintaining compatability with your current tools.
  • #23 However, we don’t believe that there is one tool that can do everything, but rather if you use the right tools, you can build a highly configurable big data architcture to meet your specific needs.
  • #24 While I won’t be able to go over all of our big data services, I would like to spend some time introducing to you several key big data services that are designed for high availability and durability, as a managed service where we provision the infrastructure on your behalf where you can get significant big data storage and analytics with a few clicks or api calls.
  • #25 Fundamental storage at internet scale, it can store any number of objects from 1 byte to 5 TB in size It is engineered for 11 9’s of durability replicating your data at least three times in three distinct physical data centers we call availability zones We have customers such as Dropbox, Spotify, Pinterest store billions of objects or files as photos, videos, songs, or any other type of file.
  • #26 DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications. Runs on solid state hard drives for high speed performance at scale and you can provision reads and writes to a table without having to worry about the admin of scaling or sharding, it is done all behind the scenes for you. For instance, real time bidding where in less than 200 milliseconds 3 rounds of bidding of what ad to place on a website while a page loads needs the performance of a single-digit millisecond latency to determine what ad to place and what price to bid for that ad impression.
  • #27 When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market
  • #28 Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources. For instance, instead of having to process log files in batch, you can have log events stream into Kinesis and then have workers with the Kinesis client library read from the stream and process the informaiton and drive a real time dashboard. Later on today, we will have the product manager, Adi Krishnan, for Amazon Kinesis give a deep dive into the service
  • #29 Provision a petabyte scale cluster to handle complex SQL queries in just a few minutes. You can get either a HDD drive based cluster or the recently introduced SSD based cluster that is smaller in total cluster size but higher performance per GB This data warehouse solution is about a tenth of what traditional solutions cost of comparable size. Redshift can drive business intelligence tools such as Jaspersoft or Microstrategy because it supports standard SQL and can connect using ODBC or JDBC drivers.
  • #30 We have had many customers from startups to enterprises, government agencies and banks for big data workloads such as analytics on recommendations of where to eat
  • #31 For collection and storage, we have a variety of storage options that depend on you requirements. Direct Connect, Storage Gateway, Import/Export, Glacier, RDS
  • #32 EMR Integrates with The Hadoop Ecosystem Tools Kinesis tools Nutch – web crawler software Cascading – data processing Hbase – large table, NoSQL Cassandara – NoSQL database Chuckwa – data collection system Pig – create mapreduce programs with easy scripting Thrift – build services, interfaces Hive – SQL on Mapreduce HDFS – distributed file system Avro – compact binary serialize MapReduce – process large data sets in parallel Mahout – machine learning Flume – collect aggreate and move large amounts of log data Sqoop – command line transfer data between hadoop and relational databases
  • #34 In summary, AWS provides you the tools so you can pick the right one at the scale that you need when you need it.
  • #35 Life technologies LinkedIn DropCam ICRAR CDC Channel4 Yelp Nokia
  • #36 AWS Marketplace is the AWS Online Software Store Customer can find, research, buy software including a wide variety of big data options and software to help you manage your databases With AWS Marketplace, the simple hourly pricing of most products aligns with EC2 usage model You can find, purchase and 1-Click launch in minutes, making deployment easy Marketplace billing integrated into your AWS account 1300+ product listings across 25 categories
  • #37 The 1000 Genomes Project aims to build the most detailed map of human genetic variation, ultimately with data from the genomes of over 2,600 people from 26 populations around the world. The data contained within this release include results from sequencing the DNA of approximately first 1,700 of over 2,600 people; the remaining samples are expected to be sequenced in 2012 and the data will be released to researchers as soon as possible. The data presented here, over 200Tb, is intended for use in analysis on Amazon EC2 or Elastic MapReduce, rather than for download. NASA NEX Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. NASA NEX is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. Through NEX, users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities. A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. 541TB Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone. The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files. For more details about the file formats and directory structure please see this blog post. Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs. To learn more about Amazon Elastic MapReduce please see the product detail page. Common Crawl's Hadoop classes and other code can be found in its GitHub repository. Three NASA NEX data sets are now available to all via Amazon S3. One data set, the NEX downscaled climate simulations, provides high-resolution climate change projections for the 48 contiguous U.S. states. The second data set, provided by the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on NASA's Terra and Aqua satellites, offers a global view of Earth's surface every 1 to 2 days. Finally, the Landsat data record from the U.S. Geological Survey provides the longest existing continuous space-based record of Earth's land. The data sets are available at: s3://nasanex/NEX-DCP30 s3://nasanex/MODIS s3://nasanex/Landsat You can learn more about the NASA NEX data sets here.