Deep Dive On AWS Redshift
Deep Dive On AWS Redshift
Pratim Das
Specialist Solutions Architect, Data & Analytics, EMEA
28th June, 2017
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Performance tuning
Redshift Spectrum
Summary + Q&A
Echo
Architecture Tuning Integration Spectrum MAG Summary
System
Redshift Architecture
Fast Cost
Efficient
Node
• Coordinates parallel SQL processing 16TB disk
10 GigE
Compute nodes (HPC)
Compute
16 cores Compute
16 cores Compute
16 cores
• 2, 16 or 32 slices
Ingestion
Backup S3 / EMR / DynamoDB / SSH
Restore
Design for Queryability
Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice
1 2 3 4 1 2 3 4 1 2 3 4
+ 10 10 | 13 | 14 | 26 |…
Zone maps 324 … | 100 | 245 | 324
Waiting Running
SQL clients
Echo
Architecture Tuning Integration Spectrum MAG Summary
System
amzn.to/2quChdM
Optimizing Amazon Redshift by Using the AWS
Schema Conversion Tool
amzn.to/2sTYow1
Echo
Architecture Tuning Integration Spectrum MAG Summary
System
Amazon Redshift
Echo
Architecture Tuning Integration Spectrum MAG Summary
System
S3
SQL
High concurrency: Multiple No ETL: Query data in-place Full Amazon Redshift
clusters access same data using open file formats SQL support
Life of a query Query
SELECT COUNT(*)
1
FROM S3.EXT_TABLE
GROUP BY…
JDBC/ODBC
Amazon
Redshift
...
1 2 3 4 N
JDBC/ODBC
Amazon
Query is optimized and compiled at
Redshift
2 the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
...
1 2 3 4 N
JDBC/ODBC
Amazon
Redshift
...
1 2 3 4 N
JDBC/ODBC
Amazon
Redshift
...
1 2 3 4 N
JDBC/ODBC
Amazon
Redshift
...
1 2 3 4 N
JDBC/ODBC
Amazon
Redshift
JDBC/ODBC
Amazon
Redshift
Amazon Redshift
7 Spectrum projects, ...
filters, joins and
1 2 3 4 N
aggregates
JDBC/ODBC
Amazon
Redshift
...
1 2 3 4 N
JDBC/ODBC
Amazon
Redshift
9 Result is sent back to client
...
1 2 3 4 N
1 Order By
1 Limit
1 Aggregation
Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How SELECT
many should we order for Seattle? What were prior first few P.ASIN,
P.TITLE,
day sales?
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
Lets compute the sales of the prior books she’s written in this FROM
s3.d_customer_order_item_details D,
series and return the top 20 values, just for the first three days
asin_attributes A,
of sales of first editions products P
WHERE
3 Tables (1 S3, 2 local) D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
5 Filters A.EDITION LIKE '%FIRST%' AND
2 Joins P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
3 Group By columns
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
1 Order By D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
1 Limit GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
1 Aggregation LIMIT 20;
1 Function
2 Casts
Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How SELECT
many should we order for Seattle? What were prior first few P.ASIN,
P.TITLE,
day sales?
R.POSTAL_CODE,
P.RELEASE_DATE,
Lets compute the sales of the prior books she’s written in this SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
series and return the top 20 values, just for the first three days
s3.d_customer_order_item_details D,
of sales of first editions in the city of Seattle, WA, USA asin_attributes A,
products P,
4 Tables (1 S3, 3 local) regions R
WHERE
8 Filters D.ASIN = P.ASIN AND
3 Joins P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
4 Group By columns
A.EDITION LIKE '%FIRST%' AND
1 Order By P.TITLE LIKE '%Potter%' AND
1 Limit P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
1 Aggregation R.CITY = ‘Seattle’ AND
1 Function R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
2 Casts
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Now let’s run that query over an exabyte of data in S3
• Compression ……………..….……..5X
• Columnar file format……….......…10X
---------------------------------------------------
Total reduction……….…………3.5B X
aws.amazon.com/redshift/partners/
“Some” Amazon Redshift Customers
Manchester Airport Group
An AWS Redshift customer story
Stuart Hutson
Head of Data and BI, MAG
+
Munsoor Negyal
Director of Data Science, Crimson Macaw
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MAG – take-off with cloud and data
Stuart Hutson – Head of Data and BI
THE AVIATION PROFESSIONALS
MAG is a leading UK based airport company, which owns and operates Manchester, London St ansted, East
Midlands and Bournemout h airports.
MAG is privately managed on behalf of its shareholders, the local authorities of Greater Manchester
and Industry Funds Management (IFM). IFM is a highly experienced, long-term investor in airports
and already has significant interests in ten airports across Australia and Europe.
48.5 MILLION passengers serv ed per £623 MILLION property assets across all
year. airports, 5.67m sq ft of commercial
property.
Ov er 80 AIRLINES serv ing 272
DESTINATIONS direct. £738.4 MILLION REVENUE +10.0%
increase from last year.
£134.3 MILLION RETAIL INCOME per
annum deliv ered v ia 200+ shops, bars £283.6 MILLION EBITDA growth of 17.2%
and restaurants. in 2015.
M AN STN
45
OUR CONNECTIVITY…
80+ airlines and over 270 direct destinations providing global connectivity.
MAG has a diverse carrier mix from global destinations with an excellent track record of MAG’s Cargo produces an annual income of £20.2 million and holds 26% of the UK freight market
incentivizsng passenger growth. share.
MAG has ex ceeded ex pect at ions w ith indust ry leading rat es of passenger grow t h. I mport ant ly East Midlands is t he U K’s largest dedicat ed freight hub handling 310,000 t onnes of freight per
for passengers, by forging st rong commercial part nerships w it h airlines, our airport s hav e been annum. St anst ed handles 233,000 t onnes of freight per annum and is a key gat ew ay t o London
able t o increase choice and conv enience and make a st ronger cont ribut ion t o economy and t he Sout h of England.
grow t h.
46
OUR DEVELOPMENTS…
Manchester Transformation Programme and London St ansted Transformation Programme are developments
t hat all aim to drive improved cust omer service.
With investment of £1 billion, Manchester will become one of the most modern and customer
focused airports in Europe demonstrating the importance of Manchester as a global gateway.
The £80 million terminal transformation project at London Stansted will transform the passenger
experience and boost commercial yields.
47
MAG’S CURRENT BUSINESS INTELLIGENCE MATURITY
4. HOW CAN WE MAKE
IT HAPPEN?
PRESCRIPTIVE
ANALYTICS
3. WHAT WILL
HAPPEN?
PREDICTIVE
ANALYTICS
2. WHY DID IT
HAPPEN?
DIAGNOSTIC
ANALYTICS
VALUE
1. WHAT
HAPPENED?
DESCRIPTIVE
ANALYTICS
MATURITY
MAG’S LEGACY ARCHITECTURE - CHALLENGES…
Technical
• Database @ 95% capacity on physical kit that can not be scaled.
• Dashboards are slow to run.
• Constant optimisation and maintenance of database.
• Limited concurrent connections for queries.
• Lack of self-serve – centralised BI model.
access v ia
browser to
• No direct connection to database – business wants to expand into
online PDFs v ia using R and Python etc.
dashboards email
and ad-
• All data in batch with no possibility of streaming
hoc queries
VALUE OF BI STRATEGY IN MAG…
50
PHASE 1 - IMPLEMENT SELF-SERVE RETAIL BI SOLUTION
- 50+ PARTNERS GENERATING OVER £130M REVENUE…
• Real-time streaming.
• Enable MAG to become a real-time business across their customer journey.
• Cloud environment:
• Secured.
• Resilient.
• Repeatable build.
• Create an architecture than can evolve over time to meets MAG’s new challenges.
• Benefits delivered early and continuously.
• No need for MAG to invest in a large, front -loaded EDW programme.
EXAMPLE OF MAG’S DESIGN PRINCIPLES TO SOLVE THE PROBLEMS…
• Evolutionary architecture
• Infrastructure as Code
• Serverless computing
• Etc.
MAG – OUR 6 MONTH JOURNEY…
From To
Single instance database. → Scale-able Data Warehouse.
Daily sales rung in at store lev el. Ov er 90% of all sales automatically ingested
→ at product lev el.
Car parking - flat files ingested in batch. Ingest and interrogate streaming data
directly:
→ • Car park data is being added v ia Kinesis
Access to database limited to reporting Authorised users can use v isualisation and
tool. → data science tools (e.g. R and Python) of
their choice for self-serv e analytics
No database writeback for end-users. Sandboxes in Redshift for end user
→ experimentation.
MAG – NEXT 6-12 MONTHS…
Storage in S3
Data Warehouse in
Amazon Redshift
Cloud Architecture + Data Architecture =
Solution
How do you match the pace of infrastructure build in the cloud with understanding the data & BI
requirements?
Deliver value quickly vs conformed • A horizontal analytical ‘slice’ across the estate.
dimensions? • Understand conformed dimensions.
• Vertical slice of a business domain.
• Reduced refactoring due to the prior horizontal
analysis.
Understand how the business will consume • Produce artefacts that are:
and use the data? • Shared by stakeholders and the delivery
team.
• Understandable by all parties.
• Highly visual, allow complex information to
be absorbed - sun modelling.
Sun modelling vs Enterprise Bus Matrix
Time Calendar
Month
Employee
Calendar Name
Financial Week
Financial Period
Date
Financial Quarter Employee
Year ID
Sales £ measure
Name Net Despatches £
Sales Units
Customer
Enterprise Bus Matrix
ID SKU
Postcode
Town
Product
Country
County
Salutation Type dimension
Gender
Item
Customer Description
Product
hierarchy Sun Model
Star Schema
Building the infrastructure (as code)
• Why use infrastructure as code?
• Repeatability.
• Consistency.
• Versioned.
• Code reviews.
• Speed of delivery.
• Technology Used:
• CloudFormation in YAML format with custom YAML Tags.
• Lambda Functions for Custom Resource Types.
• Bespoke deployment utility.
• Puppet Standalone in Cloud Init for EC2.